SlideShare a Scribd company logo
1 of 69
Netflix Titus, its Feisty
Team, and Daemons
InfoQ.com: News & Community Site
• Over 1,000,000 software developers, architects and CTOs read the site world-
wide every month
• 250,000 senior developers subscribe to our weekly newsletter
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• 2 dedicated podcast channels: The InfoQ Podcast, with a focus on
Architecture and The Engineering Culture Podcast, with a focus on building
• 96 deep dives on innovative topics packed as downloadable emags and
minibooks
• Over 40 new content items per week
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
netflix-titus-2018
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon San Francisco
www.qconsf.com
Titus - Netflix’s Container Management Platform
Scheduling
● Service & batch job lifecycle
● Resource management
Container Execution
● AWS Integration
● Netflix Ecosystem Support
Service
Job and Fleet Management
Resource Management &
Optimization
Container Execution
Batch
Stats
Containers Launched
Per Week
Batch vs. Service
3 Million /
Week
Batch runtimes
(< 1s, < 1m, < 1h, < 12h, < 1d, > 1d)
Service runtimes
(< 1 day, < 1 week, < 1 month, > 1 month)
Autoscaling
High Churn
The Titus team
● Design
● Develop
● Operate
● Support
* And Netflix Platform Engineering and Amazon Web Services
Titus Product Strategy
Ordered priority focus on
● Developer Velocity
● Reliability
● Cost Efficiency
Easy migration from VMs to containers
Easy container integration with VMs and Amazon Services
Focus on just what Netflix needs
Deeply integrated AWS container platform
IP per container
● VPC, ENIs, and security groups
IAM Roles and Metadata Endpoint per container
● Container view of 169.254.169.254
Cryptographic identity per container
● Using Amazon instance identity document, Amazon KMS
Service job container autoscaling
● Using Native AWS Cloudwatch, SQS, Autoscaling policies and engine
Application Load Balancing (ALB)
Applications using containers at Netflix
● Netflix API, Node.js Backend UI Scripts
● Machine Learning (GPUs) for personalization
● Encoding and Content use cases
● Netflix Studio use cases
● CDN tracking, planning, monitoring
● Massively parallel CI system
● Data Pipeline and Stream Processing
● Big Data use cases (Notebooks, Presto)
Batch
Q4 15
Basic
Services
1Q 16
Production
Services
4Q 16
Customer
Facing
Services
2Q 17
Q4 2018 Container Usage
Common
Jobs Launched 255K jobs / day
Different applications 1K+ different images
Isolated Titus deployments 7
Services
Single App Cluster Size 5K (real), 12K containers (benchmark)
Hosts managed 7K VMs (435,000 CPUs)
Batch
Containers launched 450K / day (750K / day peak)
Hosts managed (autoscaled) 55K / month
High Level Titus Architecture
Cassandra
Titus Control Plane
● API
● Scheduling
● Job Lifecycle Control
EC2 Autoscaling
Fenzo
container
container
container
docker
Titus Hosts
Mesos agent
Docker
Docker Registry
containercontainerUser Containers
AWS Virtual Machines
Mesos
Titus System ServicesBatch/Workflow
Systems
Service
CI/CD
Open Source
Open sourced April 2018
Help other communities by sharing our approach
Lessons Learned
End to End
User Experience
Our initial view of containers
Image
Registry
Publish
Run
Container
Workload
Monitor
ContainersDeploy new
Container
Workload
“The Runtime”
What about?
Image
Registry
Publish
Run
Container
Workload
Monitor
Containers
Deploy new
Container
Workload
Security
Scanning
Change
Campaigns
Ad hoc
Performance
analysis
What about?
Local Development CI/CD Runtime
End to end tooling
Container orchestration only part of the problem
For Netflix …
● Local Development - Newt
● Continuous Integration - Jenkins + Newt
● Continuous Delivery - Spinnaker
● Change Campaigns - Astrid
● Performance Analysis - Vector and Flamegraphs
Tooling guidance
● Ensure coverage for entire application SDLC
○ Developing an application before deployment
○ Change management, security and compliance tooling for runtime
● What we added to Docker tooling
○ Curated known base images
○ Consistent image tagging
○ Assistance for multi-region/account registries
○ Consistency with existing tools
Operations and
High Availability
● Single container crashes
● Single host crashes
● Control plane fails
● Control plane gets into bad state
Learning how things fail
Increasing
Severity
● Single container crashes
● Single host crashes
○ Taking down multiple containers
● Control plane fails
● Control plane gets into bad state
Learning how things fail
Increasing
Severity
● Single container crashes
● Single host crashes
● Control plane fails
○ Existing containers continue to run
○ New jobs cannot be submitted
○ Replacements and scale ups do not occur
● Control plane gets into bad state
Learning how things fail
Increasing
Severity
● Single container crashes
● Single host crashes
● Control plane fails
● Control plane gets into bad state
○ Can be catastrophic
Learning how things fail
Increasing
Severity
● Most orchestrators will recover
● Most often during startup
or shutdown
● Monitor for crash loops
Case 1 - Single container crashes
Case 2 - Single host crashes
● Need a placement engine that spreads critical workloads
● Need a way to detect and remediate bad hosts
Titus node health monitoring, scheduling
● Extensive health checks
○ Control plane components - Docker, Mesos, Titus executor
○ AWS - ENI, VPC, GPU
○ Netflix Dependencies - systemd state, security systems
Scheduler
Health Check and
Service Discovery
✖
+
+
✖
Titus hosts
Titus node health remediation
● Rate limiting through centralized service is critical
Scheduler ✖
+
+
Infrastructure
Automation
(Winston)
Events
Automation
perform analysis on host
perform remediation on host
if (unrecoverable) {
tell scheduler to reschedule work
terminate instance
}
Titus hosts
Spotting fleet wide issues using logging
● For the hosts, not the containers
○ Need fleet wide view of container runtime, OS problems
○ New workloads will trigger new host problems
● Titus hosts generate 2B log lines per day
○ Stream processing to look for patterns and remediations
● Aggregated logging - see patterns in the large
Case 3 - Control plane hard failures
White box - monitor time
bucketed queue length
Black box - submit
synthetic workloads
Case 4 - Control plane soft failures
I don’t feel so good!
But first, let’s talk about Zombies
● Early on, we had cases where
○ Some but not all of the control plane was working
○ User terminated their containers
○ Containers still running, but shouldn’t have been
● The “fix” - Mesos implicit reconciliation
○ Titus to Mesos - What containers are running?
○ Titus to Mesos - Kill these containers we know shouldn’t be running
○ System converges on consistent state 👍
Disconnected containers
But what if?
Cluster
State
Scheduler
Controller
Host
✖
✖
✖
✖
Backing store
gets corrupted
Or
Control plane reads
store incorrectly
Bad things occur
12,000 containers “reconciled” in < 1m
An hour to restore service
Running
Containers
Not running
Containers
Guidance
● Know how to operate your cluster storage
○ Perform backups and test restores
○ Test corruption
○ Know failure modes, and know how to recover
At Netflix, we ...
● Moved to less aggressive reconciliation
● Page on inconsistent data
○ Let existing containers run
○ Human fixes state and decides how to proceed
● Automated snapshot testing for staging
Security
● Enforcement
○ Seccomp and AppArmor policies
● Cryptographic identity for each container
○ Leveraging host level Amazon and control plane provided identities
○ Validated by central Netflix service before secrets are provided
Reducing container escape vectors
User namespaces
● Root (or user) in container != Root (or user) on host
● Challenge: Getting it to work with persistent storage
Reducing impact of container escape vectors
user_namespaces (7)
Lock down, isolate control plane
● Hackers are scanning for Docker and Kubernetes
● Reported lack of networking isolation in Google Borg
● We also thought our networking was isolated (wasn’t)
Avoiding user host level access
Vector
ssh
perf tools
Scale - Scheduling Speed
How does Netflix failover?
✖
Kong
Netflix regional failover
Kong evacuation of us-east-1
Traffic diverted to other regions
Fail back to us-east-1
Traffic moved back to us-east-1
API Calls Per Region
EU-WEST-1
US-EAST-1
● Increase capacity during scale up of savior region
● Launch 1000s of containers in 7 minutes
Infrastructure challenge
Easy Right?
“we reduced time to schedule 30,000
pods onto 1,000 nodes from
8,780 seconds to 587 seconds”
Easy Right?
“we reduced time to schedule 30,000
pods onto 1,000 nodes from
8,780 seconds to 587 seconds”
Synthetic benchmarks missing
1. Heterogeneous workloads
2. Full end to end launches
3. Docker image pull times
4. Integration with public cloud networking
Titus can do this by ...
● Dynamically changeable scheduling behavior
● Fleet wide networking optimizations
Normal scheduling
VM1
...
VM2 VMn
App 1 App 1
App 2
ENI1 ENI 2
App 2
App 1
ENI1 ENI1 ENI 2
Trade-off for reliability
IP1 IP1 IP1 IP1 IP1
Spread Pack
Scheduling Algorithm
Failover scheduling
VM1
...
VM2 VMn
App 1 App 1
App 2
ENI1 ENI 2
App 2
App 1
ENI1 ENI1 ENI 2
App 1
App 1
App 1
App 1
App 1
App 2
App 2
IP1 IP1 IP1 IP1 IP1
Spread Pack
IP2, IP3 IP2, IP3, IP4 IP2, IP3
Trade-off for speed
Scheduling Algorithm
● Due to normal scheduling, host likely already has ...
○ Docker image downloaded
○ Networking interfaces and security groups configured
● Need to burst allocate IP addresses
○ Opportunistically batch allocate at container launch time
○ Likely if one container was launched more are coming
○ Garbage collect unused later
On each host
Results
us-east-1 / prod
containers started per minute
7500 Launched
In 5 Minutes
Scale - Limits
How far can a single Titus stack go?
● Speed and stability of scheduling
● Blast radius of mistakes
Scaling options
Idealistic
Continue to improve
performance
Avoid making mistakes
Realistic
Test a stack up to a
known scale level
Contain mistakes
Titus “Federation”
● Allows a Titus stack to be scaled out
○ For performance and reliability reasons
● Not to help with
○ Cloud bursting across different resource pools
○ Automatic balancing across resource pools
○ Joining various business unit resource pools
Federation Implementation
● Users need only to know of the external single API
○ VIP - titus-api.us-east-1.prod.netflix.net
● Simple federation proxy spans stacks (cells)
○ Route these apps to cell 1, these others to cell 2
○ Fan out & union queries across cells
○ Operators can route directly to specific cells
Titus Federation
…
Titus API
(Federation Proxy)
Titus cell01
us-west-2
Titus cell02
…
Titus cell01
us-east-1
Titus cell02
…
Titus cell01
eu-west-1
Titus cell02
Titus API
(Federation Proxy)
Titus API
(Federation Proxy)
How many cells?
A few large cells
● Only as many as needed for scale / blast radius
Why? Larger resource pools help with
● Cross workload efficiency
● Operations
● Bad workload impacts
Performance and Efficiency
● A fictional “16 vCPU” host
● Left and right are CPU packages
● Top to bottom are cores with hyperthreads
Simplified view of a server
Consider workload placement
● Consider three workloads
○ Static A, B, and C all which are latency sensitive
○ Burst D which would like more compute when available
● Let’s start placing workloads
Problems
● Potential performance problems
○ Static C is crossing packages
○ Static B can steal from Static C
● Underutilized resources
○ Burst workload isn’t using all available resources
Node level CPU rescheduling
● After containers land on hosts
○ Eventually, dynamic and cross host
● Leverages cpusets
○ Static - placed on single CPU package and exclusive full cores
○ Burst - can consume extra capacity, but variable performance
● Kubernetes - CPU Manager (beta)
Opportunistic workloads
● Enable workloads to burst into underutilized resources
● Differences between utilized and total
Utilization
time
Resources (Total)
Resources (Allocated)
Resources (Utilized)
Questions?
Follow-up: @aspyker
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
netflix-titus-2018

More Related Content

What's hot

Docker SF Meetup January 2016
Docker SF Meetup January 2016Docker SF Meetup January 2016
Docker SF Meetup January 2016Patrick Chanezon
 
NetflixOSS for Triangle Devops Oct 2013
NetflixOSS for Triangle Devops Oct 2013NetflixOSS for Triangle Devops Oct 2013
NetflixOSS for Triangle Devops Oct 2013aspyker
 
"[WORKSHOP] K8S for developers", Denis Romanuk
"[WORKSHOP] K8S for developers", Denis Romanuk"[WORKSHOP] K8S for developers", Denis Romanuk
"[WORKSHOP] K8S for developers", Denis RomanukFwdays
 
Docker open stack boston
Docker open stack bostonDocker open stack boston
Docker open stack bostondotCloud
 
Docker Birthday #3 - Intro to Docker Slides
Docker Birthday #3 - Intro to Docker SlidesDocker Birthday #3 - Intro to Docker Slides
Docker Birthday #3 - Intro to Docker SlidesDocker, Inc.
 
An Introduction to Container Organization with Docker Swarm, Kubernetes, Meso...
An Introduction to Container Organization with Docker Swarm, Kubernetes, Meso...An Introduction to Container Organization with Docker Swarm, Kubernetes, Meso...
An Introduction to Container Organization with Docker Swarm, Kubernetes, Meso...Neo4j
 
From Docker To Kubernetes: A Developer's Guide To Containers - Mandy White - ...
From Docker To Kubernetes: A Developer's Guide To Containers - Mandy White - ...From Docker To Kubernetes: A Developer's Guide To Containers - Mandy White - ...
From Docker To Kubernetes: A Developer's Guide To Containers - Mandy White - ...Codemotion
 
Docker Swarm Meetup (15min lightning)
Docker Swarm Meetup (15min lightning)Docker Swarm Meetup (15min lightning)
Docker Swarm Meetup (15min lightning)Mike Goelzer
 
Service Discovery In Kubernetes
Service Discovery In KubernetesService Discovery In Kubernetes
Service Discovery In KubernetesKnoldus Inc.
 
Introduction to kubernetes
Introduction to kubernetesIntroduction to kubernetes
Introduction to kubernetesRishabh Indoria
 
Microservices Runtimes
Microservices RuntimesMicroservices Runtimes
Microservices RuntimesFrank Munz
 
Kubernetes in Docker
Kubernetes in DockerKubernetes in Docker
Kubernetes in DockerDocker, Inc.
 
Docker Containers Deep Dive
Docker Containers Deep DiveDocker Containers Deep Dive
Docker Containers Deep DiveWill Kinard
 
Docker swarm-mike-goelzer-mv-meetup-45min-workshop 02242016 (1)
Docker swarm-mike-goelzer-mv-meetup-45min-workshop 02242016 (1)Docker swarm-mike-goelzer-mv-meetup-45min-workshop 02242016 (1)
Docker swarm-mike-goelzer-mv-meetup-45min-workshop 02242016 (1)Michelle Antebi
 
Dockercon 16 Recap
Dockercon 16 RecapDockercon 16 Recap
Dockercon 16 RecapLee Calcote
 
Container Orchestration with Docker Swarm and Kubernetes
Container Orchestration with Docker Swarm and KubernetesContainer Orchestration with Docker Swarm and Kubernetes
Container Orchestration with Docker Swarm and KubernetesWill Hall
 
In-Cluster Continuous Testing Framework for Docker Containers
In-Cluster Continuous Testing Framework for Docker ContainersIn-Cluster Continuous Testing Framework for Docker Containers
In-Cluster Continuous Testing Framework for Docker ContainersNeil Gehani
 

What's hot (20)

Docker SF Meetup January 2016
Docker SF Meetup January 2016Docker SF Meetup January 2016
Docker SF Meetup January 2016
 
Containers & Security
Containers & SecurityContainers & Security
Containers & Security
 
NetflixOSS for Triangle Devops Oct 2013
NetflixOSS for Triangle Devops Oct 2013NetflixOSS for Triangle Devops Oct 2013
NetflixOSS for Triangle Devops Oct 2013
 
"[WORKSHOP] K8S for developers", Denis Romanuk
"[WORKSHOP] K8S for developers", Denis Romanuk"[WORKSHOP] K8S for developers", Denis Romanuk
"[WORKSHOP] K8S for developers", Denis Romanuk
 
Docker open stack boston
Docker open stack bostonDocker open stack boston
Docker open stack boston
 
Docker Birthday #3 - Intro to Docker Slides
Docker Birthday #3 - Intro to Docker SlidesDocker Birthday #3 - Intro to Docker Slides
Docker Birthday #3 - Intro to Docker Slides
 
An Introduction to Container Organization with Docker Swarm, Kubernetes, Meso...
An Introduction to Container Organization with Docker Swarm, Kubernetes, Meso...An Introduction to Container Organization with Docker Swarm, Kubernetes, Meso...
An Introduction to Container Organization with Docker Swarm, Kubernetes, Meso...
 
The Docker Ecosystem
The Docker EcosystemThe Docker Ecosystem
The Docker Ecosystem
 
From Docker To Kubernetes: A Developer's Guide To Containers - Mandy White - ...
From Docker To Kubernetes: A Developer's Guide To Containers - Mandy White - ...From Docker To Kubernetes: A Developer's Guide To Containers - Mandy White - ...
From Docker To Kubernetes: A Developer's Guide To Containers - Mandy White - ...
 
Docker Swarm Meetup (15min lightning)
Docker Swarm Meetup (15min lightning)Docker Swarm Meetup (15min lightning)
Docker Swarm Meetup (15min lightning)
 
Service Discovery In Kubernetes
Service Discovery In KubernetesService Discovery In Kubernetes
Service Discovery In Kubernetes
 
Introduction to kubernetes
Introduction to kubernetesIntroduction to kubernetes
Introduction to kubernetes
 
Microservices Runtimes
Microservices RuntimesMicroservices Runtimes
Microservices Runtimes
 
Kubernetes in Docker
Kubernetes in DockerKubernetes in Docker
Kubernetes in Docker
 
Atomic CLI scan
Atomic CLI scanAtomic CLI scan
Atomic CLI scan
 
Docker Containers Deep Dive
Docker Containers Deep DiveDocker Containers Deep Dive
Docker Containers Deep Dive
 
Docker swarm-mike-goelzer-mv-meetup-45min-workshop 02242016 (1)
Docker swarm-mike-goelzer-mv-meetup-45min-workshop 02242016 (1)Docker swarm-mike-goelzer-mv-meetup-45min-workshop 02242016 (1)
Docker swarm-mike-goelzer-mv-meetup-45min-workshop 02242016 (1)
 
Dockercon 16 Recap
Dockercon 16 RecapDockercon 16 Recap
Dockercon 16 Recap
 
Container Orchestration with Docker Swarm and Kubernetes
Container Orchestration with Docker Swarm and KubernetesContainer Orchestration with Docker Swarm and Kubernetes
Container Orchestration with Docker Swarm and Kubernetes
 
In-Cluster Continuous Testing Framework for Docker Containers
In-Cluster Continuous Testing Framework for Docker ContainersIn-Cluster Continuous Testing Framework for Docker Containers
In-Cluster Continuous Testing Framework for Docker Containers
 

Similar to Disenchantment: Netflix Titus, Its Feisty Team, and Daemons

QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and DaemonsQConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemonsaspyker
 
Database as a Service (DBaaS) on Kubernetes
Database as a Service (DBaaS) on KubernetesDatabase as a Service (DBaaS) on Kubernetes
Database as a Service (DBaaS) on KubernetesObjectRocket
 
Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016aspyker
 
Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016Sharma Podila
 
Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015aspyker
 
NetflixOSS Meetup S6E1 - Titus & Containers
NetflixOSS Meetup S6E1 - Titus & ContainersNetflixOSS Meetup S6E1 - Titus & Containers
NetflixOSS Meetup S6E1 - Titus & Containersaspyker
 
DCSF19 Container Security: Theory & Practice at Netflix
DCSF19 Container Security: Theory & Practice at NetflixDCSF19 Container Security: Theory & Practice at Netflix
DCSF19 Container Security: Theory & Practice at NetflixDocker, Inc.
 
Netflix and Containers: Not A Stranger Thing
Netflix and Containers:  Not A Stranger ThingNetflix and Containers:  Not A Stranger Thing
Netflix and Containers: Not A Stranger Thingaspyker
 
Netflix and Containers: Not Stranger Things
Netflix and Containers: Not Stranger ThingsNetflix and Containers: Not Stranger Things
Netflix and Containers: Not Stranger ThingsAll Things Open
 
GCCP JSCOE Session 2
GCCP JSCOE Session 2GCCP JSCOE Session 2
GCCP JSCOE Session 2GDSC
 
Kubernetes and CoreOS @ Athens Docker meetup
Kubernetes and CoreOS @ Athens Docker meetupKubernetes and CoreOS @ Athens Docker meetup
Kubernetes and CoreOS @ Athens Docker meetupMist.io
 
Kubernetes is all you need
Kubernetes is all you needKubernetes is all you need
Kubernetes is all you needVishwas N
 
Kubernetes 101
Kubernetes 101Kubernetes 101
Kubernetes 101Vishwas N
 
Kubernetes at NU.nl (Kubernetes meetup 2019-09-05)
Kubernetes at NU.nl   (Kubernetes meetup 2019-09-05)Kubernetes at NU.nl   (Kubernetes meetup 2019-09-05)
Kubernetes at NU.nl (Kubernetes meetup 2019-09-05)Tibo Beijen
 
Presto Summit 2018 - 04 - Netflix Containers
Presto Summit 2018 - 04 - Netflix ContainersPresto Summit 2018 - 04 - Netflix Containers
Presto Summit 2018 - 04 - Netflix Containerskbajda
 
The Highs and Lows of Stateful Containers
The Highs and Lows of Stateful ContainersThe Highs and Lows of Stateful Containers
The Highs and Lows of Stateful ContainersC4Media
 
Cairo Kubernetes Meetup - October event Talk #1
Cairo Kubernetes Meetup - October event Talk #1Cairo Kubernetes Meetup - October event Talk #1
Cairo Kubernetes Meetup - October event Talk #1omehelba
 
Monitoring kubernetes across data center and cloud
Monitoring kubernetes across data center and cloudMonitoring kubernetes across data center and cloud
Monitoring kubernetes across data center and cloudDatadog
 

Similar to Disenchantment: Netflix Titus, Its Feisty Team, and Daemons (20)

QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and DaemonsQConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
 
Database as a Service (DBaaS) on Kubernetes
Database as a Service (DBaaS) on KubernetesDatabase as a Service (DBaaS) on Kubernetes
Database as a Service (DBaaS) on Kubernetes
 
Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016
 
Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016
 
Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015
 
NetflixOSS Meetup S6E1 - Titus & Containers
NetflixOSS Meetup S6E1 - Titus & ContainersNetflixOSS Meetup S6E1 - Titus & Containers
NetflixOSS Meetup S6E1 - Titus & Containers
 
DCSF19 Container Security: Theory & Practice at Netflix
DCSF19 Container Security: Theory & Practice at NetflixDCSF19 Container Security: Theory & Practice at Netflix
DCSF19 Container Security: Theory & Practice at Netflix
 
Netflix and Containers: Not A Stranger Thing
Netflix and Containers:  Not A Stranger ThingNetflix and Containers:  Not A Stranger Thing
Netflix and Containers: Not A Stranger Thing
 
Netflix and Containers: Not Stranger Things
Netflix and Containers: Not Stranger ThingsNetflix and Containers: Not Stranger Things
Netflix and Containers: Not Stranger Things
 
GCCP JSCOE Session 2
GCCP JSCOE Session 2GCCP JSCOE Session 2
GCCP JSCOE Session 2
 
Intro to Kubernetes
Intro to KubernetesIntro to Kubernetes
Intro to Kubernetes
 
Kubernetes and CoreOS @ Athens Docker meetup
Kubernetes and CoreOS @ Athens Docker meetupKubernetes and CoreOS @ Athens Docker meetup
Kubernetes and CoreOS @ Athens Docker meetup
 
Kubernetes is all you need
Kubernetes is all you needKubernetes is all you need
Kubernetes is all you need
 
Docker in prod
Docker in prodDocker in prod
Docker in prod
 
Kubernetes 101
Kubernetes 101Kubernetes 101
Kubernetes 101
 
Kubernetes at NU.nl (Kubernetes meetup 2019-09-05)
Kubernetes at NU.nl   (Kubernetes meetup 2019-09-05)Kubernetes at NU.nl   (Kubernetes meetup 2019-09-05)
Kubernetes at NU.nl (Kubernetes meetup 2019-09-05)
 
Presto Summit 2018 - 04 - Netflix Containers
Presto Summit 2018 - 04 - Netflix ContainersPresto Summit 2018 - 04 - Netflix Containers
Presto Summit 2018 - 04 - Netflix Containers
 
The Highs and Lows of Stateful Containers
The Highs and Lows of Stateful ContainersThe Highs and Lows of Stateful Containers
The Highs and Lows of Stateful Containers
 
Cairo Kubernetes Meetup - October event Talk #1
Cairo Kubernetes Meetup - October event Talk #1Cairo Kubernetes Meetup - October event Talk #1
Cairo Kubernetes Meetup - October event Talk #1
 
Monitoring kubernetes across data center and cloud
Monitoring kubernetes across data center and cloudMonitoring kubernetes across data center and cloud
Monitoring kubernetes across data center and cloud
 

More from C4Media

Streaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoStreaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoC4Media
 
Next Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileNext Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileC4Media
 
Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020C4Media
 
Understand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsUnderstand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsC4Media
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No KeeperC4Media
 
High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like OwnersC4Media
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaC4Media
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideC4Media
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDC4Media
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine LearningC4Media
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at SpeedC4Media
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsC4Media
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsC4Media
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerC4Media
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleC4Media
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeC4Media
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereC4Media
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing ForC4Media
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data EngineeringC4Media
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreC4Media
 

More from C4Media (20)

Streaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoStreaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live Video
 
Next Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileNext Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy Mobile
 
Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020
 
Understand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsUnderstand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java Applications
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No Keeper
 
High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like Owners
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate Guide
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 

Recently uploaded

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 

Recently uploaded (20)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 

Disenchantment: Netflix Titus, Its Feisty Team, and Daemons

  • 1. Netflix Titus, its Feisty Team, and Daemons
  • 2. InfoQ.com: News & Community Site • Over 1,000,000 software developers, architects and CTOs read the site world- wide every month • 250,000 senior developers subscribe to our weekly newsletter • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • 2 dedicated podcast channels: The InfoQ Podcast, with a focus on Architecture and The Engineering Culture Podcast, with a focus on building • 96 deep dives on innovative topics packed as downloadable emags and minibooks • Over 40 new content items per week Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ netflix-titus-2018
  • 3. Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide Presented at QCon San Francisco www.qconsf.com
  • 4.
  • 5. Titus - Netflix’s Container Management Platform Scheduling ● Service & batch job lifecycle ● Resource management Container Execution ● AWS Integration ● Netflix Ecosystem Support Service Job and Fleet Management Resource Management & Optimization Container Execution Batch
  • 6. Stats Containers Launched Per Week Batch vs. Service 3 Million / Week Batch runtimes (< 1s, < 1m, < 1h, < 12h, < 1d, > 1d) Service runtimes (< 1 day, < 1 week, < 1 month, > 1 month) Autoscaling High Churn
  • 7. The Titus team ● Design ● Develop ● Operate ● Support * And Netflix Platform Engineering and Amazon Web Services
  • 8. Titus Product Strategy Ordered priority focus on ● Developer Velocity ● Reliability ● Cost Efficiency Easy migration from VMs to containers Easy container integration with VMs and Amazon Services Focus on just what Netflix needs
  • 9. Deeply integrated AWS container platform IP per container ● VPC, ENIs, and security groups IAM Roles and Metadata Endpoint per container ● Container view of 169.254.169.254 Cryptographic identity per container ● Using Amazon instance identity document, Amazon KMS Service job container autoscaling ● Using Native AWS Cloudwatch, SQS, Autoscaling policies and engine Application Load Balancing (ALB)
  • 10. Applications using containers at Netflix ● Netflix API, Node.js Backend UI Scripts ● Machine Learning (GPUs) for personalization ● Encoding and Content use cases ● Netflix Studio use cases ● CDN tracking, planning, monitoring ● Massively parallel CI system ● Data Pipeline and Stream Processing ● Big Data use cases (Notebooks, Presto) Batch Q4 15 Basic Services 1Q 16 Production Services 4Q 16 Customer Facing Services 2Q 17
  • 11. Q4 2018 Container Usage Common Jobs Launched 255K jobs / day Different applications 1K+ different images Isolated Titus deployments 7 Services Single App Cluster Size 5K (real), 12K containers (benchmark) Hosts managed 7K VMs (435,000 CPUs) Batch Containers launched 450K / day (750K / day peak) Hosts managed (autoscaled) 55K / month
  • 12. High Level Titus Architecture Cassandra Titus Control Plane ● API ● Scheduling ● Job Lifecycle Control EC2 Autoscaling Fenzo container container container docker Titus Hosts Mesos agent Docker Docker Registry containercontainerUser Containers AWS Virtual Machines Mesos Titus System ServicesBatch/Workflow Systems Service CI/CD
  • 13. Open Source Open sourced April 2018 Help other communities by sharing our approach
  • 15. End to End User Experience
  • 16. Our initial view of containers Image Registry Publish Run Container Workload Monitor ContainersDeploy new Container Workload “The Runtime”
  • 19. End to end tooling Container orchestration only part of the problem For Netflix … ● Local Development - Newt ● Continuous Integration - Jenkins + Newt ● Continuous Delivery - Spinnaker ● Change Campaigns - Astrid ● Performance Analysis - Vector and Flamegraphs
  • 20. Tooling guidance ● Ensure coverage for entire application SDLC ○ Developing an application before deployment ○ Change management, security and compliance tooling for runtime ● What we added to Docker tooling ○ Curated known base images ○ Consistent image tagging ○ Assistance for multi-region/account registries ○ Consistency with existing tools
  • 22. ● Single container crashes ● Single host crashes ● Control plane fails ● Control plane gets into bad state Learning how things fail Increasing Severity
  • 23. ● Single container crashes ● Single host crashes ○ Taking down multiple containers ● Control plane fails ● Control plane gets into bad state Learning how things fail Increasing Severity
  • 24. ● Single container crashes ● Single host crashes ● Control plane fails ○ Existing containers continue to run ○ New jobs cannot be submitted ○ Replacements and scale ups do not occur ● Control plane gets into bad state Learning how things fail Increasing Severity
  • 25. ● Single container crashes ● Single host crashes ● Control plane fails ● Control plane gets into bad state ○ Can be catastrophic Learning how things fail Increasing Severity
  • 26. ● Most orchestrators will recover ● Most often during startup or shutdown ● Monitor for crash loops Case 1 - Single container crashes
  • 27. Case 2 - Single host crashes ● Need a placement engine that spreads critical workloads ● Need a way to detect and remediate bad hosts
  • 28. Titus node health monitoring, scheduling ● Extensive health checks ○ Control plane components - Docker, Mesos, Titus executor ○ AWS - ENI, VPC, GPU ○ Netflix Dependencies - systemd state, security systems Scheduler Health Check and Service Discovery ✖ + + ✖ Titus hosts
  • 29. Titus node health remediation ● Rate limiting through centralized service is critical Scheduler ✖ + + Infrastructure Automation (Winston) Events Automation perform analysis on host perform remediation on host if (unrecoverable) { tell scheduler to reschedule work terminate instance } Titus hosts
  • 30. Spotting fleet wide issues using logging ● For the hosts, not the containers ○ Need fleet wide view of container runtime, OS problems ○ New workloads will trigger new host problems ● Titus hosts generate 2B log lines per day ○ Stream processing to look for patterns and remediations ● Aggregated logging - see patterns in the large
  • 31. Case 3 - Control plane hard failures White box - monitor time bucketed queue length Black box - submit synthetic workloads
  • 32. Case 4 - Control plane soft failures I don’t feel so good!
  • 33. But first, let’s talk about Zombies
  • 34. ● Early on, we had cases where ○ Some but not all of the control plane was working ○ User terminated their containers ○ Containers still running, but shouldn’t have been ● The “fix” - Mesos implicit reconciliation ○ Titus to Mesos - What containers are running? ○ Titus to Mesos - Kill these containers we know shouldn’t be running ○ System converges on consistent state 👍 Disconnected containers
  • 35. But what if? Cluster State Scheduler Controller Host ✖ ✖ ✖ ✖ Backing store gets corrupted Or Control plane reads store incorrectly
  • 36. Bad things occur 12,000 containers “reconciled” in < 1m An hour to restore service Running Containers Not running Containers
  • 37. Guidance ● Know how to operate your cluster storage ○ Perform backups and test restores ○ Test corruption ○ Know failure modes, and know how to recover
  • 38. At Netflix, we ... ● Moved to less aggressive reconciliation ● Page on inconsistent data ○ Let existing containers run ○ Human fixes state and decides how to proceed ● Automated snapshot testing for staging
  • 40. ● Enforcement ○ Seccomp and AppArmor policies ● Cryptographic identity for each container ○ Leveraging host level Amazon and control plane provided identities ○ Validated by central Netflix service before secrets are provided Reducing container escape vectors
  • 41. User namespaces ● Root (or user) in container != Root (or user) on host ● Challenge: Getting it to work with persistent storage Reducing impact of container escape vectors user_namespaces (7)
  • 42. Lock down, isolate control plane ● Hackers are scanning for Docker and Kubernetes ● Reported lack of networking isolation in Google Borg ● We also thought our networking was isolated (wasn’t)
  • 43. Avoiding user host level access Vector ssh perf tools
  • 45. How does Netflix failover? ✖ Kong
  • 46. Netflix regional failover Kong evacuation of us-east-1 Traffic diverted to other regions Fail back to us-east-1 Traffic moved back to us-east-1 API Calls Per Region EU-WEST-1 US-EAST-1
  • 47. ● Increase capacity during scale up of savior region ● Launch 1000s of containers in 7 minutes Infrastructure challenge
  • 48. Easy Right? “we reduced time to schedule 30,000 pods onto 1,000 nodes from 8,780 seconds to 587 seconds”
  • 49. Easy Right? “we reduced time to schedule 30,000 pods onto 1,000 nodes from 8,780 seconds to 587 seconds” Synthetic benchmarks missing 1. Heterogeneous workloads 2. Full end to end launches 3. Docker image pull times 4. Integration with public cloud networking
  • 50. Titus can do this by ... ● Dynamically changeable scheduling behavior ● Fleet wide networking optimizations
  • 51. Normal scheduling VM1 ... VM2 VMn App 1 App 1 App 2 ENI1 ENI 2 App 2 App 1 ENI1 ENI1 ENI 2 Trade-off for reliability IP1 IP1 IP1 IP1 IP1 Spread Pack Scheduling Algorithm
  • 52. Failover scheduling VM1 ... VM2 VMn App 1 App 1 App 2 ENI1 ENI 2 App 2 App 1 ENI1 ENI1 ENI 2 App 1 App 1 App 1 App 1 App 1 App 2 App 2 IP1 IP1 IP1 IP1 IP1 Spread Pack IP2, IP3 IP2, IP3, IP4 IP2, IP3 Trade-off for speed Scheduling Algorithm
  • 53. ● Due to normal scheduling, host likely already has ... ○ Docker image downloaded ○ Networking interfaces and security groups configured ● Need to burst allocate IP addresses ○ Opportunistically batch allocate at container launch time ○ Likely if one container was launched more are coming ○ Garbage collect unused later On each host
  • 54. Results us-east-1 / prod containers started per minute 7500 Launched In 5 Minutes
  • 56. How far can a single Titus stack go? ● Speed and stability of scheduling ● Blast radius of mistakes
  • 57. Scaling options Idealistic Continue to improve performance Avoid making mistakes Realistic Test a stack up to a known scale level Contain mistakes
  • 58. Titus “Federation” ● Allows a Titus stack to be scaled out ○ For performance and reliability reasons ● Not to help with ○ Cloud bursting across different resource pools ○ Automatic balancing across resource pools ○ Joining various business unit resource pools
  • 59. Federation Implementation ● Users need only to know of the external single API ○ VIP - titus-api.us-east-1.prod.netflix.net ● Simple federation proxy spans stacks (cells) ○ Route these apps to cell 1, these others to cell 2 ○ Fan out & union queries across cells ○ Operators can route directly to specific cells
  • 60. Titus Federation … Titus API (Federation Proxy) Titus cell01 us-west-2 Titus cell02 … Titus cell01 us-east-1 Titus cell02 … Titus cell01 eu-west-1 Titus cell02 Titus API (Federation Proxy) Titus API (Federation Proxy)
  • 61. How many cells? A few large cells ● Only as many as needed for scale / blast radius Why? Larger resource pools help with ● Cross workload efficiency ● Operations ● Bad workload impacts
  • 63. ● A fictional “16 vCPU” host ● Left and right are CPU packages ● Top to bottom are cores with hyperthreads Simplified view of a server
  • 64. Consider workload placement ● Consider three workloads ○ Static A, B, and C all which are latency sensitive ○ Burst D which would like more compute when available ● Let’s start placing workloads
  • 65. Problems ● Potential performance problems ○ Static C is crossing packages ○ Static B can steal from Static C ● Underutilized resources ○ Burst workload isn’t using all available resources
  • 66. Node level CPU rescheduling ● After containers land on hosts ○ Eventually, dynamic and cross host ● Leverages cpusets ○ Static - placed on single CPU package and exclusive full cores ○ Burst - can consume extra capacity, but variable performance ● Kubernetes - CPU Manager (beta)
  • 67. Opportunistic workloads ● Enable workloads to burst into underutilized resources ● Differences between utilized and total Utilization time Resources (Total) Resources (Allocated) Resources (Utilized)
  • 69. Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ netflix-titus-2018