SlideShare a Scribd company logo
Netflix Titus, its Feisty
Team, and Daemons
InfoQ.com: News & Community Site
• Over 1,000,000 software developers, architects and CTOs read the site world-
wide every month
• 250,000 senior developers subscribe to our weekly newsletter
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• 2 dedicated podcast channels: The InfoQ Podcast, with a focus on
Architecture and The Engineering Culture Podcast, with a focus on building
• 96 deep dives on innovative topics packed as downloadable emags and
minibooks
• Over 40 new content items per week
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
netflix-titus-2018
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon San Francisco
www.qconsf.com
Titus - Netflix’s Container Management Platform
Scheduling
● Service & batch job lifecycle
● Resource management
Container Execution
● AWS Integration
● Netflix Ecosystem Support
Service
Job and Fleet Management
Resource Management &
Optimization
Container Execution
Batch
Stats
Containers Launched
Per Week
Batch vs. Service
3 Million /
Week
Batch runtimes
(< 1s, < 1m, < 1h, < 12h, < 1d, > 1d)
Service runtimes
(< 1 day, < 1 week, < 1 month, > 1 month)
Autoscaling
High Churn
The Titus team
● Design
● Develop
● Operate
● Support
* And Netflix Platform Engineering and Amazon Web Services
Titus Product Strategy
Ordered priority focus on
● Developer Velocity
● Reliability
● Cost Efficiency
Easy migration from VMs to containers
Easy container integration with VMs and Amazon Services
Focus on just what Netflix needs
Deeply integrated AWS container platform
IP per container
● VPC, ENIs, and security groups
IAM Roles and Metadata Endpoint per container
● Container view of 169.254.169.254
Cryptographic identity per container
● Using Amazon instance identity document, Amazon KMS
Service job container autoscaling
● Using Native AWS Cloudwatch, SQS, Autoscaling policies and engine
Application Load Balancing (ALB)
Applications using containers at Netflix
● Netflix API, Node.js Backend UI Scripts
● Machine Learning (GPUs) for personalization
● Encoding and Content use cases
● Netflix Studio use cases
● CDN tracking, planning, monitoring
● Massively parallel CI system
● Data Pipeline and Stream Processing
● Big Data use cases (Notebooks, Presto)
Batch
Q4 15
Basic
Services
1Q 16
Production
Services
4Q 16
Customer
Facing
Services
2Q 17
Q4 2018 Container Usage
Common
Jobs Launched 255K jobs / day
Different applications 1K+ different images
Isolated Titus deployments 7
Services
Single App Cluster Size 5K (real), 12K containers (benchmark)
Hosts managed 7K VMs (435,000 CPUs)
Batch
Containers launched 450K / day (750K / day peak)
Hosts managed (autoscaled) 55K / month
High Level Titus Architecture
Cassandra
Titus Control Plane
● API
● Scheduling
● Job Lifecycle Control
EC2 Autoscaling
Fenzo
container
container
container
docker
Titus Hosts
Mesos agent
Docker
Docker Registry
containercontainerUser Containers
AWS Virtual Machines
Mesos
Titus System ServicesBatch/Workflow
Systems
Service
CI/CD
Open Source
Open sourced April 2018
Help other communities by sharing our approach
Lessons Learned
End to End
User Experience
Our initial view of containers
Image
Registry
Publish
Run
Container
Workload
Monitor
ContainersDeploy new
Container
Workload
“The Runtime”
What about?
Image
Registry
Publish
Run
Container
Workload
Monitor
Containers
Deploy new
Container
Workload
Security
Scanning
Change
Campaigns
Ad hoc
Performance
analysis
What about?
Local Development CI/CD Runtime
End to end tooling
Container orchestration only part of the problem
For Netflix …
● Local Development - Newt
● Continuous Integration - Jenkins + Newt
● Continuous Delivery - Spinnaker
● Change Campaigns - Astrid
● Performance Analysis - Vector and Flamegraphs
Tooling guidance
● Ensure coverage for entire application SDLC
○ Developing an application before deployment
○ Change management, security and compliance tooling for runtime
● What we added to Docker tooling
○ Curated known base images
○ Consistent image tagging
○ Assistance for multi-region/account registries
○ Consistency with existing tools
Operations and
High Availability
● Single container crashes
● Single host crashes
● Control plane fails
● Control plane gets into bad state
Learning how things fail
Increasing
Severity
● Single container crashes
● Single host crashes
○ Taking down multiple containers
● Control plane fails
● Control plane gets into bad state
Learning how things fail
Increasing
Severity
● Single container crashes
● Single host crashes
● Control plane fails
○ Existing containers continue to run
○ New jobs cannot be submitted
○ Replacements and scale ups do not occur
● Control plane gets into bad state
Learning how things fail
Increasing
Severity
● Single container crashes
● Single host crashes
● Control plane fails
● Control plane gets into bad state
○ Can be catastrophic
Learning how things fail
Increasing
Severity
● Most orchestrators will recover
● Most often during startup
or shutdown
● Monitor for crash loops
Case 1 - Single container crashes
Case 2 - Single host crashes
● Need a placement engine that spreads critical workloads
● Need a way to detect and remediate bad hosts
Titus node health monitoring, scheduling
● Extensive health checks
○ Control plane components - Docker, Mesos, Titus executor
○ AWS - ENI, VPC, GPU
○ Netflix Dependencies - systemd state, security systems
Scheduler
Health Check and
Service Discovery
✖
+
+
✖
Titus hosts
Titus node health remediation
● Rate limiting through centralized service is critical
Scheduler ✖
+
+
Infrastructure
Automation
(Winston)
Events
Automation
perform analysis on host
perform remediation on host
if (unrecoverable) {
tell scheduler to reschedule work
terminate instance
}
Titus hosts
Spotting fleet wide issues using logging
● For the hosts, not the containers
○ Need fleet wide view of container runtime, OS problems
○ New workloads will trigger new host problems
● Titus hosts generate 2B log lines per day
○ Stream processing to look for patterns and remediations
● Aggregated logging - see patterns in the large
Case 3 - Control plane hard failures
White box - monitor time
bucketed queue length
Black box - submit
synthetic workloads
Case 4 - Control plane soft failures
I don’t feel so good!
But first, let’s talk about Zombies
● Early on, we had cases where
○ Some but not all of the control plane was working
○ User terminated their containers
○ Containers still running, but shouldn’t have been
● The “fix” - Mesos implicit reconciliation
○ Titus to Mesos - What containers are running?
○ Titus to Mesos - Kill these containers we know shouldn’t be running
○ System converges on consistent state 👍
Disconnected containers
But what if?
Cluster
State
Scheduler
Controller
Host
✖
✖
✖
✖
Backing store
gets corrupted
Or
Control plane reads
store incorrectly
Bad things occur
12,000 containers “reconciled” in < 1m
An hour to restore service
Running
Containers
Not running
Containers
Guidance
● Know how to operate your cluster storage
○ Perform backups and test restores
○ Test corruption
○ Know failure modes, and know how to recover
At Netflix, we ...
● Moved to less aggressive reconciliation
● Page on inconsistent data
○ Let existing containers run
○ Human fixes state and decides how to proceed
● Automated snapshot testing for staging
Security
● Enforcement
○ Seccomp and AppArmor policies
● Cryptographic identity for each container
○ Leveraging host level Amazon and control plane provided identities
○ Validated by central Netflix service before secrets are provided
Reducing container escape vectors
User namespaces
● Root (or user) in container != Root (or user) on host
● Challenge: Getting it to work with persistent storage
Reducing impact of container escape vectors
user_namespaces (7)
Lock down, isolate control plane
● Hackers are scanning for Docker and Kubernetes
● Reported lack of networking isolation in Google Borg
● We also thought our networking was isolated (wasn’t)
Avoiding user host level access
Vector
ssh
perf tools
Scale - Scheduling Speed
How does Netflix failover?
✖
Kong
Netflix regional failover
Kong evacuation of us-east-1
Traffic diverted to other regions
Fail back to us-east-1
Traffic moved back to us-east-1
API Calls Per Region
EU-WEST-1
US-EAST-1
● Increase capacity during scale up of savior region
● Launch 1000s of containers in 7 minutes
Infrastructure challenge
Easy Right?
“we reduced time to schedule 30,000
pods onto 1,000 nodes from
8,780 seconds to 587 seconds”
Easy Right?
“we reduced time to schedule 30,000
pods onto 1,000 nodes from
8,780 seconds to 587 seconds”
Synthetic benchmarks missing
1. Heterogeneous workloads
2. Full end to end launches
3. Docker image pull times
4. Integration with public cloud networking
Titus can do this by ...
● Dynamically changeable scheduling behavior
● Fleet wide networking optimizations
Normal scheduling
VM1
...
VM2 VMn
App 1 App 1
App 2
ENI1 ENI 2
App 2
App 1
ENI1 ENI1 ENI 2
Trade-off for reliability
IP1 IP1 IP1 IP1 IP1
Spread Pack
Scheduling Algorithm
Failover scheduling
VM1
...
VM2 VMn
App 1 App 1
App 2
ENI1 ENI 2
App 2
App 1
ENI1 ENI1 ENI 2
App 1
App 1
App 1
App 1
App 1
App 2
App 2
IP1 IP1 IP1 IP1 IP1
Spread Pack
IP2, IP3 IP2, IP3, IP4 IP2, IP3
Trade-off for speed
Scheduling Algorithm
● Due to normal scheduling, host likely already has ...
○ Docker image downloaded
○ Networking interfaces and security groups configured
● Need to burst allocate IP addresses
○ Opportunistically batch allocate at container launch time
○ Likely if one container was launched more are coming
○ Garbage collect unused later
On each host
Results
us-east-1 / prod
containers started per minute
7500 Launched
In 5 Minutes
Scale - Limits
How far can a single Titus stack go?
● Speed and stability of scheduling
● Blast radius of mistakes
Scaling options
Idealistic
Continue to improve
performance
Avoid making mistakes
Realistic
Test a stack up to a
known scale level
Contain mistakes
Titus “Federation”
● Allows a Titus stack to be scaled out
○ For performance and reliability reasons
● Not to help with
○ Cloud bursting across different resource pools
○ Automatic balancing across resource pools
○ Joining various business unit resource pools
Federation Implementation
● Users need only to know of the external single API
○ VIP - titus-api.us-east-1.prod.netflix.net
● Simple federation proxy spans stacks (cells)
○ Route these apps to cell 1, these others to cell 2
○ Fan out & union queries across cells
○ Operators can route directly to specific cells
Titus Federation
…
Titus API
(Federation Proxy)
Titus cell01
us-west-2
Titus cell02
…
Titus cell01
us-east-1
Titus cell02
…
Titus cell01
eu-west-1
Titus cell02
Titus API
(Federation Proxy)
Titus API
(Federation Proxy)
How many cells?
A few large cells
● Only as many as needed for scale / blast radius
Why? Larger resource pools help with
● Cross workload efficiency
● Operations
● Bad workload impacts
Performance and Efficiency
● A fictional “16 vCPU” host
● Left and right are CPU packages
● Top to bottom are cores with hyperthreads
Simplified view of a server
Consider workload placement
● Consider three workloads
○ Static A, B, and C all which are latency sensitive
○ Burst D which would like more compute when available
● Let’s start placing workloads
Problems
● Potential performance problems
○ Static C is crossing packages
○ Static B can steal from Static C
● Underutilized resources
○ Burst workload isn’t using all available resources
Node level CPU rescheduling
● After containers land on hosts
○ Eventually, dynamic and cross host
● Leverages cpusets
○ Static - placed on single CPU package and exclusive full cores
○ Burst - can consume extra capacity, but variable performance
● Kubernetes - CPU Manager (beta)
Opportunistic workloads
● Enable workloads to burst into underutilized resources
● Differences between utilized and total
Utilization
time
Resources (Total)
Resources (Allocated)
Resources (Utilized)
Questions?
Follow-up: @aspyker
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
netflix-titus-2018

More Related Content

What's hot

An Introduction to Container Organization with Docker Swarm, Kubernetes, Meso...
An Introduction to Container Organization with Docker Swarm, Kubernetes, Meso...An Introduction to Container Organization with Docker Swarm, Kubernetes, Meso...
An Introduction to Container Organization with Docker Swarm, Kubernetes, Meso...
Neo4j
 

What's hot (20)

Docker SF Meetup January 2016
Docker SF Meetup January 2016Docker SF Meetup January 2016
Docker SF Meetup January 2016
 
Containers & Security
Containers & SecurityContainers & Security
Containers & Security
 
NetflixOSS for Triangle Devops Oct 2013
NetflixOSS for Triangle Devops Oct 2013NetflixOSS for Triangle Devops Oct 2013
NetflixOSS for Triangle Devops Oct 2013
 
"[WORKSHOP] K8S for developers", Denis Romanuk
"[WORKSHOP] K8S for developers", Denis Romanuk"[WORKSHOP] K8S for developers", Denis Romanuk
"[WORKSHOP] K8S for developers", Denis Romanuk
 
Docker open stack boston
Docker open stack bostonDocker open stack boston
Docker open stack boston
 
Docker Birthday #3 - Intro to Docker Slides
Docker Birthday #3 - Intro to Docker SlidesDocker Birthday #3 - Intro to Docker Slides
Docker Birthday #3 - Intro to Docker Slides
 
An Introduction to Container Organization with Docker Swarm, Kubernetes, Meso...
An Introduction to Container Organization with Docker Swarm, Kubernetes, Meso...An Introduction to Container Organization with Docker Swarm, Kubernetes, Meso...
An Introduction to Container Organization with Docker Swarm, Kubernetes, Meso...
 
The Docker Ecosystem
The Docker EcosystemThe Docker Ecosystem
The Docker Ecosystem
 
From Docker To Kubernetes: A Developer's Guide To Containers - Mandy White - ...
From Docker To Kubernetes: A Developer's Guide To Containers - Mandy White - ...From Docker To Kubernetes: A Developer's Guide To Containers - Mandy White - ...
From Docker To Kubernetes: A Developer's Guide To Containers - Mandy White - ...
 
Docker Swarm Meetup (15min lightning)
Docker Swarm Meetup (15min lightning)Docker Swarm Meetup (15min lightning)
Docker Swarm Meetup (15min lightning)
 
Service Discovery In Kubernetes
Service Discovery In KubernetesService Discovery In Kubernetes
Service Discovery In Kubernetes
 
Introduction to kubernetes
Introduction to kubernetesIntroduction to kubernetes
Introduction to kubernetes
 
Microservices Runtimes
Microservices RuntimesMicroservices Runtimes
Microservices Runtimes
 
Kubernetes in Docker
Kubernetes in DockerKubernetes in Docker
Kubernetes in Docker
 
Atomic CLI scan
Atomic CLI scanAtomic CLI scan
Atomic CLI scan
 
Docker Containers Deep Dive
Docker Containers Deep DiveDocker Containers Deep Dive
Docker Containers Deep Dive
 
Docker swarm-mike-goelzer-mv-meetup-45min-workshop 02242016 (1)
Docker swarm-mike-goelzer-mv-meetup-45min-workshop 02242016 (1)Docker swarm-mike-goelzer-mv-meetup-45min-workshop 02242016 (1)
Docker swarm-mike-goelzer-mv-meetup-45min-workshop 02242016 (1)
 
Dockercon 16 Recap
Dockercon 16 RecapDockercon 16 Recap
Dockercon 16 Recap
 
Container Orchestration with Docker Swarm and Kubernetes
Container Orchestration with Docker Swarm and KubernetesContainer Orchestration with Docker Swarm and Kubernetes
Container Orchestration with Docker Swarm and Kubernetes
 
In-Cluster Continuous Testing Framework for Docker Containers
In-Cluster Continuous Testing Framework for Docker ContainersIn-Cluster Continuous Testing Framework for Docker Containers
In-Cluster Continuous Testing Framework for Docker Containers
 

Similar to Disenchantment: Netflix Titus, Its Feisty Team, and Daemons

Similar to Disenchantment: Netflix Titus, Its Feisty Team, and Daemons (20)

QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and DaemonsQConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
 
Database as a Service (DBaaS) on Kubernetes
Database as a Service (DBaaS) on KubernetesDatabase as a Service (DBaaS) on Kubernetes
Database as a Service (DBaaS) on Kubernetes
 
Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016
 
Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016
 
Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015
 
NetflixOSS Meetup S6E1 - Titus & Containers
NetflixOSS Meetup S6E1 - Titus & ContainersNetflixOSS Meetup S6E1 - Titus & Containers
NetflixOSS Meetup S6E1 - Titus & Containers
 
DCSF19 Container Security: Theory & Practice at Netflix
DCSF19 Container Security: Theory & Practice at NetflixDCSF19 Container Security: Theory & Practice at Netflix
DCSF19 Container Security: Theory & Practice at Netflix
 
Netflix and Containers: Not A Stranger Thing
Netflix and Containers:  Not A Stranger ThingNetflix and Containers:  Not A Stranger Thing
Netflix and Containers: Not A Stranger Thing
 
Netflix and Containers: Not Stranger Things
Netflix and Containers: Not Stranger ThingsNetflix and Containers: Not Stranger Things
Netflix and Containers: Not Stranger Things
 
GCCP JSCOE Session 2
GCCP JSCOE Session 2GCCP JSCOE Session 2
GCCP JSCOE Session 2
 
Intro to Kubernetes
Intro to KubernetesIntro to Kubernetes
Intro to Kubernetes
 
Kubernetes and CoreOS @ Athens Docker meetup
Kubernetes and CoreOS @ Athens Docker meetupKubernetes and CoreOS @ Athens Docker meetup
Kubernetes and CoreOS @ Athens Docker meetup
 
Kubernetes is all you need
Kubernetes is all you needKubernetes is all you need
Kubernetes is all you need
 
Docker in prod
Docker in prodDocker in prod
Docker in prod
 
Kubernetes 101
Kubernetes 101Kubernetes 101
Kubernetes 101
 
Kubernetes at NU.nl (Kubernetes meetup 2019-09-05)
Kubernetes at NU.nl   (Kubernetes meetup 2019-09-05)Kubernetes at NU.nl   (Kubernetes meetup 2019-09-05)
Kubernetes at NU.nl (Kubernetes meetup 2019-09-05)
 
Presto Summit 2018 - 04 - Netflix Containers
Presto Summit 2018 - 04 - Netflix ContainersPresto Summit 2018 - 04 - Netflix Containers
Presto Summit 2018 - 04 - Netflix Containers
 
The Highs and Lows of Stateful Containers
The Highs and Lows of Stateful ContainersThe Highs and Lows of Stateful Containers
The Highs and Lows of Stateful Containers
 
Cairo Kubernetes Meetup - October event Talk #1
Cairo Kubernetes Meetup - October event Talk #1Cairo Kubernetes Meetup - October event Talk #1
Cairo Kubernetes Meetup - October event Talk #1
 
Monitoring kubernetes across data center and cloud
Monitoring kubernetes across data center and cloudMonitoring kubernetes across data center and cloud
Monitoring kubernetes across data center and cloud
 

More from C4Media

More from C4Media (20)

Streaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoStreaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live Video
 
Next Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileNext Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy Mobile
 
Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020
 
Understand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsUnderstand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java Applications
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No Keeper
 
High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like Owners
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate Guide
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 

Recently uploaded

Recently uploaded (20)

AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
Introduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG EvaluationIntroduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG Evaluation
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 

Disenchantment: Netflix Titus, Its Feisty Team, and Daemons

  • 1. Netflix Titus, its Feisty Team, and Daemons
  • 2. InfoQ.com: News & Community Site • Over 1,000,000 software developers, architects and CTOs read the site world- wide every month • 250,000 senior developers subscribe to our weekly newsletter • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • 2 dedicated podcast channels: The InfoQ Podcast, with a focus on Architecture and The Engineering Culture Podcast, with a focus on building • 96 deep dives on innovative topics packed as downloadable emags and minibooks • Over 40 new content items per week Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ netflix-titus-2018
  • 3. Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide Presented at QCon San Francisco www.qconsf.com
  • 4.
  • 5. Titus - Netflix’s Container Management Platform Scheduling ● Service & batch job lifecycle ● Resource management Container Execution ● AWS Integration ● Netflix Ecosystem Support Service Job and Fleet Management Resource Management & Optimization Container Execution Batch
  • 6. Stats Containers Launched Per Week Batch vs. Service 3 Million / Week Batch runtimes (< 1s, < 1m, < 1h, < 12h, < 1d, > 1d) Service runtimes (< 1 day, < 1 week, < 1 month, > 1 month) Autoscaling High Churn
  • 7. The Titus team ● Design ● Develop ● Operate ● Support * And Netflix Platform Engineering and Amazon Web Services
  • 8. Titus Product Strategy Ordered priority focus on ● Developer Velocity ● Reliability ● Cost Efficiency Easy migration from VMs to containers Easy container integration with VMs and Amazon Services Focus on just what Netflix needs
  • 9. Deeply integrated AWS container platform IP per container ● VPC, ENIs, and security groups IAM Roles and Metadata Endpoint per container ● Container view of 169.254.169.254 Cryptographic identity per container ● Using Amazon instance identity document, Amazon KMS Service job container autoscaling ● Using Native AWS Cloudwatch, SQS, Autoscaling policies and engine Application Load Balancing (ALB)
  • 10. Applications using containers at Netflix ● Netflix API, Node.js Backend UI Scripts ● Machine Learning (GPUs) for personalization ● Encoding and Content use cases ● Netflix Studio use cases ● CDN tracking, planning, monitoring ● Massively parallel CI system ● Data Pipeline and Stream Processing ● Big Data use cases (Notebooks, Presto) Batch Q4 15 Basic Services 1Q 16 Production Services 4Q 16 Customer Facing Services 2Q 17
  • 11. Q4 2018 Container Usage Common Jobs Launched 255K jobs / day Different applications 1K+ different images Isolated Titus deployments 7 Services Single App Cluster Size 5K (real), 12K containers (benchmark) Hosts managed 7K VMs (435,000 CPUs) Batch Containers launched 450K / day (750K / day peak) Hosts managed (autoscaled) 55K / month
  • 12. High Level Titus Architecture Cassandra Titus Control Plane ● API ● Scheduling ● Job Lifecycle Control EC2 Autoscaling Fenzo container container container docker Titus Hosts Mesos agent Docker Docker Registry containercontainerUser Containers AWS Virtual Machines Mesos Titus System ServicesBatch/Workflow Systems Service CI/CD
  • 13. Open Source Open sourced April 2018 Help other communities by sharing our approach
  • 15. End to End User Experience
  • 16. Our initial view of containers Image Registry Publish Run Container Workload Monitor ContainersDeploy new Container Workload “The Runtime”
  • 19. End to end tooling Container orchestration only part of the problem For Netflix … ● Local Development - Newt ● Continuous Integration - Jenkins + Newt ● Continuous Delivery - Spinnaker ● Change Campaigns - Astrid ● Performance Analysis - Vector and Flamegraphs
  • 20. Tooling guidance ● Ensure coverage for entire application SDLC ○ Developing an application before deployment ○ Change management, security and compliance tooling for runtime ● What we added to Docker tooling ○ Curated known base images ○ Consistent image tagging ○ Assistance for multi-region/account registries ○ Consistency with existing tools
  • 22. ● Single container crashes ● Single host crashes ● Control plane fails ● Control plane gets into bad state Learning how things fail Increasing Severity
  • 23. ● Single container crashes ● Single host crashes ○ Taking down multiple containers ● Control plane fails ● Control plane gets into bad state Learning how things fail Increasing Severity
  • 24. ● Single container crashes ● Single host crashes ● Control plane fails ○ Existing containers continue to run ○ New jobs cannot be submitted ○ Replacements and scale ups do not occur ● Control plane gets into bad state Learning how things fail Increasing Severity
  • 25. ● Single container crashes ● Single host crashes ● Control plane fails ● Control plane gets into bad state ○ Can be catastrophic Learning how things fail Increasing Severity
  • 26. ● Most orchestrators will recover ● Most often during startup or shutdown ● Monitor for crash loops Case 1 - Single container crashes
  • 27. Case 2 - Single host crashes ● Need a placement engine that spreads critical workloads ● Need a way to detect and remediate bad hosts
  • 28. Titus node health monitoring, scheduling ● Extensive health checks ○ Control plane components - Docker, Mesos, Titus executor ○ AWS - ENI, VPC, GPU ○ Netflix Dependencies - systemd state, security systems Scheduler Health Check and Service Discovery ✖ + + ✖ Titus hosts
  • 29. Titus node health remediation ● Rate limiting through centralized service is critical Scheduler ✖ + + Infrastructure Automation (Winston) Events Automation perform analysis on host perform remediation on host if (unrecoverable) { tell scheduler to reschedule work terminate instance } Titus hosts
  • 30. Spotting fleet wide issues using logging ● For the hosts, not the containers ○ Need fleet wide view of container runtime, OS problems ○ New workloads will trigger new host problems ● Titus hosts generate 2B log lines per day ○ Stream processing to look for patterns and remediations ● Aggregated logging - see patterns in the large
  • 31. Case 3 - Control plane hard failures White box - monitor time bucketed queue length Black box - submit synthetic workloads
  • 32. Case 4 - Control plane soft failures I don’t feel so good!
  • 33. But first, let’s talk about Zombies
  • 34. ● Early on, we had cases where ○ Some but not all of the control plane was working ○ User terminated their containers ○ Containers still running, but shouldn’t have been ● The “fix” - Mesos implicit reconciliation ○ Titus to Mesos - What containers are running? ○ Titus to Mesos - Kill these containers we know shouldn’t be running ○ System converges on consistent state 👍 Disconnected containers
  • 35. But what if? Cluster State Scheduler Controller Host ✖ ✖ ✖ ✖ Backing store gets corrupted Or Control plane reads store incorrectly
  • 36. Bad things occur 12,000 containers “reconciled” in < 1m An hour to restore service Running Containers Not running Containers
  • 37. Guidance ● Know how to operate your cluster storage ○ Perform backups and test restores ○ Test corruption ○ Know failure modes, and know how to recover
  • 38. At Netflix, we ... ● Moved to less aggressive reconciliation ● Page on inconsistent data ○ Let existing containers run ○ Human fixes state and decides how to proceed ● Automated snapshot testing for staging
  • 40. ● Enforcement ○ Seccomp and AppArmor policies ● Cryptographic identity for each container ○ Leveraging host level Amazon and control plane provided identities ○ Validated by central Netflix service before secrets are provided Reducing container escape vectors
  • 41. User namespaces ● Root (or user) in container != Root (or user) on host ● Challenge: Getting it to work with persistent storage Reducing impact of container escape vectors user_namespaces (7)
  • 42. Lock down, isolate control plane ● Hackers are scanning for Docker and Kubernetes ● Reported lack of networking isolation in Google Borg ● We also thought our networking was isolated (wasn’t)
  • 43. Avoiding user host level access Vector ssh perf tools
  • 45. How does Netflix failover? ✖ Kong
  • 46. Netflix regional failover Kong evacuation of us-east-1 Traffic diverted to other regions Fail back to us-east-1 Traffic moved back to us-east-1 API Calls Per Region EU-WEST-1 US-EAST-1
  • 47. ● Increase capacity during scale up of savior region ● Launch 1000s of containers in 7 minutes Infrastructure challenge
  • 48. Easy Right? “we reduced time to schedule 30,000 pods onto 1,000 nodes from 8,780 seconds to 587 seconds”
  • 49. Easy Right? “we reduced time to schedule 30,000 pods onto 1,000 nodes from 8,780 seconds to 587 seconds” Synthetic benchmarks missing 1. Heterogeneous workloads 2. Full end to end launches 3. Docker image pull times 4. Integration with public cloud networking
  • 50. Titus can do this by ... ● Dynamically changeable scheduling behavior ● Fleet wide networking optimizations
  • 51. Normal scheduling VM1 ... VM2 VMn App 1 App 1 App 2 ENI1 ENI 2 App 2 App 1 ENI1 ENI1 ENI 2 Trade-off for reliability IP1 IP1 IP1 IP1 IP1 Spread Pack Scheduling Algorithm
  • 52. Failover scheduling VM1 ... VM2 VMn App 1 App 1 App 2 ENI1 ENI 2 App 2 App 1 ENI1 ENI1 ENI 2 App 1 App 1 App 1 App 1 App 1 App 2 App 2 IP1 IP1 IP1 IP1 IP1 Spread Pack IP2, IP3 IP2, IP3, IP4 IP2, IP3 Trade-off for speed Scheduling Algorithm
  • 53. ● Due to normal scheduling, host likely already has ... ○ Docker image downloaded ○ Networking interfaces and security groups configured ● Need to burst allocate IP addresses ○ Opportunistically batch allocate at container launch time ○ Likely if one container was launched more are coming ○ Garbage collect unused later On each host
  • 54. Results us-east-1 / prod containers started per minute 7500 Launched In 5 Minutes
  • 56. How far can a single Titus stack go? ● Speed and stability of scheduling ● Blast radius of mistakes
  • 57. Scaling options Idealistic Continue to improve performance Avoid making mistakes Realistic Test a stack up to a known scale level Contain mistakes
  • 58. Titus “Federation” ● Allows a Titus stack to be scaled out ○ For performance and reliability reasons ● Not to help with ○ Cloud bursting across different resource pools ○ Automatic balancing across resource pools ○ Joining various business unit resource pools
  • 59. Federation Implementation ● Users need only to know of the external single API ○ VIP - titus-api.us-east-1.prod.netflix.net ● Simple federation proxy spans stacks (cells) ○ Route these apps to cell 1, these others to cell 2 ○ Fan out & union queries across cells ○ Operators can route directly to specific cells
  • 60. Titus Federation … Titus API (Federation Proxy) Titus cell01 us-west-2 Titus cell02 … Titus cell01 us-east-1 Titus cell02 … Titus cell01 eu-west-1 Titus cell02 Titus API (Federation Proxy) Titus API (Federation Proxy)
  • 61. How many cells? A few large cells ● Only as many as needed for scale / blast radius Why? Larger resource pools help with ● Cross workload efficiency ● Operations ● Bad workload impacts
  • 63. ● A fictional “16 vCPU” host ● Left and right are CPU packages ● Top to bottom are cores with hyperthreads Simplified view of a server
  • 64. Consider workload placement ● Consider three workloads ○ Static A, B, and C all which are latency sensitive ○ Burst D which would like more compute when available ● Let’s start placing workloads
  • 65. Problems ● Potential performance problems ○ Static C is crossing packages ○ Static B can steal from Static C ● Underutilized resources ○ Burst workload isn’t using all available resources
  • 66. Node level CPU rescheduling ● After containers land on hosts ○ Eventually, dynamic and cross host ● Leverages cpusets ○ Static - placed on single CPU package and exclusive full cores ○ Burst - can consume extra capacity, but variable performance ● Kubernetes - CPU Manager (beta)
  • 67. Opportunistic workloads ● Enable workloads to burst into underutilized resources ● Differences between utilized and total Utilization time Resources (Total) Resources (Allocated) Resources (Utilized)
  • 69. Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ netflix-titus-2018