SlideShare a Scribd company logo
Netflix Titus, its Feisty
Team, and Daemons
Titus - Netflix’s Container Management Platform
Scheduling
● Service & batch job lifecycle
● Resource management
Container Execution
● AWS Integration
● Netflix Ecosystem Support
Service
Job and Fleet Management
Resource Management &
Optimization
Container Execution
Batch
Stats
Containers Launcher
Per Week
Batch vs. Service
3 Million /
Week
Batch runtimes
(< 1s, < 1m, < 1h, < 12h, < 1d, > 1d)
Service runtimes
(< 1 day, < 1 week, < 1 month, > 1 month)
Autoscaling
High Churn
The Titus team
● Design
● Develop
● Operate
● Support
* And Netflix Platform Engineering and Amazon Web Services
Titus Product Strategy
Ordered priority focus on
● Developer Velocity
● Reliability
● Cost Efficiency
Easy migration from VMs to containers
Easy container integration with VMs and Amazon Services
Focus on just what Netflix needs
Deeply integrated AWS container platform
IP per container
● VPC, ENIs, and security groups
IAM Roles and Metadata Endpoint per container
● Container view of 169.254.169.254
Cryptographic identity per container
● Using Amazon instance identity document, Amazon KMS
Service job container autoscaling
● Using Native AWS Cloudwatch, SQS, Autoscaling policies and engine
Application Load Balancing (ALB)
Applications using containers at Netflix
● Netflix API, Node.js Backend UI Scripts
● Machine Learning (GPUs) for personalization
● Encoding and Content use cases
● Netflix Studio use cases
● CDN tracking, planning, monitoring
● Massively parallel CI system
● Data Pipeline and Stream Processing
● Big Data use cases (Notebooks, Presto)
Batch
Q4 15
Basic
Services
1Q 16
Production
Services
4Q 16
Customer
Facing
Services
2Q 17
Q4 2018 Container Usage
Common
Jobs Launched 255K jobs / day
Different applications 1K+ different images
Isolated Titus deployments 7
Services
Single App Cluster Size 5K (real), 12K containers (benchmark)
Hosts managed 7K VMs (435,000 CPUs)
Batch
Containers launched 450K / day (750K / day peak)
Hosts managed (autoscaled) 55K / month
High Level Titus Architecture
Cassandra
Titus Control Plane
● API
● Scheduling
● Job Lifecycle Control
EC2 Autoscaling
Fenzo
container
container
container
docker
Titus Hosts
Mesos agent
Docker
Docker Registry
containercontainerUser Containers
AWS Virtual Machines
Mesos
Titus System ServicesBatch/Workflow
Systems
Service
CI/CD
Open Source
Open sourced April 2018
Help other communities by sharing our approach
Lessons Learned
End to End
User Experience
Our initial view of containers
Image
Registry
Publish
Run
Container
Workload
Monitor
ContainersDeploy new
Container
Workload
“The Runtime”
What about?
Image
Registry
Publish
Run
Container
Workload
Monitor
Containers
Deploy new
Container
Workload
Security
Scanning
Change
Campaigns
Ad hoc
Performance
analysis
What about?
Local Development CI/CD Runtime
End to end tooling
Container orchestration only part of the problem
For Netflix …
● Local Development - Newt
● Continuous Integration - Jenkins + Newt
● Continuous Delivery - Spinnaker
● Change Campaigns - Astrid
● Performance Analysis - Vector and Flamegraphs
Tooling guidance
● Ensure coverage for entire application SDLC
○ Developing an application before deployment
○ Change management, security and compliance tooling for runtime
● What we added to Docker tooling
○ Curated known base images
○ Consistent image tagging
○ Assistance for multi-region/account registries
○ Consistency with existing tools
Operations and
High Availability
● Single container crashes
● Single host crashes
● Control plane fails
● Control plane gets into bad state
Learning how things fail
Increasing
Severity
● Single container crashes
● Single host crashes
○ Taking down multiple containers
● Control plane fails
● Control plane gets into bad state
Learning how things fail
Increasing
Severity
● Single container crashes
● Single host crashes
● Control plane fails
○ Existing containers continue to run
○ New jobs cannot be submitted
○ Replacements and scale ups do not occur
● Control plane gets into bad state
Learning how things fail
Increasing
Severity
● Single container crashes
● Single host crashes
● Control plane fails
● Control plane gets into bad state
○ Can be catastrophic
Learning how things fail
Increasing
Severity
● Most orchestrators will recover
● Most often during startup
or shutdown
● Monitor for crash loops
Case 1 - Single container crashes
Case 2 - Single host crashes
● Need a placement engine that spreads critical workloads
● Need a way to detect and remediate bad hosts
Titus node health monitoring, scheduling
● Extensive health checks
○ Control plane components - Docker, Mesos, Titus executor
○ AWS - ENI, VPC, GPU
○ Netflix Dependencies - systemd state, security systems
Scheduler
Health Check and
Service Discovery
✖
+
+
✖
Titus hosts
Titus node health remediation
● Rate limiting through centralized service is critical
Scheduler ✖
+
+
Infrastructure
Automation
(Winston)
Events
Automation
perform analysis on host
perform remediation on host
if (unrecoverable) {
tell scheduler to reschedule work
terminate instance
}
Titus hosts
Spotting fleet wide issues using logging
● For the hosts, not the containers
○ Need fleet wide view of container runtime, OS problems
○ New workloads will trigger new host problems
● Titus hosts generate 2B log lines per day
○ Stream processing to look for patterns and remediations
● Aggregated logging - see patterns in the large
Case 3 - Control plane hard failures
White box - monitor time
bucketed queue length
Black box - submit
synthetic workloads
Case 4 - Control plane soft failures
I don’t feel so good!
But first, let’s talk about Zombies
● Early on, we had cases where
○ Some but not all of the control plane was working
○ User terminated their containers
○ Containers still running, but shouldn’t have been
● The “fix” - Mesos implicit reconciliation
○ Titus to Mesos - What containers are running?
○ Titus to Mesos - Kill these containers we know shouldn’t be running
○ System converges on consistent state 👍
Disconnected containers
But what if?
Cluster
State
Scheduler
Controller
Host
✖
✖
✖
✖
Backing store
gets corrupted
Or
Control plane reads
store incorrectly
Bad things occur
12,000 containers “reconciled” in < 1m
An hour to restore service
Running
Containers
Not running
Containers
Guidance
● Know how to operate your cluster storage
○ Perform backups and test restores
○ Test corruption
○ Know failure modes, and know how to recover
At Netflix, we ...
● Moved to less aggressive reconciliation
● Page on inconsistent data
○ Let existing containers run
○ Human fixes state and decides how to proceed
● Automated snapshot testing for staging
Security
● Enforcement
○ Seccomp and AppArmor policies
● Cryptographic identity for each container
○ Leveraging host level Amazon and control plane provided identities
○ Validated by central Netflix service before secrets are provided
Reducing container escape vectors
User namespaces
● Root (or user) in container != Root (or user) on host
● Challenge: Getting it to work with persistent storage
Reducing impact of container escape vectors
user_namespaces (7)
Lock down, isolate control plane
● Hackers are scanning for Docker and Kubernetes
● Reported lack of networking isolation in Google Borg
● We also thought our networking was isolated (wasn’t)
Avoiding user host level access
Vector
ssh
perf tools
Scale - Scheduling Speed
How does Netflix failover?
✖
Kong
Netflix regional failover
Kong evacuation of us-east-1
Traffic diverted to other regions
Fail back to us-east-1
Traffic moved back to us-east-1
API Calls Per Region
EU-WEST-1
US-EAST-1
● Increase capacity during scale up of savior region
● Launch 1000s of containers in 7 minutes
Infrastructure challenge
Easy Right?
“we reduced time to schedule 30,000
pods onto 1,000 nodes from
8,780 seconds to 587 seconds”
Easy Right?
“we reduced time to schedule 30,000
pods onto 1,000 nodes from
8,780 seconds to 587 seconds”
Synthetic benchmarks missing
1. Heterogeneous workloads
2. Full end to end launches
3. Docker image pull times
4. Integration with public cloud networking
Titus can do this by ...
● Dynamically changeable scheduling behavior
● Fleet wide networking optimizations
Normal scheduling
VM1
...
VM2 VMn
App 1 App 1
App 2
ENI1 ENI 2
App 2
App 1
ENI1 ENI1 ENI 2
Trade-off for reliability
IP1 IP1 IP1 IP1 IP1
Spread Pack
Scheduling Algorithm
Failover scheduling
VM1
...
VM2 VMn
App 1 App 1
App 2
ENI1 ENI 2
App 2
App 1
ENI1 ENI1 ENI 2
App 1
App 1
App 1
App 1
App 1
App 2
App 2
IP1 IP1 IP1 IP1 IP1
Spread Pack
IP2, IP3 IP2, IP3, IP4 IP2, IP3
Trade-off for speed
Scheduling Algorithm
● Due to normal scheduling, host likely already has ...
○ Docker image downloaded
○ Networking interfaces and security groups configured
● Need to burst allocate IP addresses
○ Opportunistically batch allocate at container launch time
○ Likely if one container was launched more are coming
○ Garbage collect unused later
On each host
Results
us-east-1 / prod
containers started per minute
7500 Launched
In 5 Minutes
Scale - Limits
How far can a single Titus stack go?
● Speed and stability of scheduling
● Blast radius of mistakes
Scaling options
Idealistic
Continue to improve
performance
Avoid making mistakes
Realistic
Test a stack up to a
known scale level
Contain mistakes
Titus “Federation”
● Allows a Titus stack to be scaled out
○ For performance and reliability reasons
● Not to help with
○ Cloud bursting across different resource pools
○ Automatic balancing across resource pools
○ Joining various business unit resource pools
Federation Implementation
● Users need only to know of the external single API
○ VIP - titus-api.us-east-1.prod.netflix.net
● Simple federation proxy spans stacks (cells)
○ Route these apps to cell 1, these others to cell 2
○ Fan out & union queries across cells
○ Operators can route directly to specific cells
Titus Federation
…
Titus API
(Federation Proxy)
Titus cell01
us-west-2
Titus cell02
…
Titus cell01
us-east-1
Titus cell02
…
Titus cell01
eu-west-1
Titus cell02
Titus API
(Federation Proxy)
Titus API
(Federation Proxy)
How many cells?
A few large cells
● Only as many as needed for scale / blast radius
Why? Larger resource pools help with
● Cross workload efficiency
● Operations
● Bad workload impacts
Performance and Efficiency
● A fictional “16 vCPU” host
● Left and right are CPU packages
● Top to bottom are cores with hyperthreads
Simplified view of a server
Consider workload placement
● Consider three workloads
○ Static A, B, and C all which are latency sensitive
○ Burst D which would like more compute when available
● Let’s start placing workloads
Problems
● Potential performance problems
○ Static C is crossing packages
○ Static B can steal from Static C
● Underutilized resources
○ Burst workload isn’t using all available resources
Node level CPU rescheduling
● After containers land on hosts
○ Eventually, dynamic and cross host
● Leverages cpusets
○ Static - placed on single CPU package and exclusive full cores
○ Burst - can consume extra capacity, but variable performance
● Kubernetes - CPU Manager (beta)
Opportunistic workloads
● Enable workloads to burst into underutilized resources
● Differences between utilized and total
Utilization
time
Resources (Total)
Resources (Allocated)
Resources (Utilized)
Questions?

More Related Content

What's hot

NetflixOSS and ZeroToDocker Talk
NetflixOSS and ZeroToDocker TalkNetflixOSS and ZeroToDocker Talk
NetflixOSS and ZeroToDocker Talk
aspyker
 
Netflix Container Runtime - Titus - for Container Camp 2016
Netflix Container Runtime - Titus - for Container Camp 2016Netflix Container Runtime - Titus - for Container Camp 2016
Netflix Container Runtime - Titus - for Container Camp 2016
aspyker
 
Netflix OSS Meetup Season 5 Episode 1
Netflix OSS Meetup Season 5 Episode 1Netflix OSS Meetup Season 5 Episode 1
Netflix OSS Meetup Season 5 Episode 1
aspyker
 
Netflix Open Source Meetup Season 3 Episode 2
Netflix Open Source Meetup Season 3 Episode 2Netflix Open Source Meetup Season 3 Episode 2
Netflix Open Source Meetup Season 3 Episode 2
aspyker
 
Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015
aspyker
 
Velocity NYC 2016 - Containers @ Netflix
Velocity NYC 2016 - Containers @ NetflixVelocity NYC 2016 - Containers @ Netflix
Velocity NYC 2016 - Containers @ Netflix
aspyker
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
aspyker
 
The new Netflix API
The new Netflix APIThe new Netflix API
The new Netflix API
Katharina Probst
 
Netflix Cloud Platform and Open Source
Netflix Cloud Platform and Open SourceNetflix Cloud Platform and Open Source
Netflix Cloud Platform and Open Source
aspyker
 
CS80A Foothill College Open Source Talk
CS80A Foothill College Open Source TalkCS80A Foothill College Open Source Talk
CS80A Foothill College Open Source Talk
aspyker
 
Series of Unfortunate Netflix Container Events - QConNYC17
Series of Unfortunate Netflix Container Events - QConNYC17Series of Unfortunate Netflix Container Events - QConNYC17
Series of Unfortunate Netflix Container Events - QConNYC17
aspyker
 
Netflix oss season 2 episode 1 - meetup Lightning talks
Netflix oss   season 2 episode 1 - meetup Lightning talksNetflix oss   season 2 episode 1 - meetup Lightning talks
Netflix oss season 2 episode 1 - meetup Lightning talksRuslan Meshenberg
 
Dev309 from asgard to zuul - netflix oss-final
Dev309  from asgard to zuul - netflix oss-finalDev309  from asgard to zuul - netflix oss-final
Dev309 from asgard to zuul - netflix oss-final
Ruslan Meshenberg
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
Ruslan Meshenberg
 
Netflix Open Source Meetup Season 4 Episode 1
Netflix Open Source Meetup Season 4 Episode 1Netflix Open Source Meetup Season 4 Episode 1
Netflix Open Source Meetup Season 4 Episode 1
aspyker
 
Netflix viewing data architecture evolution - EBJUG Nov 2014
Netflix viewing data architecture evolution - EBJUG Nov 2014Netflix viewing data architecture evolution - EBJUG Nov 2014
Netflix viewing data architecture evolution - EBJUG Nov 2014
Philip Fisher-Ogden
 
Monitoring, the Prometheus Way - Julius Voltz, Prometheus
Monitoring, the Prometheus Way - Julius Voltz, Prometheus Monitoring, the Prometheus Way - Julius Voltz, Prometheus
Monitoring, the Prometheus Way - Julius Voltz, Prometheus
Docker, Inc.
 
Netflix: From Zero to Production-Ready in Minutes (QCon 2017)
Netflix: From Zero to Production-Ready in Minutes (QCon 2017)Netflix: From Zero to Production-Ready in Minutes (QCon 2017)
Netflix: From Zero to Production-Ready in Minutes (QCon 2017)
Tim Bozarth
 
Netflix Story of Embracing the Cloud
Netflix Story of Embracing the CloudNetflix Story of Embracing the Cloud
Netflix Story of Embracing the Cloud
Kate Karniouchina
 
20140708 - Jeremy Edberg: How Netflix Delivers Software
20140708 - Jeremy Edberg: How Netflix Delivers Software20140708 - Jeremy Edberg: How Netflix Delivers Software
20140708 - Jeremy Edberg: How Netflix Delivers Software
DevOps Chicago
 

What's hot (20)

NetflixOSS and ZeroToDocker Talk
NetflixOSS and ZeroToDocker TalkNetflixOSS and ZeroToDocker Talk
NetflixOSS and ZeroToDocker Talk
 
Netflix Container Runtime - Titus - for Container Camp 2016
Netflix Container Runtime - Titus - for Container Camp 2016Netflix Container Runtime - Titus - for Container Camp 2016
Netflix Container Runtime - Titus - for Container Camp 2016
 
Netflix OSS Meetup Season 5 Episode 1
Netflix OSS Meetup Season 5 Episode 1Netflix OSS Meetup Season 5 Episode 1
Netflix OSS Meetup Season 5 Episode 1
 
Netflix Open Source Meetup Season 3 Episode 2
Netflix Open Source Meetup Season 3 Episode 2Netflix Open Source Meetup Season 3 Episode 2
Netflix Open Source Meetup Season 3 Episode 2
 
Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015
 
Velocity NYC 2016 - Containers @ Netflix
Velocity NYC 2016 - Containers @ NetflixVelocity NYC 2016 - Containers @ Netflix
Velocity NYC 2016 - Containers @ Netflix
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
 
The new Netflix API
The new Netflix APIThe new Netflix API
The new Netflix API
 
Netflix Cloud Platform and Open Source
Netflix Cloud Platform and Open SourceNetflix Cloud Platform and Open Source
Netflix Cloud Platform and Open Source
 
CS80A Foothill College Open Source Talk
CS80A Foothill College Open Source TalkCS80A Foothill College Open Source Talk
CS80A Foothill College Open Source Talk
 
Series of Unfortunate Netflix Container Events - QConNYC17
Series of Unfortunate Netflix Container Events - QConNYC17Series of Unfortunate Netflix Container Events - QConNYC17
Series of Unfortunate Netflix Container Events - QConNYC17
 
Netflix oss season 2 episode 1 - meetup Lightning talks
Netflix oss   season 2 episode 1 - meetup Lightning talksNetflix oss   season 2 episode 1 - meetup Lightning talks
Netflix oss season 2 episode 1 - meetup Lightning talks
 
Dev309 from asgard to zuul - netflix oss-final
Dev309  from asgard to zuul - netflix oss-finalDev309  from asgard to zuul - netflix oss-final
Dev309 from asgard to zuul - netflix oss-final
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
 
Netflix Open Source Meetup Season 4 Episode 1
Netflix Open Source Meetup Season 4 Episode 1Netflix Open Source Meetup Season 4 Episode 1
Netflix Open Source Meetup Season 4 Episode 1
 
Netflix viewing data architecture evolution - EBJUG Nov 2014
Netflix viewing data architecture evolution - EBJUG Nov 2014Netflix viewing data architecture evolution - EBJUG Nov 2014
Netflix viewing data architecture evolution - EBJUG Nov 2014
 
Monitoring, the Prometheus Way - Julius Voltz, Prometheus
Monitoring, the Prometheus Way - Julius Voltz, Prometheus Monitoring, the Prometheus Way - Julius Voltz, Prometheus
Monitoring, the Prometheus Way - Julius Voltz, Prometheus
 
Netflix: From Zero to Production-Ready in Minutes (QCon 2017)
Netflix: From Zero to Production-Ready in Minutes (QCon 2017)Netflix: From Zero to Production-Ready in Minutes (QCon 2017)
Netflix: From Zero to Production-Ready in Minutes (QCon 2017)
 
Netflix Story of Embracing the Cloud
Netflix Story of Embracing the CloudNetflix Story of Embracing the Cloud
Netflix Story of Embracing the Cloud
 
20140708 - Jeremy Edberg: How Netflix Delivers Software
20140708 - Jeremy Edberg: How Netflix Delivers Software20140708 - Jeremy Edberg: How Netflix Delivers Software
20140708 - Jeremy Edberg: How Netflix Delivers Software
 

Similar to QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons

Disenchantment: Netflix Titus, Its Feisty Team, and Daemons
Disenchantment: Netflix Titus, Its Feisty Team, and DaemonsDisenchantment: Netflix Titus, Its Feisty Team, and Daemons
Disenchantment: Netflix Titus, Its Feisty Team, and Daemons
C4Media
 
Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016
aspyker
 
Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016
Sharma Podila
 
Netflix and Containers: Not Stranger Things
Netflix and Containers: Not Stranger ThingsNetflix and Containers: Not Stranger Things
Netflix and Containers: Not Stranger Things
All Things Open
 
AWS re:Invent 2016: Netflix: Container Scheduling, Execution, and Integration...
AWS re:Invent 2016: Netflix: Container Scheduling, Execution, and Integration...AWS re:Invent 2016: Netflix: Container Scheduling, Execution, and Integration...
AWS re:Invent 2016: Netflix: Container Scheduling, Execution, and Integration...
Amazon Web Services
 
DCSF19 Container Security: Theory & Practice at Netflix
DCSF19 Container Security: Theory & Practice at NetflixDCSF19 Container Security: Theory & Practice at Netflix
DCSF19 Container Security: Theory & Practice at Netflix
Docker, Inc.
 
DockerCon SF 2015 : Reliably shipping containers in a resource rich world usi...
DockerCon SF 2015 : Reliably shipping containers in a resource rich world usi...DockerCon SF 2015 : Reliably shipping containers in a resource rich world usi...
DockerCon SF 2015 : Reliably shipping containers in a resource rich world usi...
Docker, Inc.
 
Kubernetes Clusters At Scale: Managing Hundreds Apache Pinot Kubernetes Clust...
Kubernetes Clusters At Scale: Managing Hundreds Apache Pinot Kubernetes Clust...Kubernetes Clusters At Scale: Managing Hundreds Apache Pinot Kubernetes Clust...
Kubernetes Clusters At Scale: Managing Hundreds Apache Pinot Kubernetes Clust...
Xiaoman DONG
 
Re:invent 2016 Container Scheduling, Execution and AWS Integration
Re:invent 2016 Container Scheduling, Execution and AWS IntegrationRe:invent 2016 Container Scheduling, Execution and AWS Integration
Re:invent 2016 Container Scheduling, Execution and AWS Integration
aspyker
 
Netflix Architecture and Open Source
Netflix Architecture and Open SourceNetflix Architecture and Open Source
Netflix Architecture and Open Source
All Things Open
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
Steven Wu
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
Allen (Xiaozhong) Wang
 
reBuy on Kubernetes
reBuy on KubernetesreBuy on Kubernetes
reBuy on Kubernetes
Stephan Lindauer
 
DCSF19 How Docker Simplifies Kubernetes for the Masses
DCSF19 How Docker Simplifies Kubernetes for the Masses  DCSF19 How Docker Simplifies Kubernetes for the Masses
DCSF19 How Docker Simplifies Kubernetes for the Masses
Docker, Inc.
 
Monitoring kubernetes across data center and cloud
Monitoring kubernetes across data center and cloudMonitoring kubernetes across data center and cloud
Monitoring kubernetes across data center and cloud
Datadog
 
Netflix Titus WASP October 2017
Netflix Titus WASP October 2017Netflix Titus WASP October 2017
Netflix Titus WASP October 2017
Andrew Leung
 
Presto Summit 2018 - 04 - Netflix Containers
Presto Summit 2018 - 04 - Netflix ContainersPresto Summit 2018 - 04 - Netflix Containers
Presto Summit 2018 - 04 - Netflix Containers
kbajda
 
Container orchestration and microservices world
Container orchestration and microservices worldContainer orchestration and microservices world
Container orchestration and microservices world
Karol Chrapek
 
WSO2 Kubernetes Reference Architecture - Nov 2017
WSO2 Kubernetes Reference Architecture - Nov 2017WSO2 Kubernetes Reference Architecture - Nov 2017
WSO2 Kubernetes Reference Architecture - Nov 2017
Imesh Gunaratne
 
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kevin Lynch
 

Similar to QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons (20)

Disenchantment: Netflix Titus, Its Feisty Team, and Daemons
Disenchantment: Netflix Titus, Its Feisty Team, and DaemonsDisenchantment: Netflix Titus, Its Feisty Team, and Daemons
Disenchantment: Netflix Titus, Its Feisty Team, and Daemons
 
Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016
 
Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016
 
Netflix and Containers: Not Stranger Things
Netflix and Containers: Not Stranger ThingsNetflix and Containers: Not Stranger Things
Netflix and Containers: Not Stranger Things
 
AWS re:Invent 2016: Netflix: Container Scheduling, Execution, and Integration...
AWS re:Invent 2016: Netflix: Container Scheduling, Execution, and Integration...AWS re:Invent 2016: Netflix: Container Scheduling, Execution, and Integration...
AWS re:Invent 2016: Netflix: Container Scheduling, Execution, and Integration...
 
DCSF19 Container Security: Theory & Practice at Netflix
DCSF19 Container Security: Theory & Practice at NetflixDCSF19 Container Security: Theory & Practice at Netflix
DCSF19 Container Security: Theory & Practice at Netflix
 
DockerCon SF 2015 : Reliably shipping containers in a resource rich world usi...
DockerCon SF 2015 : Reliably shipping containers in a resource rich world usi...DockerCon SF 2015 : Reliably shipping containers in a resource rich world usi...
DockerCon SF 2015 : Reliably shipping containers in a resource rich world usi...
 
Kubernetes Clusters At Scale: Managing Hundreds Apache Pinot Kubernetes Clust...
Kubernetes Clusters At Scale: Managing Hundreds Apache Pinot Kubernetes Clust...Kubernetes Clusters At Scale: Managing Hundreds Apache Pinot Kubernetes Clust...
Kubernetes Clusters At Scale: Managing Hundreds Apache Pinot Kubernetes Clust...
 
Re:invent 2016 Container Scheduling, Execution and AWS Integration
Re:invent 2016 Container Scheduling, Execution and AWS IntegrationRe:invent 2016 Container Scheduling, Execution and AWS Integration
Re:invent 2016 Container Scheduling, Execution and AWS Integration
 
Netflix Architecture and Open Source
Netflix Architecture and Open SourceNetflix Architecture and Open Source
Netflix Architecture and Open Source
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
reBuy on Kubernetes
reBuy on KubernetesreBuy on Kubernetes
reBuy on Kubernetes
 
DCSF19 How Docker Simplifies Kubernetes for the Masses
DCSF19 How Docker Simplifies Kubernetes for the Masses  DCSF19 How Docker Simplifies Kubernetes for the Masses
DCSF19 How Docker Simplifies Kubernetes for the Masses
 
Monitoring kubernetes across data center and cloud
Monitoring kubernetes across data center and cloudMonitoring kubernetes across data center and cloud
Monitoring kubernetes across data center and cloud
 
Netflix Titus WASP October 2017
Netflix Titus WASP October 2017Netflix Titus WASP October 2017
Netflix Titus WASP October 2017
 
Presto Summit 2018 - 04 - Netflix Containers
Presto Summit 2018 - 04 - Netflix ContainersPresto Summit 2018 - 04 - Netflix Containers
Presto Summit 2018 - 04 - Netflix Containers
 
Container orchestration and microservices world
Container orchestration and microservices worldContainer orchestration and microservices world
Container orchestration and microservices world
 
WSO2 Kubernetes Reference Architecture - Nov 2017
WSO2 Kubernetes Reference Architecture - Nov 2017WSO2 Kubernetes Reference Architecture - Nov 2017
WSO2 Kubernetes Reference Architecture - Nov 2017
 
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
 

More from aspyker

SRECon Lightning Talk
SRECon Lightning TalkSRECon Lightning Talk
SRECon Lightning Talk
aspyker
 
Netflix OSS Meetup Season 4 Episode 4
Netflix OSS Meetup Season 4 Episode 4Netflix OSS Meetup Season 4 Episode 4
Netflix OSS Meetup Season 4 Episode 4
aspyker
 
Netflix Open Source: Building a Distributed and Automated Open Source Program
Netflix Open Source:  Building a Distributed and Automated Open Source ProgramNetflix Open Source:  Building a Distributed and Automated Open Source Program
Netflix Open Source: Building a Distributed and Automated Open Source Program
aspyker
 
Netflix Open Source Meetup Season 4 Episode 3
Netflix Open Source Meetup Season 4 Episode 3Netflix Open Source Meetup Season 4 Episode 3
Netflix Open Source Meetup Season 4 Episode 3
aspyker
 
Netflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open SourceNetflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open Source
aspyker
 
Ibm cloud nativenetflixossfinal
Ibm cloud nativenetflixossfinalIbm cloud nativenetflixossfinal
Ibm cloud nativenetflixossfinalaspyker
 
Docker Demo IBM Impact 2014
Docker Demo IBM Impact 2014Docker Demo IBM Impact 2014
Docker Demo IBM Impact 2014
aspyker
 
Netflix s2e1lightningtalk
Netflix s2e1lightningtalkNetflix s2e1lightningtalk
Netflix s2e1lightningtalkaspyker
 
Going Cloud Native with IBM Cloud and NetflixOSS for Dev@Pulse
Going Cloud Native with IBM Cloud and NetflixOSS for Dev@PulseGoing Cloud Native with IBM Cloud and NetflixOSS for Dev@Pulse
Going Cloud Native with IBM Cloud and NetflixOSS for Dev@Pulse
aspyker
 

More from aspyker (9)

SRECon Lightning Talk
SRECon Lightning TalkSRECon Lightning Talk
SRECon Lightning Talk
 
Netflix OSS Meetup Season 4 Episode 4
Netflix OSS Meetup Season 4 Episode 4Netflix OSS Meetup Season 4 Episode 4
Netflix OSS Meetup Season 4 Episode 4
 
Netflix Open Source: Building a Distributed and Automated Open Source Program
Netflix Open Source:  Building a Distributed and Automated Open Source ProgramNetflix Open Source:  Building a Distributed and Automated Open Source Program
Netflix Open Source: Building a Distributed and Automated Open Source Program
 
Netflix Open Source Meetup Season 4 Episode 3
Netflix Open Source Meetup Season 4 Episode 3Netflix Open Source Meetup Season 4 Episode 3
Netflix Open Source Meetup Season 4 Episode 3
 
Netflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open SourceNetflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open Source
 
Ibm cloud nativenetflixossfinal
Ibm cloud nativenetflixossfinalIbm cloud nativenetflixossfinal
Ibm cloud nativenetflixossfinal
 
Docker Demo IBM Impact 2014
Docker Demo IBM Impact 2014Docker Demo IBM Impact 2014
Docker Demo IBM Impact 2014
 
Netflix s2e1lightningtalk
Netflix s2e1lightningtalkNetflix s2e1lightningtalk
Netflix s2e1lightningtalk
 
Going Cloud Native with IBM Cloud and NetflixOSS for Dev@Pulse
Going Cloud Native with IBM Cloud and NetflixOSS for Dev@PulseGoing Cloud Native with IBM Cloud and NetflixOSS for Dev@Pulse
Going Cloud Native with IBM Cloud and NetflixOSS for Dev@Pulse
 

Recently uploaded

Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 

Recently uploaded (20)

Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 

QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons

  • 1. Netflix Titus, its Feisty Team, and Daemons
  • 2.
  • 3. Titus - Netflix’s Container Management Platform Scheduling ● Service & batch job lifecycle ● Resource management Container Execution ● AWS Integration ● Netflix Ecosystem Support Service Job and Fleet Management Resource Management & Optimization Container Execution Batch
  • 4. Stats Containers Launcher Per Week Batch vs. Service 3 Million / Week Batch runtimes (< 1s, < 1m, < 1h, < 12h, < 1d, > 1d) Service runtimes (< 1 day, < 1 week, < 1 month, > 1 month) Autoscaling High Churn
  • 5. The Titus team ● Design ● Develop ● Operate ● Support * And Netflix Platform Engineering and Amazon Web Services
  • 6. Titus Product Strategy Ordered priority focus on ● Developer Velocity ● Reliability ● Cost Efficiency Easy migration from VMs to containers Easy container integration with VMs and Amazon Services Focus on just what Netflix needs
  • 7. Deeply integrated AWS container platform IP per container ● VPC, ENIs, and security groups IAM Roles and Metadata Endpoint per container ● Container view of 169.254.169.254 Cryptographic identity per container ● Using Amazon instance identity document, Amazon KMS Service job container autoscaling ● Using Native AWS Cloudwatch, SQS, Autoscaling policies and engine Application Load Balancing (ALB)
  • 8. Applications using containers at Netflix ● Netflix API, Node.js Backend UI Scripts ● Machine Learning (GPUs) for personalization ● Encoding and Content use cases ● Netflix Studio use cases ● CDN tracking, planning, monitoring ● Massively parallel CI system ● Data Pipeline and Stream Processing ● Big Data use cases (Notebooks, Presto) Batch Q4 15 Basic Services 1Q 16 Production Services 4Q 16 Customer Facing Services 2Q 17
  • 9. Q4 2018 Container Usage Common Jobs Launched 255K jobs / day Different applications 1K+ different images Isolated Titus deployments 7 Services Single App Cluster Size 5K (real), 12K containers (benchmark) Hosts managed 7K VMs (435,000 CPUs) Batch Containers launched 450K / day (750K / day peak) Hosts managed (autoscaled) 55K / month
  • 10. High Level Titus Architecture Cassandra Titus Control Plane ● API ● Scheduling ● Job Lifecycle Control EC2 Autoscaling Fenzo container container container docker Titus Hosts Mesos agent Docker Docker Registry containercontainerUser Containers AWS Virtual Machines Mesos Titus System ServicesBatch/Workflow Systems Service CI/CD
  • 11. Open Source Open sourced April 2018 Help other communities by sharing our approach
  • 13. End to End User Experience
  • 14. Our initial view of containers Image Registry Publish Run Container Workload Monitor ContainersDeploy new Container Workload “The Runtime”
  • 17. End to end tooling Container orchestration only part of the problem For Netflix … ● Local Development - Newt ● Continuous Integration - Jenkins + Newt ● Continuous Delivery - Spinnaker ● Change Campaigns - Astrid ● Performance Analysis - Vector and Flamegraphs
  • 18. Tooling guidance ● Ensure coverage for entire application SDLC ○ Developing an application before deployment ○ Change management, security and compliance tooling for runtime ● What we added to Docker tooling ○ Curated known base images ○ Consistent image tagging ○ Assistance for multi-region/account registries ○ Consistency with existing tools
  • 20. ● Single container crashes ● Single host crashes ● Control plane fails ● Control plane gets into bad state Learning how things fail Increasing Severity
  • 21. ● Single container crashes ● Single host crashes ○ Taking down multiple containers ● Control plane fails ● Control plane gets into bad state Learning how things fail Increasing Severity
  • 22. ● Single container crashes ● Single host crashes ● Control plane fails ○ Existing containers continue to run ○ New jobs cannot be submitted ○ Replacements and scale ups do not occur ● Control plane gets into bad state Learning how things fail Increasing Severity
  • 23. ● Single container crashes ● Single host crashes ● Control plane fails ● Control plane gets into bad state ○ Can be catastrophic Learning how things fail Increasing Severity
  • 24. ● Most orchestrators will recover ● Most often during startup or shutdown ● Monitor for crash loops Case 1 - Single container crashes
  • 25. Case 2 - Single host crashes ● Need a placement engine that spreads critical workloads ● Need a way to detect and remediate bad hosts
  • 26. Titus node health monitoring, scheduling ● Extensive health checks ○ Control plane components - Docker, Mesos, Titus executor ○ AWS - ENI, VPC, GPU ○ Netflix Dependencies - systemd state, security systems Scheduler Health Check and Service Discovery ✖ + + ✖ Titus hosts
  • 27. Titus node health remediation ● Rate limiting through centralized service is critical Scheduler ✖ + + Infrastructure Automation (Winston) Events Automation perform analysis on host perform remediation on host if (unrecoverable) { tell scheduler to reschedule work terminate instance } Titus hosts
  • 28. Spotting fleet wide issues using logging ● For the hosts, not the containers ○ Need fleet wide view of container runtime, OS problems ○ New workloads will trigger new host problems ● Titus hosts generate 2B log lines per day ○ Stream processing to look for patterns and remediations ● Aggregated logging - see patterns in the large
  • 29. Case 3 - Control plane hard failures White box - monitor time bucketed queue length Black box - submit synthetic workloads
  • 30. Case 4 - Control plane soft failures I don’t feel so good!
  • 31. But first, let’s talk about Zombies
  • 32. ● Early on, we had cases where ○ Some but not all of the control plane was working ○ User terminated their containers ○ Containers still running, but shouldn’t have been ● The “fix” - Mesos implicit reconciliation ○ Titus to Mesos - What containers are running? ○ Titus to Mesos - Kill these containers we know shouldn’t be running ○ System converges on consistent state 👍 Disconnected containers
  • 33. But what if? Cluster State Scheduler Controller Host ✖ ✖ ✖ ✖ Backing store gets corrupted Or Control plane reads store incorrectly
  • 34. Bad things occur 12,000 containers “reconciled” in < 1m An hour to restore service Running Containers Not running Containers
  • 35. Guidance ● Know how to operate your cluster storage ○ Perform backups and test restores ○ Test corruption ○ Know failure modes, and know how to recover
  • 36. At Netflix, we ... ● Moved to less aggressive reconciliation ● Page on inconsistent data ○ Let existing containers run ○ Human fixes state and decides how to proceed ● Automated snapshot testing for staging
  • 38. ● Enforcement ○ Seccomp and AppArmor policies ● Cryptographic identity for each container ○ Leveraging host level Amazon and control plane provided identities ○ Validated by central Netflix service before secrets are provided Reducing container escape vectors
  • 39. User namespaces ● Root (or user) in container != Root (or user) on host ● Challenge: Getting it to work with persistent storage Reducing impact of container escape vectors user_namespaces (7)
  • 40. Lock down, isolate control plane ● Hackers are scanning for Docker and Kubernetes ● Reported lack of networking isolation in Google Borg ● We also thought our networking was isolated (wasn’t)
  • 41. Avoiding user host level access Vector ssh perf tools
  • 43. How does Netflix failover? ✖ Kong
  • 44. Netflix regional failover Kong evacuation of us-east-1 Traffic diverted to other regions Fail back to us-east-1 Traffic moved back to us-east-1 API Calls Per Region EU-WEST-1 US-EAST-1
  • 45. ● Increase capacity during scale up of savior region ● Launch 1000s of containers in 7 minutes Infrastructure challenge
  • 46. Easy Right? “we reduced time to schedule 30,000 pods onto 1,000 nodes from 8,780 seconds to 587 seconds”
  • 47. Easy Right? “we reduced time to schedule 30,000 pods onto 1,000 nodes from 8,780 seconds to 587 seconds” Synthetic benchmarks missing 1. Heterogeneous workloads 2. Full end to end launches 3. Docker image pull times 4. Integration with public cloud networking
  • 48. Titus can do this by ... ● Dynamically changeable scheduling behavior ● Fleet wide networking optimizations
  • 49. Normal scheduling VM1 ... VM2 VMn App 1 App 1 App 2 ENI1 ENI 2 App 2 App 1 ENI1 ENI1 ENI 2 Trade-off for reliability IP1 IP1 IP1 IP1 IP1 Spread Pack Scheduling Algorithm
  • 50. Failover scheduling VM1 ... VM2 VMn App 1 App 1 App 2 ENI1 ENI 2 App 2 App 1 ENI1 ENI1 ENI 2 App 1 App 1 App 1 App 1 App 1 App 2 App 2 IP1 IP1 IP1 IP1 IP1 Spread Pack IP2, IP3 IP2, IP3, IP4 IP2, IP3 Trade-off for speed Scheduling Algorithm
  • 51. ● Due to normal scheduling, host likely already has ... ○ Docker image downloaded ○ Networking interfaces and security groups configured ● Need to burst allocate IP addresses ○ Opportunistically batch allocate at container launch time ○ Likely if one container was launched more are coming ○ Garbage collect unused later On each host
  • 52. Results us-east-1 / prod containers started per minute 7500 Launched In 5 Minutes
  • 54. How far can a single Titus stack go? ● Speed and stability of scheduling ● Blast radius of mistakes
  • 55. Scaling options Idealistic Continue to improve performance Avoid making mistakes Realistic Test a stack up to a known scale level Contain mistakes
  • 56. Titus “Federation” ● Allows a Titus stack to be scaled out ○ For performance and reliability reasons ● Not to help with ○ Cloud bursting across different resource pools ○ Automatic balancing across resource pools ○ Joining various business unit resource pools
  • 57. Federation Implementation ● Users need only to know of the external single API ○ VIP - titus-api.us-east-1.prod.netflix.net ● Simple federation proxy spans stacks (cells) ○ Route these apps to cell 1, these others to cell 2 ○ Fan out & union queries across cells ○ Operators can route directly to specific cells
  • 58. Titus Federation … Titus API (Federation Proxy) Titus cell01 us-west-2 Titus cell02 … Titus cell01 us-east-1 Titus cell02 … Titus cell01 eu-west-1 Titus cell02 Titus API (Federation Proxy) Titus API (Federation Proxy)
  • 59. How many cells? A few large cells ● Only as many as needed for scale / blast radius Why? Larger resource pools help with ● Cross workload efficiency ● Operations ● Bad workload impacts
  • 61. ● A fictional “16 vCPU” host ● Left and right are CPU packages ● Top to bottom are cores with hyperthreads Simplified view of a server
  • 62. Consider workload placement ● Consider three workloads ○ Static A, B, and C all which are latency sensitive ○ Burst D which would like more compute when available ● Let’s start placing workloads
  • 63. Problems ● Potential performance problems ○ Static C is crossing packages ○ Static B can steal from Static C ● Underutilized resources ○ Burst workload isn’t using all available resources
  • 64. Node level CPU rescheduling ● After containers land on hosts ○ Eventually, dynamic and cross host ● Leverages cpusets ○ Static - placed on single CPU package and exclusive full cores ○ Burst - can consume extra capacity, but variable performance ● Kubernetes - CPU Manager (beta)
  • 65. Opportunistic workloads ● Enable workloads to burst into underutilized resources ● Differences between utilized and total Utilization time Resources (Total) Resources (Allocated) Resources (Utilized)