SlideShare a Scribd company logo
1 of 39
Download to read offline
Containers at Netflix
WASP 10/19/17
Andrew Leung
The Whole Titus Team
2
Motivating Factors For Containers
● From Late 2015 Technical Strategy ...
● Simpler management of compute resources
● Simpler deployment packaging artifacts for compute jobs
● Need for a consistent local developer environment
3
Provided Innovation Velocity
Media Encoding - encoding research development time
● Using custom VM’s - 1 month
● Using customizable containers - 1 week
Niagara
● Build all Netflix codebases in hours
● Saves development 100’s of hours of debugging
NodeQuark
● Focus returns to app development
● Newt & Titus simplifies, speeds test and deployment
4
Consistent Developer Experience
● NeWT - Common local developer experience including
support for container development
○ Container image used for local laptop development
○ Same container image re-used when deployed
● Has benefits in both directions
○ Cloud like local development environment
○ Easier operational debugging of cloud workloads
5
What is Titus?
● Cloud runtime platform for container based jobs
● Scheduling
○ Service & batch job management
○ Advanced resource management
across elastic shared resource pool
● Container Execution
○ Advanced Isolation
○ Docker and AWS Integration
○ Containers integration with Netflix
infrastructure
6
Service
Job Management
Resource Management & Optimization
Container Execution
Integration
Batch
Titus Evolution Timeframe
7
Titus Created
Batch GA
4Q 2015
Service Support
Added
1Q 2016
Netflix Infra &
AWS Integration
2Q 2016
First Scale
Production Service
4Q 2016
First User Path
Service
2Q 2017
Containers Scale Over Time
8
● From thousand daily
● To 100K daily
● Spike to 450K
days
Containers
Launched
9
Titus Current Scale
● Deployed across multiple AWS accounts & three regions
● Over 5,000 instances (Mostly M4.4xls & R4.8xls)
● Over a week period launched over 1,000,000 containers
● Around 10,000 long running containers 9
Current Titus Users (Sampling)
● Service
○ Stream Processing (Flink)
○ UI Services (NodeJS single core)
○ Internal dashboards
● Batch
○ Algorithm model training, personalization &
recommendations (with GPU’s)
○ Content value analysis
○ Digital watermarking
○ Adhoc reporting (ex. Open Connect CDN
analysis and planning)
○ Continuous integration builds
● Queued worker model
○ Media encoding experimentation
10
Archer
11
Titus Overview
Titus UITitus UI
RheaRheaTitus API
Titus UI
Cassandra
Titus Master
Job Management &
Scheduler
Zookeeper
EC2
Auto-scaling API
Mesos Master
Fenzo
1111
Docker
Registry
Docker
Registry
container
container
container
docker
Titus Agent
metrics agents
Titus executor
logging agent
btrfs
Mesos agent
Docker
S3
Docker
Registry
container
Pod & VPC network
drivers
containercontainer
AWS
metadata proxy
Integration
AWS VM’s
12
AWS Integration
● Making Docker integrate with AWS like VM’s
● Titus adds
○ VPC Connectivity (IP per container)
○ Security Groups
○ EC2 Metadata service
○ IAM Roles
○ Multi-tenant isolation (cpu, memory, disk quota, network)
○ Live and S3 persisted logs rotation & mgmt
○ Remote storage (EFS)
○ Autoscaling service jobs
○ GPU Support
○ Environmental context to similar to user data 12
Multi-tenant networking is hard
● Decided early on we wanted full IP stacks per container
● But what about?
○ Security group support
○ IAM role support
○ Network bandwidth isolation
○ Integration with VPC
13
Networking - VPC Driver
14
Networking - VPC Driver
15
Networking - VPC Driver
16
Networking - VPC Driver
17
Networking - Metadata Proxy
18
Networking - Putting it all together
19
Isolation
● CPU
○ Fixed shares today (pinning coming)
● Memory
○ Including page cache
● Disk
○ Quotas
● Networking
○ Bandwidth, ENI’s and IP’s
● Security
○ User namespaces, hosts locked down, secret mgmt
20
21
Netflix Infrastructure Integration
● Provide single cloud platform (VM’s and containers same)
● Titus adds integration with
○ Spinnaker CI/CD and canaries
○ Atlas telemetry and outlier detection
○ Discovery/IPC
○ Edda (and dependent systems)
○ Instance pollers (healthcheck, system metrics)
○ Chaos monkey
○ Traffic control & Kong
○ Netflix secure secret management
○ Interactive access (ala ssh)
● Supports both reserved critical and elastically scaled flex workloads
● Manages containers under both service and batch systems 21
22
Netflix Cloud Infrastructure (VM’s + Containers)
Why? Single Consistent Cloud Platform
Spinnaker Setup
23
24
Deploy based
on new images
tags
24
25
Basic resource
requirements
IAM Roles & Sec
Groups per
container
Deploy
Strategies
Same as VM’s
25
26
Easily see
health &
discovery
26
2727
2828
Container Level Introspection
29
● Interactive “ssh” and files “scp” managed by Titus hosts
● Locked down as hosts are secure and only accessible by Titus operators
Scheduling
30
Fenzo - The heart of Titus scheduling
● Extensible Library for Scheduling Frameworks
● Plugins based scheduling objectives
○ Bin packing, etc.
● Heterogeneous resources & tasks
● Cluster autoscaling
○ Multiple instance types
● Plugins based constraints evaluator
○ Resource affinity, task locality, etc.
● Single offer mode added in support of ECS
31
Scheduling - Capacity Guarantees
● Titus maintains …
● Critical tier
○ guaranteed capacity &
start latencies
● Flex tier
○ more dynamic capacity &
variable start latency
32
Scheduling - Bin Packing, Elastic Scaling
User adds work tasks
● Titus does bin packing
to ensure that we can
downscale entire hosts
efficiently
33
Scheduling - Constraints including AZ Balancing
User specifies constraints
● AZ Balancing
● Resource and Task
affinity
● Hard and soft
34
Scheduling - Agent upgrades
Operator updates Titus agent
codebase
● New scheduling on new cluster
● Batch jobs drain
● Service tasks are migrated via
Spinnaker pipelines
● Old cluster autoscales down
35
Future
36
● Perf/Scalability, Ops Enablement, Reliability
○ Better resiliency driven by directed chaos testing
○ More scale (2 orders of magnitude by 2019)
○ Hands off canaried automation of all operational tasks
● Scheduling
○ Advanced job and AWS rate limiting
○ Easier and more scalable fleet management
○ “Trough” management and improved batch SLA
Some Titus Futures
37
● Container Execution
○ Improved isolation
○ Deeper and automated layers of security
○ Pods (system services, then application sidecars)
● Netflix Infrastructure and AWS Integration
○ Chargeback visibility and automated improvements
○ ALB support
Some Titus Futures
38
Questions
? 39

More Related Content

What's hot

Webinar: Achieving Economies of Web Scale in Your Enterprise with Containeriz...
Webinar: Achieving Economies of Web Scale in Your Enterprise with Containeriz...Webinar: Achieving Economies of Web Scale in Your Enterprise with Containeriz...
Webinar: Achieving Economies of Web Scale in Your Enterprise with Containeriz...WSO2
 
Aptira presents OpenStack Load Balancing as a Service at Banglore India OSUG ...
Aptira presents OpenStack Load Balancing as a Service at Banglore India OSUG ...Aptira presents OpenStack Load Balancing as a Service at Banglore India OSUG ...
Aptira presents OpenStack Load Balancing as a Service at Banglore India OSUG ...OpenStack
 
Using OpenStack Swift for Extreme Data Durability
 Using OpenStack Swift for Extreme Data Durability Using OpenStack Swift for Extreme Data Durability
Using OpenStack Swift for Extreme Data DurabilityChristian Schwede
 
WSO2 Microservices Framework for Java - Product Overview
WSO2 Microservices Framework for Java - Product OverviewWSO2 Microservices Framework for Java - Product Overview
WSO2 Microservices Framework for Java - Product OverviewWSO2
 
Cncf storage-final-filip
Cncf storage-final-filipCncf storage-final-filip
Cncf storage-final-filipJuraj Hantak
 
NATS vs HTTP
NATS vs HTTPNATS vs HTTP
NATS vs HTTPApcera
 
Kubecon 2019_eu-k8s-secrets-csi
Kubecon 2019_eu-k8s-secrets-csiKubecon 2019_eu-k8s-secrets-csi
Kubecon 2019_eu-k8s-secrets-csiRita Zhang
 
The evolving container landscape
The evolving container landscapeThe evolving container landscape
The evolving container landscapeNilesh Trivedi
 
Kubernetes 1.12 Update and Container Security with Liz Rice
Kubernetes 1.12 Update and Container Security with Liz RiceKubernetes 1.12 Update and Container Security with Liz Rice
Kubernetes 1.12 Update and Container Security with Liz RiceCloudOps2005
 
Neutron Updates - Liberty Edition
Neutron Updates - Liberty Edition Neutron Updates - Liberty Edition
Neutron Updates - Liberty Edition OpenStack Foundation
 
Initial presentation of swift (for montreal user group)
Initial presentation of swift (for montreal user group)Initial presentation of swift (for montreal user group)
Initial presentation of swift (for montreal user group)Marcos García
 
19. Cloud Native Computing - Kubernetes - Bratislava - Databases in K8s world
19. Cloud Native Computing - Kubernetes - Bratislava - Databases in K8s world19. Cloud Native Computing - Kubernetes - Bratislava - Databases in K8s world
19. Cloud Native Computing - Kubernetes - Bratislava - Databases in K8s worldDávid Kőszeghy
 
Implementing Microservices with NATS
Implementing Microservices with NATSImplementing Microservices with NATS
Implementing Microservices with NATSApcera
 
Running Netflix OSS on Docker with Nirmata
Running Netflix OSS on Docker with NirmataRunning Netflix OSS on Docker with Nirmata
Running Netflix OSS on Docker with NirmataDamien Toledo
 
A New Way of Thinking | NATS 2.0 & Connectivity
A New Way of Thinking | NATS 2.0 & ConnectivityA New Way of Thinking | NATS 2.0 & Connectivity
A New Way of Thinking | NATS 2.0 & ConnectivityNATS
 
Kubera Launch Webinar: Kubernetes native management of Kubernetes native data
Kubera Launch Webinar: Kubernetes native management of Kubernetes native dataKubera Launch Webinar: Kubernetes native management of Kubernetes native data
Kubera Launch Webinar: Kubernetes native management of Kubernetes native dataMayaData Inc
 

What's hot (20)

Webinar: Achieving Economies of Web Scale in Your Enterprise with Containeriz...
Webinar: Achieving Economies of Web Scale in Your Enterprise with Containeriz...Webinar: Achieving Economies of Web Scale in Your Enterprise with Containeriz...
Webinar: Achieving Economies of Web Scale in Your Enterprise with Containeriz...
 
Aptira presents OpenStack Load Balancing as a Service at Banglore India OSUG ...
Aptira presents OpenStack Load Balancing as a Service at Banglore India OSUG ...Aptira presents OpenStack Load Balancing as a Service at Banglore India OSUG ...
Aptira presents OpenStack Load Balancing as a Service at Banglore India OSUG ...
 
Netflix Data Benchmark @ HPTS 2017
Netflix Data Benchmark @ HPTS 2017Netflix Data Benchmark @ HPTS 2017
Netflix Data Benchmark @ HPTS 2017
 
Using OpenStack Swift for Extreme Data Durability
 Using OpenStack Swift for Extreme Data Durability Using OpenStack Swift for Extreme Data Durability
Using OpenStack Swift for Extreme Data Durability
 
WSO2 Microservices Framework for Java - Product Overview
WSO2 Microservices Framework for Java - Product OverviewWSO2 Microservices Framework for Java - Product Overview
WSO2 Microservices Framework for Java - Product Overview
 
Cncf storage-final-filip
Cncf storage-final-filipCncf storage-final-filip
Cncf storage-final-filip
 
NATS vs HTTP
NATS vs HTTPNATS vs HTTP
NATS vs HTTP
 
Kubecon 2019_eu-k8s-secrets-csi
Kubecon 2019_eu-k8s-secrets-csiKubecon 2019_eu-k8s-secrets-csi
Kubecon 2019_eu-k8s-secrets-csi
 
The evolving container landscape
The evolving container landscapeThe evolving container landscape
The evolving container landscape
 
Open stack wtf_(1)
Open stack  wtf_(1)Open stack  wtf_(1)
Open stack wtf_(1)
 
Kubernetes 1.12 Update and Container Security with Liz Rice
Kubernetes 1.12 Update and Container Security with Liz RiceKubernetes 1.12 Update and Container Security with Liz Rice
Kubernetes 1.12 Update and Container Security with Liz Rice
 
Neutron Updates - Liberty Edition
Neutron Updates - Liberty Edition Neutron Updates - Liberty Edition
Neutron Updates - Liberty Edition
 
Initial presentation of swift (for montreal user group)
Initial presentation of swift (for montreal user group)Initial presentation of swift (for montreal user group)
Initial presentation of swift (for montreal user group)
 
Samuel Bercovici - lbaaS for Havana
Samuel Bercovici - lbaaS for HavanaSamuel Bercovici - lbaaS for Havana
Samuel Bercovici - lbaaS for Havana
 
19. Cloud Native Computing - Kubernetes - Bratislava - Databases in K8s world
19. Cloud Native Computing - Kubernetes - Bratislava - Databases in K8s world19. Cloud Native Computing - Kubernetes - Bratislava - Databases in K8s world
19. Cloud Native Computing - Kubernetes - Bratislava - Databases in K8s world
 
Implementing Microservices with NATS
Implementing Microservices with NATSImplementing Microservices with NATS
Implementing Microservices with NATS
 
Running Netflix OSS on Docker with Nirmata
Running Netflix OSS on Docker with NirmataRunning Netflix OSS on Docker with Nirmata
Running Netflix OSS on Docker with Nirmata
 
A New Way of Thinking | NATS 2.0 & Connectivity
A New Way of Thinking | NATS 2.0 & ConnectivityA New Way of Thinking | NATS 2.0 & Connectivity
A New Way of Thinking | NATS 2.0 & Connectivity
 
Glance Updates - Liberty Edition
Glance Updates - Liberty EditionGlance Updates - Liberty Edition
Glance Updates - Liberty Edition
 
Kubera Launch Webinar: Kubernetes native management of Kubernetes native data
Kubera Launch Webinar: Kubernetes native management of Kubernetes native dataKubera Launch Webinar: Kubernetes native management of Kubernetes native data
Kubera Launch Webinar: Kubernetes native management of Kubernetes native data
 

Similar to Netflix Titus WASP October 2017

Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016aspyker
 
Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016Sharma Podila
 
Netflix and Containers: Not A Stranger Thing
Netflix and Containers:  Not A Stranger ThingNetflix and Containers:  Not A Stranger Thing
Netflix and Containers: Not A Stranger Thingaspyker
 
Netflix and Containers: Not Stranger Things
Netflix and Containers: Not Stranger ThingsNetflix and Containers: Not Stranger Things
Netflix and Containers: Not Stranger ThingsAll Things Open
 
NetflixOSS Meetup S6E1 - Titus & Containers
NetflixOSS Meetup S6E1 - Titus & ContainersNetflixOSS Meetup S6E1 - Titus & Containers
NetflixOSS Meetup S6E1 - Titus & Containersaspyker
 
Craig Box (Google) - The road to Kubernetes 1.0
Craig Box (Google) - The road to Kubernetes 1.0Craig Box (Google) - The road to Kubernetes 1.0
Craig Box (Google) - The road to Kubernetes 1.0Outlyer
 
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and DaemonsQConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemonsaspyker
 
Monitoring hybrid container environments
Monitoring hybrid container environments Monitoring hybrid container environments
Monitoring hybrid container environments Samuel Vandamme
 
Unleashing k8 s to reduce complexities of an entire middleware platform
Unleashing k8 s to reduce complexities of an entire middleware platformUnleashing k8 s to reduce complexities of an entire middleware platform
Unleashing k8 s to reduce complexities of an entire middleware platformLakmal Warusawithana
 
Container World 2018
Container World 2018Container World 2018
Container World 2018aspyker
 
WSO2 Kubernetes Reference Architecture - Nov 2017
WSO2 Kubernetes Reference Architecture - Nov 2017WSO2 Kubernetes Reference Architecture - Nov 2017
WSO2 Kubernetes Reference Architecture - Nov 2017Imesh Gunaratne
 
AWS re:Invent 2016: Netflix: Container Scheduling, Execution, and Integration...
AWS re:Invent 2016: Netflix: Container Scheduling, Execution, and Integration...AWS re:Invent 2016: Netflix: Container Scheduling, Execution, and Integration...
AWS re:Invent 2016: Netflix: Container Scheduling, Execution, and Integration...Amazon Web Services
 
Re:invent 2016 Container Scheduling, Execution and AWS Integration
Re:invent 2016 Container Scheduling, Execution and AWS IntegrationRe:invent 2016 Container Scheduling, Execution and AWS Integration
Re:invent 2016 Container Scheduling, Execution and AWS Integrationaspyker
 
How Kubernetes helps Devops
How Kubernetes helps DevopsHow Kubernetes helps Devops
How Kubernetes helps DevopsSreenivas Makam
 
Scaling Open edX with Kubernetes
Scaling Open edX with KubernetesScaling Open edX with Kubernetes
Scaling Open edX with KubernetesAppsembler
 
Future of Microservices - Jakub Hadvig
Future of Microservices - Jakub HadvigFuture of Microservices - Jakub Hadvig
Future of Microservices - Jakub HadvigWEBtlak
 
DCSF19 How Docker Simplifies Kubernetes for the Masses
DCSF19 How Docker Simplifies Kubernetes for the Masses  DCSF19 How Docker Simplifies Kubernetes for the Masses
DCSF19 How Docker Simplifies Kubernetes for the Masses Docker, Inc.
 
Automating using Ansible
Automating using AnsibleAutomating using Ansible
Automating using AnsibleAlok Patra
 
Kubernetes 101
Kubernetes 101Kubernetes 101
Kubernetes 101Vishwas N
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1Ruslan Meshenberg
 

Similar to Netflix Titus WASP October 2017 (20)

Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016
 
Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016
 
Netflix and Containers: Not A Stranger Thing
Netflix and Containers:  Not A Stranger ThingNetflix and Containers:  Not A Stranger Thing
Netflix and Containers: Not A Stranger Thing
 
Netflix and Containers: Not Stranger Things
Netflix and Containers: Not Stranger ThingsNetflix and Containers: Not Stranger Things
Netflix and Containers: Not Stranger Things
 
NetflixOSS Meetup S6E1 - Titus & Containers
NetflixOSS Meetup S6E1 - Titus & ContainersNetflixOSS Meetup S6E1 - Titus & Containers
NetflixOSS Meetup S6E1 - Titus & Containers
 
Craig Box (Google) - The road to Kubernetes 1.0
Craig Box (Google) - The road to Kubernetes 1.0Craig Box (Google) - The road to Kubernetes 1.0
Craig Box (Google) - The road to Kubernetes 1.0
 
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and DaemonsQConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
 
Monitoring hybrid container environments
Monitoring hybrid container environments Monitoring hybrid container environments
Monitoring hybrid container environments
 
Unleashing k8 s to reduce complexities of an entire middleware platform
Unleashing k8 s to reduce complexities of an entire middleware platformUnleashing k8 s to reduce complexities of an entire middleware platform
Unleashing k8 s to reduce complexities of an entire middleware platform
 
Container World 2018
Container World 2018Container World 2018
Container World 2018
 
WSO2 Kubernetes Reference Architecture - Nov 2017
WSO2 Kubernetes Reference Architecture - Nov 2017WSO2 Kubernetes Reference Architecture - Nov 2017
WSO2 Kubernetes Reference Architecture - Nov 2017
 
AWS re:Invent 2016: Netflix: Container Scheduling, Execution, and Integration...
AWS re:Invent 2016: Netflix: Container Scheduling, Execution, and Integration...AWS re:Invent 2016: Netflix: Container Scheduling, Execution, and Integration...
AWS re:Invent 2016: Netflix: Container Scheduling, Execution, and Integration...
 
Re:invent 2016 Container Scheduling, Execution and AWS Integration
Re:invent 2016 Container Scheduling, Execution and AWS IntegrationRe:invent 2016 Container Scheduling, Execution and AWS Integration
Re:invent 2016 Container Scheduling, Execution and AWS Integration
 
How Kubernetes helps Devops
How Kubernetes helps DevopsHow Kubernetes helps Devops
How Kubernetes helps Devops
 
Scaling Open edX with Kubernetes
Scaling Open edX with KubernetesScaling Open edX with Kubernetes
Scaling Open edX with Kubernetes
 
Future of Microservices - Jakub Hadvig
Future of Microservices - Jakub HadvigFuture of Microservices - Jakub Hadvig
Future of Microservices - Jakub Hadvig
 
DCSF19 How Docker Simplifies Kubernetes for the Masses
DCSF19 How Docker Simplifies Kubernetes for the Masses  DCSF19 How Docker Simplifies Kubernetes for the Masses
DCSF19 How Docker Simplifies Kubernetes for the Masses
 
Automating using Ansible
Automating using AnsibleAutomating using Ansible
Automating using Ansible
 
Kubernetes 101
Kubernetes 101Kubernetes 101
Kubernetes 101
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
 

Recently uploaded

Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 

Recently uploaded (20)

Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 

Netflix Titus WASP October 2017

  • 1. Containers at Netflix WASP 10/19/17 Andrew Leung
  • 3. Motivating Factors For Containers ● From Late 2015 Technical Strategy ... ● Simpler management of compute resources ● Simpler deployment packaging artifacts for compute jobs ● Need for a consistent local developer environment 3
  • 4. Provided Innovation Velocity Media Encoding - encoding research development time ● Using custom VM’s - 1 month ● Using customizable containers - 1 week Niagara ● Build all Netflix codebases in hours ● Saves development 100’s of hours of debugging NodeQuark ● Focus returns to app development ● Newt & Titus simplifies, speeds test and deployment 4
  • 5. Consistent Developer Experience ● NeWT - Common local developer experience including support for container development ○ Container image used for local laptop development ○ Same container image re-used when deployed ● Has benefits in both directions ○ Cloud like local development environment ○ Easier operational debugging of cloud workloads 5
  • 6. What is Titus? ● Cloud runtime platform for container based jobs ● Scheduling ○ Service & batch job management ○ Advanced resource management across elastic shared resource pool ● Container Execution ○ Advanced Isolation ○ Docker and AWS Integration ○ Containers integration with Netflix infrastructure 6 Service Job Management Resource Management & Optimization Container Execution Integration Batch
  • 7. Titus Evolution Timeframe 7 Titus Created Batch GA 4Q 2015 Service Support Added 1Q 2016 Netflix Infra & AWS Integration 2Q 2016 First Scale Production Service 4Q 2016 First User Path Service 2Q 2017
  • 8. Containers Scale Over Time 8 ● From thousand daily ● To 100K daily ● Spike to 450K days Containers Launched
  • 9. 9 Titus Current Scale ● Deployed across multiple AWS accounts & three regions ● Over 5,000 instances (Mostly M4.4xls & R4.8xls) ● Over a week period launched over 1,000,000 containers ● Around 10,000 long running containers 9
  • 10. Current Titus Users (Sampling) ● Service ○ Stream Processing (Flink) ○ UI Services (NodeJS single core) ○ Internal dashboards ● Batch ○ Algorithm model training, personalization & recommendations (with GPU’s) ○ Content value analysis ○ Digital watermarking ○ Adhoc reporting (ex. Open Connect CDN analysis and planning) ○ Continuous integration builds ● Queued worker model ○ Media encoding experimentation 10 Archer
  • 11. 11 Titus Overview Titus UITitus UI RheaRheaTitus API Titus UI Cassandra Titus Master Job Management & Scheduler Zookeeper EC2 Auto-scaling API Mesos Master Fenzo 1111 Docker Registry Docker Registry container container container docker Titus Agent metrics agents Titus executor logging agent btrfs Mesos agent Docker S3 Docker Registry container Pod & VPC network drivers containercontainer AWS metadata proxy Integration AWS VM’s
  • 12. 12 AWS Integration ● Making Docker integrate with AWS like VM’s ● Titus adds ○ VPC Connectivity (IP per container) ○ Security Groups ○ EC2 Metadata service ○ IAM Roles ○ Multi-tenant isolation (cpu, memory, disk quota, network) ○ Live and S3 persisted logs rotation & mgmt ○ Remote storage (EFS) ○ Autoscaling service jobs ○ GPU Support ○ Environmental context to similar to user data 12
  • 13. Multi-tenant networking is hard ● Decided early on we wanted full IP stacks per container ● But what about? ○ Security group support ○ IAM role support ○ Network bandwidth isolation ○ Integration with VPC 13
  • 14. Networking - VPC Driver 14
  • 15. Networking - VPC Driver 15
  • 16. Networking - VPC Driver 16
  • 17. Networking - VPC Driver 17
  • 19. Networking - Putting it all together 19
  • 20. Isolation ● CPU ○ Fixed shares today (pinning coming) ● Memory ○ Including page cache ● Disk ○ Quotas ● Networking ○ Bandwidth, ENI’s and IP’s ● Security ○ User namespaces, hosts locked down, secret mgmt 20
  • 21. 21 Netflix Infrastructure Integration ● Provide single cloud platform (VM’s and containers same) ● Titus adds integration with ○ Spinnaker CI/CD and canaries ○ Atlas telemetry and outlier detection ○ Discovery/IPC ○ Edda (and dependent systems) ○ Instance pollers (healthcheck, system metrics) ○ Chaos monkey ○ Traffic control & Kong ○ Netflix secure secret management ○ Interactive access (ala ssh) ● Supports both reserved critical and elastically scaled flex workloads ● Manages containers under both service and batch systems 21
  • 22. 22 Netflix Cloud Infrastructure (VM’s + Containers) Why? Single Consistent Cloud Platform
  • 24. 24 Deploy based on new images tags 24
  • 25. 25 Basic resource requirements IAM Roles & Sec Groups per container Deploy Strategies Same as VM’s 25
  • 27. 2727
  • 28. 2828
  • 29. Container Level Introspection 29 ● Interactive “ssh” and files “scp” managed by Titus hosts ● Locked down as hosts are secure and only accessible by Titus operators
  • 31. Fenzo - The heart of Titus scheduling ● Extensible Library for Scheduling Frameworks ● Plugins based scheduling objectives ○ Bin packing, etc. ● Heterogeneous resources & tasks ● Cluster autoscaling ○ Multiple instance types ● Plugins based constraints evaluator ○ Resource affinity, task locality, etc. ● Single offer mode added in support of ECS 31
  • 32. Scheduling - Capacity Guarantees ● Titus maintains … ● Critical tier ○ guaranteed capacity & start latencies ● Flex tier ○ more dynamic capacity & variable start latency 32
  • 33. Scheduling - Bin Packing, Elastic Scaling User adds work tasks ● Titus does bin packing to ensure that we can downscale entire hosts efficiently 33
  • 34. Scheduling - Constraints including AZ Balancing User specifies constraints ● AZ Balancing ● Resource and Task affinity ● Hard and soft 34
  • 35. Scheduling - Agent upgrades Operator updates Titus agent codebase ● New scheduling on new cluster ● Batch jobs drain ● Service tasks are migrated via Spinnaker pipelines ● Old cluster autoscales down 35
  • 37. ● Perf/Scalability, Ops Enablement, Reliability ○ Better resiliency driven by directed chaos testing ○ More scale (2 orders of magnitude by 2019) ○ Hands off canaried automation of all operational tasks ● Scheduling ○ Advanced job and AWS rate limiting ○ Easier and more scalable fleet management ○ “Trough” management and improved batch SLA Some Titus Futures 37
  • 38. ● Container Execution ○ Improved isolation ○ Deeper and automated layers of security ○ Pods (system services, then application sidecars) ● Netflix Infrastructure and AWS Integration ○ Chargeback visibility and automated improvements ○ ALB support Some Titus Futures 38