Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

DCSF19 Container Security: Theory & Practice at Netflix


Published on

Michael Wardrop, Netflix

Usage of containers has undergone rapid growth at Netflix and it is still accelerating. Our container story started organically with developers downloading Docker and using it to improve their developer experience. The first production workloads were simple batch jobs, pioneering micro-services followed, then status as a first class platform running critical workloads.

As the types of workloads changed and their importance increased, the security of our container ecosystem needed to evolve and adapt. This session will cover some security theory, architecture, along with practical considerations, and lessons we learnt along the way.

Published in: Technology
  • Be the first to comment

DCSF19 Container Security: Theory & Practice at Netflix

  1. 1. Michael Wardrop Senior Security Software Engineer, Platform Security Container Security: Theory & Practice @
  2. 2. Why? There are lots of great presentations about Container Security theory I hope to inspire more sharing so that we learn from each other and improve everyone’s security together Not so may about the challenges of doing it in practice
  3. 3. Who? Security Practitioners Developers who are curious about ‘how the sausage is made’ Builders and Operators of container platforms
  4. 4. Context Know your threat models - My threat models may not be the same as yours Don’t copy & paste security - Tailor solutions to your context Although I am presenting, this is the work of many people from multiple teams over a few years
  5. 5. Containers @
  6. 6. Containers at Netflix Started organically with engineers • Improved polyglot development and testing experience Basic batch processing systems • cron in the cloud • Extract, Transform, Load With momentum came demand • Container management platform • Integration with AWS and Netflix ecosystem
  7. 7. est 2015
  8. 8. Titus Netflix’s Container Management Platform > 3 million containers launched per week Scheduling • Service & batch job lifecycle • Resource management AWS & Netflix Integrations High churn • Most batch workloads < 1hour • Due to auto scaling most Service containers < 1 day multi Region, multi AZ Chaos Monkey & regional failover > 1K different images
  9. 9. Titus: High Level Architecture
  10. 10. Titus: High Level Architecture
  11. 11. Newt Netflix Workflow Toolkit - from Productivity Engineering • Initialization of Projects (Stash repos, Jenkins jobs, Spinnaker pipelines, & alerts) • Code generation • Consistent development environment in polyglot world • Isolated, reproducible, and cacheable builds • Container based testing • Good place to incorporate best practices and secure defaults Docker is an important component
  12. 12. Rapid growth of container use cases • 1000+ services • Netflix API, Node.js Backend UI Scripts • Machine Learning (GPUs) for personalization • Encoding and Content use cases • Netflix Studio use cases • CDN tracking and planning • Massively parallel CI system • Data Pipeline routing & Stream Processing as a Service • Big Data platform use cases
  13. 13. Container Security Theory
  14. 14. What’s interesting about OCI Containers? 1. Operating System virtualization - rely on the OS Kernel for security. On Linux, this means: • Namespaces - different userland views • Control Groups - resource limits • Seccomp - Syscall filtering • Mandatory Access Control - Apparmor, SELinux, etc • Capabilities - break up the power of root • Pivot Root - Change the root file system 2. File System Image - Bring your dependencies with you Implemented as a Tar of Tars with some metadata
  15. 15. Registry Image Scanning Patch Management Control Plane Cloud Networking and APIs Developer Identity Service Identity Development Production Secret Management Key Management Version Control Source Code Container Ecosystem Security What is different? Continuous Integration Continuous Delivery
  16. 16. Container Ecosystem Security What isn’t impacted? Registry Image Scanning Patch Management Control Plane Cloud Networking and APIs Developer Identity Service Identity Development Production Secret Management Key Management Version Control Source Code Continuous Integration Continuous Delivery
  17. 17. Practice Container Security
  18. 18. Cloud Security AWS EC2 Metadata proxy • Started with one per host, changed to one per container • Block Server Side Request Forgery (SSRF) and XML External Entity (XXE) injection • Honey Credentials Identity & Access Management • IAM Role per container • Limit IAM permissions for the host & bind credentials to the host • Restrict which IAM roles can be used by which Applications Elastic Network Interfaces • VPC routable IP Address per container • Assigning Security Groups to containers Cloud APIs have great power, protect them!
  19. 19. Cloud Security Separate accounts for Control Plane and Workers (12 accounts total) STS service in control plane account • AuthN, AuthZ, & Audit Container’s Identity based on Target IAM account • Workload can be logically in Target account despite executing on Titus New Titus Architecture Agent Pool Titus US-East-1 Control Plane Account Titus US-East-1 Agent Account Federation (New) Internet Agent Pool Titus US-East-1 Account Federation (Old) Agent PoolAgent Pool
  20. 20. Cloud Security OSS tools from Netflix - some assembly required, talk to us
  21. 21. Control Plane Security Root controls ONE host, Control Plane controls ALL hosts. API • V1 was http • V2 was https with optional mutual TLS • V3 mutual TLS only with audit logs Master to Workers communication • Originally relied on Security Groups • no authentication, authorization, or encryption • Dangerous! 1 misconfiguration away from shadow control plane attacks • Mutual TLS authN • AuthZ policies • Auditing got root control?
  22. 22. Control Plane Security Problem: Invalid Jobs Uses REST/JSON poorly { env: { “PATH” : null } } Symptoms • Scheduler crashes, fails over, crashes, repeat
 Solutions • Input validation, input fuzz testing, exception handling
  23. 23. Control Plane Security Problem: Failing Jobs That Repeat Symptoms • Scheduler works really hard • Cloud resources are allocated / deallocated fast
 Solution • Rate limiting of failing jobs Image: “org/imagename:lateest” Command:/bin/besh -c …
  24. 24. Identity for People Pandora: unified identity service Meechum: multi factor Single Sign On Metatron uses Meechum identity to create: ➡X.509 cert • person to service authN via Mutual TLS ➡SSH cert • Bastion access
  25. 25. Cryptographic bootstrap of service identity in the cloud Established before application code, supports: • Ec2 Instances built on our BaseAMI • Containers on Titus • Netflix Functions All get X.509 certificates for use in Mutual TLS, enabling authentication Metatron: Identity for Services
  26. 26. Round 1 • Based on metadata signed by AWS • No freshness guarantee, therefore no support for instance restarts • No Lambda support ☹ Round 2 • Based on KMS encryption context • Freshness guarantee, therefore can refresh identity at any time • Lambda support 🥳 Metatron: Identity for Services Closest open source equivalent is How? Starts with an Application in Spinnaker, which signs some metadata, and puts it in User Data given to AWS
  27. 27. Gandalf: Authorization Gandalf decides who can be let in, and who shall not pass. • Web portal for defining policies • REST • gRPC • SSH • custom • Policy updates are pushed out to Authorization agent • All authorization decisions are made locally in ns
  28. 28. SSH Access For extraordinary circumstances Vast majority of Instances and Containers go through their lifecycle without SSH access Initial implementation • connection from bastion into limited environment on the host restricted docker exec and docker cp like functionality Current implementation • After authorization check, the Bastion calls the Titus control plane • A specially configured sshd is injected into the container • The bastion connects directly to the injected sshd
  29. 29. SSH
  30. 30. Secret Protection No secrets in code! Encrypted via Gandalf web portal • Define a policy for which Metatron identities (applications, groups, individuals) can access • Copy a Base64 encoded bundle / download a binary file Files in conventional path are automatically decrypted on instance / container startup and loaded into tmpfs Library support for transparently loading and decrypting from configuration files
  31. 31. Secret Protection The only place secrets should exist in the clear is in ram when they are being used Blinded( EncryptedBundle(Secret, Policy Id) ) Blinded( Secret ) Mutual TLS Instance / Container Metatron X.509 cert Decryption Server Metatron X.509 cert
  32. 32. Host Problem: kernel vulnerability away from loss of containment Solutions • Don’t use a generic kernel, use one tuned for your environment • get rid of unneeded features, modules, and drivers • Follow kernel hardening best practices like the Kernel Self Protection Project Consider: Firecracker
  33. 33. Runtime Use User Namespaces Docker 1.10 - Introduced User Namespaces • Didn’t work /w shared networking NS
 Docker 1.11 - Fixed shared networking NS • User id mapping is per daemon (not per container)
 Titus uses unique user namespace per container, shared User Id mapping • avoids problems with shared filesystems
  34. 34. Vulnerability Management Problem: Stop known vulnerabilities from getting introduced into your ecosystem Solution: ‘Shift left’ • IDE plugins • Scanning of pull requests & builds in CI system
  35. 35. Vulnerability Management Problem: Discover and eliminate vulnerabilities in your ecosystem Theory: Scan your container images Practice: Discovering vulnerabilities is relatively easy, flushing them from your ecosystem is hard
  36. 36. Change Management People > 1K Engineers Applications > 5K Micro Services CI > 600K CI builds per week Artifacts > 2K NPMs > 17K Debians > 17K AMIs > 97K JARs Artifact Churn Not deployed for ~ 3 days ~ 18K total Deleted per day ~ 13K total Deployments & Autoscaling > 3M containers deployed per week • Most batch workloads < 1hour • Most Service containers < 1 day ~ 50% VM Instance churn per day
  37. 37. Change Management Who needs to change what when? Change campaigns • Targeted & actionable communication • Email, Spinnaker, linters, build warnings Deprecation cycles • All micro services should be rebuilt & redeployed with latest supported artifact versions every 90 days • Act as a forcing function to purge old / vulnerable software Orange: campaign rules Pink: primary blockers Green: affected services
  38. 38. Takeaways 1. Cloud & Platform control planes are of strategic importance • protect with multiple independent layers of security 2. People and Service identity are the foundation • AuthN, AuthZ, & Auditing • Secret Management
  39. 39. Takeaways 3. Need to take an ecosystem approach • Container security does not happen in isolation • Engineers should get Security involved early in project / platform lifecycle • As a security practitioner you take what is there and iterate 4. Users will need help adopting containers responsibly • Expect problematic containers and workloads 5. Users expect ability to debug and performance tune • Metrics, Monitoring, and Alerting are key • SSH as break glass, not as a crutch
  40. 40. Security: Russell Lewis: OSCON 2016 — How Netflix Gives All Its Engineers SSH Access To Instances Running In Production Ian Haken: USENIX Enigma 2017 — Secrets at Scale: Automated Bootstrapping of Secrets & Identity in the Cloud Manish Mehta: CloudNativeCon 2017 — How Netflix Is Solving Authorization Across Their Cloud Manish Mehta: RWC 2018 — Secrets at Scale Travis McPeak: Enigma 2018 — Least Privilege: Security Gain without Developer Pain Netflix Tech Blog —Security Titus Team: Netflix OSS: Season 6 Episode 1 - Titus, Slides, Source QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemon