Resilience and Security @ Scale: Lessons Learned


Published on

Presented at Goldman Sachs 3/6/2013.

Published in: Technology
  • Be the first to comment

Resilience and Security @ Scale: Lessons Learned

  1. 1. Resilience and Security @ Scale – Lessons LearnedJason Chan -
  2. 2. Netflix, Inc. “Netflix is the world’s leading Internet television network with more than 33 million members in 40 countries enjoying more than one billion hours of TV shows and movies per month, including original series . . .”Source:
  3. 3. Me Director of Engineering @ Netflix Responsible for:  Cloud app, product, infrastructure, ops security Previously:  Led security team @ VMware  Earlier, primarily security consulting at @stake, iSEC Partners
  4. 4. Netflix in the Cloud – Why?
  5. 5. Availability and the Move to Streaming
  6. 6. “Undifferentiated Heavy Lifting”
  7. 7. Netflix Culture“may well be the most important document ever to come out of the Valley.” Sheryl Sandberg, Facebook COO
  8. 8. Scale and Usage Curve
  9. 9. Netflix is now ~99% in the cloud
  10. 10. On the way to the cloud . . . (architecture)
  11. 11. On the way to the cloud . . . (organization) (or NoOps, depending on definitions)
  12. 12. Some As-Is #s  33m+ subscribers  10,000s of systems  100s of engineers, apps  ~250 test deployments/day **  ~70 production deployments/day ** ** Sample based on one week‟s activities
  13. 13. Common Approaches to Reslience
  14. 14. Common Controls to Promote Resilience Architectural committees  Designed to standardize on Change approval boards design patterns, vendors, etc. Centralized deployments  Problems for Netflix:  Freedom and Responsibility Vendor-specific, component- Culture level HA  Highly aligned and loosely Standards and checklists coupled  Innovation cycles
  15. 15. Common Controls to Promote Resilience Architectural committees  Designed to control and de- Change approval boards risk change Centralized deployments  Focus on artifacts, test and rollback plans Vendor-specific, component- level HA  Problems for Netflix:  Freedom and Responsibility Standards and checklists Culture  Highly aligned and loosely coupled  Innovation cycles
  16. 16. Common Controls to Promote Resilience Architectural committees  Separate Ops team deploys at Change approval boards a pre-ordained time (e.g. weekly, monthly) Centralized deployments  Problems for Netflix: Vendor-specific, component-  Freedom and Responsibility level HA Culture Standards and checklists  Highly aligned and loosely coupled  Innovation cycles
  17. 17. Common Controls to Promote Resilience Architectural committees  High reliance on vendor Change approval boards solutions to provide HA and resilience Centralized deployments  Problems for Netflix: Vendor-specific, component-  Traditional data center oriented level HA systems do not translate well Standards and checklists to the cloud  Heavy use of open source
  18. 18. Common Controls to Promote Resilience Architectural committees  Designed for repeatable Change approval boards execution Centralized deployments  Problems for Netflix:  Not suitable for load-based Vendor-specific, component- scaling and heavy automation level HA  Reliance on humans Standards and checklists
  19. 19. Approaches to Resilience @ Netflix
  20. 20. What does the business value? Customer experience  Remember these guys? Innovation and agility In other words:  Stability and availability for customer experience  Rapid development and change to continually improve product and outpace competition Not that different from anyone else
  21. 21. Overall Approach Understand and solve for relevant failure modes Rely on automation and tools instead of committees for evaluating architecture and changes Make deployment easy and standardized
  22. 22. Cloud Application Failure Modes and EffectsFailure Mode Probability Current MitigationApp Failure High Automated fallback responseAWS Region Failure Low Wait for recoveryAWS Zone Failure Medium Continue running in 2 of 3 zonesDatacenter Failure Medium Continue migrating to cloudData Store Failure Low Restore from S3S3 Failure Low Restore from remote archive  Risk-based approach given likely failures  Tackle high-probability events first
  23. 23. Simian Army
  24. 24. Goals of Simian Army“Each system has to be able to succeed, no matter what, even all on its own.We‟re designing each distributed system to expect and tolerate failure fromother systems on which it depends.”
  25. 25. Chaos Monkey “By frequently causing failures, we force our services to be built in a way that is more resilient.” Terminates cluster nodes during business hours Rejects “If it ain‟t broke, don‟t fix it” Goals:  Simulate random hardware failures, human error at small scale  Identify weaknesses  No service impact
  26. 26. Chaos Gorilla Chaos Monkey‟s bigger brother Standard deployment pattern is to distribute load/systems/data across three data centers (AZs) What happens if one is lost? Goals:  Simulate data center loss, hardware/service failures at larger scale  Identify weaknesses, dependencies, etc.  Minimal service impact
  27. 27. Latency Monkey Distributed systems have many upstream/downstream connections How fault-tolerant are systems to dependency failure/slowdown? Goals:  Simulate latencies and error codes, see how a service responds  Survivable services regardless of dependencies
  28. 28. Conformity Monkey Without architecture review, how do you ensure designs leverage known successful patterns? Conformity Monkey provides automated analysis for pattern adherence Goals:  Evaluate deployment modes (data center distribution)  Evaluate health checks, discoverability, versions of key libraries  Help ensure service has best chance of successful operation
  29. 29. Non-Simian Approaches Org model  Engineers write, deploy, support code Culture  De-centralized with as few processes and rules as possible  Lots of local autonomy  “If you‟re not failing, you‟re not trying hard enough”  Peer pressure Productive and transparent incident reviews
  30. 30. AppSec Challenges
  31. 31. Lots of Good Advice  BSIMM  Microsoft SDL  SAFECode
  32. 32. But, what works? Forrester Consulting, 12/10
  33. 33. Especially, given phenomena such as DevOps,cloud, agile, and the unique characteristics of an organization?
  34. 34. Deploying Code at Netflix
  35. 35. A common graph @ Netflix Weekend afternoon ramp-up Lots of watching in prime time Not as much in early morning Old way - pay and provision for peak, 24/7/365 Multiply this pattern across the dozens of apps that comprise the Netflix streaming service
  36. 36. Solution: Load-Based Autoscaling
  37. 37. Autoscaling Goals:  # of systems matches load requirements  Load per server is constant  Happens without intervention (the „auto‟ in autoscaling) Results:  Clusters continuously add & remove nodes  New nodes must mirror existing
  38. 38. Every change requires a new cluster push(not an incremental change to existing systems)
  39. 39. Deploying code must be easy (it is)
  40. 40. Netflix Deployment Pipeline RPM with app-specific VM template bits ready to launch YUM AMIPerforce/Git Bakery ASGCode change Base image + Cluster configConfig change RPM Running systems
  41. 41. Operational Impact No changes to running systems No systems mgmt infrastructure (Puppet, Chef, etc.) Fewer logins to prod No snowflakes Trivial “rollback”
  42. 42. Security Impact Need to think differently on:  Vulnerability management  Patch management  User activity monitoring  File integrity monitoring  Forensic investigations
  43. 43. Architecture, organization, deployment are all different. What about security?
  44. 44. We‟ve adapted too.Some principles we‟ve found useful.
  45. 45. Cloud Application Security: What We Emphasize
  46. 46. Points of Emphasis Integrate  Two contexts: 1. Integration with your Make the right way easy engineering ecosystem Self-service, with 2. Integration of your security exceptions controls  Organization Trust, but verify  SCM, build and release  Monitoring and alerting 47
  47. 47. Integration: Base AMI Testing Base AMI – VM/instance template used for all cloud systems  Average instance age = ~24 days (one-time sample) The base AMI is managed like other packages, via P4, Jenkins, etc. We watch the SCM directory & kick off testing when it changes Launch an instance of the AMI, perform vuln scan and other checks SCAN COMPLETED ALERT Site name: AMI1 Stopped by: N/A Total Scan Time: 4 minutes 46 seconds Critical Vulnerabilities: 5 Severe Vulnerabilities: 4 Moderate Vulnerabilities: 4
  48. 48. Integration: Control Packaging and Installation  From the RPM spec file of a webserver: Requires: ossec cloudpassage nflx-base-harden hyperguard-enforcer Pulls in the following RPMs:  HIDS agent  Config assessment/firewall agent  Host hardening package  WAF
  49. 49. Integration: Timeline (Chronos) What IP addresses have been blacklisted by the WAF in the last few weeks? GET /api/v1/event?timelines=type:blacklist&start=20130125000000000 Which security groups have changed today? GET /api/v1/event?timelines=type:securitygroup&start=20130206000000000
  50. 50. Points of Emphasis Integrate  Developers are lazy Make the right way easy Self-service, with exceptions Trust, but verify
  51. 51. Making it Easy: Cryptex Crypto: DDIY (“Don‟t Do It Yourself”) Many uses of crypto in web/distributed systems:  Encrypt/decrypt (cookies, data, etc.)  Sign/verify (URLs, data, etc.) Netflix also uses heavily for device activation, DRM playback, etc.
  52. 52. Making it Easy: Cryptex Multi-layer crypto system (HSM basis, scale out layer)  Easy to use  Key management handled transparently  Access control and auditable operations
  53. 53. Making it Easy: Cloud-Based SSO In the AWS cloud, access to data center services is problematic  Examples: AD, LDAP, DNS But, many cloud-based systems require authN, authZ  Examples: Dashboards, admin UIs Asking developers to securely handle/accept credentials is also problematic
  54. 54. Making it Easy: Cloud-Based SSO Solution: Leverage OneLogin SaaS SSO (SAML) used by IT for enterprise apps (e.g. Workday, Google Apps) Uses Active Directory credentials Provides a single & centralized login page  Developers don‟t accept username & password directly Built filter for our base server to make SSO/authN trivial
  55. 55. Points of Emphasis Integrate  Self-service is perhaps the most transformative cloud Make the right way easy characteristic Self-service, with  Failing to adopt this for security exceptions controls will lead to friction Trust, but verify
  56. 56. Self-Service: Security Groups Asgard cloud orchestration tool allows developers to configure their own firewall rules Limited to same AWS account, no IP-based rules
  57. 57. Points of Emphasis Integrate  Culture precludes traditional “command and control” Make the right way easy approach Self-service, with  Organizational desire for agile, exceptions DevOps, CI/CD blur traditional security engagement Trust, but verify touchpoints
  58. 58. Trust but Verify: Security Monkey Cloud APIs make verification  Includes: and analysis of configuration  Certificate checking and running state simpler  Firewall analysis Security Monkey created as  IAM entity analysis the framework for this analysis  Limit warnings  Resource policy analysis
  59. 59. Trust but Verify: Security Monkey From: Security Monkey Date: Wed, 24 Oct 2012 17:08:18 +0000 To: Security Alerts Subject: prod Changes Detected Table of Contents: Security Groups Changed Security Group <sgname> (eu-west-1 / prod) <#Security Group/<sgname> (eu-west-1 / prod)>
  60. 60. Trust but Verify: Exploit Monkey  AWS Autoscaling group is unit of deployment, so changes signal a good time to rerun dynamic scans On 10/23/12 12:35 PM, Exploit Monkey wrote: I noticed that testapp-live has changed current ASG name from testapp- live-v001 to testapp-live-v002. Im starting a vulnerability scan against test app from these private/public IPs:
  61. 61. Takeaways  Netflix runs a large, dynamic service in AWS  Newer concepts like cloud & DevOps need an updated approach to resilience and security  Specific context can help jumpstart a pragmatic and effective security program
  62. 62. Netflix References
  63. 63. Other References 08,2817,2395372,00.asp us/download/confirmation.aspx?id=29884 calls-maybe-the-most-important-document-ever-to-come-out-of-the- valley/
  64. 64. Questions?