Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Real World Cloud Application Security

2,112 views

Published on

  • Be the first to comment

Real World Cloud Application Security

  1. 1. Real World Cloud Application SecurityLessons Learned Running Large Scale Systems in the Public Cloud Jason Chan - chan@netflix.com
  2. 2. Netflix, Inc. “With more than 27 million streaming members in the United States, Canada, Latin America, the United Kingdom and Ireland, Netflix, Inc. is the worlds leading internet subscription service for enjoying movies and TV programs . . .”Source: http://ir.netflix.com
  3. 3. Me• Cloud Security Architect @ Netflix• Responsible for: • Cloud app, product, infrastructure, ops security• Previously: • Led security team @ VMware • Earlier, primarily security consulting at @stake, iSEC Partners
  4. 4. AppSec Challenges
  5. 5. AppSec Challenges
  6. 6. Lots of Good Advice• BSIMM• Microsoft SDL• SAFECode
  7. 7. But, what works?Forrester Consulting, 12/10
  8. 8. Especially, given phenomena such as DevOps, cloud, agile,and the unique characteristics of an organization?
  9. 9. Netflix Engineering Characteristics
  10. 10. Netflix in the Cloud - Why?
  11. 11. Netflix in the Cloud - Why?
  12. 12. Netflix in the Cloud - Why? “Undifferentiated heavy lifting”
  13. 13. Netflix in the Cloud - Why? “Undifferentiated heavy lifting”
  14. 14. Netflix in the Cloud - Why? “Undifferentiated heavy lifting”
  15. 15. Netflix is now ~99% in Public Cloud
  16. 16. On the way to the cloud . . .
  17. 17. On the way to the cloud . . . (or NoOps, depending on definitions)
  18. 18. Some As-Is #s• 27m+ subscribers• 10,000s of systems• 100s of engineers, apps• ~250 test deployments/day *• ~70 production deployments/day ** Sample based on this week’s activities
  19. 19. Deploying Code @ Netflix
  20. 20. A common graph @ Netflix
  21. 21. A common graph @ NetflixLots of watching in prime time
  22. 22. A common graph @ NetflixLots of watching in prime time Not as much in early morning
  23. 23. A common graph @ NetflixLots of watching in prime time Not as much in early morning Old way - pay and provision for peak, 24/7/365
  24. 24. A common graph @ Netflix Lots of watching in prime time Not as much in early morning Old way - pay and provision for peak, 24/7/365Multiply this pattern across the dozens of apps that comprise the Netflix streaming service
  25. 25. Solution: Load-Based Autoscaling
  26. 26. Autoscaling
  27. 27. Autoscaling• Goals:
  28. 28. Autoscaling• Goals: • # of systems matches load requirements
  29. 29. Autoscaling• Goals: • # of systems matches load requirements • Load per server is constant
  30. 30. Autoscaling• Goals: • # of systems matches load requirements • Load per server is constant • Happens without intervention (the ‘auto’ in autoscaling
  31. 31. Autoscaling• Goals: • Results: • # of systems matches load requirements • Load per server is constant • Happens without intervention (the ‘auto’ in autoscaling
  32. 32. Autoscaling• Goals: • Results: • # of systems matches • Continuously load requirements adding/removing nodes • Load per server is constant • Happens without intervention (the ‘auto’ in autoscaling
  33. 33. Autoscaling• Goals: • Results: • # of systems matches • Continuously load requirements adding/removing nodes • Load per server is constant • New nodes must mirror existing • Happens without intervention (the ‘auto’ in autoscaling
  34. 34. Every change requires a new cluster push(not an incremental change to existing systems)
  35. 35. Deploying must be easy (it is)
  36. 36. Netflix Deployment Pipeline
  37. 37. Netflix Deployment PipelinePerforce/GitCode changeConfig change
  38. 38. Netflix Deployment Pipeline RPM file with app-specific bits YUMPerforce/GitCode changeConfig change
  39. 39. Netflix Deployment Pipeline RPM file with app-specific bits YUMPerforce/Git BakeryCode change Base image +Config change RPM
  40. 40. Netflix Deployment Pipeline RPM file with VM template app-specific ready to launch bits YUM AMIPerforce/Git BakeryCode change Base image +Config change RPM
  41. 41. Netflix Deployment Pipeline RPM file with VM template app-specific ready to launch bits YUM AMIPerforce/Git Bakery ASGCode change Base image + Cluster configConfig change RPM Running systems
  42. 42. Netflix Deployment Pipeline RPM file with VM template app-specific ready to launch bits YUM AMIPerforce/Git Bakery ASGCode change Base image + Cluster configConfig change RPM Running systems
  43. 43. Operational Impact• No changes to running systems• No systems management infrastructure• Fewer logins to prod• No snowflakes• Trivial “rollback”
  44. 44. Security Impact• Need to think differently on: • Vulnerability management • Patch management • User activity monitoring • File integrity monitoring • Forensic investigations
  45. 45. Org, architecture,deployment is different, what about security?
  46. 46. We’ve adapted too.Some principles we’ve found useful.
  47. 47. Integrate
  48. 48. Base AMI Security• AMI = Amazon Machine • Average age of running Image instance: 24 days*• @ Netflix, all apps are • 60% of instances less based on “Base AMI”, than 1 week old* and new pushes pick up the latest• Concentrating testing and improvements here provides greatest impact* Based on one-time sampling (yesterday)
  49. 49. Base AMI Testing• The base AMI is managed like other packages, via P4, Jenkins, etc.• We watch the base AMI’s SCM directory & kick off testing when it changes• Launch an instance of the AMI, perform vuln scan and other checks
  50. 50. Base AMI Testing• The base AMI is managed like other packages, via P4, Jenkins, etc.• We watch the base AMI’s SCM directory & kick off testing when it changes• Launch an instance of the AMI, perform vuln scan and other checks
  51. 51. Base AMI Testing• The base AMI is managed like other packages, via P4, Jenkins, etc.• We watch the base AMI’s SCM directory & kick SCAN COMPLETED ALERT off testing when it Site name: AMI1 changes Stopped by: N/A Total Scan Time: 4 minutes 46 seconds• Critical Vulnerabilities: 5 Launch an instance of Severe Vulnerabilities:   4 Moderate Vulnerabilities: 4 the AMI, perform vuln scan and other checks
  52. 52. Security Packaging• All security tools use the same toolchain as the rest of engineering (P4/Git, Jenkins, etc.)
  53. 53. • From the RPM spec file of a webserver: Requires: ossec cloudpassage nflx-base-harden hyperguard-enforcer
  54. 54. • Pulls in the following RPMs: • Host hardening package • WAF agent • OSSEC (HIDS agent) • CloudPassage (config assessment, FW, etc.)
  55. 55. Static Analysis• Available self-service through build environment (FindBugs, PMD)• Jenkins (CI) plugin to display graphs and support drill through to results
  56. 56. MAN Integration
  57. 57. Many systems involved, standardization is important
  58. 58. Central Alerting Gateway• A single place to generate alerts• Python, Java libraries (or json post) to easily alert on events of interest• Ties in to PagerDuty notification system• Allows for stateful alerting and some response• A prerequisite that our tools will leverage
  59. 59. CAG Exampleimport CORE.Gatewaygw = CORE.Gateway.Gateway()gw.send("testcluster", "normal", "Something wentwrong")
  60. 60. Chronos• Timeline system (API and UI) with Java/ Python libraries, or json post• Track config changes, deployments, etc.• Security tools also leverage for tracking and analysis
  61. 61. Chronos Security Examples• What IP addresses have been blacklisted by the WAF in the last few weeks? GET /api/v1/event? timelines=type:blacklist&start=20121012000000000• Which security groups have changed today? GET /api/v1/event? timelines=type:securitygroup&start=20121024000000000
  62. 62. Make the right way easy (and secure)
  63. 63. Cryptex• Many uses of crypto in web/distributed systems: • Encrypt/decrypt (cookies, data, etc.) • Sign/verify (URLs, data, etc.)• Known as an area where developers should not DIY
  64. 64. • Multi-layer crypto system (HSM basis, scale out layer) • Easy for developers to use • Key management handled transparently • Access control and auditable operationsICipherContext cipherContext = CryptexClientFactory.getCipherContext(KeySet.testkey);// encryptionString cipherText = cipherContext.encrypt("NETFLIX");// decryptionString plainText = cipherContext.decrypt(cipherText);
  65. 65. Cloud SSO• Authenticated access to dashboards, admin apps in the cloud is problematic • No datacenter access, no LDAP, AD
  66. 66. Cloud SSO• Solution - leverage OneLogin SaaS SSO option (SAML) used by IT for enterprise apps• Built filter that integrates with our platform web server to make SSO/authentication trivial
  67. 67. Trust, but verify
  68. 68. Culture of ‘freedom and responsibility’ precludes traditional centralized,command and control approach
  69. 69. Security Monkey• Cloud APIs make • Includes: verification and analysis of configuration & running state simpler • Cert checking • Firewall analysis• Security Monkey created as the framework for this analysis • IAM entity analysis • Limit warnings
  70. 70. Security Monkey From:  Security Monkey Date:  Wed, 24 Oct 2012 17:08:18 +0000 To:  Security Alerts Subject:  prod Changes Detected         Table of Contents:             Security Groups                                  Changed Security Group                                                                       <sgname> (eu-west-1 / prod)                          <#Security Group/<sgname> (eu-west-1 / prod)>                     
  71. 71. Exploit Monkey • Autoscaling group is unit of deployment, so changes signal a good time to rerun dynamic scansOn 10/23/12 12:35 PM, Exploit Monkey wrote:I noticed that testapp-live has changed current ASG name from testapp-live-v001 to testapp-live-v002.Im starting a vulnerability scan against test app from these private/public IPs:10.29.24.174
  72. 72. ELB Checker (gauntlt)• AWS’ Elastic Load Balancer (ELB) provides cross- datacenter traffic balancing, but no security controls (if your cluster is attached to an ELB, it is available to the Internet)• Engineers may misunderstand use cases for ELBs, security features, and/or other measures that can be used to protect ELB-fronted clusters
  73. 73. Solution: gauntlt Testing1. Launch gauntlt test runner instance, loaded with “master list” of ELBs and expected state2. Determine “target list” of current ELBs to evaluate3. Generate per-ELB listener gauntlt attack files4. Execute attacks5. Alert on failures and new ELBs6. Triage findings and update ELB master list
  74. 74. Self-service, with exceptions
  75. 75. AWS Security Groups• Asgard cloud orchestration tool allows developers to configure their own firewall rules• Limited to same-account groups, no IP-based rules• Handles 95% of requirements, JIRAs for additional changes, and Security Monkey to keep an eye on things
  76. 76. Takeaways• Netflix runs a large, dynamic service in AWS• Good guidance + specific context can help jumpstart a pragmatic security program• Newer concepts like cloud & DevOps need updated approach to security• Don’t swim upstream - integrate and collaborate with your engineering partners
  77. 77. Netflix References• http://netflix.github.com/• http://techblog.netflix.com/• http://slideshare.net/netflix
  78. 78. Other References• http://www.webpronews.com/netflix-outage-angers- customers-2008-08• http://www.pcmag.com/article2/0,2817,2395372,00.asp• http://www.readwriteweb.com/archives/ etech_amazon_cto_aws.php• http://bsimm.com/online/• http://www.microsoft.com/en-us/download/ confirmation.aspx?id=29884• http://www.gauntlt.org
  79. 79. Questions?chan@netflix.com

×