Cloud Application Security: Lessons Learned

1,380 views
1,309 views

Published on

Presented at the Houston OWASP meeting - 2/21/2013.

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,380
On SlideShare
0
From Embeds
0
Number of Embeds
9
Actions
Shares
0
Downloads
0
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Cloud Application Security: Lessons Learned

  1. 1. Cloud Application Security: Lessons LearnedHouston OWASP – 2/21/2013Jason Chan - chan@netflix.com
  2. 2. Netflix, Inc. “Netflix is the world’s leading Internet television network with more than 33 million members in 40 countries enjoying more than one billion hours of TV shows and movies per month, including original series . . .”Source: http://ir.netflix.com
  3. 3. Me Director of Engineering @ Netflix Responsible for:  Cloud app, product, infrastructure, ops security Previously:  Led security team @ VMware  Earlier, primarily security consulting at @stake, iSEC Partners
  4. 4. AppSec Challenges
  5. 5. Lots of Good Advice  BSIMM  Microsoft SDL  SAFECode
  6. 6. But, what works? Forrester Consulting, 12/10
  7. 7. Especially, given phenomena such as DevOps,cloud, agile, and the unique characteristics of an organization?
  8. 8. Engineering @ Netflix
  9. 9. Availability and the Move to Streaming
  10. 10. “Undifferentiated Heavy Lifting”
  11. 11. Netflix Culture“may well be the most important document ever to come out of the Valley.” Sheryl Sandberg, Facebook COO
  12. 12. Scale and Usage Curve
  13. 13. Netflix is now ~99% in the cloud
  14. 14. On the way to the cloud . . . (architecture)
  15. 15. On the way to the cloud . . . (organization) (or NoOps, depending on definitions)
  16. 16. Some As-Is #s  33m+ subscribers  10,000s of systems  100s of engineers, apps  ~250 test deployments/day **  ~70 production deployments/day ** ** Sample based on one week‟s activities
  17. 17. Deploying Code at Netflix
  18. 18. A common graph @ Netflix Weekend afternoon ramp-up Lots of watching in prime time Not as much in early morning Old way - pay and provision for peak, 24/7/365 Multiply this pattern across the dozens of apps that comprise the Netflix streaming service
  19. 19. Solution: Load-Based Autoscaling
  20. 20. Autoscaling Goals:  # of systems matches load requirements  Load per server is constant  Happens without intervention (the „auto‟ in autoscaling) Results:  Clusters continuously add & remove nodes  New nodes must mirror existing
  21. 21. Every change requires a new cluster push(not an incremental change to existing systems)
  22. 22. Deploying code must be easy (it is)
  23. 23. Netflix Deployment Pipeline RPM with app-specific VM template bits ready to launch YUM AMIPerforce/Git Bakery ASGCode change Base image + Cluster configConfig change RPM Running systems
  24. 24. Operational Impact No changes to running systems No systems mgmt infrastructure (Puppet, Chef, etc.) Fewer logins to prod No snowflakes Trivial “rollback”
  25. 25. Security Impact Need to think differently on:  Vulnerability management  Patch management  User activity monitoring  File integrity monitoring  Forensic investigations
  26. 26. Architecture, organization, deployment are all different. What about security?
  27. 27. We‟ve adapted too.Some principles we‟ve found useful.
  28. 28. Cloud Application Security: What We Emphasize
  29. 29. Points of Emphasis Integrate  Two contexts: 1. Integration with your Make the right way easy engineering ecosystem Self-service, with 2. Integration of your security exceptions controls  Organization Trust, but verify  SCM, build and release  Monitoring and alerting 29
  30. 30. Integration: Base AMI Testing Base AMI – VM/instance template used for all cloud systems  Average instance age = ~24 days (one-time sample) The base AMI is managed like other packages, via P4, Jenkins, etc. We watch the SCM directory & kick off testing when it changes Launch an instance of the AMI, perform vuln scan and other checks SCAN COMPLETED ALERT Site name: AMI1 Stopped by: N/A Total Scan Time: 4 minutes 46 seconds Critical Vulnerabilities: 5 Severe Vulnerabilities: 4 Moderate Vulnerabilities: 4
  31. 31. Integration: Control Packaging and Installation  From the RPM spec file of a webserver: Requires: ossec cloudpassage nflx-base-harden hyperguard-enforcer Pulls in the following RPMs:  HIDS agent  Config assessment/firewall agent  Host hardening package  WAF
  32. 32. Integration: Timeline (Chronos) What IP addresses have been blacklisted by the WAF in the last few weeks? GET /api/v1/event?timelines=type:blacklist&start=20130125000000000 Which security groups have changed today? GET /api/v1/event?timelines=type:securitygroup&start=20130206000000000
  33. 33. Integration: Static Analysis  Available self-service through build environment  FindBugs, PMD  Jenkins plugin to display graphs and support drill through to results
  34. 34. Integration: Static Analysis
  35. 35. Integration: Alerting (Central Alerting Gateway) Single place to generate and deliver alerts Python, Java libraries (or JSON post) Ties in to PagerDuty notification/escalation system Permits stateful alerting and some response A prerequisite that our security tools will leverage
  36. 36. CAG Example import CORE.Gateway gw = CORE.Gateway.Gateway() # testcluster is a defined app with associated escalation # schedule in PagerDuty gw.send("testcluster", "normal", "Something went wrong")
  37. 37. Points of Emphasis Integrate  Developers are lazy Make the right way easy Self-service, with exceptions Trust, but verify
  38. 38. Making it Easy: Cryptex Crypto: DDIY (“Don‟t Do It Yourself”) Many uses of crypto in web/distributed systems:  Encrypt/decrypt (cookies, data, etc.)  Sign/verify (URLs, data, etc.) Netflix also uses heavily for device activation, DRM playback, etc.
  39. 39. Making it Easy: Cryptex Multi-layer crypto system (HSM basis, scale out layer)  Easy to use  Key management handled transparently  Access control and auditable operations
  40. 40. Making it Easy: Cloud-Based SSO In the AWS cloud, access to data center services is problematic  Examples: AD, LDAP, DNS But, many cloud-based systems require authN, authZ  Examples: Dashboards, admin UIs Asking developers to securely handle/accept credentials is also problematic
  41. 41. Making it Easy: Cloud-Based SSO Solution: Leverage OneLogin SaaS SSO (SAML) used by IT for enterprise apps (e.g. Workday, Google Apps) Uses Active Directory credentials Provides a single & centralized login page  Developers don‟t accept username & password directly Built filter for our base server to make SSO/authN trivial
  42. 42. Points of Emphasis Integrate  Self-service is perhaps the most transformative cloud Make the right way easy characteristic Self-service, with  Failing to adopt this for security exceptions controls will lead to friction Trust, but verify
  43. 43. Self-Service: Security Groups Asgard cloud orchestration tool allows developers to configure their own firewall rules Limited to same AWS account, no IP-based rules
  44. 44. Points of Emphasis Integrate  Culture precludes traditional “command and control” Make the right way easy approach Self-service, with  Organizational desire for agile, exceptions DevOps, CI/CD blur traditional security engagement Trust, but verify touchpoints
  45. 45. Trust but Verify: Security Monkey Cloud APIs make verification  Includes: and analysis of configuration  Certificate checking and running state simpler  Firewall analysis Security Monkey created as  IAM entity analysis the framework for this analysis  Limit warnings  Resource policy analysis
  46. 46. Trust but Verify: Security Monkey From: Security Monkey Date: Wed, 24 Oct 2012 17:08:18 +0000 To: Security Alerts Subject: prod Changes Detected Table of Contents: Security Groups Changed Security Group <sgname> (eu-west-1 / prod) <#Security Group/<sgname> (eu-west-1 / prod)>
  47. 47. Trust but Verify: Exploit Monkey  AWS Autoscaling group is unit of deployment, so changes signal a good time to rerun dynamic scans On 10/23/12 12:35 PM, Exploit Monkey wrote: I noticed that testapp-live has changed current ASG name from testapp- live-v001 to testapp-live-v002. Im starting a vulnerability scan against test app from these private/public IPs: 10.29.24.174
  48. 48. Trust but Verify: ELB Checker (gauntlt) AWS Elastic Load Balancer (ELB) provides cross- datacenter traffic balancing, but no security controls  If your cluster is attached to an ELB, it is available to the Internet Engineers may misunderstand:  ELB use cases (and alternatives)  Security features  Other measures used to protect ELB-fronted clusters
  49. 49. Trust but Verify: ELB Checker (gauntlt)1. Launch gauntlt test runner instance, loaded with “master list” of ELBs and expected state2. Determine “target list” of current ELBs to evaluate3. Generate per-ELB listener gauntlt attack files4. Execute attacks5. Alert on failures and new ELBs6. Triage findings and update master list
  50. 50. Takeaways  Netflix runs a large, dynamic service in AWS  Newer concepts like cloud & DevOps need an updated approach to application security  Specific context can help jumpstart a pragmatic and effective security program  Don‟t swim upstream - integrate and collaborate with your engineering partners
  51. 51. Netflix References http://netflix.github.com http://techblog.netflix.com http://slideshare.net/netflix
  52. 52. Other References http://www.webpronews.com/netflix-outage-angers-customers-2008- 08 http://www.pcmag.com/article2/0,2817,2395372,00.asp http://www.readwriteweb.com/archives/etech_amazon_cto_aws.php http://bsimm.com/online/ http://www.microsoft.com/en- us/download/confirmation.aspx?id=29884 http://www.slideshare.net/reed2001/culture-1798664 http://techcrunch.com/2013/01/31/read-what-facebooks-sandberg- calls-maybe-the-most-important-document-ever-to-come-out-of-the- valley/ http://www.gauntlt.org
  53. 53. Questions? chan@netflix.com

×