• Save
Real World Cloud Application Security
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Real World Cloud Application Security

  • 1,709 views
Uploaded on

 

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,709
On Slideshare
1,437
From Embeds
272
Number of Embeds
1

Actions

Shares
Downloads
0
Comments
0
Likes
3

Embeds 272

http://blog.ruggeddevops.org 272

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n

Transcript

  • 1. Real World Cloud Application SecurityLessons Learned Running Large Scale Systems in the Public Cloud Jason Chan - chan@netflix.com
  • 2. Netflix, Inc. “With more than 27 million streaming members in the United States, Canada, Latin America, the United Kingdom and Ireland, Netflix, Inc. is the worlds leading internet subscription service for enjoying movies and TV programs . . .”Source: http://ir.netflix.com
  • 3. Me• Cloud Security Architect @ Netflix• Responsible for: • Cloud app, product, infrastructure, ops security• Previously: • Led security team @ VMware • Earlier, primarily security consulting at @stake, iSEC Partners
  • 4. AppSec Challenges
  • 5. AppSec Challenges
  • 6. Lots of Good Advice• BSIMM• Microsoft SDL• SAFECode
  • 7. But, what works?Forrester Consulting, 12/10
  • 8. Especially, given phenomena such as DevOps, cloud, agile,and the unique characteristics of an organization?
  • 9. Netflix Engineering Characteristics
  • 10. Netflix in the Cloud - Why?
  • 11. Netflix in the Cloud - Why?
  • 12. Netflix in the Cloud - Why? “Undifferentiated heavy lifting”
  • 13. Netflix in the Cloud - Why? “Undifferentiated heavy lifting”
  • 14. Netflix in the Cloud - Why? “Undifferentiated heavy lifting”
  • 15. Netflix is now ~99% in Public Cloud
  • 16. On the way to the cloud . . .
  • 17. On the way to the cloud . . . (or NoOps, depending on definitions)
  • 18. Some As-Is #s• 27m+ subscribers• 10,000s of systems• 100s of engineers, apps• ~250 test deployments/day *• ~70 production deployments/day ** Sample based on this week’s activities
  • 19. Deploying Code @ Netflix
  • 20. A common graph @ Netflix
  • 21. A common graph @ NetflixLots of watching in prime time
  • 22. A common graph @ NetflixLots of watching in prime time Not as much in early morning
  • 23. A common graph @ NetflixLots of watching in prime time Not as much in early morning Old way - pay and provision for peak, 24/7/365
  • 24. A common graph @ Netflix Lots of watching in prime time Not as much in early morning Old way - pay and provision for peak, 24/7/365Multiply this pattern across the dozens of apps that comprise the Netflix streaming service
  • 25. Solution: Load-Based Autoscaling
  • 26. Autoscaling
  • 27. Autoscaling• Goals:
  • 28. Autoscaling• Goals: • # of systems matches load requirements
  • 29. Autoscaling• Goals: • # of systems matches load requirements • Load per server is constant
  • 30. Autoscaling• Goals: • # of systems matches load requirements • Load per server is constant • Happens without intervention (the ‘auto’ in autoscaling
  • 31. Autoscaling• Goals: • Results: • # of systems matches load requirements • Load per server is constant • Happens without intervention (the ‘auto’ in autoscaling
  • 32. Autoscaling• Goals: • Results: • # of systems matches • Continuously load requirements adding/removing nodes • Load per server is constant • Happens without intervention (the ‘auto’ in autoscaling
  • 33. Autoscaling• Goals: • Results: • # of systems matches • Continuously load requirements adding/removing nodes • Load per server is constant • New nodes must mirror existing • Happens without intervention (the ‘auto’ in autoscaling
  • 34. Every change requires a new cluster push(not an incremental change to existing systems)
  • 35. Deploying must be easy (it is)
  • 36. Netflix Deployment Pipeline
  • 37. Netflix Deployment PipelinePerforce/GitCode changeConfig change
  • 38. Netflix Deployment Pipeline RPM file with app-specific bits YUMPerforce/GitCode changeConfig change
  • 39. Netflix Deployment Pipeline RPM file with app-specific bits YUMPerforce/Git BakeryCode change Base image +Config change RPM
  • 40. Netflix Deployment Pipeline RPM file with VM template app-specific ready to launch bits YUM AMIPerforce/Git BakeryCode change Base image +Config change RPM
  • 41. Netflix Deployment Pipeline RPM file with VM template app-specific ready to launch bits YUM AMIPerforce/Git Bakery ASGCode change Base image + Cluster configConfig change RPM Running systems
  • 42. Netflix Deployment Pipeline RPM file with VM template app-specific ready to launch bits YUM AMIPerforce/Git Bakery ASGCode change Base image + Cluster configConfig change RPM Running systems
  • 43. Operational Impact• No changes to running systems• No systems management infrastructure• Fewer logins to prod• No snowflakes• Trivial “rollback”
  • 44. Security Impact• Need to think differently on: • Vulnerability management • Patch management • User activity monitoring • File integrity monitoring • Forensic investigations
  • 45. Org, architecture,deployment is different, what about security?
  • 46. We’ve adapted too.Some principles we’ve found useful.
  • 47. Integrate
  • 48. Base AMI Security• AMI = Amazon Machine • Average age of running Image instance: 24 days*• @ Netflix, all apps are • 60% of instances less based on “Base AMI”, than 1 week old* and new pushes pick up the latest• Concentrating testing and improvements here provides greatest impact* Based on one-time sampling (yesterday)
  • 49. Base AMI Testing• The base AMI is managed like other packages, via P4, Jenkins, etc.• We watch the base AMI’s SCM directory & kick off testing when it changes• Launch an instance of the AMI, perform vuln scan and other checks
  • 50. Base AMI Testing• The base AMI is managed like other packages, via P4, Jenkins, etc.• We watch the base AMI’s SCM directory & kick off testing when it changes• Launch an instance of the AMI, perform vuln scan and other checks
  • 51. Base AMI Testing• The base AMI is managed like other packages, via P4, Jenkins, etc.• We watch the base AMI’s SCM directory & kick SCAN COMPLETED ALERT off testing when it Site name: AMI1 changes Stopped by: N/A Total Scan Time: 4 minutes 46 seconds• Critical Vulnerabilities: 5 Launch an instance of Severe Vulnerabilities:   4 Moderate Vulnerabilities: 4 the AMI, perform vuln scan and other checks
  • 52. Security Packaging• All security tools use the same toolchain as the rest of engineering (P4/Git, Jenkins, etc.)
  • 53. • From the RPM spec file of a webserver: Requires: ossec cloudpassage nflx-base-harden hyperguard-enforcer
  • 54. • Pulls in the following RPMs: • Host hardening package • WAF agent • OSSEC (HIDS agent) • CloudPassage (config assessment, FW, etc.)
  • 55. Static Analysis• Available self-service through build environment (FindBugs, PMD)• Jenkins (CI) plugin to display graphs and support drill through to results
  • 56. MAN Integration
  • 57. Many systems involved, standardization is important
  • 58. Central Alerting Gateway• A single place to generate alerts• Python, Java libraries (or json post) to easily alert on events of interest• Ties in to PagerDuty notification system• Allows for stateful alerting and some response• A prerequisite that our tools will leverage
  • 59. CAG Exampleimport CORE.Gatewaygw = CORE.Gateway.Gateway()gw.send("testcluster", "normal", "Something wentwrong")
  • 60. Chronos• Timeline system (API and UI) with Java/ Python libraries, or json post• Track config changes, deployments, etc.• Security tools also leverage for tracking and analysis
  • 61. Chronos Security Examples• What IP addresses have been blacklisted by the WAF in the last few weeks? GET /api/v1/event? timelines=type:blacklist&start=20121012000000000• Which security groups have changed today? GET /api/v1/event? timelines=type:securitygroup&start=20121024000000000
  • 62. Make the right way easy (and secure)
  • 63. Cryptex• Many uses of crypto in web/distributed systems: • Encrypt/decrypt (cookies, data, etc.) • Sign/verify (URLs, data, etc.)• Known as an area where developers should not DIY
  • 64. • Multi-layer crypto system (HSM basis, scale out layer) • Easy for developers to use • Key management handled transparently • Access control and auditable operationsICipherContext cipherContext = CryptexClientFactory.getCipherContext(KeySet.testkey);// encryptionString cipherText = cipherContext.encrypt("NETFLIX");// decryptionString plainText = cipherContext.decrypt(cipherText);
  • 65. Cloud SSO• Authenticated access to dashboards, admin apps in the cloud is problematic • No datacenter access, no LDAP, AD
  • 66. Cloud SSO• Solution - leverage OneLogin SaaS SSO option (SAML) used by IT for enterprise apps• Built filter that integrates with our platform web server to make SSO/authentication trivial
  • 67. Trust, but verify
  • 68. Culture of ‘freedom and responsibility’ precludes traditional centralized,command and control approach
  • 69. Security Monkey• Cloud APIs make • Includes: verification and analysis of configuration & running state simpler • Cert checking • Firewall analysis• Security Monkey created as the framework for this analysis • IAM entity analysis • Limit warnings
  • 70. Security Monkey From:  Security Monkey Date:  Wed, 24 Oct 2012 17:08:18 +0000 To:  Security Alerts Subject:  prod Changes Detected         Table of Contents:             Security Groups                                  Changed Security Group                                                                       <sgname> (eu-west-1 / prod)                          <#Security Group/<sgname> (eu-west-1 / prod)>                     
  • 71. Exploit Monkey • Autoscaling group is unit of deployment, so changes signal a good time to rerun dynamic scansOn 10/23/12 12:35 PM, Exploit Monkey wrote:I noticed that testapp-live has changed current ASG name from testapp-live-v001 to testapp-live-v002.Im starting a vulnerability scan against test app from these private/public IPs:10.29.24.174
  • 72. ELB Checker (gauntlt)• AWS’ Elastic Load Balancer (ELB) provides cross- datacenter traffic balancing, but no security controls (if your cluster is attached to an ELB, it is available to the Internet)• Engineers may misunderstand use cases for ELBs, security features, and/or other measures that can be used to protect ELB-fronted clusters
  • 73. Solution: gauntlt Testing1. Launch gauntlt test runner instance, loaded with “master list” of ELBs and expected state2. Determine “target list” of current ELBs to evaluate3. Generate per-ELB listener gauntlt attack files4. Execute attacks5. Alert on failures and new ELBs6. Triage findings and update ELB master list
  • 74. Self-service, with exceptions
  • 75. AWS Security Groups• Asgard cloud orchestration tool allows developers to configure their own firewall rules• Limited to same-account groups, no IP-based rules• Handles 95% of requirements, JIRAs for additional changes, and Security Monkey to keep an eye on things
  • 76. Takeaways• Netflix runs a large, dynamic service in AWS• Good guidance + specific context can help jumpstart a pragmatic security program• Newer concepts like cloud & DevOps need updated approach to security• Don’t swim upstream - integrate and collaborate with your engineering partners
  • 77. Netflix References• http://netflix.github.com/• http://techblog.netflix.com/• http://slideshare.net/netflix
  • 78. Other References• http://www.webpronews.com/netflix-outage-angers- customers-2008-08• http://www.pcmag.com/article2/0,2817,2395372,00.asp• http://www.readwriteweb.com/archives/ etech_amazon_cto_aws.php• http://bsimm.com/online/• http://www.microsoft.com/en-us/download/ confirmation.aspx?id=29884• http://www.gauntlt.org
  • 79. Questions?chan@netflix.com