Your SlideShare is downloading. ×
Real World Cloud Application SecurityLessons Learned Running Large Scale Systems            in the Public Cloud   Jason Ch...
Netflix, Inc.         “With more than 27 million          streaming members in the         United States, Canada, Latin    ...
Me• Cloud Security Architect @ Netflix• Responsible for:  • Cloud app, product, infrastructure, ops     security• Previousl...
AppSec Challenges
AppSec Challenges
Lots of Good Advice• BSIMM• Microsoft SDL• SAFECode
But, what works?Forrester Consulting, 12/10
Especially, given phenomena such as DevOps, cloud, agile,and the unique characteristics     of an organization?
Netflix Engineering Characteristics
Netflix in the Cloud -        Why?
Netflix in the Cloud -        Why?
Netflix in the Cloud -        Why?           “Undifferentiated heavy lifting”
Netflix in the Cloud -        Why?           “Undifferentiated heavy lifting”
Netflix in the Cloud -        Why?           “Undifferentiated heavy lifting”
Netflix is now ~99% in    Public Cloud
On the way to the    cloud . . .
On the way to the    cloud . . .                 (or NoOps,           depending on definitions)
Some As-Is #s• 27m+ subscribers• 10,000s of systems• 100s of engineers, apps• ~250 test deployments/day *• ~70 production ...
Deploying Code @ Netflix
A common graph @ Netflix
A common graph @ NetflixLots of watching in prime time
A common graph @ NetflixLots of watching in prime time   Not as much in early morning
A common graph @ NetflixLots of watching in prime time              Not as much in early morning            Old way - pay a...
A common graph @ Netflix    Lots of watching in prime time              Not as much in early morning                Old way...
Solution: Load-Based    Autoscaling
Autoscaling
Autoscaling•   Goals:
Autoscaling•   Goals:    •   # of systems matches        load requirements
Autoscaling•   Goals:    •   # of systems matches        load requirements    •   Load per server is        constant
Autoscaling•   Goals:    •   # of systems matches        load requirements    •   Load per server is        constant    • ...
Autoscaling•   Goals:                      •   Results:    •   # of systems matches        load requirements    •   Load p...
Autoscaling•   Goals:                      •   Results:    •   # of systems matches        •   Continuously        load re...
Autoscaling•   Goals:                      •   Results:    •   # of systems matches        •   Continuously        load re...
Every change requires a new         cluster push(not an incremental change to      existing systems)
Deploying must be easy        (it is)
Netflix Deployment     Pipeline
Netflix Deployment              PipelinePerforce/GitCode changeConfig change
Netflix Deployment              Pipeline               RPM file with               app-specific               bits           ...
Netflix Deployment              Pipeline               RPM file with               app-specific               bits           ...
Netflix Deployment              Pipeline               RPM file with                  VM template               app-specific ...
Netflix Deployment              Pipeline               RPM file with                  VM template               app-specific ...
Netflix Deployment              Pipeline               RPM file with                  VM template               app-specific ...
Operational Impact• No changes to running systems• No systems management infrastructure• Fewer logins to prod• No snowflake...
Security Impact• Need to think differently on: • Vulnerability management • Patch management • User activity monitoring • ...
Org, architecture,deployment is different, what about security?
We’ve adapted too.Some principles we’ve   found useful.
Integrate
Base AMI Security•   AMI = Amazon Machine                   •   Average age of running    Image                           ...
Base AMI Testing•   The base AMI is managed    like other packages, via    P4, Jenkins, etc.•   We watch the base AMI’s   ...
Base AMI Testing•   The base AMI is managed    like other packages, via    P4, Jenkins, etc.•   We watch the base AMI’s   ...
Base AMI Testing•   The base AMI is managed    like other packages, via    P4, Jenkins, etc.•   We watch the base AMI’s   ...
Security Packaging• All security tools use the same toolchain as  the rest of engineering (P4/Git, Jenkins, etc.)
• From the RPM spec file of a webserver:  Requires: ossec cloudpassage nflx-base-harden  hyperguard-enforcer
• Pulls in the following RPMs:   • Host hardening package   • WAF agent   • OSSEC (HIDS agent)   • CloudPassage (config ass...
Static Analysis• Available self-service through build  environment (FindBugs, PMD)• Jenkins (CI) plugin to display graphs ...
MAN Integration
Many systems involved,  standardization is      important
Central Alerting           Gateway• A single place to generate alerts• Python, Java libraries (or json post) to easily  al...
CAG Exampleimport CORE.Gatewaygw = CORE.Gateway.Gateway()gw.send("testcluster", "normal", "Something wentwrong")
Chronos• Timeline system (API and UI) with Java/  Python libraries, or json post• Track config changes, deployments, etc.• ...
Chronos Security          Examples• What IP addresses have been blacklisted by  the WAF in the last few weeks?  GET /api/v...
Make the right way easy     (and secure)
Cryptex• Many uses of crypto in web/distributed  systems: • Encrypt/decrypt (cookies, data, etc.) • Sign/verify (URLs, dat...
• Multi-layer crypto system (HSM basis, scale          out layer)          • Easy for developers to use          • Key man...
Cloud SSO• Authenticated access to dashboards, admin  apps in the cloud is problematic • No datacenter access, no LDAP, AD
Cloud SSO• Solution - leverage OneLogin SaaS SSO  option (SAML) used by IT for enterprise  apps• Built filter that integrat...
Trust, but verify
Culture of ‘freedom and   responsibility’ precludes    traditional centralized,command and control approach
Security Monkey•   Cloud APIs make            •   Includes:    verification and analysis    of configuration &    running st...
Security Monkey     From:  Security Monkey     Date:  Wed, 24 Oct 2012 17:08:18 +0000     To:  Security Alerts     Subject...
Exploit Monkey    • Autoscaling group is unit of deployment, so        changes signal a good time to rerun        dynamic ...
ELB Checker (gauntlt)•   AWS’ Elastic Load Balancer (ELB) provides cross-    datacenter traffic balancing, but no security ...
Solution: gauntlt Testing1. Launch gauntlt test runner   instance, loaded with “master   list” of ELBs and expected state2...
Self-service, with   exceptions
AWS Security Groups•   Asgard cloud orchestration    tool allows developers to    configure their own firewall    rules•   L...
Takeaways• Netflix runs a large, dynamic service in AWS• Good guidance + specific context can help  jumpstart a pragmatic se...
Netflix References• http://netflix.github.com/• http://techblog.netflix.com/• http://slideshare.net/netflix
Other References•   http://www.webpronews.com/netflix-outage-angers-    customers-2008-08•   http://www.pcmag.com/article2/...
Questions?chan@netflix.com
Real World Cloud Application Security
Upcoming SlideShare
Loading in...5
×

Real World Cloud Application Security

1,495

Published on

0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,495
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
6
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Transcript of "Real World Cloud Application Security"

    1. 1. Real World Cloud Application SecurityLessons Learned Running Large Scale Systems in the Public Cloud Jason Chan - chan@netflix.com
    2. 2. Netflix, Inc. “With more than 27 million streaming members in the United States, Canada, Latin America, the United Kingdom and Ireland, Netflix, Inc. is the worlds leading internet subscription service for enjoying movies and TV programs . . .”Source: http://ir.netflix.com
    3. 3. Me• Cloud Security Architect @ Netflix• Responsible for: • Cloud app, product, infrastructure, ops security• Previously: • Led security team @ VMware • Earlier, primarily security consulting at @stake, iSEC Partners
    4. 4. AppSec Challenges
    5. 5. AppSec Challenges
    6. 6. Lots of Good Advice• BSIMM• Microsoft SDL• SAFECode
    7. 7. But, what works?Forrester Consulting, 12/10
    8. 8. Especially, given phenomena such as DevOps, cloud, agile,and the unique characteristics of an organization?
    9. 9. Netflix Engineering Characteristics
    10. 10. Netflix in the Cloud - Why?
    11. 11. Netflix in the Cloud - Why?
    12. 12. Netflix in the Cloud - Why? “Undifferentiated heavy lifting”
    13. 13. Netflix in the Cloud - Why? “Undifferentiated heavy lifting”
    14. 14. Netflix in the Cloud - Why? “Undifferentiated heavy lifting”
    15. 15. Netflix is now ~99% in Public Cloud
    16. 16. On the way to the cloud . . .
    17. 17. On the way to the cloud . . . (or NoOps, depending on definitions)
    18. 18. Some As-Is #s• 27m+ subscribers• 10,000s of systems• 100s of engineers, apps• ~250 test deployments/day *• ~70 production deployments/day ** Sample based on this week’s activities
    19. 19. Deploying Code @ Netflix
    20. 20. A common graph @ Netflix
    21. 21. A common graph @ NetflixLots of watching in prime time
    22. 22. A common graph @ NetflixLots of watching in prime time Not as much in early morning
    23. 23. A common graph @ NetflixLots of watching in prime time Not as much in early morning Old way - pay and provision for peak, 24/7/365
    24. 24. A common graph @ Netflix Lots of watching in prime time Not as much in early morning Old way - pay and provision for peak, 24/7/365Multiply this pattern across the dozens of apps that comprise the Netflix streaming service
    25. 25. Solution: Load-Based Autoscaling
    26. 26. Autoscaling
    27. 27. Autoscaling• Goals:
    28. 28. Autoscaling• Goals: • # of systems matches load requirements
    29. 29. Autoscaling• Goals: • # of systems matches load requirements • Load per server is constant
    30. 30. Autoscaling• Goals: • # of systems matches load requirements • Load per server is constant • Happens without intervention (the ‘auto’ in autoscaling
    31. 31. Autoscaling• Goals: • Results: • # of systems matches load requirements • Load per server is constant • Happens without intervention (the ‘auto’ in autoscaling
    32. 32. Autoscaling• Goals: • Results: • # of systems matches • Continuously load requirements adding/removing nodes • Load per server is constant • Happens without intervention (the ‘auto’ in autoscaling
    33. 33. Autoscaling• Goals: • Results: • # of systems matches • Continuously load requirements adding/removing nodes • Load per server is constant • New nodes must mirror existing • Happens without intervention (the ‘auto’ in autoscaling
    34. 34. Every change requires a new cluster push(not an incremental change to existing systems)
    35. 35. Deploying must be easy (it is)
    36. 36. Netflix Deployment Pipeline
    37. 37. Netflix Deployment PipelinePerforce/GitCode changeConfig change
    38. 38. Netflix Deployment Pipeline RPM file with app-specific bits YUMPerforce/GitCode changeConfig change
    39. 39. Netflix Deployment Pipeline RPM file with app-specific bits YUMPerforce/Git BakeryCode change Base image +Config change RPM
    40. 40. Netflix Deployment Pipeline RPM file with VM template app-specific ready to launch bits YUM AMIPerforce/Git BakeryCode change Base image +Config change RPM
    41. 41. Netflix Deployment Pipeline RPM file with VM template app-specific ready to launch bits YUM AMIPerforce/Git Bakery ASGCode change Base image + Cluster configConfig change RPM Running systems
    42. 42. Netflix Deployment Pipeline RPM file with VM template app-specific ready to launch bits YUM AMIPerforce/Git Bakery ASGCode change Base image + Cluster configConfig change RPM Running systems
    43. 43. Operational Impact• No changes to running systems• No systems management infrastructure• Fewer logins to prod• No snowflakes• Trivial “rollback”
    44. 44. Security Impact• Need to think differently on: • Vulnerability management • Patch management • User activity monitoring • File integrity monitoring • Forensic investigations
    45. 45. Org, architecture,deployment is different, what about security?
    46. 46. We’ve adapted too.Some principles we’ve found useful.
    47. 47. Integrate
    48. 48. Base AMI Security• AMI = Amazon Machine • Average age of running Image instance: 24 days*• @ Netflix, all apps are • 60% of instances less based on “Base AMI”, than 1 week old* and new pushes pick up the latest• Concentrating testing and improvements here provides greatest impact* Based on one-time sampling (yesterday)
    49. 49. Base AMI Testing• The base AMI is managed like other packages, via P4, Jenkins, etc.• We watch the base AMI’s SCM directory & kick off testing when it changes• Launch an instance of the AMI, perform vuln scan and other checks
    50. 50. Base AMI Testing• The base AMI is managed like other packages, via P4, Jenkins, etc.• We watch the base AMI’s SCM directory & kick off testing when it changes• Launch an instance of the AMI, perform vuln scan and other checks
    51. 51. Base AMI Testing• The base AMI is managed like other packages, via P4, Jenkins, etc.• We watch the base AMI’s SCM directory & kick SCAN COMPLETED ALERT off testing when it Site name: AMI1 changes Stopped by: N/A Total Scan Time: 4 minutes 46 seconds• Critical Vulnerabilities: 5 Launch an instance of Severe Vulnerabilities:   4 Moderate Vulnerabilities: 4 the AMI, perform vuln scan and other checks
    52. 52. Security Packaging• All security tools use the same toolchain as the rest of engineering (P4/Git, Jenkins, etc.)
    53. 53. • From the RPM spec file of a webserver: Requires: ossec cloudpassage nflx-base-harden hyperguard-enforcer
    54. 54. • Pulls in the following RPMs: • Host hardening package • WAF agent • OSSEC (HIDS agent) • CloudPassage (config assessment, FW, etc.)
    55. 55. Static Analysis• Available self-service through build environment (FindBugs, PMD)• Jenkins (CI) plugin to display graphs and support drill through to results
    56. 56. MAN Integration
    57. 57. Many systems involved, standardization is important
    58. 58. Central Alerting Gateway• A single place to generate alerts• Python, Java libraries (or json post) to easily alert on events of interest• Ties in to PagerDuty notification system• Allows for stateful alerting and some response• A prerequisite that our tools will leverage
    59. 59. CAG Exampleimport CORE.Gatewaygw = CORE.Gateway.Gateway()gw.send("testcluster", "normal", "Something wentwrong")
    60. 60. Chronos• Timeline system (API and UI) with Java/ Python libraries, or json post• Track config changes, deployments, etc.• Security tools also leverage for tracking and analysis
    61. 61. Chronos Security Examples• What IP addresses have been blacklisted by the WAF in the last few weeks? GET /api/v1/event? timelines=type:blacklist&start=20121012000000000• Which security groups have changed today? GET /api/v1/event? timelines=type:securitygroup&start=20121024000000000
    62. 62. Make the right way easy (and secure)
    63. 63. Cryptex• Many uses of crypto in web/distributed systems: • Encrypt/decrypt (cookies, data, etc.) • Sign/verify (URLs, data, etc.)• Known as an area where developers should not DIY
    64. 64. • Multi-layer crypto system (HSM basis, scale out layer) • Easy for developers to use • Key management handled transparently • Access control and auditable operationsICipherContext cipherContext = CryptexClientFactory.getCipherContext(KeySet.testkey);// encryptionString cipherText = cipherContext.encrypt("NETFLIX");// decryptionString plainText = cipherContext.decrypt(cipherText);
    65. 65. Cloud SSO• Authenticated access to dashboards, admin apps in the cloud is problematic • No datacenter access, no LDAP, AD
    66. 66. Cloud SSO• Solution - leverage OneLogin SaaS SSO option (SAML) used by IT for enterprise apps• Built filter that integrates with our platform web server to make SSO/authentication trivial
    67. 67. Trust, but verify
    68. 68. Culture of ‘freedom and responsibility’ precludes traditional centralized,command and control approach
    69. 69. Security Monkey• Cloud APIs make • Includes: verification and analysis of configuration & running state simpler • Cert checking • Firewall analysis• Security Monkey created as the framework for this analysis • IAM entity analysis • Limit warnings
    70. 70. Security Monkey From:  Security Monkey Date:  Wed, 24 Oct 2012 17:08:18 +0000 To:  Security Alerts Subject:  prod Changes Detected         Table of Contents:             Security Groups                                  Changed Security Group                                                                       <sgname> (eu-west-1 / prod)                          <#Security Group/<sgname> (eu-west-1 / prod)>                     
    71. 71. Exploit Monkey • Autoscaling group is unit of deployment, so changes signal a good time to rerun dynamic scansOn 10/23/12 12:35 PM, Exploit Monkey wrote:I noticed that testapp-live has changed current ASG name from testapp-live-v001 to testapp-live-v002.Im starting a vulnerability scan against test app from these private/public IPs:10.29.24.174
    72. 72. ELB Checker (gauntlt)• AWS’ Elastic Load Balancer (ELB) provides cross- datacenter traffic balancing, but no security controls (if your cluster is attached to an ELB, it is available to the Internet)• Engineers may misunderstand use cases for ELBs, security features, and/or other measures that can be used to protect ELB-fronted clusters
    73. 73. Solution: gauntlt Testing1. Launch gauntlt test runner instance, loaded with “master list” of ELBs and expected state2. Determine “target list” of current ELBs to evaluate3. Generate per-ELB listener gauntlt attack files4. Execute attacks5. Alert on failures and new ELBs6. Triage findings and update ELB master list
    74. 74. Self-service, with exceptions
    75. 75. AWS Security Groups• Asgard cloud orchestration tool allows developers to configure their own firewall rules• Limited to same-account groups, no IP-based rules• Handles 95% of requirements, JIRAs for additional changes, and Security Monkey to keep an eye on things
    76. 76. Takeaways• Netflix runs a large, dynamic service in AWS• Good guidance + specific context can help jumpstart a pragmatic security program• Newer concepts like cloud & DevOps need updated approach to security• Don’t swim upstream - integrate and collaborate with your engineering partners
    77. 77. Netflix References• http://netflix.github.com/• http://techblog.netflix.com/• http://slideshare.net/netflix
    78. 78. Other References• http://www.webpronews.com/netflix-outage-angers- customers-2008-08• http://www.pcmag.com/article2/0,2817,2395372,00.asp• http://www.readwriteweb.com/archives/ etech_amazon_cto_aws.php• http://bsimm.com/online/• http://www.microsoft.com/en-us/download/ confirmation.aspx?id=29884• http://www.gauntlt.org
    79. 79. Questions?chan@netflix.com

    ×