Real World Cloud
 Application Security
Lessons Learned Running Large Scale Systems
            in the Public Cloud


   Jason Chan - chan@netflix.com
Netflix, Inc.
         “With more than 27 million
          streaming members in the
         United States, Canada, Latin
        America, the United Kingdom
        and Ireland, Netflix, Inc. is the
           world's leading internet
       subscription service for enjoying
        movies and TV programs . . .”
Source: http://ir.netflix.com
Me
• Cloud Security Architect @ Netflix
• Responsible for:
  • Cloud app, product, infrastructure, ops
     security
• Previously:
  • Led security team @ VMware
  • Earlier, primarily security consulting at
     @stake, iSEC Partners
AppSec Challenges
AppSec Challenges
Lots of Good Advice
• BSIMM
• Microsoft SDL
• SAFECode
But, what works?




Forrester Consulting, 12/10
Especially, given phenomena
 such as DevOps, cloud, agile,
and the unique characteristics
     of an organization?
Netflix Engineering
 Characteristics
Netflix in the Cloud -
        Why?
Netflix in the Cloud -
        Why?
Netflix in the Cloud -
        Why?

           “Undifferentiated heavy lifting”
Netflix in the Cloud -
        Why?

           “Undifferentiated heavy lifting”
Netflix in the Cloud -
        Why?

           “Undifferentiated heavy lifting”
Netflix is now ~99% in
    Public Cloud
On the way to the
    cloud . . .
On the way to the
    cloud . . .




                 (or NoOps,
           depending on definitions)
Some As-Is #s

• 27m+ subscribers
• 10,000s of systems
• 100s of engineers, apps
• ~250 test deployments/day *
• ~70 production deployments/day *
* Sample based on this week’s activities
Deploying Code @ Netflix
A common graph @ Netflix
A common graph @ Netflix
Lots of watching in prime time
A common graph @ Netflix
Lots of watching in prime time   Not as much in early morning
A common graph @ Netflix
Lots of watching in prime time              Not as much in early morning




            Old way - pay and provision for peak, 24/7/365
A common graph @ Netflix
    Lots of watching in prime time              Not as much in early morning




                Old way - pay and provision for peak, 24/7/365

Multiply this pattern across the dozens of apps
 that comprise the Netflix streaming service
Solution: Load-Based
    Autoscaling
Autoscaling
Autoscaling
•   Goals:
Autoscaling
•   Goals:

    •   # of systems matches
        load requirements
Autoscaling
•   Goals:

    •   # of systems matches
        load requirements

    •   Load per server is
        constant
Autoscaling
•   Goals:

    •   # of systems matches
        load requirements

    •   Load per server is
        constant

    •   Happens without
        intervention (the
        ‘auto’ in autoscaling
Autoscaling
•   Goals:                      •   Results:

    •   # of systems matches
        load requirements

    •   Load per server is
        constant

    •   Happens without
        intervention (the
        ‘auto’ in autoscaling
Autoscaling
•   Goals:                      •   Results:

    •   # of systems matches        •   Continuously
        load requirements               adding/removing
                                        nodes
    •   Load per server is
        constant

    •   Happens without
        intervention (the
        ‘auto’ in autoscaling
Autoscaling
•   Goals:                      •   Results:

    •   # of systems matches        •   Continuously
        load requirements               adding/removing
                                        nodes
    •   Load per server is
        constant                    •   New nodes must
                                        mirror existing
    •   Happens without
        intervention (the
        ‘auto’ in autoscaling
Every change requires a new
         cluster push
(not an incremental change to
      existing systems)
Deploying must be easy
        (it is)
Netflix Deployment
     Pipeline
Netflix Deployment
              Pipeline




Perforce/Git

Code change
Config change
Netflix Deployment
              Pipeline
               RPM file with
               app-specific
               bits


                  YUM




Perforce/Git

Code change
Config change
Netflix Deployment
              Pipeline
               RPM file with
               app-specific
               bits


                  YUM




Perforce/Git                    Bakery

Code change                   Base image +
Config change                  RPM
Netflix Deployment
              Pipeline
               RPM file with                  VM template
               app-specific                   ready to launch
               bits


                  YUM                            AMI




Perforce/Git                    Bakery

Code change                   Base image +
Config change                  RPM
Netflix Deployment
              Pipeline
               RPM file with                  VM template
               app-specific                   ready to launch
               bits


                  YUM                            AMI




Perforce/Git                    Bakery                             ASG

Code change                   Base image +                     Cluster config
Config change                  RPM                              Running systems
Netflix Deployment
              Pipeline
               RPM file with                  VM template
               app-specific                   ready to launch
               bits


                  YUM                            AMI




Perforce/Git                    Bakery                             ASG

Code change                   Base image +                     Cluster config
Config change                  RPM                              Running systems
Operational Impact

• No changes to running systems
• No systems management infrastructure
• Fewer logins to prod
• No snowflakes
• Trivial “rollback”
Security Impact
• Need to think differently on:
 • Vulnerability management
 • Patch management
 • User activity monitoring
 • File integrity monitoring
 • Forensic investigations
Org, architecture,
deployment is different,
 what about security?
We’ve adapted too.
Some principles we’ve
   found useful.
Integrate
Base AMI Security
•   AMI = Amazon Machine                   •   Average age of running
    Image                                      instance: 24 days*

•   @ Netflix, all apps are                 •   60% of instances less
    based on “Base AMI”,                       than 1 week old*
    and new pushes pick up
    the latest

•   Concentrating testing
    and improvements here
    provides greatest impact


* Based on one-time sampling (yesterday)
Base AMI Testing
•   The base AMI is managed
    like other packages, via
    P4, Jenkins, etc.

•   We watch the base AMI’s
    SCM directory & kick
    off testing when it
    changes

•   Launch an instance of
    the AMI, perform vuln
    scan and other checks
Base AMI Testing
•   The base AMI is managed
    like other packages, via
    P4, Jenkins, etc.

•   We watch the base AMI’s
    SCM directory & kick
    off testing when it
    changes

•   Launch an instance of
    the AMI, perform vuln
    scan and other checks
Base AMI Testing
•   The base AMI is managed
    like other packages, via
    P4, Jenkins, etc.

•   We watch the base AMI’s
    SCM directory & kick       SCAN COMPLETED ALERT

    off testing when it        Site name: AMI1


    changes                    Stopped by: N/A

                               Total Scan Time: 4 minutes 46 seconds



•
                               Critical Vulnerabilities: 5
    Launch an instance of      Severe Vulnerabilities:   4
                               Moderate Vulnerabilities: 4

    the AMI, perform vuln
    scan and other checks
Security Packaging


• All security tools use the same toolchain as
  the rest of engineering (P4/Git, Jenkins, etc.)
• From the RPM spec file of a webserver:
  Requires: ossec cloudpassage nflx-base-harden
  hyperguard-enforcer
• Pulls in the following RPMs:
   • Host hardening package
   • WAF agent
   • OSSEC (HIDS agent)
   • CloudPassage (config assessment,
    FW, etc.)
Static Analysis

• Available self-service through build
  environment (FindBugs, PMD)
• Jenkins (CI) plugin to display graphs and
  support drill through to results
MAN Integration
Many systems involved,
  standardization is
      important
Central Alerting
           Gateway
• A single place to generate alerts
• Python, Java libraries (or json post) to easily
  alert on events of interest
• Ties in to PagerDuty notification system
• Allows for stateful alerting and some
  response
• A prerequisite that our tools will leverage
CAG Example

import CORE.Gateway

gw = CORE.Gateway.Gateway()

gw.send("testcluster", "normal", "Something went
wrong")
Chronos

• Timeline system (API and UI) with Java/
  Python libraries, or json post
• Track config changes, deployments, etc.
• Security tools also leverage for tracking and
  analysis
Chronos Security
          Examples
• What IP addresses have been blacklisted by
  the WAF in the last few weeks?
  GET /api/v1/event?
  timelines=type:blacklist&start=20121012000000000


• Which security groups have changed today?
  GET /api/v1/event?
  timelines=type:securitygroup&start=20121024000000000
Make the right way easy
     (and secure)
Cryptex
• Many uses of crypto in web/distributed
  systems:
 • Encrypt/decrypt (cookies, data, etc.)
 • Sign/verify (URLs, data, etc.)
• Known as an area where developers should
  not DIY
• Multi-layer crypto system (HSM basis, scale
          out layer)

          • Easy for developers to use
          • Key management handled transparently
          • Access control and auditable operations
ICipherContext cipherContext =
                CryptexClientFactory.getCipherContext(KeySet.testkey);
// encryption
String cipherText = cipherContext.encrypt("NETFLIX");
// decryption
String plainText = cipherContext.decrypt(cipherText);
Cloud SSO


• Authenticated access to dashboards, admin
  apps in the cloud is problematic
 • No datacenter access, no LDAP, AD
Cloud SSO
• Solution - leverage OneLogin SaaS SSO
  option (SAML) used by IT for enterprise
  apps
• Built filter that integrates with our platform
  web server to make SSO/authentication
  trivial
Trust, but verify
Culture of ‘freedom and
   responsibility’ precludes
    traditional centralized,
command and control approach
Security Monkey
•   Cloud APIs make            •   Includes:
    verification and analysis
    of configuration &
    running state simpler
                                   •   Cert checking

                                   •   Firewall analysis
•   Security Monkey created
    as the framework for
    this analysis
                                   •   IAM entity analysis

                                   •   Limit warnings
Security Monkey


     From:  Security Monkey
     Date:  Wed, 24 Oct 2012 17:08:18 +0000
     To:  Security Alerts
     Subject:  prod Changes Detected


             Table of Contents:
                 Security Groups
                 
                         Changed Security Group
                         
                             
                             <sgname> (eu-west-1 / prod)
                              <#Security Group/<sgname> (eu-west-1 / prod)>
                         
Exploit Monkey
    • Autoscaling group is unit of deployment, so
        changes signal a good time to rerun
        dynamic scans
On 10/23/12 12:35 PM, Exploit Monkey wrote:

I noticed that testapp-live has changed current ASG name from testapp-
live-v001 to testapp-live-v002.

I'm starting a vulnerability scan against test app from these private/
public IPs:
10.29.24.174
ELB Checker (gauntlt)

•   AWS’ Elastic Load Balancer (ELB) provides cross-
    datacenter traffic balancing, but no security
    controls (if your cluster is attached to an ELB, it is
    available to the Internet)

•   Engineers may misunderstand use cases for ELBs,
    security features, and/or other measures that can
    be used to protect ELB-fronted clusters
Solution: gauntlt Testing
1. Launch gauntlt test runner
   instance, loaded with “master
   list” of ELBs and expected state

2. Determine “target list” of
   current ELBs to evaluate

3. Generate per-ELB listener
   gauntlt attack files

4. Execute attacks

5. Alert on failures and new ELBs

6. Triage findings and update ELB
   master list
Self-service, with
   exceptions
AWS Security Groups
•   Asgard cloud orchestration
    tool allows developers to
    configure their own firewall
    rules

•   Limited to same-account
    groups, no IP-based rules

•   Handles 95% of
    requirements, JIRAs for
    additional changes, and
    Security Monkey to keep an
    eye on things
Takeaways
• Netflix runs a large, dynamic service in AWS
• Good guidance + specific context can help
  jumpstart a pragmatic security program
• Newer concepts like cloud & DevOps need
  updated approach to security
• Don’t swim upstream - integrate and
  collaborate with your engineering partners
Netflix References

• http://netflix.github.com/
• http://techblog.netflix.com/
• http://slideshare.net/netflix
Other References
•   http://www.webpronews.com/netflix-outage-angers-
    customers-2008-08
•   http://www.pcmag.com/article2/0,2817,2395372,00.asp
•   http://www.readwriteweb.com/archives/
    etech_amazon_cto_aws.php
•   http://bsimm.com/online/
•   http://www.microsoft.com/en-us/download/
    confirmation.aspx?id=29884
•   http://www.gauntlt.org
Questions?
chan@netflix.com

Real World Cloud Application Security

  • 1.
    Real World Cloud Application Security Lessons Learned Running Large Scale Systems in the Public Cloud Jason Chan - chan@netflix.com
  • 2.
    Netflix, Inc. “With more than 27 million streaming members in the United States, Canada, Latin America, the United Kingdom and Ireland, Netflix, Inc. is the world's leading internet subscription service for enjoying movies and TV programs . . .” Source: http://ir.netflix.com
  • 3.
    Me • Cloud SecurityArchitect @ Netflix • Responsible for: • Cloud app, product, infrastructure, ops security • Previously: • Led security team @ VMware • Earlier, primarily security consulting at @stake, iSEC Partners
  • 4.
  • 5.
  • 6.
    Lots of GoodAdvice • BSIMM • Microsoft SDL • SAFECode
  • 7.
    But, what works? ForresterConsulting, 12/10
  • 8.
    Especially, given phenomena such as DevOps, cloud, agile, and the unique characteristics of an organization?
  • 9.
  • 10.
    Netflix in theCloud - Why?
  • 11.
    Netflix in theCloud - Why?
  • 12.
    Netflix in theCloud - Why? “Undifferentiated heavy lifting”
  • 13.
    Netflix in theCloud - Why? “Undifferentiated heavy lifting”
  • 14.
    Netflix in theCloud - Why? “Undifferentiated heavy lifting”
  • 15.
    Netflix is now~99% in Public Cloud
  • 16.
    On the wayto the cloud . . .
  • 17.
    On the wayto the cloud . . . (or NoOps, depending on definitions)
  • 18.
    Some As-Is #s •27m+ subscribers • 10,000s of systems • 100s of engineers, apps • ~250 test deployments/day * • ~70 production deployments/day * * Sample based on this week’s activities
  • 19.
  • 20.
    A common graph@ Netflix
  • 21.
    A common graph@ Netflix Lots of watching in prime time
  • 22.
    A common graph@ Netflix Lots of watching in prime time Not as much in early morning
  • 23.
    A common graph@ Netflix Lots of watching in prime time Not as much in early morning Old way - pay and provision for peak, 24/7/365
  • 24.
    A common graph@ Netflix Lots of watching in prime time Not as much in early morning Old way - pay and provision for peak, 24/7/365 Multiply this pattern across the dozens of apps that comprise the Netflix streaming service
  • 25.
  • 26.
  • 27.
  • 28.
    Autoscaling • Goals: • # of systems matches load requirements
  • 29.
    Autoscaling • Goals: • # of systems matches load requirements • Load per server is constant
  • 30.
    Autoscaling • Goals: • # of systems matches load requirements • Load per server is constant • Happens without intervention (the ‘auto’ in autoscaling
  • 31.
    Autoscaling • Goals: • Results: • # of systems matches load requirements • Load per server is constant • Happens without intervention (the ‘auto’ in autoscaling
  • 32.
    Autoscaling • Goals: • Results: • # of systems matches • Continuously load requirements adding/removing nodes • Load per server is constant • Happens without intervention (the ‘auto’ in autoscaling
  • 33.
    Autoscaling • Goals: • Results: • # of systems matches • Continuously load requirements adding/removing nodes • Load per server is constant • New nodes must mirror existing • Happens without intervention (the ‘auto’ in autoscaling
  • 34.
    Every change requiresa new cluster push (not an incremental change to existing systems)
  • 35.
    Deploying must beeasy (it is)
  • 36.
  • 37.
    Netflix Deployment Pipeline Perforce/Git Code change Config change
  • 38.
    Netflix Deployment Pipeline RPM file with app-specific bits YUM Perforce/Git Code change Config change
  • 39.
    Netflix Deployment Pipeline RPM file with app-specific bits YUM Perforce/Git Bakery Code change Base image + Config change RPM
  • 40.
    Netflix Deployment Pipeline RPM file with VM template app-specific ready to launch bits YUM AMI Perforce/Git Bakery Code change Base image + Config change RPM
  • 41.
    Netflix Deployment Pipeline RPM file with VM template app-specific ready to launch bits YUM AMI Perforce/Git Bakery ASG Code change Base image + Cluster config Config change RPM Running systems
  • 42.
    Netflix Deployment Pipeline RPM file with VM template app-specific ready to launch bits YUM AMI Perforce/Git Bakery ASG Code change Base image + Cluster config Config change RPM Running systems
  • 43.
    Operational Impact • Nochanges to running systems • No systems management infrastructure • Fewer logins to prod • No snowflakes • Trivial “rollback”
  • 44.
    Security Impact • Needto think differently on: • Vulnerability management • Patch management • User activity monitoring • File integrity monitoring • Forensic investigations
  • 45.
    Org, architecture, deployment isdifferent, what about security?
  • 46.
    We’ve adapted too. Someprinciples we’ve found useful.
  • 47.
  • 48.
    Base AMI Security • AMI = Amazon Machine • Average age of running Image instance: 24 days* • @ Netflix, all apps are • 60% of instances less based on “Base AMI”, than 1 week old* and new pushes pick up the latest • Concentrating testing and improvements here provides greatest impact * Based on one-time sampling (yesterday)
  • 49.
    Base AMI Testing • The base AMI is managed like other packages, via P4, Jenkins, etc. • We watch the base AMI’s SCM directory & kick off testing when it changes • Launch an instance of the AMI, perform vuln scan and other checks
  • 50.
    Base AMI Testing • The base AMI is managed like other packages, via P4, Jenkins, etc. • We watch the base AMI’s SCM directory & kick off testing when it changes • Launch an instance of the AMI, perform vuln scan and other checks
  • 51.
    Base AMI Testing • The base AMI is managed like other packages, via P4, Jenkins, etc. • We watch the base AMI’s SCM directory & kick SCAN COMPLETED ALERT off testing when it Site name: AMI1 changes Stopped by: N/A Total Scan Time: 4 minutes 46 seconds • Critical Vulnerabilities: 5 Launch an instance of Severe Vulnerabilities:   4 Moderate Vulnerabilities: 4 the AMI, perform vuln scan and other checks
  • 52.
    Security Packaging • Allsecurity tools use the same toolchain as the rest of engineering (P4/Git, Jenkins, etc.)
  • 53.
    • From theRPM spec file of a webserver: Requires: ossec cloudpassage nflx-base-harden hyperguard-enforcer
  • 54.
    • Pulls inthe following RPMs: • Host hardening package • WAF agent • OSSEC (HIDS agent) • CloudPassage (config assessment, FW, etc.)
  • 55.
    Static Analysis • Availableself-service through build environment (FindBugs, PMD) • Jenkins (CI) plugin to display graphs and support drill through to results
  • 57.
  • 58.
    Many systems involved, standardization is important
  • 59.
    Central Alerting Gateway • A single place to generate alerts • Python, Java libraries (or json post) to easily alert on events of interest • Ties in to PagerDuty notification system • Allows for stateful alerting and some response • A prerequisite that our tools will leverage
  • 60.
    CAG Example import CORE.Gateway gw= CORE.Gateway.Gateway() gw.send("testcluster", "normal", "Something went wrong")
  • 61.
    Chronos • Timeline system(API and UI) with Java/ Python libraries, or json post • Track config changes, deployments, etc. • Security tools also leverage for tracking and analysis
  • 62.
    Chronos Security Examples • What IP addresses have been blacklisted by the WAF in the last few weeks? GET /api/v1/event? timelines=type:blacklist&start=20121012000000000 • Which security groups have changed today? GET /api/v1/event? timelines=type:securitygroup&start=20121024000000000
  • 63.
    Make the rightway easy (and secure)
  • 64.
    Cryptex • Many usesof crypto in web/distributed systems: • Encrypt/decrypt (cookies, data, etc.) • Sign/verify (URLs, data, etc.) • Known as an area where developers should not DIY
  • 65.
    • Multi-layer cryptosystem (HSM basis, scale out layer) • Easy for developers to use • Key management handled transparently • Access control and auditable operations ICipherContext cipherContext = CryptexClientFactory.getCipherContext(KeySet.testkey); // encryption String cipherText = cipherContext.encrypt("NETFLIX"); // decryption String plainText = cipherContext.decrypt(cipherText);
  • 66.
    Cloud SSO • Authenticatedaccess to dashboards, admin apps in the cloud is problematic • No datacenter access, no LDAP, AD
  • 67.
    Cloud SSO • Solution- leverage OneLogin SaaS SSO option (SAML) used by IT for enterprise apps • Built filter that integrates with our platform web server to make SSO/authentication trivial
  • 68.
  • 69.
    Culture of ‘freedomand responsibility’ precludes traditional centralized, command and control approach
  • 70.
    Security Monkey • Cloud APIs make • Includes: verification and analysis of configuration & running state simpler • Cert checking • Firewall analysis • Security Monkey created as the framework for this analysis • IAM entity analysis • Limit warnings
  • 71.
    Security Monkey From:  Security Monkey Date:  Wed, 24 Oct 2012 17:08:18 +0000 To:  Security Alerts Subject:  prod Changes Detected         Table of Contents:             Security Groups                                  Changed Security Group                                                                       <sgname> (eu-west-1 / prod)                          <#Security Group/<sgname> (eu-west-1 / prod)>                     
  • 72.
    Exploit Monkey • Autoscaling group is unit of deployment, so changes signal a good time to rerun dynamic scans On 10/23/12 12:35 PM, Exploit Monkey wrote: I noticed that testapp-live has changed current ASG name from testapp- live-v001 to testapp-live-v002. I'm starting a vulnerability scan against test app from these private/ public IPs: 10.29.24.174
  • 73.
    ELB Checker (gauntlt) • AWS’ Elastic Load Balancer (ELB) provides cross- datacenter traffic balancing, but no security controls (if your cluster is attached to an ELB, it is available to the Internet) • Engineers may misunderstand use cases for ELBs, security features, and/or other measures that can be used to protect ELB-fronted clusters
  • 74.
    Solution: gauntlt Testing 1.Launch gauntlt test runner instance, loaded with “master list” of ELBs and expected state 2. Determine “target list” of current ELBs to evaluate 3. Generate per-ELB listener gauntlt attack files 4. Execute attacks 5. Alert on failures and new ELBs 6. Triage findings and update ELB master list
  • 75.
  • 76.
    AWS Security Groups • Asgard cloud orchestration tool allows developers to configure their own firewall rules • Limited to same-account groups, no IP-based rules • Handles 95% of requirements, JIRAs for additional changes, and Security Monkey to keep an eye on things
  • 77.
    Takeaways • Netflix runsa large, dynamic service in AWS • Good guidance + specific context can help jumpstart a pragmatic security program • Newer concepts like cloud & DevOps need updated approach to security • Don’t swim upstream - integrate and collaborate with your engineering partners
  • 78.
    Netflix References • http://netflix.github.com/ •http://techblog.netflix.com/ • http://slideshare.net/netflix
  • 79.
    Other References • http://www.webpronews.com/netflix-outage-angers- customers-2008-08 • http://www.pcmag.com/article2/0,2817,2395372,00.asp • http://www.readwriteweb.com/archives/ etech_amazon_cto_aws.php • http://bsimm.com/online/ • http://www.microsoft.com/en-us/download/ confirmation.aspx?id=29884 • http://www.gauntlt.org
  • 80.