Resilience and Security @ Scale: Lessons Learned

Resilience and Security @ Scale – Lessons Learned
Jason Chan - chan@netflix.com

Netflix, Inc.

“Netflix is the world’s leading Internet television
network with more than 33 million members in
40 countries enjoying more than one billion
hours of TV shows and movies per month,
including original series . . .”

Source: http://ir.netflix.com

Me
 Director of Engineering @ Netflix
 Responsible for:
 Cloud app, product, infrastructure, ops security
 Previously:
 Led security team @ VMware
 Earlier, primarily security consulting at @stake, iSEC Partners

Availability and the Move to Streaming

“Undifferentiated Heavy Lifting”

Netflix Culture

“may well be the most important document ever to come out of the Valley.”
Sheryl Sandberg, Facebook COO

Netflix is now ~99% in the cloud

On the way to the cloud . . . (architecture)

On the way to the cloud . . . (organization)

(or NoOps, depending on definitions)

Some As-Is #s
 33m+ subscribers
 10,000s of systems
 100s of engineers, apps
 ~250 test deployments/day **
 ~70 production deployments/day **

** Sample based on one week‟s activities

Common Approaches to Reslience

Common Controls to Promote Resilience
 Architectural committees  Designed to standardize on
 Change approval boards design patterns, vendors, etc.
 Centralized deployments  Problems for Netflix:
 Freedom and Responsibility
 Vendor-specific, component-
Culture
level HA
 Highly aligned and loosely
 Standards and checklists coupled
 Innovation cycles

 Architectural committees  Designed to control and de-
 Change approval boards risk change
 Centralized deployments  Focus on artifacts, test and
rollback plans
level HA  Problems for Netflix:
 Standards and checklists
Culture
 Highly aligned and loosely
coupled

 Architectural committees  Separate Ops team deploys at
 Change approval boards a pre-ordained time (e.g.
weekly, monthly)
 Centralized deployments
 Problems for Netflix:
level HA
Culture
 Standards and checklists  Highly aligned and loosely
coupled

 Architectural committees  High reliance on vendor
 Change approval boards solutions to provide HA and
resilience
 Centralized deployments
 Problems for Netflix:
 Traditional data center oriented
level HA
systems do not translate well
 Standards and checklists to the cloud
 Heavy use of open source

 Architectural committees  Designed for repeatable
 Change approval boards execution
 Centralized deployments  Problems for Netflix:
 Not suitable for load-based
scaling and heavy automation
level HA
 Reliance on humans
 Standards and checklists

Approaches to Resilience @ Netflix

What does the business value?
 Customer experience  Remember these guys?
 Innovation and agility
 In other words:
 Stability and availability for
customer experience
 Rapid development and
change to continually improve
product and outpace
competition
 Not that different from anyone
else

Overall Approach
 Understand and solve for relevant failure modes
 Rely on automation and tools instead of committees for
evaluating architecture and changes
 Make deployment easy and standardized

Cloud Application Failure Modes and Effects
Failure Mode Probability Current Mitigation
App Failure High Automated fallback response
AWS Region Failure Low Wait for recovery
AWS Zone Failure Medium Continue running in 2 of 3 zones
Datacenter Failure Medium Continue migrating to cloud
Data Store Failure Low Restore from S3
S3 Failure Low Restore from remote archive

 Risk-based approach given likely failures
 Tackle high-probability events first

Goals of Simian Army

“Each system has to be able to succeed, no matter what, even all on its own.
We‟re designing each distributed system to expect and tolerate failure from
other systems on which it depends.”

http://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html

Chaos Monkey
 “By frequently causing failures, we force our services to
be built in a way that is more resilient.”
 Terminates cluster nodes during business hours
 Rejects “If it ain‟t broke, don‟t fix it”
 Goals:
 Simulate random hardware failures, human error at small scale
 Identify weaknesses
 No service impact

Chaos Gorilla
 Chaos Monkey‟s bigger brother
 Standard deployment pattern is to distribute
load/systems/data across three data centers (AZs)
 What happens if one is lost?
 Goals:
 Simulate data center loss, hardware/service failures at larger
scale
 Identify weaknesses, dependencies, etc.
 Minimal service impact

Latency Monkey
 Distributed systems have many upstream/downstream
connections
 How fault-tolerant are systems to dependency
failure/slowdown?
 Goals:
 Simulate latencies and error codes, see how a service responds
 Survivable services regardless of dependencies

Conformity Monkey
 Without architecture review, how do you ensure designs
leverage known successful patterns?
 Conformity Monkey provides automated analysis for
pattern adherence
 Goals:
 Evaluate deployment modes (data center distribution)
 Evaluate health checks, discoverability, versions of key libraries
 Help ensure service has best chance of successful operation

Non-Simian Approaches
 Org model
 Engineers write, deploy, support code
 Culture
 De-centralized with as few processes and rules as possible
 Lots of local autonomy
 “If you‟re not failing, you‟re not trying hard enough”
 Peer pressure
 Productive and transparent incident reviews

Lots of Good Advice
 BSIMM
 Microsoft SDL
 SAFECode

But, what works?

Forrester Consulting, 12/10

Especially, given phenomena such as DevOps,
cloud, agile, and the unique characteristics of an
organization?

A common graph @ Netflix
Weekend afternoon ramp-up
Lots of watching in prime time Not as much in early morning

Old way - pay and provision for peak, 24/7/365

Multiply this pattern across the dozens of apps that comprise the
Netflix streaming service

Solution: Load-Based Autoscaling

Autoscaling
 Goals:
 # of systems matches load requirements
 Load per server is constant
 Happens without intervention (the „auto‟ in autoscaling)
 Results:
 Clusters continuously add & remove nodes
 New nodes must mirror existing

Every change requires a new cluster push
(not an incremental change to existing systems)

Deploying code must be easy
(it is)

Netflix Deployment Pipeline

RPM with
app-specific VM template
bits ready to launch

YUM AMI

Perforce/Git Bakery ASG
Code change Base image + Cluster config
Config change RPM Running systems

Operational Impact
 No changes to running systems
 No systems mgmt infrastructure (Puppet, Chef, etc.)
 Fewer logins to prod
 No snowflakes
 Trivial “rollback”

Security Impact
 Need to think differently on:
 Vulnerability management
 Patch management
 User activity monitoring
 File integrity monitoring
 Forensic investigations

Architecture, organization, deployment
are all different.
What about security?

We‟ve adapted too.
Some principles we‟ve found useful.

Cloud Application Security: What We Emphasize

Points of Emphasis
 Integrate  Two contexts:
1. Integration with your
 Make the right way easy engineering ecosystem
 Self-service, with 2. Integration of your security
exceptions controls
 Organization
 Trust, but verify
 SCM, build and release
 Monitoring and alerting

47

Integration: Base AMI Testing
 Base AMI – VM/instance template used for all cloud systems
 Average instance age = ~24 days (one-time sample)

 The base AMI is managed like other packages, via P4, Jenkins, etc.
 We watch the SCM directory & kick off testing when it changes
 Launch an instance of the AMI, perform vuln scan and other checks

SCAN COMPLETED ALERT

Site name: AMI1

Stopped by: N/A

Total Scan Time: 4 minutes 46 seconds

Critical Vulnerabilities: 5
Severe Vulnerabilities: 4
Moderate Vulnerabilities: 4

Integration: Control Packaging and Installation

 From the RPM spec file of a webserver:
Requires: ossec cloudpassage nflx-base-harden hyperguard-enforcer

 Pulls in the following RPMs:
 HIDS agent
 Config assessment/firewall agent
 Host hardening package
 WAF

Integration: Timeline (Chronos)
 What IP addresses have been blacklisted by the WAF in
the last few weeks?
 GET /api/v1/event?timelines=type:blacklist&start=20130125000000000

 Which security groups have changed today?
 GET /api/v1/event?timelines=type:securitygroup&start=20130206000000000

Points of Emphasis
 Integrate  Developers are lazy

 Make the right way easy
 Self-service, with
exceptions

Making it Easy: Cryptex
 Crypto: DDIY (“Don‟t Do It Yourself”)
 Many uses of crypto in web/distributed systems:
 Encrypt/decrypt (cookies, data, etc.)
 Sign/verify (URLs, data, etc.)
 Netflix also uses heavily for device activation, DRM
playback, etc.

Making it Easy: Cryptex
 Multi-layer crypto system (HSM basis, scale out layer)
 Easy to use
 Key management handled transparently
 Access control and auditable operations

Making it Easy: Cloud-Based SSO
 In the AWS cloud, access to data center services is
problematic
 Examples: AD, LDAP, DNS
 But, many cloud-based systems require authN, authZ
 Examples: Dashboards, admin UIs
 Asking developers to securely handle/accept credentials
is also problematic

Making it Easy: Cloud-Based SSO
 Solution: Leverage OneLogin SaaS SSO (SAML) used
by IT for enterprise apps (e.g. Workday, Google Apps)
 Uses Active Directory credentials
 Provides a single & centralized login page
 Developers don‟t accept username & password directly
 Built filter for our base server to make SSO/authN trivial

Points of Emphasis
 Integrate  Self-service is perhaps the
most transformative cloud
 Make the right way easy characteristic
 Self-service, with  Failing to adopt this for security
exceptions controls will lead to friction

Self-Service: Security Groups
 Asgard cloud orchestration tool allows developers to
configure their own firewall rules
 Limited to same AWS account, no IP-based rules

Points of Emphasis
 Integrate  Culture precludes traditional
“command and control”
 Make the right way easy approach
 Self-service, with  Organizational desire for agile,
exceptions DevOps, CI/CD blur traditional
security engagement
 Trust, but verify touchpoints

Trust but Verify: Security Monkey
 Cloud APIs make verification  Includes:
and analysis of configuration  Certificate checking
and running state simpler  Firewall analysis
 Security Monkey created as  IAM entity analysis
the framework for this analysis  Limit warnings
 Resource policy analysis

Trust but Verify: Security Monkey

From: Security Monkey
Date: Wed, 24 Oct 2012 17:08:18 +0000
To: Security Alerts
Subject: prod Changes Detected

Table of Contents:
Security Groups

Changed Security Group

<sgname> (eu-west-1 / prod)
<#Security Group/<sgname> (eu-west-1 / prod)>

Trust but Verify: Exploit Monkey
 AWS Autoscaling group is unit of deployment, so
changes signal a good time to rerun dynamic scans

On 10/23/12 12:35 PM, Exploit Monkey wrote:

I noticed that testapp-live has changed current ASG name from testapp-
live-v001 to testapp-live-v002.

I'm starting a vulnerability scan against test app from these
private/public IPs:
10.29.24.174

Takeaways
 Netflix runs a large, dynamic service in AWS

 Newer concepts like cloud & DevOps need an
updated approach to resilience and security

 Specific context can help jumpstart a pragmatic
and effective security program

Netflix References
 http://netflix.github.com
 http://techblog.netflix.com
 http://slideshare.net/netflix

Other References
 http://www.webpronews.com/netflix-outage-angers-customers-2008-
08
 http://www.pcmag.com/article2/0,2817,2395372,00.asp
 http://www.readwriteweb.com/archives/etech_amazon_cto_aws.php
 http://bsimm.com/online/
 http://www.microsoft.com/en-
us/download/confirmation.aspx?id=29884
 http://www.slideshare.net/reed2001/culture-1798664
 http://techcrunch.com/2013/01/31/read-what-facebooks-sandberg-
calls-maybe-the-most-important-document-ever-to-come-out-of-the-
valley/
 http://www.gauntlt.org

Questions?

chan@netflix.com

Resilience and Security @ Scale: Lessons Learned

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (15)

Similar to Resilience and Security @ Scale: Lessons Learned

Similar to Resilience and Security @ Scale: Lessons Learned (20)

Recently uploaded

Recently uploaded (20)

Resilience and Security @ Scale: Lessons Learned