2. Netflix, Inc.
“Netflix is the world’s leading Internet television
network with more than 33 million members in
40 countries enjoying more than one billion
hours of TV shows and movies per month,
including original series . . .”
Source: http://ir.netflix.com
3. Me
Director of Engineering @ Netflix
Responsible for:
Cloud app, product, infrastructure, ops security
Previously:
Led security team @ VMware
Earlier, primarily security consulting at @stake, iSEC Partners
11. On the way to the cloud . . . (organization)
(or NoOps, depending on definitions)
12. Some As-Is #s
33m+ subscribers
10,000s of systems
100s of engineers, apps
~250 test deployments/day **
~70 production deployments/day **
** Sample based on one week‟s activities
14. Common Controls to Promote Resilience
Architectural committees Designed to standardize on
Change approval boards design patterns, vendors, etc.
Centralized deployments Problems for Netflix:
Freedom and Responsibility
Vendor-specific, component-
Culture
level HA
Highly aligned and loosely
Standards and checklists coupled
Innovation cycles
15. Common Controls to Promote Resilience
Architectural committees Designed to control and de-
Change approval boards risk change
Centralized deployments Focus on artifacts, test and
rollback plans
Vendor-specific, component-
level HA Problems for Netflix:
Freedom and Responsibility
Standards and checklists
Culture
Highly aligned and loosely
coupled
Innovation cycles
16. Common Controls to Promote Resilience
Architectural committees Separate Ops team deploys at
Change approval boards a pre-ordained time (e.g.
weekly, monthly)
Centralized deployments
Problems for Netflix:
Vendor-specific, component-
Freedom and Responsibility
level HA
Culture
Standards and checklists Highly aligned and loosely
coupled
Innovation cycles
17. Common Controls to Promote Resilience
Architectural committees High reliance on vendor
Change approval boards solutions to provide HA and
resilience
Centralized deployments
Problems for Netflix:
Vendor-specific, component-
Traditional data center oriented
level HA
systems do not translate well
Standards and checklists to the cloud
Heavy use of open source
18. Common Controls to Promote Resilience
Architectural committees Designed for repeatable
Change approval boards execution
Centralized deployments Problems for Netflix:
Not suitable for load-based
Vendor-specific, component-
scaling and heavy automation
level HA
Reliance on humans
Standards and checklists
20. What does the business value?
Customer experience Remember these guys?
Innovation and agility
In other words:
Stability and availability for
customer experience
Rapid development and
change to continually improve
product and outpace
competition
Not that different from anyone
else
21. Overall Approach
Understand and solve for relevant failure modes
Rely on automation and tools instead of committees for
evaluating architecture and changes
Make deployment easy and standardized
22. Cloud Application Failure Modes and Effects
Failure Mode Probability Current Mitigation
App Failure High Automated fallback response
AWS Region Failure Low Wait for recovery
AWS Zone Failure Medium Continue running in 2 of 3 zones
Datacenter Failure Medium Continue migrating to cloud
Data Store Failure Low Restore from S3
S3 Failure Low Restore from remote archive
Risk-based approach given likely failures
Tackle high-probability events first
24. Goals of Simian Army
“Each system has to be able to succeed, no matter what, even all on its own.
We‟re designing each distributed system to expect and tolerate failure from
other systems on which it depends.”
http://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html
25.
26. Chaos Monkey
“By frequently causing failures, we force our services to
be built in a way that is more resilient.”
Terminates cluster nodes during business hours
Rejects “If it ain‟t broke, don‟t fix it”
Goals:
Simulate random hardware failures, human error at small scale
Identify weaknesses
No service impact
27. Chaos Gorilla
Chaos Monkey‟s bigger brother
Standard deployment pattern is to distribute
load/systems/data across three data centers (AZs)
What happens if one is lost?
Goals:
Simulate data center loss, hardware/service failures at larger
scale
Identify weaknesses, dependencies, etc.
Minimal service impact
28. Latency Monkey
Distributed systems have many upstream/downstream
connections
How fault-tolerant are systems to dependency
failure/slowdown?
Goals:
Simulate latencies and error codes, see how a service responds
Survivable services regardless of dependencies
29. Conformity Monkey
Without architecture review, how do you ensure designs
leverage known successful patterns?
Conformity Monkey provides automated analysis for
pattern adherence
Goals:
Evaluate deployment modes (data center distribution)
Evaluate health checks, discoverability, versions of key libraries
Help ensure service has best chance of successful operation
30. Non-Simian Approaches
Org model
Engineers write, deploy, support code
Culture
De-centralized with as few processes and rules as possible
Lots of local autonomy
“If you‟re not failing, you‟re not trying hard enough”
Peer pressure
Productive and transparent incident reviews
36. A common graph @ Netflix
Weekend afternoon ramp-up
Lots of watching in prime time Not as much in early morning
Old way - pay and provision for peak, 24/7/365
Multiply this pattern across the dozens of apps that comprise the
Netflix streaming service
38. Autoscaling
Goals:
# of systems matches load requirements
Load per server is constant
Happens without intervention (the „auto‟ in autoscaling)
Results:
Clusters continuously add & remove nodes
New nodes must mirror existing
39. Every change requires a new cluster push
(not an incremental change to existing systems)
41. Netflix Deployment Pipeline
RPM with
app-specific VM template
bits ready to launch
YUM AMI
Perforce/Git Bakery ASG
Code change Base image + Cluster config
Config change RPM Running systems
42. Operational Impact
No changes to running systems
No systems mgmt infrastructure (Puppet, Chef, etc.)
Fewer logins to prod
No snowflakes
Trivial “rollback”
43. Security Impact
Need to think differently on:
Vulnerability management
Patch management
User activity monitoring
File integrity monitoring
Forensic investigations
47. Points of Emphasis
Integrate Two contexts:
1. Integration with your
Make the right way easy engineering ecosystem
Self-service, with 2. Integration of your security
exceptions controls
Organization
Trust, but verify
SCM, build and release
Monitoring and alerting
47
48. Integration: Base AMI Testing
Base AMI – VM/instance template used for all cloud systems
Average instance age = ~24 days (one-time sample)
The base AMI is managed like other packages, via P4, Jenkins, etc.
We watch the SCM directory & kick off testing when it changes
Launch an instance of the AMI, perform vuln scan and other checks
SCAN COMPLETED ALERT
Site name: AMI1
Stopped by: N/A
Total Scan Time: 4 minutes 46 seconds
Critical Vulnerabilities: 5
Severe Vulnerabilities: 4
Moderate Vulnerabilities: 4
49. Integration: Control Packaging and Installation
From the RPM spec file of a webserver:
Requires: ossec cloudpassage nflx-base-harden hyperguard-enforcer
Pulls in the following RPMs:
HIDS agent
Config assessment/firewall agent
Host hardening package
WAF
50. Integration: Timeline (Chronos)
What IP addresses have been blacklisted by the WAF in
the last few weeks?
GET /api/v1/event?timelines=type:blacklist&start=20130125000000000
Which security groups have changed today?
GET /api/v1/event?timelines=type:securitygroup&start=20130206000000000
51. Points of Emphasis
Integrate Developers are lazy
Make the right way easy
Self-service, with
exceptions
Trust, but verify
52. Making it Easy: Cryptex
Crypto: DDIY (“Don‟t Do It Yourself”)
Many uses of crypto in web/distributed systems:
Encrypt/decrypt (cookies, data, etc.)
Sign/verify (URLs, data, etc.)
Netflix also uses heavily for device activation, DRM
playback, etc.
53. Making it Easy: Cryptex
Multi-layer crypto system (HSM basis, scale out layer)
Easy to use
Key management handled transparently
Access control and auditable operations
54. Making it Easy: Cloud-Based SSO
In the AWS cloud, access to data center services is
problematic
Examples: AD, LDAP, DNS
But, many cloud-based systems require authN, authZ
Examples: Dashboards, admin UIs
Asking developers to securely handle/accept credentials
is also problematic
55. Making it Easy: Cloud-Based SSO
Solution: Leverage OneLogin SaaS SSO (SAML) used
by IT for enterprise apps (e.g. Workday, Google Apps)
Uses Active Directory credentials
Provides a single & centralized login page
Developers don‟t accept username & password directly
Built filter for our base server to make SSO/authN trivial
56. Points of Emphasis
Integrate Self-service is perhaps the
most transformative cloud
Make the right way easy characteristic
Self-service, with Failing to adopt this for security
exceptions controls will lead to friction
Trust, but verify
57. Self-Service: Security Groups
Asgard cloud orchestration tool allows developers to
configure their own firewall rules
Limited to same AWS account, no IP-based rules
58. Points of Emphasis
Integrate Culture precludes traditional
“command and control”
Make the right way easy approach
Self-service, with Organizational desire for agile,
exceptions DevOps, CI/CD blur traditional
security engagement
Trust, but verify touchpoints
59. Trust but Verify: Security Monkey
Cloud APIs make verification Includes:
and analysis of configuration Certificate checking
and running state simpler Firewall analysis
Security Monkey created as IAM entity analysis
the framework for this analysis Limit warnings
Resource policy analysis
60. Trust but Verify: Security Monkey
From: Security Monkey
Date: Wed, 24 Oct 2012 17:08:18 +0000
To: Security Alerts
Subject: prod Changes Detected
Table of Contents:
Security Groups
Changed Security Group
<sgname> (eu-west-1 / prod)
<#Security Group/<sgname> (eu-west-1 / prod)>
61. Trust but Verify: Exploit Monkey
AWS Autoscaling group is unit of deployment, so
changes signal a good time to rerun dynamic scans
On 10/23/12 12:35 PM, Exploit Monkey wrote:
I noticed that testapp-live has changed current ASG name from testapp-
live-v001 to testapp-live-v002.
I'm starting a vulnerability scan against test app from these
private/public IPs:
10.29.24.174
62. Takeaways
Netflix runs a large, dynamic service in AWS
Newer concepts like cloud & DevOps need an
updated approach to resilience and security
Specific context can help jumpstart a pragmatic
and effective security program