• Save
Resilience and Compliance at Speed and Scale
Upcoming SlideShare
Loading in...5
×
 

Resilience and Compliance at Speed and Scale

on

  • 386 views

Presentation at the Silicon Valley ISACA Spring Conference (April 2014).

Presentation at the Silicon Valley ISACA Spring Conference (April 2014).

Statistics

Views

Total Views
386
Views on SlideShare
383
Embed Views
3

Actions

Likes
0
Downloads
0
Comments
0

1 Embed 3

https://www.linkedin.com 3

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Resilience and Compliance at Speed and Scale Resilience and Compliance at Speed and Scale Presentation Transcript

  • Resilience and Compliance at Speed and Scale ISACA SV Spring Conference Jason Chan chan@netflix.com linkedin.com/in/jasonbchan @chanjbs
  • About Me  Engineering Director @ Netflix:  Security: product, app, ops, IR, fraud/abuse  Previously:  Led infosec team @ VMware  Consultant - @stake, iSEC Partners
  • About Netflix
  • Common Approaches to Reslience
  • Common Controls to Promote Resilience  Architectural committees  Change approval boards  Centralized deployments  Vendor-specific, component- level HA  Standards and checklists  Designed to standardize on design patterns, vendors, etc.  Problems for Netflix:  Freedom and Responsibility Culture  Highly aligned and loosely coupled  Innovation cycles
  • Common Controls to Promote Resilience  Architectural committees  Change approval boards  Centralized deployments  Vendor-specific, component- level HA  Standards and checklists  Designed to control and de- risk change  Focus on artifacts, test and rollback plans  Problems for Netflix:  Freedom and Responsibility Culture  Highly aligned and loosely coupled  Innovation cycles
  • Common Controls to Promote Resilience  Architectural committees  Change approval boards  Centralized deployments  Vendor-specific, component- level HA  Standards and checklists  Separate Ops team deploys at a pre-ordained time (e.g. weekly, monthly)  Problems for Netflix:  Freedom and Responsibility Culture  Highly aligned and loosely coupled  Innovation cycles
  • Common Controls to Promote Resilience  Architectural committees  Change approval boards  Centralized deployments  Vendor-specific, component- level HA  Standards and checklists  High reliance on vendor solutions to provide HA and resilience  Problems for Netflix:  Traditional data center oriented systems do not translate well to the cloud  Heavy use of open source
  • Common Controls to Promote Resilience  Architectural committees  Change approval boards  Centralized deployments  Vendor-specific, component- level HA  Standards and checklists  Designed for repeatable execution  Problems for Netflix:  Reliance on humans
  • Approaches to Resilience @ Netflix
  • What does the business value?  Customer experience  Innovation and agility  In other words:  Stability and availability for customer experience  Rapid development and change to continually improve product and outpace competition  Not that different from anyone else
  • Overall Approach  Understand and solve for relevant failure modes  Rely on automation and tools, not humans or committees  Make no assumptions that planned controls will work  Provide train tracks and guardrails, but invite deviation
  • Goals of Simian Army “Each system has to be able to succeed, no matter what, even all on its own. We’re designing each distributed system to expect and tolerate failure from other systems on which it depends.” http://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html
  • Systems fail
  • Chaos Monkey  “By frequently causing failures, we force our services to be built in a way that is more resilient.”  Terminates cluster nodes during business hours  Rejects “If it ain’t broke, don’t fix it”  Goals:  Simulate random hardware failures, human error at small scale  Identify weaknesses  No service impact
  • Lots of systems fail
  • Chaos Gorilla  Chaos Monkey’s bigger brother  Standard deployment pattern is to distribute load/systems/data across three data centers (AZs)  What happens if one is lost?  Goals:  Simulate data center loss, hardware/service failures at larger scale  Identify weaknesses, dependencies, etc.  Minimal service impact
  • What about larger catastrophes?
  • Chaos Kong  Simulate an entire region (US west coast, US east coast) failing  For example – hurricane, large winter storm, earthquake, etc.  Goals:  Exercise end-to-end large-scale failover (routing, DNS, scaling up)
  • The sick and wounded
  • Latency Monkey  Distributed systems have many upstream/downstream connections  How fault-tolerant are systems to dependency failure/slowdown?  Goals:  Simulate latencies and error codes, see how a service responds  Survivable services regardless of dependencies
  • Outliers and rebels
  • Conformity Monkey  Without architecture review, how do you ensure designs leverage known successful patterns?  Conformity Monkey provides automated analysis for pattern adherence  Goals:  Evaluate deployment modes (data center distribution)  Evaluate health checks, discoverability, versions of key libraries  Help ensure service has best chance of successful operation
  • Cruft, junk, and clutter
  • Janitor Monkey  Clutter accumulates, in the form of:  Complexity  Vulnerabilities  Cost  Janitor identifies unused resources and reaps them to save money and reduce exposure  Goals:  Automated hygiene  More freedom for engineers to innovate and move fast
  • Non-Simian Approaches  Org model  Engineers write, deploy, support code  Culture  De-centralized with as few processes and rules as possible  Lots of local autonomy  “If you’re not failing, you’re not trying hard enough”  Peer pressure  Productive and transparent incident reviews
  • Software Deployment for Compliance-Sensitive Apps
  • Control Objectives for Software Deployments Visibility and transparency  Who did what, when?  What was the scope of the change or deployment?  Was it reviewed?  Was it tested?  Was it approved? Typically attempted via:  Restricted access/SoD  CMDBs  Change management processes  Test results  Change windows
  • Large and Dynamic Systems Need a Different Approach  No operations organization  No acceptable windows for downtime  Thousands of deployments and changes per day
  • Control Objectives Haven’t Changed Visibility and transparency  Who did what, when?  What was the scope of the change or deployment?  Was it reviewed?  Was it tested?  Was it approved?
  • System-wide view on changes
  • Access to changes by app, region, environment, etc. Lookback in time as needed
  • Changes, via email
  • When? By who? What changed?
  • Integrated awareness
  • Chat integration lets engineers easily access info
  • Automated testing
  • 1000+ tests to compare proposed vs. existing Automated scoring and deployment decision
  • Complete view of deployment lifecycle
  • Jenkins (CI) job App name Currently running clusters by region/environm ent
  • Cluster ID Deployment details AMI version SCM commit
  • Modified files Source diffs Link to relevant JIRA(s)
  • Takeaway  Control objectives have not changed, but advantages of new technologies and operational models dictate updated approaches
  • Netflix References  http://netflix.github.com  http://techblog.netflix.com  http://slideshare.net/netflix
  • Questions? chan@netflix.com