Resilience and Compliance at Speed and Scale

1,157 views

Published on

Presentation at the Silicon Valley ISACA Spring Conference (April 2014).

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,157
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
0
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Resilience and Compliance at Speed and Scale

  1. 1. Resilience and Compliance at Speed and Scale ISACA SV Spring Conference Jason Chan chan@netflix.com linkedin.com/in/jasonbchan @chanjbs
  2. 2. About Me  Engineering Director @ Netflix:  Security: product, app, ops, IR, fraud/abuse  Previously:  Led infosec team @ VMware  Consultant - @stake, iSEC Partners
  3. 3. About Netflix
  4. 4. Common Approaches to Reslience
  5. 5. Common Controls to Promote Resilience  Architectural committees  Change approval boards  Centralized deployments  Vendor-specific, component- level HA  Standards and checklists  Designed to standardize on design patterns, vendors, etc.  Problems for Netflix:  Freedom and Responsibility Culture  Highly aligned and loosely coupled  Innovation cycles
  6. 6. Common Controls to Promote Resilience  Architectural committees  Change approval boards  Centralized deployments  Vendor-specific, component- level HA  Standards and checklists  Designed to control and de- risk change  Focus on artifacts, test and rollback plans  Problems for Netflix:  Freedom and Responsibility Culture  Highly aligned and loosely coupled  Innovation cycles
  7. 7. Common Controls to Promote Resilience  Architectural committees  Change approval boards  Centralized deployments  Vendor-specific, component- level HA  Standards and checklists  Separate Ops team deploys at a pre-ordained time (e.g. weekly, monthly)  Problems for Netflix:  Freedom and Responsibility Culture  Highly aligned and loosely coupled  Innovation cycles
  8. 8. Common Controls to Promote Resilience  Architectural committees  Change approval boards  Centralized deployments  Vendor-specific, component- level HA  Standards and checklists  High reliance on vendor solutions to provide HA and resilience  Problems for Netflix:  Traditional data center oriented systems do not translate well to the cloud  Heavy use of open source
  9. 9. Common Controls to Promote Resilience  Architectural committees  Change approval boards  Centralized deployments  Vendor-specific, component- level HA  Standards and checklists  Designed for repeatable execution  Problems for Netflix:  Reliance on humans
  10. 10. Approaches to Resilience @ Netflix
  11. 11. What does the business value?  Customer experience  Innovation and agility  In other words:  Stability and availability for customer experience  Rapid development and change to continually improve product and outpace competition  Not that different from anyone else
  12. 12. Overall Approach  Understand and solve for relevant failure modes  Rely on automation and tools, not humans or committees  Make no assumptions that planned controls will work  Provide train tracks and guardrails, but invite deviation
  13. 13. Goals of Simian Army “Each system has to be able to succeed, no matter what, even all on its own. We’re designing each distributed system to expect and tolerate failure from other systems on which it depends.” http://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html
  14. 14. Systems fail
  15. 15. Chaos Monkey  “By frequently causing failures, we force our services to be built in a way that is more resilient.”  Terminates cluster nodes during business hours  Rejects “If it ain’t broke, don’t fix it”  Goals:  Simulate random hardware failures, human error at small scale  Identify weaknesses  No service impact
  16. 16. Lots of systems fail
  17. 17. Chaos Gorilla  Chaos Monkey’s bigger brother  Standard deployment pattern is to distribute load/systems/data across three data centers (AZs)  What happens if one is lost?  Goals:  Simulate data center loss, hardware/service failures at larger scale  Identify weaknesses, dependencies, etc.  Minimal service impact
  18. 18. What about larger catastrophes?
  19. 19. Chaos Kong  Simulate an entire region (US west coast, US east coast) failing  For example – hurricane, large winter storm, earthquake, etc.  Goals:  Exercise end-to-end large-scale failover (routing, DNS, scaling up)
  20. 20. The sick and wounded
  21. 21. Latency Monkey  Distributed systems have many upstream/downstream connections  How fault-tolerant are systems to dependency failure/slowdown?  Goals:  Simulate latencies and error codes, see how a service responds  Survivable services regardless of dependencies
  22. 22. Outliers and rebels
  23. 23. Conformity Monkey  Without architecture review, how do you ensure designs leverage known successful patterns?  Conformity Monkey provides automated analysis for pattern adherence  Goals:  Evaluate deployment modes (data center distribution)  Evaluate health checks, discoverability, versions of key libraries  Help ensure service has best chance of successful operation
  24. 24. Cruft, junk, and clutter
  25. 25. Janitor Monkey  Clutter accumulates, in the form of:  Complexity  Vulnerabilities  Cost  Janitor identifies unused resources and reaps them to save money and reduce exposure  Goals:  Automated hygiene  More freedom for engineers to innovate and move fast
  26. 26. Non-Simian Approaches  Org model  Engineers write, deploy, support code  Culture  De-centralized with as few processes and rules as possible  Lots of local autonomy  “If you’re not failing, you’re not trying hard enough”  Peer pressure  Productive and transparent incident reviews
  27. 27. Software Deployment for Compliance-Sensitive Apps
  28. 28. Control Objectives for Software Deployments Visibility and transparency  Who did what, when?  What was the scope of the change or deployment?  Was it reviewed?  Was it tested?  Was it approved? Typically attempted via:  Restricted access/SoD  CMDBs  Change management processes  Test results  Change windows
  29. 29. Large and Dynamic Systems Need a Different Approach  No operations organization  No acceptable windows for downtime  Thousands of deployments and changes per day
  30. 30. Control Objectives Haven’t Changed Visibility and transparency  Who did what, when?  What was the scope of the change or deployment?  Was it reviewed?  Was it tested?  Was it approved?
  31. 31. System-wide view on changes
  32. 32. Access to changes by app, region, environment, etc. Lookback in time as needed
  33. 33. Changes, via email
  34. 34. When? By who? What changed?
  35. 35. Integrated awareness
  36. 36. Chat integration lets engineers easily access info
  37. 37. Automated testing
  38. 38. 1000+ tests to compare proposed vs. existing Automated scoring and deployment decision
  39. 39. Complete view of deployment lifecycle
  40. 40. Jenkins (CI) job App name Currently running clusters by region/environm ent
  41. 41. Cluster ID Deployment details AMI version SCM commit
  42. 42. Modified files Source diffs Link to relevant JIRA(s)
  43. 43. Takeaway  Control objectives have not changed, but advantages of new technologies and operational models dictate updated approaches
  44. 44. Netflix References  http://netflix.github.com  http://techblog.netflix.com  http://slideshare.net/netflix
  45. 45. Questions? chan@netflix.com

×