Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Josh Evans – Engineering Leader
November 8, 2016
Mastering Chaos
A Netflix Guide to Microservices
Illness in the Family
Myelin Sheathe
Autoimmune disorder
Externally trigger
Treatable
Guillain-Barré Syndrome
Breathing is a miraculous act of
bravery
ELB
and so is taking traffic
Introductions
Microservice Basics
Challenges & Solutions
Organization & Architecture
Our Talk Today
Introductions
Microservice Basics
Challenges & Solutions
Organization & Architecture
Our Talk Today
1999 – 2009
Engineer & Engineering Manager
Ecommerce (DVD  Streaming)
2009 – 2013
Director of Engineering - Playback Serv...
Taking time off
Spending time with family
Thinking about what’s next
Today
Leader in subscription internet tv service
Hollywood, indy, local
Growing slate of original content
86 million members
~19...
Introductions
Microservice Basics
Challenges & Solutions
Organization & Architecture
Our Talk Today
Netflix DVD Data Center - 2000
Linux Host
What microservices are not
Apache
Tomcat
Javaweb
STORE
LoadBalancer
BILLING
HTTP...
What is a microservice?
…the microservice architectural style is an
approach to developing a single application as a
suite of small services, each...
Separation of concerns
Modularity, encapsulation
Scalability
Horizontally scaling
Workload partitioning
Virtualization & e...
Organ Systems
Each organ has a purpose
Organs form systems
Systems form an organism
Edge
ELB
Zuul
NCCP
API
Middle Tier & Platform
Product
• Bucket testing
• Subscriber
• Recommendations
Platform
• Routing
•...
Client Application
Client Library
EVCache Client Service Client
S S S S. . .
DB DB DB DB. . .
. . .
Microservices are an a...
Introductions
Microservice Basics
Challenges & Solutions
Organization & Architecture
Our Talk Today
Dependency
Scale
Variance
Change
Challenges & Solutions
Dependency
Scale
Variance
Change
Challenges & Solutions
Intra-service requests
Client libraries
Data Persistence
Infrastructure
Use Cases
Intra-service Requests
Crossing the Chasm
Linux Host
Linux Host
Linux Host
Linux Host
Crossing the Chasm
Linux Host
Apache Tomcat
Linux Host
Apache Tomcat
Network l...
Cascading Failure
How do you know if it works?
Inoculation
Device Service B
Service C
Internet EdgeZuul
Service A
ELB
FITSynthetic transactions
Override by device or account
% of li...
Device Service B
Service C
Internet EdgeZuul
Service A
ELB
FIT
Fault Injection Testing (FIT)
Enforced throughout the call ...
ELB
API
How do we constrain testing scope?
API Gateway
App 1
App 2
App 4
App 5
App 6
App 3
App 7
App 8
99.99
99.99
99.99
99.99
99.99
99.99
99.99
99.99
Proxy
99.99 99...
Critical Microservices
Client Libraries
• Many clients
• Common business logic
• Common access patterns
Return of the Monolith
Heap consumption
Logical defects
Transitive dependencies
Parasitic Infestation
Client Application
Client Library
EVCache Client Service Client
S S S S. . .
DB DB DB DB. . .
. . .
Simple Logic, Common P...
Persistence
In the presence of a network partition, you must choose
between consistency and availability
CAP Theorem
DB
DB
DB
Network ...
Zone A
Zone B
Zone C
Zone B
Zone C
Client
Zone A
Local Quorum
(Typical)
100ms
Eventual Consistency
Infrastructure
December 24th, 2012
US-East-1
Canada
No place to go
US
Latin America
US-East-1US-West-2 EU-West-1
#NetflixEverywhere Global Architecture
QCon London, 2016
https://www.infoq.com/presentations/netflix-failure-multiple-regi...
Dependency
Scale
Variance
Change
Challenges & Solutions
Stateless services
Stateful services
Hybrid services
Use Cases
Stateless Services
Not a cache or a database
Frequently accessed metadata
No instance affinity
Loss a node is a non-event
What is a stateless...
Minimum size
Desired capacity
Maximum size
Scale out as needed
S3AMI retrieved on demand
Compute efficiency
Node failure
T...
Cluster A Cluster D
Edge Cluster
Cluster B
Cluster C
Surviving Instance Failure
Stateful Services
Databases & caches
Custom apps which hold large amounts of data
Loss of a node is a notable event
What is a stateful servi...
Dedicated Shards – An Antipattern
Squid 1 Squid 2 Squid 3
Client Application
Subscriber Client Library
Cache Client Servic...
Redundancy is fundamental
Zone A Zone B Zone C
. . .. . .. . .
EVCache Writes
Client Application
Client Library
EVCache Client
Client Application
Cl...
Zone A
Client Application
Client Library
EVCache Client
Zone B
Client Application
Client Library
EVCache Client
Zone C
Cli...
Hybrid Services
Client Application
Client Library
EVCache Client Service Client
S S S S. . .
DB DB DB DB. . .
. . .
Hybrid Microservice
. ...
It’s easy to take EVCache for granted
30 million requests/sec
2 trillion requests per day globally
Hundreds of billions of...
Batch
S S S S. . .
DB DB DB DB. . .
. . . . . .
Member Path
Member Path
Member Path
Batch
Batch
Called by many services
On...
Batch
S S S S. . .
DB DB DB DB. . .
. . . . . .
Member Path
Member Path
Member Path
Batch
Batch
Excessive Load
X X
Batch
S S S S. . .
DB DB DB DB. . .
. . . . . .
Member Path
Member Path
Member Path
Batch
Batch
Workload partitioning
Requ...
Dependency
Scale
Variance
Change
Challenges & Solutions
Operational drift
Polyglot & containers
Use Cases
Operational Drift
(Unintentional Variance)
Over time
Alert thresholds
Timeouts, retries, fallbacks
Throughput (RPS)
Across microservices
Reliability best practices
O...
Autonomic Nervous
System
You don’t have to think about
digestion or breathing
Incident
Resolution
Review
Remediation
Analysis
Best Practice
Automation
Adoption
Continuous Learning & Automation
Alerts
Apache & Tomcat
Automated canary analysis
Autoscaling
Chaos
Consistent naming
ELB config
Healthcheck
Immutable mach...
Polyglot & Containers
(Intentional Variance)
The Paved Road
Stash
Nebula/Gradle
BaseAMI/Ubuntu
Jenkins
Spinnaker
Runtime Platform
In the Critical Path
In the Critical Path
Productivity tooling
Insight & triage capabilities
Base image fragmentation
Node management
Library/platform duplication
L...
Raise awareness of costs
Constrain centralized support
Prioritize by impact
Seek reusable solutions
Strategic Stance
Dependency
Scale
Variance
Change
Challenges & Solutions
How do we achieve velocity with confidence?
Global Cloud Management & Delivery
Integrated, Automated Practices
Conformity checks
Red/black pipelines
Automated canaries
Staged deployments
Squeeze tests
Alerts
Apache & Tomcat
Automated canary analysis
Autoscaling
Chaos
Consistent naming
ELB config
Healthcheck
Immutable mach...
https://www.youtube.com/watch?v=IkPb15FfuQU
Introductions
Microservice Basics
Challenges & Solutions
Organization & Architecture
Our Talk Today
Customer Device Netflix Data Center - 2009
NCCP
Electronic Delivery - NRDP 1.x
LoadBalancer
Netflix App
Security
Activatio...
Netflix API - let a 1000 flowers bloom!
Netflix Data Center - 2009
API
Netflix API – from public to private
LoadBalancer
General REST API
JSON schema
HTTP respons...
Customer Device
Netflix Data Center – 2010
API
Hybrid Architecture
LB
Netflix App
Security
Activation
Playback
Platform (N...
Josh: what is the right long term architecture?
Peter: do you care about the organizational
implications?
Conway’s Law
Organizations which design systems are constrained to
produce designs which are copies of the
communication s...
Conway’s Law
If you have four teams working on a compiler you will
end up with a four pass compiler
NCCP
API
Blade Runner
Outcomes
Productivity & new capabilities
Refactored organization
Lessons
Solutions first, team second
Reconfigure teams to...
Introductions
Microservice Basics
Challenges & Solutions
Organization & Architecture
Recap
Our Talk Today
Microservice architectures are
complex and organic
Health depends on discipline and
chaos
Dependency
Circuit breakers, fallbacks, chaos
Simple clients
Eventual consistency
Multi-region failover
Scale
Auto-scaling...
netflix.github.io
techblog.netflix.com
Questions?
Mastering Chaos - A Netflix Guide to Microservices
Mastering Chaos - A Netflix Guide to Microservices
Mastering Chaos - A Netflix Guide to Microservices
Mastering Chaos - A Netflix Guide to Microservices
Mastering Chaos - A Netflix Guide to Microservices
Upcoming SlideShare
Loading in …5
×

Mastering Chaos - A Netflix Guide to Microservices

7,802 views

Published on

QConSF 2016 Abstract:

By embracing the tension between order and chaos and applying a healthy mix of discipline and surrender Netflix reliably operates microservices in the cloud at scale. But every lesson learned and solution developed over the last seven years was born out of pain for us and our customers. Even today we remain vigilant as we evolve our service architecture. For those just starting the microservices journey these lessons and solutions provide a blueprint for success.

In this talk we’ll explore the chaotic and vibrant world of microservices at Netflix. We’ll start with the basics - the anatomy of a microservice, the challenges around distributed systems, and the benefits realized when integrated operational practices and technical solutions are properly leveraged. Then we’ll build on that foundation exploring the cultural, architectural, and operational methods that lead to microservice mastery.

Published in: Internet
  • Be the first to comment

Mastering Chaos - A Netflix Guide to Microservices

  1. 1. Josh Evans – Engineering Leader November 8, 2016 Mastering Chaos A Netflix Guide to Microservices
  2. 2. Illness in the Family
  3. 3. Myelin Sheathe Autoimmune disorder Externally trigger Treatable Guillain-Barré Syndrome
  4. 4. Breathing is a miraculous act of bravery
  5. 5. ELB and so is taking traffic
  6. 6. Introductions Microservice Basics Challenges & Solutions Organization & Architecture Our Talk Today
  7. 7. Introductions Microservice Basics Challenges & Solutions Organization & Architecture Our Talk Today
  8. 8. 1999 – 2009 Engineer & Engineering Manager Ecommerce (DVD  Streaming) 2009 – 2013 Director of Engineering - Playback Services 2013 – 2016 Director of Operations Engineering Josh Evans
  9. 9. Taking time off Spending time with family Thinking about what’s next Today
  10. 10. Leader in subscription internet tv service Hollywood, indy, local Growing slate of original content 86 million members ~190 countries, 10s of languages 1000s of device types Microservices on AWS
  11. 11. Introductions Microservice Basics Challenges & Solutions Organization & Architecture Our Talk Today
  12. 12. Netflix DVD Data Center - 2000 Linux Host What microservices are not Apache Tomcat Javaweb STORE LoadBalancer BILLING HTTP JDBC DB Link HTTP/S Monolithic code base Monolithic database Tightly coupled architecture
  13. 13. What is a microservice?
  14. 14. …the microservice architectural style is an approach to developing a single application as a suite of small services, each running in its own process and communicating with lightweight mechanisms, often an HTTP resource API. - Martin Fowler
  15. 15. Separation of concerns Modularity, encapsulation Scalability Horizontally scaling Workload partitioning Virtualization & elasticity Automated operations On demand provisioning An Evolutionary Response
  16. 16. Organ Systems Each organ has a purpose Organs form systems Systems form an organism
  17. 17. Edge ELB Zuul NCCP API Middle Tier & Platform Product • Bucket testing • Subscriber • Recommendations Platform • Routing • Configuration • Crypto Persistence • Cache • Database
  18. 18. Client Application Client Library EVCache Client Service Client S S S S. . . DB DB DB DB. . . . . . Microservices are an abstraction . . . Microservice
  19. 19. Introductions Microservice Basics Challenges & Solutions Organization & Architecture Our Talk Today
  20. 20. Dependency Scale Variance Change Challenges & Solutions
  21. 21. Dependency Scale Variance Change Challenges & Solutions
  22. 22. Intra-service requests Client libraries Data Persistence Infrastructure Use Cases
  23. 23. Intra-service Requests
  24. 24. Crossing the Chasm
  25. 25. Linux Host Linux Host Linux Host Linux Host Crossing the Chasm Linux Host Apache Tomcat Linux Host Apache Tomcat Network latency, congestion, failure Logical or scaling failure Service A Service B
  26. 26. Cascading Failure
  27. 27. How do you know if it works?
  28. 28. Inoculation
  29. 29. Device Service B Service C Internet EdgeZuul Service A ELB FITSynthetic transactions Override by device or account % of live traffic up to 100% Fault Injection Testing (FIT)
  30. 30. Device Service B Service C Internet EdgeZuul Service A ELB FIT Fault Injection Testing (FIT) Enforced throughout the call path
  31. 31. ELB API
  32. 32. How do we constrain testing scope?
  33. 33. API Gateway App 1 App 2 App 4 App 5 App 6 App 3 App 7 App 8 99.99 99.99 99.99 99.99 99.99 99.99 99.99 99.99 Proxy 99.99 99.99 Combinatorial Math 99.9910 = 99.9
  34. 34. Critical Microservices
  35. 35. Client Libraries
  36. 36. • Many clients • Common business logic • Common access patterns
  37. 37. Return of the Monolith
  38. 38. Heap consumption Logical defects Transitive dependencies Parasitic Infestation
  39. 39. Client Application Client Library EVCache Client Service Client S S S S. . . DB DB DB DB. . . . . . Simple Logic, Common Patterns . . .
  40. 40. Persistence
  41. 41. In the presence of a network partition, you must choose between consistency and availability CAP Theorem DB DB DB Network B Network C Network D Service Network A X
  42. 42. Zone A Zone B Zone C Zone B Zone C Client Zone A Local Quorum (Typical) 100ms Eventual Consistency
  43. 43. Infrastructure
  44. 44. December 24th, 2012
  45. 45. US-East-1 Canada No place to go US Latin America
  46. 46. US-East-1US-West-2 EU-West-1
  47. 47. #NetflixEverywhere Global Architecture QCon London, 2016 https://www.infoq.com/presentations/netflix-failure-multiple-regions
  48. 48. Dependency Scale Variance Change Challenges & Solutions
  49. 49. Stateless services Stateful services Hybrid services Use Cases
  50. 50. Stateless Services
  51. 51. Not a cache or a database Frequently accessed metadata No instance affinity Loss a node is a non-event What is a stateless service?
  52. 52. Minimum size Desired capacity Maximum size Scale out as needed S3AMI retrieved on demand Compute efficiency Node failure Traffic spikes Performance bugs Auto Scaling Groups
  53. 53. Cluster A Cluster D Edge Cluster Cluster B Cluster C Surviving Instance Failure
  54. 54. Stateful Services
  55. 55. Databases & caches Custom apps which hold large amounts of data Loss of a node is a notable event What is a stateful service?
  56. 56. Dedicated Shards – An Antipattern Squid 1 Squid 2 Squid 3 Client Application Subscriber Client Library Cache Client Service Client S S S S. . . DB DB DB DB. . . Squid n HA Proxy Set 1 Set 2 Set 3 Set n X
  57. 57. Redundancy is fundamental
  58. 58. Zone A Zone B Zone C . . .. . .. . . EVCache Writes Client Application Client Library EVCache Client Client Application Client Library EVCache Client Client Application Client Library EVCache Client
  59. 59. Zone A Client Application Client Library EVCache Client Zone B Client Application Client Library EVCache Client Zone C Client Application Client Library EVCache Client . . .. . .. . . EVCache Reads
  60. 60. Hybrid Services
  61. 61. Client Application Client Library EVCache Client Service Client S S S S. . . DB DB DB DB. . . . . . Hybrid Microservice . . .
  62. 62. It’s easy to take EVCache for granted 30 million requests/sec 2 trillion requests per day globally Hundreds of billions of objects Tens of thousands of memcached instances Milliseconds of latency per request
  63. 63. Batch S S S S. . . DB DB DB DB. . . . . . . . . Member Path Member Path Member Path Batch Batch Called by many services Online & offline clients Called many times / request 800k – 1M RPS Fallback to service/db Excessive Load
  64. 64. Batch S S S S. . . DB DB DB DB. . . . . . . . . Member Path Member Path Member Path Batch Batch Excessive Load X X
  65. 65. Batch S S S S. . . DB DB DB DB. . . . . . . . . Member Path Member Path Member Path Batch Batch Workload partitioning Request-level caching Secure token fallback Chaos under load Solutions Online Offline
  66. 66. Dependency Scale Variance Change Challenges & Solutions
  67. 67. Operational drift Polyglot & containers Use Cases
  68. 68. Operational Drift (Unintentional Variance)
  69. 69. Over time Alert thresholds Timeouts, retries, fallbacks Throughput (RPS) Across microservices Reliability best practices Operational Drift
  70. 70. Autonomic Nervous System You don’t have to think about digestion or breathing
  71. 71. Incident Resolution Review Remediation Analysis Best Practice Automation Adoption Continuous Learning & Automation
  72. 72. Alerts Apache & Tomcat Automated canary analysis Autoscaling Chaos Consistent naming ELB config Healthcheck Immutable machine images Squeeze testing Staged, red/black deployments Timeouts, retries, fallbacks Production Ready
  73. 73. Polyglot & Containers (Intentional Variance)
  74. 74. The Paved Road Stash Nebula/Gradle BaseAMI/Ubuntu Jenkins Spinnaker Runtime Platform
  75. 75. In the Critical Path
  76. 76. In the Critical Path
  77. 77. Productivity tooling Insight & triage capabilities Base image fragmentation Node management Library/platform duplication Learning curve - production expertise Cost of Variance
  78. 78. Raise awareness of costs Constrain centralized support Prioritize by impact Seek reusable solutions Strategic Stance
  79. 79. Dependency Scale Variance Change Challenges & Solutions
  80. 80. How do we achieve velocity with confidence?
  81. 81. Global Cloud Management & Delivery
  82. 82. Integrated, Automated Practices Conformity checks Red/black pipelines Automated canaries Staged deployments Squeeze tests
  83. 83. Alerts Apache & Tomcat Automated canary analysis Autoscaling Chaos Consistent naming ELB config Healthcheck Immutable machine images Squeeze testing Staged, red/black deployments Timeouts, retries, fallbacks Production Ready
  84. 84. https://www.youtube.com/watch?v=IkPb15FfuQU
  85. 85. Introductions Microservice Basics Challenges & Solutions Organization & Architecture Our Talk Today
  86. 86. Customer Device Netflix Data Center - 2009 NCCP Electronic Delivery - NRDP 1.x LoadBalancer Netflix App Security Activation Playback Platform (NRDP) UI Collaborative design XML payloads Custom responses Versioned firmware releases Long cycles Simple UI – “Queue Reader” ED
  87. 87. Netflix API - let a 1000 flowers bloom!
  88. 88. Netflix Data Center - 2009 API Netflix API – from public to private LoadBalancer General REST API JSON schema HTTP response codes Oauth security model Content Metadata Content Metadata Application
  89. 89. Customer Device Netflix Data Center – 2010 API Hybrid Architecture LB Netflix App Security Activation Playback Platform (NRDP) UI Content Metadata NCCP ED LB Distinct • Services • Protocols • Schemas • Security
  90. 90. Josh: what is the right long term architecture? Peter: do you care about the organizational implications?
  91. 91. Conway’s Law Organizations which design systems are constrained to produce designs which are copies of the communication structures of these organizations. Any piece of software reflects the organizational structure that produced it.
  92. 92. Conway’s Law If you have four teams working on a compiler you will end up with a four pass compiler
  93. 93. NCCP API
  94. 94. Blade Runner
  95. 95. Outcomes Productivity & new capabilities Refactored organization Lessons Solutions first, team second Reconfigure teams to best support your architecture Outcomes & Lessons
  96. 96. Introductions Microservice Basics Challenges & Solutions Organization & Architecture Recap Our Talk Today
  97. 97. Microservice architectures are complex and organic Health depends on discipline and chaos
  98. 98. Dependency Circuit breakers, fallbacks, chaos Simple clients Eventual consistency Multi-region failover Scale Auto-scaling Redundancy – avoid SPoF Partitioned workloads Failure-driven design Chaos under load Variance Engineered operations Understood cost of variance Prioritized support by impact Change Automated delivery Integrated practices Organization & Architecture Solutions first, team second
  99. 99. netflix.github.io
  100. 100. techblog.netflix.com
  101. 101. Questions?

×