Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Josh Evans – (Former Netflix) Engineering Leader at Large
June 26, 2017
Refactoring Organizations
A Netflix Study
2009
Devices
…
Queue Reader UX
DVD Foundation
Customer Device Netflix Data Center
NCCP
Electronic Delivery
LoadBalancer
Netflix App
Security
Activation
Playback
Platfor...
Netflix API
Let 1000 flowers bloom!
Netflix Data Center
API
Netflix API
LoadBalancer
REST API
JSON schema
HTTP response codes
Oauth
Content
Metadata
Applicati...
Customer Device
Netflix Data Center
API
Proposal
LB
Netflix App
Security
Activation
Playback
Platform (NRDP)
UI
Content
Me...
Pros
Separation of concerns
Bandwidth
Increased innovation
Cons
API reliability
Heterogeneous architecture
Lack of domain ...
Neil Hunt, CPO
What was the reaction?
Concern
Anger
Tribalism
Conway’s Law
If you have four teams working on a compiler you will end
up with a four pass compiler
Today’s Premise
Conway’s Law describes dysfunction
We must embrace architecture before organization
Technical analogs driv...
Selfless Leadership
Company
Team
You
In that order!
Introductions
Framework
Scaling Teams
IQ v EQ
Conway’s Revenge
Today’s Program
Introductions
Framework
Scaling Teams
IQ v EQ
Conway’s Revenge
Today’s Program
1999 – 2009
Ecommerce (DVD  Streaming)
2009 – 2013
Streaming Infrastructure
2013 - 2016
Operations Engineering
2017
Time ...
Global leader in subscription internet TV
Growing slate of original content
100 million members
190 countries, 10s of lang...
Introductions
Framework
Scaling Teams
IQ v EQ
Conway’s Revenge
Today’s Program
Why do we refactor?
Functionality
Engineering velocity
Functional and operational quality
As we scale!
We refactor to improve or sustain
The ability to enhance a system by adding new functionality at
minimal effort
Functional Scalability
The ease with which a...
The ability for an organization to easily add people and domain
responsibilities in response to increased work and complex...
Common tasks are difficult
Strategic efforts are impractical or impossible
When do we refactor?
How do we refactor?
Technical Patterns
Object-oriented design
Micro-service architecture
Systems engineering
Example: Organizational Polymorphism
With the right people
Instead of a culture process adherence
We have a culture of creativity and self discipline,
freedom ...
You build it
You run it
Introductions
Framework
Scaling Teams
IQ v EQ
Conway’s Revenge
Today’s Program
2009
Devices in Production
…
Key Platforms in Progress
Anthony Park
Surprise!
John Funge
Big Picture
Device Ubiquity
Product Innovation
Cloud Migration
Internationalization
Service Reliability
How many engineers?
6
What would you do?
Prioritize & Queue
…
…
Task Queue
Completed Tasks
Thread Pool
… …
Prioritize
Service availability
Game consoles
Downloadable apps
CE expansion
Mobile
…
Queue
Audio & subtitles
Internationa...
Scale Up
Work profile
Roles & throughput
Team structure
Manager
Engineer Engineer Engineer Engineer Engineer Test/Ops
Monolithic Team
One leader
Undifferentiated roles
Ad hoc res...
Monolithic Decomposition
Distinct modules & services
Workload partitioning
Dependency awareness
Loose coupling
Server
NRDP Features
Protocols
Security
Bootstrap
Key Platforms
Device integration
Device launch
Streaming Infrastructure
...
On Call Overload
Vicious Cycle
Philip was great at
Development
Test infrastructure
Project management
Troubleshooting
Philip Fisher-Ogden
You build it
You run it
Risks
Burnout
Slow progress on key initiatives
Philip Fisher-Ogden
Thread Starvation
… …
Shared
exclusive
resource
High priority/frequency
Other - blocked
Tasks
Context Switching
Process 1 Process 2
OS
Interrupt or system call
Save state - pcb1
..
Get state – pcb2
Interrupt or syste...
Thread Pool Isolation
Partition pools & locks
Distribute problematic workloads
… …
… …
… …
…
…
Organizational Solution
Deepen troubleshooting skills
Distribute escalations
Engineer operations
Key Platforms
Device integration
Device launch
Server
NRDP
Protocols
Security
Bootstrap
Insight/Tools
Delivery
Dashboards
...
Cloud Migration
Rapid iteration v. systematic, long-cycle execution
Cloud v. Product
S S S S. . .
DB DB DB DB. . .
. . . . . .
Member Traffic Batch Processes
Heterogeneous
Workloads
Batch
S S S S. . .
DB DB DB DB. . .
. . . . . .
Member Path
Member Path
Member Path
Batch
Batch
Interference
Batch
S S S S. . .
DB DB DB DB. . .
. . . . . .
Member Path
Member Path
Member Path
Batch
Batch
Interference
Batch
S S S S. . .
DB DB DB DB. . .
. . . . . .
Member Path
Member Path
Member Path
Batch
Batch
Interference
. . .
Batch
S S S S. . .
DB DB DB DB. . .
. . .
Member Path
Member Path
Member Path
Batch
Batch
Interference
X
Batch
S S S S. . .
DB DB DB DB. . .
. . .
Member Path
Member Path
Member Path
Batch
Batch
Partitioning
Online Offline
. . .
Silverlight Migration
Partitioning & Domain Portability
Streaming Infrastructure Platform Engineering
Engineer as a Library
Ranjit Ranjit
Partitioning & Domain Portability
Platform Engineering
Engineer as a Library
Streaming Infrastructure
Systems
Cloud migration
Key Platforms
Device integration
Device launch
Server
NRDP
Protocols
Security
Bootstrap
Insight/To...
Staffing
6  24
Bottleneck
Systems
Cloud migration
Viewing history
Viewing sessions
Key Platforms
Device integration
Device launch
Server
NRDP
Protoc...
By 2012
cloud migration
Canada, Latin America, UK
massive device expansion
major product improvements
Netflix CDN
Introductions
Framework
Scaling Teams
IQ v EQ
Conway’s Revenge
Today’s Program
IQ
Task-oriented
Logical
Literal
Detached
Autocratic
EQ
Feeling-oriented
Emotional
Social
Empathetic
Democratic
Bimodal Th...
IQ
Design
Evaluation
Implementation
EQ
Inception
Socialization
Overcoming Tribalism
Flawed Inception
Introductions
Framework
Scaling Teams
IQ v EQ
Conway’s Revenge
Today’s Program
2012
Customer Device
Netflix Data Center
API
This…
LB
Netflix App
Security
Activation
Playback
Platform (NRDP)
UI
Content
Metad...
ELB
NCCP
API
…has become this
Zuul
ELB
…and this
Growing complexity
Duplication of effort
Engineering tax
Raising the Stakes
Playback start in 500ms
More UI/Playback scenarios
Faster rate of innovation
Better service reliability
Common tasks are difficult
Strategic efforts are impractical or impossible
When do we refactor?
If you have four teams working on a compiler you will end
up with a four pass compiler
Conway’s Revenge!
We had two teams ...
Mature API team
Robust API platform
Strong operational focus
Trust & respect
A Better Foundation
Daniel Jacobson
Josh: what’s the right architectural solution?
Peter: do you care about the organizational implications?
Moment of Truth
Selfless Leadership
Josh: what’s the right architectural solution?
Peter: do you care about the organizational implications?
Moment of Truth
J...
ELB
NCCP
API
Before
Zuul
After
Integrated architecture
Distributed functionality
Shared services
Common practices
Edge Services
Zuul API Server
Playback
Services
Features Security Data Systems
Platform Insight/Tools
Edge Services
Shared...
Takeaways
Put architecture first
Leverage technical analogs
Know when to use IQ v EQ
Be selfless
www.linkedin.com/in/jevansnflx
Where to find me
?
Refactoring Organizations
Refactoring Organizations - A Netflix Study (QCon NYC 2017)
Refactoring Organizations - A Netflix Study (QCon NYC 2017)
Refactoring Organizations - A Netflix Study (QCon NYC 2017)
Upcoming SlideShare
Loading in …5
×

Refactoring Organizations - A Netflix Study (QCon NYC 2017)

954 views

Published on

Is your service architecture and engineering velocity constrained by organizational concerns? Does it seem impossible to give priority to key initiatives regardless of intent? Are engineers switching tasks so often that they are just treading water? Are critical projects endlessly backlogged? Has staffing up pushed the limits of your team structure? Navigating through challenges like these can be daunting and solutions fraught with uncertainty. How do you know what, where, when to change. And whatever the answer is today it will most certainly vary over time. Effective organizations evolve, at key inflection points, to support critical business and technical goals. There is not only a strong relationship between organizations and the software they produce (Conway’s Law) but many organizational solutions can be derived from analogs in the technical realm. In other words, we can treat organizational improvement as a refactoring exercise. Over the last 20 years Netflix engineering has proven time and again an ability to adapt and grow, resulting in undisputed dominance over the global internet tv market. In this talk we’ll use Netflix as a case study to illustrate how specific strategies, framed as technical analogs, have been employed to maximize engineering agility, velocity, and impact. These powerful, yet simple strategies and solutions provide a useful blueprint for organizational success.

Published in: Internet
  • Hey guys! Who wants to chat with me? More photos with me here 👉 http://www.bit.ly/katekoxx
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Refactoring Organizations - A Netflix Study (QCon NYC 2017)

  1. 1. Josh Evans – (Former Netflix) Engineering Leader at Large June 26, 2017 Refactoring Organizations A Netflix Study
  2. 2. 2009
  3. 3. Devices …
  4. 4. Queue Reader UX
  5. 5. DVD Foundation
  6. 6. Customer Device Netflix Data Center NCCP Electronic Delivery LoadBalancer Netflix App Security Activation Playback Platform (NRDP) UI XML/RPC Ticket-based security Custom responses DB HTTP/S DVD Legacy DVD Legacy
  7. 7. Netflix API
  8. 8. Let 1000 flowers bloom!
  9. 9. Netflix Data Center API Netflix API LoadBalancer REST API JSON schema HTTP response codes Oauth Content Metadata Application HTTP/S
  10. 10. Customer Device Netflix Data Center API Proposal LB Netflix App Security Activation Playback Platform (NRDP) UI Content Metadata NCCP ED LB
  11. 11. Pros Separation of concerns Bandwidth Increased innovation Cons API reliability Heterogeneous architecture Lack of domain knowledge Assessment
  12. 12. Neil Hunt, CPO
  13. 13. What was the reaction?
  14. 14. Concern Anger
  15. 15. Tribalism
  16. 16. Conway’s Law
  17. 17. If you have four teams working on a compiler you will end up with a four pass compiler
  18. 18. Today’s Premise Conway’s Law describes dysfunction We must embrace architecture before organization Technical analogs drive better organizational solutions
  19. 19. Selfless Leadership Company Team You In that order!
  20. 20. Introductions Framework Scaling Teams IQ v EQ Conway’s Revenge Today’s Program
  21. 21. Introductions Framework Scaling Teams IQ v EQ Conway’s Revenge Today’s Program
  22. 22. 1999 – 2009 Ecommerce (DVD  Streaming) 2009 – 2013 Streaming Infrastructure 2013 - 2016 Operations Engineering 2017 Time off, exploring options Josh Evans – Engineering Leader at Large @ops_engineering @
  23. 23. Global leader in subscription internet TV Growing slate of original content 100 million members 190 countries, 10s of languages 1000s of device types Microservices on AWS Unique company culture
  24. 24. Introductions Framework Scaling Teams IQ v EQ Conway’s Revenge Today’s Program
  25. 25. Why do we refactor?
  26. 26. Functionality Engineering velocity Functional and operational quality As we scale! We refactor to improve or sustain
  27. 27. The ability to enhance a system by adding new functionality at minimal effort Functional Scalability The ease with which a system or component can be modified, added, or removed, to accommodate changing load Load Scalability
  28. 28. The ability for an organization to easily add people and domain responsibilities in response to increased work and complexity The ease with which an organization or team can adapt to shifts in business strategy Organizational Scalability
  29. 29. Common tasks are difficult Strategic efforts are impractical or impossible When do we refactor?
  30. 30. How do we refactor?
  31. 31. Technical Patterns Object-oriented design Micro-service architecture Systems engineering
  32. 32. Example: Organizational Polymorphism
  33. 33. With the right people Instead of a culture process adherence We have a culture of creativity and self discipline, freedom and responsibility Netflix Culture
  34. 34. You build it You run it
  35. 35. Introductions Framework Scaling Teams IQ v EQ Conway’s Revenge Today’s Program
  36. 36. 2009
  37. 37. Devices in Production …
  38. 38. Key Platforms in Progress
  39. 39. Anthony Park
  40. 40. Surprise! John Funge
  41. 41. Big Picture Device Ubiquity Product Innovation Cloud Migration Internationalization Service Reliability
  42. 42. How many engineers? 6
  43. 43. What would you do?
  44. 44. Prioritize & Queue … … Task Queue Completed Tasks Thread Pool … …
  45. 45. Prioritize Service availability Game consoles Downloadable apps CE expansion Mobile … Queue Audio & subtitles International support New codecs … Prioritize & Queue
  46. 46. Scale Up Work profile Roles & throughput Team structure
  47. 47. Manager Engineer Engineer Engineer Engineer Engineer Test/Ops Monolithic Team One leader Undifferentiated roles Ad hoc responsibilities
  48. 48. Monolithic Decomposition Distinct modules & services Workload partitioning Dependency awareness Loose coupling
  49. 49. Server NRDP Features Protocols Security Bootstrap Key Platforms Device integration Device launch Streaming Infrastructure Device & partner-oriented load balancing
  50. 50. On Call Overload
  51. 51. Vicious Cycle Philip was great at Development Test infrastructure Project management Troubleshooting Philip Fisher-Ogden
  52. 52. You build it You run it
  53. 53. Risks Burnout Slow progress on key initiatives Philip Fisher-Ogden
  54. 54. Thread Starvation … … Shared exclusive resource High priority/frequency Other - blocked Tasks
  55. 55. Context Switching Process 1 Process 2 OS Interrupt or system call Save state - pcb1 .. Get state – pcb2 Interrupt or system call Save state – pcb2 Get state – pcb1 .. Executing Executing Idle Executing Idle Idle
  56. 56. Thread Pool Isolation Partition pools & locks Distribute problematic workloads … … … … … … … …
  57. 57. Organizational Solution Deepen troubleshooting skills Distribute escalations Engineer operations
  58. 58. Key Platforms Device integration Device launch Server NRDP Protocols Security Bootstrap Insight/Tools Delivery Dashboards Performance Operational tools Consolidating Operations Engineering
  59. 59. Cloud Migration
  60. 60. Rapid iteration v. systematic, long-cycle execution Cloud v. Product
  61. 61. S S S S. . . DB DB DB DB. . . . . . . . . Member Traffic Batch Processes Heterogeneous Workloads
  62. 62. Batch S S S S. . . DB DB DB DB. . . . . . . . . Member Path Member Path Member Path Batch Batch Interference
  63. 63. Batch S S S S. . . DB DB DB DB. . . . . . . . . Member Path Member Path Member Path Batch Batch Interference
  64. 64. Batch S S S S. . . DB DB DB DB. . . . . . . . . Member Path Member Path Member Path Batch Batch Interference
  65. 65. . . . Batch S S S S. . . DB DB DB DB. . . . . . Member Path Member Path Member Path Batch Batch Interference X
  66. 66. Batch S S S S. . . DB DB DB DB. . . . . . Member Path Member Path Member Path Batch Batch Partitioning Online Offline . . .
  67. 67. Silverlight Migration
  68. 68. Partitioning & Domain Portability Streaming Infrastructure Platform Engineering Engineer as a Library Ranjit Ranjit
  69. 69. Partitioning & Domain Portability Platform Engineering Engineer as a Library Streaming Infrastructure
  70. 70. Systems Cloud migration Key Platforms Device integration Device launch Server NRDP Protocols Security Bootstrap Insight/Tools Delivery Dashboards Performance Ops tools Streaming Infrastructure
  71. 71. Staffing
  72. 72. 6  24
  73. 73. Bottleneck
  74. 74. Systems Cloud migration Viewing history Viewing sessions Key Platforms Device integration Device launch Server NRDP Protocols Security Bootstrap Insight/Tools Delivery Dashboards Performance Ops tools Cloning & Parallel Processing
  75. 75. By 2012 cloud migration Canada, Latin America, UK massive device expansion major product improvements Netflix CDN
  76. 76. Introductions Framework Scaling Teams IQ v EQ Conway’s Revenge Today’s Program
  77. 77. IQ Task-oriented Logical Literal Detached Autocratic EQ Feeling-oriented Emotional Social Empathetic Democratic Bimodal Thinking
  78. 78. IQ Design Evaluation Implementation EQ Inception Socialization Overcoming Tribalism
  79. 79. Flawed Inception
  80. 80. Introductions Framework Scaling Teams IQ v EQ Conway’s Revenge Today’s Program
  81. 81. 2012
  82. 82. Customer Device Netflix Data Center API This… LB Netflix App Security Activation Playback Platform (NRDP) UI Content Metadata NCCP ED LB
  83. 83. ELB NCCP API …has become this Zuul
  84. 84. ELB …and this
  85. 85. Growing complexity Duplication of effort Engineering tax
  86. 86. Raising the Stakes Playback start in 500ms More UI/Playback scenarios Faster rate of innovation Better service reliability
  87. 87. Common tasks are difficult Strategic efforts are impractical or impossible When do we refactor?
  88. 88. If you have four teams working on a compiler you will end up with a four pass compiler Conway’s Revenge! We had two teams and a two-service edge architecture
  89. 89. Mature API team Robust API platform Strong operational focus Trust & respect A Better Foundation Daniel Jacobson
  90. 90. Josh: what’s the right architectural solution? Peter: do you care about the organizational implications? Moment of Truth
  91. 91. Selfless Leadership
  92. 92. Josh: what’s the right architectural solution? Peter: do you care about the organizational implications? Moment of Truth Josh: no – we’ll figure that out later
  93. 93. ELB NCCP API Before Zuul
  94. 94. After Integrated architecture Distributed functionality Shared services Common practices
  95. 95. Edge Services Zuul API Server Playback Services Features Security Data Systems Platform Insight/Tools Edge Services Shared services Organized around microservices, functionality, shared services
  96. 96. Takeaways Put architecture first Leverage technical analogs Know when to use IQ v EQ Be selfless
  97. 97. www.linkedin.com/in/jevansnflx Where to find me
  98. 98. ? Refactoring Organizations

×