Josh Evans - Director of Operations Engineering
November 16, 2015
Beyond DevOps:
How Netflix Bridges the Gap
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/netflix-operations-devops
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon San Francisco
www.qconsf.com
Technical Debt
• Java 6
• Perforce
• Single Master Jenkins
• Ant
• CentOS
• Asgard/Mimir
Fall 2013
How do we drive broad-based change?
The Paved Road
• Java 7
• Stash
• Jenkins Shards
• Gradle
• Ubuntu
Some said
• You’re overloading us
• Too many projects
• Poor targeting
Others said
• What took you so long?
• We’ve moved on
• Now we need to migrate
That’s great but…
We’re paying a high tax
• Expectations gap
– Division of labor
– Timing of solutions
– Leadership
• Affects
– Reputation
– Relationships
– Lost opportunities
Organizational Debt
How do we bridge the gap?
“Remember that TIME is money…”
Time is a form of currency
• Product Engineering
• Operations Engineering
• Challenges & Strategies
Our time today…
• Product Engineering
• Operations Engineering
• Challenges & Strategies
Our time today…
Product Innovation
winning moments of truth
● Every facet of the product
● 1400 AB tests in the last year & accelerating
Continuous Innovation
But wait, there’s more…
Build It
• design
• code
• build
• bake
• test
• deploy
Run It
• configure
• monitor
• triage
• fix
…at scale, globally
You build it, you run it
Internet
• 1000s of starts per second
• 100,000s of requests per second
• 100,000,000 hours of content / day
• 3 AWS Regions, 3 AZs per region
Relentless product innovation
Building & running micro-
services at scale, globally
• Product Engineering
• Operations Engineering
• Challenges & Strategies
Our time today…
DevOps is a software development method that
emphasizes the roles of both software developers and
other information-technology (IT) professionals with an
emphasis on IT Operations.
- Wikipedia
The Gap
Why? How?
Quality Velocity
Operational Excellence
Operational Excellence is the continuous improvement of
the management, design, and function of operational
environments to achieve greater quality, velocity, and
competitive advantage.
• Engineering Tools
• Insight & Real-time Analytics
• Performance & Reliability
Operations Engineering is the application of software
engineering practices to achieve and sustain operational
excellence.
Operations Engineering
• Service provider
• Operational excellence driver
• Cross-cutting solutions
• Undifferentiated heavy lifting
• Product Engineering
• Operations Engineering
• Challenges & Strategies
Our time today…
• You’re overloading us
• What took you so long?
Remember that feedback?
• We made assumptions
– Requirements – what & when
– Time for non-product work
• Move from assumptions to knowledge
• Affect change without imposing a tax?
• Achieve and sustain operational excellence?
How do we…
Time is a form of currency
5 strategies for success
in time-based economies
software & organizational engineering
1. Reach out
• What are your biggest operational pain points?
• How can we help?
• How well are we meeting your needs today?
• What would you like to see from us in the future?
Listen
Shower, rinse, repeat
Talk to your engineering customers
Grease the Squeaky Wheels
• low tolerance for tax
• more vocal than most
• High impact solutions
• Clarity on deliverables
• Lower operational tax
• Leadership, innovation, and partnership
What they wanted
• Deliver on solutions
• Better road map definition & communication
• A more aggressive stance on automation
• Deeper investment into leadership, innovation, planning
Our commitments
2. Make an impact
• Apply what you’ve learned
• Deliver what matters
• global cloud console
• end to end delivery
• automation platform
• velocity with confidence
Pipelines - Automated Global Delivery
3. Make it easy to do the right thing
• Engineering time is scarce
• We must do more heavy lifting
Supply & Demand
• Spinnaker manual step
• Automated migrations – Mimir
Provide on-ramps
Automate proven practices
• Alerting and Monitoring
• Apache & Tomcat Hardening
• Automated Canary Analysis
• Autoscaling
• Chaos Participation
• Consistent Naming
• ELB Configuration
• Healthcheck Configured
• Red-Black Pipeline
• Squeeze Testing
• Timeout & Fallback Tuning
• Workload Reliability
Production Ready?
• Alerting and Monitoring
• Apache & Tomcat Hardening
• Automated Canary Analysis
• Autoscaling
• Chaos Participation
• Consistent Naming
• ELB Configuration
• Healthcheck Configured
• Red-Black Pipeline
• Squeeze Testing
• Timeout & Fallback Tuning
• Workload Reliability
Production Ready?
Old Version (v1.0)
New Version
(v1.1)
Load BalancerCustomers
100 Servers
5 Servers
95%
5%
Metrics
Canaries
Old Version (v1.0)
New Version
(v1.1)
Load BalancerCustomers
0 Servers
100 Servers
100%
Metrics
Canaries
Define
• Metrics
• A threshold
Every n minutes
● Classify metrics
● Compute score
● Make a decision
Automated Canary Analysis
Canary Analysis
Performance
Integration Tests
Chaos
Conformity
Static
Unit Tests
Make it easy to do the
right thing
Static &
Functional
Testing
4. Reduce the cost of change
• Ongoing migrations
• Library propagation
• 100s of micro-services
• Complex dependencies
Continuous, Broad-based Change
Change Engineering
• Locate
• Communicate
• Facilitate
• Automated forensics
– Who last touched x?
– What team?
– Who was their manager?
Who owns this artifact, repository, service?
Whitepages
• Workday wrapper
• App & REST API
• Organization hierarchy
• Metadata
• Change log
(###) ###-####
Krieger
• REST-based service
• Sources
– Whitepages
– Stash
– Edda
– Jenkins
– Spinnaker
– Etc…
{
"content": {},
"_links": {
"employees": {
"href": "/api/employees/"
},
"projects": {
"href": "/api/projects/"
},
"teams": {
"href": "/api/teams/"
},
"applications": {
"href": "/api/applications/"
},
"jobs": {
"href": "/api/build/jobs"
},
"masters": {
"href": "/api/build/masters"
},
"projectDistribution": {
"href": "/api/teams/projectDistribution"
}
}
}
/api/employees?q=jevans "employees": [
{
"id": "241",
"firstName": "Josh",
"lastName": "Evans",
"username": "jevans",
"email": "jevans@netflix.com",
"jobTitle": "Director of Operations Engineering",
"isManager": true,
"isCurrent": true,
"title": "Josh Evans (jevans) - Operations Engineering",
"_links": {
"self": {
"href": "/api/employees/241"
},
"manager": {
"href": "/api/employees/117890"
},
"team": {
"href": "/api/teams/f9134a81"
},
"projects": {
"href": "/api/teams/f9134a81/projects"
}
}
}
]
}
• Security vulnerabilities
– Who owns this service?
• Platform updates
– Who is using this version of this library?
Today – Targeted Coordination
Automated, efficient technical
project management
• Communication
• Guidance
• Tracking
Low tax for TPMs & engineers
Security Fix Guava
Future – Change Campaigns
5. Develop Partnerships
Beyond supply & demand
• Nearing completion
• Aggressive schedule
• Unexpected delays
• Commitment to June delivery
Spinnaker 1.0 – 1H 2015
• Built their own continuous delivery solution
• Not positioned for engineering-wide support
• Believes common solutions
Edge Engineering
Partnership in Action
• Strong relationship
• Open discussions about concerns
• Decision - leaned forward
• +2 engineers on Spinnaker
• Successful 1.0 launch
Moving Forward Together
• Containers?
• Achieving alignment
• Collaborative exploration
– Edge, Platform, Operations
– A new paved road?
• Paved Road adopted
– Adding new ones
• Production Ready ongoing
• Migrations easier
• Reputation improving
• Improved
– Service uptime
– Rate of change
Payoffs
Putting it to the test in 2016
• Streaming production & test - EC2 Classic to VPC
• Highly cross-functional
• Complex dependencies
• Zero downtime
Stay tuned…
Five Strategies
1. Reach out
2. Make an impact
3. Make it easy to do the right thing
4. Reduce the cost of change
5. Develop partnerships
Open Sourced!
https://netflix.github.io/
Josh Evans
jevans@netflix.com
@ops_engineering
Questions?
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations/
netflix-operations-devops

Beyond DevOps: How Netflix Bridges the Gap?

  • 1.
    Josh Evans -Director of Operations Engineering November 16, 2015 Beyond DevOps: How Netflix Bridges the Gap
  • 2.
    InfoQ.com: News &Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /netflix-operations-devops
  • 3.
    Purpose of QCon -to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide Presented at QCon San Francisco www.qconsf.com
  • 4.
    Technical Debt • Java6 • Perforce • Single Master Jenkins • Ant • CentOS • Asgard/Mimir Fall 2013
  • 5.
    How do wedrive broad-based change?
  • 6.
    The Paved Road •Java 7 • Stash • Jenkins Shards • Gradle • Ubuntu
  • 7.
    Some said • You’reoverloading us • Too many projects • Poor targeting Others said • What took you so long? • We’ve moved on • Now we need to migrate That’s great but… We’re paying a high tax
  • 8.
    • Expectations gap –Division of labor – Timing of solutions – Leadership • Affects – Reputation – Relationships – Lost opportunities Organizational Debt
  • 9.
    How do webridge the gap?
  • 10.
    “Remember that TIMEis money…”
  • 11.
    Time is aform of currency
  • 12.
    • Product Engineering •Operations Engineering • Challenges & Strategies Our time today…
  • 13.
    • Product Engineering •Operations Engineering • Challenges & Strategies Our time today…
  • 14.
  • 17.
    ● Every facetof the product ● 1400 AB tests in the last year & accelerating Continuous Innovation
  • 18.
  • 19.
    Build It • design •code • build • bake • test • deploy Run It • configure • monitor • triage • fix …at scale, globally You build it, you run it
  • 20.
    Internet • 1000s ofstarts per second • 100,000s of requests per second • 100,000,000 hours of content / day • 3 AWS Regions, 3 AZs per region
  • 21.
    Relentless product innovation Building& running micro- services at scale, globally
  • 22.
    • Product Engineering •Operations Engineering • Challenges & Strategies Our time today…
  • 23.
    DevOps is asoftware development method that emphasizes the roles of both software developers and other information-technology (IT) professionals with an emphasis on IT Operations. - Wikipedia The Gap
  • 24.
  • 25.
  • 26.
    Operational Excellence isthe continuous improvement of the management, design, and function of operational environments to achieve greater quality, velocity, and competitive advantage.
  • 27.
    • Engineering Tools •Insight & Real-time Analytics • Performance & Reliability Operations Engineering is the application of software engineering practices to achieve and sustain operational excellence.
  • 28.
    Operations Engineering • Serviceprovider • Operational excellence driver • Cross-cutting solutions • Undifferentiated heavy lifting
  • 29.
    • Product Engineering •Operations Engineering • Challenges & Strategies Our time today…
  • 30.
    • You’re overloadingus • What took you so long? Remember that feedback? • We made assumptions – Requirements – what & when – Time for non-product work
  • 31.
    • Move fromassumptions to knowledge • Affect change without imposing a tax? • Achieve and sustain operational excellence? How do we…
  • 32.
    Time is aform of currency
  • 33.
    5 strategies forsuccess in time-based economies software & organizational engineering
  • 34.
  • 35.
    • What areyour biggest operational pain points? • How can we help? • How well are we meeting your needs today? • What would you like to see from us in the future? Listen Shower, rinse, repeat Talk to your engineering customers
  • 36.
    Grease the SqueakyWheels • low tolerance for tax • more vocal than most
  • 37.
    • High impactsolutions • Clarity on deliverables • Lower operational tax • Leadership, innovation, and partnership What they wanted
  • 38.
    • Deliver onsolutions • Better road map definition & communication • A more aggressive stance on automation • Deeper investment into leadership, innovation, planning Our commitments
  • 39.
    2. Make animpact • Apply what you’ve learned • Deliver what matters
  • 40.
    • global cloudconsole • end to end delivery • automation platform • velocity with confidence
  • 42.
    Pipelines - AutomatedGlobal Delivery
  • 44.
    3. Make iteasy to do the right thing
  • 45.
    • Engineering timeis scarce • We must do more heavy lifting Supply & Demand
  • 46.
    • Spinnaker manualstep • Automated migrations – Mimir Provide on-ramps
  • 47.
  • 48.
    • Alerting andMonitoring • Apache & Tomcat Hardening • Automated Canary Analysis • Autoscaling • Chaos Participation • Consistent Naming • ELB Configuration • Healthcheck Configured • Red-Black Pipeline • Squeeze Testing • Timeout & Fallback Tuning • Workload Reliability Production Ready?
  • 49.
    • Alerting andMonitoring • Apache & Tomcat Hardening • Automated Canary Analysis • Autoscaling • Chaos Participation • Consistent Naming • ELB Configuration • Healthcheck Configured • Red-Black Pipeline • Squeeze Testing • Timeout & Fallback Tuning • Workload Reliability Production Ready?
  • 50.
    Old Version (v1.0) NewVersion (v1.1) Load BalancerCustomers 100 Servers 5 Servers 95% 5% Metrics Canaries
  • 51.
    Old Version (v1.0) NewVersion (v1.1) Load BalancerCustomers 0 Servers 100 Servers 100% Metrics Canaries
  • 52.
    Define • Metrics • Athreshold Every n minutes ● Classify metrics ● Compute score ● Make a decision Automated Canary Analysis
  • 53.
    Canary Analysis Performance Integration Tests Chaos Conformity Static UnitTests Make it easy to do the right thing Static & Functional Testing
  • 54.
    4. Reduce thecost of change
  • 55.
    • Ongoing migrations •Library propagation • 100s of micro-services • Complex dependencies Continuous, Broad-based Change
  • 56.
    Change Engineering • Locate •Communicate • Facilitate
  • 57.
    • Automated forensics –Who last touched x? – What team? – Who was their manager? Who owns this artifact, repository, service?
  • 58.
    Whitepages • Workday wrapper •App & REST API • Organization hierarchy • Metadata • Change log (###) ###-####
  • 59.
    Krieger • REST-based service •Sources – Whitepages – Stash – Edda – Jenkins – Spinnaker – Etc… { "content": {}, "_links": { "employees": { "href": "/api/employees/" }, "projects": { "href": "/api/projects/" }, "teams": { "href": "/api/teams/" }, "applications": { "href": "/api/applications/" }, "jobs": { "href": "/api/build/jobs" }, "masters": { "href": "/api/build/masters" }, "projectDistribution": { "href": "/api/teams/projectDistribution" } } }
  • 60.
    /api/employees?q=jevans "employees": [ { "id":"241", "firstName": "Josh", "lastName": "Evans", "username": "jevans", "email": "jevans@netflix.com", "jobTitle": "Director of Operations Engineering", "isManager": true, "isCurrent": true, "title": "Josh Evans (jevans) - Operations Engineering", "_links": { "self": { "href": "/api/employees/241" }, "manager": { "href": "/api/employees/117890" }, "team": { "href": "/api/teams/f9134a81" }, "projects": { "href": "/api/teams/f9134a81/projects" } } } ] }
  • 61.
    • Security vulnerabilities –Who owns this service? • Platform updates – Who is using this version of this library? Today – Targeted Coordination
  • 62.
    Automated, efficient technical projectmanagement • Communication • Guidance • Tracking Low tax for TPMs & engineers Security Fix Guava Future – Change Campaigns
  • 63.
  • 64.
    • Nearing completion •Aggressive schedule • Unexpected delays • Commitment to June delivery Spinnaker 1.0 – 1H 2015
  • 65.
    • Built theirown continuous delivery solution • Not positioned for engineering-wide support • Believes common solutions Edge Engineering
  • 66.
    Partnership in Action •Strong relationship • Open discussions about concerns • Decision - leaned forward • +2 engineers on Spinnaker • Successful 1.0 launch
  • 67.
    Moving Forward Together •Containers? • Achieving alignment • Collaborative exploration – Edge, Platform, Operations – A new paved road?
  • 68.
    • Paved Roadadopted – Adding new ones • Production Ready ongoing • Migrations easier • Reputation improving • Improved – Service uptime – Rate of change Payoffs
  • 69.
    Putting it tothe test in 2016 • Streaming production & test - EC2 Classic to VPC • Highly cross-functional • Complex dependencies • Zero downtime Stay tuned…
  • 70.
    Five Strategies 1. Reachout 2. Make an impact 3. Make it easy to do the right thing 4. Reduce the cost of change 5. Develop partnerships
  • 71.
  • 72.
  • 73.
    Watch the videowith slide synchronization on InfoQ.com! http://www.infoq.com/presentations/ netflix-operations-devops