Why Everyone Needs DevOps Now: 15 Year Study Of High Performing Technology Orgs

24,427 views

Published on

This presentation describes my interpretation of the Why and How of DevOps, and the key findings from my 15 year study of high-performing IT organizations, and how they simultaneously deliver stellar service levels and rapid implementation of new features into the production environment.

Organizations employing DevOps practices such as Google, Amazon, Facebook, Etsy and Twitter are routinely deploying code into production hundreds, or even thousands, of times per day, while providing world-class availability, reliability and security. In contrast, most organizations struggle to do releases more every nine months.

He will present how these high-performing organizations achieve this fast flow of work through Product Management and Development, through QA and Infosec, and into IT Operations. By doing so, other organizations can now replicate the extraordinary culture and outcomes enabling their organization to win in the marketplace.

Published in: Business, Technology
7 Comments
102 Likes
Statistics
Notes
No Downloads
Views
Total views
24,427
On SlideShare
0
From Embeds
0
Number of Embeds
1,517
Actions
Shares
0
Downloads
1,557
Comments
7
Likes
102
Embeds 0
No embeds

No notes for slide
  • My name is Gene Kim. My area of passion started when I was the CTO and founder of Tripwire in 1999. I started keeping a list that we called “Gene’s list of people with great kung fu.” These were the organizations that simutaneously…

    In the next 25 minutes, I’m really excited to share with you some of my key learnings, which I’m hoping that will not only be applicable to you, but that you’ll be able to put into practice right away, and get some amazing results.

    But let me tell you how my journey began…
  • [ picture of messy data center ] Ten minutes into Bill’s first day on the job, he has to deal with a payroll run failure. Tomorrow is payday, and finance just found out that while all the salaried employees are going to get paid, none of the hourly factory employees will. All their records from the factory timekeeping systems were zeroed out. Was it a SAN failure? A database failure? An application failure? Interface failure? Cabling error?
  • Source: http://biobreak.wordpress.com/2010/10/07/games-evangelism-dos-and-donts/
  • Who are they auditing? IT operations.

    I love IT operatoins. Why? Because when the developers screw up, the only people who can save the day are the IT operations people.
    Memory leak? No problem, we’ll do hourly reboots until you figure that out.

    Who here is from IT operations?

    Bad day:
    Not as prepared for the audit as they thought
    Spending 30% of their time scrambling, generating presentation for auditors
    Or an outage, and the developer is adamant that they didn’t make the change – they’re saying, “it must be the security guys – they’re always causing outages”
    Or, there’s 50 systems behind the load balancer, and six systems are acting funny – what different, and who made them different
    Or every server is like a snowflake, each having their own personality

    We as Tripwire practitioners can help them make sure changes are made visible, authorized, deployed completely and accurately, find differences
    Create and enforce a culture of change management and causality
  • EG Parts Unlimited, Inc. DBA Parts Unlimited in is serious trouble. Stock has tumbled 19% in the last 30 days, and is down 52% from its peak three years ago. The company continues to be outmaneuvered by their arch-rival, famous for their ability to anticipate and instantly react to customer needs. Parts Unlimited now trails the competition in sales growth, inventory turns and profitability.

    Parts Unlimited has been promising the release of a software, call “Phoenix” which – if they can ever get it release – should close the gap. It tightly integrates its retailing and e-commerce channels. Already years late, many expect the company to announce another program delay in their analyst earnings call next month. 20 million in, years late and the Board and the Investors are – let’s just say the natives are restless and are looking for heads. Which mean not only have some of the players been let go, and moved positions, but the board is looking at outsourcing and / or splitting up the company..

    The board has given the team six months to make dramatic improvements.
  • Source: Flickr: birdsandanchors
  • Who’s introducing variance? Well, it’s often these guys. Show me a developer who isn’t causing an outage, I’ll show you one who is on vacation.

    Primary measurement is deploy features quickly – get to market.

    I’ve worked with two of the five largest Internet companies (Google, Microsoft, Yahoo, AOL, Amazon), and I now believe that the biggest differentiator to great time to market is great operations:

    Bad day:
    We do 6 weeks of testing, but deployment still fails. Why? QA environment doesn’t match production
    Or there’s a failure in testing, and no one can agree whether it’s a code failure or an environment failure
    Or changes are made in QA, but no one wrote them down, so they didn’t get replicated downstream in production

    Believe it or not, we as Tripwire practitioners can even help them – make sure environments are available when we need them, that they’re properly configured correctly the first time, document all the changes, replicate them downstream
  • [ picture of messy data center ] Ten minutes into Bill’s first day on the job, he has to deal with a payroll run failure. Tomorrow is payday, and finance just found out that while all the salaried employees are going to get paid, none of the hourly factory employees will. All their records from the factory timekeeping systems were zeroed out. Was it a SAN failure? A database failure? An application failure? Interface failure? Cabling error?
  • Source: http://biobreak.wordpress.com/2010/10/07/games-evangelism-dos-and-donts/
  • So who are all these constituencies that we can help, and increase our relevance as Tripwire practitioners and champions?

    How many people here are in infosec?

    Goal: protect critical systems and data
    Safeguard organizational commitments
    Prevent security breaches, help quickly detect and recover from them

    Bad day: no security standards
    No one is complying
    Yes, we’re 3 years behind. “Whaddya gonna do about it?”

    Vs. we (Tripwire owner) can become more relevant and add value by help infosec by leveraging all the configuration guidance out there
    Measure variance between produciton and those known good states
    Trust and verify that when management says, we’ve trued up the configurations, they’ve actually done it

    Why? Now, more than ever, there are an ever increasing amount of regulatory and contractual requirements to protect systems and data
  • There are many ways to react to this: like, fear, horror, trying to become invisible… All understandable, given the circumstances…

    Because infosec can no longer take 4 weeks to turn around a security review for application code, or take 6 weeks to turnaround a firewall change.
    But, on the other hand, I think it’s will be the best thing to ever happen to infosec in the past 20 years. We’re calling this Rugged DevOps, because it’s a way for infosec to integrate into the DevOps process, and be welcomed. And not be viewed as the shrill hysterical folks who slow the business down.
  • Tell story of Amazon, Netflix: they care about, availability, security
    It’s not a push, it’s a pull – they’re looking for our help (#1 concern: fear of disintermediation and being marginalized)
  • Eran Feigenbaum

    Director of Security, Google Enterprise

  • [ picture of messy data center ] Ten minutes into Bill’s first day on the job, he has to deal with a payroll run failure. Tomorrow is payday, and finance just found out that while all the salaried employees are going to get paid, none of the hourly factory employees will. All their records from the factory timekeeping systems were zeroed out. Was it a SAN failure? A database failure? An application failure? Interface failure? Cabling error?
  • Why Everyone Needs DevOps Now: 15 Year Study Of High Performing Technology Orgs

    1. 1. @RealGeneKim Why Everyone Needs DevOps Now: My Fifteen Year Journey Studying High Performing IT Organizations Gene Kim Session ID:
    2. 2. IT Operations @RealGeneKim
    3. 3. @RealGeneKim
    4. 4. The Product Managers @RealGeneKim
    5. 5. The Developers @RealGeneKim
    6. 6. @RealGeneKim
    7. 7. @RealGeneKim
    8. 8. @RealGeneKim IT Ops And Dev At War 13
    9. 9. @RealGeneKim
    10. 10. @RealGeneKim
    11. 11. @RealGeneKim The Downward Spiral…
    12. 12. There Is A Better Way… @RealGeneKim
    13. 13. @RealGeneKim Google, Amazon, Netflix, Spotify, Etsy, Spotify, Twitter, Facebook…
    14. 14. @RealGeneKim 10 deploys per day Dev & ops cooperation at Flickr John Allspaw & Paul Hammond Velocity 2009 Source: John Allspaw (@allspaw) and Paul Hammond (@ph)
    15. 15. @RealGeneKim
    16. 16. Little bit weird Sits closer to the boss Thinks too hard Pulls levers & turns knobs Easily excited Yells a lot in emergencies Source: John Allspaw (@allspaw) and Paul Hammond (@ph)
    17. 17. Ops who think like devs Devs who think like ops @RealGeneKim Source: John Allspaw (@allspaw) and Paul Hammond (@ph)
    18. 18. @RealGeneKim Dev and Ops Source: John Allspaw (@allspaw) and Paul Hammond (@ph)
    19. 19. DevOps is incomplete, is interpreted wrong, and is too isolated Source: Theo Schlossnagle (@postwait) @RealGeneKim
    20. 20. @RealGeneKim .*Ops Source: Theo Schlossnagle (@postwait)
    21. 21. ^(?<dept>.+)Ops$ @RealGeneKim Source: Theo Schlossnagle (@postwait)
    22. 22. Source: John Jenkins, Amazon.com @RealGeneKim
    23. 23. @RealGeneKim Making Changes When It Matters Most “By installing a rampant innovation culture, we performed 165 experiments in the peak three months of tax season.” “Our business result? Conversion rate of the website is up 50 percent. Employee result? Everyone loves it, because now their ideas can make it to market.” –Scott Cook, Intuit Founder
    24. 24. @RealGeneKim Who Is Doing DevOps?  Google, Amazon, Netflix, Etsy, Spotify, Twitter, Facebook …  Dynatrace, CSC, IBM, CA, SAP, HP, Microsoft, Red Hat, …  GE Capital, Nationwide, BNP Paribas, BNY Mellon, World Bank, Paychex, Intuit …  The Gap, Nordstrom, Macy’s, Williams-Sonoma, Target …  General Motors, Raytheon, LEGO, Bosche …  UK Government, US Department of Homeland Security …  Kansas State University… Who else?
    25. 25. High Performers Are More Agile 30x 8,000x more frequent deployments @RealGeneKim faster lead times than their peers Source: Puppet Labs 2013 State Of DevOps: http://puppetlabs.com/2013-state-of-devops-infographic
    26. 26. @RealGeneKim High Performers Are More Reliable 2x 12x the change success rate faster mean time to recover (MTTR) Source: Puppet Labs 2013 State Of DevOps: http://puppetlabs.com/2013-state-of-devops-infographic
    27. 27. High Performers Win In The Marketplace 2x 50% more likely to exceed profitability, market share & productivity goals @RealGeneKim higher market capitalization growth over 3 years* Source: Puppet Labs 2014 State Of DevOps
    28. 28. @RealGeneKim 36 Source: Darren Hague (@dhague)
    29. 29. “This book will have a profound effect on IT, just as The Goal did for manufacturing.” –Jez Humble, co-author Continuous Delivery “This is the IT swamp draining manual for anyone who is neck deep in alligators.” –Adrian Cockroft, Cloud Architect at Netflix “This is The Goal for our decade, and is for any IT professional who wants their life back.” –Charles Betz, IT architect, author “Architecture and Patterns for IT” @RealGeneKim
    30. 30. @RealGeneKim The First Way: Flow
    31. 31. @RealGeneKim “deploys per day” vs. “lead time”
    32. 32. @RealGeneKim “What is your lead time for changes?” “How long does it take to go from code committed to code successfully running in production?”
    33. 33. IT’S A TRAP
    34. 34. @RealGeneKim
    35. 35. @RealGeneKim Create One Step Environment Creation Process  Make environments available early in the Development process  Make sure Dev builds the code and environment at the same time  Create a common Dev, QA and Production environment creation process
    36. 36. @RealGeneKim If I had a magic wand, I’d change the Agile sprints and definition of “done”: “At the end of each sprint, we must have working and shippable code… demonstrated in an environment that resembles production.”
    37. 37. Deploy Smaller Changes, More Frequently * @RealGeneKim Source: http://www.facebook.com/note.php?note_id=14218138919
    38. 38. Deploy Smaller Changes, More Frequently * @RealGeneKim  Decouple feature releases from code deployments  Deploy features in a disabled state, using feature flags  Require all developers check code into trunk daily (at least)  Practice deploying smaller changes, which dramatically reduces risk and improves MTTR
    39. 39. Experiment: Reducing Batch Size By 50% And the customer got the feature in @RealGeneKim half the time! Source: Scott Prugh, Chief Architect, CSG, Inc.
    40. 40. @RealGeneKim “As a lifelong Ops practitioner, I know we need DevOps to make our work humane. In the past, I’ve worked every holiday, on my birthday, my spouse’s birthday, and even on the day my son was born.” Nathan Shimek Engineering Manager, New Context @nathan_shimek
    41. 41. @RealGeneKim Breaking The Bottlenecks In The Flow  Environment creation  Code deployment  Test setup and run (mention @rohansingh)  Overly tight architecture  Development  Product management
    42. 42. “In November 2011, running even the most minimal test for CloudFoundry required deploying to 45 virtual machines, which took a half hour. This was way too long, and also prevented developers from testing on @RealGeneKim their own workstations. By using containers, within months, we got it down to 18 virtual machines so that any developer can deploy the entire system to single VM in six minutes.” — Elisabeth Hendrickson, Director of Quality Engineering, Pivotal Labs @testobsessed
    43. 43. @RealGeneKim Blackboard Learn: 2005-Present 54 LoC Commits Source: David Ashman, Chief Architect, Blackboard, Inc. (@davidbashman) The Problem
    44. 44. @RealGeneKim Blackboard Learn Building Blocks 55 Source: David Ashman, Chief Architect, Blackboard, Inc. (@davidbashman)
    45. 45. Top Predictors Of IT Performance (2014)  Version control of all production artifacts  Continuous integration and deployment  Automated acceptance testing  Peer-review of production changes (vs. external change approval)  High trust culture  Proactive monitoring of the production environment  Win-win relationship between Dev and Ops @RealGeneKim Source: Puppet Labs 2014 State Of DevOps
    46. 46. @RealGeneKim The First Way: Outcomes  Creating single repository for code and environments  Determinism in the release process  Consistent Dev, Test and Production environments, all properly built before deployment begins  Features being deployed daily without catastrophic failures  Decreased lead time  Faster cycle time and release cadence
    47. 47. @RealGeneKim The Second Way: Feedback
    48. 48. @RealGeneKim
    49. 49. How many times per day is the andon cord @RealGeneKim pulled in a typical day at a Toyota manufacturing plant? 3,500 times per day Source: http://www.gembapantarei.com/2008/04/how_many_times_do_you_pull_the_andon_cord_each_day.html
    50. 50. Why would Toyota do something so disruptive as stopping production thousands of times per day? @RealGeneKim “It’s the only way we can build 2,000 vehicles per day – that’s one completed vehicle every 55 seconds.”
    51. 51. @RealGeneKim Google Dev And Ops (2013)  15,000 engineers, working on 4,000+ projects  All code is checked into one source tree (billions of files!)  5,500 code commits/day  75 million test cases are run daily "Automated tests transform fear into boredom." -- Eran Messeri, Google
    52. 52. @RealGeneKim Developers Carry Pagers “We found that when we woke up developers at 2am, defects got fixed faster than ever” – Patrick Lightbody, CEO, BrowserMob “You build it, you run it.” – Werner Vogels CTO, Amazon
    53. 53. @RealGeneKim Developers Carry Pagers “As a developer, there has never been a more satisfying point in my career than when I wrote the code, I pushed the button to deploy it, I watched the metrics to see if it actually worked in production, and fixed it if it broke.” – Tim Tischler Director of Operations Engr, Nike, Inc.
    54. 54. Devs Initially Self-Manage Their Own Code @RealGeneKim 65 Source: Tom Limoncelli (@yesthattom)
    55. 55. @RealGeneKim Return Fragile Services Back To Dev 67 Source: Tom Limoncelli (@yesthattom)
    56. 56. @RealGeneKim Pervasive Production Telemetry “Having a developer add a monitoring metric shouldn’t feel like a schema change.” – John Allspaw, SVP Tech Ops, Etsy
    57. 57. @RealGeneKim 69
    58. 58. @RealGeneKim People actually look at the logs! (Mention Verizon PCI Data Breach Study) 70
    59. 59. @RealGeneKim
    60. 60. @RealGeneKim One Of The Highest Predictors Of Performance
    61. 61. @RealGeneKim One Of The Highest Predictors Of Performance
    62. 62. Top Predictors Of IT Performance (2014)  Version control of all production artifacts  Continuous integration and deployment  Automated acceptance testing  Peer-review of production changes (vs. external change approval)  High trust culture  Proactive monitoring of the production environment  Win-win relationship between Dev and Ops @RealGeneKim Source: Puppet Labs 2014 State Of DevOps
    63. 63. @RealGeneKim The Second Way: Outcomes  Defects and security issues getting fixed faster than ever  Disciplined automated testing enabling many simultaneous small, agile teams to work productively  All groups communicating and coordinating better  Everybody is getting more work done
    64. 64. The Third Way: Continual Experimentation And Learning @RealGeneKim
    65. 65. @RealGeneKim Break Things Early And Often “Do painful things more frequently, so you can make it less painful… We don’t get pushback from Dev, because they know it makes rollouts smoother.” – Adrian Cockcroft, Former Architect, Netflix (Now Technology Fellow, Battery Ventures)
    66. 66. @RealGeneKim 80
    67. 67. @RealGeneKim Inject Failures Often
    68. 68. @RealGeneKim You Don’t Choose Chaos Monkey… Chaos Monkey Chooses You
    69. 69. @RealGeneKim The 2014 AWS Reboot “When we got the news about the emergency EC2 reboots, our jaws dropped. When we got the list of how many Cassandra nodes would be affected, I felt ill. “Then I remembered all the Chaos Monkey exercises we’ve gone through. My reaction was, ‘Bring it on!’” – Christos Kalantzis Netflix Cloud DB Engineering Source: http://techblog.netflix.com/2014/10/a-state-of-xen-chaos-monkey-cassandra.html
    70. 70. @RealGeneKim The 2014 AWS Reboot “Out of our 2700+ production Cassandra nodes, 218 were rebooted. 22 Cassandra nodes did not reboot successfully. “Netflix customers experienced no downtime that weekend.” – Bruce Wong Netflix Chaos Engineering
    71. 71. @RealGeneKim Allocate 20% Of Cycles To Technical Debt Reduction
    72. 72. “By November 2011, Kevin Scott, LinkedIn’s top engineer, had had enough. The system was taxed as LinkedIn attracted more users, and engineers were burnt out. “To fix the problems, Scott, who’d arrived from Google that February, launched Operation InVersion. “He froze development on new features so engineers could overhaul the computing architecture. “`We had to tell management we’re not going to deliver anything new while all of engineering works on this project for the next two months,’ Scott says. “It was a scary thing.’” @RealGeneKim
    73. 73. @RealGeneKim
    74. 74. @RealGeneKim
    75. 75. Source: Pingdom
    76. 76. @RealGeneKim Why Do I Think This Is Important?
    77. 77. @RealGeneKim The Downward Spiral…
    78. 78. @RealGeneKim
    79. 79. @RealGeneKim Opportunity Cost Of Wasted IT Spending? $2,600,000,000,000.00 per year ($2.6 Trillion US)
    80. 80. @RealGeneKim Our Mission Positively influence the lives of one million IT professionals by 2017.
    81. 81. @RealGeneKim DevOps Enterprise: Lessons Learned  On Oct 21-23, we held the DevOps Enterprise Summit, a conference for horses, by horses  Macy’s, Disney, GE Capital, Blackboard, Telstra, US Department of Homeland Security, CSG, Raytheon, Ticketmaster, Union Bank of California  Leaders driving DevOps transformations talked about  The business problem they set out to solve  The obstacles they had to overcome  The business value they created
    82. 82. @RealGeneKim Want More Learn More? To receive the following:  A copy of this presentation  A free 140 page excerpt of The Phoenix Project  Information on the DevOps Enterprise: Lessons Learned  My recommended reading list for enterprise DevOps adoption  See early drafts of our upcoming DevOps Cookbook Just pick up your phone, and send an email: To: realgenekim@SendYourSlides.com Subject: lisa realgenekim@SendYourSlides.com lisa
    83. 83. Can Large Orgs Be High Performers? Yes. But orgs with 10,000+ employees 40% less likely to be high performing vs. 500 employee orgs… Source: Puppet Labs 2014 State Of DevOps @RealGeneKim
    84. 84. @RealGeneKim Other Side Of Innovation 98

    ×