[ text ] My personal goal is to prescriptively define 1) what does Dev need to do to become a reliable partner, 2) what does IT Operations need to do to become a realiable partner, and then 3) how do they work together to deliver unbelievable value to the business.Of course, the goal is more than happy coexistence. It’s to replicate the Etsy and LinkedIn stories:Increase the rate of features that we can put into production, while simultaneously maintaining the reliability, stability, security and survivability of the production environment.
Since 1986, I’ve been a QA engineer writing filesystem QA tests, system administrator, developer, infosec, process design, operations research, auditorIncidentally, I almost moved to Seattle to be on Microsoft NT network test team in 1991 (TCP/IP stack)For 13 years, I was the founder/CTO of Tripwire, but my primary passion is studying high performing IT operations and security organizations.When I met Chris 3 years ago, he helped me see clearly one of the primary obstacles for successful transformations. I’ll describe this later.First, let me talk about what I meant by “high performers” back in 1999.
The DevOps Cookbook: Codifying Kick-Ass Business Practices That Matter Gene Kim, CISA, TOCICO Jonah#lspeSeptember 19, 2011
Effective pairing of preventive and detective controls
Source: IT Process Institute
Visible Ops: Playbook of High Performers The IT Process Institute has been studying high-performing organizations since 1999 What is common to all the high performers? What is different between them and average and low performers? How did they become great? Answers have been codified in the Visible Ops Methodology The “Visible Ops Handbook” is now available from the ITPI www.ITPI.org
2007: Three Controls Predict 60% Of Performance To what extent does an organization define, monitor and enforce the following? Standardized configuration strategy Process discipline Controlled access to production systems Source: IT Process Institute, 2008
Why Was I So Unsatisfied With The State Of IT Practice? IT operations work continued to be viewed as tactical Information security and compliance programs were sucking all the air out of the room (due to scoping problems) The activation energy for successful improvement programs was still too high The IT operations issues overshadowed by development Issues are amplified 10x in production: outages, findings, lawsuits Technical debt builds up over time IT operations is often the constraint in the organization Linkage of IT performance to business performance not obvious enough “Why doesn’t the business care? I found the pump handle!”
Seeing The Bigger Problem Operations Sees… Fragile applications are prone to failure Long time required to figure out “which bit got flipped” Detective control is a salesperson Too much time required to restore service Too much firefighting and unplanned work Planned project work cannot complete Frustrated customers leave Market share goes down Business misses Wall Street commitments Business makes even larger promises to Wall Street Dev Sees… More urgent, date-driven projects put into the queue Even more fragile code put into production More releases have increasingly “turbulent installs” Release cycles lengthen to amortize “cost of deployments” Failing bigger deployments more difficult to diagnose Most senior and constrained IT ops resources have less time to fix underlying process problems Ever increasing backlog of infrastructure projects that could fix root cause and reduce costs Ever increasing amount of tension between IT Ops and Development These aren’t IT Operations problems…These are business problems!
The Dreaded Disease IT Operations Constipatus (noun) Occurs when IT Operations creates fatal blockages in project flow. Creates blinding pain in Dev organization.Blockage worsens with chronic break/fix and security/compliance work, and when technical debt is never paid off.Causes host to lose energy, become unable to achieve organizational goals. Dangerous to CEOs. Photo credit: http://www.flickr.com/photos/keenepubliclibrary/2435790649/
12 DevOps Can Break A Core Chronic Conflict In IT * Every IT organization is pressured to simultaneously: Respond more quickly to urgent business needs Provide stable, secure and predictable IT service Words often used to describe ITIL process owners:“hysterical, irrelevant, bureaucratic, bottleneck, difficult to understand, not aligned with the business, immature, shrill, perpetually focused on irrelevant technical minutiae…” Source: The authors acknowledge Dr. Eliyahu Goldratt, creator of the Theory of Constraints and author of The Goal, has written extensively on the theory and practice of identifying and resolving core, chronic conflicts.
Framed This Way, Help Can Come From A Surprising Place The VP Application Development will often have the following complaints: IT Operations is the bottleneck We complete the code, but it takes too long for IT Operations to get the code into production Environments are never available when we need them Releases often cause chaos and disruption to all the other production services Turbulent installs have become the norm: 30 min installs take 3 days Due to slow OS upgrades, applications delayed by 2 quarters We are always late getting features to market
A Reframed IT Operations Problem Statement Increase flow from Dev to Production Increase throughput Decrease WIP Our goal is to create a system of operations that allows Planned work to quickly move to production Ensure service is quickly restored when things go wrong How does this relate to Visible Ops? We focused much on “unplanned work” What’s happening to all the planned work? At any given time, what should IT Ops be working on? Now we are focusing on the flow of planned work
Goal #1: Decrease Cycle Time Of Releases Create determinism in the release process Move packaging responsibility to development Release early and often Decrease cycle time Reduce deployment times from 6 hours to 45 minutes Refactor deployment process that had 1300+ steps spanning 4 weeks Never again “fix forward,” instead “roll back,” escalating any deviation from plan to Dev Verify for all handoffs (e.g., correctness, accuracy, timeliness, etc…) Ensure environments are properly built before deployment begins Control code and environments down the preproduction runways Hold Dev, QA, Int, and Staging owners accountable for integrity
Goal #2: Increase Production Rigor Define what work is and where work can come from Protect the integrity of the work queue (e.g., are checks being written than won’t clear?) To preserve and increase throughput, elevate preventive projects and maintenance tasks Document all work, changes and outcomes so that it is repeatable Ops builds Agile standardized deployment stories, to be completed after Dev sprints are complete Maintains adequate situational awareness so that incidents could be quickly detected and corrected Standardize unplanned work and escalations Always seeking to eradicate unplanned work and increase throughput Lean Principle: “Better -> Faster -> Cheaper”
Some Principles Because operations is constrained, it is always better to prevent than recover Operations work must be planned We strive to have continual situational awareness We will strive to control as many dimensions of our work as possible We ruthlessly pursue to understand any deviations from normal We expect systems in operations to never stop working We never do one-offs (they must be exceptions, not the rule) We require determinism to enable resiliency We strive for the improvement and mastery of the environment
Creating A System Of Operations Inj: 1. Projects: ensure rapid project releases from Development Inj: 1.1. Created effective centralized work demand queue Inj: 1.2: Protect integrity of work queue (e.g., write only checks that will clear) Inj: 1.3: Release early and often: Freeze projects if necessary, choking materials release to reduce WIP, allow longer runways of work Inj: 1.4: Elevate any deviations or incidents that stop flow of work Inj: 1.5: Standardize product deployments with Development Inj: 1.6: Continually seek ways to increase flow Inj: 2. Ensure reliable IT operations Inj: 2.1: When failures, detect/correct quickly inside the plant (e.g., production) Inj: 2.2. Prevent failures (e.g., maintenance) Inj: 2.3. Study and create projects to reduce/eradicate unplanned work Inj: 2.4. Seek ways to increase production Inj: 3. Subordinate infosec/PMO/etc. to enable Inj 1 & 2
The Prescriptive DevOps Cookbook Capture and codify how to start and finish successful DevOps transformations Create isomorphic mapping between plant floors and IT shops Co-authoring with Patrick DeBois, Mike Orzen, John Willis Describe in detail how to replicate the transformations describe in “When IT Fails: The Novel” Goals How does Development, IT Operations and Infosec become dependable partners How do they work together to solve business problems (and Infosec, too)
Goal Statement Build a system of work where Dev and Ops can be relied upon so that they work together to simultaneously achieve: fast flow of features into production deliver services in production that are: Attributes of Rugged DevOps Scalability, availability, survivability, sustainability, security, supportability, defensibility
Underpinning Principles Agile: increase velocity Lean: reduce WIP Systems thinking: Dev, Test, IT Operations, Project Management, Information Security Lean: implementing effective countermeasures
Cookbook Outline Part 1: Enable IT Operations to become a dependable partner Part 2: Enable Dev to become a dependable partner Part 3: Dev and IT Operations to create breakthrough results
Part 1: IT Ops Enable fast, repeatable and predictable flow of planned work Create single work queue, master list of commitments, master production schedule Create catalog of acceptable work: bill of materials, work centers, routings Runners, repeaters and strangers Create job release function
Part 1: IT Ops Minimize disruption from unplanned work Standardize unplanned work: make it repeatable Modify first response: ensure constrained resources have all data at hand to diagnose Elevate preventive activities to reduce incidents Stories about reducing reliance on Brent
Part 2: Dev Continuous deployment and integration in place Working through some assumptions about Agile methods in place
Part 3: DevOps Pick a pilot project Baseline current performance Create organization Someone needs to see the end-to-end flow from Dev to Production to Incident Enable correct feedback loops
Part 3: DevOps Dev and Ops work together in Sprint 0 and 1 to create code and environments Create environment that Dev deploys into Create downstream environments: QA, Staging, Production Create the Agile information radiator Integrate infosec and QA into daily sprint activities
Part 3: DevOps Embed Ops person into Dev structure Describes non-functional requirements, use cases and stories from Ops Has a vote like other team members Responsible for bringing Ops experiences into “quality at the source” Has special responsibility for pulling the Andon cord
Part 3: DevOps Potentially decouple production releases from Sprint boundaries Issue: how to enable deployments that are more frequent than the typical 1 or 2 week intervals Sprints vs. Kanbans
Part 3: DevOps Put Dev into Ops escalation chain MobBrowser case study: “Waking up developers at 3am is a great feedback loop: defects get fixed very quickly” Determine when SOD is a control being relied upon
The Theory of Constraints Approach To Visible Ops Dr. Goldratt wrote The Goal in 1984, describing Alex’s challenge to fix his plant’s cost and due date issues within 90 days Some tenets that went against common wisdom: Every flow of work has a constraint/bottleneck Any improvement not made at the bottleneck is merely an illusion Fallacy of cost accounting as operational management tool
When IT Fails: The NovelDay 1 Steve Masters, CEO Dick Landry, CFO Parts Unlimited$4B revenue/year
When IT Fails: The NovelDay 2 Bill Palmer, VP IT Operations (promoted) Wes Davis, Director, Distributed Systems Patty McKee, Director, IT Service Support Services The payroll outage All salaried employees will get paid, but not the hourlies CISO put in tokenization application in the factories, breaking database query that uses SSN IT Ops thought it was a SAN firmware upgrade failure All HR apps go down CFO is on front page of news, apologizing to community
When IT Fails: The NovelDay 4 Chris Allers, VP Application Development Sarah Moulton, SVP Retail Products “We can deploy by next week by cutting some corners, but IT Ops is in the way… again…” “Bill, your team lacks a sense of urgency. We must go. We’ve already bought the newspaper ads – they’re bought, paid for and being printed…”
When IT Fails: The NovelDay 3 Nancy Mailer, Chief Audit Executive John Pesche, CISO IT Operations has 980 IT general control deficiencies on critical financial systems, potentially dooming financial statement to having a footnote. Needs management response in 1 week. Bill grapples with who to put on the project. 1 yr of work, just to fix issues, even without Phoenix.
The Goal For IT: Day 10 The Deployment Database conversion, the point of no return, taking 1000x longer. In store POS won’t come up by Sat 8am, maybe by next Tuesday Emptying shopping cart shows last successful order credit card #
Resources From the IT Process Institute www.itpi.org Both Visible Ops Handbooks ITPI IT Controls Performance Study “Lean IT” by Orzen and Bell Winner of the Shingo Prize 2011 “Inspired: How To Create Products That Customers Love” by Cagan “Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation” by Humble, Farley Follow Gene Kim @RealGeneKim mailto:firstname.lastname@example.org http://realgenekim.me/blog
Call To Action If you’re interested in reviewing early versions of “When IT Fails: The Novel,” email me. If you’re interested in helping build or review the DevOps Cookbook, email me. I’m email@example.com Thank you for allowing me to join your tribe!
About Gene Kim I’ve spent the last 12 years studying high performing IT organizations, trying to understand: What do they have in common? What is present in successful transformations, absent in unsuccessful transformations? How do we lower the activation energy required to create the transformations? Founder and former CTO of Tripwire, Inc. Co-author of Visible Ops Handbook, Security Visible Ops Handbook Active researcher Co-founder of IT Process Institute Committee member of Institute of Internal Auditors Leader of PCI Security Standards Council Scoping SIG