Horngren’s Financial & Managerial Accounting, 7th edition by Miller-Nobles so...
ServiceNow ITIL at Ludicrous Speeds - Rugged DevOps
1. ITIL At Ludicrous Speeds:
Rugged DevOps and More……
Gene Kim
Author, Visible Ops Handbook
Knowledge12
May 16, 2012
Session ID:
@RealGeneKim, genek@realgenekim.me
2. Where Did The High Performers Come From?
@RealGeneKim, genek@realgenekim.me
3. Visible Ops: Playbook of High Performers
The IT Process Institute has
been studying high-performing
organizations since 1999
What is common to all the high
performers?
What is different between them
and average and low
performers?
How did they become great?
Answers have been codified in
the Visible Ops Methodology
www.ITPI.org
@RealGeneKim, genek@realgenekim.me
18. Why DevOps Is So Important To Me
@RealGeneKim, genek@realgenekim.me
19. Since 1999, We’ve Benchmarked 1500+
IT Organizations
Source: EMA (2009)
Source: IT Process Institute (2008)
@RealGeneKim, genek@realgenekim.me
20. High Performing IT Organizations
High performers maintain a posture of compliance
Fewest number of repeat audit findings
One-third amount of audit preparation effort
High performers find and fix security breaches faster
5 times more likely to detect breaches by automated control
5 times less likely to have breaches result in a loss event
When high performers implement changes…
14 times more changes
One-half the change failure rate
One-quarter the first fix failure rate
10x faster MTTR for Sev 1 outages
When high performers manage IT resources…
One-third the amount of unplanned work
8 times more projects and IT services
6 times more applications
Source: IT Process Institute, 2008
@RealGeneKim, genek@realgenekim.me
21. 2007: Three Controls Predict 60% Of
Performance
To what extent does an organization define,
monitor and enforce the following?
Standardized configuration strategy
Process discipline
Controlled access to production systems
@RealGeneKim, genek@realgenekim.me
Source: IT Process Institute, 2008
22. Tough Love From Ari Balogh
@RealGeneKim, genek@realgenekim.me
23. The Downward Spiral
Operations Sees… Dev Sees…
Too many fragile and insecure More urgent, date-driven projects
applications in production put into the queue
Too much time required to restore Even more fragile code (less
service secure) put into production
Too much firefighting and unplanned More releases have increasingly
work “turbulent installs”
Planned project work cannot complete Release cycles lengthen to
amortize “cost of deployments”
Frustrated customers leave
Bigger deployment failures
Market share goes down
More time spent on firefighting
Business misses Wall Street
commitments Ever increasing backlog of work
that cold help the business win
Business makes even larger promises
to Wall Street Ever increasing amount of
tension between IT Ops,
Development, Design…
These aren’t ITSM or IT Operations problems…
These are business problems!
@RealGeneKim, genek@realgenekim.me
24. My Mission: Figure Out How Break The IT Core
Chronic Conflict
Every IT organization is pressured to
simultaneously:
Respond more quickly to urgent business needs
Provide stable, secure and predictable IT service
Words often used to describe process improvement:
“hysterical, irrelevant, bureaucratic, bottleneck, difficult to understand, not
aligned with the business, immature, shrill, perpetually focused on irrelevant
technical minutiae…”
Source: The authors acknowledge Dr. Eliyahu Goldratt, creator of the Theory of Constraints and
author of The Goal, has written extensively on the theory and practice of identifying and resolving
core, chronic conflicts.
26
@RealGeneKim, genek@realgenekim.me
25. Good News: It Can Be Done
Bad News: You Can’t Do It Alone
@RealGeneKim, genek@realgenekim.me
30. Product Management And Design
Source: Flickr: birdsandanchors
@RealGeneKim, genek@realgenekim.me
31. DevOps: It’s A Real Movement
I would never do another startup that didn’t
employ DevOps like principles
It’s not just startups – it’s happening in the
enterprise and in public sector, too
I believe working in DevOps environments will
be a necessary skillset 5 years from now
Agile helped Dev regain trust with the business;
DevOps will help all of IT
IT becoming more automated relies on DevOps
practices (especially PaaS)
@RealGeneKim, genek@realgenekim.me
32. If I Could Wave A Magic Wand, Everyone Will…
Become conversant with DevOps and recognize
the practices when you see them
Be energized about how ITSM practitioners can
contribute in this organizational journey
Leave with some concrete steps to get some
great outcomes
Become a part of a team that starts putting
DevOps practices into place
34
@RealGeneKim, genek@realgenekim.me
33. How Do You Do
DevOps?
35
@RealGeneKim, genek@realgenekim.me
34. The Prescriptive DevOps Cookbook
“DevOps Cookbook” Authors
Patrick DeBois, Mike Orzen,
John Willis
Goals
Codify how to start and finish
DevOps transformations
How does Development, IT
Operations and Infosec
become dependable partners
Describe in detail how to
replicate the transformations
describe in “When IT Fails: The
Novel”
@RealGeneKim, genek@realgenekim.me
35. “The Goal” by Dr. Eliyahu Goldratt
@RealGeneKim, genek@realgenekim.me
39. The First Way:
Systems Thinking
(Business) (Customer)
@RealGeneKim, genek@realgenekim.me
40. The First Way:
Systems Thinking (Left To Right)
Don’t pass defects downstream
Don’t optimize locally
Always increase flow: elevate bottlenecks,
reduce WIP, throttle release of work, reduce
batch sizes
Understanding where reliance is placed
@RealGeneKim, genek@realgenekim.me
41. Phase 1: Extend the Agile CI/CR Processes
Create one-step Dev, Test and Production
environment creation procedure in Sprint 0
Create the one-step automated code
deployment procedure
Properly integrate release, configuration and
change into the value stream (as well as QA and
infosec)
Ensure developers don’t leave until production
change is successful
Assign Ops person into Dev team
@RealGeneKim, genek@realgenekim.me
42. Definition: Kanban Board
Signaling tool to reduce WIP and increase flow
44
@RealGeneKim, genek@realgenekim.me
43. The First Way:
Systems Thinking: ITSM Insurgency
Have someone attend the daily Agile standups
Gain awareness of what the team is working on
Find the automated infrastructure project team
(e.g., puppet, chef)
Release managers can provide hardening guidance
Integrate and extend their production configuration monitoring
Find where code packaging is performed
Integrate security testing pre- and post-deployment
Integrate testing into continuous integration and release
process
Add security test scripts to automated test library
Define what changes/deploys cannot be made without
triggering full retest
@RealGeneKim, genek@realgenekim.me
44. The First Way:
Outcomes
Determinism in the release process
Creating single repository for code and environments
Consistent Dev, QA, Int, and Staging environments, all
properly built before deployment begins
Decreased cycle time
Reduce deployment times from 6 hours to 45 minutes
Refactor deployment process that had 1300+ steps
spanning 4 weeks
Faster release cadence
@RealGeneKim, genek@realgenekim.me
46. The Second Way:
Amplify Feedback Loops (Right to Left)
Expose visual data so everyone can see how
their decisions affect the entire system
Get Development closer to Operations and
customers
Create a reliable system system of work that
improves itself
@RealGeneKim, genek@realgenekim.me
47. Phase 2: Extend Release Process And Create
Right -> Left Feedback Loops
Embed Dev into Ops escalation process
Invite Dev to post-mortems/root cause analysis
meeting
Have Dev cross-train IT Operations
Ensure application monitoring/metrics to aid in
Ops and Infosec work (e.g., incident/problem
management
@RealGeneKim, genek@realgenekim.me
48. The Second Way:
Amplify Feedback Loops: ITSM Insurgency
Find areas in the incident and problem
management processes where Development
knowledge could help
Ensure that countermeasures are captured in
the Agile backlog
Find that developer who really cares about the
production environment
@RealGeneKim, genek@realgenekim.me
49. The Second Way:
Outcomes
Defects and security issues getting fixed faster
than ever
Reusable Ops and Infosec user stories now part
of the Agile process
All groups communicating and coordinating
better
Everybody is getting more work done
@RealGeneKim, genek@realgenekim.me
50. The Third Way:
Culture Of Continual Experimentation And
Learning
@RealGeneKim, genek@realgenekim.me
51. The Third Way:
Culture Of Continual Experimentation And
Learning
Foster a culture that rewards:
Experimentation (taking risks) and learning from
failure
Repetition is the prerequisite to mastery
Why?
You need a culture that keeps pushing into the danger
zone
And have the habits that enable you to survive in the
danger zone
@RealGeneKim, genek@realgenekim.me
52. You Don’t Choose Chaos Monkey…
Chaos Monkey Chooses You
@RealGeneKim, genek@realgenekim.me
53. Phase 3: Organize Dev and Ops To Achieve
Organizational Goals
Allocate 20% of Dev cycles to non-functional
requirements
Integrate fault injection and resilience into
design, development and production (e.g.,
Chaos Monkey)
@RealGeneKim, genek@realgenekim.me
54. The Third Way:
Culture Of Continual Experimentation And
Learning: ITSM
Ensure that process improvement projects are in
the Agile backlog
Make technical debt visible
Help prioritize work against features and other non-functional
requirements
Release your Chaos Monkey
Rehearse cleaning up after the Chaos Monkey
Find processes that waste everyone’s time
@RealGeneKim, genek@realgenekim.me
56. The Third Way:
Outcomes
Technical debt is being paid off
Exploitable attack surface area decreases
Continual reduction of unplanned work
More cycles for planned work
More resilient code and environments
Balancing nimbleness and practiced repetition
Enabling wider range of risk/reward balance
@RealGeneKim, genek@realgenekim.me
74. And Do More With Less Effort…
@RealGeneKim, genek@realgenekim.me
75. This Is An Important Problem
Operations Sees… Dev Sees…
Fragile applications are prone to More urgent, date-driven projects
failure put into the queue
Long time required to figure out “which Even more fragile code (less
bit got flipped” secure) put into production
Detective control is a salesperson More releases have increasingly
“turbulent installs”
Too much time required to restore
service Release cycles lengthen to
amortize “cost of deployments”
Too much firefighting and unplanned
work Failing bigger deployments more
difficult to diagnose
Urgent security rework and
remediation Most senior and constrained IT
ops resources have less time to
Planned project work cannot complete fix underlying process problems
Frustrated customers leave Ever increasing backlog of work
Market share goes down that cold help the business win
Business misses Wall Street Ever increasing amount of
commitments tension between IT Ops,
Development, Design…
Business makes even larger promises
to Wall Street
@RealGeneKim, genek@realgenekim.me
78. If I Could Wave A Magic Wand, Everyone Will…
Become conversant with DevOps and recognize
the practices when you see them
Be energized about how ITSM practitioners can
contribute in this organizational journey
Leave with some concrete steps to get some
great outcomes
Become a part of a team that starts putting
DevOps practices into place
…And fill out the survey
forms! 82
@RealGeneKim, genek@realgenekim.me
79. When IT Fails: The Novel and The DevOps
Cookbook
Coming in July 2012
“In the tradition of the best MBA case studies, this
book should be mandatory reading for business
and IT graduates alike.”
Paul Muller, VP Software Marketing, Hewlett-
Packard
Gene Kim, Tripwire
“The greatest IT management book of our
founder, Visible Ops co- generation.”
author Branden Williams, CTO Marketing, RSA
@RealGeneKim, genek@realgenekim.me
80. When IT Fails: The Novel and The DevOps
Cookbook
Our mission is to positively affect the
lives of 1 million IT workers by 2017
If you would like the “Top 10 Things
Infosec Needs To Know About DevOps,”
sample chapters and updates on the
book:
Gene Kim, Tripwire founder,
Visible Ops co-author
Sign up at http://itrevolution.com
Email genek@realgenekim.me
Hand me a business card
@RealGeneKim, genek@realgenekim.me
Editor's Notes
There are many ways to react to this: like, fear, horror, trying to become invisible… All understandable, given the circumstances…
Tell story of Amazon, Netflix: they care about, availability, securityIt’s not a push, it’s a pull – they’re looking for our help (#1 concern: fear of disintermediation and being marginalized)
How each side Actively impedes the achievement of each other’s goals.
Who are they auditing? IT operations.I love IT operatoins. Why? Because when the developers screw up, the only people who can save the day are the IT operations people. Memory leak? No problem, we’ll do hourly reboots until you figure that out.Who here is from IT operations?Bad day:Not as prepared for the audit as they thoughtSpending 30% of their time scrambling, generating presentation for auditorsOr an outage, and the developer is adamant that they didn’t make the change – they’re saying, “it must be the security guys – they’re always causing outages”Or, there’s 50 systems behind the load balancer, and six systems are acting funny – what different, and who made them differentOr every server is like a snowflake, each having their own personalityWe as Tripwire practitioners can help them make sure changes are made visible, authorized, deployed completely and accurately, find differencesCreate and enforce a culture of change management and causality
Who’s introducing variance? Well, it’s often these guys. Show me a developer who isn’t causing an outage, I’ll show you one who is on vacation.Primary measurement is deploy features quickly – get to market.I’ve worked with two of the five largest Internet companies (Google, Microsoft, Yahoo, AOL, Amazon), and I now believe that the biggest differentiator to great time to market is great operations:Bad day: We do 6 weeks of testing, but deployment still fails. Why? QA environment doesn’t match productionOr there’s a failure in testing, and no one can agree whether it’s a code failure or an environment failureOr changes are made in QA, but no one wrote them down, so they didn’t get replicated downstream in productionBelieve it or not, we as Tripwire practitioners can even help them – make sure environments are available when we need them, that they’re properly configured correctly the first time, document all the changes, replicate them downstream
So who are all these constituencies that we can help, and increase our relevance as Tripwire practitioners and champions?How many people here are in infosec?Goal: protect critical systems and dataSafeguard organizational commitmentsPrevent security breaches, help quickly detect and recover from themBad day: no security standardsNo one is complyingYes, we’re 3 years behind. “Whaddyagonna do about it?”Vs. we (Tripwire owner) can become more relevant and add value by help infosec by leveraging all the configuration guidance out thereMeasure variance between produciton and those known good statesTrust and verify that when management says, we’ve trued up the configurations, they’ve actually done itWhy? Now, more than ever, there are an ever increasing amount of regulatory and contractual requirements to protect systems and data
[ text ] My personal goal is to prescriptively define 1) what does Dev need to do to become a reliable partner, 2) what does IT Operations need to do to become a realiable partner, and then 3) how do they work together to deliver unbelievable value to the business.Of course, the goal is more than happy coexistence. It’s to replicate the Etsy and LinkedIn stories:Increase the rate of features that we can put into production, while simultaneously maintaining the reliability, stability, security and survivability of the production environment.
[ picture of messy data center ] Ten minutes into Bill’s first day on the job, he has to deal with a payroll run failure. Tomorrow is payday, and finance just found out that while all the salaried employees are going to get paid, none of the hourly factory employees will. All their records from the factory timekeeping systems were zeroed out.Was it a SAN failure? A database failure? An application failure? Interface failure? Cabling error?