DevOps Roadtrip Final Speaking Deck

DENVER - SEATTLE - SAN FRANCISCO - MINNEAPOLIS - NEW YORK CITY

#DevOpsRoadTrip
Wifi:
ssid: Skyline
pass: Sounders15

JASON HAND |
DevOps Evangelist
• Holds over 15 years of experience
as a developer, system
administrator, and support specialist
• Fully emerged into the world of agile
development and the DevOps
movement with Colorado tech
startups
#DevOpsRoadTrip

A little about VictorOps…
VictorOps is the real-time incident
management platform that combines the
power of people and data to embolden
DevOps pros to handle incidents as they
occur.
#DevOpsRoadTrip

Agenda
12:00 - 1:00 - Registration &
Lunch
1:00 - Opening Remarks | Jason
Hand, DevOps Evangelist,
VictorOps
1:15 – Real-Life Stories + Expert
Panel Q&A
Matt Augustine – CTO & Co-
Founder at PlayFab
Courtney Kissler – VP of Retail at
Starbucks
2:00 - BREAK
2:15 - Breakout Sessions
3:30 - BREAK
3:40 – ‘Failure” as “Success”:
The Mindset, Methods, and
Landmines
J. Paul Reed – DevOps
Consultant
4:25 - Closing Remarks and
Raffle!
4:45 - Happy Hour
#DevOpsRoadTrip

“How Organizations Process Information”
Roy Westrum: A Typology of Organizational Cultures
2014 State of DevOps Report shows that in the context of IT, job satisfaction is the biggest predictor of
profitability, market share, and productivity. The biggest predictor of job satisfaction, in turn, is how
effectively organizations process information, as determined by a model created by sociologist Ron
Westrum, shown below. 1
1: https://continuousdelivery.com/implementing/culture/

Words are how we think – stories are how we link.
- Christina Baldwin
Oral narrative is and for a long time has been the
chief basis of culture itself.
- John D. Niles
Stories from the road

TimeToRepair(TTR)
Continuous Improvement Efforts
Reactive
(chaotic)
Tactical
(obvious)
Integrated
(complicated)
Strategic
(complex)
 No automation
 No operational stack
awareness
 Poor collaboration between
teams (Dev & Ops)
 Documentation not available
 No standardized
communication
 High focus on consistent
continuous learning
 Uses a NOC
 Some monitoring & alerting
instrumentation
 Collaboration in crisis
 "Mission critical" processes are
available
 Understood crisis
communication protocols
 Remediation data available to
IT Operations
 Team rotations, paging
policies, role hunting
 Continuous improvement of
key health indicators
 Technical collaboration across
all incidents
 Docs up to date and easily
accessible
 Consistent real-time
communication practices
 Automated docs and remediation
 Actionable Alerts with full context
 High collaboration among all teams
 Documentation part of remediation
 Targeted, proactive crisis comms
 High focus on continuous learning
Incident Management
Maturity

Automation
Awareness
Collaboration
Documentation User Empathy
Learning

MATT AUGUSTINE|
CTO & CO-FOUNDER, PLAYFAB
• Matt leads engineering and product development at PlayFab, a
backend platform for online games.
• Well versed in the challenges of growing an engineering team
from a single person (himself) to a highly-functioning group,
cranking out features and supporting customers, all while
continuously improving product quality and reliability.
• Prior to PlayFab, Matt had over a decade of software
development experience, working at Uber Entertainment and
Microsoft on technologies ranging from video games to file
synchronization.
• He is passionate about building reliable systems that are used
by millions of people.
#DevOpsRoadTrip

#DevOpsRoadTrip
Matt Augustine |
CTO & Co-Founder, PlayFab

The PlayFab Story
 Developed backend services for local game studio, Uber
Entertainment
 Realized that many game developers needed the same
technology.
 Found an amazing CEO, James Gwertzman, and spun out
PlayFab as a new company in January 2014.
 Operating today with > 100 live games and 10M MAU

Succeeding with Spun-Out Tech
1. 2. 3.
Establishing
the New
Standard
DevOps Team
of More Than
One
Escaping
Reactive
Mode

Old Product, New Product
 Starting codebase developed to a different
standard
 Minimum Viable Process – before first hire
 Fork and ruthlessly prune existing codebase
 New product, new standards

DevOps Team of One… to Many
 Establish on-call
rotation, even if
you always have
to get involved at
first
 Only alert on
outages + a few
key metrics
 Counters more
actionable than

More Customers, More Problems
Dealing with usage patterns you never anticipated, every
day

Escaping Reactive Mode
Distraction Solution
Unpredictable traffic Auto-scaling everything (compute +
storage)
Load test to 10X current traffic peaks
Machine failures cause partial
outages
Every server role runs in >1 DC with
health check based failover
Regressions in complex legacy code Gradually improve test coverage by
adding test whenever touched
Functional bugs in new features
(super embarrassing)
Unit tests for all new functionality +
peer review of all code changes
Customer feature / limit change
requests
Define limits for everything with max
allowable increases, and make limit
changes self-serve

Engineering @ PlayFab
 No “DevOps Team”
 All engineers design web
services and run them on the
cloud

COURTNEY KISSLER |
VP OF RETAIL, STARBUCKS
• An experienced leader, working specifically with technology
teams accountable for eCommerce, customer mobile,
personalization, loyalty, marketing, payments, customer care,
digital foundation and store technology experiences.
• She is passionate about connecting technology investments to
business outcomes, delivering impactful solutions and giving
teams line of sight to how their work contributes to those
outcomes.
• Graduate of Eastern Washington University with a B.S. in
Computer Information Systems and worked at two startups,
CyberSafe and WorldStream Communications
• Most recently at Nordstrom prior to joining Starbucks.
#DevOpsRoadTrip

WHAT I’M GOING TO TALK ABOUT…
 Incident Management
 Critical/High
 Medium/Low
 OMTM (One Metric That Matters)
 Current Condition/Target Condition
 Tactics
 Additional Benefits/Outcomes

TEAM…
Why isn’t anyone
listening to us????
This is so frustrating.
I’m going to move to
another team (or
leave the
organization).

FIX IT
OR…UNTIL YOU CAN FIX IT…HIDE IT

RESIST THE GO-TO
MOVE…HEROICS!!!

CRITICAL/HIGH:
ONE METRIC THAT MATTERS
Metric Current Target
Mean Time to
Recovery (MTTR)
4-6 hours 2-3 hours

TACTICS
 Automation
 Deployments
 Testing
 Monitoring & Alerting
 Capacity for unplanned work
 Organization (removing silos)

MEDIUM/LOW:
ONE METRIC THAT MATTERS
Metric Current Target
# of incidents 1300 50
ANOTHER
PATTERN…HOW
DO WE GET OFF
THE HAMSTER
WHEEL?

TACTICS
 All work visible
 WIP limits
 Team – self-organized
 Improvement kata
 A3 problem solving

ADDITIONAL METRICS
 Critical/Highs – incident count
 Understanding ratio of breakthrough vs.
operational
 Cycle Time
 Deployment frequency
 Mean Time To Detect (MTTD)

ADDITIONAL OUTCOMES
BEHAVIOR
CHANGE
ENGAGED
LEADERS
IMPROVED
TRUST
TEAM
MORALE
PERSONAL
DEVELOPMENT

Breakout Sessions
 It broke in production, now what? Strategies for managing failure
and getting back to business
- Jeff Norris, Technical Principal of Snap CI, at Thoughtworks
 Finding Signal in the Noise - Matt Williams, Evangelist at
DataDog
 Security & Compliance in a DevOps World - J. Paul Reed,
DevOps Consultant
 Devs On-Call, How and Why to Get Started - Matt Augustine,
CTO & Co-Founder at PlayFab
 ChatOps - Jason Hand, VictorOps
 The Leadership Evolution: How to lead in this brave new world -
Courtney Kissler, Starbucks
#DevOpsRoadTrip

JEFF NORRIS |
TECHNICAL PRINCIPAL, SNAP CI THOUGHTWORKS
• Jeff Norris is the Technical Principal for Snap CI, which
provides Continuous Delivery and Continuous Integration in the
cloud.
• Before joining the Snap CI team, Jeff worked for many years at
ThoughtWorks and led ThoughtWorks’ longest running project,
an international leasing application with high uptime
requirements that processed billions of dollars of equipment
annually.
• Jeff is a co-author of ThoughtWorks’ Technology Radar
(thoughtworks.com/radar).
• When not developing systems or leading teams, Jeff teaches
and coaches tech leads throughout the Americas.
#DevOpsRoadTrip

MATT WILLIAMS |
EVANGELIST, DATADOG
• Passionate about the power of monitoring and metrics to make
large-scale systems stable and manageable
• Usually touring the country speaking and writing about
monitoring with Datadog.
• When he’s not on the road, he’s coding.
• You can find Matt on Twitter at @Technovangelist.
#DevOpsRoadTrip

J. PAUL REED |
DEVOPS CONSULTANT
• Over a decade of experience in the trenches as a build/release
and tools engineer, working with such organizations as
VMware, Mozilla, and Symantec.
• In 2012, he founded Release Engineering Approaches, a
consultancy incorporating a host of tools and techniques to help
organizations “Simply Ship. Every time.”
• Worked with organizations across a number of industries, from
financial services to cloud-based infrastructure, with teams from
2 to 200.
• Paul is also a founding host of The Ship Show, a twice-monthly
podcast tackling topics related to build engineering, DevOps,
and release management.
•
#DevOpsRoadTrip

8%
48%
28%
16%
Incident Mgmt Maturity
Reactive
Tactical
Integrated
Strategic

TimeToRepair(TTR)
Continuous Improvement Efforts
Reactive (0 – 4)
(chaotic)
Tactical (5 – 9)
(obvious)
Integrated (10 -14)
(complicated)
Strategic (15 –
18)
(complex)
 No automation
 No operational stack
awareness
 Poor collaboration between
teams (Dev & Ops)
 Documentation not available
 No standardized
communication
 High focus on consistent
continuous learning
 Uses a NOC
 Some monitoring & alerting
instrumentation
 Collaboration in crisis
 "Mission critical" processes are
available
 Understood crisis
communication protocols
 Remediation data available to
IT Operations
 Team rotations, paging
policies, role hunting
 Continuous improvement of
key health indicators
 Technical collaboration across
all incidents
 Docs up to date and easily
accessible
 Consistent real-time
communication practices
 Automated docs and remediation
 Actionable Alerts with full context
 High collaboration among all teams
 Documentation part of remediation
 Targeted, proactive crisis comms
 High focus on continuous learning
Incident Management
Maturity

How Organizations Process Information
Roy Westrum: A Typology of Organizational Cultures
2014 State of DevOps Report shows that in the context of IT, job satisfaction is the biggest predictor of
profitability, market share, and productivity. The biggest predictor of job satisfaction, in turn, is how
effectively organizations process information, as determined by a model created by sociologist Ron
Westrum, shown below. 1
1: https://continuousdelivery.com/implementing/culture/

Reduce MTTR
State of DevOps Report (2015)
– by Puppet Labs

Un-ordered Ordered
Complicated
Obvious
Complex
Chaotic
Cause Effect Obvious
From Experience
Cause Effect Requires
Analysis
Cause Effect Only
Apparent in Hindsight
Cause & Effect Cannot
Be Related
Sense – Categorize - Respond
Sense – Analyze - RespondProbe – Sense - Respond
Act – Sense - Respond

The systems we engineer, maintain, and improve are
Complicated
.. or ..
Known unknowns

The systems we engineer, maintain, and improve are
Complex
Unknown unknowns

What are the..
Contributing
Factors?

Identifying a “root cause” helps us to …
Put it back
how it was

What we really want is to..
Continuously
Improve

Reactive
(chaotic)
No automation
No operational stack awareness
Poor collaboration between teams (Dev & Ops)
Documentation not available
No standardized communication
High focus on consistent continuous learning

Tactical
(obvious)
Uses a NOC
Some monitoring & alerting instrumentation
Collaboration in crisis
"Mission critical" processes are available
Understood crisis communication protocols
Remediation data available to IT Operations

Integrated
(complicated)
Team rotations, paging policies, role hunting
Continuous improvement of key health indicators
Technical collaboration across all incidents
Docs up to date and easily accessible
Consistent real-time communication practices

Strategic
(complex)
Automated docs and remediation
Actionable Alerts with full context
High collaboration among all teams
Documentation part of remediation
Targeted, proactive crisis comms
High focus on continuous learning

“Six Trends Shape DevOps Adoption, Q1 2015”
Forrester report
• The Foundation For Success Is In Place . . . Mostly
• Fear Of Failure Will Hamper Advancement
• Monitoring And Analytics Strategies Must Make A Big Leap Forward
• The Focus On Customer Experience Is Not Second Nature . . . Yet
• Change And Release Processes Are Not Delivering Business Needs
• You Must Prioritize And Focus Sourcing Strategies

Failure not seen as opportunity to
learn
Source: “Six Trends Shape DevOps Adoption, Q1 2015”, Forrester report

Awareness
http://blog.vmware.com

© 2015 Forrester Research, Inc. Reproduction Prohibited 94
Single Source Of Truth Lacking In
Many Orgs – 95% only most of
the time or less
Source: April 15, 2015 “Six Trends That Will Shape DevOps Adoption”, Forrester report

Teams siloed throughout life cycle

User Empathy
https://open.buffer.com/wp-content/uploads/2015/12/empathy3.jpg

Automation
http://thelifedesignproject.com/wp-content/uploads/2009/09/373881476_217d24ef6d.jpg

Delays in notifications Leads To
Customers Finding the Problem First

Documentation
http://blog.vmware.com

DevOps Roadtrip Final Speaking Deck

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to DevOps Roadtrip Final Speaking Deck

Similar to DevOps Roadtrip Final Speaking Deck (20)

More from VictorOps

More from VictorOps (20)

Recently uploaded

Recently uploaded (20)

DevOps Roadtrip Final Speaking Deck

Editor's Notes