3. JASON HAND |
DevOps Evangelist
• Holds over 15 years of experience
as a developer, system
administrator, and support specialist
• Fully emerged into the world of agile
development and the DevOps
movement with Colorado tech
startups
#DevOpsRoadTrip
4. A little about VictorOps…
VictorOps is the real-time incident
management platform that combines the
power of people and data to embolden
DevOps pros to handle incidents as they
occur.
#DevOpsRoadTrip
5. Agenda
12:00 - 1:00 - Registration &
Lunch
1:00 - Opening Remarks | Jason
Hand, DevOps Evangelist,
VictorOps
1:15 – Real-Life Stories + Expert
Panel Q&A
Matt Augustine – CTO & Co-
Founder at PlayFab
Courtney Kissler – VP of Retail at
Starbucks
2:00 - BREAK
2:15 - Breakout Sessions
3:30 - BREAK
3:40 – ‘Failure” as “Success”:
The Mindset, Methods, and
Landmines
J. Paul Reed – DevOps
Consultant
4:25 - Closing Remarks and
Raffle!
4:45 - Happy Hour
#DevOpsRoadTrip
18. “How Organizations Process Information”
Roy Westrum: A Typology of Organizational Cultures
2014 State of DevOps Report shows that in the context of IT, job satisfaction is the biggest predictor of
profitability, market share, and productivity. The biggest predictor of job satisfaction, in turn, is how
effectively organizations process information, as determined by a model created by sociologist Ron
Westrum, shown below. 1
1: https://continuousdelivery.com/implementing/culture/
19.
20.
21. Words are how we think – stories are how we link.
- Christina Baldwin
Oral narrative is and for a long time has been the
chief basis of culture itself.
- John D. Niles
Stories from the road
25. TimeToRepair(TTR)
Continuous Improvement Efforts
Reactive
(chaotic)
Tactical
(obvious)
Integrated
(complicated)
Strategic
(complex)
No automation
No operational stack
awareness
Poor collaboration between
teams (Dev & Ops)
Documentation not available
No standardized
communication
High focus on consistent
continuous learning
Uses a NOC
Some monitoring & alerting
instrumentation
Collaboration in crisis
"Mission critical" processes are
available
Understood crisis
communication protocols
Remediation data available to
IT Operations
Team rotations, paging
policies, role hunting
Continuous improvement of
key health indicators
Technical collaboration across
all incidents
Docs up to date and easily
accessible
Consistent real-time
communication practices
Automated docs and remediation
Actionable Alerts with full context
High collaboration among all teams
Documentation part of remediation
Targeted, proactive crisis comms
High focus on continuous learning
Incident Management
Maturity
32. MATT AUGUSTINE|
CTO & CO-FOUNDER, PLAYFAB
• Matt leads engineering and product development at PlayFab, a
backend platform for online games.
• Well versed in the challenges of growing an engineering team
from a single person (himself) to a highly-functioning group,
cranking out features and supporting customers, all while
continuously improving product quality and reliability.
• Prior to PlayFab, Matt had over a decade of software
development experience, working at Uber Entertainment and
Microsoft on technologies ranging from video games to file
synchronization.
• He is passionate about building reliable systems that are used
by millions of people.
#DevOpsRoadTrip
35. The PlayFab Story
Developed backend services for local game studio, Uber
Entertainment
Realized that many game developers needed the same
technology.
Found an amazing CEO, James Gwertzman, and spun out
PlayFab as a new company in January 2014.
Operating today with > 100 live games and 10M MAU
36. Succeeding with Spun-Out Tech
1. 2. 3.
Establishing
the New
Standard
DevOps Team
of More Than
One
Escaping
Reactive
Mode
37. Old Product, New Product
Starting codebase developed to a different
standard
Minimum Viable Process – before first hire
Fork and ruthlessly prune existing codebase
New product, new standards
38. DevOps Team of One… to Many
Establish on-call
rotation, even if
you always have
to get involved at
first
Only alert on
outages + a few
key metrics
Counters more
actionable than
39. More Customers, More Problems
Dealing with usage patterns you never anticipated, every
day
40. Escaping Reactive Mode
Distraction Solution
Unpredictable traffic Auto-scaling everything (compute +
storage)
Load test to 10X current traffic peaks
Machine failures cause partial
outages
Every server role runs in >1 DC with
health check based failover
Regressions in complex legacy code Gradually improve test coverage by
adding test whenever touched
Functional bugs in new features
(super embarrassing)
Unit tests for all new functionality +
peer review of all code changes
Customer feature / limit change
requests
Define limits for everything with max
allowable increases, and make limit
changes self-serve
41. Engineering @ PlayFab
No “DevOps Team”
All engineers design web
services and run them on the
cloud
42.
43. COURTNEY KISSLER |
VP OF RETAIL, STARBUCKS
• An experienced leader, working specifically with technology
teams accountable for eCommerce, customer mobile,
personalization, loyalty, marketing, payments, customer care,
digital foundation and store technology experiences.
• She is passionate about connecting technology investments to
business outcomes, delivering impactful solutions and giving
teams line of sight to how their work contributes to those
outcomes.
• Graduate of Eastern Washington University with a B.S. in
Computer Information Systems and worked at two startups,
CyberSafe and WorldStream Communications
• Most recently at Nordstrom prior to joining Starbucks.
#DevOpsRoadTrip
44. WHAT I’M GOING TO TALK ABOUT…
Incident Management
Critical/High
Medium/Low
OMTM (One Metric That Matters)
Current Condition/Target Condition
Tactics
Additional Benefits/Outcomes
53. MEDIUM/LOW:
ONE METRIC THAT MATTERS
Metric Current Target
# of incidents 1300 50
ANOTHER
PATTERN…HOW
DO WE GET OFF
THE HAMSTER
WHEEL?
54. TACTICS
All work visible
WIP limits
Team – self-organized
Improvement kata
A3 problem solving
55. ADDITIONAL METRICS
Critical/Highs – incident count
Understanding ratio of breakthrough vs.
operational
Cycle Time
Deployment frequency
Mean Time To Detect (MTTD)
60. Breakout Sessions
It broke in production, now what? Strategies for managing failure
and getting back to business
- Jeff Norris, Technical Principal of Snap CI, at Thoughtworks
Finding Signal in the Noise - Matt Williams, Evangelist at
DataDog
Security & Compliance in a DevOps World - J. Paul Reed,
DevOps Consultant
Devs On-Call, How and Why to Get Started - Matt Augustine,
CTO & Co-Founder at PlayFab
ChatOps - Jason Hand, VictorOps
The Leadership Evolution: How to lead in this brave new world -
Courtney Kissler, Starbucks
#DevOpsRoadTrip
61. JEFF NORRIS |
TECHNICAL PRINCIPAL, SNAP CI THOUGHTWORKS
• Jeff Norris is the Technical Principal for Snap CI, which
provides Continuous Delivery and Continuous Integration in the
cloud.
• Before joining the Snap CI team, Jeff worked for many years at
ThoughtWorks and led ThoughtWorks’ longest running project,
an international leasing application with high uptime
requirements that processed billions of dollars of equipment
annually.
• Jeff is a co-author of ThoughtWorks’ Technology Radar
(thoughtworks.com/radar).
• When not developing systems or leading teams, Jeff teaches
and coaches tech leads throughout the Americas.
#DevOpsRoadTrip
62. MATT WILLIAMS |
EVANGELIST, DATADOG
• Passionate about the power of monitoring and metrics to make
large-scale systems stable and manageable
• Usually touring the country speaking and writing about
monitoring with Datadog.
• When he’s not on the road, he’s coding.
• You can find Matt on Twitter at @Technovangelist.
#DevOpsRoadTrip
63. J. PAUL REED |
DEVOPS CONSULTANT
• Over a decade of experience in the trenches as a build/release
and tools engineer, working with such organizations as
VMware, Mozilla, and Symantec.
• In 2012, he founded Release Engineering Approaches, a
consultancy incorporating a host of tools and techniques to help
organizations “Simply Ship. Every time.”
• Worked with organizations across a number of industries, from
financial services to cloud-based infrastructure, with teams from
2 to 200.
• Paul is also a founding host of The Ship Show, a twice-monthly
podcast tackling topics related to build engineering, DevOps,
and release management.
•
#DevOpsRoadTrip
64. Breakout Sessions
It broke in production, now what? Strategies for managing failure
and getting back to business
- Jeff Norris, Technical Principal of Snap CI, at Thoughtworks
Finding Signal in the Noise - Matt Williams, Evangelist at
DataDog
Security & Compliance in a DevOps World - J. Paul Reed,
DevOps Consultant
Devs On-Call, How and Why to Get Started - Matt Augustine,
CTO & Co-Founder at PlayFab
ChatOps - Jason Hand, VictorOps
The Leadership Evolution: How to lead in this brave new world -
Courtney Kissler, Starbucks
#DevOpsRoadTrip
72. TimeToRepair(TTR)
Continuous Improvement Efforts
Reactive (0 – 4)
(chaotic)
Tactical (5 – 9)
(obvious)
Integrated (10 -14)
(complicated)
Strategic (15 –
18)
(complex)
No automation
No operational stack
awareness
Poor collaboration between
teams (Dev & Ops)
Documentation not available
No standardized
communication
High focus on consistent
continuous learning
Uses a NOC
Some monitoring & alerting
instrumentation
Collaboration in crisis
"Mission critical" processes are
available
Understood crisis
communication protocols
Remediation data available to
IT Operations
Team rotations, paging
policies, role hunting
Continuous improvement of
key health indicators
Technical collaboration across
all incidents
Docs up to date and easily
accessible
Consistent real-time
communication practices
Automated docs and remediation
Actionable Alerts with full context
High collaboration among all teams
Documentation part of remediation
Targeted, proactive crisis comms
High focus on continuous learning
Incident Management
Maturity
73. How Organizations Process Information
Roy Westrum: A Typology of Organizational Cultures
2014 State of DevOps Report shows that in the context of IT, job satisfaction is the biggest predictor of
profitability, market share, and productivity. The biggest predictor of job satisfaction, in turn, is how
effectively organizations process information, as determined by a model created by sociologist Ron
Westrum, shown below. 1
1: https://continuousdelivery.com/implementing/culture/
76. Un-ordered Ordered
Complicated
Obvious
Complex
Chaotic
Cause Effect Obvious
From Experience
Cause Effect Requires
Analysis
Cause Effect Only
Apparent in Hindsight
Cause & Effect Cannot
Be Related
Sense – Categorize - Respond
Sense – Analyze - RespondProbe – Sense - Respond
Act – Sense - Respond
77. The systems we engineer, maintain, and improve are
Complicated
.. or ..
Known unknowns
78. The systems we engineer, maintain, and improve are
Complex
Unknown unknowns
86. Reactive
(chaotic)
No automation
No operational stack awareness
Poor collaboration between teams (Dev & Ops)
Documentation not available
No standardized communication
High focus on consistent continuous learning
87. Tactical
(obvious)
Uses a NOC
Some monitoring & alerting instrumentation
Collaboration in crisis
"Mission critical" processes are available
Understood crisis communication protocols
Remediation data available to IT Operations
88. Integrated
(complicated)
Team rotations, paging policies, role hunting
Continuous improvement of key health indicators
Technical collaboration across all incidents
Docs up to date and easily accessible
Consistent real-time communication practices
89. Strategic
(complex)
Automated docs and remediation
Actionable Alerts with full context
High collaboration among all teams
Documentation part of remediation
Targeted, proactive crisis comms
High focus on continuous learning
90. “Six Trends Shape DevOps Adoption, Q1 2015”
Forrester report
• The Foundation For Success Is In Place . . . Mostly
• Fear Of Failure Will Hamper Advancement
• Monitoring And Analytics Strategies Must Make A Big Leap Forward
• The Focus On Customer Experience Is Not Second Nature . . . Yet
• Change And Release Processes Are Not Delivering Business Needs
• You Must Prioritize And Focus Sourcing Strategies
Greenfield development – allure of working at a startup. At spinout, green field is overgrown and full of weeds.
Different standard. One usage pattern, few (any?) tests, reporting bug means walking down the hall and bugging you, and “you’re calling it wrong” acceptable answer.
Sources, build server, at least one test as gate, deployment script, uptime monitor, alert/page
Separate codebase, minimize merging. Separate deployment / AWS account.
But that code has low test coverage is not an excuse. Lead by example – don’t fall into old patterns.
Good news, you’ve already instituted policy of devs on call. Bad news, you’re the only dev.
Will talk more in breakout, but devs wearing pager is good motivation – personal plus peer pressure.
Set the tone early that paging should not be frequent - page only on things that impact customers, and are actionable
Counters/Metrics more useful than logs – can alert on them, get quick visual confirmation of changes, explore relationships etc. Metrics for everything – logs for the unexpected.
Think through
If you’ve survived until now, you are luck enough to have a constant stream of exciting new problems
I used to think that startup CTOs spent their time coming up with technical vision and other “big picture” stuff. Boy was I wrong – the role is actually being the global exception handler.
To become more effective, must reduce the number of exceptions and catch the remaining ones sooner.
Time you spend troubleshooting, fixing or even thinking about these issues is time you aren’t spending on making your product better.