THE NEXT TEN YEARS OF
Ellen Chisa - @ellenchisa - 11/25/2021
🙋 ♀️🙋 🙋
♂️
Self Provisioning Infra
(hello world gif)
DevOpsDays
2009
Ten years
from now
iPads were new
Google+ just launched
Quikster
Nokia
Angular was new
No NPM
AWS Console was Manageable
SysAdmin vs. DevOps was a
thing
1. Business requires (fast) change
2. Change causes outages
3.Lowering the risk of change through
tools and culture
John Allspaw (Flickr/Yahoo!) and Paul Hammond (Flickr)
"10+ Deploys Per Day: Dev and Ops Cooperation at Flickr"
Dev and Ops
Everyone has a computer in
their pocket
Metaverse, web3, crypto, NFTs
Jamstack Apps
1.3 Million Packages in NPM
150 services in AWS Console
Deploying 26,280x more often
What’s most annoying today?
It’s still the same thing, but...
“Their evidence refutes the
bimodal IT notion that you
have to choose between
speed and stability—instead,
speed depends on stability,
so good IT practices give you
both.”
Increase Speed
Reduce Risk
4 ways we’ll keep doing that,
and what that means ten
years from now.
1.Real time feedback loops
Scratch
Replay.io
Snaplet
Zuplo
Realtime feedback loops in
all tools will increase speed
and reduce risk.
2. Incident infra
Incident Occurs
Self healing infrastructure &
playbooks become automated
resolution of knowns.
Pager Goes Off
“In the future systems will be
much smarter about escalating
to the best possible people
considering a bunch of factors
like time zone, area of
expertise, and recency of
contact with the system being
reported on (the last N
committers to a project, or the
last N to update some config)”
- Paul Nakata
Incident Opens
Fiberplane.dev
Incident Resolves
Incident infrastructure will
reduce risk and increase
speed.
3. DevEx Focus
Live Values
1. Take into account learning style
2. Not too hard
3. Not too easy
4. Progressive disclosure
5. Docs & error messages to enable
users to solve their own issues
6. Customization is expert mode
1. Take into account learning style
Codesee
1. Take into account learning style
2. Not too hard
Optic
1. Take into account learning style
2. Not too hard
3. Not too easy
4. Progressive disclosure
1. Take into account learning style
2. Not too hard
3. Not too easy
4. Progressive disclosure
5. Docs & error messages to enable
users to solve their own issues
Great Errors & Documentation
1. Take into account learning style
2. Not too hard
3. Not too easy
4. Progressive disclosure
5. Docs & error messages to enable
users to solve their own issues
6. Customization is expert mode
Customizations
Sarah Drasner and Jake Downs thread
Tools with good DevEx will
increase speed and reduce
risk.
4. Business Metrics
Today’s Big Four Metrics (Accelerate)
Speed:
1. Deployment Frequency (the frequency at which new releases go to production)
2. Lead Time For Changes (the time until a commit goes to production)
Risk:
1. Change Failure Rate (the ratio of deployments to production that leads to errors
and successful deployments).
2. Mean Time to Restore (the time it takes to resolve a service impairment in
production)
Today’s Big Four Metrics (Accelerate)
Speed:
1. Deployment Frequency (the frequency at which new releases go to production)
2. Lead Time For Changes (the time until a commit goes to production)
Risk:
1. Change Failure Rate (the ratio of deployments to production that leads to errors
and successful deployments).
2. Mean Time to Restore (the time it takes to resolve a service impairment in
production)
Executives (and investors) care about...
Existing Metrics
Communicating in
business metrics will get
buy in for future
investments.
If we get:
- Full organizational buy-in for the importance of ops
- Tools designed for flow and developer experience
- Real time feedback when something goes wrong
- Incidents managed as well as deploys
What happens next?
Smoother cross team collaboration and
more infra resources
Work is more fun
Speed goes towards ∞
Risk goes towards 0
Self-provisioning runtimes
Everyone writes software
We’re going to need a bigger room for
this
conference
Tools Appendix by Section & References
Realtime Feedback:
- scratch.mit.edu (kids)
- replay.io (debug)
- snaplet.dev (database snapshots)
- zuplo.com (api gateway)
Incidents:
- ab.bot (chatbot infra)
- fiberplane.dev (incident notebook)
- kumospace.com (comms platform)
- honeycomb.io (observability)
- allma.io (incident slack infra)
- jeli.io (incident analysis)
- blameless.com (incident analysis)
Developer Experience:
- darklang.com (backend)
- codesee.io (architecture maps)
- tryfabric.com (slack <> github)
- useoptic.com (API changes)
- railway.app (infra setup)
Books, articles, talks:
- Accelerate (Forsgren, Humble, Kim)
- 10+ Deploys per day (Allspaw, Hammond)
- Incident Review Best Practices Survey
(Pragmatic Engineer)
- Great Documentation Examples
(WorkOS)
- Preparing a Board Deck (Sequoia)
- Self Provisioning Runtime (Swyx)
‫תודה‬
Ellen Chisa - @ellenchisa - 11/25/2021

KEYNOTE | WHAT'S COMING IN THE NEXT 10 YEARS OF DEVOPS? // ELLEN CHISA, boldstart VC

  • 1.
    THE NEXT TENYEARS OF Ellen Chisa - @ellenchisa - 11/25/2021
  • 2.
  • 4.
  • 6.
  • 7.
    iPads were new Google+just launched Quikster Nokia Angular was new No NPM AWS Console was Manageable SysAdmin vs. DevOps was a thing
  • 9.
    1. Business requires(fast) change 2. Change causes outages 3.Lowering the risk of change through tools and culture John Allspaw (Flickr/Yahoo!) and Paul Hammond (Flickr) "10+ Deploys Per Day: Dev and Ops Cooperation at Flickr" Dev and Ops
  • 11.
    Everyone has acomputer in their pocket Metaverse, web3, crypto, NFTs Jamstack Apps 1.3 Million Packages in NPM 150 services in AWS Console Deploying 26,280x more often
  • 13.
  • 16.
    It’s still thesame thing, but...
  • 17.
    “Their evidence refutesthe bimodal IT notion that you have to choose between speed and stability—instead, speed depends on stability, so good IT practices give you both.”
  • 18.
  • 19.
    4 ways we’llkeep doing that, and what that means ten years from now.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
    Realtime feedback loopsin all tools will increase speed and reduce risk.
  • 26.
  • 27.
    Incident Occurs Self healinginfrastructure & playbooks become automated resolution of knowns.
  • 28.
    Pager Goes Off “Inthe future systems will be much smarter about escalating to the best possible people considering a bunch of factors like time zone, area of expertise, and recency of contact with the system being reported on (the last N committers to a project, or the last N to update some config)” - Paul Nakata
  • 29.
  • 30.
  • 31.
  • 32.
    Incident infrastructure will reducerisk and increase speed.
  • 33.
  • 34.
  • 36.
    1. Take intoaccount learning style 2. Not too hard 3. Not too easy 4. Progressive disclosure 5. Docs & error messages to enable users to solve their own issues 6. Customization is expert mode
  • 37.
    1. Take intoaccount learning style
  • 38.
  • 39.
    1. Take intoaccount learning style 2. Not too hard
  • 40.
  • 43.
    1. Take intoaccount learning style 2. Not too hard 3. Not too easy 4. Progressive disclosure
  • 45.
    1. Take intoaccount learning style 2. Not too hard 3. Not too easy 4. Progressive disclosure 5. Docs & error messages to enable users to solve their own issues
  • 46.
    Great Errors &Documentation
  • 47.
    1. Take intoaccount learning style 2. Not too hard 3. Not too easy 4. Progressive disclosure 5. Docs & error messages to enable users to solve their own issues 6. Customization is expert mode
  • 48.
  • 49.
    Tools with goodDevEx will increase speed and reduce risk.
  • 50.
  • 51.
    Today’s Big FourMetrics (Accelerate) Speed: 1. Deployment Frequency (the frequency at which new releases go to production) 2. Lead Time For Changes (the time until a commit goes to production) Risk: 1. Change Failure Rate (the ratio of deployments to production that leads to errors and successful deployments). 2. Mean Time to Restore (the time it takes to resolve a service impairment in production)
  • 53.
    Today’s Big FourMetrics (Accelerate) Speed: 1. Deployment Frequency (the frequency at which new releases go to production) 2. Lead Time For Changes (the time until a commit goes to production) Risk: 1. Change Failure Rate (the ratio of deployments to production that leads to errors and successful deployments). 2. Mean Time to Restore (the time it takes to resolve a service impairment in production)
  • 55.
  • 56.
  • 58.
    Communicating in business metricswill get buy in for future investments.
  • 59.
    If we get: -Full organizational buy-in for the importance of ops - Tools designed for flow and developer experience - Real time feedback when something goes wrong - Incidents managed as well as deploys
  • 60.
  • 61.
    Smoother cross teamcollaboration and more infra resources Work is more fun Speed goes towards ∞ Risk goes towards 0 Self-provisioning runtimes Everyone writes software We’re going to need a bigger room for this conference
  • 62.
    Tools Appendix bySection & References Realtime Feedback: - scratch.mit.edu (kids) - replay.io (debug) - snaplet.dev (database snapshots) - zuplo.com (api gateway) Incidents: - ab.bot (chatbot infra) - fiberplane.dev (incident notebook) - kumospace.com (comms platform) - honeycomb.io (observability) - allma.io (incident slack infra) - jeli.io (incident analysis) - blameless.com (incident analysis) Developer Experience: - darklang.com (backend) - codesee.io (architecture maps) - tryfabric.com (slack <> github) - useoptic.com (API changes) - railway.app (infra setup) Books, articles, talks: - Accelerate (Forsgren, Humble, Kim) - 10+ Deploys per day (Allspaw, Hammond) - Incident Review Best Practices Survey (Pragmatic Engineer) - Great Documentation Examples (WorkOS) - Preparing a Board Deck (Sequoia) - Self Provisioning Runtime (Swyx)
  • 63.
    ‫תודה‬ Ellen Chisa -@ellenchisa - 11/25/2021

Editor's Notes

  • #3 Ops (like the system part) Developers (like the code part)
  • #4 Started a company Programming language, batteries included Still exists For very tool I mention in here I’ll link you at the end
  • #6 Boldstart.vc Some companies in the room Share lessons from lots of people!
  • #7 Anchors us squarely in the middle of this journey
  • #8 Remind you or tell you because I know lots of us in the room may not have been around for this
  • #9 Web Accessibility Companions Office for Mobile
  • #10 Self service APIs Automation Cloud provisioning Matters more than ever as we have more businesses that rely heavily on software Who does it matter to? Internal customer, not a business/product center Engineers are looking, Hashicorp or Snyk examples - DevOps2.0 Developers will physically take over in our inner loop and is being distributed Everything is the product - Beth Long Doing things that matter to the business
  • #12 26280x better again would be 7.3 deploys/second
  • #16 Balsa did a survey recently saying basically this
  • #21 Productivity with faster, better, more secure operations Implicit benefit to productivity Builds on observability - builds on monitoring/observability before it happens ServiceNow and Lightstep Replay Snaplet Dark, Lambdragon, Natto
  • #22 Kids - seem familiar, Logo Turtle, Mindstorms, etc. MIT Mathematician - learning from feedback - Norbert Wiener https://en.wikipedia.org/wiki/Norbert_Wiener
  • #23 Fullstory + Chrome DevTools
  • #26 Individuals write code faster - debug faster, etc. Individuals check their work in atomic steps - Zuplo example with people finding other routes
  • #27 Deployments babysat, now happens automatically Manual process vs. automated process Document https://newsletter.pragmaticengineer.com/p/incident-review-best-practices
  • #28 I don’t know about you, haven’t seen many places with automatic resolutions If it has a manual playbook someone could read, that could definitely be automated Personal Ownership Certain types of problems always go to some group of people; right now most systems rely on an explicit definition of who is on call, but in the future systems will be much smarter about escalating to the best possible people considering a bunch of factors like time zone, area of expertise, and recency of contact with the system being reported on (the last N committers to a project, or the last N to update some config) Automated recovery policies Playbooks Monitoring: knowns -- automated -- Pagerduty, Runbook Observability: unknowns -- get the right person who knows the most
  • #29 After a pager needs to go, pages will change Personal Ownership Certain types of problems always go to some group of people; right now most systems rely on an explicit definition of who is on call, but in the future systems will be much smarter about escalating to the best possible people considering a bunch of factors like time zone, area of expertise, and recency of contact with the system being reported on (the last N committers to a project, or the last N to update some config) Automated recovery policies Playbooks Monitoring: knowns -- automated -- Pagerduty, Runbook Observability: unknowns -- get the right person who knows the most
  • #30 Doesn’t have to be in Slack necessarily, but can be - Allma, blameless People automatically get directed to the right place Pulls in related information Easy “status” section
  • #31 Firehydrant, Blameless
  • #33 Less bad when something goes wrong Continue to learn for next time to be able to go faster
  • #34 Framework for what is good We know tools go better when people adopt, what makes them adopt Solve a real problem, have a good experience This is for those of you who are building tools for devs or choosing them
  • #35 Right information at the right time, not a good DevEx
  • #36 muh·hay·lee Chik·sent·mee·hai·ee
  • #40 Not too hard - what people have, be it chat or OpenAPI
  • #41 Feels like github Works on to of the API you have Has an OSS version and a hosted version
  • #49 If you’ve gotten here, you’re winning!
  • #50 Individuals write code faster - debug faster, etc. Individuals check their work in atomic steps - Zuplo example with people finding other routes
  • #51 Productivity with faster, better, more secure operations Implicit benefit to productivity Builds on observability - builds on monitoring/observability before it happens ServiceNow and Lightstep Replay Snaplet Dark, Lambdragon, Natto
  • #52 Goes to infinity - once every three years to once a minute we’ve already improved this by 12,000,000x Goes to 0 Goes to 0 Doesn’t matter if the previous has gone to 0
  • #54 Goes to infinity - once every three years to once a minute we’ve already improved this by 12,000,000x Goes to 0 Goes to 0 Doesn’t matter if the previous has gone to 0
  • #55 Spoiler alert: MBA people don’t care about your normal metrics
  • #57 Time to ticket resolution (lead time for changes) Ratio of successful to poor experiences (change failure rate)
  • #58 HIring Velocity Dev NPS Dev Testing
  • #59 Less bad when something goes wrong Continue to learn for next time to be able to go faster
  • #61 Productivity with faster, better, more secure operations Implicit benefit to productivity Builds on observability - builds on monitoring/observability before it happens ServiceNow and Lightstep Replay Snaplet Dark, Lambdragon, Natto
  • #62 26280x better again would be 7.3 deploys/second