Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Building A Culture of Observability At Stripe


Published on

I've been working on Observability things or many years, and while I didn't set out to join Stripe just for the purposes of Observability, I quickly decided that's what I needed to do. How does one go about changing a company to have a culture of observability, measuring and monitoring? Let's see if my ideas worked and how you can learn from my experiences!

Published in: Technology

Building A Culture of Observability At Stripe

  1. 1. Building a Culture of Observability at Stripe Maaaaaaaybe?
  2. 2. Cory “gphat” Watson • Joined Stripe in August, 2015 • Previously at Keen IO and Twitter • Generalist
  3. 3. Starting Point • Stripe had some visibility, but not enough. • No clear ownership, broken windows. • Lack of confidence, vision for future. • Very reactive.
  4. 4. This isn’t about a specific technology. This is about people.
  5. 5. Did it work?
  6. 6. See my resume at: (jk)
  7. 7. You’re here because you know this is important.
  8. 8. How can we get others to agree and work toward it?
  9. 9. Stripe Org Facts • ~450 employees, 100% growth in last year • ~2 dozen teams • ~200 services • Thousands of hosts (AWS) • Ruby, JVM, lots of OSS stuff • Team: 3 + intern (starting Q2)
  10. 10. Where to begin?
  11. 11. Start Over, Kinda • Spend time with the tools • Improve if possible • Replace if not • Leverage past knowledge
  12. 12. Empathy and Respect • People not generally evil, but they are busy! • Stressed, doing best with what they have • Being a hater is lazy • Help people be great at their jobs
  13. 13. Replaced Existing System • Maybe a bad call, technically better • Overcoming momentum is hard, adds work • Declaring bankruptcy • Saved us ops headaches • Still going
  14. 14. Tip: Nemawashi • Start small, you’re a great guinea pig • Quietly lay a foundation and gather feedback • Ask how you can improve, follow up! • Engage discontent! Usually fine. Sometimes you need whisky.
  15. 15. Identify Power Users • Find interested parties • Talk to them, give them what they need • Empower them to help others • Watch them grow!
  16. 16. Value • What are you improving? • How can you measure it? • Is this the best way?
  17. 17. What is Observability? Why do we want it?
  18. 18. In control theory, observability is a measure for how well internal states of a system can be inferred by knowledge of its external outputs.
  19. 19. Systems output work. If the internal state goes bad, the work goes bad. We need to add sensors!
  20. 20. Make This Great Programmer Reference System Sensor(s) Work
  21. 21. Flat Org Work Ethic • Probably the biggest challenge, getting started • So, ya know, get started • Be willing to do the work, shave the preposterous line of yaks • Stigmergy • Strike when good opportunities arise (incidents, etc)
  22. 22. Advertise • Don’t be afraid! • Promote team accomplishments. • Moreso, promote the accomplishment of others. • Humbly ask to help, then learn. • We send monthly “State of” addresses…
  23. 23. Make It Easy & Good • Harder than it sounds (email!) • Make it easy/automatic to do things right and hard to do wrong. • Quality is important.
  24. 24. Automated Monitors • Baseline monitoring • Common problems, common solutions • Users have no state, are surprised • People care when you show them failure and how to fix it.
  25. 25. Automatic Ticket Creation And Resolution!
  26. 26. Investigation Dashboard Such Helpful!
  27. 27. Getting Feedback How we improve.
  28. 28. Teach the Basics • Company curriculum: Teach ‘em early! • Measuring work metrics • Metrics types • Schemas (dotted, tags, etc) • Rates, histograms • Visualizations
  29. 29. Ownership • Poor story for this • Org was ready for this, management was on board. • Evolving, tools are lacking.
  30. 30. Did it work?
  31. 31. Yes, but not done. • Some teams? Hell yes. Strong champions, huge improvement. • Some other teams, kinda the same. • Some other other teams, what is Observability and why do I care? Rare!
  32. 32. Usage? • 200+ dashboards created, 339 in old (over 2 years) • 200+ monitors created, dozens in old (nobody trusted, was unreliable!) • ~3000 distinct metrics (can’t compare, tags now!) • All positive feedback from automation. (Avg 4.5, 2.5% response)
  33. 33. Tools? • Dozens of OSS PRs, OSS *StatsD library (Scala), internal libraries (we own) • Vast improvement over old pipeline, no loss • New styles, better naming, more consistency • Being tied to a commercial product cuts both ways
  34. 34. Adjustments? • Embracing other tools (log analysis, error catching) • Beginning to work on strategic things (global timers, histograms and sets) • Need to improve metrics on our own work (we got by easy for a while) • Monitoring is hard, need to fix.
  35. 35. Summary • Start small • Seek feedback • Think on your value • Measure effectiveness • Enjoy!
  36. 36. Thanks Team @antifuchs and @shu, all of Stripe @gphat
  37. 37. Questions? @gphat Info Slides Feedback Talk Help me improve.