Building A Culture of Observability At Stripe

Building a Culture of
Observability at Stripe
Maaaaaaaybe?
Cory “gphat” Watson
• Joined Stripe in August, 2015
• Previously at Keen IO and Twitter
• Generalist
Starting Point
• Stripe had some visibility, but not enough.
• No clear ownership, broken windows.
• Lack of confidence, vision for future.
• Very reactive.
This isn’t about a specific
technology. This is about people.
Did it work?
See my resume at:
onemogin.com/resume
(jk)
You’re here because you
know this is important.
How can we get others to
agree and work toward it?
Stripe Org Facts
• ~450 employees, 100% growth in last year
• ~2 dozen teams
• ~200 services
• Thousands of hosts (AWS)
• Ruby, JVM, lots of OSS stuff
• Team: 3 + intern (starting Q2)
Where to begin?
Start Over, Kinda
• Spend time with the tools
• Improve if possible
• Replace if not
• Leverage past knowledge
Empathy and Respect
• People not generally evil, but they are busy!
• Stressed, doing best with what they have
• Being a hater is lazy
• Help people be great at their jobs
Replaced Existing System
• Maybe a bad call, technically better
• Overcoming momentum is hard, adds work
• Declaring bankruptcy
• Saved us ops headaches
• Still going
Tip: Nemawashi
• Start small, you’re a great guinea pig
• Quietly lay a foundation and gather feedback
• Ask how you can improve, follow up!
• Engage discontent! Usually fine. Sometimes you need
whisky.
Identify Power Users
• Find interested parties
• Talk to them, give them what they need
• Empower them to help others
• Watch them grow!
Value
• What are you improving?
• How can you measure it?
• Is this the best way?
What is Observability?
Why do we want it?
In control theory, observability is a measure
for how well internal states of a system can
be inferred by knowledge of its external
outputs.
Systems output work.
If the internal state goes bad,
the work goes bad.
We need to add sensors!
Make This Great
Programmer
Reference
System
Sensor(s)
Work
Flat Org Work Ethic
• Probably the biggest challenge, getting started
• So, ya know, get started
• Be willing to do the work, shave the preposterous line
of yaks
• Stigmergy
• Strike when good opportunities arise (incidents, etc)
Advertise
• Don’t be afraid!
• Promote team accomplishments.
• Moreso, promote the accomplishment of others.
• Humbly ask to help, then learn.
• We send monthly “State of” addresses…
Make It Easy & Good
• Harder than it sounds (email!)
• Make it easy/automatic to do things right and hard to
do wrong.
• Quality is important.
Automated Monitors
• Baseline monitoring
• Common problems, common solutions
• Users have no state, are surprised
• People care when you show them failure and how to
fix it.
Automatic Ticket Creation
And Resolution!
Investigation Dashboard
Such Helpful!
Getting Feedback
How we improve.
Teach the Basics
• Company curriculum: Teach ‘em early!
• Measuring work metrics
• Metrics types
• Schemas (dotted, tags, etc)
• Rates, histograms
• Visualizations
Ownership
• Poor story for this
• Org was ready for this, management was on board.
• Evolving, tools are lacking.
Did it work?
Yes, but not done.
• Some teams? Hell yes. Strong champions, huge
improvement.
• Some other teams, kinda the same.
• Some other other teams, what is Observability and
why do I care? Rare!
Usage?
• 200+ dashboards created, 339 in old (over 2 years)
• 200+ monitors created, dozens in old (nobody trusted,
was unreliable!)
• ~3000 distinct metrics (can’t compare, tags now!)
• All positive feedback from automation. (Avg 4.5, 2.5%
response)
Tools?
• Dozens of OSS PRs, OSS *StatsD library (Scala),
internal libraries (we own)
• Vast improvement over old pipeline, no loss
• New styles, better naming, more consistency
• Being tied to a commercial product cuts both ways
Adjustments?
• Embracing other tools (log analysis, error catching)
• Beginning to work on strategic things (global timers,
histograms and sets)
• Need to improve metrics on our own work (we got by
easy for a while)
• Monitoring is hard, need to fix.
Summary
• Start small
• Seek feedback
• Think on your value
• Measure effectiveness
• Enjoy!
Thanks
Team @antifuchs and @shu, all of Stripe
onemogin.com
@gphat
github.com/gphat
cory@stripe.com
Questions?
@gphat
Info
Slides
Feedback
Talk
Help me improve.
1 of 37

Recommended

Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015 by
Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015
Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015DevOpsDays Tel Aviv
5.2K views58 slides
OSMC 2015: Monitoring at Spotify-When things go ping in the night by Martin Parm by
OSMC 2015: Monitoring at Spotify-When things go ping in the night by Martin ParmOSMC 2015: Monitoring at Spotify-When things go ping in the night by Martin Parm
OSMC 2015: Monitoring at Spotify-When things go ping in the night by Martin ParmNETWAYS
2.1K views66 slides
2016 metrics-as-culture by
2016 metrics-as-culture2016 metrics-as-culture
2016 metrics-as-cultureNicole Forsgren
3.8K views21 slides
Prometheus (Monitorama 2016) by
Prometheus (Monitorama 2016)Prometheus (Monitorama 2016)
Prometheus (Monitorama 2016)Brian Brazil
4.5K views23 slides
Monitoring Challenges - Monitorama 2016 - Monitoringless by
Monitoring Challenges - Monitorama 2016 - MonitoringlessMonitoring Challenges - Monitorama 2016 - Monitoringless
Monitoring Challenges - Monitorama 2016 - MonitoringlessAdrian Cockcroft
8.6K views58 slides
SREcon 2016 Performance Checklists for SREs by
SREcon 2016 Performance Checklists for SREsSREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREsBrendan Gregg
206.7K views79 slides

More Related Content

Viewers also liked

Scaling Pinterest's Monitoring by
Scaling Pinterest's MonitoringScaling Pinterest's Monitoring
Scaling Pinterest's MonitoringBrian Overstreet
2.5K views40 slides
Monitoring As a Service by
Monitoring As a ServiceMonitoring As a Service
Monitoring As a ServiceJames Turnbull
22.9K views34 slides
DevOpsDays Amsterdam - Monitoring at Service Provider Scale by
DevOpsDays Amsterdam - Monitoring at Service Provider ScaleDevOpsDays Amsterdam - Monitoring at Service Provider Scale
DevOpsDays Amsterdam - Monitoring at Service Provider ScaleChris Jackson
2.1K views23 slides
Bruce Lawson: Progressive Web Apps: the future of Apps by
Bruce Lawson: Progressive Web Apps: the future of AppsBruce Lawson: Progressive Web Apps: the future of Apps
Bruce Lawson: Progressive Web Apps: the future of Appsbrucelawson
6.2K views54 slides
Microservices Workshop All Topics Deck 2016 by
Microservices Workshop All Topics Deck 2016Microservices Workshop All Topics Deck 2016
Microservices Workshop All Topics Deck 2016Adrian Cockcroft
33.4K views364 slides
Query Latency Optimization with Lucene by
Query Latency Optimization with LuceneQuery Latency Optimization with Lucene
Query Latency Optimization with Lucenelucenerevolution
5.3K views60 slides

Viewers also liked(20)

Monitoring As a Service by James Turnbull
Monitoring As a ServiceMonitoring As a Service
Monitoring As a Service
James Turnbull22.9K views
DevOpsDays Amsterdam - Monitoring at Service Provider Scale by Chris Jackson
DevOpsDays Amsterdam - Monitoring at Service Provider ScaleDevOpsDays Amsterdam - Monitoring at Service Provider Scale
DevOpsDays Amsterdam - Monitoring at Service Provider Scale
Chris Jackson2.1K views
Bruce Lawson: Progressive Web Apps: the future of Apps by brucelawson
Bruce Lawson: Progressive Web Apps: the future of AppsBruce Lawson: Progressive Web Apps: the future of Apps
Bruce Lawson: Progressive Web Apps: the future of Apps
brucelawson6.2K views
Microservices Workshop All Topics Deck 2016 by Adrian Cockcroft
Microservices Workshop All Topics Deck 2016Microservices Workshop All Topics Deck 2016
Microservices Workshop All Topics Deck 2016
Adrian Cockcroft33.4K views
Query Latency Optimization with Lucene by lucenerevolution
Query Latency Optimization with LuceneQuery Latency Optimization with Lucene
Query Latency Optimization with Lucene
lucenerevolution5.3K views
Monitoring, graphs and visualisations by morekid
Monitoring, graphs and visualisationsMonitoring, graphs and visualisations
Monitoring, graphs and visualisations
morekid6K views
Monitorama - Please, no more Minutes, Milliseconds, Monoliths or Monitoring T... by Adrian Cockcroft
Monitorama - Please, no more Minutes, Milliseconds, Monoliths or Monitoring T...Monitorama - Please, no more Minutes, Milliseconds, Monoliths or Monitoring T...
Monitorama - Please, no more Minutes, Milliseconds, Monoliths or Monitoring T...
Adrian Cockcroft15.2K views
Performance and Scalability Testing with Python and Multi-Mechanize by coreygoldberg
Performance and Scalability Testing with Python and Multi-MechanizePerformance and Scalability Testing with Python and Multi-Mechanize
Performance and Scalability Testing with Python and Multi-Mechanize
coreygoldberg9.6K views
Monitoring Docker containers - Docker NYC Feb 2015 by Datadog
Monitoring Docker containers - Docker NYC Feb 2015Monitoring Docker containers - Docker NYC Feb 2015
Monitoring Docker containers - Docker NYC Feb 2015
Datadog 3.2K views
Measuring Micro-services. Richard Rodger by Future Insights
Measuring Micro-services. Richard RodgerMeasuring Micro-services. Richard Rodger
Measuring Micro-services. Richard Rodger
Future Insights760 views
ContainerDays NYC 2016: "Observability and Manageability in a Container Envir... by DynamicInfraDays
ContainerDays NYC 2016: "Observability and Manageability in a Container Envir...ContainerDays NYC 2016: "Observability and Manageability in a Container Envir...
ContainerDays NYC 2016: "Observability and Manageability in a Container Envir...
DynamicInfraDays505 views
Paymill vs Stripe by betabeers
Paymill vs StripePaymill vs Stripe
Paymill vs Stripe
betabeers3.3K views
2008 "An overview of Methods for analysis of Identifiability and Observabilit... by Steinar Elgsæter
2008 "An overview of Methods for analysis of Identifiability and Observabilit...2008 "An overview of Methods for analysis of Identifiability and Observabilit...
2008 "An overview of Methods for analysis of Identifiability and Observabilit...
Steinar Elgsæter1.3K views
BFF Pattern in Action: SoundCloud’s Microservices by Bora Tunca
BFF Pattern in Action: SoundCloud’s MicroservicesBFF Pattern in Action: SoundCloud’s Microservices
BFF Pattern in Action: SoundCloud’s Microservices
Bora Tunca3.9K views
Microservice Architecture by Engin Yoeyen
Microservice ArchitectureMicroservice Architecture
Microservice Architecture
Engin Yoeyen1.2K views

Recently uploaded

Don’t Make A Human Do A Robot’s Job! : 6 Reasons Why AI Will Save Us & Not De... by
Don’t Make A Human Do A Robot’s Job! : 6 Reasons Why AI Will Save Us & Not De...Don’t Make A Human Do A Robot’s Job! : 6 Reasons Why AI Will Save Us & Not De...
Don’t Make A Human Do A Robot’s Job! : 6 Reasons Why AI Will Save Us & Not De...Moses Kemibaro
29 views38 slides
PharoJS - Zürich Smalltalk Group Meetup November 2023 by
PharoJS - Zürich Smalltalk Group Meetup November 2023PharoJS - Zürich Smalltalk Group Meetup November 2023
PharoJS - Zürich Smalltalk Group Meetup November 2023Noury Bouraqadi
141 views17 slides
20231123_Camunda Meetup Vienna.pdf by
20231123_Camunda Meetup Vienna.pdf20231123_Camunda Meetup Vienna.pdf
20231123_Camunda Meetup Vienna.pdfPhactum Softwareentwicklung GmbH
46 views73 slides
HTTP headers that make your website go faster - devs.gent November 2023 by
HTTP headers that make your website go faster - devs.gent November 2023HTTP headers that make your website go faster - devs.gent November 2023
HTTP headers that make your website go faster - devs.gent November 2023Thijs Feryn
28 views151 slides
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava... by
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...ShapeBlue
48 views17 slides
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue by
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlueVNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlueShapeBlue
85 views54 slides

Recently uploaded(20)

Don’t Make A Human Do A Robot’s Job! : 6 Reasons Why AI Will Save Us & Not De... by Moses Kemibaro
Don’t Make A Human Do A Robot’s Job! : 6 Reasons Why AI Will Save Us & Not De...Don’t Make A Human Do A Robot’s Job! : 6 Reasons Why AI Will Save Us & Not De...
Don’t Make A Human Do A Robot’s Job! : 6 Reasons Why AI Will Save Us & Not De...
Moses Kemibaro29 views
PharoJS - Zürich Smalltalk Group Meetup November 2023 by Noury Bouraqadi
PharoJS - Zürich Smalltalk Group Meetup November 2023PharoJS - Zürich Smalltalk Group Meetup November 2023
PharoJS - Zürich Smalltalk Group Meetup November 2023
Noury Bouraqadi141 views
HTTP headers that make your website go faster - devs.gent November 2023 by Thijs Feryn
HTTP headers that make your website go faster - devs.gent November 2023HTTP headers that make your website go faster - devs.gent November 2023
HTTP headers that make your website go faster - devs.gent November 2023
Thijs Feryn28 views
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava... by ShapeBlue
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...
ShapeBlue48 views
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue by ShapeBlue
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlueVNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue
ShapeBlue85 views
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha... by ShapeBlue
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...
ShapeBlue74 views
Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBIT by ShapeBlue
Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBITUpdates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBIT
Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBIT
ShapeBlue91 views
Business Analyst Series 2023 - Week 4 Session 7 by DianaGray10
Business Analyst Series 2023 -  Week 4 Session 7Business Analyst Series 2023 -  Week 4 Session 7
Business Analyst Series 2023 - Week 4 Session 7
DianaGray1080 views
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive by Network Automation Forum
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLiveAutomating a World-Class Technology Conference; Behind the Scenes of CiscoLive
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ... by Jasper Oosterveld
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas... by Bernd Ruecker
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
Bernd Ruecker50 views
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti... by ShapeBlue
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...
ShapeBlue46 views
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f... by TrustArc
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...
TrustArc77 views
Business Analyst Series 2023 - Week 3 Session 5 by DianaGray10
Business Analyst Series 2023 -  Week 3 Session 5Business Analyst Series 2023 -  Week 3 Session 5
Business Analyst Series 2023 - Week 3 Session 5
DianaGray10369 views
KVM Security Groups Under the Hood - Wido den Hollander - Your.Online by ShapeBlue
KVM Security Groups Under the Hood - Wido den Hollander - Your.OnlineKVM Security Groups Under the Hood - Wido den Hollander - Your.Online
KVM Security Groups Under the Hood - Wido den Hollander - Your.Online
ShapeBlue102 views
Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ... by ShapeBlue
Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ...Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ...
Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ...
ShapeBlue83 views
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T by ShapeBlue
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&TCloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T
ShapeBlue56 views

Building A Culture of Observability At Stripe

  • 1. Building a Culture of Observability at Stripe Maaaaaaaybe?
  • 2. Cory “gphat” Watson • Joined Stripe in August, 2015 • Previously at Keen IO and Twitter • Generalist
  • 3. Starting Point • Stripe had some visibility, but not enough. • No clear ownership, broken windows. • Lack of confidence, vision for future. • Very reactive.
  • 4. This isn’t about a specific technology. This is about people.
  • 6. See my resume at: onemogin.com/resume (jk)
  • 7. You’re here because you know this is important.
  • 8. How can we get others to agree and work toward it?
  • 9. Stripe Org Facts • ~450 employees, 100% growth in last year • ~2 dozen teams • ~200 services • Thousands of hosts (AWS) • Ruby, JVM, lots of OSS stuff • Team: 3 + intern (starting Q2)
  • 11. Start Over, Kinda • Spend time with the tools • Improve if possible • Replace if not • Leverage past knowledge
  • 12. Empathy and Respect • People not generally evil, but they are busy! • Stressed, doing best with what they have • Being a hater is lazy • Help people be great at their jobs
  • 13. Replaced Existing System • Maybe a bad call, technically better • Overcoming momentum is hard, adds work • Declaring bankruptcy • Saved us ops headaches • Still going
  • 14. Tip: Nemawashi • Start small, you’re a great guinea pig • Quietly lay a foundation and gather feedback • Ask how you can improve, follow up! • Engage discontent! Usually fine. Sometimes you need whisky.
  • 15. Identify Power Users • Find interested parties • Talk to them, give them what they need • Empower them to help others • Watch them grow!
  • 16. Value • What are you improving? • How can you measure it? • Is this the best way?
  • 17. What is Observability? Why do we want it?
  • 18. In control theory, observability is a measure for how well internal states of a system can be inferred by knowledge of its external outputs.
  • 19. Systems output work. If the internal state goes bad, the work goes bad. We need to add sensors!
  • 21. Flat Org Work Ethic • Probably the biggest challenge, getting started • So, ya know, get started • Be willing to do the work, shave the preposterous line of yaks • Stigmergy • Strike when good opportunities arise (incidents, etc)
  • 22. Advertise • Don’t be afraid! • Promote team accomplishments. • Moreso, promote the accomplishment of others. • Humbly ask to help, then learn. • We send monthly “State of” addresses…
  • 23. Make It Easy & Good • Harder than it sounds (email!) • Make it easy/automatic to do things right and hard to do wrong. • Quality is important.
  • 24. Automated Monitors • Baseline monitoring • Common problems, common solutions • Users have no state, are surprised • People care when you show them failure and how to fix it.
  • 28. Teach the Basics • Company curriculum: Teach ‘em early! • Measuring work metrics • Metrics types • Schemas (dotted, tags, etc) • Rates, histograms • Visualizations
  • 29. Ownership • Poor story for this • Org was ready for this, management was on board. • Evolving, tools are lacking.
  • 31. Yes, but not done. • Some teams? Hell yes. Strong champions, huge improvement. • Some other teams, kinda the same. • Some other other teams, what is Observability and why do I care? Rare!
  • 32. Usage? • 200+ dashboards created, 339 in old (over 2 years) • 200+ monitors created, dozens in old (nobody trusted, was unreliable!) • ~3000 distinct metrics (can’t compare, tags now!) • All positive feedback from automation. (Avg 4.5, 2.5% response)
  • 33. Tools? • Dozens of OSS PRs, OSS *StatsD library (Scala), internal libraries (we own) • Vast improvement over old pipeline, no loss • New styles, better naming, more consistency • Being tied to a commercial product cuts both ways
  • 34. Adjustments? • Embracing other tools (log analysis, error catching) • Beginning to work on strategic things (global timers, histograms and sets) • Need to improve metrics on our own work (we got by easy for a while) • Monitoring is hard, need to fix.
  • 35. Summary • Start small • Seek feedback • Think on your value • Measure effectiveness • Enjoy!
  • 36. Thanks Team @antifuchs and @shu, all of Stripe onemogin.com @gphat github.com/gphat cory@stripe.com