Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Martin Parm
Infrastructure Engineer, Monitoring team
Monitoring for
Distributed
Operational
Responsibility
Giving operational responsibility back to the feature teams
(read: developers) instead of having a monolithic SRE team
❖ C...
❖ Organizational Scalability - too frequent changes for a
monolithic SRE team to keep up
❖ Getting The Right Person(tm) on...
❖ Developers need training, but not a new education
❖ Developers need autonomy, but will do stupid things
❖ Developers nee...
Alerting - What developers should care about
Metrics and events
Magic
monitoring
pipeline
Alerting
rules
Alerting - The reality
Apache Kafka
FFWD
Metrics and events
Other
stuff
Even
more
stuff
...but we provide several different abstraction levels
depending on complexity of the task
❖ Script hooks i.e. drop a scri...
....but we help them by providing....
❖ Continuous integration with integration tests
❖ Abstractions from externals like P...
❖ We build monitoring as a platform with many levels of
entry
❖ Self-service is king!
❖ We spend a lot of our time teachin...
Distributed Operational Responsibility is work-
in-progress
❖ We don’t know if this will work well
❖ We will run into new ...
Martin Parm
email: parmus@spotify.com
twitter: @parmus_dk
FFWD: https://github.com/spotify/ffwd
Thank you for your
time an...
Upcoming SlideShare
Loading in …5
×

Monitorama PDX - Monitoring for Distributed Operational Responsibility

2,787 views

Published on

Published in: Technology, Design

Monitorama PDX - Monitoring for Distributed Operational Responsibility

  1. 1. Martin Parm Infrastructure Engineer, Monitoring team Monitoring for Distributed Operational Responsibility
  2. 2. Giving operational responsibility back to the feature teams (read: developers) instead of having a monolithic SRE team ❖ Capacity planning ❖ Configuration management ❖ Monitoring, alerting and on-call etc.... But we provide the tools and infrastructure for them! Distributed Operational Responsibility
  3. 3. ❖ Organizational Scalability - too frequent changes for a monolithic SRE team to keep up ❖ Getting The Right Person(tm) on the problem faster ❖ Accountability - making the right people hurt ❖ Autonomy - feature teams make all their own planning and decisions So let’s talk about monitoring.... But... but why...????
  4. 4. ❖ Developers need training, but not a new education ❖ Developers need autonomy, but will do stupid things ❖ Developers need to care about metrics and analytics, but not the pipeline So how does that affect tooling? Human challenges
  5. 5. Alerting - What developers should care about Metrics and events Magic monitoring pipeline Alerting rules
  6. 6. Alerting - The reality Apache Kafka FFWD Metrics and events Other stuff Even more stuff
  7. 7. ...but we provide several different abstraction levels depending on complexity of the task ❖ Script hooks i.e. drop a script in a folder ❖ Python scripts using the Riemann library ❖ Talk directly to FFWD using a supported protocol Developers collect their own metrics
  8. 8. ....but we help them by providing.... ❖ Continuous integration with integration tests ❖ Abstractions from externals like PagerDuty ❖ Shared common functionality Developers write their own alerting rules
  9. 9. ❖ We build monitoring as a platform with many levels of entry ❖ Self-service is king! ❖ We spend a lot of our time teaching and talking rather than typing ....and that’s a good thing! Impact on the monitoring team
  10. 10. Distributed Operational Responsibility is work- in-progress ❖ We don’t know if this will work well ❖ We will run into new problems ❖ We will keep changing the way we work anyway ......and finally
  11. 11. Martin Parm email: parmus@spotify.com twitter: @parmus_dk FFWD: https://github.com/spotify/ffwd Thank you for your time and patience!

×