Monitorama PDX - Monitoring for Distributed Operational Responsibility
Infrastructure Engineer, Monitoring team
Giving operational responsibility back to the feature teams
(read: developers) instead of having a monolithic SRE team
❖ Capacity planning
❖ Configuration management
❖ Monitoring, alerting and on-call
But we provide the tools and infrastructure for them!
Distributed Operational Responsibility
❖ Organizational Scalability - too frequent changes for a
monolithic SRE team to keep up
❖ Getting The Right Person(tm) on the problem faster
❖ Accountability - making the right people hurt
❖ Autonomy - feature teams make all their own planning
So let’s talk about monitoring....
But... but why...????
❖ Developers need training, but not a new education
❖ Developers need autonomy, but will do stupid things
❖ Developers need to care about metrics and analytics,
but not the pipeline
So how does that affect tooling?
Alerting - What developers should care about
Metrics and events
Alerting - The reality
Metrics and events
...but we provide several different abstraction levels
depending on complexity of the task
❖ Script hooks i.e. drop a script in a folder
❖ Python scripts using the Riemann library
❖ Talk directly to FFWD using a supported protocol
Developers collect their own metrics
....but we help them by providing....
❖ Continuous integration with integration tests
❖ Abstractions from externals like PagerDuty
❖ Shared common functionality
Developers write their own alerting rules
❖ We build monitoring as a platform with many levels of
❖ Self-service is king!
❖ We spend a lot of our time teaching and talking rather
....and that’s a good thing!
Impact on the monitoring team
Distributed Operational Responsibility is work-
❖ We don’t know if this will work well
❖ We will run into new problems
❖ We will keep changing the way we work
Thank you for your
time and patience!