SlideShare a Scribd company logo
/ Robert Treat
Less Alarming Alerts
Saturday, April 9, 16
Hello /@robtreat2
Former
WebDev SysAdmin DBA
I have now been promoted to where I can do the least damage
Saturday, April 9, 16
Hello /@robtreat2
Now
CEO @OMNITI
Saturday, April 9, 16
Hello /@robtreat2
Who Cares What Some Suite Thinks?
Saturday, April 9, 16
Hello /@robtreat2
Phantom Pages
Saturday, April 9, 16
Memory Lane /@robtreat2
Benny
Saturday, April 9, 16
Memory Lane /@robtreat2
MyFirstPager
Saturday, April 9, 16
Memory Lane /@robtreat2
Multiple Rotations
Saturday, April 9, 16
Memory Lane /@robtreat2
always available, phone only
no pager for years
Saturday, April 9, 16
Hello /@robtreat2
Phantom Pages
Saturday, April 9, 16
Hello /@robtreat2
I manage the SRE team at OmniTI
we manage multiple sites
24x7
millions of users
(omniti.com/is/hiring)
Saturday, April 9, 16
Why God Why?
paging is useful
“broken systems should not be
just another day at the office”
-- me
Saturday, April 9, 16
Why God Why?
paging is useful
Who has ever gotten an alert and ignored it?
(/me looks at alert, says “oh, it’ll probably recover, no need to look further”)
Saturday, April 9, 16
Why God Why?
paging is useful
How many alerts were received in the past
week that were not actionable?
(no human action was required)
Saturday, April 9, 16
Why God Why?
paging CAN BE useful
Saturday, April 9, 16
Can We Fix It?
how to improve?
Saturday, April 9, 16
Can We Fix It?
hello@omniti.com
we offer operationally focused services to
help build and manage your infrastructure
:-)
Saturday, April 9, 16
Terms
• Metrics
• (anything which can be measured)
Saturday, April 9, 16
Terms
• Metrics
• (anything which can be measured)
• Graphs
• (trending systems)
Saturday, April 9, 16
Terms
• Metrics
• (anything which can be measured)
• Graphs
• (trending systems)
• Notices
• (notification of event; email)
Saturday, April 9, 16
Terms
• Metrics
• (anything which can be measured)
• Graphs
• (trending systems)
• Notices
• (notification of event; email)
• ALERTS
• (wake’n you up; pages)
Saturday, April 9, 16
Terms
• Metrics
• (anything which can be measured)
• Graphs
• (trending systems)
• Notices
• (notification of event; email)
• ALERTS
• (wake’n you up; pages)
Saturday, April 9, 16
Onward and Upward
If you want to improve
your alerts
use systems thinking to reason about your
“system”
Saturday, April 9, 16
Onward and Upward
alerts should be seen as evidence
that your system is behaving in a way
outside of your existing understanding
Saturday, April 9, 16
Onward and Upward
If you want to improve
your alerts
think in terms your business can get on
board with
Saturday, April 9, 16
Onward and Upward
for every alert you receive
What is the business impact of this alert?
Saturday, April 9, 16
Onward and Upward
for every alert you receive
What is the remediation for this alert?
Saturday, April 9, 16
Onward and Upward
remediation:
• Summarize the problem
• What was done to solve the problem?
• Who was notified?
• Can this be prevented?
Saturday, April 9, 16
Onward and Upward
send the answer to these questions
to everyone on the team
every time
Saturday, April 9, 16
Onward and Upward
link to this documentation
from your alerting system
Saturday, April 9, 16
Onward and Upward
• Knowledge Transfer
• Gaps Exposed
• Patterns will emerge
Saturday, April 9, 16
Onward and Upward
you might be a bad alert
• cannot determine business impact
• no remediation necessary
• no one needs to be told
• work arounds are available
Saturday, April 9, 16
Onward and Upward
if you can’t fix it, you don’t
need to wake up for it
Saturday, April 9, 16
Onward and Upward
if it can wait until morning,
you don’t need to wake up
for it
Saturday, April 9, 16
Onward and Upward
in case of bad alert
• remove the alert
Saturday, April 9, 16
Onward and Upward
in case of bad alert
• remove the alert
• convert the alert to a notice
Saturday, April 9, 16
Onward and Upward
in case of bad alert
• remove the alert
• convert the alert to a notice
• implement fixes
Saturday, April 9, 16
Onward and Upward
pro tip:
never let anyone add an alert
unless they can answer these
questions first
Saturday, April 9, 16
Can We Really Do This?
this is partially an organizational issue
Saturday, April 9, 16
Can We Really Do This?
thought exercise:
if you launched a new web site today,
you really only need one alarm
Saturday, April 9, 16
Can We Really Do This?
“I don’t care if my servers are on fire,
as long as I am still making money”
-- Kevin, actual OmniTI customer
Saturday, April 9, 16
This sounds good but...
Most SA/SRE types want to be
pro-active, not re-active.
ie. they want to alert on leading
indicators, not on problems
Saturday, April 9, 16
This sounds good but...
Carrie: I-I'm just making sure we don't get hit again.
Saul: Well, I'm glad someone's looking out for us, Carrie.
Carrie: I'm serious. I-I missed something once before, I
won't... I can't let that happen again.
Saul: It was ten years ago. Everyone missed something
that day.
Carrie: Yeah, everyone's not me.
Saturday, April 9, 16
Based On A True Story
site down: monitor was checking 200
response code.
failed to notice absence of response
code.
easily fixed, but reactive
Saturday, April 9, 16
Based On A True Story
“root cause” ==> OOM
why don’t we alert on OOM?
OOM does not consistently cause outages
Saturday, April 9, 16
Based On A True Story
too many false positives leads to
ignoring alarms
Saturday, April 9, 16
Digression
Friendman, Naparstek,Taussing-Rubbo,
Alarmingly Useless,The Case For Banning Car Alarms In NYC
http://transalt.org/files/news/reports/caralarms/report.pdf
Blackstone, Buck, Hakim
Evaluation of alternative policies to combat false emergency calls
http://isc.temple.edu/economics/wkpapers/Pubs/FalsePolicy.pdf
Wickens, Rice, Keller, Hutchins, Hughes, Clayton
False Alerts in Air Traffic Control Conflict Alerting System: Is There A Cry Wolf Effect?
http://www.tc.faa.gov/LOGISTICS/grants/pdf/2007/07-G-002.pdf
Görges M, Markewitz BA,Westenskow DR
Improving Alarm Performance In The Medical Intensive Care Unit Using Delays and Clinical Context
http://www.ncbi.nlm.nih.gov/pubmed/19372334
“In an intensive care unit, alarms are used to call attention to a patient, to
alert a change in the patient's physiology, or to warn of a failure in a medical
device; however, up to 94% of the alarms are false.”
Saturday, April 9, 16
Digression
AESOP
The Boy Who Cried Wolf
Saturday, April 9, 16
Based On A True Story
• send notice of OOM?
• fix the cause of OOM?
• make a useful alert?
Saturday, April 9, 16
Based On A True Story
useful alerting
• script that checks for OOM
• restart app server when found
• find offending process; kill it
• spin up new node; kill old node
in the event all of these fail, send an alert?
Saturday, April 9, 16
Based On A True Story
thought exercise:
if you launched a new web site today,
you really only need one alert
Saturday, April 9, 16
In Conclusion
if we need software that runs 24x7, we should
design resiliency into our software,
not human intervention
Saturday, April 9, 16
In Conclusion
thinking doesn’t scale
especially at 2AM
Saturday, April 9, 16
In Conclusion
thanks!
more:
Surge 2016
http://surge.omniti.com
@robtreat2
@omniti
Saturday, April 9, 16
Saturday, April 9, 16

More Related Content

What's hot

REBEL practices to implement innovation initiatives #RebelJam15 #vanrompay...
REBEL practices to implement innovation initiatives #RebelJam15 #vanrompay...REBEL practices to implement innovation initiatives #RebelJam15 #vanrompay...
REBEL practices to implement innovation initiatives #RebelJam15 #vanrompay...
Erik Van Rompay
 
How to create value from your web traffic by Salvatore Bruno
How to create value from your web traffic by Salvatore BrunoHow to create value from your web traffic by Salvatore Bruno
How to create value from your web traffic by Salvatore Bruno
TheFamily
 
Affili@SYD 10 minute presentation
Affili@SYD 10 minute presentationAffili@SYD 10 minute presentation
Affili@SYD 10 minute presentation
Lee Hopkins
 
FinCon15 - You're Doing It Wrong; 13 Mistakes WordPress Users Make
FinCon15 - You're Doing It Wrong; 13 Mistakes WordPress Users MakeFinCon15 - You're Doing It Wrong; 13 Mistakes WordPress Users Make
FinCon15 - You're Doing It Wrong; 13 Mistakes WordPress Users Make
Dustin Hartzler
 
Accessibility doesn't exist
Accessibility doesn't existAccessibility doesn't exist
Accessibility doesn't exist
Chris Mills
 
Turning huge ships - Open Source and Microsoft
Turning huge ships - Open Source and MicrosoftTurning huge ships - Open Source and Microsoft
Turning huge ships - Open Source and Microsoft
Christian Heilmann
 
Virtual Pet
Virtual PetVirtual Pet
Virtual Pet
Thinkful
 
Automated Analytics Testing with Open Source Tools
Automated Analytics Testing with Open Source ToolsAutomated Analytics Testing with Open Source Tools
Automated Analytics Testing with Open Source Tools
TechWell
 
Workshop de Desarrollo con Cascades Blackberry Dev Meeting Santiago
Workshop de Desarrollo con Cascades Blackberry Dev Meeting SantiagoWorkshop de Desarrollo con Cascades Blackberry Dev Meeting Santiago
Workshop de Desarrollo con Cascades Blackberry Dev Meeting Santiago
Carlos Solis
 

What's hot (9)

REBEL practices to implement innovation initiatives #RebelJam15 #vanrompay...
REBEL practices to implement innovation initiatives #RebelJam15 #vanrompay...REBEL practices to implement innovation initiatives #RebelJam15 #vanrompay...
REBEL practices to implement innovation initiatives #RebelJam15 #vanrompay...
 
How to create value from your web traffic by Salvatore Bruno
How to create value from your web traffic by Salvatore BrunoHow to create value from your web traffic by Salvatore Bruno
How to create value from your web traffic by Salvatore Bruno
 
Affili@SYD 10 minute presentation
Affili@SYD 10 minute presentationAffili@SYD 10 minute presentation
Affili@SYD 10 minute presentation
 
FinCon15 - You're Doing It Wrong; 13 Mistakes WordPress Users Make
FinCon15 - You're Doing It Wrong; 13 Mistakes WordPress Users MakeFinCon15 - You're Doing It Wrong; 13 Mistakes WordPress Users Make
FinCon15 - You're Doing It Wrong; 13 Mistakes WordPress Users Make
 
Accessibility doesn't exist
Accessibility doesn't existAccessibility doesn't exist
Accessibility doesn't exist
 
Turning huge ships - Open Source and Microsoft
Turning huge ships - Open Source and MicrosoftTurning huge ships - Open Source and Microsoft
Turning huge ships - Open Source and Microsoft
 
Virtual Pet
Virtual PetVirtual Pet
Virtual Pet
 
Automated Analytics Testing with Open Source Tools
Automated Analytics Testing with Open Source ToolsAutomated Analytics Testing with Open Source Tools
Automated Analytics Testing with Open Source Tools
 
Workshop de Desarrollo con Cascades Blackberry Dev Meeting Santiago
Workshop de Desarrollo con Cascades Blackberry Dev Meeting SantiagoWorkshop de Desarrollo con Cascades Blackberry Dev Meeting Santiago
Workshop de Desarrollo con Cascades Blackberry Dev Meeting Santiago
 

Viewers also liked

A Guide To PostgreSQL 9.0
A Guide To PostgreSQL 9.0A Guide To PostgreSQL 9.0
A Guide To PostgreSQL 9.0
Robert Treat
 
Database Scalability Patterns
Database Scalability PatternsDatabase Scalability Patterns
Database Scalability Patterns
Robert Treat
 
Scaling With Postgres
Scaling With PostgresScaling With Postgres
Scaling With Postgres
Robert Treat
 
What Ops Can Learn From Design
What Ops Can Learn From DesignWhat Ops Can Learn From Design
What Ops Can Learn From Design
Robert Treat
 
Intro to pl/PHP Oscon2007
Intro to pl/PHP Oscon2007Intro to pl/PHP Oscon2007
Intro to pl/PHP Oscon2007
Robert Treat
 
Postgres 9.4 First Look
Postgres 9.4 First LookPostgres 9.4 First Look
Postgres 9.4 First Look
Robert Treat
 
plProxy, pgBouncer, pgBalancer
plProxy, pgBouncer, pgBalancerplProxy, pgBouncer, pgBalancer
plProxy, pgBouncer, pgBalancerelliando dias
 
Managing Databases In A DevOps Environment
Managing Databases In A DevOps EnvironmentManaging Databases In A DevOps Environment
Managing Databases In A DevOps Environment
Robert Treat
 
Managing Databases In A DevOps Environment 2016
Managing Databases In A DevOps Environment 2016Managing Databases In A DevOps Environment 2016
Managing Databases In A DevOps Environment 2016
Robert Treat
 
Advanced WAL File Management With OmniPITR
Advanced WAL File Management With OmniPITRAdvanced WAL File Management With OmniPITR
Advanced WAL File Management With OmniPITR
Robert Treat
 
Out of the Box Replication in Postgres 9.4(PgCon)
Out of the Box Replication in Postgres 9.4(PgCon)Out of the Box Replication in Postgres 9.4(PgCon)
Out of the Box Replication in Postgres 9.4(PgCon)Denish Patel
 
Think_your_Postgres_backups_and_recovery_are_safe_lets_talk.pptx
Think_your_Postgres_backups_and_recovery_are_safe_lets_talk.pptxThink_your_Postgres_backups_and_recovery_are_safe_lets_talk.pptx
Think_your_Postgres_backups_and_recovery_are_safe_lets_talk.pptx
Payal Singh
 
The Essential PostgreSQL.conf
The Essential PostgreSQL.confThe Essential PostgreSQL.conf
The Essential PostgreSQL.conf
Robert Treat
 
Out of the box replication in postgres 9.4
Out of the box replication in postgres 9.4Out of the box replication in postgres 9.4
Out of the box replication in postgres 9.4
Denish Patel
 
Best Practices for a Complete Postgres Enterprise Architecture Setup
Best Practices for a Complete Postgres Enterprise Architecture SetupBest Practices for a Complete Postgres Enterprise Architecture Setup
Best Practices for a Complete Postgres Enterprise Architecture Setup
EDB
 
PostgreSQL Disaster Recovery with Barman
PostgreSQL Disaster Recovery with BarmanPostgreSQL Disaster Recovery with Barman
PostgreSQL Disaster Recovery with BarmanGabriele Bartolini
 
The Magic of Tuning in PostgreSQL
The Magic of Tuning in PostgreSQLThe Magic of Tuning in PostgreSQL
The Magic of Tuning in PostgreSQL
Ashnikbiz
 
PostgreSQL performance improvements in 9.5 and 9.6
PostgreSQL performance improvements in 9.5 and 9.6PostgreSQL performance improvements in 9.5 and 9.6
PostgreSQL performance improvements in 9.5 and 9.6
Tomas Vondra
 
PostgreSQL Administration for System Administrators
PostgreSQL Administration for System AdministratorsPostgreSQL Administration for System Administrators
PostgreSQL Administration for System Administrators
Command Prompt., Inc
 
Scaling postgres
Scaling postgresScaling postgres
Scaling postgres
Denish Patel
 

Viewers also liked (20)

A Guide To PostgreSQL 9.0
A Guide To PostgreSQL 9.0A Guide To PostgreSQL 9.0
A Guide To PostgreSQL 9.0
 
Database Scalability Patterns
Database Scalability PatternsDatabase Scalability Patterns
Database Scalability Patterns
 
Scaling With Postgres
Scaling With PostgresScaling With Postgres
Scaling With Postgres
 
What Ops Can Learn From Design
What Ops Can Learn From DesignWhat Ops Can Learn From Design
What Ops Can Learn From Design
 
Intro to pl/PHP Oscon2007
Intro to pl/PHP Oscon2007Intro to pl/PHP Oscon2007
Intro to pl/PHP Oscon2007
 
Postgres 9.4 First Look
Postgres 9.4 First LookPostgres 9.4 First Look
Postgres 9.4 First Look
 
plProxy, pgBouncer, pgBalancer
plProxy, pgBouncer, pgBalancerplProxy, pgBouncer, pgBalancer
plProxy, pgBouncer, pgBalancer
 
Managing Databases In A DevOps Environment
Managing Databases In A DevOps EnvironmentManaging Databases In A DevOps Environment
Managing Databases In A DevOps Environment
 
Managing Databases In A DevOps Environment 2016
Managing Databases In A DevOps Environment 2016Managing Databases In A DevOps Environment 2016
Managing Databases In A DevOps Environment 2016
 
Advanced WAL File Management With OmniPITR
Advanced WAL File Management With OmniPITRAdvanced WAL File Management With OmniPITR
Advanced WAL File Management With OmniPITR
 
Out of the Box Replication in Postgres 9.4(PgCon)
Out of the Box Replication in Postgres 9.4(PgCon)Out of the Box Replication in Postgres 9.4(PgCon)
Out of the Box Replication in Postgres 9.4(PgCon)
 
Think_your_Postgres_backups_and_recovery_are_safe_lets_talk.pptx
Think_your_Postgres_backups_and_recovery_are_safe_lets_talk.pptxThink_your_Postgres_backups_and_recovery_are_safe_lets_talk.pptx
Think_your_Postgres_backups_and_recovery_are_safe_lets_talk.pptx
 
The Essential PostgreSQL.conf
The Essential PostgreSQL.confThe Essential PostgreSQL.conf
The Essential PostgreSQL.conf
 
Out of the box replication in postgres 9.4
Out of the box replication in postgres 9.4Out of the box replication in postgres 9.4
Out of the box replication in postgres 9.4
 
Best Practices for a Complete Postgres Enterprise Architecture Setup
Best Practices for a Complete Postgres Enterprise Architecture SetupBest Practices for a Complete Postgres Enterprise Architecture Setup
Best Practices for a Complete Postgres Enterprise Architecture Setup
 
PostgreSQL Disaster Recovery with Barman
PostgreSQL Disaster Recovery with BarmanPostgreSQL Disaster Recovery with Barman
PostgreSQL Disaster Recovery with Barman
 
The Magic of Tuning in PostgreSQL
The Magic of Tuning in PostgreSQLThe Magic of Tuning in PostgreSQL
The Magic of Tuning in PostgreSQL
 
PostgreSQL performance improvements in 9.5 and 9.6
PostgreSQL performance improvements in 9.5 and 9.6PostgreSQL performance improvements in 9.5 and 9.6
PostgreSQL performance improvements in 9.5 and 9.6
 
PostgreSQL Administration for System Administrators
PostgreSQL Administration for System AdministratorsPostgreSQL Administration for System Administrators
PostgreSQL Administration for System Administrators
 
Scaling postgres
Scaling postgresScaling postgres
Scaling postgres
 

Similar to Less Alarming Alerts - SRECon 2016

The Testable Web
The Testable WebThe Testable Web
The Testable Web
Dave Haeffner
 
Webinar Worst-Case Scenario Survival Training
Webinar Worst-Case Scenario Survival TrainingWebinar Worst-Case Scenario Survival Training
Webinar Worst-Case Scenario Survival Training
Adam Arents
 
Unmoderated User Testing
Unmoderated User TestingUnmoderated User Testing
Unmoderated User Testing
ZURB
 
Finding sensitive information in text data
Finding sensitive information in text dataFinding sensitive information in text data
Finding sensitive information in text data
InfinIT - Innovationsnetværket for it
 
Kanban for DevOps
Kanban for DevOpsKanban for DevOps
Kanban for DevOps
Cory Foy
 
Social Media in business Ism award-2013 presentatie Vincent
Social Media in business Ism award-2013 presentatie VincentSocial Media in business Ism award-2013 presentatie Vincent
Social Media in business Ism award-2013 presentatie Vincent
Vincent Everts
 

Similar to Less Alarming Alerts - SRECon 2016 (7)

The Testable Web
The Testable WebThe Testable Web
The Testable Web
 
Webinar Worst-Case Scenario Survival Training
Webinar Worst-Case Scenario Survival TrainingWebinar Worst-Case Scenario Survival Training
Webinar Worst-Case Scenario Survival Training
 
Unmoderated User Testing
Unmoderated User TestingUnmoderated User Testing
Unmoderated User Testing
 
Finding sensitive information in text data
Finding sensitive information in text dataFinding sensitive information in text data
Finding sensitive information in text data
 
Kanban for DevOps
Kanban for DevOpsKanban for DevOps
Kanban for DevOps
 
Usability principles 2
Usability principles 2Usability principles 2
Usability principles 2
 
Social Media in business Ism award-2013 presentatie Vincent
Social Media in business Ism award-2013 presentatie VincentSocial Media in business Ism award-2013 presentatie Vincent
Social Media in business Ism award-2013 presentatie Vincent
 

More from Robert Treat

Advanced Int->Bigint Conversions
Advanced Int->Bigint ConversionsAdvanced Int->Bigint Conversions
Advanced Int->Bigint Conversions
Robert Treat
 
Explaining Explain
Explaining ExplainExplaining Explain
Explaining Explain
Robert Treat
 
the-lost-art-of-plpgsql
the-lost-art-of-plpgsqlthe-lost-art-of-plpgsql
the-lost-art-of-plpgsql
Robert Treat
 
Managing Chaos In Production: Testing vs Monitoring
Managing Chaos In Production: Testing vs MonitoringManaging Chaos In Production: Testing vs Monitoring
Managing Chaos In Production: Testing vs Monitoring
Robert Treat
 
Past, Present, and Pachyderm - All Things Open - 2013
Past, Present, and Pachyderm - All Things Open - 2013Past, Present, and Pachyderm - All Things Open - 2013
Past, Present, and Pachyderm - All Things Open - 2013
Robert Treat
 
Big Bad "Upgraded" Postgres
Big Bad "Upgraded" PostgresBig Bad "Upgraded" Postgres
Big Bad "Upgraded" Postgres
Robert Treat
 
Pro Postgres 9
Pro Postgres 9Pro Postgres 9
Pro Postgres 9
Robert Treat
 
Scaling with Postgres (Highload++ 2010)
Scaling with Postgres (Highload++ 2010)Scaling with Postgres (Highload++ 2010)
Scaling with Postgres (Highload++ 2010)Robert Treat
 
Intro to Postgres 9 Tutorial
Intro to Postgres 9 TutorialIntro to Postgres 9 Tutorial
Intro to Postgres 9 TutorialRobert Treat
 
Check Please!
Check Please!Check Please!
Check Please!
Robert Treat
 
Intro to Postgres 8.4 Tutorial
Intro to Postgres 8.4 TutorialIntro to Postgres 8.4 Tutorial
Intro to Postgres 8.4 Tutorial
Robert Treat
 
The Essential postgresql.conf
The Essential postgresql.confThe Essential postgresql.conf
The Essential postgresql.conf
Robert Treat
 
PostgreSQL Partitioning, PGCon 2007
PostgreSQL Partitioning, PGCon 2007PostgreSQL Partitioning, PGCon 2007
PostgreSQL Partitioning, PGCon 2007
Robert Treat
 
Pro PostgreSQL, OSCon 2008
Pro PostgreSQL, OSCon 2008Pro PostgreSQL, OSCon 2008
Pro PostgreSQL, OSCon 2008
Robert Treat
 
Database Anti Patterns
Database Anti PatternsDatabase Anti Patterns
Database Anti Patterns
Robert Treat
 
Pro PostgreSQL
Pro PostgreSQLPro PostgreSQL
Pro PostgreSQL
Robert Treat
 

More from Robert Treat (16)

Advanced Int->Bigint Conversions
Advanced Int->Bigint ConversionsAdvanced Int->Bigint Conversions
Advanced Int->Bigint Conversions
 
Explaining Explain
Explaining ExplainExplaining Explain
Explaining Explain
 
the-lost-art-of-plpgsql
the-lost-art-of-plpgsqlthe-lost-art-of-plpgsql
the-lost-art-of-plpgsql
 
Managing Chaos In Production: Testing vs Monitoring
Managing Chaos In Production: Testing vs MonitoringManaging Chaos In Production: Testing vs Monitoring
Managing Chaos In Production: Testing vs Monitoring
 
Past, Present, and Pachyderm - All Things Open - 2013
Past, Present, and Pachyderm - All Things Open - 2013Past, Present, and Pachyderm - All Things Open - 2013
Past, Present, and Pachyderm - All Things Open - 2013
 
Big Bad "Upgraded" Postgres
Big Bad "Upgraded" PostgresBig Bad "Upgraded" Postgres
Big Bad "Upgraded" Postgres
 
Pro Postgres 9
Pro Postgres 9Pro Postgres 9
Pro Postgres 9
 
Scaling with Postgres (Highload++ 2010)
Scaling with Postgres (Highload++ 2010)Scaling with Postgres (Highload++ 2010)
Scaling with Postgres (Highload++ 2010)
 
Intro to Postgres 9 Tutorial
Intro to Postgres 9 TutorialIntro to Postgres 9 Tutorial
Intro to Postgres 9 Tutorial
 
Check Please!
Check Please!Check Please!
Check Please!
 
Intro to Postgres 8.4 Tutorial
Intro to Postgres 8.4 TutorialIntro to Postgres 8.4 Tutorial
Intro to Postgres 8.4 Tutorial
 
The Essential postgresql.conf
The Essential postgresql.confThe Essential postgresql.conf
The Essential postgresql.conf
 
PostgreSQL Partitioning, PGCon 2007
PostgreSQL Partitioning, PGCon 2007PostgreSQL Partitioning, PGCon 2007
PostgreSQL Partitioning, PGCon 2007
 
Pro PostgreSQL, OSCon 2008
Pro PostgreSQL, OSCon 2008Pro PostgreSQL, OSCon 2008
Pro PostgreSQL, OSCon 2008
 
Database Anti Patterns
Database Anti PatternsDatabase Anti Patterns
Database Anti Patterns
 
Pro PostgreSQL
Pro PostgreSQLPro PostgreSQL
Pro PostgreSQL
 

Recently uploaded

The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Enhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZEnhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZ
Globus
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 

Recently uploaded (20)

The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Enhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZEnhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZ
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 

Less Alarming Alerts - SRECon 2016

  • 1. / Robert Treat Less Alarming Alerts Saturday, April 9, 16
  • 2. Hello /@robtreat2 Former WebDev SysAdmin DBA I have now been promoted to where I can do the least damage Saturday, April 9, 16
  • 4. Hello /@robtreat2 Who Cares What Some Suite Thinks? Saturday, April 9, 16
  • 8. Memory Lane /@robtreat2 Multiple Rotations Saturday, April 9, 16
  • 9. Memory Lane /@robtreat2 always available, phone only no pager for years Saturday, April 9, 16
  • 11. Hello /@robtreat2 I manage the SRE team at OmniTI we manage multiple sites 24x7 millions of users (omniti.com/is/hiring) Saturday, April 9, 16
  • 12. Why God Why? paging is useful “broken systems should not be just another day at the office” -- me Saturday, April 9, 16
  • 13. Why God Why? paging is useful Who has ever gotten an alert and ignored it? (/me looks at alert, says “oh, it’ll probably recover, no need to look further”) Saturday, April 9, 16
  • 14. Why God Why? paging is useful How many alerts were received in the past week that were not actionable? (no human action was required) Saturday, April 9, 16
  • 15. Why God Why? paging CAN BE useful Saturday, April 9, 16
  • 16. Can We Fix It? how to improve? Saturday, April 9, 16
  • 17. Can We Fix It? hello@omniti.com we offer operationally focused services to help build and manage your infrastructure :-) Saturday, April 9, 16
  • 18. Terms • Metrics • (anything which can be measured) Saturday, April 9, 16
  • 19. Terms • Metrics • (anything which can be measured) • Graphs • (trending systems) Saturday, April 9, 16
  • 20. Terms • Metrics • (anything which can be measured) • Graphs • (trending systems) • Notices • (notification of event; email) Saturday, April 9, 16
  • 21. Terms • Metrics • (anything which can be measured) • Graphs • (trending systems) • Notices • (notification of event; email) • ALERTS • (wake’n you up; pages) Saturday, April 9, 16
  • 22. Terms • Metrics • (anything which can be measured) • Graphs • (trending systems) • Notices • (notification of event; email) • ALERTS • (wake’n you up; pages) Saturday, April 9, 16
  • 23. Onward and Upward If you want to improve your alerts use systems thinking to reason about your “system” Saturday, April 9, 16
  • 24. Onward and Upward alerts should be seen as evidence that your system is behaving in a way outside of your existing understanding Saturday, April 9, 16
  • 25. Onward and Upward If you want to improve your alerts think in terms your business can get on board with Saturday, April 9, 16
  • 26. Onward and Upward for every alert you receive What is the business impact of this alert? Saturday, April 9, 16
  • 27. Onward and Upward for every alert you receive What is the remediation for this alert? Saturday, April 9, 16
  • 28. Onward and Upward remediation: • Summarize the problem • What was done to solve the problem? • Who was notified? • Can this be prevented? Saturday, April 9, 16
  • 29. Onward and Upward send the answer to these questions to everyone on the team every time Saturday, April 9, 16
  • 30. Onward and Upward link to this documentation from your alerting system Saturday, April 9, 16
  • 31. Onward and Upward • Knowledge Transfer • Gaps Exposed • Patterns will emerge Saturday, April 9, 16
  • 32. Onward and Upward you might be a bad alert • cannot determine business impact • no remediation necessary • no one needs to be told • work arounds are available Saturday, April 9, 16
  • 33. Onward and Upward if you can’t fix it, you don’t need to wake up for it Saturday, April 9, 16
  • 34. Onward and Upward if it can wait until morning, you don’t need to wake up for it Saturday, April 9, 16
  • 35. Onward and Upward in case of bad alert • remove the alert Saturday, April 9, 16
  • 36. Onward and Upward in case of bad alert • remove the alert • convert the alert to a notice Saturday, April 9, 16
  • 37. Onward and Upward in case of bad alert • remove the alert • convert the alert to a notice • implement fixes Saturday, April 9, 16
  • 38. Onward and Upward pro tip: never let anyone add an alert unless they can answer these questions first Saturday, April 9, 16
  • 39. Can We Really Do This? this is partially an organizational issue Saturday, April 9, 16
  • 40. Can We Really Do This? thought exercise: if you launched a new web site today, you really only need one alarm Saturday, April 9, 16
  • 41. Can We Really Do This? “I don’t care if my servers are on fire, as long as I am still making money” -- Kevin, actual OmniTI customer Saturday, April 9, 16
  • 42. This sounds good but... Most SA/SRE types want to be pro-active, not re-active. ie. they want to alert on leading indicators, not on problems Saturday, April 9, 16
  • 43. This sounds good but... Carrie: I-I'm just making sure we don't get hit again. Saul: Well, I'm glad someone's looking out for us, Carrie. Carrie: I'm serious. I-I missed something once before, I won't... I can't let that happen again. Saul: It was ten years ago. Everyone missed something that day. Carrie: Yeah, everyone's not me. Saturday, April 9, 16
  • 44. Based On A True Story site down: monitor was checking 200 response code. failed to notice absence of response code. easily fixed, but reactive Saturday, April 9, 16
  • 45. Based On A True Story “root cause” ==> OOM why don’t we alert on OOM? OOM does not consistently cause outages Saturday, April 9, 16
  • 46. Based On A True Story too many false positives leads to ignoring alarms Saturday, April 9, 16
  • 47. Digression Friendman, Naparstek,Taussing-Rubbo, Alarmingly Useless,The Case For Banning Car Alarms In NYC http://transalt.org/files/news/reports/caralarms/report.pdf Blackstone, Buck, Hakim Evaluation of alternative policies to combat false emergency calls http://isc.temple.edu/economics/wkpapers/Pubs/FalsePolicy.pdf Wickens, Rice, Keller, Hutchins, Hughes, Clayton False Alerts in Air Traffic Control Conflict Alerting System: Is There A Cry Wolf Effect? http://www.tc.faa.gov/LOGISTICS/grants/pdf/2007/07-G-002.pdf Görges M, Markewitz BA,Westenskow DR Improving Alarm Performance In The Medical Intensive Care Unit Using Delays and Clinical Context http://www.ncbi.nlm.nih.gov/pubmed/19372334 “In an intensive care unit, alarms are used to call attention to a patient, to alert a change in the patient's physiology, or to warn of a failure in a medical device; however, up to 94% of the alarms are false.” Saturday, April 9, 16
  • 48. Digression AESOP The Boy Who Cried Wolf Saturday, April 9, 16
  • 49. Based On A True Story • send notice of OOM? • fix the cause of OOM? • make a useful alert? Saturday, April 9, 16
  • 50. Based On A True Story useful alerting • script that checks for OOM • restart app server when found • find offending process; kill it • spin up new node; kill old node in the event all of these fail, send an alert? Saturday, April 9, 16
  • 51. Based On A True Story thought exercise: if you launched a new web site today, you really only need one alert Saturday, April 9, 16
  • 52. In Conclusion if we need software that runs 24x7, we should design resiliency into our software, not human intervention Saturday, April 9, 16
  • 53. In Conclusion thinking doesn’t scale especially at 2AM Saturday, April 9, 16