SlideShare a Scribd company logo
1 of 28
Download to read offline
Lightning Fast Monitoring against
Lightning Fast Outages
Maxime Petazzoni
@mpetazzoni
#MonitorSF – May 2017
At SignalFx
SignalFx uses SignalFx for all monitoring & alerting
DevOps culture, SWEs on-call
Distributed responsibility for production
Optimizing for least amount of operational work
Problem statement
Today's systems are complex, distributed infrastructures
Large amounts of real-time traffic
Systems fails in new, unpredictable and exotic ways
Among those failures lies the "lightning fast outage"
Lightning fast outages
Sudden meltdown of one or more components/services
Develops from anomaly to outage very quickly
New, unexpected or unusual load
Specifics depends on system and architecture
Causes
System behavior tied to user activity or user input
Query with unexpectedly large results
Sudden increase in traffic
Bug in seldom executed/tested code path
Hard to catch
Most monitoring systems are not sophisticated enough
Not granular or not fast enough
Straight from "all good" to "nothing works"
No details on evolution of the anomaly
Even with metrics
With us in the 21st century and using metrics?
(a lot of people still don't, and still put up with Nagios checks!?)
1m, or even 10s resolution may still not be enough
Graphite/Influx users: do you trust your right edge?
How accurate is your recent data through aggregations?
The ultimate slow/weak link
The on-call human!
Even if the anomaly is detected before it becomes an
outage, humans can't act fast enough
Think about your MTTA and MTTR
Solution?
A lightning fast monitoring system
A dynamic stack
Your very own IFTTT
But before we go into details...
What does it look like?
Lightning fast monitoring and
automated remediation saving the day
A fast, real-time monitoring system
Observability
Metrics, metrics, metrics
Complete, real-time, high-resolution observability
Dimensions / tags (but watch for cardinality)
Push vs pull? Aha.
Instrument all the things
Metrics have good value/$
Develop a culture of code instrumentation
Instrument at all levels: host, container, application
Don't stop at infrastructure and framework-level metrics
Custom application metrics are often the most valuable
Application instrumentation
Count requests, actions, errors, data
Measure queues, thread pools, buffers, in-flight tasks
Time methods, wait times, response times, latency
metrics.counter("errors", "class", e.getClass().getName()).inc();
metrics.counter("requests", "endpoint", "search").inc();
metrics.registerGauge("queue_size", queue::size);
try (Timer.Context c = metrics.timer("request.timer").time()) {
// process data
}
Powerful alerting
Fast, real-time, metrics analytics based alerting
Accurate "right now", even on most recent data
Able to handle reporting lag dynamically
Build smart, dynamic alerts; static threshold often a smell
Detecting causality between redundant alerts still a really
hard problem...
Building a dynamic stack
From provisioning to application configuration
Elastic infrastructure
Automated provisioning (much faster with containers!)
Dynamic service discovery (ZK/Curator, etcd, ...)
Horizontal scalability
Consider your 3rd party components too
All you should need is a zkConnectString
Elastic infrastructure (2)
Deeply Integrate discovery in your application framework
At SignalFx, Guice, Thrift and ZK/Curator come together:
public class Foo {
private final AnalyticsService.Iface analytics;
@Inject
public Foo(AnalyticsService.Iface analytics) {
/*
* Injects an implementation of the service interface that
* makes Thrift RPC calls to the target discovered service.
*/
this.analytics = analytics;
}
public void doSomething(String program) {
/*
* Call is always backed by currently advertised instances,
* partitioned and load-balanced as defined by the service.
*/
analytics.execute(program);
}
}
Internal control endpoints
Build operations endpoints into your applications
Internal HTTP API, JMX operations, control console, ...
Load shedding
Take out of load balancer / stop consuming
Impeachment (to elect new leader)
Dynamic application configuration
Ability to perform real-time configuration changes
Control thread pools, maximum queue and buffer sizes
Control task execution intervals
Enable/disable features and safety valves
Respect semantics for graceful degradation
Dynamic application configuration (2)
At SignalFx, application configuration is in ZooKeeper
Config properties at environment and service levels
Changes seen and reflected in real-time
Callbacks
With real-time monitoring, effect is directly visible
Dynamic application configuration (3)
public interface FooConfig extends ConfigInterface {
// When declared as Property<>, allows for attaching callbacks
@Config(name = "pool.threads", defaultValue = "16")
Property<Integer> getThreadPoolSize();
// But can also just be a primitive
@Config(name = "feature.enabled", defaultValue = "true")
boolean isFeatureEnabled();
// Or even a JSON-decoded value
@Config(name = "blacklist", defaultValue = "[]", json = "true")
Set<String> getBlacklist();
}
Dynamic application configuration (4)
public class Foo {
private final FooConfig config;
private final ThreadPoolExecutor executor;
private final Runnable poolSizeCallback = () -> {
int threads = config.getThreadPoolSize().get();
executor.setCorePoolSize(threads);
executor.setMaximumPoolSize(threads);
};
@Inject
public Foo(FooConfig config) {
this.config = config;
executor = new ThreadPoolExecutor(...);
config.getThreadPoolSize().addCallback(poolSizeCallback);
}
public Future<Result> doSomething(String arg) {
return config.isFeatureEnabled() ? executor.submit(() -> computeResult()) : Futures.immediateFailedFuture(...);
}
public void shutdown() {
executor.shutdown();
config.getThreadPoolSize().removeCallback(poolSizeCallback);
}
}
Putting it together: your own IFTTT
From automated to automatic
Automatic remediation
Hardest part: going from tribal knowledge to script
What action to take, to what extent, on what alert?
Depending on action taken, page on-call to address issue
Test your automation!
Exercise and validate automated actions in CI
Wiring
Define and verify your alerts (anomaly detector preflighting)
Webhooks and lambdas to the rescue
You can also act on alert resolution
Record audit log
Profit Sleep tight
Lightning Fast Monitoring and Automated Remediation

More Related Content

What's hot

Operationalizing Docker at Scale: Lessons from Running Microservices in Produ...
Operationalizing Docker at Scale: Lessons from Running Microservices in Produ...Operationalizing Docker at Scale: Lessons from Running Microservices in Produ...
Operationalizing Docker at Scale: Lessons from Running Microservices in Produ...SignalFx
 
Upstream Testing Collaboration
Upstream Testing Collaboration Upstream Testing Collaboration
Upstream Testing Collaboration OPNFV
 
Locking down your Kubernetes cluster with Linkerd
Locking down your Kubernetes cluster with LinkerdLocking down your Kubernetes cluster with Linkerd
Locking down your Kubernetes cluster with LinkerdBuoyant
 
Monitoring docker container and dockerized applications
Monitoring docker container and dockerized applicationsMonitoring docker container and dockerized applications
Monitoring docker container and dockerized applicationsAnanth Padmanabhan
 
Serverless security - how to protect what you don't see?
Serverless security - how to protect what you don't see?Serverless security - how to protect what you don't see?
Serverless security - how to protect what you don't see?Sqreen
 
OSMC 2021 | Monitoring Open Source Hardware
OSMC 2021 | Monitoring Open Source HardwareOSMC 2021 | Monitoring Open Source Hardware
OSMC 2021 | Monitoring Open Source HardwareNETWAYS
 
Kubernetes Networking 101
Kubernetes Networking 101Kubernetes Networking 101
Kubernetes Networking 101Kublr
 
2600 av evasion_deuce
2600 av evasion_deuce2600 av evasion_deuce
2600 av evasion_deuceDb Cooper
 
Netflix OSS Meetup Season 5 Episode 1
Netflix OSS Meetup Season 5 Episode 1Netflix OSS Meetup Season 5 Episode 1
Netflix OSS Meetup Season 5 Episode 1aspyker
 
Running & Monitoring Docker at Scale
Running & Monitoring Docker at ScaleRunning & Monitoring Docker at Scale
Running & Monitoring Docker at ScaleDatadog
 
DCSF19 Kubernetes Security with OPA
DCSF19 Kubernetes Security with OPA DCSF19 Kubernetes Security with OPA
DCSF19 Kubernetes Security with OPA Docker, Inc.
 
Introducing Pico - A Deep Learning Platform using Docker & IoT - Sangam Biradar
Introducing Pico - A Deep Learning Platform using Docker & IoT - Sangam BiradarIntroducing Pico - A Deep Learning Platform using Docker & IoT - Sangam Biradar
Introducing Pico - A Deep Learning Platform using Docker & IoT - Sangam Biradarsangam biradar
 
Can you trust Neutron?
Can you trust Neutron?Can you trust Neutron?
Can you trust Neutron?salv_orlando
 
Whats new in brigade 2
Whats new in brigade 2Whats new in brigade 2
Whats new in brigade 2LibbySchulze
 
How to build a Neutron Plugin (stadium edition)
How to build a Neutron Plugin (stadium edition)How to build a Neutron Plugin (stadium edition)
How to build a Neutron Plugin (stadium edition)Salvatore Orlando
 
Nsa and vpn
Nsa and vpnNsa and vpn
Nsa and vpnantitree
 
Hands-on monitoring with Prometheus
Hands-on monitoring with PrometheusHands-on monitoring with Prometheus
Hands-on monitoring with PrometheusBrice Fernandes
 
Series of Unfortunate Netflix Container Events - QConNYC17
Series of Unfortunate Netflix Container Events - QConNYC17Series of Unfortunate Netflix Container Events - QConNYC17
Series of Unfortunate Netflix Container Events - QConNYC17aspyker
 

What's hot (20)

Operationalizing Docker at Scale: Lessons from Running Microservices in Produ...
Operationalizing Docker at Scale: Lessons from Running Microservices in Produ...Operationalizing Docker at Scale: Lessons from Running Microservices in Produ...
Operationalizing Docker at Scale: Lessons from Running Microservices in Produ...
 
Container Security
Container SecurityContainer Security
Container Security
 
Upstream Testing Collaboration
Upstream Testing Collaboration Upstream Testing Collaboration
Upstream Testing Collaboration
 
Locking down your Kubernetes cluster with Linkerd
Locking down your Kubernetes cluster with LinkerdLocking down your Kubernetes cluster with Linkerd
Locking down your Kubernetes cluster with Linkerd
 
Monitoring docker container and dockerized applications
Monitoring docker container and dockerized applicationsMonitoring docker container and dockerized applications
Monitoring docker container and dockerized applications
 
Serverless security - how to protect what you don't see?
Serverless security - how to protect what you don't see?Serverless security - how to protect what you don't see?
Serverless security - how to protect what you don't see?
 
OSMC 2021 | Monitoring Open Source Hardware
OSMC 2021 | Monitoring Open Source HardwareOSMC 2021 | Monitoring Open Source Hardware
OSMC 2021 | Monitoring Open Source Hardware
 
Kubernetes Networking 101
Kubernetes Networking 101Kubernetes Networking 101
Kubernetes Networking 101
 
2600 av evasion_deuce
2600 av evasion_deuce2600 av evasion_deuce
2600 av evasion_deuce
 
Netflix OSS Meetup Season 5 Episode 1
Netflix OSS Meetup Season 5 Episode 1Netflix OSS Meetup Season 5 Episode 1
Netflix OSS Meetup Season 5 Episode 1
 
Running & Monitoring Docker at Scale
Running & Monitoring Docker at ScaleRunning & Monitoring Docker at Scale
Running & Monitoring Docker at Scale
 
DCSF19 Kubernetes Security with OPA
DCSF19 Kubernetes Security with OPA DCSF19 Kubernetes Security with OPA
DCSF19 Kubernetes Security with OPA
 
Introducing Pico - A Deep Learning Platform using Docker & IoT - Sangam Biradar
Introducing Pico - A Deep Learning Platform using Docker & IoT - Sangam BiradarIntroducing Pico - A Deep Learning Platform using Docker & IoT - Sangam Biradar
Introducing Pico - A Deep Learning Platform using Docker & IoT - Sangam Biradar
 
Can you trust Neutron?
Can you trust Neutron?Can you trust Neutron?
Can you trust Neutron?
 
Whats new in brigade 2
Whats new in brigade 2Whats new in brigade 2
Whats new in brigade 2
 
How to build a Neutron Plugin (stadium edition)
How to build a Neutron Plugin (stadium edition)How to build a Neutron Plugin (stadium edition)
How to build a Neutron Plugin (stadium edition)
 
Neutron upgrades
Neutron upgradesNeutron upgrades
Neutron upgrades
 
Nsa and vpn
Nsa and vpnNsa and vpn
Nsa and vpn
 
Hands-on monitoring with Prometheus
Hands-on monitoring with PrometheusHands-on monitoring with Prometheus
Hands-on monitoring with Prometheus
 
Series of Unfortunate Netflix Container Events - QConNYC17
Series of Unfortunate Netflix Container Events - QConNYC17Series of Unfortunate Netflix Container Events - QConNYC17
Series of Unfortunate Netflix Container Events - QConNYC17
 

Similar to Lightning Fast Monitoring and Automated Remediation

Cerberus : Framework for Manual and Automated Testing (Web Application)
Cerberus : Framework for Manual and Automated Testing (Web Application)Cerberus : Framework for Manual and Automated Testing (Web Application)
Cerberus : Framework for Manual and Automated Testing (Web Application)CIVEL Benoit
 
Cerberus_Presentation1
Cerberus_Presentation1Cerberus_Presentation1
Cerberus_Presentation1CIVEL Benoit
 
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic SystemTimely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic SystemAccumulo Summit
 
Observability foundations in dynamically evolving architectures
Observability foundations in dynamically evolving architecturesObservability foundations in dynamically evolving architectures
Observability foundations in dynamically evolving architecturesBoyan Dimitrov
 
The Role of Models in Semiconductor Smart Manufacturing
The Role of Models in Semiconductor Smart ManufacturingThe Role of Models in Semiconductor Smart Manufacturing
The Role of Models in Semiconductor Smart ManufacturingKimberly Daich
 
Proactive ops for container orchestration environments
Proactive ops for container orchestration environmentsProactive ops for container orchestration environments
Proactive ops for container orchestration environmentsDocker, Inc.
 
DevOps Underground - Microservices Monitoring
DevOps Underground - Microservices MonitoringDevOps Underground - Microservices Monitoring
DevOps Underground - Microservices Monitoringkloia
 
Apache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San JoseApache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San JoseHao Chen
 
Operations: Production Readiness Review – How to stop bad things from Happening
Operations: Production Readiness Review – How to stop bad things from HappeningOperations: Production Readiness Review – How to stop bad things from Happening
Operations: Production Readiness Review – How to stop bad things from HappeningAmazon Web Services
 
Extra micrometer practices with Quarkus | DevNation Tech Talk
Extra micrometer practices with Quarkus | DevNation Tech TalkExtra micrometer practices with Quarkus | DevNation Tech Talk
Extra micrometer practices with Quarkus | DevNation Tech TalkRed Hat Developers
 
DevOps_SelfHealing
DevOps_SelfHealingDevOps_SelfHealing
DevOps_SelfHealingAtul Dhingra
 
Muves3 Elastic Grid Java One2009 Final
Muves3 Elastic Grid Java One2009 FinalMuves3 Elastic Grid Java One2009 Final
Muves3 Elastic Grid Java One2009 FinalElastic Grid, LLC.
 
Ogce Workflow Suite
Ogce Workflow SuiteOgce Workflow Suite
Ogce Workflow Suitesmarru
 
Trellis DCIM Platform
Trellis DCIM PlatformTrellis DCIM Platform
Trellis DCIM PlatformGreg Stover
 
Why AIOps Matters For Kubernetes
Why AIOps Matters For KubernetesWhy AIOps Matters For Kubernetes
Why AIOps Matters For KubernetesTimothy Chen
 
Performance eng prakash.sahu
Performance eng prakash.sahuPerformance eng prakash.sahu
Performance eng prakash.sahuDr. Prakash Sahu
 
An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)Brian Brazil
 
Netflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open SourceNetflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open Sourceaspyker
 

Similar to Lightning Fast Monitoring and Automated Remediation (20)

Cerberus : Framework for Manual and Automated Testing (Web Application)
Cerberus : Framework for Manual and Automated Testing (Web Application)Cerberus : Framework for Manual and Automated Testing (Web Application)
Cerberus : Framework for Manual and Automated Testing (Web Application)
 
Cerberus_Presentation1
Cerberus_Presentation1Cerberus_Presentation1
Cerberus_Presentation1
 
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic SystemTimely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
 
Observability foundations in dynamically evolving architectures
Observability foundations in dynamically evolving architecturesObservability foundations in dynamically evolving architectures
Observability foundations in dynamically evolving architectures
 
The Role of Models in Semiconductor Smart Manufacturing
The Role of Models in Semiconductor Smart ManufacturingThe Role of Models in Semiconductor Smart Manufacturing
The Role of Models in Semiconductor Smart Manufacturing
 
Proactive ops for container orchestration environments
Proactive ops for container orchestration environmentsProactive ops for container orchestration environments
Proactive ops for container orchestration environments
 
DevOps Underground - Microservices Monitoring
DevOps Underground - Microservices MonitoringDevOps Underground - Microservices Monitoring
DevOps Underground - Microservices Monitoring
 
Apache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real TimeApache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real Time
 
Apache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San JoseApache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San Jose
 
Operations: Production Readiness Review – How to stop bad things from Happening
Operations: Production Readiness Review – How to stop bad things from HappeningOperations: Production Readiness Review – How to stop bad things from Happening
Operations: Production Readiness Review – How to stop bad things from Happening
 
Extra micrometer practices with Quarkus | DevNation Tech Talk
Extra micrometer practices with Quarkus | DevNation Tech TalkExtra micrometer practices with Quarkus | DevNation Tech Talk
Extra micrometer practices with Quarkus | DevNation Tech Talk
 
DevOps_SelfHealing
DevOps_SelfHealingDevOps_SelfHealing
DevOps_SelfHealing
 
Muves3 Elastic Grid Java One2009 Final
Muves3 Elastic Grid Java One2009 FinalMuves3 Elastic Grid Java One2009 Final
Muves3 Elastic Grid Java One2009 Final
 
Ogce Workflow Suite
Ogce Workflow SuiteOgce Workflow Suite
Ogce Workflow Suite
 
Trellis DCIM Platform
Trellis DCIM PlatformTrellis DCIM Platform
Trellis DCIM Platform
 
Why AIOps Matters For Kubernetes
Why AIOps Matters For KubernetesWhy AIOps Matters For Kubernetes
Why AIOps Matters For Kubernetes
 
Performance eng prakash.sahu
Performance eng prakash.sahuPerformance eng prakash.sahu
Performance eng prakash.sahu
 
Testing in a distributed world
Testing in a distributed worldTesting in a distributed world
Testing in a distributed world
 
An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)
 
Netflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open SourceNetflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open Source
 

Recently uploaded

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 

Recently uploaded (20)

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 

Lightning Fast Monitoring and Automated Remediation

  • 1. Lightning Fast Monitoring against Lightning Fast Outages Maxime Petazzoni @mpetazzoni #MonitorSF – May 2017
  • 2. At SignalFx SignalFx uses SignalFx for all monitoring & alerting DevOps culture, SWEs on-call Distributed responsibility for production Optimizing for least amount of operational work
  • 3. Problem statement Today's systems are complex, distributed infrastructures Large amounts of real-time traffic Systems fails in new, unpredictable and exotic ways Among those failures lies the "lightning fast outage"
  • 4. Lightning fast outages Sudden meltdown of one or more components/services Develops from anomaly to outage very quickly New, unexpected or unusual load Specifics depends on system and architecture
  • 5. Causes System behavior tied to user activity or user input Query with unexpectedly large results Sudden increase in traffic Bug in seldom executed/tested code path
  • 6. Hard to catch Most monitoring systems are not sophisticated enough Not granular or not fast enough Straight from "all good" to "nothing works" No details on evolution of the anomaly
  • 7. Even with metrics With us in the 21st century and using metrics? (a lot of people still don't, and still put up with Nagios checks!?) 1m, or even 10s resolution may still not be enough Graphite/Influx users: do you trust your right edge? How accurate is your recent data through aggregations?
  • 8. The ultimate slow/weak link The on-call human! Even if the anomaly is detected before it becomes an outage, humans can't act fast enough Think about your MTTA and MTTR
  • 9. Solution? A lightning fast monitoring system A dynamic stack Your very own IFTTT But before we go into details...
  • 10. What does it look like? Lightning fast monitoring and automated remediation saving the day
  • 11.
  • 12. A fast, real-time monitoring system
  • 13. Observability Metrics, metrics, metrics Complete, real-time, high-resolution observability Dimensions / tags (but watch for cardinality) Push vs pull? Aha.
  • 14. Instrument all the things Metrics have good value/$ Develop a culture of code instrumentation Instrument at all levels: host, container, application Don't stop at infrastructure and framework-level metrics Custom application metrics are often the most valuable
  • 15. Application instrumentation Count requests, actions, errors, data Measure queues, thread pools, buffers, in-flight tasks Time methods, wait times, response times, latency metrics.counter("errors", "class", e.getClass().getName()).inc(); metrics.counter("requests", "endpoint", "search").inc(); metrics.registerGauge("queue_size", queue::size); try (Timer.Context c = metrics.timer("request.timer").time()) { // process data }
  • 16. Powerful alerting Fast, real-time, metrics analytics based alerting Accurate "right now", even on most recent data Able to handle reporting lag dynamically Build smart, dynamic alerts; static threshold often a smell Detecting causality between redundant alerts still a really hard problem...
  • 17. Building a dynamic stack From provisioning to application configuration
  • 18. Elastic infrastructure Automated provisioning (much faster with containers!) Dynamic service discovery (ZK/Curator, etcd, ...) Horizontal scalability Consider your 3rd party components too All you should need is a zkConnectString
  • 19. Elastic infrastructure (2) Deeply Integrate discovery in your application framework At SignalFx, Guice, Thrift and ZK/Curator come together: public class Foo { private final AnalyticsService.Iface analytics; @Inject public Foo(AnalyticsService.Iface analytics) { /* * Injects an implementation of the service interface that * makes Thrift RPC calls to the target discovered service. */ this.analytics = analytics; } public void doSomething(String program) { /* * Call is always backed by currently advertised instances, * partitioned and load-balanced as defined by the service. */ analytics.execute(program); } }
  • 20. Internal control endpoints Build operations endpoints into your applications Internal HTTP API, JMX operations, control console, ... Load shedding Take out of load balancer / stop consuming Impeachment (to elect new leader)
  • 21. Dynamic application configuration Ability to perform real-time configuration changes Control thread pools, maximum queue and buffer sizes Control task execution intervals Enable/disable features and safety valves Respect semantics for graceful degradation
  • 22. Dynamic application configuration (2) At SignalFx, application configuration is in ZooKeeper Config properties at environment and service levels Changes seen and reflected in real-time Callbacks With real-time monitoring, effect is directly visible
  • 23. Dynamic application configuration (3) public interface FooConfig extends ConfigInterface { // When declared as Property<>, allows for attaching callbacks @Config(name = "pool.threads", defaultValue = "16") Property<Integer> getThreadPoolSize(); // But can also just be a primitive @Config(name = "feature.enabled", defaultValue = "true") boolean isFeatureEnabled(); // Or even a JSON-decoded value @Config(name = "blacklist", defaultValue = "[]", json = "true") Set<String> getBlacklist(); }
  • 24. Dynamic application configuration (4) public class Foo { private final FooConfig config; private final ThreadPoolExecutor executor; private final Runnable poolSizeCallback = () -> { int threads = config.getThreadPoolSize().get(); executor.setCorePoolSize(threads); executor.setMaximumPoolSize(threads); }; @Inject public Foo(FooConfig config) { this.config = config; executor = new ThreadPoolExecutor(...); config.getThreadPoolSize().addCallback(poolSizeCallback); } public Future<Result> doSomething(String arg) { return config.isFeatureEnabled() ? executor.submit(() -> computeResult()) : Futures.immediateFailedFuture(...); } public void shutdown() { executor.shutdown(); config.getThreadPoolSize().removeCallback(poolSizeCallback); } }
  • 25. Putting it together: your own IFTTT From automated to automatic
  • 26. Automatic remediation Hardest part: going from tribal knowledge to script What action to take, to what extent, on what alert? Depending on action taken, page on-call to address issue Test your automation! Exercise and validate automated actions in CI
  • 27. Wiring Define and verify your alerts (anomaly detector preflighting) Webhooks and lambdas to the rescue You can also act on alert resolution Record audit log Profit Sleep tight