1) Lightning fast outages are sudden failures that develop quickly from an anomaly to an outage due to unexpected load or bugs.
2) Most monitoring systems are not fast or granular enough to detect these types of failures early. Metrics need to be captured at high resolution to identify issues before they become outages.
3) To address this, companies need a dynamic system with real-time, high-resolution metrics and analytics-based alerting to detect issues quickly, as well as automated remediation through dynamic application configuration and infrastructure provisioning when issues arise.
Lightning Fast Monitoring and Automated Remediation
1. Lightning Fast Monitoring against
Lightning Fast Outages
Maxime Petazzoni
@mpetazzoni
#MonitorSF – May 2017
2. At SignalFx
SignalFx uses SignalFx for all monitoring & alerting
DevOps culture, SWEs on-call
Distributed responsibility for production
Optimizing for least amount of operational work
3. Problem statement
Today's systems are complex, distributed infrastructures
Large amounts of real-time traffic
Systems fails in new, unpredictable and exotic ways
Among those failures lies the "lightning fast outage"
4. Lightning fast outages
Sudden meltdown of one or more components/services
Develops from anomaly to outage very quickly
New, unexpected or unusual load
Specifics depends on system and architecture
5. Causes
System behavior tied to user activity or user input
Query with unexpectedly large results
Sudden increase in traffic
Bug in seldom executed/tested code path
6. Hard to catch
Most monitoring systems are not sophisticated enough
Not granular or not fast enough
Straight from "all good" to "nothing works"
No details on evolution of the anomaly
7. Even with metrics
With us in the 21st century and using metrics?
(a lot of people still don't, and still put up with Nagios checks!?)
1m, or even 10s resolution may still not be enough
Graphite/Influx users: do you trust your right edge?
How accurate is your recent data through aggregations?
8. The ultimate slow/weak link
The on-call human!
Even if the anomaly is detected before it becomes an
outage, humans can't act fast enough
Think about your MTTA and MTTR
9. Solution?
A lightning fast monitoring system
A dynamic stack
Your very own IFTTT
But before we go into details...
10. What does it look like?
Lightning fast monitoring and
automated remediation saving the day
14. Instrument all the things
Metrics have good value/$
Develop a culture of code instrumentation
Instrument at all levels: host, container, application
Don't stop at infrastructure and framework-level metrics
Custom application metrics are often the most valuable
15. Application instrumentation
Count requests, actions, errors, data
Measure queues, thread pools, buffers, in-flight tasks
Time methods, wait times, response times, latency
metrics.counter("errors", "class", e.getClass().getName()).inc();
metrics.counter("requests", "endpoint", "search").inc();
metrics.registerGauge("queue_size", queue::size);
try (Timer.Context c = metrics.timer("request.timer").time()) {
// process data
}
16. Powerful alerting
Fast, real-time, metrics analytics based alerting
Accurate "right now", even on most recent data
Able to handle reporting lag dynamically
Build smart, dynamic alerts; static threshold often a smell
Detecting causality between redundant alerts still a really
hard problem...
18. Elastic infrastructure
Automated provisioning (much faster with containers!)
Dynamic service discovery (ZK/Curator, etcd, ...)
Horizontal scalability
Consider your 3rd party components too
All you should need is a zkConnectString
19. Elastic infrastructure (2)
Deeply Integrate discovery in your application framework
At SignalFx, Guice, Thrift and ZK/Curator come together:
public class Foo {
private final AnalyticsService.Iface analytics;
@Inject
public Foo(AnalyticsService.Iface analytics) {
/*
* Injects an implementation of the service interface that
* makes Thrift RPC calls to the target discovered service.
*/
this.analytics = analytics;
}
public void doSomething(String program) {
/*
* Call is always backed by currently advertised instances,
* partitioned and load-balanced as defined by the service.
*/
analytics.execute(program);
}
}
20. Internal control endpoints
Build operations endpoints into your applications
Internal HTTP API, JMX operations, control console, ...
Load shedding
Take out of load balancer / stop consuming
Impeachment (to elect new leader)
21. Dynamic application configuration
Ability to perform real-time configuration changes
Control thread pools, maximum queue and buffer sizes
Control task execution intervals
Enable/disable features and safety valves
Respect semantics for graceful degradation
22. Dynamic application configuration (2)
At SignalFx, application configuration is in ZooKeeper
Config properties at environment and service levels
Changes seen and reflected in real-time
Callbacks
With real-time monitoring, effect is directly visible
23. Dynamic application configuration (3)
public interface FooConfig extends ConfigInterface {
// When declared as Property<>, allows for attaching callbacks
@Config(name = "pool.threads", defaultValue = "16")
Property<Integer> getThreadPoolSize();
// But can also just be a primitive
@Config(name = "feature.enabled", defaultValue = "true")
boolean isFeatureEnabled();
// Or even a JSON-decoded value
@Config(name = "blacklist", defaultValue = "[]", json = "true")
Set<String> getBlacklist();
}
24. Dynamic application configuration (4)
public class Foo {
private final FooConfig config;
private final ThreadPoolExecutor executor;
private final Runnable poolSizeCallback = () -> {
int threads = config.getThreadPoolSize().get();
executor.setCorePoolSize(threads);
executor.setMaximumPoolSize(threads);
};
@Inject
public Foo(FooConfig config) {
this.config = config;
executor = new ThreadPoolExecutor(...);
config.getThreadPoolSize().addCallback(poolSizeCallback);
}
public Future<Result> doSomething(String arg) {
return config.isFeatureEnabled() ? executor.submit(() -> computeResult()) : Futures.immediateFailedFuture(...);
}
public void shutdown() {
executor.shutdown();
config.getThreadPoolSize().removeCallback(poolSizeCallback);
}
}
26. Automatic remediation
Hardest part: going from tribal knowledge to script
What action to take, to what extent, on what alert?
Depending on action taken, page on-call to address issue
Test your automation!
Exercise and validate automated actions in CI
27. Wiring
Define and verify your alerts (anomaly detector preflighting)
Webhooks and lambdas to the rescue
You can also act on alert resolution
Record audit log
Profit Sleep tight