Lightning Fast Monitoring and Automated Remediation

Lightning Fast Monitoring against
Lightning Fast Outages
Maxime Petazzoni
@mpetazzoni
#MonitorSF – May 2017

At SignalFx
SignalFx uses SignalFx for all monitoring & alerting
DevOps culture, SWEs on-call
Distributed responsibility for production
Optimizing for least amount of operational work

Problem statement
Today's systems are complex, distributed infrastructures
Large amounts of real-time trafﬁc
Systems fails in new, unpredictable and exotic ways
Among those failures lies the "lightning fast outage"

Lightning fast outages
Sudden meltdown of one or more components/services
Develops from anomaly to outage very quickly
New, unexpected or unusual load
Speciﬁcs depends on system and architecture

Causes
System behavior tied to user activity or user input
Query with unexpectedly large results
Sudden increase in trafﬁc
Bug in seldom executed/tested code path

Hard to catch
Most monitoring systems are not sophisticated enough
Not granular or not fast enough
Straight from "all good" to "nothing works"
No details on evolution of the anomaly

Even with metrics
With us in the 21st century and using metrics?
(a lot of people still don't, and still put up with Nagios checks!?)
1m, or even 10s resolution may still not be enough
Graphite/Influx users: do you trust your right edge?
How accurate is your recent data through aggregations?

The ultimate slow/weak link
The on-call human!
Even if the anomaly is detected before it becomes an
outage, humans can't act fast enough
Think about your MTTA and MTTR

Solution?
A lightning fast monitoring system
A dynamic stack
Your very own IFTTT
But before we go into details...

What does it look like?
Lightning fast monitoring and
automated remediation saving the day

A fast, real-time monitoring system

Observability
Metrics, metrics, metrics
Complete, real-time, high-resolution observability
Dimensions / tags (but watch for cardinality)
Push vs pull? Aha.

Instrument all the things
Metrics have good value/$
Develop a culture of code instrumentation
Instrument at all levels: host, container, application
Don't stop at infrastructure and framework-level metrics
Custom application metrics are often the most valuable

Application instrumentation
Count requests, actions, errors, data
Measure queues, thread pools, buffers, in-flight tasks
Time methods, wait times, response times, latency
metrics.counter("errors", "class", e.getClass().getName()).inc();
metrics.counter("requests", "endpoint", "search").inc();
metrics.registerGauge("queue_size", queue::size);
try (Timer.Context c = metrics.timer("request.timer").time()) {
// process data
}

Powerful alerting
Fast, real-time, metrics analytics based alerting
Accurate "right now", even on most recent data
Able to handle reporting lag dynamically
Build smart, dynamic alerts; static threshold often a smell
Detecting causality between redundant alerts still a really
hard problem...

Building a dynamic stack
From provisioning to application conﬁguration

Elastic infrastructure
Automated provisioning (much faster with containers!)
Dynamic service discovery (ZK/Curator, etcd, ...)
Horizontal scalability
Consider your 3rd party components too
All you should need is a zkConnectString

Elastic infrastructure (2)
Deeply Integrate discovery in your application framework
At SignalFx, Guice, Thrift and ZK/Curator come together:
public class Foo {
private final AnalyticsService.Iface analytics;
@Inject
public Foo(AnalyticsService.Iface analytics) {
/*
* Injects an implementation of the service interface that
* makes Thrift RPC calls to the target discovered service.
*/
this.analytics = analytics;
}
public void doSomething(String program) {
/*
* Call is always backed by currently advertised instances,
* partitioned and load-balanced as defined by the service.
*/
analytics.execute(program);
}
}

Internal control endpoints
Build operations endpoints into your applications
Internal HTTP API, JMX operations, control console, ...
Load shedding
Take out of load balancer / stop consuming
Impeachment (to elect new leader)

Dynamic application conﬁguration
Ability to perform real-time conﬁguration changes
Control thread pools, maximum queue and buffer sizes
Control task execution intervals
Enable/disable features and safety valves
Respect semantics for graceful degradation

Dynamic application configuration (2)
At SignalFx, application configuration is in ZooKeeper
Config properties at environment and service levels
Changes seen and reflected in real-time
Callbacks
With real-time monitoring, effect is directly visible

public interface FooConfig extends ConfigInterface {
// When declared as Property<>, allows for attaching callbacks
@Config(name = "pool.threads", defaultValue = "16")
Property<Integer> getThreadPoolSize();
// But can also just be a primitive
@Config(name = "feature.enabled", defaultValue = "true")
boolean isFeatureEnabled();
// Or even a JSON-decoded value
@Config(name = "blacklist", defaultValue = "[]", json = "true")
Set<String> getBlacklist();
}

public class Foo {
private final FooConfig config;
private final ThreadPoolExecutor executor;
private final Runnable poolSizeCallback = () -> {
int threads = config.getThreadPoolSize().get();
executor.setCorePoolSize(threads);
executor.setMaximumPoolSize(threads);
};
@Inject
public Foo(FooConfig config) {
this.config = config;
executor = new ThreadPoolExecutor(...);
config.getThreadPoolSize().addCallback(poolSizeCallback);
}
public Future<Result> doSomething(String arg) {
return config.isFeatureEnabled() ? executor.submit(() -> computeResult()) : Futures.immediateFailedFuture(...);
}
public void shutdown() {
executor.shutdown();
config.getThreadPoolSize().removeCallback(poolSizeCallback);
}
}

Putting it together: your own IFTTT
From automated to automatic

Automatic remediation
Hardest part: going from tribal knowledge to script
What action to take, to what extent, on what alert?
Depending on action taken, page on-call to address issue
Test your automation!
Exercise and validate automated actions in CI

Wiring
Deﬁne and verify your alerts (anomaly detector preflighting)
Webhooks and lambdas to the rescue
You can also act on alert resolution
Record audit log
Proﬁt Sleep tight

Lightning Fast Monitoring and Automated Remediation

Lightning Fast Monitoring and Automated Remediation

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Lightning Fast Monitoring and Automated Remediation

Similar to Lightning Fast Monitoring and Automated Remediation (20)

Recently uploaded

Recently uploaded (20)

Lightning Fast Monitoring and Automated Remediation