Number of Services in Production
Availability and performance of our services is critical to
running our business
The software we develop has to make delivering on our SLAs
How (besides sane design):
Healthchecks + Nagios
Historical Data with Graphs
Gauges – instantaneous value
Counters – counter with +/-
Meters – rate over time (mean, 1, 5, & 15 moving avg.)
Histograms – distribution of data (mean, median, max, std.
div., 75th, 90th, 95th, 98th, 99th, & 99.9th percentiles)
Timers – Meter of requests & Histogram of duration (frequency
Metrics - Healthchecks
Verify that your service is running correctly
Dropwizard: What is it?
Quality open source Java webservice components glued
together in a modular way
Eliminates the need for picking a platform stack, it‟s all there
It‟s opinionated. If you don‟t like a Dropwizard core
component, that‟s too bad, don‟t use Dropwizard
Developers focus on business logic, not framework
It‟s easy, maintainable, and it works!
A Few Words from Coda…
“I had no one I had to toss a WAR to. I had no one to
stand up a Tomcat server and fiddle with it until their
eyes bled. I had no one who didn't trust me to spin up
my own threads or connection pools. So I wrote
something which worked as simply and in as straight-
forward a manner as possible because my own ass
was on the line if it didn't work.”
Dropwizard: The Ingredients
Jersey for REST
Jackson for JSON
Jetty for a webserver
Metrics for measuring
YAML for configuring
Dropwizard for weaving everything together
Dropwizard – Healthchecks
Register hooks that check the health of your app
An HTTP endpoint that iterates over all the hooks
“The meaning of healthy” is decided by you (i. e. Database
Connections, Client Connections, DeadLock Count)
Dropwizard + Metrics
Dropwizard has lots of platform instrumentation baked in using
Metrics, happens for free! (i.e. Jetty, JVM, Log Counts, etc…)
Ability to add Timers to your endpoints with @Timed
Ability to add arbitrary metrics as you see fit
Abandonware for Play 2.X, which was still beta
Everything and the kitchen sink
Also I hate XML
What do I get out of it? Dev
Story telling: causation & correlation
Integral piece of the operational excellence puzzle
State of the world – Dashboards
Developers focus on features, operations is mostly free lunch
Code review & demo
Disclaimer: You need graphite to really harness the value
The grid is slow why?
Is it load?
Is it dependent service latency?
How does that compare to yesterday
JVM throws out of memory, what‟s the problem?
What does the GC jigsaw look?
When did it change?
Is it correlated with increased load?
How is that new „performance‟ tweak?
If you never measured, then you didn‟t tune. True story!
What does my 5XX graph look like?
Operational Excellence: The ingredients
Application Instrumentation (Dropwizard)
Time Series Data & Graphing (Graphite, D3)
Centralized logging & log parsing (Rsyslog, Logstash, Nagios)
Automated alerting & escalation (Pagerduty)
DW & Graphite will get you very far, but if you want total control &
visibility you need the rest. This is the stack that RTR is moving
towards, rather than relying on basic java logging smtp appenders
OMG, we are on GMA, are we
Each services runs in a cluster behind an LB
„OK‟ is somewhat service specific
Basically you need a lot of info at your fingertips. Pictures are
worth a thousand words. Get yourself some dashboards!
Tasseo dashboard (D3)
• Red, Yellow, & Green Lights
• Endless cool things: graphite + D3
If we see yellow or red, start diagnosing
Free Lunch? Not really
DB connection pool monitoring
Http client connection pool monitoring
JVM Heap & GC info
Http Server response counts
Http Server connection info
Endpoint duration & throughput stats
Where do I sign up?
You install Graphite, one time hit + some TLC. Medium
You annotate your endpoints and maybe add finer telemetry.
You configure so your service is feeding into graphite.
Hopefully consistently across services, via a „Bundle‟. Easy
Show a simple dropwizard codebase
Do some curls
Show the admin endpoints