3. We want to know
Runtime of certain parts of the system
Data throughput
Performance bottlenecks
4. Why we want to do that
Suddenly dropping throughput
Suddenly longer running jobs/requests
Exploring performance trends
See performance impact on new implementation
5.
6. To achieve that
Collect Performance Metrics, Aggregate
and Visualise them
Easy in Monolithic Applications
More difficult in Distributed Applications
7. Distributed Applications
Metrics have to be collected from many hosts
Distributed contexts have to handled
Data have to be aggregated (right order) and visualised.
—> Distributed Tracing Systems,
first mentioned in Googles Drapper Papers
Popular implementations are OpenZipkin and Jaeger (Uber)
8. Let's collect some metrics
Business-, Application- and System-Metrics
Application- and System-Metrics via JMX
Business-Metrics via Code Instrumentation (DropWizard, kamon.io)
9. and persist the metrics
a good idea is to use a
Timeseries Database (InfluxDB, Graphite)
10. Visualisation is key
Make insights accessible by
visualising data and configuring alerts
(i.e. Grafana, Graphite, Chronograf)
11.
12. Our System
Java Application
Consists of different independent batches
Most batches handling data
Some batches uses external asynchrones services to enrich data
(response time from seconds to weeks)
Run in an distributed environment
13. Our Requirements
Single Methods
Batch Runtime
Business-Process duration (spanning multiple JVMs)
Add runtime parameters to the metrics
Measure data throughput
And
Low Code Impact
Metric collection should be decoupled and not harm the system
Visualisation should be awesome
14. Implementation
Own metrics library with two kinds of Metrics:
Simple Metric which measures the runtime of single methods
Distributed Metric which span over multiple JVMs
18. Learnings
There is no free lunch
Start with your Dashboard
Find the right audience
Choose the right level of measurement
You will produce lots of data
Measure as lot as you can, you don't now what you need (Coda Hale)