Be the first to like this
Hardware fails, applications fail, our code... well, it fails too (at least mine). To prevent software failure we test. Hardware failures are inevitable, so we write code that tolerates them, then we test. From tests we gather metrics and act upon them by improving parts that perform inadequately. Measuring right things at right places in an application is as much about good engineering practices and maintaining SLAs as it is about end user experience and may differentiate successful product from a failure.
In order to act on performance metrics such as max latency and consistent response times we need to know their accurate value. The problem with such metrics is that when using popular tools we get results that are not only inaccurate but also too optimistic.
During my presentation I will simulate services that require monitoring and show how gathered metrics differ from real numbers. All this while using what currently seems to be most popular metric pipeline - Graphite together with com.codahale metrics library - and get completely false results. We will learn to tune it and get much better accuracy. We will use JMeter to measure latency and observe how falsely reassuring the results are. We will check how graphite averages data just to helplessly watch important latency spikes disappear. Finally I will show how HdrHistogram helps in gathering reliable metrics. We will also run tests measuring performance of different metric classes