I’m going to talk to you about Time Series data.
I’ll show you what it is and how we use it.
important and valuable alerting and diagnostic tool we have.
Focus on 100k/s monitoring estate
A single point in your estate – means nothing
Sound good? What would be even more interesting would be if we could see how that value changed over time. Let’s bring in Bernard’s brothers…..
And finally, what if we could also bring in some other metrics, maybe some of the others we mentioned. Maybe some others.
Really useful data.
lifeblood of your system.
If you don’t think this data is valuable then none of the rest of what I have to say will be of any interest….
And finally, what if we could also bring in some other metrics, maybe some of the others we mentioned. Maybe some others.
Really useful data.
lifeblood of your system.
If you don’t think this data is valuable then none of the rest of what I have to say will be of any interest….
* Virtual or physical, including network devices, storage arrays and you good old fashioned application, web and database servers.
FREE!!! OMG!!!
Our first implementation.
As a side note, this is a pretty effective way of getting the guys to own the hardware to provide you with decent servers in a data centre. You can jump the queue by showing them something like this.
We chose OpenTSDB
We made a more usable visualiser
TicketMaster made Metrylix.
TSDB is GREAT for retrospective Root Cause Analysis
We still have ALL of the data since we started.
500 billion data points.
ingesting data PRODUCTION estate at 70k a second.
“if only I could have been notified when this happened”
And this
They wanted a dashboard of graphs that update in real time.
Either way, TSDB doesn’t really support these requirements in a scalable manner.
Let’s go back to the TSDB architecture to see why.
From TSDB website.
The metric data is sampled (by the COLLECTOR)
LOCAL or REMOTE via SNMP (it’s not always possible to deploy a COLLECTOR on every machine)
Sent to the TSD
deduping and compression
writes to HBase, which, in most cases runs on HDFS.
alerting and crons.
HTTP or RPC calls to TSD which in turn goes to Hbase. That’s a BIG problem.
So that’s the architecture – and here’s a physical implementation.
We decided to write our own solution – called TSP and available open source
drop in replacement for the tcollectors. (Forwarders)
More efficient
Write to multiple targets.
Still write to TSDB
second new component, the Aggregator.
Out of the aggregator is Site Feed
stream of the metric data from ALL sources in the estate and I can add any number of subscribers.
I can now use these valuable metrics in REAL TIME from multiple CONCURRENT consumers.
Currently 3 consumers
PLUS TSDB
Simple Heath Check on feed.
long delays
METRICS that stop
Profiles source of the metrics.
In Memory TSDB. Suport Nagios. Not perfect, but easy.
Riemann.
Riemann handles the CONFIGURED alert problem well.
But there are 10s of thousands of metrics captured because we like to capture ALL of the metrics. How do we find the valuable information in there?
Luckily for us, Etsy asked the same question and then provided what we hope is the answer with Kale.
---
----
After that …. We don’t know.
Maybe the future is self-aware artificial intelligence defence network