Druid is an open-source data store designed for OLAP queries on event or time-series data. It ingests streaming or batch time-series data, aggregates the data during ingestion, and allows querying of aggregated metrics over time intervals via an HTTP API. It consists of several node types that work together, including brokers for querying, historical nodes for data storage and serving, real-time nodes for ingesting streaming data, and coordinators for managing data distribution. Druid uses other technologies like ZooKeeper, MySQL/Postgres, and deep storage systems to manage configurations, metadata, and actual data storage.
2. Is it about woods creature?
Nope. It’s about…
a data store designed for OLAP queries on event
(time-series) data.
2
3. Boring facts
• Open-source, community-driven project
• ~3400 stars, ~7300 commits on Github ATM
• Written in Java
• Very modular / extendable thanks to Guice
3
4. What it does?
• Ingests stream / batch time-series data and
splits it into segments representing configured
time interval
• During ingestion, it performs aggregation using
provided algorithms to create metrics
• Allows to perform several types of queries over
served segments via nice HTTP API (here you
can do simple aggregations over metrics)
4
6. How it works?
Druid consists of several types of services (hereinafter,
nodes) which together make it complete working system:
• Broker
• Historical
• Real-time
• Coordinator
• Overlord, Middle Manager, Peons, etc. (aka boring
nodes)
6
7. How it works?
It also has three external dependencies:
• Apache ZooKeeper for configuration management,
leader election and data flow organisation
(coordinator - historical communication)
• Metadata Storage such as MySQL, PostgreSQL used
to store (guess what?) various metadata about the
system
• Deep Storage where all the compressed data is stored
(S3, HDFS, FS)
7
8. Nodes: Real-time
It is responsible for ingestion of streaming data.
Also it exposes HTTP API for querying, so data is
available for querying right after processing /
aggregation.
That is how Druid allows querying of real-time
data.
*: by real-time here we mean something which happened ~0-15 seconds ago.
8
9. Nodes: Historical
This guy is responsible for loading data
(hereinafter, segments) from Deep Storage and
serving this data via HTTP API (same as Real-time
dude).
Historical Node uses ZooKeeper to understand
what segments to load.
It uncompresses segment data and caches it
locally on FS.
9
10. Nodes: Coordinator
Node which manages segments and coordinates
their distribution on historical nodes. It uses
ZooKeeper for:
• getting current cluster state
• assigning segments to Historical Nodes
Loading and dropping of segments is managed
via Rules stored in Metadata Storage.
10
11. Nodes: Broker
This is the only Node which is usually touched by
client applications, because it exposes HTTP API
for querying segments data.
It splits incoming query onto smaller ones based
on segments data stored ZooKeeper and queries
corresponding Historical and Real-time Nodes.
11
12. Nodes: Overlord
Indexing service powers Druid’s batch data
ingestion. It consists of three node types: Overlord,
Middle Manager and Peons.
Overlord accepts indexing tasks and distributes
those to Middle Manager Nodes. It communicates
the latter ones through ZooKeeper.
It provides simple HTTP API to create, shutdown
and view status of indexing tasks.
12
13. Nodes: Middle-Manager
Middle-manager executes submitted indexing
tasks. It creates separate JVMs i.e. Peon Node
processes for this. Each Peon runs one task at a
time.
MM retrieves indexing tasks from ZooKeeper.
Then it stores indexing task as JSON file and runs
Peon providing it with path to the indexing task file.
After processing Peon stores segments data in
Deep Storage.
13
14. Live example
• Run all nodes locally
• Local FS is used for Deep Storage
• Derby is used for Metadata Storage
• ZooKeeper … well, it’s ZooKeeper, gotta run it
• Kafka 0.8.7.6.5.4.3.2.1 for streaming data <3
• Various Bash and Node scripts for loading data
• Imply Pivot for visualisation of queried results
14
15. Problems?
Everybody can tell good sides of Druid. Issues
time!
• DevOps needed to spin up all the Nodes and
dependencies (Did we clone Dmytro already?)
• Limited number of aggregations
• Druid loves cookies… no, space, it needs space,
more space, and it’s not enough anyways.
15