druid.io

Agenda
❏ What is druid ?
❏ History & Motivations
❏ Druid concepts
❏ Sharding data
❏ Nodes types
❏ Architecture overview
❏ Loading Data
❏ Querying Data
❏ External dependencies
❏ Visualization tool
❏ Who is using Druid ?

What is druid ?
The name Druid comes from the
shapeshifting Druid class in
many role-playing games, to
reflect the fact that the
architecture of the system can
shift to solve different types of
data problems

What is druid ?
"Druid is a system built to allow fast ("real-time") access to large sets of
seldom-changing data".
"Designed to be 100% uptime"

History & Motivations
Paid online data analytic
tools
❏ IBM’s Netezza
❏ HP’s Vertica
❏ EMC’s Greenplum
MAP REDUCE

Druid's concepts - Data
❏ Timestamp column (x axis)
❏ Dimension columns (filtering data)
❏ Metric columns (numeric values used on computation such as mean, sum, count...)

Druid's concepts - Roll-up
Individual events are not very interesting
The compacted version of our original raw data looks something like this:
Pseudocode:

Sharding data
Sharding data by time
Immutable blocks of data called "segments"

Node types - Broker
ZOOKEEPER
1
2
Fault tolerance
Can be run in parallel or in hot
fail-over.
Availability
If broker nodes are unable to
communicate to Zookeeper, they use
their last known view of the cluster
and continue to forward queries to
real-time and historical nodes
2
3
34
DATA
QUERIES

Node types - Realtime
Fault tolerance
Loss of access to the local
disk can result in data loss if
this is the only method of
adding data to the system.
Availability
Kafka between the producer
and real-time node
❏ Process events on
memory
❏ Store periodically to disk
❏ Merge index together and
build immutable blocks of
data (segments)
S3
HDFS
MySQL
PostgreSQL
Kafka
Spark
Flink
Derby (default)
MySQL
PostogreSQL

Node types - Historical <workers>
ZOOKEEPER
Segments
2
1
Fault tolerance
Other historical takes place with no data
loss
Availability
Historical nodes depend on Zookeeper for
segment load and unload instructions.
Should Zookeeper become unavailable,
historical nodes are no longer able to
serve new data or drop outdated data
S3
HDFS
MySQL
PostgreSQL
3

Node types - Coordinator
ZOOKEEPER
Segments
1
2
3
Fault tolerance
fail-over configuration
Availability
If no coordinators are running, then
changes to the data topology will stop
happening (no new data and no data
balancing decisions)
Derby (default)
MySQL
PostogreSQL

External dependencies - Zookeeper
Fault tolerance
If this is not available, data topology changes cannot be made,
but the Brokers will maintain their most recent view of the
data topology and continue serving requests accordingly.

External dependencies - Metadata storage
Fault tolerance
If this is not available, the Coordinator will be unable to
find out about new segments in the system, but it will
continue with its current view of the segments that
should exist in the cluster.

External dependencies - Deep storage
Fault tolerance
If this is not available, new data will not be able to enter
the cluster, but the cluster will continue operating as is.

Architecture - overview
S3
HDFS
MySQL
PostgreSQL

Derby (default)
MySQL
PostogreSQL
Architecture - overview

Loading Data
1. Real-time ingestion (exactly once is NOT guaranteed)
2. Batch ingestion (exactly once IS guaranteed)

Loading Data - example of batch data
curl -X 'POST' -H 'Content-Type:application/json' -d @quickstart/wikiticker-index.json
localhost:8090/druid/indexer/v1/task

Loading Data - status of ingestion task
http://localhost:8090/console.html

Loading Data - datasource view
http://localhost:8081/#/datasources/wikiticker

Querying the Data
curl -L -H'Content-Type: application/json' -XPOST --data-binary @quickstart/wikiticker-top-pages.json
http://localhost:8082/druid/v2/?pretty

druid.io

More Related Content

What's hot

Similar to druid.io

More from Jéferson Machado

Recently uploaded

druid.io