Agenda
❏ What is druid ?
❏ History & Motivations
❏ Druid concepts
❏ Sharding data
❏ Nodes types
❏ Architecture overview
❏ Loading Data
❏ Querying Data
❏ External dependencies
❏ Visualization tool
❏ Who is using Druid ?
What is druid ?
The name Druid comes from the
shapeshifting Druid class in
many role-playing games, to
reflect the fact that the
architecture of the system can
shift to solve different types of
data problems
What is druid ?
"Druid is a system built to allow fast ("real-time") access to large sets of
seldom-changing data".
"Designed to be 100% uptime"
History & Motivations
Paid online data analytic
tools
❏ IBM’s Netezza
❏ HP’s Vertica
❏ EMC’s Greenplum
MAP REDUCE
Druid's concepts - Data
❏ Timestamp column (x axis)
❏ Dimension columns (filtering data)
❏ Metric columns (numeric values used on computation such as mean, sum, count...)
Druid's concepts - Roll-up
Individual events are not very interesting
The compacted version of our original raw data looks something like this:
Pseudocode:
Sharding data
Sharding data by time
Immutable blocks of data called "segments"
Node types - Broker
ZOOKEEPER
1
2
Fault tolerance
Can be run in parallel or in hot
fail-over.
Availability
If broker nodes are unable to
communicate to Zookeeper, they use
their last known view of the cluster
and continue to forward queries to
real-time and historical nodes
2
3
34
DATA
QUERIES
Node types - Realtime
Fault tolerance
Loss of access to the local
disk can result in data loss if
this is the only method of
adding data to the system.
Availability
Kafka between the producer
and real-time node
❏ Process events on
memory
❏ Store periodically to disk
❏ Merge index together and
build immutable blocks of
data (segments)
S3
HDFS
MySQL
PostgreSQL
Kafka
Spark
Flink
Derby (default)
MySQL
PostogreSQL
Node types - Historical <workers>
ZOOKEEPER
Segments
2
1
Fault tolerance
Other historical takes place with no data
loss
Availability
Historical nodes depend on Zookeeper for
segment load and unload instructions.
Should Zookeeper become unavailable,
historical nodes are no longer able to
serve new data or drop outdated data
S3
HDFS
MySQL
PostgreSQL
3
Node types - Coordinator
ZOOKEEPER
Segments
1
2
3
Fault tolerance
fail-over configuration
Availability
If no coordinators are running, then
changes to the data topology will stop
happening (no new data and no data
balancing decisions)
Derby (default)
MySQL
PostogreSQL
External dependencies - Zookeeper
Fault tolerance
If this is not available, data topology changes cannot be made,
but the Brokers will maintain their most recent view of the
data topology and continue serving requests accordingly.
External dependencies - Metadata storage
Fault tolerance
If this is not available, the Coordinator will be unable to
find out about new segments in the system, but it will
continue with its current view of the segments that
should exist in the cluster.
External dependencies - Deep storage
Fault tolerance
If this is not available, new data will not be able to enter
the cluster, but the cluster will continue operating as is.
Architecture - overview
S3
HDFS
MySQL
PostgreSQL
Derby (default)
MySQL
PostogreSQL
Architecture - overview
Loading Data
1. Real-time ingestion (exactly once is NOT guaranteed)
2. Batch ingestion (exactly once IS guaranteed)
Loading Data - example of batch data
curl -X 'POST' -H 'Content-Type:application/json' -d @quickstart/wikiticker-index.json
localhost:8090/druid/indexer/v1/task
Loading Data - status of ingestion task
http://localhost:8090/console.html
Loading Data - datasource view
http://localhost:8081/#/datasources/wikiticker
Querying the Data
curl -L -H'Content-Type: application/json' -XPOST --data-binary @quickstart/wikiticker-top-pages.json
http://localhost:8082/druid/v2/?pretty
Visualization tool (Pivot)
Who use druid.io ?
QUESTIONS ?

druid.io

  • 2.
    Agenda ❏ What isdruid ? ❏ History & Motivations ❏ Druid concepts ❏ Sharding data ❏ Nodes types ❏ Architecture overview ❏ Loading Data ❏ Querying Data ❏ External dependencies ❏ Visualization tool ❏ Who is using Druid ?
  • 3.
    What is druid? The name Druid comes from the shapeshifting Druid class in many role-playing games, to reflect the fact that the architecture of the system can shift to solve different types of data problems
  • 4.
    What is druid? "Druid is a system built to allow fast ("real-time") access to large sets of seldom-changing data". "Designed to be 100% uptime"
  • 5.
    History & Motivations Paidonline data analytic tools ❏ IBM’s Netezza ❏ HP’s Vertica ❏ EMC’s Greenplum MAP REDUCE
  • 6.
    Druid's concepts -Data ❏ Timestamp column (x axis) ❏ Dimension columns (filtering data) ❏ Metric columns (numeric values used on computation such as mean, sum, count...)
  • 7.
    Druid's concepts -Roll-up Individual events are not very interesting The compacted version of our original raw data looks something like this: Pseudocode:
  • 8.
    Sharding data Sharding databy time Immutable blocks of data called "segments"
  • 9.
    Node types -Broker ZOOKEEPER 1 2 Fault tolerance Can be run in parallel or in hot fail-over. Availability If broker nodes are unable to communicate to Zookeeper, they use their last known view of the cluster and continue to forward queries to real-time and historical nodes 2 3 34 DATA QUERIES
  • 10.
    Node types -Realtime Fault tolerance Loss of access to the local disk can result in data loss if this is the only method of adding data to the system. Availability Kafka between the producer and real-time node ❏ Process events on memory ❏ Store periodically to disk ❏ Merge index together and build immutable blocks of data (segments) S3 HDFS MySQL PostgreSQL Kafka Spark Flink Derby (default) MySQL PostogreSQL
  • 11.
    Node types -Historical <workers> ZOOKEEPER Segments 2 1 Fault tolerance Other historical takes place with no data loss Availability Historical nodes depend on Zookeeper for segment load and unload instructions. Should Zookeeper become unavailable, historical nodes are no longer able to serve new data or drop outdated data S3 HDFS MySQL PostgreSQL 3
  • 12.
    Node types -Coordinator ZOOKEEPER Segments 1 2 3 Fault tolerance fail-over configuration Availability If no coordinators are running, then changes to the data topology will stop happening (no new data and no data balancing decisions) Derby (default) MySQL PostogreSQL
  • 13.
    External dependencies -Zookeeper Fault tolerance If this is not available, data topology changes cannot be made, but the Brokers will maintain their most recent view of the data topology and continue serving requests accordingly.
  • 14.
    External dependencies -Metadata storage Fault tolerance If this is not available, the Coordinator will be unable to find out about new segments in the system, but it will continue with its current view of the segments that should exist in the cluster.
  • 15.
    External dependencies -Deep storage Fault tolerance If this is not available, new data will not be able to enter the cluster, but the cluster will continue operating as is.
  • 16.
  • 17.
  • 18.
    Loading Data 1. Real-timeingestion (exactly once is NOT guaranteed) 2. Batch ingestion (exactly once IS guaranteed)
  • 19.
    Loading Data -example of batch data curl -X 'POST' -H 'Content-Type:application/json' -d @quickstart/wikiticker-index.json localhost:8090/druid/indexer/v1/task
  • 20.
    Loading Data -status of ingestion task http://localhost:8090/console.html
  • 21.
    Loading Data -datasource view http://localhost:8081/#/datasources/wikiticker
  • 22.
    Querying the Data curl-L -H'Content-Type: application/json' -XPOST --data-binary @quickstart/wikiticker-top-pages.json http://localhost:8082/druid/v2/?pretty
  • 23.
  • 24.
  • 25.