Druid Scaling Realtime Analytics

Real-Time Analytics With Druid
Aaron Brooks
Solutions Engineer
abrooks@hortonworks.com

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
 Druid Overview
 Architecture
 Data model & queries
 Druid with Hive
 Demo

Druid capabilities
 Streaming ingestion capability
 Data Freshness – analyze events as they occur
 Fast response time (ideally < 1sec query time)
 Arbitrary slicing and dicing
 Multi-tenancy – 1000s of concurrent users
 Scalability and Availability
 Rich real-time visualization with Superset
Superset
Druid is a distributed, real-time, column-oriented datastore
designed to quickly ingest and index large amounts of data
and make it available for real-time query.

Companies Using Druid

History
 Development started at Metamarkets
in 2011
 Initial use case
– power ad-tech analytics product
 Open sourced in late 2012
– GPL licensed initially
– Switched to Apache V2 in early 2015
 150+ committers today
 In production at many companies

Druid Is Red Hot Technology
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
9/1/12
11/1/12
1/1/13
3/1/13
5/1/13
7/1/13
9/1/13
11/1/13
1/1/14
3/1/14
5/1/14
7/1/14
9/1/14
11/1/14
1/1/15
3/1/15
5/1/15
7/1/15
9/1/15
11/1/15
1/1/16
3/1/16
5/1/16
7/1/16
9/1/16
11/1/16
1/1/17
3/1/17
5/1/17
7/1/17
9/1/17
Popularity of Major Data Management Technologies
by GitHub Followers (Source: GitHub)
Cassandra
Hadoop
Kafka
Spark
Storm
Druid
In data / analytics,
only Spark and Kafka
have more traction
than Druid.
Druid is the
foundation of
the modern
streaming
architecture

Cool stuff you can do with Druid
 Spatial Indexing
– Query within a rectangular or circular distance from a point.
 Use cases:
– In-store push offers.
– Count users / devices within a radius.

Druid: Fast Facts
Most Events per Day
30 Billion Events / Day
(Metamarkets)
Most Computed Metrics
2 Billion Metrics / Min
(Jolata)
Largest Cluster
200 Nodes
(Metamarkets)
Largest Hourly Ingestion
3TB per Hour
(Netflix)

Druid Architecture

Druid: Services
 Broker
 Coordinator
 Historical
 Realtime
 Router
 Overlord, Middle Managers and Peons
 Different node types for solving different problems
 Processes dedicated for
– Historical data
– Ingestion
– Coordination
– Result merging

A Typical Druid Deployment
Many nodes, based on
data size and #queries
HDFS or S3
Superset or Dashboards,
BI Tools

Druid: Segments
 Data in Druid is stored in Segment Files.
 Partitioned by time, supports fast time-based slice-and-dice.
 Ideally, segment files are each smaller than 1GB.
 If files are large, smaller time partitions are needed.
Time
Segment 1:
Monday
Segment 2:
Tuesday
Segment 3:
Wednesday
Segment 4:
Thursday
Segment 5:
Friday

Historical Nodes
 Main workhorses of druid cluster
 Load immutable read optimized segments
 Respond to queries
 Use memory mapped files
to load segments

Broker Nodes
 Keeps track of segment announcements in cluster
– (This information is kept in Zookeeper, much like Storm or HBase do.)
 Scatters query across historical and realtime nodes
– (Clients issue queries to this node, but queries are processed elsewhere.)
 Merge results from different query nodes
 (Distributed) caching layer

Coordinator Nodes
 Assigns segments to historical nodes
 Interval based cost function to distribute segments
 Makes sure query load is uniform across historical nodes
 Handles replication of data
 Configurable rules to load/drop data

Druid: Segment Data Structures
 Within a Segment:
– Timestamp Column Group.
– Dimensions Column Group.
– Metrics Column Group.
– Indexes that facilitate fast lookup and aggregation.

What Makes Druid Fast?

Druid High Level Architecture
Hadoop
Historical
Node
Historical
Node
Historical
Node
Batch Data
Broker
Node
Queries
ETL
(Samza,
Kafka, Storm,
Spark etc)
Streaming
Data Realtime
Node
Realtime
Node
Handoff

Druid: Batch Indexing
 Indexing is performed by related components:
– Overlord
– Middle Managers
– Peons
 Batch indexing is done on data that already exists in
Deep Storage (e.g. HDFS).
 Middle Managers spawn peons which run ingestions
tasks
 Middle Managers get task definitions that defines
which tasks to run and its properties
 Each peon runs 1 task

Stream Ingestion: Real time Index Tasks
 Ability to ingest streams of data
 Stores data in write-optimized structure
 Periodically converts write-optimized structure
to read-optimized segments
 Event query-able as soon as it is ingested
 Both push (tranquility) and pull based ingestion

Druid: Realtime Indexing
Deep Storage
Tranquility
Coordinator
Broker
Indexing Service
Overlord
MiddleManager
Peon Peon Peon
ZooKeeper
Kafka
task
Push
Segments
segment
Segment-
cache
Historical
Segment-
cache
Historical
Spark
Flink
Storm
Python
Pull

Druid: Segment Data Structures
 Within a Segment:
– Timestamp Column Group.
– Dimensions Column Group.
– Metrics Column Group.
– Indexes that facilitate fast lookup and aggregation.

Queries
 Timeseries: Time based queries
 TopN : Equivalent to group_by + order over 1 dimension
– approximate if more than 1000 Dim values
 GroupBy
 Time boundary : queries return the earliest and latest data points of a data set
 Search queries / Select
 For each query we can use operators like
– Granularity (Roll up)
– Filters
– Aggregation / Post-Aggregation
– Etc

Group-by example

Results example

Druid And Hive

Druid Hive integration
 Data already existing in druid
 Druid has its own JSON based query
language
 No native BI tools integration
 Point hive to broker and specify data
source name
 Use Hive as a virtualization layer
 Query Druid data with SQL and plug any
BI tool
 Data already existing in Hive .
 Data stored in distributed filesystem like
HDFS, S3 in a format that can be read by
Hive eg TSV, CSV ORC, Parquet etc
 Perform some pre-processing over
various data sources before feeding it to
Druid
 Accelerate query over Hive data
 Join between hot and cold data
Query Druid from Hive with SQL Hive queries acceleration

Query Druid from Hive with SQL
 Point hive to the broker:
– SET hive.druid.broker.address.default=druid.broker.hostname:8082;
 Simple CREATE EXTERNAL TABLE statement
CREATE EXTERNAL TABLE druid_wikiticker
STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler'
TBLPROPERTIES ("druid.datasource" = "wikiticker");
 Broker node endpoint specified as a Hive configuration parameter
 Automatic Druid data schema discovery: segment metadata query
Hive table name
Hive storage handler classname
Druid data source name

Query Druid from Hive Hive Table Creation
Hive Query Plan
Query with SQL

Hive queries acceleration
 Use Create Table As Select (CTAS) statement
CREATE TABLE druid_wikiticker
STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler’
TBLPROPERTIES ("druid.datasource" = "wikiticker”, "druid.segment.granularity" =
"HOUR")
AS
SELECT __time, page, user, c_added, c_removed
FROM src;
 Inference of Druid column types (timestamp, dimensions, metrics) depends on Hive
column type
– Timestamp –> Time
– Dimensions -> page, user
– Metrics -> c_added, c_removed
Credit jcamacho@apache.org
Hive table name
Hive storage handler classname
Druid data source name

Index Hive in Druid SQL Druid Query
SQL Hive Query

Example of Druid queries
Cryptocurrency Market Data

Example of Druid queries
Druid Query
Results in shell
Results in Superset

Superset

Demo

Druid Scaling Realtime Analytics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Druid Scaling Realtime Analytics

Similar to Druid Scaling Realtime Analytics (20)

Recently uploaded

Recently uploaded (20)

Druid Scaling Realtime Analytics