Real-Time Analytics With Druid
Aaron Brooks
Solutions Engineer
abrooks@hortonworks.com
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
 Druid Overview
 Architecture
 Data model & queries
 Druid with Hive
 Demo
Druid Overview
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid capabilities
 Streaming ingestion capability
 Data Freshness – analyze events as they occur
 Fast response time (ideally < 1sec query time)
 Arbitrary slicing and dicing
 Multi-tenancy – 1000s of concurrent users
 Scalability and Availability
 Rich real-time visualization with Superset
Superset
Druid is a distributed, real-time, column-oriented datastore
designed to quickly ingest and index large amounts of data
and make it available for real-time query.
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Companies Using Druid
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
History
 Development started at Metamarkets
in 2011
 Initial use case
– power ad-tech analytics product
 Open sourced in late 2012
– GPL licensed initially
– Switched to Apache V2 in early 2015
 150+ committers today
 In production at many companies
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid Is Red Hot Technology
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
9/1/12
11/1/12
1/1/13
3/1/13
5/1/13
7/1/13
9/1/13
11/1/13
1/1/14
3/1/14
5/1/14
7/1/14
9/1/14
11/1/14
1/1/15
3/1/15
5/1/15
7/1/15
9/1/15
11/1/15
1/1/16
3/1/16
5/1/16
7/1/16
9/1/16
11/1/16
1/1/17
3/1/17
5/1/17
7/1/17
9/1/17
Popularity of Major Data Management Technologies
by GitHub Followers (Source: GitHub)
Cassandra
Hadoop
Kafka
Spark
Storm
Druid
In data / analytics,
only Spark and Kafka
have more traction
than Druid.
Druid is the
foundation of
the modern
streaming
architecture
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Cool stuff you can do with Druid
 Spatial Indexing
– Query within a rectangular or circular distance from a point.
 Use cases:
– In-store push offers.
– Count users / devices within a radius.
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid: Fast Facts
Most Events per Day
30 Billion Events / Day
(Metamarkets)
Most Computed Metrics
2 Billion Metrics / Min
(Jolata)
Largest Cluster
200 Nodes
(Metamarkets)
Largest Hourly Ingestion
3TB per Hour
(Netflix)
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid Architecture
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid: Services
 Broker
 Coordinator
 Historical
 Realtime
 Router
 Overlord, Middle Managers and Peons
 Different node types for solving different problems
 Processes dedicated for
– Historical data
– Ingestion
– Coordination
– Result merging
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
A Typical Druid Deployment
Many nodes, based on
data size and #queries
HDFS or S3
Superset or Dashboards,
BI Tools
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid: Segments
 Data in Druid is stored in Segment Files.
 Partitioned by time, supports fast time-based slice-and-dice.
 Ideally, segment files are each smaller than 1GB.
 If files are large, smaller time partitions are needed.
Time
Segment 1:
Monday
Segment 2:
Tuesday
Segment 3:
Wednesday
Segment 4:
Thursday
Segment 5:
Friday
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Historical Nodes
 Main workhorses of druid cluster
 Load immutable read optimized segments
 Respond to queries
 Use memory mapped files
to load segments
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Broker Nodes
 Keeps track of segment announcements in cluster
– (This information is kept in Zookeeper, much like Storm or HBase do.)
 Scatters query across historical and realtime nodes
– (Clients issue queries to this node, but queries are processed elsewhere.)
 Merge results from different query nodes
 (Distributed) caching layer
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Coordinator Nodes
 Assigns segments to historical nodes
 Interval based cost function to distribute segments
 Makes sure query load is uniform across historical nodes
 Handles replication of data
 Configurable rules to load/drop data
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid: Segment Data Structures
 Within a Segment:
– Timestamp Column Group.
– Dimensions Column Group.
– Metrics Column Group.
– Indexes that facilitate fast lookup and aggregation.
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What Makes Druid Fast?
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What Makes Druid Fast?
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid High Level Architecture
Hadoop
Historical
Node
Historical
Node
Historical
Node
Batch Data
Broker
Node
Queries
ETL
(Samza,
Kafka, Storm,
Spark etc)
Streaming
Data Realtime
Node
Realtime
Node
Handoff
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid: Batch Indexing
 Indexing is performed by related components:
– Overlord
– Middle Managers
– Peons
 Batch indexing is done on data that already exists in
Deep Storage (e.g. HDFS).
 Middle Managers spawn peons which run ingestions
tasks
 Middle Managers get task definitions that defines
which tasks to run and its properties
 Each peon runs 1 task
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Stream Ingestion: Real time Index Tasks
 Ability to ingest streams of data
 Stores data in write-optimized structure
 Periodically converts write-optimized structure
to read-optimized segments
 Event query-able as soon as it is ingested
 Both push (tranquility) and pull based ingestion
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid: Realtime Indexing
Deep Storage
Tranquility
Coordinator
Broker
Indexing Service
Overlord
MiddleManager
Peon Peon Peon
ZooKeeper
Kafka
task
Push
Segments
segment
Segment-
cache
Historical
Segment-
cache
Historical
Spark
Flink
Storm
Python
Pull
Data model and queries
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid: Segment Data Structures
 Within a Segment:
– Timestamp Column Group.
– Dimensions Column Group.
– Metrics Column Group.
– Indexes that facilitate fast lookup and aggregation.
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Queries
 Timeseries: Time based queries
 TopN : Equivalent to group_by + order over 1 dimension
– approximate if more than 1000 Dim values
 GroupBy
 Time boundary : queries return the earliest and latest data points of a data set
 Search queries / Select
 For each query we can use operators like
– Granularity (Roll up)
– Filters
– Aggregation / Post-Aggregation
– Etc
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Group-by example
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Results example
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid And Hive
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid Hive integration
 Data already existing in druid
 Druid has its own JSON based query
language
 No native BI tools integration
 Point hive to broker and specify data
source name
 Use Hive as a virtualization layer
 Query Druid data with SQL and plug any
BI tool
 Data already existing in Hive .
 Data stored in distributed filesystem like
HDFS, S3 in a format that can be read by
Hive eg TSV, CSV ORC, Parquet etc
 Perform some pre-processing over
various data sources before feeding it to
Druid
 Accelerate query over Hive data
 Join between hot and cold data
Query Druid from Hive with SQL Hive queries acceleration
31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Query Druid from Hive with SQL
 Point hive to the broker:
– SET hive.druid.broker.address.default=druid.broker.hostname:8082;
 Simple CREATE EXTERNAL TABLE statement
CREATE EXTERNAL TABLE druid_wikiticker
STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler'
TBLPROPERTIES ("druid.datasource" = "wikiticker");
 Broker node endpoint specified as a Hive configuration parameter
 Automatic Druid data schema discovery: segment metadata query
Hive table name
Hive storage handler classname
Druid data source name
32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Query Druid from Hive Hive Table Creation
Hive Query Plan
Query with SQL
33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive queries acceleration
 Use Create Table As Select (CTAS) statement
CREATE TABLE druid_wikiticker
STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler’
TBLPROPERTIES ("druid.datasource" = "wikiticker”, "druid.segment.granularity" =
"HOUR")
AS
SELECT __time, page, user, c_added, c_removed
FROM src;
 Inference of Druid column types (timestamp, dimensions, metrics) depends on Hive
column type
– Timestamp –> Time
– Dimensions -> page, user
– Metrics -> c_added, c_removed
Credit jcamacho@apache.org
Hive table name
Hive storage handler classname
Druid data source name
34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Index Hive in Druid SQL Druid Query
SQL Hive Query
35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Example of Druid queries
Cryptocurrency Market Data
36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Example of Druid queries
Druid Query
Results in shell
Results in Superset
37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Superset
38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Demo

Druid Scaling Realtime Analytics

  • 1.
    Real-Time Analytics WithDruid Aaron Brooks Solutions Engineer abrooks@hortonworks.com
  • 2.
    2 © HortonworksInc. 2011 – 2016. All Rights Reserved Agenda  Druid Overview  Architecture  Data model & queries  Druid with Hive  Demo
  • 3.
  • 4.
    4 © HortonworksInc. 2011 – 2016. All Rights Reserved Druid capabilities  Streaming ingestion capability  Data Freshness – analyze events as they occur  Fast response time (ideally < 1sec query time)  Arbitrary slicing and dicing  Multi-tenancy – 1000s of concurrent users  Scalability and Availability  Rich real-time visualization with Superset Superset Druid is a distributed, real-time, column-oriented datastore designed to quickly ingest and index large amounts of data and make it available for real-time query.
  • 5.
    5 © HortonworksInc. 2011 – 2016. All Rights Reserved Companies Using Druid
  • 6.
    6 © HortonworksInc. 2011 – 2016. All Rights Reserved History  Development started at Metamarkets in 2011  Initial use case – power ad-tech analytics product  Open sourced in late 2012 – GPL licensed initially – Switched to Apache V2 in early 2015  150+ committers today  In production at many companies
  • 7.
    7 © HortonworksInc. 2011 – 2016. All Rights Reserved Druid Is Red Hot Technology 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 9/1/12 11/1/12 1/1/13 3/1/13 5/1/13 7/1/13 9/1/13 11/1/13 1/1/14 3/1/14 5/1/14 7/1/14 9/1/14 11/1/14 1/1/15 3/1/15 5/1/15 7/1/15 9/1/15 11/1/15 1/1/16 3/1/16 5/1/16 7/1/16 9/1/16 11/1/16 1/1/17 3/1/17 5/1/17 7/1/17 9/1/17 Popularity of Major Data Management Technologies by GitHub Followers (Source: GitHub) Cassandra Hadoop Kafka Spark Storm Druid In data / analytics, only Spark and Kafka have more traction than Druid. Druid is the foundation of the modern streaming architecture
  • 8.
    8 © HortonworksInc. 2011 – 2016. All Rights Reserved Cool stuff you can do with Druid  Spatial Indexing – Query within a rectangular or circular distance from a point.  Use cases: – In-store push offers. – Count users / devices within a radius.
  • 9.
    9 © HortonworksInc. 2011 – 2016. All Rights Reserved Druid: Fast Facts Most Events per Day 30 Billion Events / Day (Metamarkets) Most Computed Metrics 2 Billion Metrics / Min (Jolata) Largest Cluster 200 Nodes (Metamarkets) Largest Hourly Ingestion 3TB per Hour (Netflix)
  • 10.
    10 © HortonworksInc. 2011 – 2016. All Rights Reserved Druid Architecture
  • 11.
    11 © HortonworksInc. 2011 – 2016. All Rights Reserved Druid: Services  Broker  Coordinator  Historical  Realtime  Router  Overlord, Middle Managers and Peons  Different node types for solving different problems  Processes dedicated for – Historical data – Ingestion – Coordination – Result merging
  • 12.
    12 © HortonworksInc. 2011 – 2016. All Rights Reserved A Typical Druid Deployment Many nodes, based on data size and #queries HDFS or S3 Superset or Dashboards, BI Tools
  • 13.
    13 © HortonworksInc. 2011 – 2016. All Rights Reserved Druid: Segments  Data in Druid is stored in Segment Files.  Partitioned by time, supports fast time-based slice-and-dice.  Ideally, segment files are each smaller than 1GB.  If files are large, smaller time partitions are needed. Time Segment 1: Monday Segment 2: Tuesday Segment 3: Wednesday Segment 4: Thursday Segment 5: Friday
  • 14.
    14 © HortonworksInc. 2011 – 2016. All Rights Reserved Historical Nodes  Main workhorses of druid cluster  Load immutable read optimized segments  Respond to queries  Use memory mapped files to load segments
  • 15.
    15 © HortonworksInc. 2011 – 2016. All Rights Reserved Broker Nodes  Keeps track of segment announcements in cluster – (This information is kept in Zookeeper, much like Storm or HBase do.)  Scatters query across historical and realtime nodes – (Clients issue queries to this node, but queries are processed elsewhere.)  Merge results from different query nodes  (Distributed) caching layer
  • 16.
    16 © HortonworksInc. 2011 – 2016. All Rights Reserved Coordinator Nodes  Assigns segments to historical nodes  Interval based cost function to distribute segments  Makes sure query load is uniform across historical nodes  Handles replication of data  Configurable rules to load/drop data
  • 17.
    17 © HortonworksInc. 2011 – 2016. All Rights Reserved Druid: Segment Data Structures  Within a Segment: – Timestamp Column Group. – Dimensions Column Group. – Metrics Column Group. – Indexes that facilitate fast lookup and aggregation.
  • 18.
    18 © HortonworksInc. 2011 – 2016. All Rights Reserved What Makes Druid Fast?
  • 19.
    19 © HortonworksInc. 2011 – 2016. All Rights Reserved What Makes Druid Fast?
  • 20.
    20 © HortonworksInc. 2011 – 2016. All Rights Reserved Druid High Level Architecture Hadoop Historical Node Historical Node Historical Node Batch Data Broker Node Queries ETL (Samza, Kafka, Storm, Spark etc) Streaming Data Realtime Node Realtime Node Handoff
  • 21.
    21 © HortonworksInc. 2011 – 2016. All Rights Reserved Druid: Batch Indexing  Indexing is performed by related components: – Overlord – Middle Managers – Peons  Batch indexing is done on data that already exists in Deep Storage (e.g. HDFS).  Middle Managers spawn peons which run ingestions tasks  Middle Managers get task definitions that defines which tasks to run and its properties  Each peon runs 1 task
  • 22.
    22 © HortonworksInc. 2011 – 2016. All Rights Reserved Stream Ingestion: Real time Index Tasks  Ability to ingest streams of data  Stores data in write-optimized structure  Periodically converts write-optimized structure to read-optimized segments  Event query-able as soon as it is ingested  Both push (tranquility) and pull based ingestion
  • 23.
    23 © HortonworksInc. 2011 – 2016. All Rights Reserved Druid: Realtime Indexing Deep Storage Tranquility Coordinator Broker Indexing Service Overlord MiddleManager Peon Peon Peon ZooKeeper Kafka task Push Segments segment Segment- cache Historical Segment- cache Historical Spark Flink Storm Python Pull
  • 24.
  • 25.
    25 © HortonworksInc. 2011 – 2016. All Rights Reserved Druid: Segment Data Structures  Within a Segment: – Timestamp Column Group. – Dimensions Column Group. – Metrics Column Group. – Indexes that facilitate fast lookup and aggregation.
  • 26.
    26 © HortonworksInc. 2011 – 2016. All Rights Reserved Queries  Timeseries: Time based queries  TopN : Equivalent to group_by + order over 1 dimension – approximate if more than 1000 Dim values  GroupBy  Time boundary : queries return the earliest and latest data points of a data set  Search queries / Select  For each query we can use operators like – Granularity (Roll up) – Filters – Aggregation / Post-Aggregation – Etc
  • 27.
    27 © HortonworksInc. 2011 – 2016. All Rights Reserved Group-by example
  • 28.
    28 © HortonworksInc. 2011 – 2016. All Rights Reserved Results example
  • 29.
    29 © HortonworksInc. 2011 – 2016. All Rights Reserved Druid And Hive
  • 30.
    30 © HortonworksInc. 2011 – 2016. All Rights Reserved Druid Hive integration  Data already existing in druid  Druid has its own JSON based query language  No native BI tools integration  Point hive to broker and specify data source name  Use Hive as a virtualization layer  Query Druid data with SQL and plug any BI tool  Data already existing in Hive .  Data stored in distributed filesystem like HDFS, S3 in a format that can be read by Hive eg TSV, CSV ORC, Parquet etc  Perform some pre-processing over various data sources before feeding it to Druid  Accelerate query over Hive data  Join between hot and cold data Query Druid from Hive with SQL Hive queries acceleration
  • 31.
    31 © HortonworksInc. 2011 – 2016. All Rights Reserved Query Druid from Hive with SQL  Point hive to the broker: – SET hive.druid.broker.address.default=druid.broker.hostname:8082;  Simple CREATE EXTERNAL TABLE statement CREATE EXTERNAL TABLE druid_wikiticker STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler' TBLPROPERTIES ("druid.datasource" = "wikiticker");  Broker node endpoint specified as a Hive configuration parameter  Automatic Druid data schema discovery: segment metadata query Hive table name Hive storage handler classname Druid data source name
  • 32.
    32 © HortonworksInc. 2011 – 2016. All Rights Reserved Query Druid from Hive Hive Table Creation Hive Query Plan Query with SQL
  • 33.
    33 © HortonworksInc. 2011 – 2016. All Rights Reserved Hive queries acceleration  Use Create Table As Select (CTAS) statement CREATE TABLE druid_wikiticker STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler’ TBLPROPERTIES ("druid.datasource" = "wikiticker”, "druid.segment.granularity" = "HOUR") AS SELECT __time, page, user, c_added, c_removed FROM src;  Inference of Druid column types (timestamp, dimensions, metrics) depends on Hive column type – Timestamp –> Time – Dimensions -> page, user – Metrics -> c_added, c_removed Credit jcamacho@apache.org Hive table name Hive storage handler classname Druid data source name
  • 34.
    34 © HortonworksInc. 2011 – 2016. All Rights Reserved Index Hive in Druid SQL Druid Query SQL Hive Query
  • 35.
    35 © HortonworksInc. 2011 – 2016. All Rights Reserved Example of Druid queries Cryptocurrency Market Data
  • 36.
    36 © HortonworksInc. 2011 – 2016. All Rights Reserved Example of Druid queries Druid Query Results in shell Results in Superset
  • 37.
    37 © HortonworksInc. 2011 – 2016. All Rights Reserved Superset
  • 38.
    38 © HortonworksInc. 2011 – 2016. All Rights Reserved Demo