Time-series data analysis and persistence with Druid

1 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Big Data Developers in Madrid
> Time-series data analysis and persistence with Druid
(IoT, clickstream analytics, ...)
Raúl Marín
Solutions Engineering @ Hortonworks
May 10th, 2018

▪ What’s Druid?
▪ What does Druid look like?
▪ How is data modeled in Druid?
▪ How to query data on Druid
▪ Demo
Agenda

▪ What’s Druid?
▪ Demo
Agenda

⬢ Development starts in 2011 at Metamarkets, open-sourced in late 2012
⬢ Initial use case: interactive ad-analytics
⬢ +150 contributors
⬢ Main features:
– Columns-oriented distributed data store
– Scalable to PBs & 1000s concurrent users
– Batch & Real-time ingestion
– Sub-second response for arbitrary
time-based slice-and-dice:
• Data partition by time dimension
• Automatic data summarization
• Approximate algorithms (hyperLogLog, theta)
• Auto-indexing on load
What’s Druid? An overview

⬢ Interactive and Exploratory Analytics on event data
⬢ Suitable for BI/OLAP demanding interactive visualization of complex data streams:
– Real-time bidding events
– User activity streams
– Voice call logs
– Network traffic flows
– Firewall events
– Application performance metrics
⬢ Querying event data at large scale poses multiple challenges:
– Window joining not guaranteed
– Potential duplicated events
Where does Druid shine?

▪ What’s Druid?
▪ Demo
Agenda

High Level Druid Architecture
HDP
Historical
Node
Historical
Node
Historical
Node
Batch Data
Broker
Node
Queries
Kafka, Storm,
Spark, API
(Twitter), etc
Streaming
Data Realtime
Node
Realtime
Node
Handoff
Deep Storage
HDFS (HDP) or S3

⬢ Scatters query across historical and realtime nodes
– (Clients issue queries to this node, but queries are processed elsewhere.)
⬢ Merge results from different query nodes
⬢ (Distributed) caching layer
Broker Nodes

⬢ Ability to ingest streams of data
⬢ Both push and pull based ingestion
⬢ Stores data in write-optimized structure
⬢ Periodically converts write-optimized structure
to read-optimized segments
⬢ Event query-able as soon as it is ingested
Realtime Nodes

⬢ Shared nothing architecture
⬢ Main workhorses of druid cluster
⬢ Load immutable read optimized segments
⬢ Respond to queries
⬢ Use memory mapped files
to load segments
Historical nodes

⬢ Assigns segments to historical nodes
⬢ Interval based cost function to distribute segments
⬢ Makes sure query load is uniform across historical nodes
⬢ Handles replication of data
⬢ Configurable rules to load/drop data
Coordinator Nodes

⬢ A highly-available & distributed service for indexing tasks
⬢ Indexing is performed by related components:
– Overlord
– Middle Managers
– Peons
⬢ Index definition is specified via a JSON file and submitted
to the Overlord.
Indexing service

▪ What’s Druid?
▪ Demo
Agenda

⬢ Data is organized in Data Sources, top level abstraction, equivalent to Data Tables
⬢ Within a Data Source, data is stored in Segments
⬢ Segments are partitioned by time and, eventually, some dimensions
⬢ Segment size matters in order to avoid resource contention (~GBs)
Data Storage in Druid
Time
Segment 1:
Monday
Segment 2:
Tuesday
Segment 3:
Wednesday
Segment 4:
Thursday
Segment 5:
Friday

⬢ A segment contains:
– A timestamp column
– One or many dimension columns
– One or many metric columns
– Indexes to facilitate fast lookups
and aggregations
Segment Data Structure

▪ What’s Druid?
▪ Demo
Agenda

Typical queries and operators available on Druid
⬢ Time based (time-series)
⬢ Filters (search/select)
⬢ Group by
⬢ Top N - Equivalent to a group_by + order over 1 dimension
(results approximated for efficiency if there are more than 1000 dimension values)
⬢ Time boundary - earliest and latest data points of a data set
⬢ Granularity (roll up)

Queries and results expressed in JSON (HTTP Rest API)

Superset - BI Dashboarding fully integrated with Druid

Query Druid from Hive with SQL (BI tools integration)
⬢ Druid supports SQL natively (experimental feature) → Hive integration preferred
⬢ Point hive to the broker:
– SET hive.druid.broker.address.default=druid.broker.hostname:8082;
⬢ Simple CREATE EXTERNAL TABLE statement
CREATE EXTERNAL TABLE druid_wikiticker
STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler'
TBLPROPERTIES ("druid.datasource" = "wikiticker");
⬢ Broker node endpoint specified as a Hive configuration parameter
⬢ Automatic Druid data schema discovery: segment metadata query
Hive table name
Hive storage handler classname
Druid data source name

▪ What’s Druid?
▪ Demo
Agenda

DEMO

Time-series data analysis and persistence with Druid

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Time-series data analysis and persistence with Druid

Similar to Time-series data analysis and persistence with Druid (20)

Recently uploaded

Recently uploaded (20)

Time-series data analysis and persistence with Druid