SlideShare a Scribd company logo
1 of 34
Interactive Analytics at Scale
in Apache Hive using Druid
Jesús Camacho Rodríguez
DataWorks Summit Europe
April 5, 2017
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Motivation
 BI/OLAP applications that require interactive
visualization of complex data streams
– Real time bidding events
– User activity streams
– Voice call logs
– Network traffic flows
– Firewall events
– Application performance metrics
 Querying event data at large scale poses multiple challenges
Interactive analytics on event data
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid overview
 Development starts in 2011, open-sourced in late 2012
 Initial use case: interactive ad-analytics
 +150 contributors
 Main features
– Column-oriented distributed data store
– Batch and real-time ingestion
– Scalable to petabytes of data
– Sub-second response for arbitrary time-based
slice-and-dice
• Data partitioned by time dimension
• Automatic data summarization
• Approximate algorithms (hyperLogLog, theta)
Most Events per Day
30 Billion Events / Day
(Metamarkets)
Most Computed Metrics
1 Billion Metrics / Min
(Jolata)
Largest Cluster
200 Nodes
(Metamarkets)
Largest Hourly Ingestion
2TB per Hour
(Netflix)
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid architecture
Dashboards, BI tools
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Persistent storage
 Data in Druid is stored in segment files
 Partitioned by time, supports fast time-based slice-and-dice
 Ideally, segment files are each smaller than 1GB
 If files are large, smaller time partitions are needed
Time
Segment 1:
Monday
Segment 2:
Tuesday
Segment 3:
Wednesday
Segment 4:
Thursday
Segment 5:
Friday
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Segment data structures
 Within a segment
– Timestamp column
– Dimension columns
– Metric columns
– Indexes to facilitate fast lookup and aggregation
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Querying
 HTTP REST API
 Queries and results expressed in JSON
 Multiple query types
– Time boundary
– Segment metadata
– Timeseries
– TopN
– GroupBy
– Select
{
"queryType": "groupBy",
"dataSource": "product_sales_index",
"granularity": "all",
"dimension": "product_id",
"aggregations": [ { "type": "doubleSum", "name": "s", "fieldName": "sales" } ],
"limitSpec": {
"limit": 10,
"columns": [ {"dimension": "s", "direction": "descending" } ]
},
"intervals": [ "2010-01-01T00:00:00.000/2012-01-01T00:00:00.000" ]
}
Important to use adequate type  Impact on query performance
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid + Apache Hive
 Integration brings benefits both to Druid and Apache Hive
– Indexing complex query results in Druid using Hive
– Introducing a SQL interface on top of Druid
– Being able to execute complex operations on Druid data
– Efficient execution of OLAP queries in Hive
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
Interactive Analytics at Scale in Hive using Druid
Introduction
Registering and creating Druid data sources
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid data sources in Hive
 User needs to provide Druid data sources information to Hive
 Two different options depending on requirements
– Register Druid data sources in Hive
• Data is already stored in Druid
– Create Druid data sources from Hive
• Data is stored in Hive
• User may want to pre-process the data before storing it in Druid
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid data sources in Hive
 Simple CREATE EXTERNAL TABLE statement
CREATE EXTERNAL TABLE druid_table_1
STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler'
TBLPROPERTIES ("druid.datasource" = "wikiticker");
Hive table name
Hive storage handler classname
Druid data source name
⇢ Broker node endpoint specified as a Hive configuration parameter
⇢ Automatic Druid data schema discovery: segment metadata query
Registering Druid data sources
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid data sources in Hive
 Use Create Table As Select (CTAS) statement
CREATE TABLE druid_table_1
STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler'
TBLPROPERTIES ("druid.datasource" = "wikiticker", "druid.segment.granularity" = "DAY")
AS
SELECT __time, page, user, c_added, c_removed
FROM src;
Hive table name
Hive storage handler classname
Druid data source name
Druid segment granularity
Creating Druid data sources
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid data sources in Hive
 Use Create Table As Select (CTAS) statement
CREATE TABLE druid_table_1
STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler’
TBLPROPERTIES ("druid.datasource" = "wikiticker", "druid.segment.granularity" = "DAY”)
AS
SELECT __time, page, user, c_added, c_removed
FROM src;
⇢ Inference of Druid column types (timestamp, dimensions, metrics) depends on Hive column type
Creating Druid data sources
Timestamp Dimensions Metrics
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid data sources in Hive
 File Sink operator uses Druid output format
– Creates segment files and register them in Druid
– Data needs to be partitioned by time granularity
• Granularity specified as configuration parameter
Creating Druid data sources
Select
File Sink
Original CTAS
physical plan
__time page user c_added c_removed
2011-01-01T01:05:00Z Justin Boxer 1800 25
2011-01-02T19:00:00Z Justin Reach 2912 42
2011-01-01T11:00:00Z Ke$ha Xeno 1953 17
2011-01-02T13:00:00Z Ke$ha Helz 3194 170
2011-01-02T18:00:00Z Miley Ashu 2232 34
CTAS query results
Table Scan
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
__time page user c_added c_removed __time_granularity
2011-01-01T01:05:00Z Justin Boxer 1800 25 2011-01-01T00:00:00Z
2011-01-02T19:00:00Z Justin Reach 2912 42 2011-01-02T00:00:00Z
2011-01-01T11:00:00Z Ke$ha Xeno 1953 17 2011-01-01T00:00:00Z
2011-01-02T13:00:00Z Ke$ha Helz 3194 170 2011-01-02T00:00:00Z
2011-01-02T18:00:00Z Miley Ashu 2232 34 2011-01-02T00:00:00Z
Druid data sources in Hive
 File Sink operator uses Druid output format
– Creates segment files and register them in Druid
– Data needs to be partitioned by time granularity
• Granularity specified as configuration parameter
Creating Druid data sources
Select
File Sink
Rewritten CTAS
physical plan CTAS query results
Table Scan
Reduce
Truncate timestamp to day granularity
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
 File Sink operator uses Druid output format
– Creates segment files and register them in Druid
– Data needs to be partitioned by time granularity
• Granularity specified as configuration parameter
2011-01-01T01:05:00Z Justin Boxer 1800 25 2011-01-01T00:00:00Z
2011-01-01T11:00:00Z Ke$ha Xeno 1953 17 2011-01-01T00:00:00Z
2011-01-02T19:00:00Z Justin Reach 2912 42 2011-01-02T00:00:00Z
2011-01-02T13:00:00Z Ke$ha Helz 3194 170 2011-01-02T00:00:00Z
2011-01-02T18:00:00Z Miley Ashu 2232 34 2011-01-02T00:00:00Z
Segment 2011-01-01
Segment 2011-01-02
Druid data sources in Hive
Creating Druid data sources
Select
File Sink
Rewritten CTAS
physical plan
Table Scan
Reduce
CTAS query results
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
Interactive Analytics at Scale in Hive using Druid
Introduction
Registering and creating Druid data sources
Querying Druid data sources
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Querying Druid data sources
 Automatic rewriting when query is expressed over Druid table
– Powered by Apache Calcite
– Main challenge: identify patterns in logical plan corresponding to different kinds of Druid queries
(Timeseries, TopN, GroupBy, Select)
 Translate (sub)plan of operators into valid Druid JSON query
– Druid query is encapsulated within Hive TableScan operator
 Hive TableScan uses Druid input format
– Submits query to Druid and generates records out of the query results
 It might not be possible to push all computation to Druid
– Our contract is that the query should always be executed
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid query recognition (powered by Apache Calcite)
SELECT `user`, sum(`c_added`) AS s
FROM druid_table_1
WHERE EXTRACT(year FROM `__time`)
BETWEEN 2010 AND 2011
GROUP BY `user`
ORDER BY s DESC
LIMIT 10;
 Top 10 users that have added more characters
from beginning of 2010 until the end of 2011
Apache Hive - SQL query
Query logical plan
Druid Scan
Project
Aggregate
Sort Limit
Sink
Filter
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid query recognition (powered by Apache Calcite)
SELECT `user`, sum(`c_added`) AS s
FROM druid_table_1
WHERE EXTRACT(year FROM `__time`)
BETWEEN 2010 AND 2011
GROUP BY `user`
ORDER BY s DESC
LIMIT 10;
 Top 10 users that have added more characters
from beginning of 2010 until the end of 2011
Apache Hive - SQL query
Query logical plan
Druid Scan
Project
Aggregate
Sort Limit
Sink
Filter
Possible to express filters
on time dimension using
SQL standard functions
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Hive
Druid query
select
Druid query recognition (powered by Apache Calcite)
SELECT `user`, sum(`c_added`) AS s
FROM druid_table_1
WHERE EXTRACT(year FROM `__time`)
BETWEEN 2010 AND 2011
GROUP BY `user`
ORDER BY s DESC
LIMIT 10;
 Initially:
– Scan is executed in Druid (select query)
– Rest of the query is executed in Hive
Apache Hive - SQL query
Query logical plan
Druid Scan
Project
Aggregate
Sort Limit
Sink
Filter
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Hive
Druid query
select
Rewriting
rule
Druid query recognition (powered by Apache Calcite)
SELECT `user`, sum(`c_added`) AS s
FROM druid_table_1
WHERE EXTRACT(year FROM `__time`)
BETWEEN 2010 AND 2011
GROUP BY `user`
ORDER BY s DESC
LIMIT 10;
 Rewriting rules push computation into Druid
– Need to check that operator meets some
pre-conditions before pushing it to Druid
Apache Hive - SQL query
Query logical plan
Druid Scan
Project
Aggregate
Sort Limit
Sink
Filter
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Hive
Druid query
select
Rewriting
rule
Druid query recognition (powered by Apache Calcite)
SELECT `user`, sum(`c_added`) AS s
FROM druid_table_1
WHERE EXTRACT(year FROM `__time`)
BETWEEN 2010 AND 2011
GROUP BY `user`
ORDER BY s DESC
LIMIT 10;
 Rewriting rules push computation into Druid
– Need to check that operator meets some
pre-conditions before pushing it to Druid
Apache Hive - SQL query
Query logical plan
Druid Scan
Project
Aggregate
Sort Limit
Sink
Filter
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Hive
Druid query
groupBy
Rewriting
rule
SELECT `user`, sum(`c_added`) AS s
FROM druid_table_1
WHERE EXTRACT(year FROM `__time`)
BETWEEN 2010 AND 2011
GROUP BY `user`
ORDER BY s DESC
LIMIT 10;
 Rewriting rules push computation into Druid
– Need to check that operator meets some
pre-conditions before pushing it to Druid
Druid query recognition (powered by Apache Calcite)
Apache Hive - SQL query
Query logical plan
Druid Scan
Project
Aggregate
Sort Limit
Sink
Filter
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Hive
Druid query
groupBy
Rewriting
rule
SELECT `user`, sum(`c_added`) AS s
FROM druid_table_1
WHERE EXTRACT(year FROM `__time`)
BETWEEN 2010 AND 2011
GROUP BY `user`
ORDER BY s DESC
LIMIT 10;
 Rewriting rules push computation into Druid
– Need to check that operator meets some
pre-conditions before pushing it to Druid
Druid query recognition (powered by Apache Calcite)
Apache Hive - SQL query
Query logical plan
Druid Scan
Project
Aggregate
Sort Limit
Sink
Filter
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
{
"queryType": "groupBy",
"dataSource": "users_index",
"granularity": "all",
"dimension": "user",
"aggregations": [ { "type": "longSum", "name": "s", "fieldName": "c_added" } ],
"limitSpec": {
"limit": 10,
"columns": [ {"dimension": "s", "direction": "descending" } ]
},
"intervals": [ "2010-01-01T00:00:00.000/2012-01-01T00:00:00.000" ]
}
Physical plan transformation
Apache Hive
Druid query
groupBy
Query logical plan
Druid Scan
Project
Aggregate
Sort Limit
Sink
Filter
Select
File SinkFile Sink
Table Scan
Query physical plan
Druid JSON query
Table Scan uses
Druid Input Format
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid input format
 Submits query to Druid and generates records out of the query results
 Current version
– Timeseries, TopN, and GroupBy queries are not partitioned
– Select queries: realtime and historical nodes are contacted directly
Node
Table Scan
Record reader
…
Timeseries, TopN, GroupBy
Node
Table Scan
Record reader
…
Table Scan
Record reader
… Node
Table Scan
Record reader
…
Table Scan
Record reader
…
Select
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
Interactive Analytics at Scale in Hive using Druid
Introduction
Registering and creating Druid data sources
Querying Druid data sources
Demonstration
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Demonstration
 Implementation in Apache Hive 2.3 - Apache Hive 3.0
– Release in Q2 2017
– Relies on Druid 0.9.2 and Apache Calcite 1.12.0
 Current status (master)
– Registering, creating, overwritting and deleting Druid data sources
– Querying Druid from Hive
• Bypass broker for Druid Select queries
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Demonstration
31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
Interactive Analytics at Scale in Hive using Druid
Introduction
Registering and creating Druid data sources
Querying Druid data sources
Demonstration
Road ahead
32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Road ahead
 Tighten integration between Druid and Apache Hive/Apache Calcite
– Recognize more functions  Push more computation to Druid
– Support complex column types
– Close the gap between semantics of different systems
• Time zone handling, null values
 Broader perspective
– Materialized views support in Apache Hive
• Data stored in Apache Hive
• Create materialized view in Druid
– Denormalized star schema for a certain time period
• Automatic input query rewriting over the materialized view (Apache Calcite)
33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Acknowledgments
 Apache Hive, Apache Calcite and Druid communities
– Slim Bouguerra, Julian Hyde, Nishant Bangarwa, Ashutosh Chauhan, Gunther Hagleitner, Carter
Shanklin, and many others
Thank You
@ApacheHive | @ApacheCalcite | @druidio
http://cwiki.apache.org/confluence/display/Hive/Druid+Integration
http://calcite.apache.org/docs/druid_adapter.html

More Related Content

What's hot

An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...
DataWorks Summit
 
An Introduction to Druid
An Introduction to DruidAn Introduction to Druid
An Introduction to Druid
DataWorks Summit
 
Introduction to Streaming Analytics
Introduction to Streaming AnalyticsIntroduction to Streaming Analytics
Introduction to Streaming Analytics
Guido Schmutz
 

What's hot (20)

Actor Patterns and NATS - Boulder Meetup
Actor Patterns and NATS - Boulder MeetupActor Patterns and NATS - Boulder Meetup
Actor Patterns and NATS - Boulder Meetup
 
Building Event-Driven Systems with Apache Kafka
Building Event-Driven Systems with Apache KafkaBuilding Event-Driven Systems with Apache Kafka
Building Event-Driven Systems with Apache Kafka
 
An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...
 
Getting up to speed with MirrorMaker 2 | Mickael Maison, IBM and Ryanne Dolan...
Getting up to speed with MirrorMaker 2 | Mickael Maison, IBM and Ryanne Dolan...Getting up to speed with MirrorMaker 2 | Mickael Maison, IBM and Ryanne Dolan...
Getting up to speed with MirrorMaker 2 | Mickael Maison, IBM and Ryanne Dolan...
 
Hadoop Data Modeling
Hadoop Data ModelingHadoop Data Modeling
Hadoop Data Modeling
 
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...
 
Spark with Delta Lake
Spark with Delta LakeSpark with Delta Lake
Spark with Delta Lake
 
Tarun poladi resume
Tarun poladi resumeTarun poladi resume
Tarun poladi resume
 
Building a Streaming Pipeline on Kubernetes Using Kafka Connect, KSQLDB & Apa...
Building a Streaming Pipeline on Kubernetes Using Kafka Connect, KSQLDB & Apa...Building a Streaming Pipeline on Kubernetes Using Kafka Connect, KSQLDB & Apa...
Building a Streaming Pipeline on Kubernetes Using Kafka Connect, KSQLDB & Apa...
 
Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink
Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with FlinkSanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink
Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink
 
An Introduction to Druid
An Introduction to DruidAn Introduction to Druid
An Introduction to Druid
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
 
Amazon Redshift Masterclass
Amazon Redshift MasterclassAmazon Redshift Masterclass
Amazon Redshift Masterclass
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
 
Serialization and performance in Java
Serialization and performance in JavaSerialization and performance in Java
Serialization and performance in Java
 
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
 
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
 
Introduction to Streaming Analytics
Introduction to Streaming AnalyticsIntroduction to Streaming Analytics
Introduction to Streaming Analytics
 
Kafka streams windowing behind the curtain
Kafka streams windowing behind the curtain Kafka streams windowing behind the curtain
Kafka streams windowing behind the curtain
 
Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...
Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...
Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...
 

Similar to Interactive Analytics at Scale in Apache Hive Using Druid

Similar to Interactive Analytics at Scale in Apache Hive Using Druid (20)

Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using Druid
 
Druid at Hadoop Ecosystem
Druid at Hadoop EcosystemDruid at Hadoop Ecosystem
Druid at Hadoop Ecosystem
 
Druid Scaling Realtime Analytics
Druid Scaling Realtime AnalyticsDruid Scaling Realtime Analytics
Druid Scaling Realtime Analytics
 
Druid and Hive Together : Use Cases and Best Practices
Druid and Hive Together : Use Cases and Best PracticesDruid and Hive Together : Use Cases and Best Practices
Druid and Hive Together : Use Cases and Best Practices
 
Time-series data analysis and persistence with Druid
Time-series data analysis and persistence with DruidTime-series data analysis and persistence with Druid
Time-series data analysis and persistence with Druid
 
Enabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical EnterpriseEnabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical Enterprise
 
Druid: Sub-Second OLAP queries over Petabytes of Streaming Data
Druid: Sub-Second OLAP queries over Petabytes of Streaming DataDruid: Sub-Second OLAP queries over Petabytes of Streaming Data
Druid: Sub-Second OLAP queries over Petabytes of Streaming Data
 
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
 
Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?
 
Hive 3 a new horizon
Hive 3  a new horizonHive 3  a new horizon
Hive 3 a new horizon
 
Understanding apache-druid
Understanding apache-druidUnderstanding apache-druid
Understanding apache-druid
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizon
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drill
 
HDF Data in the Cloud
HDF Data in the CloudHDF Data in the Cloud
HDF Data in the Cloud
 
Matlab, Big Data, and HDF Server
Matlab, Big Data, and HDF ServerMatlab, Big Data, and HDF Server
Matlab, Big Data, and HDF Server
 
Tim Hall [InfluxData] | InfluxDB Roadmap | InfluxDays Virtual Experience Lond...
Tim Hall [InfluxData] | InfluxDB Roadmap | InfluxDays Virtual Experience Lond...Tim Hall [InfluxData] | InfluxDB Roadmap | InfluxDays Virtual Experience Lond...
Tim Hall [InfluxData] | InfluxDB Roadmap | InfluxDays Virtual Experience Lond...
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 
Real-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case studyReal-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case study
 
Scalable olap with druid
Scalable olap with druidScalable olap with druid
Scalable olap with druid
 

More from DataWorks Summit/Hadoop Summit

How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 

More from DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

Interactive Analytics at Scale in Apache Hive Using Druid

  • 1. Interactive Analytics at Scale in Apache Hive using Druid Jesús Camacho Rodríguez DataWorks Summit Europe April 5, 2017
  • 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Motivation  BI/OLAP applications that require interactive visualization of complex data streams – Real time bidding events – User activity streams – Voice call logs – Network traffic flows – Firewall events – Application performance metrics  Querying event data at large scale poses multiple challenges Interactive analytics on event data
  • 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid overview  Development starts in 2011, open-sourced in late 2012  Initial use case: interactive ad-analytics  +150 contributors  Main features – Column-oriented distributed data store – Batch and real-time ingestion – Scalable to petabytes of data – Sub-second response for arbitrary time-based slice-and-dice • Data partitioned by time dimension • Automatic data summarization • Approximate algorithms (hyperLogLog, theta) Most Events per Day 30 Billion Events / Day (Metamarkets) Most Computed Metrics 1 Billion Metrics / Min (Jolata) Largest Cluster 200 Nodes (Metamarkets) Largest Hourly Ingestion 2TB per Hour (Netflix)
  • 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid architecture Dashboards, BI tools
  • 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Persistent storage  Data in Druid is stored in segment files  Partitioned by time, supports fast time-based slice-and-dice  Ideally, segment files are each smaller than 1GB  If files are large, smaller time partitions are needed Time Segment 1: Monday Segment 2: Tuesday Segment 3: Wednesday Segment 4: Thursday Segment 5: Friday
  • 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Segment data structures  Within a segment – Timestamp column – Dimension columns – Metric columns – Indexes to facilitate fast lookup and aggregation
  • 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Querying  HTTP REST API  Queries and results expressed in JSON  Multiple query types – Time boundary – Segment metadata – Timeseries – TopN – GroupBy – Select { "queryType": "groupBy", "dataSource": "product_sales_index", "granularity": "all", "dimension": "product_id", "aggregations": [ { "type": "doubleSum", "name": "s", "fieldName": "sales" } ], "limitSpec": { "limit": 10, "columns": [ {"dimension": "s", "direction": "descending" } ] }, "intervals": [ "2010-01-01T00:00:00.000/2012-01-01T00:00:00.000" ] } Important to use adequate type  Impact on query performance
  • 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid + Apache Hive  Integration brings benefits both to Druid and Apache Hive – Indexing complex query results in Druid using Hive – Introducing a SQL interface on top of Druid – Being able to execute complex operations on Druid data – Efficient execution of OLAP queries in Hive
  • 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda Interactive Analytics at Scale in Hive using Druid Introduction Registering and creating Druid data sources
  • 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid data sources in Hive  User needs to provide Druid data sources information to Hive  Two different options depending on requirements – Register Druid data sources in Hive • Data is already stored in Druid – Create Druid data sources from Hive • Data is stored in Hive • User may want to pre-process the data before storing it in Druid
  • 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid data sources in Hive  Simple CREATE EXTERNAL TABLE statement CREATE EXTERNAL TABLE druid_table_1 STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler' TBLPROPERTIES ("druid.datasource" = "wikiticker"); Hive table name Hive storage handler classname Druid data source name ⇢ Broker node endpoint specified as a Hive configuration parameter ⇢ Automatic Druid data schema discovery: segment metadata query Registering Druid data sources
  • 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid data sources in Hive  Use Create Table As Select (CTAS) statement CREATE TABLE druid_table_1 STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler' TBLPROPERTIES ("druid.datasource" = "wikiticker", "druid.segment.granularity" = "DAY") AS SELECT __time, page, user, c_added, c_removed FROM src; Hive table name Hive storage handler classname Druid data source name Druid segment granularity Creating Druid data sources
  • 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid data sources in Hive  Use Create Table As Select (CTAS) statement CREATE TABLE druid_table_1 STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler’ TBLPROPERTIES ("druid.datasource" = "wikiticker", "druid.segment.granularity" = "DAY”) AS SELECT __time, page, user, c_added, c_removed FROM src; ⇢ Inference of Druid column types (timestamp, dimensions, metrics) depends on Hive column type Creating Druid data sources Timestamp Dimensions Metrics
  • 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid data sources in Hive  File Sink operator uses Druid output format – Creates segment files and register them in Druid – Data needs to be partitioned by time granularity • Granularity specified as configuration parameter Creating Druid data sources Select File Sink Original CTAS physical plan __time page user c_added c_removed 2011-01-01T01:05:00Z Justin Boxer 1800 25 2011-01-02T19:00:00Z Justin Reach 2912 42 2011-01-01T11:00:00Z Ke$ha Xeno 1953 17 2011-01-02T13:00:00Z Ke$ha Helz 3194 170 2011-01-02T18:00:00Z Miley Ashu 2232 34 CTAS query results Table Scan
  • 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved __time page user c_added c_removed __time_granularity 2011-01-01T01:05:00Z Justin Boxer 1800 25 2011-01-01T00:00:00Z 2011-01-02T19:00:00Z Justin Reach 2912 42 2011-01-02T00:00:00Z 2011-01-01T11:00:00Z Ke$ha Xeno 1953 17 2011-01-01T00:00:00Z 2011-01-02T13:00:00Z Ke$ha Helz 3194 170 2011-01-02T00:00:00Z 2011-01-02T18:00:00Z Miley Ashu 2232 34 2011-01-02T00:00:00Z Druid data sources in Hive  File Sink operator uses Druid output format – Creates segment files and register them in Druid – Data needs to be partitioned by time granularity • Granularity specified as configuration parameter Creating Druid data sources Select File Sink Rewritten CTAS physical plan CTAS query results Table Scan Reduce Truncate timestamp to day granularity
  • 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved  File Sink operator uses Druid output format – Creates segment files and register them in Druid – Data needs to be partitioned by time granularity • Granularity specified as configuration parameter 2011-01-01T01:05:00Z Justin Boxer 1800 25 2011-01-01T00:00:00Z 2011-01-01T11:00:00Z Ke$ha Xeno 1953 17 2011-01-01T00:00:00Z 2011-01-02T19:00:00Z Justin Reach 2912 42 2011-01-02T00:00:00Z 2011-01-02T13:00:00Z Ke$ha Helz 3194 170 2011-01-02T00:00:00Z 2011-01-02T18:00:00Z Miley Ashu 2232 34 2011-01-02T00:00:00Z Segment 2011-01-01 Segment 2011-01-02 Druid data sources in Hive Creating Druid data sources Select File Sink Rewritten CTAS physical plan Table Scan Reduce CTAS query results
  • 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda Interactive Analytics at Scale in Hive using Druid Introduction Registering and creating Druid data sources Querying Druid data sources
  • 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Querying Druid data sources  Automatic rewriting when query is expressed over Druid table – Powered by Apache Calcite – Main challenge: identify patterns in logical plan corresponding to different kinds of Druid queries (Timeseries, TopN, GroupBy, Select)  Translate (sub)plan of operators into valid Druid JSON query – Druid query is encapsulated within Hive TableScan operator  Hive TableScan uses Druid input format – Submits query to Druid and generates records out of the query results  It might not be possible to push all computation to Druid – Our contract is that the query should always be executed
  • 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid query recognition (powered by Apache Calcite) SELECT `user`, sum(`c_added`) AS s FROM druid_table_1 WHERE EXTRACT(year FROM `__time`) BETWEEN 2010 AND 2011 GROUP BY `user` ORDER BY s DESC LIMIT 10;  Top 10 users that have added more characters from beginning of 2010 until the end of 2011 Apache Hive - SQL query Query logical plan Druid Scan Project Aggregate Sort Limit Sink Filter
  • 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid query recognition (powered by Apache Calcite) SELECT `user`, sum(`c_added`) AS s FROM druid_table_1 WHERE EXTRACT(year FROM `__time`) BETWEEN 2010 AND 2011 GROUP BY `user` ORDER BY s DESC LIMIT 10;  Top 10 users that have added more characters from beginning of 2010 until the end of 2011 Apache Hive - SQL query Query logical plan Druid Scan Project Aggregate Sort Limit Sink Filter Possible to express filters on time dimension using SQL standard functions
  • 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Hive Druid query select Druid query recognition (powered by Apache Calcite) SELECT `user`, sum(`c_added`) AS s FROM druid_table_1 WHERE EXTRACT(year FROM `__time`) BETWEEN 2010 AND 2011 GROUP BY `user` ORDER BY s DESC LIMIT 10;  Initially: – Scan is executed in Druid (select query) – Rest of the query is executed in Hive Apache Hive - SQL query Query logical plan Druid Scan Project Aggregate Sort Limit Sink Filter
  • 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Hive Druid query select Rewriting rule Druid query recognition (powered by Apache Calcite) SELECT `user`, sum(`c_added`) AS s FROM druid_table_1 WHERE EXTRACT(year FROM `__time`) BETWEEN 2010 AND 2011 GROUP BY `user` ORDER BY s DESC LIMIT 10;  Rewriting rules push computation into Druid – Need to check that operator meets some pre-conditions before pushing it to Druid Apache Hive - SQL query Query logical plan Druid Scan Project Aggregate Sort Limit Sink Filter
  • 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Hive Druid query select Rewriting rule Druid query recognition (powered by Apache Calcite) SELECT `user`, sum(`c_added`) AS s FROM druid_table_1 WHERE EXTRACT(year FROM `__time`) BETWEEN 2010 AND 2011 GROUP BY `user` ORDER BY s DESC LIMIT 10;  Rewriting rules push computation into Druid – Need to check that operator meets some pre-conditions before pushing it to Druid Apache Hive - SQL query Query logical plan Druid Scan Project Aggregate Sort Limit Sink Filter
  • 24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Hive Druid query groupBy Rewriting rule SELECT `user`, sum(`c_added`) AS s FROM druid_table_1 WHERE EXTRACT(year FROM `__time`) BETWEEN 2010 AND 2011 GROUP BY `user` ORDER BY s DESC LIMIT 10;  Rewriting rules push computation into Druid – Need to check that operator meets some pre-conditions before pushing it to Druid Druid query recognition (powered by Apache Calcite) Apache Hive - SQL query Query logical plan Druid Scan Project Aggregate Sort Limit Sink Filter
  • 25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Hive Druid query groupBy Rewriting rule SELECT `user`, sum(`c_added`) AS s FROM druid_table_1 WHERE EXTRACT(year FROM `__time`) BETWEEN 2010 AND 2011 GROUP BY `user` ORDER BY s DESC LIMIT 10;  Rewriting rules push computation into Druid – Need to check that operator meets some pre-conditions before pushing it to Druid Druid query recognition (powered by Apache Calcite) Apache Hive - SQL query Query logical plan Druid Scan Project Aggregate Sort Limit Sink Filter
  • 26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved { "queryType": "groupBy", "dataSource": "users_index", "granularity": "all", "dimension": "user", "aggregations": [ { "type": "longSum", "name": "s", "fieldName": "c_added" } ], "limitSpec": { "limit": 10, "columns": [ {"dimension": "s", "direction": "descending" } ] }, "intervals": [ "2010-01-01T00:00:00.000/2012-01-01T00:00:00.000" ] } Physical plan transformation Apache Hive Druid query groupBy Query logical plan Druid Scan Project Aggregate Sort Limit Sink Filter Select File SinkFile Sink Table Scan Query physical plan Druid JSON query Table Scan uses Druid Input Format
  • 27. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid input format  Submits query to Druid and generates records out of the query results  Current version – Timeseries, TopN, and GroupBy queries are not partitioned – Select queries: realtime and historical nodes are contacted directly Node Table Scan Record reader … Timeseries, TopN, GroupBy Node Table Scan Record reader … Table Scan Record reader … Node Table Scan Record reader … Table Scan Record reader … Select
  • 28. 28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda Interactive Analytics at Scale in Hive using Druid Introduction Registering and creating Druid data sources Querying Druid data sources Demonstration
  • 29. 29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Demonstration  Implementation in Apache Hive 2.3 - Apache Hive 3.0 – Release in Q2 2017 – Relies on Druid 0.9.2 and Apache Calcite 1.12.0  Current status (master) – Registering, creating, overwritting and deleting Druid data sources – Querying Druid from Hive • Bypass broker for Druid Select queries
  • 30. 30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Demonstration
  • 31. 31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda Interactive Analytics at Scale in Hive using Druid Introduction Registering and creating Druid data sources Querying Druid data sources Demonstration Road ahead
  • 32. 32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Road ahead  Tighten integration between Druid and Apache Hive/Apache Calcite – Recognize more functions  Push more computation to Druid – Support complex column types – Close the gap between semantics of different systems • Time zone handling, null values  Broader perspective – Materialized views support in Apache Hive • Data stored in Apache Hive • Create materialized view in Druid – Denormalized star schema for a certain time period • Automatic input query rewriting over the materialized view (Apache Calcite)
  • 33. 33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Acknowledgments  Apache Hive, Apache Calcite and Druid communities – Slim Bouguerra, Julian Hyde, Nishant Bangarwa, Ashutosh Chauhan, Gunther Hagleitner, Carter Shanklin, and many others
  • 34. Thank You @ApacheHive | @ApacheCalcite | @druidio http://cwiki.apache.org/confluence/display/Hive/Druid+Integration http://calcite.apache.org/docs/druid_adapter.html

Editor's Notes

  1. - Add more info about materialized views?