SlideShare a Scribd company logo
Druid @Hadoop Ecosystem
Slim Bouguerra, Nishant Bangarwa , Jesús
Camacho Rodríguez, Ashutosh Chauhan,
Gunther Hagleitner, Julian Hyde, Carter Shanklin
Druid meetup
21/02/2017
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
1. Security enhancement
2. Deployment and management
3. SQL interaction
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Security : Integration with Kerberos
Spnego
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Kerberos/Spnego integration (druid 0.10)
 Securing all the endpoint except some specified ones if needed.
 UI access to coordinator and overlord is protected as well (browser configuration
needed).
Druid
Druid
KDC server
User
Browser1 kinit user
2 Token
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Kerberos/Spnego integration (druid 0.10)
 Securing all the endpoint except some specified ones if needed.
 UI access to coordinator and overlord is protected as well (browser configuration
needed).
druid.hadoop.security.spnego
.keytab
keytab_dir/spnego.service.k
eytab
This is the SPNEGO service
keytab that is used for
authentication.
druid.hadoop.security.spnego
.principal
HTTP/_HOST@realm
This is the SPNEGO service
principal that is used for
authentication
curl --negotiate -u:anyUser -b ~/cookies.txt -c ~/cookies.txt -X POST -H'Content-
Type: application/json' http://_endpoint
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Security Next: Integrate with Apache Ranger/ Apache KNOX
 Leveraging SSO via Apache KNOX
 Data source Level user/group based authorization.
 Row/Column level user/group based authorization.
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Deployment and management: Apache
Ambari integration
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Simple Druid Management with Ambari
 UI is the source of truth (What you see is what you get !).
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Simple Druid Management with Ambari
 Works with hadoop/hdfs zookeeper… superset, etc..
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Simple Druid Management with Ambari
Versions
managements
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Deployment and management via Ambari/HDP
 UI is the source of truth (What you see is what you get !).
 Works with hadoop/hdfs out of the box.
 Installs and configures Superset (Ex Caravel -> Ex Panomamix ) UI.
 Integrates with Kerberos (Hadoop and HDFS interaction/ intra Druid security).
 Supports rolling deployments.
 Monitoring via Graphana dashboard (backed by Hbase).
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
SQL interface: Hive integration
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Benefits both to Druid and Apache Hive
 Efficient execution of OLAP queries in
Hive to power BI tools.
 Interaction with realtime data.
 Create/Drop data source using SQL
syntax.
 Being able to execute complex SQL
operations out of the box on Druid data
and other sources like joins and window
functions.
Hive side Druid side
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data source creation
 Data already existing in druid
– All you need is to point hive to broker and specify datasource name
 Data outside of druid
– Data already existing in Hive .
– Data stored in distributed filesystem like HDFS, S3 in a format that can be read by hive eg TSV, CSV
ORC, Parquet etc.
– Need Perform some pre-processing over various data sources before feeding it to druid
Create Table statement
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid data sources in Hive
 Point hive to the broker:
– SET hive.druid.broker.address.default=druid.broker.hostname:8082;
 Simple CREATE EXTERNAL TABLE statement
CREATE EXTERNAL TABLE druid_table_1
STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler'
TBLPROPERTIES ("druid.datasource" = "wikiticker");
Hive table name
Hive storage handler classname
Druid data source name
⇢ Broker node endpoint specified as a Hive configuration parameter
⇢ Automatic Druid data schema discovery: segment metadata query
Registering Druid data sources
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid data sources in Hive
 Point hive to druid metadata storage and deep storage path
– Set hive.druid.metadata.password=diurd
– Set hive.druid.metadata.username=druid
– Set hive.druid.metadata.uri=jdbc:mysql://host/druid_db
– Set druid.storage.storageDirectory=s3a://druid-cloud-bucket/
 Use Create Table As Select (CTAS) statement
CREATE TABLE druid_table_1
STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler'
TBLPROPERTIES ("druid.datasource" = "wikiticker”, "druid.segment.granularity" = "HOUR")
AS
SELECT __time, page, user, c_added, c_removed
FROM src;
Hive table name
Hive storage handler classname
Druid data source name
Creating Druid data sources
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid data sources in Hive
 Use Create Table As Select (CTAS) statement
CREATE TABLE druid_table_1
STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler’
TBLPROPERTIES ("druid.datasource" = "wikiticker”, "druid.segment.granularity" = "HOUR")
AS
SELECT __time, page, user, c_added, c_removed
FROM src;
⇢ Inference of Druid column types (timestamp, dimensions, metrics) depends on Hive column type
Creating Druid data sources
Timestamp Dimensions Metrics
Credit jcamacho@apache.org
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid data sources in Hive
__time page user c_added c_removed
2011-01-01T01:05:00Z Justin Boxer 1800 25
2011-01-02T19:00:00Z Justin Reach 2912 42
2011-01-01T11:00:00Z Ke$ha Xeno 1953 17
2011-01-02T13:00:00Z Ke$ha Helz 3194 170
2011-01-02T18:00:00Z Miley Ashu 2232 34
CTAS query results
Select
File Sink
Original CTAS
physical plan
Table Scan
Credit jcamacho@apache.org
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Rewritten CTAS
physical plan
Druid data sources in Hive
Creating Druid data sources
 File Data needs to be partitioned by time granularity
"druid.segment.granularity" = "HOUR"
Table Scan
Select
File Sink
__time page user c_added c_removed __time_granularity
2011-01-01T01:05:00Z Justin Boxer 1800 25 2011-01-01T00:00:00Z
2011-01-02T19:00:00Z Justin Reach 2912 42 2011-01-02T00:00:00Z
2011-01-01T11:00:00Z Ke$ha Xeno 1953 17 2011-01-01T00:00:00Z
2011-01-02T13:00:00Z Ke$ha Helz 3194 170 2011-01-02T00:00:00Z
2011-01-02T18:00:00Z Miley Ashu 2232 34 2011-01-02T00:00:00Z
CTAS query results
Truncate timestamp to day granularity
Select
File Sink: Druid
output format
Reduce
Table Scan
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Rewritten CTAS
physical plan
Druid data sources in Hive
Creating Druid data sources
 File Sink operator uses Druid output format
– Creates segment files and save segments descriptors
metadata to hdfs.
– After successful reducer operation all the descriptors
will be committed to metadata storage atomically.
– Wait for handoff if coordinator is detected.
Table Scan
Select
2011-01-01T01:05:00Z Justin Boxer 1800 25 2011-01-01T00:00:00Z
2011-01-01T11:00:00Z Ke$ha Xeno 1953 17 2011-01-01T00:00:00Z
CTAS query results
Select
File Sink
Reduce
Table Scan
2011-01-02T19:00:00Z Justin Reach 2912 42 2011-01-02T00:00:00Z
2011-01-02T13:00:00Z Ke$ha Helz 3194 170 2011-01-02T00:00:00Z
2011-01-02T18:00:00Z Miley Ashu 2232 34 2011-01-02T00:00:00Z
Segment 2011-01-01
Segment 2011-01-02
File Sink: Druid
output format
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid data sources in Hive
 Use Insert Overwrite As Select (CTAS) statement:
– Can only append or overwrite.
– Need to keep the same schema.
Update data sources
INSERT OVERWRITE TABLE druid_table_1 AS
SELECT __time, page, user, c_added, c_removed
FROM src;
 Use Drop table to delete meta data from hive and data source from druid
DROP TABLE druid_table_1 [PURGE];
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Querying Druid data sources
 Automatic rewriting when query is expressed over Druid table
– Powered by Apache Calcite
– Main challenge: identify patterns in logical plan corresponding to
different kinds of Druid queries (Timeseries, GroupBy, Select)
 Translate (sub)plan of operators into valid Druid JSON query
– Druid query is encapsulated within Hive TableScan operator
 Hive TableScan uses Druid input format
– Submits query to Druid and generates records out of the query results
– interaction with Druid broker node or historicals in parallel
 It might not be possible to push all computation to Druid
– Our contract is that the query should always be executed
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid input format extends InputFormat<NullWritable, DruidWritable>
 Submits query to Druid and generates records out of the query
results
 Current version
– Timeseries, TopN, and GroupBy queries are not partitioned
– Select queries partitioned along time dimension column considering
uniform distribution
 Ongoing work for select query
– Bypass broker: query Druid realtime and historical nodes directly
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Next
 Push more time filters predicates and/or computation down the chain.
 Make use of Long/Float Columns.
 Complex column types (sketches, HLL etc…).
 Stream version of Select query.
 Interact with coordinator for data creation.
 Time semantic (Time zone handling).
 Null semantic.
Hive integration
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Thank You
@ApacheHive | @ApacheCalcite | @druidio | @ApacheAmbari
http://cwiki.apache.org/confluence/display/Hive/Druid+Integration
http://calcite.apache.org/docs/druid_adapter.html
https://issues.apache.org/jira/browse/AMBARI-17981
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Demo:

More Related Content

What's hot

Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)
Uwe Printz
 
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
MongoDB
 
Druid deep dive
Druid deep diveDruid deep dive
Druid deep dive
Kashif Khan
 
Hadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezHadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to Tez
Jan Pieter Posthuma
 
Real Time analytics with Druid, Apache Spark and Kafka
Real Time analytics with Druid, Apache Spark and KafkaReal Time analytics with Druid, Apache Spark and Kafka
Real Time analytics with Druid, Apache Spark and Kafka
Daria Litvinov
 
Powering a Virtual Power Station with Big Data
Powering a Virtual Power Station with Big DataPowering a Virtual Power Station with Big Data
Powering a Virtual Power Station with Big Data
DataWorks Summit/Hadoop Summit
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Uwe Printz
 
Hadoop in the Cloud: Real World Lessons from Enterprise Customers
Hadoop in the Cloud: Real World Lessons from Enterprise CustomersHadoop in the Cloud: Real World Lessons from Enterprise Customers
Hadoop in the Cloud: Real World Lessons from Enterprise Customers
DataWorks Summit/Hadoop Summit
 
Using Multiple Persistence Layers in Spark to Build a Scalable Prediction Eng...
Using Multiple Persistence Layers in Spark to Build a Scalable Prediction Eng...Using Multiple Persistence Layers in Spark to Build a Scalable Prediction Eng...
Using Multiple Persistence Layers in Spark to Build a Scalable Prediction Eng...
StampedeCon
 
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLCHBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
Cloudera, Inc.
 
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
StampedeCon
 
Apache Cassandra and Python for Analyzing Streaming Big Data
Apache Cassandra and Python for Analyzing Streaming Big Data Apache Cassandra and Python for Analyzing Streaming Big Data
Apache Cassandra and Python for Analyzing Streaming Big Data
prajods
 
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBaseHBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
DataWorks Summit/Hadoop Summit
 
Spark with Cassandra by Christopher Batey
Spark with Cassandra by Christopher BateySpark with Cassandra by Christopher Batey
Spark with Cassandra by Christopher Batey
Spark Summit
 
Storage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on KubernetesStorage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on Kubernetes
DataWorks Summit
 
Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala
Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at OoyalaCassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala
Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala
DataStax Academy
 
Scalable olap with druid
Scalable olap with druidScalable olap with druid
Scalable olap with druid
Kashif Khan
 
Introduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scaleIntroduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scale
MapR Technologies
 
Working with the Scalding Type -Safe API
Working with the Scalding Type -Safe APIWorking with the Scalding Type -Safe API
Working with the Scalding Type -Safe API
DataWorks Summit/Hadoop Summit
 

What's hot (20)

Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)
 
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
 
Druid deep dive
Druid deep diveDruid deep dive
Druid deep dive
 
Hadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezHadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to Tez
 
Real Time analytics with Druid, Apache Spark and Kafka
Real Time analytics with Druid, Apache Spark and KafkaReal Time analytics with Druid, Apache Spark and Kafka
Real Time analytics with Druid, Apache Spark and Kafka
 
Powering a Virtual Power Station with Big Data
Powering a Virtual Power Station with Big DataPowering a Virtual Power Station with Big Data
Powering a Virtual Power Station with Big Data
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
 
Hadoop in the Cloud: Real World Lessons from Enterprise Customers
Hadoop in the Cloud: Real World Lessons from Enterprise CustomersHadoop in the Cloud: Real World Lessons from Enterprise Customers
Hadoop in the Cloud: Real World Lessons from Enterprise Customers
 
Using Multiple Persistence Layers in Spark to Build a Scalable Prediction Eng...
Using Multiple Persistence Layers in Spark to Build a Scalable Prediction Eng...Using Multiple Persistence Layers in Spark to Build a Scalable Prediction Eng...
Using Multiple Persistence Layers in Spark to Build a Scalable Prediction Eng...
 
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLCHBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
 
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
 
Apache Cassandra and Python for Analyzing Streaming Big Data
Apache Cassandra and Python for Analyzing Streaming Big Data Apache Cassandra and Python for Analyzing Streaming Big Data
Apache Cassandra and Python for Analyzing Streaming Big Data
 
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBaseHBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
 
Spark with Cassandra by Christopher Batey
Spark with Cassandra by Christopher BateySpark with Cassandra by Christopher Batey
Spark with Cassandra by Christopher Batey
 
Storage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on KubernetesStorage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on Kubernetes
 
Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala
Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at OoyalaCassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala
Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala
 
Scalable olap with druid
Scalable olap with druidScalable olap with druid
Scalable olap with druid
 
Introduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scaleIntroduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scale
 
Working with the Scalding Type -Safe API
Working with the Scalding Type -Safe APIWorking with the Scalding Type -Safe API
Working with the Scalding Type -Safe API
 

Similar to Druid at Hadoop Ecosystem

Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using Druid
DataWorks Summit/Hadoop Summit
 
Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using Druid
DataWorks Summit
 
Druid and Hive Together : Use Cases and Best Practices
Druid and Hive Together : Use Cases and Best PracticesDruid and Hive Together : Use Cases and Best Practices
Druid and Hive Together : Use Cases and Best Practices
DataWorks Summit
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
DataWorks Summit
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drill
Julien Le Dem
 
Pivotal HD and Spring for Apache Hadoop
Pivotal HD and Spring for Apache HadoopPivotal HD and Spring for Apache Hadoop
Pivotal HD and Spring for Apache Hadoop
marklpollack
 
HUG Meetup 2013: HCatalog / Hive Data Out
HUG Meetup 2013: HCatalog / Hive Data Out HUG Meetup 2013: HCatalog / Hive Data Out
HUG Meetup 2013: HCatalog / Hive Data Out
Sumeet Singh
 
May 2013 HUG: HCatalog/Hive Data Out
May 2013 HUG: HCatalog/Hive Data OutMay 2013 HUG: HCatalog/Hive Data Out
May 2013 HUG: HCatalog/Hive Data Out
Yahoo Developer Network
 
Hortonworks Data In Motion Webinar Series Pt. 2
Hortonworks Data In Motion Webinar Series Pt. 2Hortonworks Data In Motion Webinar Series Pt. 2
Hortonworks Data In Motion Webinar Series Pt. 2
Hortonworks
 
Hortonworks and Red Hat Webinar - Part 2
Hortonworks and Red Hat Webinar - Part 2Hortonworks and Red Hat Webinar - Part 2
Hortonworks and Red Hat Webinar - Part 2
Hortonworks
 
Running Enterprise Workloads in the Cloud
Running Enterprise Workloads in the CloudRunning Enterprise Workloads in the Cloud
Running Enterprise Workloads in the Cloud
DataWorks Summit
 
How YARN Enables Multiple Data Processing Engines in Hadoop
How YARN Enables Multiple Data Processing Engines in HadoopHow YARN Enables Multiple Data Processing Engines in Hadoop
How YARN Enables Multiple Data Processing Engines in Hadoop
POSSCON
 
Taming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop ManagementTaming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop Management
DataWorks Summit/Hadoop Summit
 
Time-series data analysis and persistence with Druid
Time-series data analysis and persistence with DruidTime-series data analysis and persistence with Druid
Time-series data analysis and persistence with Druid
Raúl Marín
 
Druid: Sub-Second OLAP queries over Petabytes of Streaming Data
Druid: Sub-Second OLAP queries over Petabytes of Streaming DataDruid: Sub-Second OLAP queries over Petabytes of Streaming Data
Druid: Sub-Second OLAP queries over Petabytes of Streaming Data
DataWorks Summit
 
FOD Paris Meetup - Global Data Management with DataPlane Services (DPS)
FOD Paris Meetup -  Global Data Management with DataPlane Services (DPS)FOD Paris Meetup -  Global Data Management with DataPlane Services (DPS)
FOD Paris Meetup - Global Data Management with DataPlane Services (DPS)
Abdelkrim Hadjidj
 
Yahoo! Hack Europe Workshop
Yahoo! Hack Europe WorkshopYahoo! Hack Europe Workshop
Yahoo! Hack Europe Workshop
Hortonworks
 
Realtime Analytics in Hadoop
Realtime Analytics in HadoopRealtime Analytics in Hadoop
Realtime Analytics in Hadoop
Rommel Garcia
 
Realtime analytics + hadoop 2.0
Realtime analytics + hadoop 2.0Realtime analytics + hadoop 2.0
Realtime analytics + hadoop 2.0
Rommel Garcia
 
Hadoop crashcourse v3
Hadoop crashcourse v3Hadoop crashcourse v3
Hadoop crashcourse v3
Hortonworks
 

Similar to Druid at Hadoop Ecosystem (20)

Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using Druid
 
Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using Druid
 
Druid and Hive Together : Use Cases and Best Practices
Druid and Hive Together : Use Cases and Best PracticesDruid and Hive Together : Use Cases and Best Practices
Druid and Hive Together : Use Cases and Best Practices
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drill
 
Pivotal HD and Spring for Apache Hadoop
Pivotal HD and Spring for Apache HadoopPivotal HD and Spring for Apache Hadoop
Pivotal HD and Spring for Apache Hadoop
 
HUG Meetup 2013: HCatalog / Hive Data Out
HUG Meetup 2013: HCatalog / Hive Data Out HUG Meetup 2013: HCatalog / Hive Data Out
HUG Meetup 2013: HCatalog / Hive Data Out
 
May 2013 HUG: HCatalog/Hive Data Out
May 2013 HUG: HCatalog/Hive Data OutMay 2013 HUG: HCatalog/Hive Data Out
May 2013 HUG: HCatalog/Hive Data Out
 
Hortonworks Data In Motion Webinar Series Pt. 2
Hortonworks Data In Motion Webinar Series Pt. 2Hortonworks Data In Motion Webinar Series Pt. 2
Hortonworks Data In Motion Webinar Series Pt. 2
 
Hortonworks and Red Hat Webinar - Part 2
Hortonworks and Red Hat Webinar - Part 2Hortonworks and Red Hat Webinar - Part 2
Hortonworks and Red Hat Webinar - Part 2
 
Running Enterprise Workloads in the Cloud
Running Enterprise Workloads in the CloudRunning Enterprise Workloads in the Cloud
Running Enterprise Workloads in the Cloud
 
How YARN Enables Multiple Data Processing Engines in Hadoop
How YARN Enables Multiple Data Processing Engines in HadoopHow YARN Enables Multiple Data Processing Engines in Hadoop
How YARN Enables Multiple Data Processing Engines in Hadoop
 
Taming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop ManagementTaming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop Management
 
Time-series data analysis and persistence with Druid
Time-series data analysis and persistence with DruidTime-series data analysis and persistence with Druid
Time-series data analysis and persistence with Druid
 
Druid: Sub-Second OLAP queries over Petabytes of Streaming Data
Druid: Sub-Second OLAP queries over Petabytes of Streaming DataDruid: Sub-Second OLAP queries over Petabytes of Streaming Data
Druid: Sub-Second OLAP queries over Petabytes of Streaming Data
 
FOD Paris Meetup - Global Data Management with DataPlane Services (DPS)
FOD Paris Meetup -  Global Data Management with DataPlane Services (DPS)FOD Paris Meetup -  Global Data Management with DataPlane Services (DPS)
FOD Paris Meetup - Global Data Management with DataPlane Services (DPS)
 
Yahoo! Hack Europe Workshop
Yahoo! Hack Europe WorkshopYahoo! Hack Europe Workshop
Yahoo! Hack Europe Workshop
 
Realtime Analytics in Hadoop
Realtime Analytics in HadoopRealtime Analytics in Hadoop
Realtime Analytics in Hadoop
 
Realtime analytics + hadoop 2.0
Realtime analytics + hadoop 2.0Realtime analytics + hadoop 2.0
Realtime analytics + hadoop 2.0
 
Hadoop crashcourse v3
Hadoop crashcourse v3Hadoop crashcourse v3
Hadoop crashcourse v3
 

Recently uploaded

Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
74nqk8xf
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
kuntobimo2016
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 

Recently uploaded (20)

Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 

Druid at Hadoop Ecosystem

  • 1. Druid @Hadoop Ecosystem Slim Bouguerra, Nishant Bangarwa , Jesús Camacho Rodríguez, Ashutosh Chauhan, Gunther Hagleitner, Julian Hyde, Carter Shanklin Druid meetup 21/02/2017
  • 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved 1. Security enhancement 2. Deployment and management 3. SQL interaction
  • 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Security : Integration with Kerberos Spnego
  • 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Kerberos/Spnego integration (druid 0.10)  Securing all the endpoint except some specified ones if needed.  UI access to coordinator and overlord is protected as well (browser configuration needed). Druid Druid KDC server User Browser1 kinit user 2 Token
  • 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Kerberos/Spnego integration (druid 0.10)  Securing all the endpoint except some specified ones if needed.  UI access to coordinator and overlord is protected as well (browser configuration needed). druid.hadoop.security.spnego .keytab keytab_dir/spnego.service.k eytab This is the SPNEGO service keytab that is used for authentication. druid.hadoop.security.spnego .principal HTTP/_HOST@realm This is the SPNEGO service principal that is used for authentication curl --negotiate -u:anyUser -b ~/cookies.txt -c ~/cookies.txt -X POST -H'Content- Type: application/json' http://_endpoint
  • 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Security Next: Integrate with Apache Ranger/ Apache KNOX  Leveraging SSO via Apache KNOX  Data source Level user/group based authorization.  Row/Column level user/group based authorization.
  • 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Deployment and management: Apache Ambari integration
  • 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Simple Druid Management with Ambari  UI is the source of truth (What you see is what you get !).
  • 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Simple Druid Management with Ambari  Works with hadoop/hdfs zookeeper… superset, etc..
  • 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Simple Druid Management with Ambari Versions managements
  • 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Deployment and management via Ambari/HDP  UI is the source of truth (What you see is what you get !).  Works with hadoop/hdfs out of the box.  Installs and configures Superset (Ex Caravel -> Ex Panomamix ) UI.  Integrates with Kerberos (Hadoop and HDFS interaction/ intra Druid security).  Supports rolling deployments.  Monitoring via Graphana dashboard (backed by Hbase).
  • 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved SQL interface: Hive integration
  • 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Benefits both to Druid and Apache Hive  Efficient execution of OLAP queries in Hive to power BI tools.  Interaction with realtime data.  Create/Drop data source using SQL syntax.  Being able to execute complex SQL operations out of the box on Druid data and other sources like joins and window functions. Hive side Druid side
  • 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data source creation  Data already existing in druid – All you need is to point hive to broker and specify datasource name  Data outside of druid – Data already existing in Hive . – Data stored in distributed filesystem like HDFS, S3 in a format that can be read by hive eg TSV, CSV ORC, Parquet etc. – Need Perform some pre-processing over various data sources before feeding it to druid Create Table statement
  • 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid data sources in Hive  Point hive to the broker: – SET hive.druid.broker.address.default=druid.broker.hostname:8082;  Simple CREATE EXTERNAL TABLE statement CREATE EXTERNAL TABLE druid_table_1 STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler' TBLPROPERTIES ("druid.datasource" = "wikiticker"); Hive table name Hive storage handler classname Druid data source name ⇢ Broker node endpoint specified as a Hive configuration parameter ⇢ Automatic Druid data schema discovery: segment metadata query Registering Druid data sources
  • 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid data sources in Hive  Point hive to druid metadata storage and deep storage path – Set hive.druid.metadata.password=diurd – Set hive.druid.metadata.username=druid – Set hive.druid.metadata.uri=jdbc:mysql://host/druid_db – Set druid.storage.storageDirectory=s3a://druid-cloud-bucket/  Use Create Table As Select (CTAS) statement CREATE TABLE druid_table_1 STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler' TBLPROPERTIES ("druid.datasource" = "wikiticker”, "druid.segment.granularity" = "HOUR") AS SELECT __time, page, user, c_added, c_removed FROM src; Hive table name Hive storage handler classname Druid data source name Creating Druid data sources
  • 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid data sources in Hive  Use Create Table As Select (CTAS) statement CREATE TABLE druid_table_1 STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler’ TBLPROPERTIES ("druid.datasource" = "wikiticker”, "druid.segment.granularity" = "HOUR") AS SELECT __time, page, user, c_added, c_removed FROM src; ⇢ Inference of Druid column types (timestamp, dimensions, metrics) depends on Hive column type Creating Druid data sources Timestamp Dimensions Metrics Credit jcamacho@apache.org
  • 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid data sources in Hive __time page user c_added c_removed 2011-01-01T01:05:00Z Justin Boxer 1800 25 2011-01-02T19:00:00Z Justin Reach 2912 42 2011-01-01T11:00:00Z Ke$ha Xeno 1953 17 2011-01-02T13:00:00Z Ke$ha Helz 3194 170 2011-01-02T18:00:00Z Miley Ashu 2232 34 CTAS query results Select File Sink Original CTAS physical plan Table Scan Credit jcamacho@apache.org
  • 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Rewritten CTAS physical plan Druid data sources in Hive Creating Druid data sources  File Data needs to be partitioned by time granularity "druid.segment.granularity" = "HOUR" Table Scan Select File Sink __time page user c_added c_removed __time_granularity 2011-01-01T01:05:00Z Justin Boxer 1800 25 2011-01-01T00:00:00Z 2011-01-02T19:00:00Z Justin Reach 2912 42 2011-01-02T00:00:00Z 2011-01-01T11:00:00Z Ke$ha Xeno 1953 17 2011-01-01T00:00:00Z 2011-01-02T13:00:00Z Ke$ha Helz 3194 170 2011-01-02T00:00:00Z 2011-01-02T18:00:00Z Miley Ashu 2232 34 2011-01-02T00:00:00Z CTAS query results Truncate timestamp to day granularity Select File Sink: Druid output format Reduce Table Scan
  • 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Rewritten CTAS physical plan Druid data sources in Hive Creating Druid data sources  File Sink operator uses Druid output format – Creates segment files and save segments descriptors metadata to hdfs. – After successful reducer operation all the descriptors will be committed to metadata storage atomically. – Wait for handoff if coordinator is detected. Table Scan Select 2011-01-01T01:05:00Z Justin Boxer 1800 25 2011-01-01T00:00:00Z 2011-01-01T11:00:00Z Ke$ha Xeno 1953 17 2011-01-01T00:00:00Z CTAS query results Select File Sink Reduce Table Scan 2011-01-02T19:00:00Z Justin Reach 2912 42 2011-01-02T00:00:00Z 2011-01-02T13:00:00Z Ke$ha Helz 3194 170 2011-01-02T00:00:00Z 2011-01-02T18:00:00Z Miley Ashu 2232 34 2011-01-02T00:00:00Z Segment 2011-01-01 Segment 2011-01-02 File Sink: Druid output format
  • 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid data sources in Hive  Use Insert Overwrite As Select (CTAS) statement: – Can only append or overwrite. – Need to keep the same schema. Update data sources INSERT OVERWRITE TABLE druid_table_1 AS SELECT __time, page, user, c_added, c_removed FROM src;  Use Drop table to delete meta data from hive and data source from druid DROP TABLE druid_table_1 [PURGE];
  • 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Querying Druid data sources  Automatic rewriting when query is expressed over Druid table – Powered by Apache Calcite – Main challenge: identify patterns in logical plan corresponding to different kinds of Druid queries (Timeseries, GroupBy, Select)  Translate (sub)plan of operators into valid Druid JSON query – Druid query is encapsulated within Hive TableScan operator  Hive TableScan uses Druid input format – Submits query to Druid and generates records out of the query results – interaction with Druid broker node or historicals in parallel  It might not be possible to push all computation to Druid – Our contract is that the query should always be executed
  • 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid input format extends InputFormat<NullWritable, DruidWritable>  Submits query to Druid and generates records out of the query results  Current version – Timeseries, TopN, and GroupBy queries are not partitioned – Select queries partitioned along time dimension column considering uniform distribution  Ongoing work for select query – Bypass broker: query Druid realtime and historical nodes directly
  • 24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Next  Push more time filters predicates and/or computation down the chain.  Make use of Long/Float Columns.  Complex column types (sketches, HLL etc…).  Stream version of Select query.  Interact with coordinator for data creation.  Time semantic (Time zone handling).  Null semantic. Hive integration
  • 25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Thank You @ApacheHive | @ApacheCalcite | @druidio | @ApacheAmbari http://cwiki.apache.org/confluence/display/Hive/Druid+Integration http://calcite.apache.org/docs/druid_adapter.html https://issues.apache.org/jira/browse/AMBARI-17981
  • 26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Demo:

Editor's Notes

  1. - All nodes expose JSON REST API