Druid at Hadoop Ecosystem

Druid @Hadoop Ecosystem
Slim Bouguerra, Nishant Bangarwa , Jesús
Camacho Rodríguez, Ashutosh Chauhan,
Gunther Hagleitner, Julian Hyde, Carter Shanklin
Druid meetup
21/02/2017

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
1. Security enhancement
2. Deployment and management
3. SQL interaction

Security : Integration with Kerberos
Spnego

Kerberos/Spnego integration (druid 0.10)
 Securing all the endpoint except some specified ones if needed.
 UI access to coordinator and overlord is protected as well (browser configuration
needed).
Druid
Druid
KDC server
User
Browser1 kinit user
2 Token

Kerberos/Spnego integration (druid 0.10)
 Securing all the endpoint except some specified ones if needed.
 UI access to coordinator and overlord is protected as well (browser configuration
needed).
druid.hadoop.security.spnego
.keytab
keytab_dir/spnego.service.k
eytab
This is the SPNEGO service
keytab that is used for
authentication.
druid.hadoop.security.spnego
.principal
HTTP/_HOST@realm
This is the SPNEGO service
principal that is used for
authentication
curl --negotiate -u:anyUser -b ~/cookies.txt -c ~/cookies.txt -X POST -H'Content-
Type: application/json' http://_endpoint

Security Next: Integrate with Apache Ranger/ Apache KNOX
 Leveraging SSO via Apache KNOX
 Data source Level user/group based authorization.
 Row/Column level user/group based authorization.

Deployment and management: Apache
Ambari integration

Simple Druid Management with Ambari
 UI is the source of truth (What you see is what you get !).

 Works with hadoop/hdfs zookeeper… superset, etc..

Versions
managements

Deployment and management via Ambari/HDP
 UI is the source of truth (What you see is what you get !).
 Works with hadoop/hdfs out of the box.
 Installs and configures Superset (Ex Caravel -> Ex Panomamix ) UI.
 Integrates with Kerberos (Hadoop and HDFS interaction/ intra Druid security).
 Supports rolling deployments.
 Monitoring via Graphana dashboard (backed by Hbase).

SQL interface: Hive integration

Benefits both to Druid and Apache Hive
 Efficient execution of OLAP queries in
Hive to power BI tools.
 Interaction with realtime data.
 Create/Drop data source using SQL
syntax.
 Being able to execute complex SQL
operations out of the box on Druid data
and other sources like joins and window
functions.
Hive side Druid side

Data source creation
 Data already existing in druid
– All you need is to point hive to broker and specify datasource name
 Data outside of druid
– Data already existing in Hive .
– Data stored in distributed filesystem like HDFS, S3 in a format that can be read by hive eg TSV, CSV
ORC, Parquet etc.
– Need Perform some pre-processing over various data sources before feeding it to druid
Create Table statement

Druid data sources in Hive
 Point hive to the broker:
– SET hive.druid.broker.address.default=druid.broker.hostname:8082;
 Simple CREATE EXTERNAL TABLE statement
CREATE EXTERNAL TABLE druid_table_1
STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler'
TBLPROPERTIES ("druid.datasource" = "wikiticker");
Hive table name
Hive storage handler classname
Druid data source name
⇢ Broker node endpoint specified as a Hive configuration parameter
⇢ Automatic Druid data schema discovery: segment metadata query
Registering Druid data sources

 Point hive to druid metadata storage and deep storage path
– Set hive.druid.metadata.password=diurd
– Set hive.druid.metadata.username=druid
– Set hive.druid.metadata.uri=jdbc:mysql://host/druid_db
– Set druid.storage.storageDirectory=s3a://druid-cloud-bucket/
 Use Create Table As Select (CTAS) statement
CREATE TABLE druid_table_1
STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler'
TBLPROPERTIES ("druid.datasource" = "wikiticker”, "druid.segment.granularity" = "HOUR")
AS
SELECT __time, page, user, c_added, c_removed
FROM src;
Hive table name
Hive storage handler classname
Druid data source name
Creating Druid data sources

 Use Create Table As Select (CTAS) statement
CREATE TABLE druid_table_1
STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler’
TBLPROPERTIES ("druid.datasource" = "wikiticker”, "druid.segment.granularity" = "HOUR")
AS
FROM src;
⇢ Inference of Druid column types (timestamp, dimensions, metrics) depends on Hive column type
Timestamp Dimensions Metrics
Credit jcamacho@apache.org

__time page user c_added c_removed
2011-01-01T01:05:00Z Justin Boxer 1800 25
2011-01-02T19:00:00Z Justin Reach 2912 42
2011-01-01T11:00:00Z Ke$ha Xeno 1953 17
2011-01-02T13:00:00Z Ke$ha Helz 3194 170
2011-01-02T18:00:00Z Miley Ashu 2232 34
CTAS query results
Select
File Sink
Original CTAS
physical plan
Table Scan
Credit jcamacho@apache.org

Rewritten CTAS
physical plan
 File Data needs to be partitioned by time granularity
"druid.segment.granularity" = "HOUR"
Table Scan
Select
File Sink
__time page user c_added c_removed __time_granularity
2011-01-01T01:05:00Z Justin Boxer 1800 25 2011-01-01T00:00:00Z
2011-01-02T19:00:00Z Justin Reach 2912 42 2011-01-02T00:00:00Z
2011-01-01T11:00:00Z Ke$ha Xeno 1953 17 2011-01-01T00:00:00Z
2011-01-02T13:00:00Z Ke$ha Helz 3194 170 2011-01-02T00:00:00Z
2011-01-02T18:00:00Z Miley Ashu 2232 34 2011-01-02T00:00:00Z
CTAS query results
Truncate timestamp to day granularity
Select
File Sink: Druid
output format
Reduce
Table Scan

Rewritten CTAS
physical plan
 File Sink operator uses Druid output format
– Creates segment files and save segments descriptors
metadata to hdfs.
– After successful reducer operation all the descriptors
will be committed to metadata storage atomically.
– Wait for handoff if coordinator is detected.
Table Scan
Select
2011-01-01T01:05:00Z Justin Boxer 1800 25 2011-01-01T00:00:00Z
2011-01-01T11:00:00Z Ke$ha Xeno 1953 17 2011-01-01T00:00:00Z
CTAS query results
Select
File Sink
Reduce
Table Scan
2011-01-02T19:00:00Z Justin Reach 2912 42 2011-01-02T00:00:00Z
2011-01-02T13:00:00Z Ke$ha Helz 3194 170 2011-01-02T00:00:00Z
2011-01-02T18:00:00Z Miley Ashu 2232 34 2011-01-02T00:00:00Z
Segment 2011-01-01
Segment 2011-01-02
File Sink: Druid
output format

 Use Insert Overwrite As Select (CTAS) statement:
– Can only append or overwrite.
– Need to keep the same schema.
Update data sources
INSERT OVERWRITE TABLE druid_table_1 AS
FROM src;
 Use Drop table to delete meta data from hive and data source from druid
DROP TABLE druid_table_1 [PURGE];

Querying Druid data sources
 Automatic rewriting when query is expressed over Druid table
– Powered by Apache Calcite
– Main challenge: identify patterns in logical plan corresponding to
different kinds of Druid queries (Timeseries, GroupBy, Select)
 Translate (sub)plan of operators into valid Druid JSON query
– Druid query is encapsulated within Hive TableScan operator
 Hive TableScan uses Druid input format
– Submits query to Druid and generates records out of the query results
– interaction with Druid broker node or historicals in parallel
 It might not be possible to push all computation to Druid
– Our contract is that the query should always be executed

Druid input format extends InputFormat<NullWritable, DruidWritable>
 Submits query to Druid and generates records out of the query
results
 Current version
– Timeseries, TopN, and GroupBy queries are not partitioned
– Select queries partitioned along time dimension column considering
uniform distribution
 Ongoing work for select query
– Bypass broker: query Druid realtime and historical nodes directly

Next
 Push more time filters predicates and/or computation down the chain.
 Make use of Long/Float Columns.
 Complex column types (sketches, HLL etc…).
 Stream version of Select query.
 Interact with coordinator for data creation.
 Time semantic (Time zone handling).
 Null semantic.
Hive integration

Thank You
@ApacheHive | @ApacheCalcite | @druidio | @ApacheAmbari
http://cwiki.apache.org/confluence/display/Hive/Druid+Integration
http://calcite.apache.org/docs/druid_adapter.html
https://issues.apache.org/jira/browse/AMBARI-17981

Demo:

Druid at Hadoop Ecosystem

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Druid at Hadoop Ecosystem

Similar to Druid at Hadoop Ecosystem (20)

Recently uploaded

Recently uploaded (20)

Druid at Hadoop Ecosystem

Editor's Notes