Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Druid @Hadoop Ecosystem
Slim Bouguerra, Nishant Bangarwa , Jesús
Camacho Rodríguez, Ashutosh Chauhan,
Gunther Hagleitner, ...
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
1. Security enhancement
2. Deployment and management
3. SQL interact...
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Security : Integration with Kerberos
Spnego
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Kerberos/Spnego integration (druid 0.10)
 Securing all the endpoint...
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Kerberos/Spnego integration (druid 0.10)
 Securing all the endpoint...
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Security Next: Integrate with Apache Ranger/ Apache KNOX
 Leveragin...
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Deployment and management: Apache
Ambari integration
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Simple Druid Management with Ambari
 UI is the source of truth (Wha...
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Simple Druid Management with Ambari
 Works with hadoop/hdfs zookeep...
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Simple Druid Management with Ambari
Versions
managements
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Deployment and management via Ambari/HDP
 UI is the source of trut...
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
SQL interface: Hive integration
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Benefits both to Druid and Apache Hive
 Efficient execution of OLA...
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data source creation
 Data already existing in druid
– All you nee...
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid data sources in Hive
 Point hive to the broker:
– SET hive.d...
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid data sources in Hive
 Point hive to druid metadata storage a...
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid data sources in Hive
 Use Create Table As Select (CTAS) stat...
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid data sources in Hive
__time page user c_added c_removed
2011-...
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Rewritten CTAS
physical plan
Druid data sources in Hive
Creating Dr...
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Rewritten CTAS
physical plan
Druid data sources in Hive
Creating Dr...
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid data sources in Hive
 Use Insert Overwrite As Select (CTAS) ...
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Querying Druid data sources
 Automatic rewriting when query is exp...
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid input format extends InputFormat<NullWritable, DruidWritable>...
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Next
 Push more time filters predicates and/or computation down th...
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Thank You
@ApacheHive | @ApacheCalcite | @druidio | @ApacheAmbari
h...
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Demo:
Upcoming SlideShare
Loading in …5
×

Druid at Hadoop Ecosystem

1,690 views

Published on

Presentation at the druid meetup hosted by Branch.

Published in: Data & Analytics
  • Be the first to comment

Druid at Hadoop Ecosystem

  1. 1. Druid @Hadoop Ecosystem Slim Bouguerra, Nishant Bangarwa , Jesús Camacho Rodríguez, Ashutosh Chauhan, Gunther Hagleitner, Julian Hyde, Carter Shanklin Druid meetup 21/02/2017
  2. 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved 1. Security enhancement 2. Deployment and management 3. SQL interaction
  3. 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Security : Integration with Kerberos Spnego
  4. 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Kerberos/Spnego integration (druid 0.10)  Securing all the endpoint except some specified ones if needed.  UI access to coordinator and overlord is protected as well (browser configuration needed). Druid Druid KDC server User Browser1 kinit user 2 Token
  5. 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Kerberos/Spnego integration (druid 0.10)  Securing all the endpoint except some specified ones if needed.  UI access to coordinator and overlord is protected as well (browser configuration needed). druid.hadoop.security.spnego .keytab keytab_dir/spnego.service.k eytab This is the SPNEGO service keytab that is used for authentication. druid.hadoop.security.spnego .principal HTTP/_HOST@realm This is the SPNEGO service principal that is used for authentication curl --negotiate -u:anyUser -b ~/cookies.txt -c ~/cookies.txt -X POST -H'Content- Type: application/json' http://_endpoint
  6. 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Security Next: Integrate with Apache Ranger/ Apache KNOX  Leveraging SSO via Apache KNOX  Data source Level user/group based authorization.  Row/Column level user/group based authorization.
  7. 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Deployment and management: Apache Ambari integration
  8. 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Simple Druid Management with Ambari  UI is the source of truth (What you see is what you get !).
  9. 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Simple Druid Management with Ambari  Works with hadoop/hdfs zookeeper… superset, etc..
  10. 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Simple Druid Management with Ambari Versions managements
  11. 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Deployment and management via Ambari/HDP  UI is the source of truth (What you see is what you get !).  Works with hadoop/hdfs out of the box.  Installs and configures Superset (Ex Caravel -> Ex Panomamix ) UI.  Integrates with Kerberos (Hadoop and HDFS interaction/ intra Druid security).  Supports rolling deployments.  Monitoring via Graphana dashboard (backed by Hbase).
  12. 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved SQL interface: Hive integration
  13. 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Benefits both to Druid and Apache Hive  Efficient execution of OLAP queries in Hive to power BI tools.  Interaction with realtime data.  Create/Drop data source using SQL syntax.  Being able to execute complex SQL operations out of the box on Druid data and other sources like joins and window functions. Hive side Druid side
  14. 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data source creation  Data already existing in druid – All you need is to point hive to broker and specify datasource name  Data outside of druid – Data already existing in Hive . – Data stored in distributed filesystem like HDFS, S3 in a format that can be read by hive eg TSV, CSV ORC, Parquet etc. – Need Perform some pre-processing over various data sources before feeding it to druid Create Table statement
  15. 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid data sources in Hive  Point hive to the broker: – SET hive.druid.broker.address.default=druid.broker.hostname:8082;  Simple CREATE EXTERNAL TABLE statement CREATE EXTERNAL TABLE druid_table_1 STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler' TBLPROPERTIES ("druid.datasource" = "wikiticker"); Hive table name Hive storage handler classname Druid data source name ⇢ Broker node endpoint specified as a Hive configuration parameter ⇢ Automatic Druid data schema discovery: segment metadata query Registering Druid data sources
  16. 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid data sources in Hive  Point hive to druid metadata storage and deep storage path – Set hive.druid.metadata.password=diurd – Set hive.druid.metadata.username=druid – Set hive.druid.metadata.uri=jdbc:mysql://host/druid_db – Set druid.storage.storageDirectory=s3a://druid-cloud-bucket/  Use Create Table As Select (CTAS) statement CREATE TABLE druid_table_1 STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler' TBLPROPERTIES ("druid.datasource" = "wikiticker”, "druid.segment.granularity" = "HOUR") AS SELECT __time, page, user, c_added, c_removed FROM src; Hive table name Hive storage handler classname Druid data source name Creating Druid data sources
  17. 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid data sources in Hive  Use Create Table As Select (CTAS) statement CREATE TABLE druid_table_1 STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler’ TBLPROPERTIES ("druid.datasource" = "wikiticker”, "druid.segment.granularity" = "HOUR") AS SELECT __time, page, user, c_added, c_removed FROM src; ⇢ Inference of Druid column types (timestamp, dimensions, metrics) depends on Hive column type Creating Druid data sources Timestamp Dimensions Metrics Credit jcamacho@apache.org
  18. 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid data sources in Hive __time page user c_added c_removed 2011-01-01T01:05:00Z Justin Boxer 1800 25 2011-01-02T19:00:00Z Justin Reach 2912 42 2011-01-01T11:00:00Z Ke$ha Xeno 1953 17 2011-01-02T13:00:00Z Ke$ha Helz 3194 170 2011-01-02T18:00:00Z Miley Ashu 2232 34 CTAS query results Select File Sink Original CTAS physical plan Table Scan Credit jcamacho@apache.org
  19. 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Rewritten CTAS physical plan Druid data sources in Hive Creating Druid data sources  File Data needs to be partitioned by time granularity "druid.segment.granularity" = "HOUR" Table Scan Select File Sink __time page user c_added c_removed __time_granularity 2011-01-01T01:05:00Z Justin Boxer 1800 25 2011-01-01T00:00:00Z 2011-01-02T19:00:00Z Justin Reach 2912 42 2011-01-02T00:00:00Z 2011-01-01T11:00:00Z Ke$ha Xeno 1953 17 2011-01-01T00:00:00Z 2011-01-02T13:00:00Z Ke$ha Helz 3194 170 2011-01-02T00:00:00Z 2011-01-02T18:00:00Z Miley Ashu 2232 34 2011-01-02T00:00:00Z CTAS query results Truncate timestamp to day granularity Select File Sink: Druid output format Reduce Table Scan
  20. 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Rewritten CTAS physical plan Druid data sources in Hive Creating Druid data sources  File Sink operator uses Druid output format – Creates segment files and save segments descriptors metadata to hdfs. – After successful reducer operation all the descriptors will be committed to metadata storage atomically. – Wait for handoff if coordinator is detected. Table Scan Select 2011-01-01T01:05:00Z Justin Boxer 1800 25 2011-01-01T00:00:00Z 2011-01-01T11:00:00Z Ke$ha Xeno 1953 17 2011-01-01T00:00:00Z CTAS query results Select File Sink Reduce Table Scan 2011-01-02T19:00:00Z Justin Reach 2912 42 2011-01-02T00:00:00Z 2011-01-02T13:00:00Z Ke$ha Helz 3194 170 2011-01-02T00:00:00Z 2011-01-02T18:00:00Z Miley Ashu 2232 34 2011-01-02T00:00:00Z Segment 2011-01-01 Segment 2011-01-02 File Sink: Druid output format
  21. 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid data sources in Hive  Use Insert Overwrite As Select (CTAS) statement: – Can only append or overwrite. – Need to keep the same schema. Update data sources INSERT OVERWRITE TABLE druid_table_1 AS SELECT __time, page, user, c_added, c_removed FROM src;  Use Drop table to delete meta data from hive and data source from druid DROP TABLE druid_table_1 [PURGE];
  22. 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Querying Druid data sources  Automatic rewriting when query is expressed over Druid table – Powered by Apache Calcite – Main challenge: identify patterns in logical plan corresponding to different kinds of Druid queries (Timeseries, GroupBy, Select)  Translate (sub)plan of operators into valid Druid JSON query – Druid query is encapsulated within Hive TableScan operator  Hive TableScan uses Druid input format – Submits query to Druid and generates records out of the query results – interaction with Druid broker node or historicals in parallel  It might not be possible to push all computation to Druid – Our contract is that the query should always be executed
  23. 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid input format extends InputFormat<NullWritable, DruidWritable>  Submits query to Druid and generates records out of the query results  Current version – Timeseries, TopN, and GroupBy queries are not partitioned – Select queries partitioned along time dimension column considering uniform distribution  Ongoing work for select query – Bypass broker: query Druid realtime and historical nodes directly
  24. 24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Next  Push more time filters predicates and/or computation down the chain.  Make use of Long/Float Columns.  Complex column types (sketches, HLL etc…).  Stream version of Select query.  Interact with coordinator for data creation.  Time semantic (Time zone handling).  Null semantic. Hive integration
  25. 25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Thank You @ApacheHive | @ApacheCalcite | @druidio | @ApacheAmbari http://cwiki.apache.org/confluence/display/Hive/Druid+Integration http://calcite.apache.org/docs/druid_adapter.html https://issues.apache.org/jira/browse/AMBARI-17981
  26. 26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Demo:

×