Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache drill


Published on

Short introduction to Apache Drill

Published in: Data & Analytics
  • The last one problem with writing is the most actual) If to speak sinceresly writing is not an easy task especially if you have no certain writing skills. That's why our company helps students to write any type of scientific papers!
    Are you sure you want to  Yes  No
    Your message goes here

Apache drill

  1. 1. Apache Drill Introduction Jakub Pieprzyk 1
  2. 2. Long time ago… Hadoop Big data batch processing. Huge volumes of data. Responsiveness was not a concern. 2
  3. 3. Batch processing by design HDFS Map-Reduce applications M R M RM R M RM R 3
  4. 4. At some point of time... Hadoop cluster has a lot of data useful for ad-hoc analysis. Hard to perform data exploration in batch mode (“data lake”, “schema on read”); lot of iterative tasks. Servers have more RAM, SSD drives... 4
  5. 5. Big Data (&Fast SQL) Analytics 5
  6. 6. Wide range of products emerged... Tez Spark Facebook: Presto (Google Dremel) → Apache Drill Cloudera: Impala 6
  7. 7. Apache Drill 7
  8. 8. Apache Drill Scalable query engine Querying different data sources - both schema and schema-free JDBC / Mongo / File System / Hive / HBase Text files / Parquet / Sequence files / MapR-DB 8
  9. 9. Integration with existing BI tools Apache Drill come with JDBC/ODBC driver. Supporting many data sources and formats + responsiveness make it good candidate to Business Intelligence tools backend. Drill 9
  10. 10. Interfaces Command line (~beeline) JDBC/ODBC Web Console C/Java API REST API 10
  11. 11. Architecture highlights Cluster of nodes on which drillbit service is installed. Drillbit responsible for receiving queries, generating plan and executing. Zookeeper is used to maintain cluster membership. Clients can connect to any node (or via Zookeeper) and submit queries. 11
  12. 12. Architecture highlights (cont.) Schema can be discovered in the runtime - no need to know the schema before executing the query. Storage plugins - can access custom databases. Distributed cache is used to share metadata, plans and statistics (Infinispan in- memory key-value data store) 12
  13. 13. Performance Columnar processing Data locality (when executed on Hadoop cluster) Vectorization (processing vector of values from single column rather than whole rows) 13
  14. 14. Simple query reading data from classpath file is JSON FROM cp.`employee.json` 14
  15. 15. Hive → Drill Migration ? Apache Drill is a good candidate to Fast SQL solution over Hadoop. When deployed alongside Hive it gives ad-hoc capabilities Can use Hive Metastore Can use Hive UDFs 15
  16. 16. Hive → Drill Data types ~ match those in Hive (although DECIMAL still in alpha) Analytical functions ~ like in Hive (but still not 100% implemented, like moving average AVG(x) OVER (ORDER BY time ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING) Support for Hive UDFs (but JAR needs to be uploaded into every host) 16
  17. 17. Web Console http://<node>:8047 17
  18. 18. Web Console - Running Queries 18
  19. 19. Web Console - Running Queries 19
  20. 20. Web Console - Query Execution 20
  21. 21. Web Console - Storage Plugins 21
  22. 22. Web Console - Hive Plugin Configuration 22
  23. 23. Embarrassingly simple performance test... ...just to put some numbers in the presentation ;) Hadoop cluster: 3 nodes: data node / node manager / apache drill 2 nodes: 16GB RAM, 2 CPU x 2 cores 1 node: 10GB RAM, 2 CPU x 2 cores +1: name node / resource manager / hive server 23
  24. 24. Hive MR vs. Drill Wikipedia pageview counts: en A1_road_in_London 1 35107 en A1_steak_sauce 1 13905 en A1_volleyball_league_(Greece) 1 17636 en A1chieve 1 6558 en A2%20road 1 7402 project article page views bytes 24
  25. 25. Hive schema create table wiki_pagecounts( prj string, page string, pv int, bytes bigint ) partitioned by (ts string) row format delimited fields terminated by ' '; 25
  26. 26. Timing: Hive (MR) vs. Drill Q1 - simple count per partition (group by) Q2 - top page within hour/lang. (row_number) Q3 - mobile share (group by, case stmt) Q4 - top pages with pct pv (join, group by, row_number) 26
  27. 27. Integration with YARN? Currently (Drill 1.5) not supported There is a ticket for this DRILL-142 Would make deployment much easier and more efficient resource management. 27
  28. 28. Kerberos? Currently (Drill 1.5) doesn’t support Kerberos when accessing HDFS Ticket opened: DRILL-3584 Without it it may be challenging to fit Drill into existing secured Hadoop environment. 28
  29. 29. Apache Drill Github commits 29
  30. 30. Thanks! 30