Using Apache Drill


Published on

Jim Scott, CHUG co-founder and Director, Enterprise Strategy and Architecture for MapR presents "Using Apache Drill". This presentation was given on August 13th, 2014 at the Nokia office in Chicago, IL.

Jim has held positions running Operations, Engineering, Architecture and QA teams. He has worked in the Consumer Packaged Goods, Digital Advertising, Digital Mapping, Chemical and Pharmaceutical industries. His work with high-throughput computing at Dow Chemical was a precursor to more standardized big data concepts like Hadoop.

Apache Drill brings the power of standard ANSI:SQL 2003 to your desktop and your clusters. It is like AWK for Hadoop. Drill supports querying schemaless systems like HBase, Cassandra and MongoDB. Use standard JDBC and ODBC APIs to use Drill from your custom applications. Leveraging an efficient columnar storage format, an optimistic execution engine and a cache-conscious memory layout, Apache Drill is blazing fast. Coordination, query planning, optimization, scheduling, and execution are all distributed throughout nodes in a system to maximize parallelization. This presentation contains live demonstrations.

The video can be found here:

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Modeled after Dremel based on the white paper from Google
    With additional flexibility required to support a broader range of data formats and data sources
    The design goal is to scale to 10,000+ servers and to be able to process petabyes of data and trillions of records in seconds
  • Hortonworks has used code from drill in Tez
  • These are not people who can only create an Abstract Syntax Tree – They have worked on Oracle, DB2, ParAccel, Teradata, SQLServer, Vertica
    You don’t use a QWERTY-like keyboard
    Do you really want to use another SQL-like syntax
    Facebook, Visa, Mesosphere, many universities, etc --- Even Oracle
  • So many tools and applications. Great performance
    One technology – standard across multiple databases
    As applications evolve, schemas change rapidly
  • Why do we tolerate applications that only support the parts and pieces they choose for SQL?
  • A DrillBit is simply a Drill process running in any particular node in the cluster
    Have I mentioned JDBC and ODBC drivers? This means you can use standard database interfaces that support standards.
  • Using Apache Drill

    1. 1. © 2014 MapR Technologies 1© 2014 MapR Technologies Using Apache Drill
    2. 2. © 2014 MapR Technologies 2 Agenda • About Apache Drill • Query Execution • Demonstration • Q and A
    3. 3. © 2014 MapR Technologies 3© 2014 MapR Technologies About Apache Drill
    4. 4. © 2014 MapR Technologies 4
    5. 5. © 2014 MapR Technologies 5 Community • Mentors – MapR, Lucid Works, Elasticsearch, University members • Notable Committers – MapR, Microsoft, Hortonworks, Concurrent, Oracle, Ohm Data
    6. 6. © 2014 MapR Technologies 6 • Pioneering Data Agility for Hadoop • Apache open source project • Scale-out execution engine for low-latency queries • Unified SQL-based API for analytics & operational applications APACHE DRILL 40+ contributors 150+ years of experience building databases and distributed systems
    7. 7. © 2014 MapR Technologies 7 Rethink SQL for Big Data • ANSI SQL – Ubiquitous • Familiar – No context switch BI/Analytics • One technology – Painful to manage different technologies • Enterprise ready – System-of-record, HA, DR, Security, multi-tenancy, … • Flexible data-model – Allow schemas to evolve rapidly – Support semi-structured data types • Agility – Self-service possible when developer and DBA is same • Scalability – In all dimensions: schemas, processes, management Preserve Invent
    8. 8. © 2014 MapR Technologies 8 Drill Supports Schema Discovery On-The-Fly • Fixed schema • Leverage schema in centralized repository (Hive Metastore) • Fixed schema, evolving schema or schema-less • Leverage schema in centralized repository or self-describing data 2Schema Discovered On-The-FlySchema Declared In Advance SCHEMA ON WRITE SCHEMA BEFORE READ SCHEMA ON THE FLY
    9. 9. © 2014 MapR Technologies 9 SQL select * from A where A.a in ( select B.b from B where B.b = A.c); Did you know Apache HIVE cannot compute this query? – e.g. Hive, Impala, Spark SQL
    10. 10. © 2014 MapR Technologies 10 YOU CAN’T HANDLE REAL SQL!
    11. 11. © 2014 MapR Technologies 11 Semi-structured Data select cf.month, cf.year from hbase.table1; • Of course you know an RDBMS cannot handle this query? – Nor can HIVE and its variants like Impala, Spark SQL • There’s no meta-store definition available
    12. 12. © 2014 MapR Technologies 12 YOU CAN’T HANDLE AN HBASE API!
    13. 13. © 2014 MapR Technologies 13 Interactive SQL-on-Hadoop options Drill 1.0 Hive 0.13 w/ Tez Impala 1.x Shark 0.9 Presto 0.56 Latency Low Medium Low Medium Low Files Yes (all Hive file formats, plus JSON, Text, …) Yes (all Hive file formats) Yes (Parquet, Sequence, …) Yes (all Hive file formats) Yes (RC, Sequence, Text) HBase/MapR-DB Yes Yes Various issues Yes No Schema Hive or schema- less Hive Hive Hive Hive SQL support ANSI SQL HiveQL HiveQL (subset) HiveQL ANSI SQL Client support ODBC/JDBC ODBC/JDBC ODBC/JDBC ODBC/JDBC JDBC Hive compat High High Low High High Large joins Yes Yes No No No Nested data Yes Limited No Limited Limited Concurrency High Limited Medium Limited Medium
    14. 14. © 2014 MapR Technologies 14 Data is Stored in Many Forms • Flat files in DFS – Complex data (Thrift, Avro, protobuf) – Columnar data (Parquet, ORC) – Loosely defined (JSON) – Traditional files (CSV, TSV) • Data stored in NoSQL stores – Relational-like (rows, columns) – Sparse data (NoSQL maps) – Embedded blobs (JSON) – Document stores (nested objects) { name: { first: Michael, last: Smith }, hobbies: [skiing, soccer], district: Los Altos } { name: { first: Jennifer, last: Gates }, hobbies: [singing], preschool: CCLC }
    15. 15. © 2014 MapR Technologies 15 Drill’s Data Model is Flexible HBase JSON BSON CSV TSV Parquet Avro Schema-lessFixed schema Flat Complex Flexibility Flexibility Name Gender Age Michael M 6 Jennifer F 3 { name: { first: Michael, last: Smith }, hobbies: [skiing, soccer], district: Los Altos } { name: { first: Jennifer, last: Gates }, hobbies: [singing], preschool: CCLC } RDBMS/SQL-on-Hadoop table Apache Drill table
    16. 16. © 2014 MapR Technologies 16© 2014 MapR Technologies Query Execution
    17. 17. © 2014 MapR Technologies 17 A storage engine instance - DFS - HBase - Hive Metastore/HCatalog A workspace - Sub-directory - Hive database - HBase namespace A table - pathnames - HBase table - Hive table Data Source is in the Query SELECT timestamp, message FROM dfs1.logs.`AppServerLogs/2014/Jan/p001.parquet` WHERE errorLevel > 2
    18. 18. © 2014 MapR Technologies 18 Runtime Compilation is Faster • JIT is smart, but more gains with runtime compilation • Janino: Java- based Java compiler From
    19. 19. © 2014 MapR Technologies 19 Drill Compiler Loaded class Merge byte-code of the two classes Janino compiles runtime byte-code CodeModel generates code Precompiled byte-code templates
    20. 20. © 2014 MapR Technologies 20 Basic query flow Zookeeper DFS / HBase DFS / HBase DFS / HBase Drillbit Distributed Cache Drillbit Distributed Cache Drillbit Distributed Cache Query 1. Query comes to any Drillbit (JDBC, ODBC, CLI) 2. Drillbit generates execution plan based on query optimization & locality 3. Fragments are farmed to individual nodes 4. Data is returned to driving node *Curator/Zookeeper for ephemeral cluster membership info
    21. 21. © 2014 MapR Technologies 21© 2014 MapR Technologies Demonstration
    22. 22. © 2014 MapR Technologies 22 Download and try Drill!
    23. 23. © 2014 MapR Technologies 23 Q&A @mapr maprtech Engage with us! MapR maprtech mapr-technologies