HBaseCon 2013 Talk


Published on

Slides from talk at HBaseCon, June 13, 2013. Includes a comparison of how Drill, Phoenix, Impala and Stinger/Tez relate.

Published in: Technology

HBaseCon 2013 Talk

  1. 1. 1Apache Drill: YASOHyet another sql on h(base|adoop)Jacques Nadeau, HBaseCon June 13, 2013jacques@apache.org |@intjesus
  2. 2. 2Me Software Architect @ MapR leading our Apache Drillcontributions Previously:– Lead development of distributed search engine at YapMap– Lead R&D team at contextual advertising company Quigo, soldto AOL– Built big data warehousing and analytical reporting products atAquantive, sold to Microsoft
  3. 3. 3Apache Drill Apache Incubating Project Interactive Analysis of large scale datasets– Inspired by Google Dremel MapReduce greatest strength is also an Achilles heel for highperformance queries– Pessimistic execution is great for long running jobs– Optimistic execution is better for shorter jobs– Hive solves many needs but its organic growth and dependence onMapReduce make it hard to bring forward– Tez is a new project that tries to bring Hive a new execution model Not Done—alpha next month
  4. 4. 4Basic ProcessZookeeperDFS/HBase DFS/HBase DFS/HBaseDrillbitDistributed CacheDrillbitDistributed CacheDrillbitDistributed CacheQuery1. Query comes to any Drillbit (JDBC, ODBC, CLI, protobuf)2. Drillbit generates execution plan based on query optimization & locality3. Fragments are farmed to individual nodes4. Data is returned to driving node
  5. 5. 5Core Modules within a DrillbitSQL ParserOptimizerPhysicalPlanDFS EngineHBase EngineRPC EndpointDistributed CacheStorageEngineInterfaceLogicalPlanExecution
  6. 6. 6SQL Options for HBaseDrill Phoenix Impala Hive+TezOverallStatus Alpha 1.2 1.0 AlphaTypical Shortest Query 100ms 10ms 100ms ??Query HBase ✓ ✓ ✓ ✓Query Any SerDe ✓ ✓Hive UDF support ✓ ✓Contribution/Dev Model Apache GitHub MySQL ApacheExecution programming language Java Java C++ JavaQuery languageSupports Write ✓ ✓ ✓Query Language SQL2003 SQL92 ~HiveQL HiveQLDataSupports data without schema ✓Nested Relational Operators ✓Internal sort & join ✓ ✓ ✓External Sort/Join/Aggregation ✓ ✓ExecutionCode Generation ✓ ✓Columnar Execution ✓Vectorized Operators ✓ ✓
  7. 7. 7What’s different about Drill Late-bind schema doesn’t require metastore definitionsSELECT cf1.month, cf1.year, FROM hbase.table1 Nested data as first class entity: Extensions to SQL for nested data types,similar to BigQuery (four-value semantics)SELECTc.name, c.address, COUNT(c.children)FROM(SELECTCONVERT_FROM(cf1.user-json-blob, JSON) AS cFROMhbase.table1)
  8. 8. 8What’s different about Drill, cont’d Community-driven Apache development process and peace ofmind Leverages recent research approaches– Late record materialization– Vectorized Operators Extensibility– Supports Hive UDFs/SerDes– Well defined storage engine and operator interfaces– Logical and physical plan API layers for optimization and extension– Targeting Phoenix support Works like other things in the Hadoop ecosystem– Apache development process & Java codebase
  9. 9. 9Drill + HBase Roadmap Native support for Orderly complex keys– Orderly encodes a compound field (including null support) as a single,sortable byte value Drill on top of Phoenix to leverage great Coprocessor work Optimized HBase join leveraging bloomfilters Memory mapped RegionServer <> Drillbit communication Expression evaluation bytecode pushdown
  10. 10. 10Other Interesting Things Drill keeps data off-heap to avoid garbage collection problems– Metadata stays on heap– Utilizes Netty’s arena-based NativeByteBuffer pooling and ByteBufabstraction– RPC engine specifically designed to avoid extra memory copies– In memory representation is documented, allowing native operators asrequired Code is compiled at a record batch level, avoiding record levelfunction call overhead– Janino + ASM for code compilation– Recompiled for each schema change Record batches are maintained in columnar format and leverage aselection vector execution method to speed query performance– Minimize branches and instruction complexity– Maximizes cache locality
  11. 11. 11Thanks! Join the Community– Join the mailing list:• drill-dev-subscribe@incubator.apache.org• drill-user-subscribe@incubator.apache.org– Fork us on GitHub:http://github.com/apache/incubator-drill/– Create a JIRA:https://issues.apache.org/jira/browse/DRILL Join the Drill team at MapRTechnologies Let us know what you think on theDrill mailing lists Shout out tosupporting projects– Jackson– Typesafe HOCON– Netty4– Protobuf– Vanilla Java– Larray– Hazelcast– Curator– Optiq– Hive ORC– Parquet– Janino– ASM– Yammer Metrics– Guava– Carrot HPPC