HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI SQL Capabilities for Apache HBase

1
Apache Drill: YASOH
yet another sql on h(base|adoop)
Jacques Nadeau, HBaseCon June 13, 2013
jacques@apache.org |@intjesus

2
Me
 Software Architect @ MapR leading our Apache Drill
contributions
 Previously:
– Lead development of distributed search engine at YapMap
– Lead R&D team at contextual advertising company Quigo, sold
to AOL
– Built big data warehousing and analytical reporting products at
Aquantive, sold to Microsoft

3
Apache Drill
 Apache Incubating Project
 Interactive Analysis of large scale datasets
– Inspired by Google Dremel
 MapReduce greatest strength is also an Achilles heel for high
performance queries
– Pessimistic execution is great for long running jobs
– Optimistic execution is better for shorter jobs
– Hive solves many needs but its organic growth and dependence on
MapReduce make it hard to bring forward
– Tez is a new project that tries to bring Hive a new execution model
 Not Done—alpha next month

4
Basic Process
Zookeeper
DFS/HBase DFS/HBase DFS/HBase
Drillbit
Distributed Cache
Drillbit
Distributed Cache
Drillbit
Distributed Cache
Query
1. Query comes to any Drillbit (JDBC, ODBC, CLI, protobuf)
2. Drillbit generates execution plan based on query optimization & locality
3. Fragments are farmed to individual nodes
4. Data is returned to driving node

5
Core Modules within a Drillbit
SQL Parser
Optimizer
PhysicalPlan
DFS Engine
HBase Engine
RPC Endpoint
Distributed Cache
StorageEngineInterface
LogicalPlan
Execution

6
SQL Options for HBase
Drill Phoenix Impala Hive+Tez
Overall
Status Alpha 1.2 1.0 Alpha
Typical Shortest Query 100ms 10ms 100ms ??
Query HBase ✓ ✓ ✓ ✓
Query Any SerDe ✓ ✓
Hive UDF support ✓ ✓
Contribution/Dev Model Apache GitHub MySQL Apache
Execution programming language Java Java C++ Java
Query language
Supports Write ✓ ✓ ✓
Query Language SQL2003 SQL92 ~HiveQL HiveQL
Data
Supports data without schema ✓
Nested Relational Operators ✓
Internal sort & join ✓ ✓ ✓
External Sort/Join/Aggregation ✓ ✓
Execution
Code Generation ✓ ✓
Columnar Execution ✓
Vectorized Operators ✓ ✓

7
What’s different about Drill
 Late-bind schema doesn’t require metastore definitions
SELECT cf1.month, cf1.year, FROM hbase.table1
 Nested data as first class entity: Extensions to SQL for nested data types,
similar to BigQuery (four-value semantics)
SELECT
c.name, c.address, COUNT(c.children)
FROM(
SELECT
CONVERT_FROM(cf1.user-json-blob, JSON) AS c
FROM
hbase.table1
)

8
What’s different about Drill, cont’d
 Community-driven Apache development process and peace of
mind
 Leverages recent research approaches
– Late record materialization
– Vectorized Operators
 Extensibility
– Supports Hive UDFs/SerDes
– Well defined storage engine and operator interfaces
– Logical and physical plan API layers for optimization and extension
– Targeting Phoenix support
 Works like other things in the Hadoop ecosystem
– Apache development process & Java codebase

9
Drill + HBase Roadmap
 Native support for Orderly complex keys
– Orderly encodes a compound field (including null support) as a single,
sortable byte value
 Drill on top of Phoenix to leverage great Coprocessor work
 Optimized HBase join leveraging bloomfilters
 Memory mapped RegionServer <> Drillbit communication
 Expression evaluation bytecode pushdown

10
Other Interesting Things
 Drill keeps data off-heap to avoid garbage collection problems
– Metadata stays on heap
– Utilizes Netty’s arena-based NativeByteBuffer pooling and ByteBuf
abstraction
– RPC engine specifically designed to avoid extra memory copies
– In memory representation is documented, allowing native operators as
required
 Code is compiled at a record batch level, avoiding record level
function call overhead
– Janino + ASM for code compilation
– Recompiled for each schema change
 Record batches are maintained in columnar format and leverage a
selection vector execution method to speed query performance
– Minimize branches and instruction complexity
– Maximizes cache locality

11
Thanks!
 Join the Community
– Join the mailing list:
• drill-dev-subscribe@incubator.apache.org
• drill-user-subscribe@incubator.apache.org
– Fork us on GitHub:
http://github.com/apache/incubator-drill/
– Create a JIRA:
https://issues.apache.org/jira/browse/DRI
LL
 Join the Drill team at MapR
Technologies
 Let us know what you think on the
Drill mailing lists
 Shout out to
supporting projects
– Jackson
– Typesafe HOCON
– Netty4
– Protobuf
– Vanilla Java
– Larray
– Hazelcast
– Curator
– Optiq
– Hive ORC
– Parquet
– Janino
– ASM
– Yammer Metrics
– Guava
– Carrot HPPC

HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI SQL Capabilities for Apache HBase

More Related Content

What's hot

Viewers also liked

Similar to HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI SQL Capabilities for Apache HBase

More from Cloudera, Inc.

Recently uploaded

HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI SQL Capabilities for Apache HBase