SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.
SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.
Successfully reported this slideshow.
Activate your 14 day free trial to unlock unlimited reading.
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI SQL Capabilities for Apache HBase
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI SQL Capabilities for Apache HBase
1.
1
Apache Drill: YASOH
yet another sql on h(base|adoop)
Jacques Nadeau, HBaseCon June 13, 2013
jacques@apache.org |@intjesus
2.
2
Me
Software Architect @ MapR leading our Apache Drill
contributions
Previously:
– Lead development of distributed search engine at YapMap
– Lead R&D team at contextual advertising company Quigo, sold
to AOL
– Built big data warehousing and analytical reporting products at
Aquantive, sold to Microsoft
3.
3
Apache Drill
Apache Incubating Project
Interactive Analysis of large scale datasets
– Inspired by Google Dremel
MapReduce greatest strength is also an Achilles heel for high
performance queries
– Pessimistic execution is great for long running jobs
– Optimistic execution is better for shorter jobs
– Hive solves many needs but its organic growth and dependence on
MapReduce make it hard to bring forward
– Tez is a new project that tries to bring Hive a new execution model
Not Done—alpha next month
4.
4
Basic Process
Zookeeper
DFS/HBase DFS/HBase DFS/HBase
Drillbit
Distributed Cache
Drillbit
Distributed Cache
Drillbit
Distributed Cache
Query
1. Query comes to any Drillbit (JDBC, ODBC, CLI, protobuf)
2. Drillbit generates execution plan based on query optimization & locality
3. Fragments are farmed to individual nodes
4. Data is returned to driving node
6.
6
SQL Options for HBase
Drill Phoenix Impala Hive+Tez
Overall
Status Alpha 1.2 1.0 Alpha
Typical Shortest Query 100ms 10ms 100ms ??
Query HBase ✓ ✓ ✓ ✓
Query Any SerDe ✓ ✓
Hive UDF support ✓ ✓
Contribution/Dev Model Apache GitHub MySQL Apache
Execution programming language Java Java C++ Java
Query language
Supports Write ✓ ✓ ✓
Query Language SQL2003 SQL92 ~HiveQL HiveQL
Data
Supports data without schema ✓
Nested Relational Operators ✓
Internal sort & join ✓ ✓ ✓
External Sort/Join/Aggregation ✓ ✓
Execution
Code Generation ✓ ✓
Columnar Execution ✓
Vectorized Operators ✓ ✓
7.
7
What’s different about Drill
Late-bind schema doesn’t require metastore definitions
SELECT cf1.month, cf1.year, FROM hbase.table1
Nested data as first class entity: Extensions to SQL for nested data types,
similar to BigQuery (four-value semantics)
SELECT
c.name, c.address, COUNT(c.children)
FROM(
SELECT
CONVERT_FROM(cf1.user-json-blob, JSON) AS c
FROM
hbase.table1
)
8.
8
What’s different about Drill, cont’d
Community-driven Apache development process and peace of
mind
Leverages recent research approaches
– Late record materialization
– Vectorized Operators
Extensibility
– Supports Hive UDFs/SerDes
– Well defined storage engine and operator interfaces
– Logical and physical plan API layers for optimization and extension
– Targeting Phoenix support
Works like other things in the Hadoop ecosystem
– Apache development process & Java codebase
9.
9
Drill + HBase Roadmap
Native support for Orderly complex keys
– Orderly encodes a compound field (including null support) as a single,
sortable byte value
Drill on top of Phoenix to leverage great Coprocessor work
Optimized HBase join leveraging bloomfilters
Memory mapped RegionServer <> Drillbit communication
Expression evaluation bytecode pushdown
10.
10
Other Interesting Things
Drill keeps data off-heap to avoid garbage collection problems
– Metadata stays on heap
– Utilizes Netty’s arena-based NativeByteBuffer pooling and ByteBuf
abstraction
– RPC engine specifically designed to avoid extra memory copies
– In memory representation is documented, allowing native operators as
required
Code is compiled at a record batch level, avoiding record level
function call overhead
– Janino + ASM for code compilation
– Recompiled for each schema change
Record batches are maintained in columnar format and leverage a
selection vector execution method to speed query performance
– Minimize branches and instruction complexity
– Maximizes cache locality
11.
11
Thanks!
Join the Community
– Join the mailing list:
• drill-dev-subscribe@incubator.apache.org
• drill-user-subscribe@incubator.apache.org
– Fork us on GitHub:
http://github.com/apache/incubator-drill/
– Create a JIRA:
https://issues.apache.org/jira/browse/DRI
LL
Join the Drill team at MapR
Technologies
Let us know what you think on the
Drill mailing lists
Shout out to
supporting projects
– Jackson
– Typesafe HOCON
– Netty4
– Protobuf
– Vanilla Java
– Larray
– Hazelcast
– Curator
– Optiq
– Hive ORC
– Parquet
– Janino
– ASM
– Yammer Metrics
– Guava
– Carrot HPPC