Drill lightning-london-big-data-10-01-2012

Apache Drill
Interactive Analysis of Large-Scale Datasets

Ted Dunning
Chief Application Architect, MapR

Big Data Processing
Batch processing Interactive analysis Stream processing
Query runtime Minutes to hours Milliseconds to Never-ending
minutes
Data volume TBs to PBs GBs to PBs Continuous stream
Programming MapReduce Queries DAG
model
Users Developers Analysts and Developers
developers
Google project MapReduce Dremel
Open source Hadoop Storm and S4
project MapReduce

Introducing Apache Drill…

Google Dremel
• Interactive analysis of large-scale datasets
– Trillion records at interactive speeds
– Complementary to MapReduce
– Used by thousands of Google employees
– Paper published at VLDB 2010
• Authors: Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva
Shivakumar, Matt Tolton, Theo Vassilakis

• Model
– Nested data model with schema
• Most data at Google is stored/transferred in Protocol Buffers
• Normalization (to relational) is prohibitive
– SQL-like query language with nested data support

• Implementation
– Column-based storage and processing
– In-situ data access (GFS and Bigtable)
– Tree architecture as in Web search (and databases)

Architecture

• Only the execution engine knows the physical attributes of the cluster
– # nodes, hardware, file locations, …

• Public interfaces enable extensibility
– Developers can build parsers for new query languages
– Developers can provide an execution plan directly

• Each level of the plan has a human readable representation
– Facilitates debugging and unit testing

Nested Query Languages
• DrQL
– SQL-like query language for nested data
– Compatible with Google BigQuery/Dremel
• BigQuery applications should work with Drill
– Designed to support efficient column-based processing
• No record assembly during query processing

• Mongo Query Language
– {$query: {x: 3, y: "abc"}, $orderby: {x: 1}}

• Other languages/programming models can plug in

DrQL Example

SELECT DocId AS Id,
COUNT(Name.Language.Code) WITHIN Name AS
Cnt,
Name.Url + ',' + Name.Language.Code AS
Str
FROM t
WHERE REGEXP(Name.Url, '^http')
AND DocId < 20;

* Example from the Dremel paper

Design Principles
Flexible Easy
• Pluggable query languages • Unzip and run
• Extensible execution engine • Zero configuration
• Pluggable data formats • Reverse DNS not needed
• Column-based and row-based • IP addresses can change
• Schema and schema-less • Clear and concise log messages
• Pluggable data sources

Dependable Fast
• No SPOF • C/C++ core with Java support
• Instant recovery from crashes • Google C++ style guide
• Min latency and max throughput
(limited only by hardware)

Get Involved!
• Download (a longer version of) these slides
– http://www.mapr.com/company/events/bay-area-hug/9-19-2012

• Join the project
– drill-dev-subscribe@incubator.apache.org #apachedrill

• Contact me:
– tdunning@maprtech.com
– tdunning@apache.org
– ted.dunning@maprtech.com
– @ted_dunning

• Join MapR
– jobs@mapr.com

Drill lightning-london-big-data-10-01-2012

More Related Content

What's hot

Viewers also liked

Similar to Drill lightning-london-big-data-10-01-2012

More from Ted Dunning

Drill lightning-london-big-data-10-01-2012

Editor's Notes