Apache Drill @ PJUG, Jan 15, 2013

Gera Shegalov @PJUG, Jan 15, 2013

/home/gera: whoami

■ Saarland University
■ 1st intern in Immortal DB @ Microsoft Research
■ JMS, RDBMS HA @ Oracle

■ Hadoop MapReduce / Hadoop Core
■ Founding member of Apache Drill

■ Open enterprise-grade distribution for Hadoop
● Easy, dependable and fast
● Open source with standards-based extensions

■ MapR is deployed at 1000’s of companies
● From small Internet startups to Fortune 100

■ MapR customers analyze massive amounts of data:
● Hundreds of billions of events daily
● 90% of the world’s Internet population monthly
● $1 trillion in retail purchases annually

■ MapR in the Cloud:
● partnered with Google: Hadoop on Google Compute Engine
● partnered with Amazon: M3/M5 options for Elastic Map Reduce

Agenda
■ What?
● What exactly does Drill do?

■ Why?
● Why do we need Apache Drill?

■ Who?
● Who is doing this?

■ How?
● How does Drill work inside?

■ Conclusion
● How can you help?
● Where can you find out more?

Apache Drill Overview

■ Drill overview
● Low latency interactive queries
● Standard ANSI SQL support
● Domain Specific Languages / Your own QL

■ Open-Source
● Apache Incubator
● 100’s involved across US and Europe
● Community consensus on API, functionality

Big Data Processing
Batch Interactive Stream
processing analysis processing

Milliseconds to
Query runtime Minutes to hours Never-ending
minutes

Data volume TBs to PBs GBs to PBs Continuous stream

Programming
MapReduce Queries DAG
model

Analysts and
Users Developers Developers
developers
Google
MapReduce Dremel
project
Open source Hadoop
Apache Drill Storm and S4
project MapReduce

Latency Matters

■ Ad-hoc analysis with interactive tools

■ Real-time dashboards

■ Event/trend detection and analysis
● Network intrusions
● Fraud
● Failures

Nested Query Languages

■ DrQL
● SQL-like query language for nested data

● Compatible with Google BigQuery/Dremel
● BigQuery applications should work with Drill

● Designed to support efficient column-based processing
● No record assembly during query processing

■ Mongo Query Language
● {$query: {x: 3, y: "abc"}, $orderby: {x: 1}}

■ Other languages/programming models can plug in

Nested Data Model
■ The data model in Dremel is Protocol Buffers
● Nested
● Schema
■ Apache Drill is designed to support multiple data models
● Schema: Protocol Buffers, Apache Avro, …
● Schema-less: JSON, BSON, …
■ Flat records are supported as a special case of nested data
● CSV, TSV, …
Avro IDL JSON
enum Gender { {
MALE, FEMALE "name": "Srivas",
} "gender": "Male",
"followers": 100
record User { }
string name; {
Gender gender; "name": "Raina",
long followers; "gender": "Female",
} "followers": 200,
"zip": "94305"
}

Extensibility
■ Nested query languages
● Pluggable model

● DrQL

● Mongo Query Language

● Cascading

■ Distributed execution engine
● Extensible model (eg, Dryad)

● Low-latency

● Fault tolerant

■ Nested data formats
● Pluggable model

● Column-based (ColumnIO/Dremel, Trevni, RCFile) and row-based (RecordIO,

Avro, JSON, CSV)
● Schema (Protocol Buffers, Avro, CSV) and schema-less (JSON, BSON)

■ Scalable data sources
● Pluggable model

● Hadoop

● HBase

Design Principles

Flexible Easy
● Pluggable query languages ● Unzip and run
● Extensible execution engine ● Zero configuration
● Pluggable data formats ● Reverse DNS not needed
● Column-based and row-based ● IP addresses can change
● Schema and schema-less ● Clear and concise log messages
● Pluggable data sources
● N(ot)O(nly) Hadoop

Dependable Fast
● No SPOF ● Minimum Java core
● Instant recovery from crashes ● C/C++ core with Java support
● Google C++ style guide
● Min latency and max throughput
(limited only by hardware)

Execution Engine
Operator layer is serialization-aware
Processes individual records

Execution layer is not serialization-aware
Processes batches of records (blobs/JSON trees)
Responsible for communication, dependencies and fault tolerance

DrQL Example
local-logs = donuts.json:
SELECT
{ ppu,
"id": "0003", typeCount =
"type": "donut",
COUNT(*) OVER PARTITION BY ppu,
"name": "Old Fashioned",
quantity =
"ppu": 0.55,
"sales": 300, SUM(sales) OVER PARTITION BY ppu,
"batters": sales =
{ SUM(ppu*sales) OVER PARTITION BY
"batter": ppu
[ FROM local-logs donuts
{ "id": "1001", "type": "Regular" },
{ "id": "1002", "type": "Chocolate" } WHERE donuts.ppu < 1.00
] ORDER BY dountuts.ppu DESC;
},
"topping":
[
{ "id": "5001", "type": "None" },
{ "id": "5002", "type": "Glazed" },
{ "id": "5003", "type": "Chocolate" },
{ "id": "5004", "type": "Maple" }
]
}

Query Components

■ User Query (DrQL) components:
● SELECT

● FROM

● WHERE

● GROUP BY

● HAVING

● (JOIN)

■ Logical operators:
● Scan

● Filter

● Aggregate

● (Join)

Logical Plan Syntax:
Operators & Expressions
query:[
{
op:"sequence",
do:[
{
op: "scan",
memo: "initial_scan",
ref: "donuts",
source: "local-logs",
selection: {data: "activity"}
},
{
op: "transform",
transforms: [
{ ref: "donuts.quanity", expr: "donuts.sales"}
]
},
{
op: "filter",
expr: "donuts.ppu < 1.00"
},
---

Logical Streaming Example

0
1
2
3
4

{ @id: <refnum>, op: “window-frame”,
input: <input>,
keys: [ 0
<name>,... 01
], 012
ref: <name>, 123
before: 2, 234
after: here
}

Representing a DAG

{ @id: 19, op: "aggregate",
input: 18,
type: <simple|running|repeat>,
keys: [<name>,...],
aggregations: [
{ref: <name>, expr: <aggexpr> },...
]
}

Multiple Inputs

{ @id: 25, op: "cogroup",
groupings: [
{ref: 23, expr: “id”}, {ref: 24, expr: “id”}
]
}

Physical Scan Operators

Scan with schema Scan without schema
Operator Protocol Buffers JSON-like (MessagePack)
output
Supported ColumnIO (column-based JSON
data formats protobuf/Dremel) HBase
RecordIO (row-based protobuf)
CSV
SELECT … ColumnIO(proto URI, data URI) Json(data URI)
FROM … RecordIO(proto URI, data URI) HBase(table name)

Hadoop Integration

■ Hadoop data sources
● Hadoop FileSystem API (HDFS/MapR-FS)
● HBase

■ Hadoop data formats
● Apache Avro
● RCFile

■ MapReduce-based tools to create column-based formats

■ Table registry in HCatalog

■ Run long-running services in YARN

Where is Drill now?

■ API Definition

■ Reference Implementation for Logical Plan Interpreter
● 1:1 mapping logical/physical op
● Single JVM

■ Demo

Contribute!

■ Participate in Design discussions: JIRA, ML, Wiki, Google Doc!

■ Write a parser for your favorite QL / Domain-Specific Language

■ Write Storage Engine API implementations
● HDFS, Hbase, relational, XML DB.

■ Write Physical Operators
● scan-hbase, scan-cassandra, scan-mongo
● scan-jdbc, scan-odbc, scan-jms (browse topic/queue), scan-*
● combined functionality operators: group-aggregate, ...
● sort-merge-join, hash-join, index-lookup-join

■ Etc...

Thanks, Q&A

■ Download these slides
● http://www.mapr.com/company/events/pjug-1-15-2013

■ Join the project
● drill-dev-subscribe@incubator.apache.org
● #apachedrill

■ Contact me:
● gshegalov@maprtech.com

■ Join MapR
● jobs@mapr.com

Apache Drill @ PJUG, Jan 15, 2013

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to Apache Drill @ PJUG, Jan 15, 2013

Similar to Apache Drill @ PJUG, Jan 15, 2013 (20)

More from Gera Shegalov

More from Gera Shegalov (8)

Recently uploaded

Recently uploaded (20)

Apache Drill @ PJUG, Jan 15, 2013