PhillyDB Talk - Beyond Batch

Beyond Batch
Drill & Storm

Brad Anderson

©MapR Technologies

whoami
• Brad Anderson

• Solutions Architect at MapR (Atlanta)

• ATLHUG co-chair

• „boorad‟ most places (twitter, github)

• banderson@maprtech.com
©MapR Technologies

MapR - Faster and More Scalable
Benchmark MapR 2.1.1 CDH 4.1.1 MapR Speed
Increase

Terasort (1x replication, compression disabled)

Total 13m 35s 26m 6s 1.9x

Map 7m 58s 21m 8s 2.7x

Reduce 13m 32s 23m 37s 1.7x

DFSIO throughput/node
MapR/Googl Apache
Read 1003 MB/s 656 MB/s 1.5x e Hadoop
Write 924 MB/s 654 MB/s 1.4x Time 54 sec 62 sec
YCSB (50% read, 50% update) Nodes 1,003 1,460
Throughput 36,584.4 op/s 12,500.5 op/s 2.9x Disks 1,003 5,840
Cores 4,012 11,680
Runtime 3.80 hr 11.11 hr 2.9x

YCSB (95% read, 5% update)

Throughput 24,704.3 op/s 10,776.4 op/s 2.3x

Runtime 0.56 hr 1.29 hr 2.3x

Benchmark hardware configuration:
10 servers, 12 x 2 cores (2.4 GHz), 12 x 2TB, 48 GB, 1 x 10GbE

Beyond Batch
HBase & M7

Apache Drill

Storm

Solr & Elastic Search

©MapR Technologies

Latency Matters

Batch Interactive Streaming

©MapR Technologies

Big Data Picture
Batch processing Interactive analysis Stream processing

Query runtime Minutes to hours Milliseconds to minutes Never-ending

Data volume TBs to PBs GBs to PBs Continuous stream

Programming model MapReduce Queries DAG

Users Developers Analysts and Developers Developers

Google project MapReduce Dremel

Open source project Hadoop MapReduce Storm, S4

Apache Drill
©MapR Technologies

Interactive SQL Initiatives for Hadoop

SQL based OLTP SQL based
analytics

Real-time
interactive queries
Impala*
Real-time SQL conversion
interactive queries to MapReduce
* Does not work with other distributions

Google Dremel
• Interactive analysis of large-scale datasets
• Trillion records at interactive speeds
• Complementary to MapReduce
• Used by thousands of Google employees
• Paper published at VLDB 2010
• Model
• Nested data model with schema
• Most data at Google is stored/transferred in Protocol Buffers
• SQL-like query language with nested data support
• Implementation
• Column-based storage and processing
• In-situ data access (GFS and Bigtable)
• Tree architecture as in Web search (and databases)
©MapR Technologies

Google BigQuery
• Hosted Dremel (Dremel as a Service)
• CLI (bq) and Web UI
• Import data from Google Cloud Storage or local files
• Files must be in CSV format
• Nested data not supported [yet] except built-in datasets
• Schema definition required

©MapR Technologies

Drill Design Principles
Flexible Easy
•Pluggable query languages •Unzip and run
•Extensible execution engine •Zero configuration
•Pluggable data formats •Reverse DNS not needed
• Columns and Rows •IP addresses can change
• Schema and Schema-less •Clear and concise log messages
•Pluggable data sources

Fast Dependable
•C/C++ core with Java support •No SPOF
• Google C++ style guide •Instant recovery from crashes
•Min latency and max throughput
(limited only by hardware)

©MapR Technologies

DrQL Example
DocId: 10
Links
Forward: 20 SELECT DocId AS Id,
Forward: 40 COUNT(Name.Language.Code) WITHIN Name AS Cnt,
Forward: 60 Name.Url + ',' + Name.Language.Code AS Str
Name FROM t
Language WHERE REGEXP(Name.Url, '^http') AND DocId < 20;
Code: 'en-us'
Country: 'us'
Language
Code: 'en' Id: 10
Url: 'http://A' Name
Name Cnt: 2
Url: 'http://B' Language
Name Str: 'http://A,en-us'
Language Str: 'http://A,en'
Code: 'en-gb' Name
Country: 'gb' Cnt: 0
©MapR Technologies
* Example from the Dremel paper

Data Flow

©MapR Technologies

Extensibility
• Nested query languages
• DrQL
• Mongo Query Language
• Cascading, Hive, Pig

• Distributed execution engine
• Extensible model (eg, Dryad)
• Low-latency
• Fault tolerant

©MapR Technologies

Extensibility
Nested data formats
• Pluggable model
• Column-based (ColumnIO/Dremel, Trevni, RCFile)
• Row-based (RecordIO, Avro, JSON, CSV)
• Schema (Protocol Buffers, Avro, CSV)
• Schema-less (JSON, BSON)

Scalable data sources
• Pluggable model
• Hadoop
• HBase

©MapR Technologies

Drill Architecture
Client Cluster

Execu2on4
Driver Parser Compiler Data4
Source
Engine

Query4
(text) AST4
(text) Plan4
(text) API

 Public interfaces enable extensibility
– Add a new query language by implementing a parser
– Add a new data source by implementing an API
– Provide a plan directly to the execution engine to control execution
 Each level of the plan has a human readable representation
– Facilitates debugging and development

Drill Architecture (2)

DrQL%
Clients

Driver Drill%
Query%
Servers

DrQL%
Parser
Compiler Drill%
Worker
Drill%Worker
Cascading/Pig/...%
Clients Other%
Parser
Drill%Worker

Intermediate%
Driver
Parser

Query Components
• Query components:
• SELECT
• FROM
• WHERE
• GROUP BY
• HAVING
• JOIN

• Key logical operators:
• Scan
• Filter
• Aggregate
• Join

©MapR Technologies

Scan Operators
• Drill supports multiple data formats by having per-format scan operators
• Queries involving multiple data formats/sources are supported

• Fields and predicates can be pushed down into the scan operator

• Scan operators may have adaptive side-effects (database cracking)
• Produce ColumnIO from RecordIO
• Google PowerDrill stores materialized expressions with the data

Scan with schema Scan without schema

Operator output Protocol Buffers JSON-like (MessagePack)

ColumnIO (column-based protobuf/Dremel)
Supported data JSON
RecordIO (row-based protobuf)
formats HBase
CSV
SELECT … ColumnIO(proto URI, data URI) Json(data URI)
FROM … RecordIO(proto URI, data URI) HBase(table name)
©MapR Technologies

Execution Engine Layers
• Drill execution engine has two layers
• Operator layer is serialization-aware
• Processes individual records
• Execution layer is not serialization-aware
• Processes batches of records (blobs)
• Responsible for communication, dependencies and fault tolerance

©MapR Technologies

Hadoop Integration
• Hadoop data sources
• Hadoop FileSystem API (HDFS/MapR-FS)
• HBase
• Hadoop data formats
• Apache Avro
• RCFile
• MapReduce-based tools to create column-based formats
• Table registry in HCatalog
• Run long-running services in YARN

©MapR Technologies

Fully Open

©MapR Technologies

Momentum
Over 200 people on the Drill mailing list
Over 200 members of the Bay Area Drill User Group
Over 100 participants the first meetup in Sunnyvale, CA
• MapR, Cisco, Intel, eBay, Google, Yahoo!, LinkedIn, …
Drill meetups across the US and Europe
OpenDremel team and source code merged with Apache Drill
Simba Technologies – ODBC inventor developing a Drill
ODBC driver
• Tableau, MicroStrategy, Excel, SAP Crystal Reports, …

Before Storm

Queues Workers

©MapR Technologies

Example

©MapR Technologies
(simplified)

Storm

Guaranteed data processing
Horizontal scalability
Fault-tolerance
No intermediate message brokers!
Higher level abstraction than
message passing
“Just works”
©MapR Technologies

Concepts

©MapR Technologies

Streams

Tuple Tuple Tuple Tuple Tuple Tuple Tuple

Unbounded sequence of tuples
©MapR Technologies

Spouts

public interface ISpout extends Serializable {
void open(Map conf,
TopologyContext context,
SpoutOutputCollector collector);
void close();
void nextTuple();
void ack(Object msgId);
void fail(Object msgId);
}

©MapR Technologies

Bolts
public class DoubleAndTripleBolt extends BaseRichBolt {
private OutputCollectorBase _collector;

public void prepare(Map conf,
TopologyContext context,
OutputCollectorBase collector) {
_collector = collector;
}

public void execute(Tuple input) {
int val = input.getInteger(0);
_collector.emit(input, new Values(val*2, val*3));
_collector.ack(input);
}

public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("double", "triple"));
}
}

©MapR Technologies

Trident
TridentTopology topology = new TridentTopology();
TridentState wordCounts =
topology.newStream("spout1", spout)
.each(new Fields("sentence"),
new Split(),
new Fields("word"))
.groupBy(new Fields("word"))
.persistentAggregate(new MemoryMapState.Factory(),
new Count(),
new Fields("count"))
.parallelismHint(6);

©MapR Technologies

Storm

realtime
processes
Apps
Queue

Raw
Busines
Data
s
Value
Hadoop

Parallel Cluster Ingest

batch
processes
©MapR Technologies

Storm

realtime
processes
Apps
TailSpout
Queue

Raw
Busines
Data
s
Georg

Value
Hadoop

batch
processes
©MapR Technologies

Get Involved!
• Slides
• http://slideshare.net/boorad/phillydb

• Join the Apache Drill mailing list
• drill-dev-subscribe@incubator.apache.org

• Watch TailSpout & Georg development
• https://github.com/{tdunning | boorad | rlankenau}/mapr-spout

• Join MapR
• jobs@mapr.com
• banderson@maprtech.com

• @boorad
©MapR Technologies

PhillyDB Talk - Beyond Batch

More Related Content

What's hot

Similar to PhillyDB Talk - Beyond Batch

More from boorad

Recently uploaded

PhillyDB Talk - Beyond Batch