Hcj 2013-01-21

The Power of Hadoop to
Transform Business

©MapR Technologies - Confidential 1

My Background

 University, Startups
– Aptex, MusicMatch, ID Analytics, Veoh
– big data since before it was big

 Open source
– even before the internet
– Apache Hadoop, Mahout, Zookeeper, Drill
– bought the beer at first HUG

 MapR
 Founding member of Apache Drill


MapR Technologies

 Silicon Valley Startup
– Top investors
– Top technical and management team
• Google, Microsoft, EMC, NetApp, Oracle
 Enterprise quality distribution for
Hadoop
 Many extensions to basic Hadoop function
 Strong supporter of Apache Drill


Philosophy First

What is History?


The study of the past

(what came before now)


What is the future?

(it comes after now)


But the future also
has a past!


Do you remember the
future?


Some things
turned out
as expected


Many things
are
different!


Hadoop has
a history


Hadoop also
has a
future


The Old Future of Hadoop

 Map-reduce and HDFS
– more and more, but not really different

 Eco-system additions
– Simpler programming (Hive and Pig)
– Key-value store
– Ad hoc query

 Stands apart from other computing
– Required by HDFS and other limitations


The New Future of Hadoop

 Real-time processing
– Combines real-time and long-time

 Integration with traditional IT
– No need to stand apart

 Integration with new technologies
– Solr, Node.js, Twisted all should interface directly

 Fast and flexible computation
– Drill logical plan language


Example #1
Search Abuse


History matrix

One row per user

One column per thing


Recommendation based on
cooccurrence

Cooccurrence gives item-item
mapping

One row and column per thing


Cooccurrence matrix can also be
implemented as a search index


SolR
SolR
Complete Cooccurrence Indexer
Solr
Indexer
history (Mahout) indexing

Item meta- Index
data shards


SolR
SolR
User Indexer
Solr
Web tier Indexer
history search

Item meta-
Index
data shards


Objective Results

 At a very large credit card company

 History is all transactions, all web interaction

 Processing time cut from 20 hours per day to 3

 Recommendation engine load time decreased from 8 hours to 3
minutes


Example #2
Web Technology


Real-time Fast analysis
data (Storm)

Analytic
Raw logs
output


Large analysis
(map-reduce)

Analytic
Raw logs
output


Presentation
Browser
tier (d3 +
query
node.js)

Analytic
Raw logs
output


Objective Results

 Real-time + long-time analysis is seamless

 Web tier can be rooted directly on Hadoop cluster

 No need to move data


Example #3
Apache Drill


Big Data Processing – Hadoop

Batch processing
Query runtime Minutes to hours

Data volume TBs to PBs
Programming MapReduce
model
Users Developers

Google project MapReduce
Open source Hadoop
project MapReduce


Big Data Processing – Hadoop and Storm

Batch processing Stream processing
Query runtime Minutes to hours Never-ending

Data volume TBs to PBs Continuous stream
Programming MapReduce DAG
model (pre-programmed)
Users Developers Developers

Open source Hadoop Storm or Apache S4
project MapReduce


Big Data Processing – The missing part

Batch processing Interactive analysis Stream processing
Query runtime Minutes to hours Never-ending

Data volume TBs to PBs Continuous stream
Programming MapReduce DAG
model (pre-programmed)
Users Developers Developers

Open source Hadoop Storm and S4
project MapReduce


Big Data Processing – The missing part

Query runtime Minutes to hours Milliseconds to Never-ending
minutes
Data volume TBs to PBs GBs to PBs Continuous stream
Programming MapReduce Queries DAG
model (ad hoc) (pre-programmed)
Users Developers Analysts and Developers
developers
project MapReduce


Big Data Processing

minutes
model
developers
Google project MapReduce Dremel
project MapReduce


Big Data Processing

minutes
model
developers
Google project MapReduce Dremel
project MapReduce

Apache Drill

Design Principles

Flexible Easy
• Pluggable query languages • Unzip and run
• Extensible execution engine • Zero configuration
• Pluggable data formats • Reverse DNS not needed
• Column-based and row-based • IP addresses can change
• Schema and schema-less • Clear and concise log messages
• Pluggable data sources

Dependable Fast
• No SPOF • C/C++ core with Java support
• Instant recovery from crashes • Google C++ style guide
• Min latency and max throughput
(limited only by hardware)


Simple Architecture

Query
Interface
language

Logical
Transform
Language

Physical
Optimize Execute
Plan


Standard Interfaces

Query SQL 2003
Interface
language
Drill logical
syntax

Logical
Transform Scanner
Language API

Physical
Optimize Execute
Plan


Logical Plan Syntax:

query:[
{
op:"sequence", do:[
{ op: "scan",
memo: "initial_scan",
ref: "donuts",
source: "local-logs",
selection: {data: "activity"}
},
{ op: "transform",
transforms: [ { ref: "donuts.quanity", expr: "donuts.sales”} ]
},
{ op: "filter",
expr: "donuts.ppu < 1.00"
},
…


Logical Streaming Example

01
23
4

{ @id: <refnum>, op: “window-frame”,
input: <input>,
keys: [ 0
<name>,... 01
], 012
ref: <name>, 123
before: 2, 234
after: here
}


Logical Plan

scan-json "table-1"

ﬁlter exp1

ﬂatten

aggregate exp2


Execution Plan

scan-json "table-1" scan-json "table-1" scan-json "table-1"

filter exp1 filter exp1 filter exp1

flatten flatten flatten
node1 node2 node3

aggregate exp2


Representing a DAG

18

aggregate exp2

19
{ @id: 19, op: "aggregate",
input: 18,
type: <simple|running|repeat>,
keys: [<name>,...],
aggregations: [
{ref: <name>, expr: <aggexpr> },...
]
}


Non-SQL queries
scan-json "table-1" scan-json "table-1"

streaming
k-means

ball k-
k
means

aggregate exp2

k-means
join

cluster
features


Design Principles

Flexible Easy
• Pluggable query languages • Unzip and run
• Extensible execution engine • Zero configuration
• Pluggable data formats • Reverse DNS not needed
• Column-based and row-based • IP addresses can change
• Schema and schema-less • Clear and concise log messages
• Pluggable data sources

Dependable Fast
• No SPOF • C/C++ core with Java support
• Instant recovery from crashes • Google C++ style guide
• Min latency and max throughput
(limited only by hardware)


The future is
not what we
thought it
would be


It is better!


Get Involved!

Tweet:
#hcj13w
#mapr
@ted_dunning


Get Involved!

 Download these slides
– http://www.mapr.com/company/events/hcj-01-21-2013

 Join the Drill project
– drill-dev-subscribe@incubator.apache.org
– #apachedrill

 Contact me:
– tdunning@maprtech.com
– tdunning@apache.org
– @ted_dunning

 Join MapR (in Japan!)
– jobs@mapr.com


Hcj 2013-01-21

More Related Content

Similar to Hcj 2013-01-21

More from Ted Dunning

Recently uploaded

Hcj 2013-01-21