The Future of Hadoop: MapR VP of Product Management, Tomer Shiran

© 2014 MapR Technologies 1© 2014 MapR Technologies
The Future of Hadoop: Data Agility

© 2014 MapR Technologies 2
Data is doubling in
size every two years

44 ZETTABYTES
4.4 ZETTABYTES
2011 2013
1.8 ZETTABYTES
IDC estimates that in 2020,
there will be 44 zettabytes
of data in the world
2020
Source: IDC Digital Universe

UNSTRUCTURED
DATA
STRUCTURED DATA
1980 2000 20101990 2020
Unstructured data will account
for more than 80% of the data
collected by organizations
Source: Human-Computer Interaction & Knowledge Discovery in Complex Unstructured, Big Data
TotalDataStored

Unstructured Data is Ubiquitous
Social Media
Messages
Audio
Sensors
Mobile Data
Email
Clickstream

Hadoop Adoption is Exploding
JOB TRENDS FROM INDEED.COM
Jan ‘06 Jan ‘12 Jan ‘14Jan ‘07 Jan ‘08 Jan ‘09 Jan ‘10 Jan ‘11 Jan ‘13

The MapR Distribution for Hadoop
Best Product
Exponential
Growth
3X bookings Q1 ‘13 – Q1 ‘14
80% of accounts expand 3X
90% software licenses
<1% lifetime churn
>$1B in incremental revenue
generated by 1 customer
500+
CustomersBig Data
Riding the Wave with
Hadoop
The Big Data
Platform
of Choice

360° Customer View
5PB
CUSTOMER DATA

© 2014 MapR Technologies 9PEOPLE
1.2B
PEOPLE
Largest Biometric Database in the World

The Future of Hadoop: Data Agility

Distance to Data
Business
(analysts, developers)
“Plumbing”
development
MapReduce
Business
Modeling and
transformations
Hive and other
SQL-on-Hadoop
Existing approaches
require a middleman (IT)
Data
Data

Real-World Data Modeling and Transformations

Distance to Data
Business
“Plumbing”
development
MapReduce
Hive and other
SQL-on-Hadoop
Business
(analysts, developers)Data Agility
Existing approaches
require a middleman (IT)
Data
Data
Data
Business
Modeling and
transformations

Why Improve Distance to Data?
• Enable rapid data exploration and
application development
• IT should provide a valuable
service without “getting in the way”
• Can’t add DBAs to keep up with
the exponential data growth
• Minimize “unnecessary work” so IT
can focus on value-added
activities and become a partner to
the business users
2Reduce the burden on ITImprove time to value

• Pioneering Data Agility for Hadoop
• Apache open source project
• Scale-out execution engine for low-latency queries
• Unified SQL-based API for analytics & operational applications
APACHE DRILL
40+ contributors
150+ years of experience building
databases and distributed systems

Evolution Towards Self-Service Data Exploration
Data Modeling and
Transformation
Data Visualization
IT-driven
IT-driven
IT-driven
Self-service
IT-driven
Self-service
Not needed
Self-service
Traditional BI
w/ RDBMS
Self-Service BI
w/ RDBMS
SQL-on-Hadoop
Self-Service
Data Exploration
Zero-day analytics

(1) Self-Describing Data is Ubiquitous
Flat files in DFS
• Complex data (Thrift, Avro, protobuf)
• Columnar data (Parquet, ORC)
• Loosely defined (JSON)
• Traditional files (CSV, TSV)
Data stored in NoSQL stores
• Relational-like (rows, columns)
• Sparse data (NoSQL maps)
• Embedded blobs (JSON)
• Document stores (nested objects)
{
name: {
first: Michael,
last: Smith
},
hobbies: [ski, soccer],
district: Los Altos
}
{
name: {
first: Jennifer,
last: Gates
},
hobbies: [sing],
preschool: CCLC
}

(2) Drill’s Data Model is Flexible
HBase
JSON
BSON
CSV
TSV
Parquet
Avro
Schema-lessFixed schema
Flat
Complex
Flexibility
Flexibility
Name Gender Age
Michael M 6
Jennifer F 3
{
name: {
first: Michael,
last: Smith
},
hobbies: [ski, soccer],
district: Los Altos
}
{
name: {
first: Jennifer,
last: Gates
},
hobbies: [sing],
preschool: CCLC
}
RDBMS/SQL-on-Hadoop table
Apache Drill table

(3) Drill Supports Schema Discovery On-The-Fly
• Fixed schema
• Leverage schema in centralized
repository (Hive Metastore)
• Fixed schema, evolving schema or
schema-less
• Leverage schema in centralized
repository or self-describing data
2Schema Discovered On-The-FlySchema Declared In Advance
SCHEMA ON
WRITE
SCHEMA
BEFORE READ
SCHEMA ON THE
FLY

Quick Tour
Self-Service Data Exploration with Apache Drill

• d

A storage engine instance
- DFS
- HBase
- Hive Metastore/HCatalog
A workspace
- Sub-directory
- Hive database
A table
- pathnames
- HBase table
- Hive table
Data Source is in the Query
SELECT timestamp, message
FROM dfs1.logs.`AppServerLogs/2014/Jan/p001.parquet`
WHERE errorLevel > 2

Query Directory Trees
# Query file: How many errors per level in Jan 2014?
SELECT errorLevel, count(*)
FROM dfs.logs.`/AppServerLogs/2014/Jan/part0001.parquet`
GROUP BY errorLevel;
# Query directory sub-tree: How many errors per level?
FROM dfs.logs.`/AppServerLogs`
GROUP BY errorLevel;
# Query some partitions: How many errors per level by month from 2012?
FROM dfs.logs.`/AppServerLogs`
WHERE dirs[1] >= 2012
GROUP BY errorLevel, dirs[2];

Works with HBase and Embedded Blobs
# Query an HBase table directly (no schemas)
SELECT cf1.month, cf1.year
FROM hbase.table1;
# Embedded JSON value inside column profileBlob inside column family cf1 of
the HBase table users
SELECT profile.name, count(profile.children)
FROM (
SELECT CONVERT_FROM(cf1.profileBlob, 'json') AS profile
FROM hbase.users
)

Combine Data Sources on the Fly
# Join log directory with JSON file (user profiles) to identify the name and email address for
anyone associated with an error message.
SELECT DISTINCT users.name, users.emails.work
FROM dfs.logs.`/data/logs` logs,
dfs.users.`/profiles.json` users
WHERE logs.uid = users.id AND
logs.errorLevel > 5;
# Join a Hive table and an HBase table (without Hive metadata) to determine the number of
tweets per user
SELECT users.name, count(*) as tweetCount
FROM hive.social.tweets tweets,
hbase.users users
WHERE tweets.userId = convert_from(users.rowkey, 'UTF-8')
GROUP BY tweets.userId;

Summary
• Enable rapid data exploration and application development while
reducing the burden on IT
• Apache Drill beta coming soon
– Email tshiran@mapr.com
• Get involved
– Download and play: http://incubator.apache.org/drill/
– Ask questions: drill-user@incubator.apache.org
– Contribute: http://github.com/apache/incubator-drill/

Thank You
@mapr maprtech
tshiran@mapr.com
Tomer Shiran, VP Product Management
MapRTechnologies
maprtech
mapr-technologies

The Future of Hadoop: MapR VP of Product Management, Tomer Shiran

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to The Future of Hadoop: MapR VP of Product Management, Tomer Shiran

Similar to The Future of Hadoop: MapR VP of Product Management, Tomer Shiran (20)

More from MapR Technologies

More from MapR Technologies (20)

Recently uploaded

Recently uploaded (20)

The Future of Hadoop: MapR VP of Product Management, Tomer Shiran

Editor's Notes