Webinar: Selecting the Right SQL-on-Hadoop Solution

Selecting the Right SQL-on-Hadoop Solution:
What You Need to Know
© 2014 MapR Techno©lo 2g0ie1s4 MapR Technologies 1

Rick F. van der Lans
Rick F. van der Lans is an independent consultant, lecturer, and author. He
specializes in data warehousing, business intelligence, database technology,
and data virtualization. He is managing director of R20/Consultancy B.V.. Rick
has been involved in various projects in which data warehousing, and
integration technology was applied.
Rick van der Lans is an internationally acclaimed lecturer. He has lectured
professionally for the last twenty five years in many of the European and
Middle East countries, the USA, South America, and in Australia. He has been
invited by several major software vendors to present keynote speeches.
He is the author of several books on computing, including his new Data
Virtualization for Business Intelligence Systems. Some of these books are
available in different languages. Books such as the popular Introduction to
SQL is available in English, Dutch, Italian, Chinese, and German and is sold
world wide. He also authored The SQL Guide to Ingres and SQL for MySQL
Developers.
As author for BeyeNetwork.com, writer of whitepapers, chairman for the
annual European Enterprise Data and Business Intelligence Conference, and
as columnist for a few IT magazines, he has close contacts with many
vendors.
R20/Consultancy B.V. is located in The Hague, The Netherlands, www.r20.nl. You can get in touch with Rick via:
Email: rick@r20.nl
Twitter: @Rick_vanderlans
LinkedIn: http://www.linkedin.com/pub/rick-van-der-lans/9/207/223
Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 2

Self-Service Data Exploration with Apache Drill
© 2014 MapR Techno©lo 2g0ie1s4 MapR Technologies 3

The MapR Distribution including Apache Hadoop
© 2014 MapR Technologies 4
Top Ranked Exponential
Growth
500+
Customers
Premier
Investors
>2x annual bookings
90% software licenses
80% of accounts expand 3X
< 1% lifetime churn
> $1B in incremental revenue
generated by 1 customer

The Power of the Open Source Community
Provisioning
&
coordination
Savannah*
Workflow
& Data
Governance
Data
Integration
& Access
Hue
HttpFS
Flume Knox* Falcon* Whirr
MapR-FS MapR-DB
Management
APACHE HADOOP AND OSS ECOSYSTEM
Streaming
Storm*
NoSQL &
Search
Solr
MapR Data Platform
Security
SQL
Drill
Shark
Impala
YARN
Batch
Spark
Cascading
Pig
Spark
Streaming
HBase
Juju
ML, Graph
GraphX
MLLib
Mahout
MapReduce
v1 & v2
EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS
Tez*
Accumulo*
Hive
Sqoop Sentry* Oozie ZooKeeper
* Certification/support planned for 2014

Today’s Data Comes in Different Shapes…
Social Media
Messages
Audio
Sensors
Mobile Data
Email
Clickstream

Real-World Data Modeling and Transformations

Evolution Towards Self-Service Data Exploration
Data Modeling and
Transformation
Data Visualization
IT-driven
IT-driven
IT-driven
Self-service
IT-driven
Self-service
Not needed
Self-service
Traditional BI
w/ RDBMS
Self-Service BI
w/ RDBMS SQL-on-Hadoop
Self-Service
Data Exploration
Zero-day analytics

Improve time to value Redu2ce the burden on IT
Why Decrease the Distance to Data?
• Enable rapid data exploration and
application development
• IT should provide a valuable
service without “getting in the way”
• Can’t add DBAs to keep up with
the exponential data growth
• Minimize “unnecessary work” so IT
can focus on value-added
activities and become a partner to
the business users

• Pioneering Data Agility for Hadoop
• Apache open source project
• Scale-out execution engine for low-latency queries
• Unified SQL-based API for analytics & operational applications
APACHE DRILL
40+ contributors
150+ years of experience building
databases and distributed systems

Drill Supports Schema Discovery On-The-Fly
Schema Declared In Advance Schema2 Discovered On-The-Fly
• Fixed schema
• Leverage schema in centralized
repository (Hive Metastore)
• Fixed schema, evolving schema or
schema-less
• Leverage schema in centralized
repository or self-describing data
SCHEMA ON
WRITE
SCHEMA
BEFORE READ
SCHEMA ON THE
FLY

Optimized Data Architecture Machine Learning
MapR Optimized Data Architecture
Sources
RELATIONAL,
SAAS,
MAINFRAME
DOCUMENTS,
EMAILS
BLOGS,
TWEETS,
LINK DATA
LOG FILES,
CLICKSTREAMS
SENSORS
Streaming
(Spark Streaming, Storm)
Batch / Search
(MR, Spark, Hive, Pig, …)
NoSQL ODBMS
(HBase, Accumulo, …)
MapR Data Platform
MapR-DB
MAPR DISTRIBUTION FOR HADOOP
MapR-FS
MAPR DISTRIBUTION FOR HADOOP
DATA WAREHOUSE
Data Movement
Data Access
Analytics
Search
Schema-less
data exploration
BI, reporting
Ad-hoc integrated
analytics
Data Transformation, Enrichment
and Integration
Operational Apps
Recommendations
Fraud Detection
Logistics

(1) Self-Describing Data is Ubiquitous
Flat files in DFS
• Complex data (Thrift, Avro, protobuf)
• Columnar data (Parquet, ORC)
• Loosely defined (JSON)
• Traditional files (CSV, TSV)
Data stored in NoSQL stores
• Relational-like (rows, columns)
• Sparse data (NoSQL maps)
• Embedded blobs (JSON)
• Document stores (nested objects)
{
name: {
first: Michael,
last: Smith
},
hobbies: [ski, soccer],
district: Los Altos
}{
name: {
first: Jennifer,
last: Gates
},
hobbies: [sing],
preschool: CCLC
}

RDBMS/SQL-on-Hadoop table
Apache Drill table
(2) Drill’s Data Model is Flexible
Fixed schema Schema-less
HBase
JSON
BSON
CSV
TSV
Parquet
Avro
Flat
Complex
Flexibility
Flexibility
Name Gender Age
Michael M 6
Jennifer F 3
{
name: {
first: Michael,
last: Smith
},
hobbies: [ski, soccer],
district: Los Altos
}{
name: {
first: Jennifer,
last: Gates
},
hobbies: [sing],
preschool: CCLC
}

Quick Tour
Self-Service Data Exploration with Apache Drill
© ©20 21041 M4 aMpaRp RTe Tcehcnhonloogloiegsies 16

A storage engine instance
- DFS
- HBase
- Hive Metastore/HCatalog
A workspace
- Sub-directory
- Hive database
A table
- pathnames
- HBase table
- Hive table
Data Source is in the Query
SELECT timestamp, message
FROM dfs1.logs.`AppServerLogs/2014/Jan/p001.parquet`
WHERE errorLevel > 2

Query Directory Trees
# Query file: How many errors per level in Jan 2014?
SELECT errorLevel, count(*)
FROM dfs.logs.`/AppServerLogs/2014/Jan/part0001.parquet`
GROUP BY errorLevel;
# Query directory sub-tree: How many errors per level?
FROM dfs.logs.`/AppServerLogs`
GROUP BY errorLevel;
# Query some partitions: How many errors per level by month from 2012?
FROM dfs.logs.`/AppServerLogs`
WHERE dirs[1] >= 2012
GROUP BY errorLevel, dirs[2];

Works with HBase and Embedded Blobs
# Query an HBase table directly (no schemas)
SELECT cf1.month, cf1.year
FROM hbase.table1;
# Embedded JSON value inside column profileBlob inside column family cf1 of
the HBase table users
SELECT profile.name, count(profile.children)
FROM (
SELECT CONVERT_FROM(cf1.profileBlob, 'json') AS profile
FROM hbase.users
)

Combine Data Sources on the Fly
# Join log directory with JSON file (user profiles) to identify the
name and email address for anyone associated with an error message.
SELECT DISTINCT users.name, users.emails.work
FROM dfs.logs.`/data/logs` logs,
dfs.users.`/profiles.json` users
WHERE logs.uid = users.id AND
logs.errorLevel > 5;
# Join a Hive table and an HBase table (without Hive metadata)
to determine the number of tweets per user
SELECT users.name, count(*) as tweetCount
FROM hive.social.tweets tweets,
hbase.users users
WHERE tweets.userId = convert_from(users.rowkey, 'UTF-8')
GROUP BY tweets.userId;

Data Exploration Simple SQL-on-Hadoop (schema) Advanced SQL & Analytics
SQL Technologies Available on MapR
Drill 0.5 Hive 0.13 w/ Tez Impala 1.x Shark 0.9 Vertica
Latency Low Medium Low Low for in-memory)
Med for on disk
Low
Files Yes (all Hive formats) Yes (all Hive file formats) Yes (Parquet, Sequence,
…)
Yes (all Hive file formats) Proprietary
All Hive file formats can
be used as external
tables
HBase/MapR-DB Yes Yes, Performance issues Yes, performance issues Yes, Performance issues No
Hive compatibility High High Medium High NA
Schema Schema-less or Hive or
Hbase
Hive Hive Hive Proprietary or Hive
SQL support ANSI SQL HiveQL HiveQL (subset) HiveQL ANSI SQL + advanced
analytics
Client support ODBC/JDBC ODBC/JDBC ODBC/JDBC ODBC/JDBC ODBC/JDBC, ADO.NET,
…
Large datasets Yes Yes Limited Yes Yes
Nested data Yes Limited No Limited Limited
Machine learning No No No Yes No
Transactions No No No No Yes
Optimizer Limited Limited Limited Limited High
Concurrency Medium Medium Medium Limited High

Q& A Engage with us!
• SQL-on-Hadoop engines explained
http://info.mapr.com/wp-sql-on-hadoop-engines-explained
• Get demo and tutorials on Apache Drill
– https://www.mapr.com/products/apache-drill
• Apache Drill 0.5 available now
– Download and play: http://incubator.apache.org/drill/
– Ask questions: drill-user@incubator.apache.org
– Contribute: http://github.com/apache/incubator-drill/
@rick_vanderlans – Rick van der Lans
@swooledge – Steve Wooledge
• Contact / follow us

Copyright © 1991 - 2014 R20/Consultancy B.V.,
The Hague, The Netherlands. All rights
reserved. No part of this material may be
reproduced, stored in a retrieval system, or
transmitted in any form or by any means,
electronic, mechanical, photographic, or
otherwise, without the explicit written permission
of the copyright owners.
SQL‐on‐Hadoop
Explained
by
Rick F. van der Lans
R20/Consultancy BV
Twitter @rick_vanderlans
www.r20.nl

It’s All About Analytics

Requirements for Data Storage Technology
High data storage scalability
High data processing scalability
High performance
Low price/performance ratio
All data types
High schema flexibility
Fast loading
Enterprise-grade

Comparison Data Storage Technologies
Hadoop Classic SQL DB
High data storage scalability Yes Less
High data processing
scalability
Yes Less
High performance Yes Less
Low price/performance ratio Yes No
All data types Yes Most data types
High schema flexibility Depends No
Fast loading Yes Yes
Enterprise-grade Depends Yes

Manipulating Hadoop Data
Apache HBase
API
Apache HBase
Apache HDFS
API
Apache HDFS
Apache
MapReduce API
Apache
MapReduce
Apache HDFS
API
Apache HDFS
Apache HDFS
API
Apache HDFS

Performance Dominates in Hadoop
Productivity
Maintainability
Performance Time-to-market
Scalability
Availability

The Need for SQL‐on‐Hadoop
Add high productivity and maintainability, while
retaining high performance and scalability
Advantages SQL-on-Hadoop
• Well-known database language (especially in the BI
community)
• Large target audience
• High productivity and maintainability
• Openness to many reporting and analytical tools

Productivity is as Important
Productivity
Maintainability
Time-to-market
Performance
Scalability
Availability

Different Solutions
Apache HiveQL
Apache Hive
Apache
MapReduce API
Apache
MapReduce
Apache HDFS
API
Apache HDFS
A SQL Dialect
SQL‐on‐Hadoop
Apache HBase
API
Apache HBase
Apache HDFS
API
Apache HDFS
A SQL Dialect
SQL‐on‐Hadoop
Apache HDFS
API
Any HDFS
A SQL Dialect
SQL‐on‐Hadoop
Apache HDFS
API
Any HDFS

Not all SQL‐on‐Hadoop Engines are Created Equal
Batch-oriented query environment
(data mining)
Interactive query environment
(OLAP, self-service BI, data
visualization)
Point-queries (retrieving individual
objects)
Investigative analytics (data science)
Operational intelligence (real-time
analytics)
Transactional (production systems)

Technological Challenges
Non-SQL-to-SQL
transformational changes
• Nested data
• Variable data
• Schema-less data
• Self-describing data
Architectural Challenges
• Managing concurrent queries/users
• Parallel execution of complex
operations
• Running complex analytical functions
• Cost-based optimization

Is All Data Relational Data?
 create table
 insert data
SQL-on-
Hadoop
HDFS
 select data
 insert data
 select data
SQL-on-
Hadoop
Other
application
HDFS
 create file

Transforming Nested Data (1)
CUSTOMER_ID LAST_NAME FIRST_NAME CUSTOMER_ORDERS
75295 Sylvian David
CUSTOMER_ORDER_ID ORDER_TIMESTAMP
203699 2008-01-16
306892 2008-07-21
477047 2008-12-09
103819 Scaggs Boz
70675 2008-10-19
530223 2008-12-01
132171 Rundgren Todd
210220 2008-04-21
485584 2008-10-14
718579 2008-11-23
741912 2008-12-24

Transforming Nested Data (2)
Alternative 1:
CUSTOMER_ID LAST_NAME FIRST_NAME CUSTOMER_ORDERS
75295 Sylvian David {203699,2008-01-16},{306892,2008—07-21},
103819 Scaggs Boz {70675,2008-10-19},{530223,2008—12-01}
132171 Rundgren Todd {210220,2008-04-21},{485584,2008—10-14},
Alternative 2:
{477047,2008-12-09}
{718579,2008-12-24}
CUSTOMER_ID CUSTOMER_ORDER_ID ORDER_TIMESTAMP LAST_NAME FIRST_NAME
75295 203699 2008-01-16 Sylvian David
75295 306892 2008-07-21 Sylvian David
75295 477047 2008-12-09 Sylvian David

Transforming Variable Data (1)
Example 1:
CUSTOMER_ID CUSTOMER_ORDER_ID ORDER_TIMESTAMP
75295 203699 2008-01-16
75295 306892 2008-07-21
75295 477047 2008-12-09
CUSTOMER_ID CUSTOMER_ORDER_ID ORDER_TIMESTAMP ORDER_PROCESSED
463281 203643 2008-01-16 2008-01-20
CUSTOMER_ID CUSTOMER_ORDER_ID ORDER_TIMESTAMP ORDER_CANCELLED
463246 285825 2008-01-19 2008-10-20
Example 2:
………………
CUSTOMER_ID CUSTOMER_NAME TELEPHONE_NUMBERS
463246 O’Keefe {5157818, 2362436}
463249 Zappa {1234567, 3262836, 4374777}
463350 Donahue {3854757}

Transforming Variable Data (2)
Alternative 1:
CUSTOMER_ID CUSTOMER_NAME TELEPHONE_1 TELEPHONE_2 TELEPHONE_3
463246 O’Keefe 5157818 2362436 ?
463249 Zappa 1234567 3262836 4374777
463350 Donahue 3854757 ? ?
Alternative 2:
CUSTOMER_ID CUSTOMER_NAME TELEPHONE_NUMBER
463246 O’Keefe 5157818
463246 O’Keefe 2362436
463249 Zappa 1234567
463249 Zappa 3262836
463249 Zappa 4374777
463350 Donahue 3854757

Transforming Schema‐Less Data
Weblog record
datestamp ip request 6/1/2012 11:10:19 AM 107.1.187.170 GET
/x.php?u=http://studio-5.financialcontent.com/synacor?Page=QUOTE&Ticke
HTTP/1.1 6/1/2012 5:53:49 AM 107.1.2.180 GET /tv/3/player/vendor/Chef%
/player/fiveminute/content/steak/asset/gnrc_15879500 HTTP/1.1 6/1/2012
107.34.51.63 GET /tv/3/search/content/The%20Andy%20Griffith%20Show/s/T
Andy%20Griffith%20Show HTTP/1.1 6/1/2012 3:12:43 PM 107.5.115.117 GET
/tv/3/search/content/Kathie%20Lee%20Gifford's%20epic%20'Today'%20gaffe
%20Lee%20Gifford's%20epic%20'Today'%20gaffe HTTP/1.1 6/1/2012 4:48:35
108.225.132.245 GET /tv/3/search/content/Deadliest%20Catch/s/Deadliest
HTTP/1.1 6/1/2012 10:25:12 AM 108.246.20.125 GET /x.php?u=http://studi
5.financialcontent.com/synacor?Page=QUOTE&Ticker=DJ:DJI HTTP/1.1
6/1/2012 1:58:14 AM 108.246.25.117 GET /tv/3/player/vendor/Chef%20Tips
/fiveminute/content/steak/asset/gnrc_15879500 HTTP/1.1

Transforming Self‐Describing Data
ID VALUE
75295 { “employee” : {
“number” : “6”,
“name” : “Manzarek”,
“initials”: “R”,
“street ”: “Haseltine Lane”}
}
103819 { “employee” : {
“number” : “7”,
“name” : “Metheny”,
“initials”: “P”,
“street” : “Brownstreet”}
}
132171 { “employee” : {
“number” : “15”,
“name” : “Metheny”,
“initials”: “M”}
}
ID EMPLOYEE_NUMBER EMPLOYEE_NAME EMPLOYEE_INITIALS EMPLOYEE_STREET
75295 6 Manzarek R Haseltine Lane
103819 7 Metheny P Brownstreet
132171 15 Metheny M ?

Architectural Challenges
Managing concurrent queries/users
Parallel execution of complex
operations
Running complex analytical functions
Cost-based optimization

Use Cases of SQL‐on‐Hadoop
Traditional Interactive Reporting and
Analytics
Self-Service Business Intelligence
Batch Reporting
Point Queries
Operational Processing
Investigative Analytics
Data Stream Processing
Storage Cold Data Warehouse Data
Storage of External Data
Fast Staging Area
ETL (Pre)Processing Platform
New Use Cases and Non-Relational Data
…

Watch out for Big Data Silos!
batch
processing
investigative
analytics
point
queries
operational
processing
interactive
reporting
data stream
analytics processing
Silo 1 Silo 2 Silo 3 Silo 4 Silo 5 Silo 6 Silo 7

The Integration Labyrinth
dedicated
integration
solution
batch
processing
dedicated
integration
solution
point
queries
dedicated
integration
solution
interactive
reporting
dedicated
integration
solution
operational
processing
dedicated
integration
solution
investigative
analytics
dedicated
integration
solution
data stream
dedicated
integration
solution

One Platform to Rule Them All
batch
processing
investigative
analytics
point
queries
operational
processing
interactive
reporting
data stream
One Data Management Platform

Closing Remarks
SQL offers standardization
and independency
SQL increases productivity
and eases maintenance
Many SQL-on-Hadoop engines
available
One platform
Being able to process all types
of data is important
Productivity
Maintainability
Time-to-market
Performance
Scalability
Availability

Webinar: Selecting the Right SQL-on-Hadoop Solution

More Related Content

What's hot

Viewers also liked

Similar to Webinar: Selecting the Right SQL-on-Hadoop Solution

More from MapR Technologies

Recently uploaded

Webinar: Selecting the Right SQL-on-Hadoop Solution