Selecting the Right SQL-on-Hadoop Solution: 
What You Need to Know 
© 2014 MapR Techno©lo 2g0ie1s4 MapR Technologies 1
Rick F. van der Lans 
Rick F. van der Lans is an independent consultant, lecturer, and author. He 
specializes in data warehousing, business intelligence, database technology, 
and data virtualization. He is managing director of R20/Consultancy B.V.. Rick 
has been involved in various projects in which data warehousing, and 
integration technology was applied. 
Rick van der Lans is an internationally acclaimed lecturer. He has lectured 
professionally for the last twenty five years in many of the European and 
Middle East countries, the USA, South America, and in Australia. He has been 
invited by several major software vendors to present keynote speeches. 
He is the author of several books on computing, including his new Data 
Virtualization for Business Intelligence Systems. Some of these books are 
available in different languages. Books such as the popular Introduction to 
SQL is available in English, Dutch, Italian, Chinese, and German and is sold 
world wide. He also authored The SQL Guide to Ingres and SQL for MySQL 
Developers. 
As author for BeyeNetwork.com, writer of whitepapers, chairman for the 
annual European Enterprise Data and Business Intelligence Conference, and 
as columnist for a few IT magazines, he has close contacts with many 
vendors. 
R20/Consultancy B.V. is located in The Hague, The Netherlands, www.r20.nl. You can get in touch with Rick via: 
Email: rick@r20.nl 
Twitter: @Rick_vanderlans 
LinkedIn: http://www.linkedin.com/pub/rick-van-der-lans/9/207/223 
Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 2
Self-Service Data Exploration with Apache Drill 
© 2014 MapR Techno©lo 2g0ie1s4 MapR Technologies 3
The MapR Distribution including Apache Hadoop 
© 2014 MapR Technologies 4 
Top Ranked Exponential 
Growth 
500+ 
Customers 
Premier 
Investors 
>2x annual bookings 
90% software licenses 
80% of accounts expand 3X 
< 1% lifetime churn 
> $1B in incremental revenue 
generated by 1 customer
The Power of the Open Source Community 
Provisioning 
& 
coordination 
Savannah* 
Workflow 
& Data 
Governance 
Data 
Integration 
& Access 
Hue 
HttpFS 
Flume Knox* Falcon* Whirr 
MapR-FS MapR-DB 
© 2014 MapR Technologies 5 
Management 
APACHE HADOOP AND OSS ECOSYSTEM 
Streaming 
Storm* 
NoSQL & 
Search 
Solr 
MapR Data Platform 
Security 
SQL 
Drill 
Shark 
Impala 
YARN 
Batch 
Spark 
Cascading 
Pig 
Spark 
Streaming 
HBase 
Juju 
ML, Graph 
GraphX 
MLLib 
Mahout 
MapReduce 
v1 & v2 
EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS 
Tez* 
Accumulo* 
Hive 
Sqoop Sentry* Oozie ZooKeeper 
* Certification/support planned for 2014
Today’s Data Comes in Different Shapes… 
© 2014 MapR Technologies 6 
Social Media 
Messages 
Audio 
Sensors 
Mobile Data 
Email 
Clickstream
Real-World Data Modeling and Transformations 
© 2014 MapR Technologies 7
© 2014 MapR Technologies 8
Evolution Towards Self-Service Data Exploration 
© 2014 MapR Technologies 9 
Data Modeling and 
Transformation 
Data Visualization 
IT-driven 
IT-driven 
IT-driven 
Self-service 
IT-driven 
Self-service 
Not needed 
Self-service 
Traditional BI 
w/ RDBMS 
Self-Service BI 
w/ RDBMS SQL-on-Hadoop 
Self-Service 
Data Exploration 
Zero-day analytics
Improve time to value Redu2ce the burden on IT 
© 2014 MapR Technologies 10 
Why Decrease the Distance to Data? 
• Enable rapid data exploration and 
application development 
• IT should provide a valuable 
service without “getting in the way” 
• Can’t add DBAs to keep up with 
the exponential data growth 
• Minimize “unnecessary work” so IT 
can focus on value-added 
activities and become a partner to 
the business users
• Pioneering Data Agility for Hadoop 
• Apache open source project 
• Scale-out execution engine for low-latency queries 
• Unified SQL-based API for analytics & operational applications 
© 2014 MapR Technologies 11 
APACHE DRILL 
40+ contributors 
150+ years of experience building 
databases and distributed systems
Drill Supports Schema Discovery On-The-Fly 
Schema Declared In Advance Schema2 Discovered On-The-Fly 
© 2014 MapR Technologies 12 
• Fixed schema 
• Leverage schema in centralized 
repository (Hive Metastore) 
• Fixed schema, evolving schema or 
schema-less 
• Leverage schema in centralized 
repository or self-describing data 
SCHEMA ON 
WRITE 
SCHEMA 
BEFORE READ 
SCHEMA ON THE 
FLY
Optimized Data Architecture Machine Learning 
© 2014 MapR Technologies 13 
MapR Optimized Data Architecture 
Sources 
RELATIONAL, 
SAAS, 
MAINFRAME 
DOCUMENTS, 
EMAILS 
BLOGS, 
TWEETS, 
LINK DATA 
LOG FILES, 
CLICKSTREAMS 
SENSORS 
Streaming 
(Spark Streaming, Storm) 
Batch / Search 
(MR, Spark, Hive, Pig, …) 
NoSQL ODBMS 
(HBase, Accumulo, …) 
MapR Data Platform 
MapR-DB 
MAPR DISTRIBUTION FOR HADOOP 
MapR-FS 
MAPR DISTRIBUTION FOR HADOOP 
DATA WAREHOUSE 
Data Movement 
Data Access 
Analytics 
Search 
Schema-less 
data exploration 
BI, reporting 
Ad-hoc integrated 
analytics 
Data Transformation, Enrichment 
and Integration 
Operational Apps 
Recommendations 
Fraud Detection 
Logistics
© 2014 MapR Technologies 14 
(1) Self-Describing Data is Ubiquitous 
Flat files in DFS 
• Complex data (Thrift, Avro, protobuf) 
• Columnar data (Parquet, ORC) 
• Loosely defined (JSON) 
• Traditional files (CSV, TSV) 
Data stored in NoSQL stores 
• Relational-like (rows, columns) 
• Sparse data (NoSQL maps) 
• Embedded blobs (JSON) 
• Document stores (nested objects) 
{ 
name: { 
first: Michael, 
last: Smith 
}, 
hobbies: [ski, soccer], 
district: Los Altos 
}{ 
name: { 
first: Jennifer, 
last: Gates 
}, 
hobbies: [sing], 
preschool: CCLC 
}
RDBMS/SQL-on-Hadoop table 
Apache Drill table 
© 2014 MapR Technologies 15 
(2) Drill’s Data Model is Flexible 
Fixed schema Schema-less 
HBase 
JSON 
BSON 
CSV 
TSV 
Parquet 
Avro 
Flat 
Complex 
Flexibility 
Flexibility 
Name Gender Age 
Michael M 6 
Jennifer F 3 
{ 
name: { 
first: Michael, 
last: Smith 
}, 
hobbies: [ski, soccer], 
district: Los Altos 
}{ 
name: { 
first: Jennifer, 
last: Gates 
}, 
hobbies: [sing], 
preschool: CCLC 
}
Quick Tour 
Self-Service Data Exploration with Apache Drill 
© ©20 21041 M4 aMpaRp RTe Tcehcnhonloogloiegsies 16
Zero to Results in 2 Minutes (3 Commands) 
$ tar xzf apache-drill.tar.gz 
$ apache-drill/bin/sqlline -u jdbc:drill:zk=local 
0: jdbc:drill:zk=local> 
SELECT count(*) AS incidents, columns[1] AS category 
FROM dfs.`/tmp/SFPD_Incidents_-_Previous_Three_Months.csv` 
GROUP BY columns[1] 
ORDER BY incidents DESC; 
+------------+------------+ 
| incidents | category | 
+------------+------------+ 
| 8372 | LARCENY/THEFT | 
| 4247 | OTHER OFFENSES | 
| 3765 | NON-CRIMINAL | 
| 2502 | ASSAULT | 
... 
35 rows selected (0.847 seconds) 
Install 
Launch shell 
(embedded 
mode) 
Query 
Results 
© 2014 MapR Technologies 17
© 2014 MapR Technologies 18 
A storage engine instance 
- DFS 
- HBase 
- Hive Metastore/HCatalog 
A workspace 
- Sub-directory 
- Hive database 
A table 
- pathnames 
- HBase table 
- Hive table 
Data Source is in the Query 
SELECT timestamp, message 
FROM dfs1.logs.`AppServerLogs/2014/Jan/p001.parquet` 
WHERE errorLevel > 2
© 2014 MapR Technologies 19 
Query Directory Trees 
# Query file: How many errors per level in Jan 2014? 
SELECT errorLevel, count(*) 
FROM dfs.logs.`/AppServerLogs/2014/Jan/part0001.parquet` 
GROUP BY errorLevel; 
# Query directory sub-tree: How many errors per level? 
SELECT errorLevel, count(*) 
FROM dfs.logs.`/AppServerLogs` 
GROUP BY errorLevel; 
# Query some partitions: How many errors per level by month from 2012? 
SELECT errorLevel, count(*) 
FROM dfs.logs.`/AppServerLogs` 
WHERE dirs[1] >= 2012 
GROUP BY errorLevel, dirs[2];
Works with HBase and Embedded Blobs 
# Query an HBase table directly (no schemas) 
SELECT cf1.month, cf1.year 
FROM hbase.table1; 
# Embedded JSON value inside column profileBlob inside column family cf1 of 
the HBase table users 
SELECT profile.name, count(profile.children) 
FROM ( 
SELECT CONVERT_FROM(cf1.profileBlob, 'json') AS profile 
FROM hbase.users 
) 
© 2014 MapR Technologies 20
© 2014 MapR Technologies 21 
Combine Data Sources on the Fly 
# Join log directory with JSON file (user profiles) to identify the 
name and email address for anyone associated with an error message. 
SELECT DISTINCT users.name, users.emails.work 
FROM dfs.logs.`/data/logs` logs, 
dfs.users.`/profiles.json` users 
WHERE logs.uid = users.id AND 
logs.errorLevel > 5; 
# Join a Hive table and an HBase table (without Hive metadata) 
to determine the number of tweets per user 
SELECT users.name, count(*) as tweetCount 
FROM hive.social.tweets tweets, 
hbase.users users 
WHERE tweets.userId = convert_from(users.rowkey, 'UTF-8') 
GROUP BY tweets.userId;
Data Exploration Simple SQL-on-Hadoop (schema) Advanced SQL & Analytics 
© 2014 MapR Technologies 22 
SQL Technologies Available on MapR 
Drill 0.5 Hive 0.13 w/ Tez Impala 1.x Shark 0.9 Vertica 
Latency Low Medium Low Low for in-memory) 
Med for on disk 
Low 
Files Yes (all Hive formats) Yes (all Hive file formats) Yes (Parquet, Sequence, 
…) 
Yes (all Hive file formats) Proprietary 
All Hive file formats can 
be used as external 
tables 
HBase/MapR-DB Yes Yes, Performance issues Yes, performance issues Yes, Performance issues No 
Hive compatibility High High Medium High NA 
Schema Schema-less or Hive or 
Hbase 
Hive Hive Hive Proprietary or Hive 
SQL support ANSI SQL HiveQL HiveQL (subset) HiveQL ANSI SQL + advanced 
analytics 
Client support ODBC/JDBC ODBC/JDBC ODBC/JDBC ODBC/JDBC ODBC/JDBC, ADO.NET, 
… 
Large datasets Yes Yes Limited Yes Yes 
Nested data Yes Limited No Limited Limited 
Machine learning No No No Yes No 
Transactions No No No No Yes 
Optimizer Limited Limited Limited Limited High 
Concurrency Medium Medium Medium Limited High
© 2014 MapR Technologies 23 
Q& A Engage with us! 
• SQL-on-Hadoop engines explained 
http://info.mapr.com/wp-sql-on-hadoop-engines-explained 
• Get demo and tutorials on Apache Drill 
– https://www.mapr.com/products/apache-drill 
• Apache Drill 0.5 available now 
– Download and play: http://incubator.apache.org/drill/ 
– Ask questions: drill-user@incubator.apache.org 
– Contribute: http://github.com/apache/incubator-drill/ 
@rick_vanderlans – Rick van der Lans 
@swooledge – Steve Wooledge 
• Contact / follow us
Copyright © 1991 - 2014 R20/Consultancy B.V., 
The Hague, The Netherlands. All rights 
reserved. No part of this material may be 
reproduced, stored in a retrieval system, or 
transmitted in any form or by any means, 
electronic, mechanical, photographic, or 
otherwise, without the explicit written permission 
of the copyright owners. 
SQL‐on‐Hadoop 
Explained 
by 
Rick F. van der Lans 
R20/Consultancy BV 
Twitter @rick_vanderlans 
www.r20.nl
Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 2
Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 3
Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 4
Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 5
It’s All About Analytics 
Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 6
Requirements for Data Storage Technology 
High data storage scalability 
High data processing scalability 
High performance 
Low price/performance ratio 
All data types 
High schema flexibility 
Fast loading 
Enterprise-grade 
Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 7
Comparison Data Storage Technologies 
Hadoop Classic SQL DB 
High data storage scalability Yes Less 
High data processing 
scalability 
Yes Less 
High performance Yes Less 
Low price/performance ratio Yes No 
All data types Yes Most data types 
High schema flexibility Depends No 
Fast loading Yes Yes 
Enterprise-grade Depends Yes 
Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 8
Manipulating Hadoop Data 
Apache HBase 
API 
Apache HBase 
Apache HDFS 
API 
Apache HDFS 
Apache 
MapReduce API 
Apache 
MapReduce 
Apache HDFS 
API 
Apache HDFS 
Apache HDFS 
API 
Apache HDFS 
Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 9
Performance Dominates in Hadoop 
Productivity 
Maintainability 
Performance Time-to-market 
Scalability 
Availability 
Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 10
The Need for SQL‐on‐Hadoop 
Add high productivity and maintainability, while 
retaining high performance and scalability 
Advantages SQL-on-Hadoop 
• Well-known database language (especially in the BI 
community) 
• Large target audience 
• High productivity and maintainability 
• Openness to many reporting and analytical tools 
Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 11
Productivity is as Important 
Productivity 
Maintainability 
Time-to-market 
Performance 
Scalability 
Availability 
Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 12
Different Solutions 
Apache HiveQL 
Apache Hive 
Apache 
MapReduce API 
Apache 
MapReduce 
Apache HDFS 
API 
Apache HDFS 
A SQL Dialect 
SQL‐on‐Hadoop 
Apache HBase 
API 
Apache HBase 
Apache HDFS 
API 
Apache HDFS 
A SQL Dialect 
SQL‐on‐Hadoop 
Apache HDFS 
API 
Any HDFS 
A SQL Dialect 
SQL‐on‐Hadoop 
Apache HDFS 
API 
Any HDFS 
Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 13
Not all SQL‐on‐Hadoop Engines are Created Equal 
Batch-oriented query environment 
(data mining) 
Interactive query environment 
(OLAP, self-service BI, data 
visualization) 
Point-queries (retrieving individual 
objects) 
Investigative analytics (data science) 
Operational intelligence (real-time 
analytics) 
Transactional (production systems) 
Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 14
Technological Challenges 
Non-SQL-to-SQL 
transformational changes 
• Nested data 
• Variable data 
• Schema-less data 
• Self-describing data 
Architectural Challenges 
• Managing concurrent queries/users 
• Parallel execution of complex 
operations 
• Running complex analytical functions 
• Cost-based optimization 
Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 15
Is All Data Relational Data? 
 create table 
 insert data 
SQL-on- 
Hadoop 
HDFS 
 select data 
 insert data 
 select data 
SQL-on- 
Hadoop 
Other 
application 
HDFS 
 create file 
Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 16
Transforming Nested Data (1) 
CUSTOMER_ID LAST_NAME FIRST_NAME CUSTOMER_ORDERS 
75295 Sylvian David 
CUSTOMER_ORDER_ID ORDER_TIMESTAMP 
203699 2008-01-16 
306892 2008-07-21 
477047 2008-12-09 
103819 Scaggs Boz 
CUSTOMER_ORDER_ID ORDER_TIMESTAMP 
70675 2008-10-19 
530223 2008-12-01 
132171 Rundgren Todd 
CUSTOMER_ORDER_ID ORDER_TIMESTAMP 
210220 2008-04-21 
485584 2008-10-14 
718579 2008-11-23 
741912 2008-12-24 
Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 17
Transforming Nested Data (2) 
Alternative 1: 
CUSTOMER_ID LAST_NAME FIRST_NAME CUSTOMER_ORDERS 
75295 Sylvian David {203699,2008-01-16},{306892,2008—07-21}, 
103819 Scaggs Boz {70675,2008-10-19},{530223,2008—12-01} 
132171 Rundgren Todd {210220,2008-04-21},{485584,2008—10-14}, 
Alternative 2: 
{477047,2008-12-09} 
{718579,2008-12-24} 
CUSTOMER_ID CUSTOMER_ORDER_ID ORDER_TIMESTAMP LAST_NAME FIRST_NAME 
75295 203699 2008-01-16 Sylvian David 
75295 306892 2008-07-21 Sylvian David 
75295 477047 2008-12-09 Sylvian David 
Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 18
Transforming Variable Data (1) 
Example 1: 
CUSTOMER_ID CUSTOMER_ORDER_ID ORDER_TIMESTAMP 
75295 203699 2008-01-16 
75295 306892 2008-07-21 
75295 477047 2008-12-09 
CUSTOMER_ID CUSTOMER_ORDER_ID ORDER_TIMESTAMP ORDER_PROCESSED 
463281 203643 2008-01-16 2008-01-20 
CUSTOMER_ID CUSTOMER_ORDER_ID ORDER_TIMESTAMP ORDER_CANCELLED 
463246 285825 2008-01-19 2008-10-20 
Example 2: 
……………… 
CUSTOMER_ID CUSTOMER_NAME TELEPHONE_NUMBERS 
463246 O’Keefe {5157818, 2362436} 
463249 Zappa {1234567, 3262836, 4374777} 
463350 Donahue {3854757} 
Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 19
Transforming Variable Data (2) 
Alternative 1: 
CUSTOMER_ID CUSTOMER_NAME TELEPHONE_1 TELEPHONE_2 TELEPHONE_3 
463246 O’Keefe 5157818 2362436 ? 
463249 Zappa 1234567 3262836 4374777 
463350 Donahue 3854757 ? ? 
Alternative 2: 
CUSTOMER_ID CUSTOMER_NAME TELEPHONE_NUMBER 
463246 O’Keefe 5157818 
463246 O’Keefe 2362436 
463249 Zappa 1234567 
463249 Zappa 3262836 
463249 Zappa 4374777 
463350 Donahue 3854757 
Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 20
Transforming Schema‐Less Data 
Weblog record 
datestamp ip request 6/1/2012 11:10:19 AM 107.1.187.170 GET 
/x.php?u=http://studio-5.financialcontent.com/synacor?Page=QUOTE&Ticke 
HTTP/1.1 6/1/2012 5:53:49 AM 107.1.2.180 GET /tv/3/player/vendor/Chef% 
/player/fiveminute/content/steak/asset/gnrc_15879500 HTTP/1.1 6/1/2012 
107.34.51.63 GET /tv/3/search/content/The%20Andy%20Griffith%20Show/s/T 
Andy%20Griffith%20Show HTTP/1.1 6/1/2012 3:12:43 PM 107.5.115.117 GET 
/tv/3/search/content/Kathie%20Lee%20Gifford's%20epic%20'Today'%20gaffe 
%20Lee%20Gifford's%20epic%20'Today'%20gaffe HTTP/1.1 6/1/2012 4:48:35 
108.225.132.245 GET /tv/3/search/content/Deadliest%20Catch/s/Deadliest 
HTTP/1.1 6/1/2012 10:25:12 AM 108.246.20.125 GET /x.php?u=http://studi 
5.financialcontent.com/synacor?Page=QUOTE&Ticker=DJ:DJI HTTP/1.1 
6/1/2012 1:58:14 AM 108.246.25.117 GET /tv/3/player/vendor/Chef%20Tips 
/fiveminute/content/steak/asset/gnrc_15879500 HTTP/1.1 
Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 21
Transforming Self‐Describing Data 
ID VALUE 
75295 { “employee” : { 
“number” : “6”, 
“name” : “Manzarek”, 
“initials”: “R”, 
“street ”: “Haseltine Lane”} 
} 
103819 { “employee” : { 
“number” : “7”, 
“name” : “Metheny”, 
“initials”: “P”, 
“street” : “Brownstreet”} 
} 
132171 { “employee” : { 
“number” : “15”, 
“name” : “Metheny”, 
“initials”: “M”} 
} 
ID EMPLOYEE_NUMBER EMPLOYEE_NAME EMPLOYEE_INITIALS EMPLOYEE_STREET 
75295 6 Manzarek R Haseltine Lane 
103819 7 Metheny P Brownstreet 
132171 15 Metheny M ? 
Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 22
Architectural Challenges 
Managing concurrent queries/users 
Parallel execution of complex 
operations 
Running complex analytical functions 
Cost-based optimization 
Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 23
Use Cases of SQL‐on‐Hadoop 
Traditional Interactive Reporting and 
Analytics 
Self-Service Business Intelligence 
Batch Reporting 
Point Queries 
Operational Processing 
Investigative Analytics 
Data Stream Processing 
Storage Cold Data Warehouse Data 
Storage of External Data 
Fast Staging Area 
ETL (Pre)Processing Platform 
New Use Cases and Non-Relational Data 
… 
Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 24
Watch out for Big Data Silos! 
batch 
processing 
investigative 
analytics 
point 
queries 
operational 
processing 
interactive 
reporting 
data stream 
analytics processing 
Silo 1 Silo 2 Silo 3 Silo 4 Silo 5 Silo 6 Silo 7 
Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 25
The Integration Labyrinth 
analytics processing 
dedicated 
integration 
solution 
batch 
processing 
dedicated 
integration 
solution 
point 
queries 
dedicated 
integration 
solution 
interactive 
reporting 
dedicated 
integration 
solution 
operational 
processing 
dedicated 
integration 
solution 
investigative 
analytics 
dedicated 
integration 
solution 
data stream 
dedicated 
integration 
solution 
Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 26
One Platform to Rule Them All 
batch 
processing 
investigative 
analytics 
point 
queries 
operational 
processing 
interactive 
reporting 
data stream 
analytics processing 
One Data Management Platform 
Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 27
Closing Remarks 
SQL offers standardization 
and independency 
SQL increases productivity 
and eases maintenance 
Many SQL-on-Hadoop engines 
available 
One platform 
Being able to process all types 
of data is important 
Productivity 
Maintainability 
Time-to-market 
Performance 
Scalability 
Availability 
Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 28

Webinar: Selecting the Right SQL-on-Hadoop Solution

  • 1.
    Selecting the RightSQL-on-Hadoop Solution: What You Need to Know © 2014 MapR Techno©lo 2g0ie1s4 MapR Technologies 1
  • 2.
    Rick F. vander Lans Rick F. van der Lans is an independent consultant, lecturer, and author. He specializes in data warehousing, business intelligence, database technology, and data virtualization. He is managing director of R20/Consultancy B.V.. Rick has been involved in various projects in which data warehousing, and integration technology was applied. Rick van der Lans is an internationally acclaimed lecturer. He has lectured professionally for the last twenty five years in many of the European and Middle East countries, the USA, South America, and in Australia. He has been invited by several major software vendors to present keynote speeches. He is the author of several books on computing, including his new Data Virtualization for Business Intelligence Systems. Some of these books are available in different languages. Books such as the popular Introduction to SQL is available in English, Dutch, Italian, Chinese, and German and is sold world wide. He also authored The SQL Guide to Ingres and SQL for MySQL Developers. As author for BeyeNetwork.com, writer of whitepapers, chairman for the annual European Enterprise Data and Business Intelligence Conference, and as columnist for a few IT magazines, he has close contacts with many vendors. R20/Consultancy B.V. is located in The Hague, The Netherlands, www.r20.nl. You can get in touch with Rick via: Email: rick@r20.nl Twitter: @Rick_vanderlans LinkedIn: http://www.linkedin.com/pub/rick-van-der-lans/9/207/223 Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 2
  • 3.
    Self-Service Data Explorationwith Apache Drill © 2014 MapR Techno©lo 2g0ie1s4 MapR Technologies 3
  • 4.
    The MapR Distributionincluding Apache Hadoop © 2014 MapR Technologies 4 Top Ranked Exponential Growth 500+ Customers Premier Investors >2x annual bookings 90% software licenses 80% of accounts expand 3X < 1% lifetime churn > $1B in incremental revenue generated by 1 customer
  • 5.
    The Power ofthe Open Source Community Provisioning & coordination Savannah* Workflow & Data Governance Data Integration & Access Hue HttpFS Flume Knox* Falcon* Whirr MapR-FS MapR-DB © 2014 MapR Technologies 5 Management APACHE HADOOP AND OSS ECOSYSTEM Streaming Storm* NoSQL & Search Solr MapR Data Platform Security SQL Drill Shark Impala YARN Batch Spark Cascading Pig Spark Streaming HBase Juju ML, Graph GraphX MLLib Mahout MapReduce v1 & v2 EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS Tez* Accumulo* Hive Sqoop Sentry* Oozie ZooKeeper * Certification/support planned for 2014
  • 6.
    Today’s Data Comesin Different Shapes… © 2014 MapR Technologies 6 Social Media Messages Audio Sensors Mobile Data Email Clickstream
  • 7.
    Real-World Data Modelingand Transformations © 2014 MapR Technologies 7
  • 8.
    © 2014 MapRTechnologies 8
  • 9.
    Evolution Towards Self-ServiceData Exploration © 2014 MapR Technologies 9 Data Modeling and Transformation Data Visualization IT-driven IT-driven IT-driven Self-service IT-driven Self-service Not needed Self-service Traditional BI w/ RDBMS Self-Service BI w/ RDBMS SQL-on-Hadoop Self-Service Data Exploration Zero-day analytics
  • 10.
    Improve time tovalue Redu2ce the burden on IT © 2014 MapR Technologies 10 Why Decrease the Distance to Data? • Enable rapid data exploration and application development • IT should provide a valuable service without “getting in the way” • Can’t add DBAs to keep up with the exponential data growth • Minimize “unnecessary work” so IT can focus on value-added activities and become a partner to the business users
  • 11.
    • Pioneering DataAgility for Hadoop • Apache open source project • Scale-out execution engine for low-latency queries • Unified SQL-based API for analytics & operational applications © 2014 MapR Technologies 11 APACHE DRILL 40+ contributors 150+ years of experience building databases and distributed systems
  • 12.
    Drill Supports SchemaDiscovery On-The-Fly Schema Declared In Advance Schema2 Discovered On-The-Fly © 2014 MapR Technologies 12 • Fixed schema • Leverage schema in centralized repository (Hive Metastore) • Fixed schema, evolving schema or schema-less • Leverage schema in centralized repository or self-describing data SCHEMA ON WRITE SCHEMA BEFORE READ SCHEMA ON THE FLY
  • 13.
    Optimized Data ArchitectureMachine Learning © 2014 MapR Technologies 13 MapR Optimized Data Architecture Sources RELATIONAL, SAAS, MAINFRAME DOCUMENTS, EMAILS BLOGS, TWEETS, LINK DATA LOG FILES, CLICKSTREAMS SENSORS Streaming (Spark Streaming, Storm) Batch / Search (MR, Spark, Hive, Pig, …) NoSQL ODBMS (HBase, Accumulo, …) MapR Data Platform MapR-DB MAPR DISTRIBUTION FOR HADOOP MapR-FS MAPR DISTRIBUTION FOR HADOOP DATA WAREHOUSE Data Movement Data Access Analytics Search Schema-less data exploration BI, reporting Ad-hoc integrated analytics Data Transformation, Enrichment and Integration Operational Apps Recommendations Fraud Detection Logistics
  • 14.
    © 2014 MapRTechnologies 14 (1) Self-Describing Data is Ubiquitous Flat files in DFS • Complex data (Thrift, Avro, protobuf) • Columnar data (Parquet, ORC) • Loosely defined (JSON) • Traditional files (CSV, TSV) Data stored in NoSQL stores • Relational-like (rows, columns) • Sparse data (NoSQL maps) • Embedded blobs (JSON) • Document stores (nested objects) { name: { first: Michael, last: Smith }, hobbies: [ski, soccer], district: Los Altos }{ name: { first: Jennifer, last: Gates }, hobbies: [sing], preschool: CCLC }
  • 15.
    RDBMS/SQL-on-Hadoop table ApacheDrill table © 2014 MapR Technologies 15 (2) Drill’s Data Model is Flexible Fixed schema Schema-less HBase JSON BSON CSV TSV Parquet Avro Flat Complex Flexibility Flexibility Name Gender Age Michael M 6 Jennifer F 3 { name: { first: Michael, last: Smith }, hobbies: [ski, soccer], district: Los Altos }{ name: { first: Jennifer, last: Gates }, hobbies: [sing], preschool: CCLC }
  • 16.
    Quick Tour Self-ServiceData Exploration with Apache Drill © ©20 21041 M4 aMpaRp RTe Tcehcnhonloogloiegsies 16
  • 17.
    Zero to Resultsin 2 Minutes (3 Commands) $ tar xzf apache-drill.tar.gz $ apache-drill/bin/sqlline -u jdbc:drill:zk=local 0: jdbc:drill:zk=local> SELECT count(*) AS incidents, columns[1] AS category FROM dfs.`/tmp/SFPD_Incidents_-_Previous_Three_Months.csv` GROUP BY columns[1] ORDER BY incidents DESC; +------------+------------+ | incidents | category | +------------+------------+ | 8372 | LARCENY/THEFT | | 4247 | OTHER OFFENSES | | 3765 | NON-CRIMINAL | | 2502 | ASSAULT | ... 35 rows selected (0.847 seconds) Install Launch shell (embedded mode) Query Results © 2014 MapR Technologies 17
  • 18.
    © 2014 MapRTechnologies 18 A storage engine instance - DFS - HBase - Hive Metastore/HCatalog A workspace - Sub-directory - Hive database A table - pathnames - HBase table - Hive table Data Source is in the Query SELECT timestamp, message FROM dfs1.logs.`AppServerLogs/2014/Jan/p001.parquet` WHERE errorLevel > 2
  • 19.
    © 2014 MapRTechnologies 19 Query Directory Trees # Query file: How many errors per level in Jan 2014? SELECT errorLevel, count(*) FROM dfs.logs.`/AppServerLogs/2014/Jan/part0001.parquet` GROUP BY errorLevel; # Query directory sub-tree: How many errors per level? SELECT errorLevel, count(*) FROM dfs.logs.`/AppServerLogs` GROUP BY errorLevel; # Query some partitions: How many errors per level by month from 2012? SELECT errorLevel, count(*) FROM dfs.logs.`/AppServerLogs` WHERE dirs[1] >= 2012 GROUP BY errorLevel, dirs[2];
  • 20.
    Works with HBaseand Embedded Blobs # Query an HBase table directly (no schemas) SELECT cf1.month, cf1.year FROM hbase.table1; # Embedded JSON value inside column profileBlob inside column family cf1 of the HBase table users SELECT profile.name, count(profile.children) FROM ( SELECT CONVERT_FROM(cf1.profileBlob, 'json') AS profile FROM hbase.users ) © 2014 MapR Technologies 20
  • 21.
    © 2014 MapRTechnologies 21 Combine Data Sources on the Fly # Join log directory with JSON file (user profiles) to identify the name and email address for anyone associated with an error message. SELECT DISTINCT users.name, users.emails.work FROM dfs.logs.`/data/logs` logs, dfs.users.`/profiles.json` users WHERE logs.uid = users.id AND logs.errorLevel > 5; # Join a Hive table and an HBase table (without Hive metadata) to determine the number of tweets per user SELECT users.name, count(*) as tweetCount FROM hive.social.tweets tweets, hbase.users users WHERE tweets.userId = convert_from(users.rowkey, 'UTF-8') GROUP BY tweets.userId;
  • 22.
    Data Exploration SimpleSQL-on-Hadoop (schema) Advanced SQL & Analytics © 2014 MapR Technologies 22 SQL Technologies Available on MapR Drill 0.5 Hive 0.13 w/ Tez Impala 1.x Shark 0.9 Vertica Latency Low Medium Low Low for in-memory) Med for on disk Low Files Yes (all Hive formats) Yes (all Hive file formats) Yes (Parquet, Sequence, …) Yes (all Hive file formats) Proprietary All Hive file formats can be used as external tables HBase/MapR-DB Yes Yes, Performance issues Yes, performance issues Yes, Performance issues No Hive compatibility High High Medium High NA Schema Schema-less or Hive or Hbase Hive Hive Hive Proprietary or Hive SQL support ANSI SQL HiveQL HiveQL (subset) HiveQL ANSI SQL + advanced analytics Client support ODBC/JDBC ODBC/JDBC ODBC/JDBC ODBC/JDBC ODBC/JDBC, ADO.NET, … Large datasets Yes Yes Limited Yes Yes Nested data Yes Limited No Limited Limited Machine learning No No No Yes No Transactions No No No No Yes Optimizer Limited Limited Limited Limited High Concurrency Medium Medium Medium Limited High
  • 23.
    © 2014 MapRTechnologies 23 Q& A Engage with us! • SQL-on-Hadoop engines explained http://info.mapr.com/wp-sql-on-hadoop-engines-explained • Get demo and tutorials on Apache Drill – https://www.mapr.com/products/apache-drill • Apache Drill 0.5 available now – Download and play: http://incubator.apache.org/drill/ – Ask questions: drill-user@incubator.apache.org – Contribute: http://github.com/apache/incubator-drill/ @rick_vanderlans – Rick van der Lans @swooledge – Steve Wooledge • Contact / follow us
  • 24.
    Copyright © 1991- 2014 R20/Consultancy B.V., The Hague, The Netherlands. All rights reserved. No part of this material may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photographic, or otherwise, without the explicit written permission of the copyright owners. SQL‐on‐Hadoop Explained by Rick F. van der Lans R20/Consultancy BV Twitter @rick_vanderlans www.r20.nl
  • 25.
    Copyright © 1991- 2014 R20/Consultancy B.V., The Hague, The Netherlands 2
  • 26.
    Copyright © 1991- 2014 R20/Consultancy B.V., The Hague, The Netherlands 3
  • 27.
    Copyright © 1991- 2014 R20/Consultancy B.V., The Hague, The Netherlands 4
  • 28.
    Copyright © 1991- 2014 R20/Consultancy B.V., The Hague, The Netherlands 5
  • 29.
    It’s All AboutAnalytics Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 6
  • 30.
    Requirements for DataStorage Technology High data storage scalability High data processing scalability High performance Low price/performance ratio All data types High schema flexibility Fast loading Enterprise-grade Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 7
  • 31.
    Comparison Data StorageTechnologies Hadoop Classic SQL DB High data storage scalability Yes Less High data processing scalability Yes Less High performance Yes Less Low price/performance ratio Yes No All data types Yes Most data types High schema flexibility Depends No Fast loading Yes Yes Enterprise-grade Depends Yes Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 8
  • 32.
    Manipulating Hadoop Data Apache HBase API Apache HBase Apache HDFS API Apache HDFS Apache MapReduce API Apache MapReduce Apache HDFS API Apache HDFS Apache HDFS API Apache HDFS Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 9
  • 33.
    Performance Dominates inHadoop Productivity Maintainability Performance Time-to-market Scalability Availability Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 10
  • 34.
    The Need forSQL‐on‐Hadoop Add high productivity and maintainability, while retaining high performance and scalability Advantages SQL-on-Hadoop • Well-known database language (especially in the BI community) • Large target audience • High productivity and maintainability • Openness to many reporting and analytical tools Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 11
  • 35.
    Productivity is asImportant Productivity Maintainability Time-to-market Performance Scalability Availability Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 12
  • 36.
    Different Solutions ApacheHiveQL Apache Hive Apache MapReduce API Apache MapReduce Apache HDFS API Apache HDFS A SQL Dialect SQL‐on‐Hadoop Apache HBase API Apache HBase Apache HDFS API Apache HDFS A SQL Dialect SQL‐on‐Hadoop Apache HDFS API Any HDFS A SQL Dialect SQL‐on‐Hadoop Apache HDFS API Any HDFS Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 13
  • 37.
    Not all SQL‐on‐HadoopEngines are Created Equal Batch-oriented query environment (data mining) Interactive query environment (OLAP, self-service BI, data visualization) Point-queries (retrieving individual objects) Investigative analytics (data science) Operational intelligence (real-time analytics) Transactional (production systems) Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 14
  • 38.
    Technological Challenges Non-SQL-to-SQL transformational changes • Nested data • Variable data • Schema-less data • Self-describing data Architectural Challenges • Managing concurrent queries/users • Parallel execution of complex operations • Running complex analytical functions • Cost-based optimization Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 15
  • 39.
    Is All DataRelational Data?  create table  insert data SQL-on- Hadoop HDFS  select data  insert data  select data SQL-on- Hadoop Other application HDFS  create file Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 16
  • 40.
    Transforming Nested Data(1) CUSTOMER_ID LAST_NAME FIRST_NAME CUSTOMER_ORDERS 75295 Sylvian David CUSTOMER_ORDER_ID ORDER_TIMESTAMP 203699 2008-01-16 306892 2008-07-21 477047 2008-12-09 103819 Scaggs Boz CUSTOMER_ORDER_ID ORDER_TIMESTAMP 70675 2008-10-19 530223 2008-12-01 132171 Rundgren Todd CUSTOMER_ORDER_ID ORDER_TIMESTAMP 210220 2008-04-21 485584 2008-10-14 718579 2008-11-23 741912 2008-12-24 Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 17
  • 41.
    Transforming Nested Data(2) Alternative 1: CUSTOMER_ID LAST_NAME FIRST_NAME CUSTOMER_ORDERS 75295 Sylvian David {203699,2008-01-16},{306892,2008—07-21}, 103819 Scaggs Boz {70675,2008-10-19},{530223,2008—12-01} 132171 Rundgren Todd {210220,2008-04-21},{485584,2008—10-14}, Alternative 2: {477047,2008-12-09} {718579,2008-12-24} CUSTOMER_ID CUSTOMER_ORDER_ID ORDER_TIMESTAMP LAST_NAME FIRST_NAME 75295 203699 2008-01-16 Sylvian David 75295 306892 2008-07-21 Sylvian David 75295 477047 2008-12-09 Sylvian David Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 18
  • 42.
    Transforming Variable Data(1) Example 1: CUSTOMER_ID CUSTOMER_ORDER_ID ORDER_TIMESTAMP 75295 203699 2008-01-16 75295 306892 2008-07-21 75295 477047 2008-12-09 CUSTOMER_ID CUSTOMER_ORDER_ID ORDER_TIMESTAMP ORDER_PROCESSED 463281 203643 2008-01-16 2008-01-20 CUSTOMER_ID CUSTOMER_ORDER_ID ORDER_TIMESTAMP ORDER_CANCELLED 463246 285825 2008-01-19 2008-10-20 Example 2: ……………… CUSTOMER_ID CUSTOMER_NAME TELEPHONE_NUMBERS 463246 O’Keefe {5157818, 2362436} 463249 Zappa {1234567, 3262836, 4374777} 463350 Donahue {3854757} Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 19
  • 43.
    Transforming Variable Data(2) Alternative 1: CUSTOMER_ID CUSTOMER_NAME TELEPHONE_1 TELEPHONE_2 TELEPHONE_3 463246 O’Keefe 5157818 2362436 ? 463249 Zappa 1234567 3262836 4374777 463350 Donahue 3854757 ? ? Alternative 2: CUSTOMER_ID CUSTOMER_NAME TELEPHONE_NUMBER 463246 O’Keefe 5157818 463246 O’Keefe 2362436 463249 Zappa 1234567 463249 Zappa 3262836 463249 Zappa 4374777 463350 Donahue 3854757 Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 20
  • 44.
    Transforming Schema‐Less Data Weblog record datestamp ip request 6/1/2012 11:10:19 AM 107.1.187.170 GET /x.php?u=http://studio-5.financialcontent.com/synacor?Page=QUOTE&Ticke HTTP/1.1 6/1/2012 5:53:49 AM 107.1.2.180 GET /tv/3/player/vendor/Chef% /player/fiveminute/content/steak/asset/gnrc_15879500 HTTP/1.1 6/1/2012 107.34.51.63 GET /tv/3/search/content/The%20Andy%20Griffith%20Show/s/T Andy%20Griffith%20Show HTTP/1.1 6/1/2012 3:12:43 PM 107.5.115.117 GET /tv/3/search/content/Kathie%20Lee%20Gifford's%20epic%20'Today'%20gaffe %20Lee%20Gifford's%20epic%20'Today'%20gaffe HTTP/1.1 6/1/2012 4:48:35 108.225.132.245 GET /tv/3/search/content/Deadliest%20Catch/s/Deadliest HTTP/1.1 6/1/2012 10:25:12 AM 108.246.20.125 GET /x.php?u=http://studi 5.financialcontent.com/synacor?Page=QUOTE&Ticker=DJ:DJI HTTP/1.1 6/1/2012 1:58:14 AM 108.246.25.117 GET /tv/3/player/vendor/Chef%20Tips /fiveminute/content/steak/asset/gnrc_15879500 HTTP/1.1 Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 21
  • 45.
    Transforming Self‐Describing Data ID VALUE 75295 { “employee” : { “number” : “6”, “name” : “Manzarek”, “initials”: “R”, “street ”: “Haseltine Lane”} } 103819 { “employee” : { “number” : “7”, “name” : “Metheny”, “initials”: “P”, “street” : “Brownstreet”} } 132171 { “employee” : { “number” : “15”, “name” : “Metheny”, “initials”: “M”} } ID EMPLOYEE_NUMBER EMPLOYEE_NAME EMPLOYEE_INITIALS EMPLOYEE_STREET 75295 6 Manzarek R Haseltine Lane 103819 7 Metheny P Brownstreet 132171 15 Metheny M ? Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 22
  • 46.
    Architectural Challenges Managingconcurrent queries/users Parallel execution of complex operations Running complex analytical functions Cost-based optimization Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 23
  • 47.
    Use Cases ofSQL‐on‐Hadoop Traditional Interactive Reporting and Analytics Self-Service Business Intelligence Batch Reporting Point Queries Operational Processing Investigative Analytics Data Stream Processing Storage Cold Data Warehouse Data Storage of External Data Fast Staging Area ETL (Pre)Processing Platform New Use Cases and Non-Relational Data … Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 24
  • 48.
    Watch out forBig Data Silos! batch processing investigative analytics point queries operational processing interactive reporting data stream analytics processing Silo 1 Silo 2 Silo 3 Silo 4 Silo 5 Silo 6 Silo 7 Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 25
  • 49.
    The Integration Labyrinth analytics processing dedicated integration solution batch processing dedicated integration solution point queries dedicated integration solution interactive reporting dedicated integration solution operational processing dedicated integration solution investigative analytics dedicated integration solution data stream dedicated integration solution Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 26
  • 50.
    One Platform toRule Them All batch processing investigative analytics point queries operational processing interactive reporting data stream analytics processing One Data Management Platform Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 27
  • 51.
    Closing Remarks SQLoffers standardization and independency SQL increases productivity and eases maintenance Many SQL-on-Hadoop engines available One platform Being able to process all types of data is important Productivity Maintainability Time-to-market Performance Scalability Availability Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 28