© 2014 MapR Techno©lo 2g0ie1s4 MapR Technologies 1
Introducing the Total Data Warehouse 
Matthew Aslett 
Research Director, Data Management and Analytics, 451 Research 
© 2014 by The 451 Group. All rights reserved
 Matthew Aslett 
• Research Director, Data Platforms and Analytics 
 matthew.aslett@451research.com 
 www.twitter.com/maslett 
 Responsible for data management 
and analytics research agenda 
 Focus on operational and analytic 
databases, including NoSQL, 
NewSQL, and Hadoop 
© 2014 by The 451 Group. All rights reserved 
 With 451 Research since 2007
© 2014 by The 451 Group. All rights reserved 
Company Overview 
 One company with 3 operating 
divisions 
 Syndicated research, advisory, 
professional services, datacenter 
certification, and events 
 Global focus 
 270+ staff 
 1,500+ client organizations: 
enterprises, vendors, service 
providers, and investment firms 
 Organic and growth through 
acquisition
Hadoop and the data warehouse 
 The rise of Apache Hadoop has been driven largely by demand for more 
flexible approaches to data management and analytics 
 Overcoming the limitations of traditional analytic databases and their adherence 
© 2014 by The 451 Group. All rights reserved 
to strictly defined schema. 
 Hadoop is largely complementary to existing data warehouse deployments 
 However, there is clear evidence that at least some workloads are being 
migrated from existing enterprise data warehouses to Hadoop 
 E.g. Teradata’s CEO noted in October 2013 that, on average, 20% of the 
total ETL workload on Teradata data warehouses could potentially move to 
Hadoop (4‐8% of the total Teradata data warehouse workload) 
 That has driven many people to question the extent to which Hadoop will 
replace the data warehouse
Describe the relationship between Hadoop and the enterprise 
data warehouse within your organization 
Survey conducted: Sept/Oct 2013 
© 2014 by The 451 Group. All rights reserved 
Sample: 98 
Hadoop and the data warehouse 
Hadoop not yet 
used 
Hadoop for 
workloads not 
Hadoop replacing 
Permanently 
migrating 
workloads to 
Hadoop 
EDW 
Temporarily previously on EDW 
offloading 
workloads to 
Hadoop 
 Two‐thirds of 
Hadoop 
engagement is 
currently 
non‐threatening 
or additive to 
existing data 
warehouse 
deployments
Hadoop replacing the data warehouse? 
 Frames the question incorrectly 
 based on an assumption that a ‘data warehouse’ is by default based on an 
© 2014 by The 451 Group. All rights reserved 
analytic relational database 
 A data warehouse as an enterprise platform for storing, processing and 
analyzing data 
 could be based on an analytic database, Hadoop, or a combination of the two 
 Hadoop is primarily used to handle unstructured and semi‐structured data 
 not a good fit – in terms of economics and data formats – for analytic databases 
 The future analytic data‐processing landscape will be a hybrid of analytic 
databases and Hadoop 
 each used where appropriate for the individual analytic use case.
Introducing the Total Data Warehouse 
 There are various phrases used to describe this hybrid landscape 
 in keeping with our ‘Total Data’ terminology, we call this the Total Data Warehouse 
 The primary platforms in a Total Data Warehouse are expected to be analytic 
databases and Hadoop 
 However we also expect to see the Total Data Warehouse comprise other data 
storage and processing platforms 
 Exploratory analytics/discovery platforms 
 Search 
 Graph processing 
 Stream processing 
 Machine learning 
 Log processing 
 NoSQL databases 
 NewSQL databases 
© 2014 by The 451 Group. All rights reserved
 There are various phrases used to describe this hybrid landscape 
 in keeping with our ‘Total Data’ terminology, we call this the Total Data Warehouse 
© 2014 by The 451 Group. All rights reserved 
The Total Data Warehouse 
PRE‐DEFINED 
REPORTING 
AD HOC 
ANALYTICS 
STATISTICAL 
ANALYTICS 
PREDICTIVE 
ANALYTICS 
MACHINE 
LEARNING MAPREDUCE 
SEARCH‐BASED 
ANALYTICS 
GRAPH 
ANALYTICS 
STREAM 
PROCESSING 
MULTI‐STRUCTURED 
DATA 
APPLICATIONS 
OPERATIONAL 
INTELLIGENCE 
NOSQL 
MULTI‐STRUCTURED 
DATA 
APPLICATIONS 
ANALYTIC 
DATABASE 
STRUCTURED DATA 
(NEW) SQL 
DATABASE 
STRUCTURED DATA 
APPLICATIONS 
HADOOP 
DISTRIBUTED 
FILE SYSTEM 
MULTI‐STRUCTURED 
DATA 
YARN 
LOG 
PROCESSING 
EXPLORATORY 
ANALYTICS 
PLATFORM 
MULTI‐STRUCTURED 
DATA
Data gravity and the Total Data Warehouse 
 ‘Data gravity’ suggests that processing resources will migrate to the 
platform that stores the most data, or perhaps the most important data 
 The balance of power is currently with the analytic database 
 However, Hadoop’s flexibility to support data‐processing engines beyond 
MapReduce could tip the balance in its favor in the long term 
 Apache YARN enables multiple versions of MapReduce, and for HDFS to 
support data‐processing frameworks in addition to MapReduce 
 Native SQL analytics 
 Stream processing 
 Graph processing 
 Bulk synchronous parallel computing 
 Machine learning 
 Apache Spark provides an in‐memory platform supporting high‐performance 
processing and multiple data processing engines 
© 2014 by The 451 Group. All rights reserved
Example Total Data Warehouses 
 Teradata’s Unified Data Architecture and QueryGrid ‐ enables querying of 
data in Teradata Database, Aster Database and Hortonworks 
 Pivotal’s Big Data Suite ‐ HD Hadoop distribution/Greenplum 
Database/GemFire distributed data grid and HAWQ SQL‐on‐Hadoop query 
engine 
 Cirro offers a federated approach to performing joins and query processing 
across multiple sources of data including relational database and Hadoop 
 Microsoft PolyBase enables SQL Server 2012 PDW analysts to query data in 
Hadoop using Microsoft’s T‐SQL 
 PolyBase is only available as part of the Microsoft Analytics Platform System 
(APS) 
 APS is an appliance that combines SQL Server 2012 PDW with Microsoft’s 
HDInsight distribution of Apache Hadoop 
 APS is also the only way that customers can adopt SQL Server 2012 PDW data 
warehousing environment 
 For Microsoft at least, Hadoop is an integral part of the next‐generation data 
warehouse 
© 2014 by The 451 Group. All rights reserved
 SQL‐on‐Hadoop engines clearly have a role to play in enabling the 
Total Data Warehouse 
 SQL‐based querying of data in HDFS 
 Federation of queries across multiple data platforms 
 SQL‐on‐Hadoop initiatives exploded in recent years as a means of uniting 
the large army of trained SQL analysts with the flexible data storage and 
processing capabilities of Hadoop 
 But SQL‐on‐Hadoop engines are not created equal 
 Batch SQL‐on‐Hadoop 
 Interactive SQL‐on‐Hadoop 
 SQL‐and‐Hadoop 
 Operational SQL‐on‐Hadoop 
 And the various offerings within those categories are differentiated 
© 2014 by The 451 Group. All rights reserved 
The role of SQL‐on‐Hadoop
Approach Details Examples 
© 2014 by The 451 Group. All rights reserved 
SQL on/and Hadoop 
Batch 
SQL‐on‐Hadoop 
Native SQL‐like processing of 
data in HDFS (via MR/Tez) Hive on MapReduce 
Interactive 
SQL‐on‐Hadoop 
Specialist SQL‐based query 
engine running on Hadoop 
Apache Drill, Cloudera 
Impala, Hive on Tez, 
Spark SQL 
SQL‐and‐Hadoop 
Federated querying of data in 
Hadoop and RDBMS 
Teradata, Microsoft, 
Oracle, IBM 
Operational 
SQL‐on‐Hadoop 
Operational database that 
stores in in HDFS 
Splice Machine, 
Trafodion
Approach Key features 
© 2014 by The 451 Group. All rights reserved 
SQL on Hadoop examples 
Hive on Tez 
Faster native querying than Hive on MapReduce, 
HiveQL compatibility, extreme‐scale data joins 
Apache Drill 
ANSI SQL, Hadoop, MongoDB, Cassandra, Riak, etc; 
consume JSON data, query hierarchical data 
Cloudera Impala High performance ad hoc processing, 
HiveQL compatibility, Parquet file format 
Spark SQL 
In‐memory SQL processing, Catalyst query 
optimizer, replacing Shark (Hive on Spark)
 Hadoop is largely complementary to existing data warehouse deployments 
 The future analytic data‐processing landscape will be a hybrid of analytic 
databases and Hadoop 
 we call this the Total Data Warehouse 
 ‘Data gravity’ suggests that processing resources will migrate to the 
platform that stores the most data, or perhaps the most important data 
 The balance of power is currently with the analytic database 
 Hadoop’s flexibility tip the balance in its favor in the long term 
 SQL‐on‐Hadoop engines clearly have a role to play in enabling the 
Total Data Warehouse 
 But SQL‐on‐Hadoop engines are not created equal 
© 2014 by The 451 Group. All rights reserved 
Conclusion
Questions? Comments? 
matthew.aslett@451research.com 
@maslett 
© 2014 by The 451 Group. All rights reserved
Self Service Data Exploration with Apache Drill 
© 2014 MapR Techno©lo 2g0ie1s4 MapR Technologies 1
The MapR Distribution including Apache Hadoop 
© 2014 MapR Technologies 2 
Exponential 
Growth 
500+ 
Customers 
Premier 
Investors 
>2x annual bookings 
90% software licenses 
80% of accounts expand 3X 
< 1% lifetime churn 
> $1B in incremental revenue 
generated by 1 customer 
Big Data 
Riding the Wave with 
Hadoop 
The Big Data 
Platform 
of Choice
The Power of the Open Source Community 
Provisioning 
& 
coordination 
Savannah* 
Workflow 
& Data 
Governance 
Data 
Integration 
& Access 
Hue 
HttpFS 
Flume Knox* Falcon* Whirr 
MapR-FS MapR-DB 
© 2014 MapR Technologies 3 
Management 
APACHE HADOOP AND OSS ECOSYSTEM 
Streaming 
Storm* 
NoSQL & 
Search 
Solr 
MapR Data Platform 
Security 
SQL 
Drill 
Shark 
Impala 
YARN 
Batch 
Spark 
Cascading 
Pig 
Spark 
Streaming 
HBase 
Juju 
ML, Graph 
GraphX 
MLLib 
Mahout 
MapReduce 
v1 & v2 
EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS 
Tez* 
Accumulo* 
Hive 
Sqoop Sentry* Oozie ZooKeeper 
* Certification/support planned for 2014
© 2014 MapR Technologies 4 
UNSTRUCTURED 
DATA 
Unstructured data will account 
for more than 80% of the data 
collected by organizations 
STRUCTURED DATA 
1980 1990 2000 2010 2020 
Source: Human-Computer Interaction & Knowledge Discovery in Complex Unstructured, Big Data 
Total Data Stored
Today’s Data Comes in Different Shapes… 
© 2014 MapR Technologies 5 
Social Media 
Messages 
Audio 
Sensors 
Mobile Data 
Email 
Clickstream
© 2014 MapR Technologies 6 
Distance to Data 
Business 
(analysts, developers) 
“Plumbing” 
MapReduce development 
Business 
(analysts, developers) 
Modeling and 
transformations 
Hive and other 
SQL-on-Hadoop 
Existing approaches 
require a middleman (IT) 
Data 
Data
© 2014 MapR Technologies 7 
Distance to Data 
Business 
(analysts, developers) 
Existing approaches 
require a middleman (IT) 
“Plumbing” 
MapReduce development 
Hive and other 
SQL-on-Hadoop 
Business 
Data Agility (analysts, developers) 
Data 
Data 
Data 
Business 
(analysts, developers) 
Modeling and 
transformations
Improve time to value Redu2ce the burden on IT 
© 2014 MapR Technologies 8 
Why Improve Distance to Data? 
• Enable rapid data exploration and 
application development 
• IT should provide a valuable 
service without “getting in the way” 
• Can’t add DBAs to keep up with 
the exponential data growth 
• Minimize “unnecessary work” so IT 
can focus on value-added 
activities and become a partner to 
the business users
• Pioneering Data Agility for Hadoop 
• Apache open source project 
• Scale-out execution engine for low-latency queries 
• Unified SQL-based API for analytics & operational applications 
© 2014 MapR Technologies 9 
APACHE DRILL 
40+ contributors 
150+ years of experience building 
databases and distributed systems
Evolution Towards Self-Service Data Exploration 
© 2014 MapR Technologies 10 
Data Modeling and 
Transformation 
Data Visualization 
IT-driven 
IT-driven 
IT-driven 
Self-service 
IT-driven 
Self-service 
Not needed 
Self-service 
Traditional BI 
w/ RDBMS 
Self-Service BI 
w/ RDBMS SQL-on-Hadoop 
Self-Service 
Data Exploration 
Zero-day analytics
Optimized Data Architecture Machine Learning 
© 2014 MapR Technologies 11 
MapR Optimized Data Architecture 
Sources 
RELATIONAL, 
SAAS, 
MAINFRAME 
DOCUMENTS, 
EMAILS 
BLOGS, 
TWEETS, 
LINK DATA 
LOG FILES, 
CLICKSTREAMS 
SENSORS 
Streaming 
(Spark Streaming, Storm) 
Batch / Search 
(MR, Spark, Hive, Pig, …) 
NoSQL ODBMS 
(HBase, Accumulo, …) 
MapR Data Platform 
MapR-DB 
MAPR DISTRIBUTION FOR HADOOP 
MapR-FS 
MAPR DISTRIBUTION FOR HADOOP 
DATA WAREHOUSE 
Data Movement 
Data Access 
Analytics 
Search 
Schema-less 
data exploration 
BI, reporting 
Ad-hoc integrated 
analytics 
Data Transformation, Enrichment 
and Integration 
Operational Apps 
Recommendations 
Fraud Detection 
Logistics
© 2014 MapR Technologies 12 
(1) Self-Describing Data is Ubiquitous 
Flat files in DFS 
• Complex data (Thrift, Avro, protobuf) 
• Columnar data (Parquet, ORC) 
• Loosely defined (JSON) 
• Traditional files (CSV, TSV) 
Data stored in NoSQL stores 
• Relational-like (rows, columns) 
• Sparse data (NoSQL maps) 
• Embedded blobs (JSON) 
• Document stores (nested objects) 
{ 
name: { 
first: Michael, 
last: Smith 
}, 
hobbies: [ski, soccer], 
district: Los Altos 
}{ 
name: { 
first: Jennifer, 
last: Gates 
}, 
hobbies: [sing], 
preschool: CCLC 
}
RDBMS/SQL-on-Hadoop table 
Apache Drill table 
© 2014 MapR Technologies 13 
(2) Drill’s Data Model is Flexible 
Fixed schema Schema-less 
HBase 
JSON 
BSON 
CSV 
TSV 
Parquet 
Avro 
Flat 
Complex 
Flexibility 
Flexibility 
Name Gender Age 
Michael M 6 
Jennifer F 3 
{ 
name: { 
first: Michael, 
last: Smith 
}, 
hobbies: [ski, soccer], 
district: Los Altos 
}{ 
name: { 
first: Jennifer, 
last: Gates 
}, 
hobbies: [sing], 
preschool: CCLC 
}
(3) Drill Supports Schema Discovery On-The-Fly 
Schema Declared In Advance Schema2 Discovered On-The-Fly 
© 2014 MapR Technologies 14 
• Fixed schema 
• Leverage schema in centralized 
repository (Hive Metastore) 
• Fixed schema, evolving schema or 
schema-less 
• Leverage schema in centralized 
repository or self-describing data 
SCHEMA ON 
WRITE 
SCHEMA 
BEFORE READ 
SCHEMA ON THE 
FLY
Quick Tour 
Self-Service Data Exploration with Apache Drill 
© ©20 21041 M4 aMpaRp RTe Tcehcnhonloogloiegsies 15
Zero to Results in 2 Minutes (3 Commands) 
$ tar xzf apache-drill.tar.gz 
$ apache-drill/bin/sqlline -u jdbc:drill:zk=local 
0: jdbc:drill:zk=local> 
SELECT count(*) AS incidents, columns[1] AS category 
FROM dfs.`/tmp/SFPD_Incidents_-_Previous_Three_Months.csv` 
GROUP BY columns[1] 
ORDER BY incidents DESC; 
+------------+------------+ 
| incidents | category | 
+------------+------------+ 
| 8372 | LARCENY/THEFT | 
| 4247 | OTHER OFFENSES | 
| 3765 | NON-CRIMINAL | 
| 2502 | ASSAULT | 
... 
35 rows selected (0.847 seconds) 
Install 
Launch shell 
(embedded 
mode) 
Query 
Results 
© 2014 MapR Technologies 16
© 2014 MapR Technologies 17 
A storage engine instance 
- DFS 
- HBase 
- Hive Metastore/HCatalog 
A workspace 
- Sub-directory 
- Hive database 
A table 
- pathnames 
- HBase table 
- Hive table 
Data Source is in the Query 
SELECT timestamp, message 
FROM dfs1.logs.`AppServerLogs/2014/Jan/p001.parquet` 
WHERE errorLevel > 2
© 2014 MapR Technologies 18 
Query Directory Trees 
# Query file: How many errors per level in Jan 2014? 
SELECT errorLevel, count(*) 
FROM dfs.logs.`/AppServerLogs/2014/Jan/part0001.parquet` 
GROUP BY errorLevel; 
# Query directory sub-tree: How many errors per level? 
SELECT errorLevel, count(*) 
FROM dfs.logs.`/AppServerLogs` 
GROUP BY errorLevel; 
# Query some partitions: How many errors per level by month from 2012? 
SELECT errorLevel, count(*) 
FROM dfs.logs.`/AppServerLogs` 
WHERE dirs[1] >= 2012 
GROUP BY errorLevel, dirs[2];
Works with HBase and Embedded Blobs 
# Query an HBase table directly (no schemas) 
SELECT cf1.month, cf1.year 
FROM hbase.table1; 
# Embedded JSON value inside column profileBlob inside column family cf1 of 
the HBase table users 
SELECT profile.name, count(profile.children) 
FROM ( 
SELECT CONVERT_FROM(cf1.profileBlob, 'json') AS profile 
FROM hbase.users 
) 
© 2014 MapR Technologies 19
Combine Data Sources on the Fly 
# Join log directory with JSON file (user profiles) to identify the name and email address for 
anyone associated with an error message. 
SELECT DISTINCT users.name, users.emails.work 
FROM dfs.logs.`/data/logs` logs, 
© 2014 MapR Technologies 20 
dfs.users.`/profiles.json` users 
WHERE logs.uid = users.id AND 
logs.errorLevel > 5; 
# Join a Hive table and an HBase table (without Hive metadata) to determine the number of 
tweets per user 
SELECT users.name, count(*) as tweetCount 
FROM hive.social.tweets tweets, 
hbase.users users 
WHERE tweets.userId = convert_from(users.rowkey, 'UTF-8') 
GROUP BY tweets.userId;
Summary 
• Enable rapid data exploration and application development while 
© 2014 MapR Technologies 21 
reducing the burden on IT 
• Apache Drill 0.5 available now 
• Get involved 
– Download and play: http://incubator.apache.org/drill/ 
– Ask questions: drill-user@incubator.apache.org 
– Contribute: http://github.com/apache/incubator-drill/ 
– Join the Drill team at MapR 
• Email jacques@mapr.com 
• www.mapr.com/careers

Future of-hadoop-analytics

  • 1.
    © 2014 MapRTechno©lo 2g0ie1s4 MapR Technologies 1
  • 2.
    Introducing the TotalData Warehouse Matthew Aslett Research Director, Data Management and Analytics, 451 Research © 2014 by The 451 Group. All rights reserved
  • 3.
     Matthew Aslett • Research Director, Data Platforms and Analytics  matthew.aslett@451research.com  www.twitter.com/maslett  Responsible for data management and analytics research agenda  Focus on operational and analytic databases, including NoSQL, NewSQL, and Hadoop © 2014 by The 451 Group. All rights reserved  With 451 Research since 2007
  • 4.
    © 2014 byThe 451 Group. All rights reserved Company Overview  One company with 3 operating divisions  Syndicated research, advisory, professional services, datacenter certification, and events  Global focus  270+ staff  1,500+ client organizations: enterprises, vendors, service providers, and investment firms  Organic and growth through acquisition
  • 5.
    Hadoop and thedata warehouse  The rise of Apache Hadoop has been driven largely by demand for more flexible approaches to data management and analytics  Overcoming the limitations of traditional analytic databases and their adherence © 2014 by The 451 Group. All rights reserved to strictly defined schema.  Hadoop is largely complementary to existing data warehouse deployments  However, there is clear evidence that at least some workloads are being migrated from existing enterprise data warehouses to Hadoop  E.g. Teradata’s CEO noted in October 2013 that, on average, 20% of the total ETL workload on Teradata data warehouses could potentially move to Hadoop (4‐8% of the total Teradata data warehouse workload)  That has driven many people to question the extent to which Hadoop will replace the data warehouse
  • 6.
    Describe the relationshipbetween Hadoop and the enterprise data warehouse within your organization Survey conducted: Sept/Oct 2013 © 2014 by The 451 Group. All rights reserved Sample: 98 Hadoop and the data warehouse Hadoop not yet used Hadoop for workloads not Hadoop replacing Permanently migrating workloads to Hadoop EDW Temporarily previously on EDW offloading workloads to Hadoop  Two‐thirds of Hadoop engagement is currently non‐threatening or additive to existing data warehouse deployments
  • 7.
    Hadoop replacing thedata warehouse?  Frames the question incorrectly  based on an assumption that a ‘data warehouse’ is by default based on an © 2014 by The 451 Group. All rights reserved analytic relational database  A data warehouse as an enterprise platform for storing, processing and analyzing data  could be based on an analytic database, Hadoop, or a combination of the two  Hadoop is primarily used to handle unstructured and semi‐structured data  not a good fit – in terms of economics and data formats – for analytic databases  The future analytic data‐processing landscape will be a hybrid of analytic databases and Hadoop  each used where appropriate for the individual analytic use case.
  • 8.
    Introducing the TotalData Warehouse  There are various phrases used to describe this hybrid landscape  in keeping with our ‘Total Data’ terminology, we call this the Total Data Warehouse  The primary platforms in a Total Data Warehouse are expected to be analytic databases and Hadoop  However we also expect to see the Total Data Warehouse comprise other data storage and processing platforms  Exploratory analytics/discovery platforms  Search  Graph processing  Stream processing  Machine learning  Log processing  NoSQL databases  NewSQL databases © 2014 by The 451 Group. All rights reserved
  • 9.
     There arevarious phrases used to describe this hybrid landscape  in keeping with our ‘Total Data’ terminology, we call this the Total Data Warehouse © 2014 by The 451 Group. All rights reserved The Total Data Warehouse PRE‐DEFINED REPORTING AD HOC ANALYTICS STATISTICAL ANALYTICS PREDICTIVE ANALYTICS MACHINE LEARNING MAPREDUCE SEARCH‐BASED ANALYTICS GRAPH ANALYTICS STREAM PROCESSING MULTI‐STRUCTURED DATA APPLICATIONS OPERATIONAL INTELLIGENCE NOSQL MULTI‐STRUCTURED DATA APPLICATIONS ANALYTIC DATABASE STRUCTURED DATA (NEW) SQL DATABASE STRUCTURED DATA APPLICATIONS HADOOP DISTRIBUTED FILE SYSTEM MULTI‐STRUCTURED DATA YARN LOG PROCESSING EXPLORATORY ANALYTICS PLATFORM MULTI‐STRUCTURED DATA
  • 10.
    Data gravity andthe Total Data Warehouse  ‘Data gravity’ suggests that processing resources will migrate to the platform that stores the most data, or perhaps the most important data  The balance of power is currently with the analytic database  However, Hadoop’s flexibility to support data‐processing engines beyond MapReduce could tip the balance in its favor in the long term  Apache YARN enables multiple versions of MapReduce, and for HDFS to support data‐processing frameworks in addition to MapReduce  Native SQL analytics  Stream processing  Graph processing  Bulk synchronous parallel computing  Machine learning  Apache Spark provides an in‐memory platform supporting high‐performance processing and multiple data processing engines © 2014 by The 451 Group. All rights reserved
  • 11.
    Example Total DataWarehouses  Teradata’s Unified Data Architecture and QueryGrid ‐ enables querying of data in Teradata Database, Aster Database and Hortonworks  Pivotal’s Big Data Suite ‐ HD Hadoop distribution/Greenplum Database/GemFire distributed data grid and HAWQ SQL‐on‐Hadoop query engine  Cirro offers a federated approach to performing joins and query processing across multiple sources of data including relational database and Hadoop  Microsoft PolyBase enables SQL Server 2012 PDW analysts to query data in Hadoop using Microsoft’s T‐SQL  PolyBase is only available as part of the Microsoft Analytics Platform System (APS)  APS is an appliance that combines SQL Server 2012 PDW with Microsoft’s HDInsight distribution of Apache Hadoop  APS is also the only way that customers can adopt SQL Server 2012 PDW data warehousing environment  For Microsoft at least, Hadoop is an integral part of the next‐generation data warehouse © 2014 by The 451 Group. All rights reserved
  • 12.
     SQL‐on‐Hadoop enginesclearly have a role to play in enabling the Total Data Warehouse  SQL‐based querying of data in HDFS  Federation of queries across multiple data platforms  SQL‐on‐Hadoop initiatives exploded in recent years as a means of uniting the large army of trained SQL analysts with the flexible data storage and processing capabilities of Hadoop  But SQL‐on‐Hadoop engines are not created equal  Batch SQL‐on‐Hadoop  Interactive SQL‐on‐Hadoop  SQL‐and‐Hadoop  Operational SQL‐on‐Hadoop  And the various offerings within those categories are differentiated © 2014 by The 451 Group. All rights reserved The role of SQL‐on‐Hadoop
  • 13.
    Approach Details Examples © 2014 by The 451 Group. All rights reserved SQL on/and Hadoop Batch SQL‐on‐Hadoop Native SQL‐like processing of data in HDFS (via MR/Tez) Hive on MapReduce Interactive SQL‐on‐Hadoop Specialist SQL‐based query engine running on Hadoop Apache Drill, Cloudera Impala, Hive on Tez, Spark SQL SQL‐and‐Hadoop Federated querying of data in Hadoop and RDBMS Teradata, Microsoft, Oracle, IBM Operational SQL‐on‐Hadoop Operational database that stores in in HDFS Splice Machine, Trafodion
  • 14.
    Approach Key features © 2014 by The 451 Group. All rights reserved SQL on Hadoop examples Hive on Tez Faster native querying than Hive on MapReduce, HiveQL compatibility, extreme‐scale data joins Apache Drill ANSI SQL, Hadoop, MongoDB, Cassandra, Riak, etc; consume JSON data, query hierarchical data Cloudera Impala High performance ad hoc processing, HiveQL compatibility, Parquet file format Spark SQL In‐memory SQL processing, Catalyst query optimizer, replacing Shark (Hive on Spark)
  • 15.
     Hadoop islargely complementary to existing data warehouse deployments  The future analytic data‐processing landscape will be a hybrid of analytic databases and Hadoop  we call this the Total Data Warehouse  ‘Data gravity’ suggests that processing resources will migrate to the platform that stores the most data, or perhaps the most important data  The balance of power is currently with the analytic database  Hadoop’s flexibility tip the balance in its favor in the long term  SQL‐on‐Hadoop engines clearly have a role to play in enabling the Total Data Warehouse  But SQL‐on‐Hadoop engines are not created equal © 2014 by The 451 Group. All rights reserved Conclusion
  • 16.
    Questions? Comments? matthew.aslett@451research.com @maslett © 2014 by The 451 Group. All rights reserved
  • 17.
    Self Service DataExploration with Apache Drill © 2014 MapR Techno©lo 2g0ie1s4 MapR Technologies 1
  • 18.
    The MapR Distributionincluding Apache Hadoop © 2014 MapR Technologies 2 Exponential Growth 500+ Customers Premier Investors >2x annual bookings 90% software licenses 80% of accounts expand 3X < 1% lifetime churn > $1B in incremental revenue generated by 1 customer Big Data Riding the Wave with Hadoop The Big Data Platform of Choice
  • 19.
    The Power ofthe Open Source Community Provisioning & coordination Savannah* Workflow & Data Governance Data Integration & Access Hue HttpFS Flume Knox* Falcon* Whirr MapR-FS MapR-DB © 2014 MapR Technologies 3 Management APACHE HADOOP AND OSS ECOSYSTEM Streaming Storm* NoSQL & Search Solr MapR Data Platform Security SQL Drill Shark Impala YARN Batch Spark Cascading Pig Spark Streaming HBase Juju ML, Graph GraphX MLLib Mahout MapReduce v1 & v2 EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS Tez* Accumulo* Hive Sqoop Sentry* Oozie ZooKeeper * Certification/support planned for 2014
  • 20.
    © 2014 MapRTechnologies 4 UNSTRUCTURED DATA Unstructured data will account for more than 80% of the data collected by organizations STRUCTURED DATA 1980 1990 2000 2010 2020 Source: Human-Computer Interaction & Knowledge Discovery in Complex Unstructured, Big Data Total Data Stored
  • 21.
    Today’s Data Comesin Different Shapes… © 2014 MapR Technologies 5 Social Media Messages Audio Sensors Mobile Data Email Clickstream
  • 22.
    © 2014 MapRTechnologies 6 Distance to Data Business (analysts, developers) “Plumbing” MapReduce development Business (analysts, developers) Modeling and transformations Hive and other SQL-on-Hadoop Existing approaches require a middleman (IT) Data Data
  • 23.
    © 2014 MapRTechnologies 7 Distance to Data Business (analysts, developers) Existing approaches require a middleman (IT) “Plumbing” MapReduce development Hive and other SQL-on-Hadoop Business Data Agility (analysts, developers) Data Data Data Business (analysts, developers) Modeling and transformations
  • 24.
    Improve time tovalue Redu2ce the burden on IT © 2014 MapR Technologies 8 Why Improve Distance to Data? • Enable rapid data exploration and application development • IT should provide a valuable service without “getting in the way” • Can’t add DBAs to keep up with the exponential data growth • Minimize “unnecessary work” so IT can focus on value-added activities and become a partner to the business users
  • 25.
    • Pioneering DataAgility for Hadoop • Apache open source project • Scale-out execution engine for low-latency queries • Unified SQL-based API for analytics & operational applications © 2014 MapR Technologies 9 APACHE DRILL 40+ contributors 150+ years of experience building databases and distributed systems
  • 26.
    Evolution Towards Self-ServiceData Exploration © 2014 MapR Technologies 10 Data Modeling and Transformation Data Visualization IT-driven IT-driven IT-driven Self-service IT-driven Self-service Not needed Self-service Traditional BI w/ RDBMS Self-Service BI w/ RDBMS SQL-on-Hadoop Self-Service Data Exploration Zero-day analytics
  • 27.
    Optimized Data ArchitectureMachine Learning © 2014 MapR Technologies 11 MapR Optimized Data Architecture Sources RELATIONAL, SAAS, MAINFRAME DOCUMENTS, EMAILS BLOGS, TWEETS, LINK DATA LOG FILES, CLICKSTREAMS SENSORS Streaming (Spark Streaming, Storm) Batch / Search (MR, Spark, Hive, Pig, …) NoSQL ODBMS (HBase, Accumulo, …) MapR Data Platform MapR-DB MAPR DISTRIBUTION FOR HADOOP MapR-FS MAPR DISTRIBUTION FOR HADOOP DATA WAREHOUSE Data Movement Data Access Analytics Search Schema-less data exploration BI, reporting Ad-hoc integrated analytics Data Transformation, Enrichment and Integration Operational Apps Recommendations Fraud Detection Logistics
  • 28.
    © 2014 MapRTechnologies 12 (1) Self-Describing Data is Ubiquitous Flat files in DFS • Complex data (Thrift, Avro, protobuf) • Columnar data (Parquet, ORC) • Loosely defined (JSON) • Traditional files (CSV, TSV) Data stored in NoSQL stores • Relational-like (rows, columns) • Sparse data (NoSQL maps) • Embedded blobs (JSON) • Document stores (nested objects) { name: { first: Michael, last: Smith }, hobbies: [ski, soccer], district: Los Altos }{ name: { first: Jennifer, last: Gates }, hobbies: [sing], preschool: CCLC }
  • 29.
    RDBMS/SQL-on-Hadoop table ApacheDrill table © 2014 MapR Technologies 13 (2) Drill’s Data Model is Flexible Fixed schema Schema-less HBase JSON BSON CSV TSV Parquet Avro Flat Complex Flexibility Flexibility Name Gender Age Michael M 6 Jennifer F 3 { name: { first: Michael, last: Smith }, hobbies: [ski, soccer], district: Los Altos }{ name: { first: Jennifer, last: Gates }, hobbies: [sing], preschool: CCLC }
  • 30.
    (3) Drill SupportsSchema Discovery On-The-Fly Schema Declared In Advance Schema2 Discovered On-The-Fly © 2014 MapR Technologies 14 • Fixed schema • Leverage schema in centralized repository (Hive Metastore) • Fixed schema, evolving schema or schema-less • Leverage schema in centralized repository or self-describing data SCHEMA ON WRITE SCHEMA BEFORE READ SCHEMA ON THE FLY
  • 31.
    Quick Tour Self-ServiceData Exploration with Apache Drill © ©20 21041 M4 aMpaRp RTe Tcehcnhonloogloiegsies 15
  • 32.
    Zero to Resultsin 2 Minutes (3 Commands) $ tar xzf apache-drill.tar.gz $ apache-drill/bin/sqlline -u jdbc:drill:zk=local 0: jdbc:drill:zk=local> SELECT count(*) AS incidents, columns[1] AS category FROM dfs.`/tmp/SFPD_Incidents_-_Previous_Three_Months.csv` GROUP BY columns[1] ORDER BY incidents DESC; +------------+------------+ | incidents | category | +------------+------------+ | 8372 | LARCENY/THEFT | | 4247 | OTHER OFFENSES | | 3765 | NON-CRIMINAL | | 2502 | ASSAULT | ... 35 rows selected (0.847 seconds) Install Launch shell (embedded mode) Query Results © 2014 MapR Technologies 16
  • 33.
    © 2014 MapRTechnologies 17 A storage engine instance - DFS - HBase - Hive Metastore/HCatalog A workspace - Sub-directory - Hive database A table - pathnames - HBase table - Hive table Data Source is in the Query SELECT timestamp, message FROM dfs1.logs.`AppServerLogs/2014/Jan/p001.parquet` WHERE errorLevel > 2
  • 34.
    © 2014 MapRTechnologies 18 Query Directory Trees # Query file: How many errors per level in Jan 2014? SELECT errorLevel, count(*) FROM dfs.logs.`/AppServerLogs/2014/Jan/part0001.parquet` GROUP BY errorLevel; # Query directory sub-tree: How many errors per level? SELECT errorLevel, count(*) FROM dfs.logs.`/AppServerLogs` GROUP BY errorLevel; # Query some partitions: How many errors per level by month from 2012? SELECT errorLevel, count(*) FROM dfs.logs.`/AppServerLogs` WHERE dirs[1] >= 2012 GROUP BY errorLevel, dirs[2];
  • 35.
    Works with HBaseand Embedded Blobs # Query an HBase table directly (no schemas) SELECT cf1.month, cf1.year FROM hbase.table1; # Embedded JSON value inside column profileBlob inside column family cf1 of the HBase table users SELECT profile.name, count(profile.children) FROM ( SELECT CONVERT_FROM(cf1.profileBlob, 'json') AS profile FROM hbase.users ) © 2014 MapR Technologies 19
  • 36.
    Combine Data Sourceson the Fly # Join log directory with JSON file (user profiles) to identify the name and email address for anyone associated with an error message. SELECT DISTINCT users.name, users.emails.work FROM dfs.logs.`/data/logs` logs, © 2014 MapR Technologies 20 dfs.users.`/profiles.json` users WHERE logs.uid = users.id AND logs.errorLevel > 5; # Join a Hive table and an HBase table (without Hive metadata) to determine the number of tweets per user SELECT users.name, count(*) as tweetCount FROM hive.social.tweets tweets, hbase.users users WHERE tweets.userId = convert_from(users.rowkey, 'UTF-8') GROUP BY tweets.userId;
  • 37.
    Summary • Enablerapid data exploration and application development while © 2014 MapR Technologies 21 reducing the burden on IT • Apache Drill 0.5 available now • Get involved – Download and play: http://incubator.apache.org/drill/ – Ask questions: drill-user@incubator.apache.org – Contribute: http://github.com/apache/incubator-drill/ – Join the Drill team at MapR • Email jacques@mapr.com • www.mapr.com/careers