Future of-hadoop-analytics

Introducing the Total Data Warehouse
Matthew Aslett
Research Director, Data Management and Analytics, 451 Research
© 2014 by The 451 Group. All rights reserved

 Matthew Aslett
• Research Director, Data Platforms and Analytics
 matthew.aslett@451research.com
 www.twitter.com/maslett
 Responsible for data management
and analytics research agenda
 Focus on operational and analytic
databases, including NoSQL,
NewSQL, and Hadoop
 With 451 Research since 2007

Company Overview
 One company with 3 operating
divisions
 Syndicated research, advisory,
professional services, datacenter
certification, and events
 Global focus
 270+ staff
 1,500+ client organizations:
enterprises, vendors, service
providers, and investment firms
 Organic and growth through
acquisition

Hadoop and the data warehouse
 The rise of Apache Hadoop has been driven largely by demand for more
flexible approaches to data management and analytics
 Overcoming the limitations of traditional analytic databases and their adherence
to strictly defined schema.
 Hadoop is largely complementary to existing data warehouse deployments
 However, there is clear evidence that at least some workloads are being
migrated from existing enterprise data warehouses to Hadoop
 E.g. Teradata’s CEO noted in October 2013 that, on average, 20% of the
total ETL workload on Teradata data warehouses could potentially move to
Hadoop (4‐8% of the total Teradata data warehouse workload)
 That has driven many people to question the extent to which Hadoop will
replace the data warehouse

Describe the relationship between Hadoop and the enterprise
data warehouse within your organization
Survey conducted: Sept/Oct 2013
Sample: 98
Hadoop and the data warehouse
Hadoop not yet
used
Hadoop for
workloads not
Hadoop replacing
Permanently
migrating
workloads to
Hadoop
EDW
Temporarily previously on EDW
offloading
workloads to
Hadoop
 Two‐thirds of
Hadoop
engagement is
currently
non‐threatening
or additive to
existing data
warehouse
deployments

Hadoop replacing the data warehouse?
 Frames the question incorrectly
 based on an assumption that a ‘data warehouse’ is by default based on an
analytic relational database
 A data warehouse as an enterprise platform for storing, processing and
analyzing data
 could be based on an analytic database, Hadoop, or a combination of the two
 Hadoop is primarily used to handle unstructured and semi‐structured data
 not a good fit – in terms of economics and data formats – for analytic databases
 The future analytic data‐processing landscape will be a hybrid of analytic
databases and Hadoop
 each used where appropriate for the individual analytic use case.

Introducing the Total Data Warehouse
 There are various phrases used to describe this hybrid landscape
 in keeping with our ‘Total Data’ terminology, we call this the Total Data Warehouse
 The primary platforms in a Total Data Warehouse are expected to be analytic
 However we also expect to see the Total Data Warehouse comprise other data
storage and processing platforms
 Exploratory analytics/discovery platforms
 Search
 Graph processing
 Stream processing
 Machine learning
 Log processing
 NoSQL databases
 NewSQL databases

 There are various phrases used to describe this hybrid landscape
 in keeping with our ‘Total Data’ terminology, we call this the Total Data Warehouse
The Total Data Warehouse
PRE‐DEFINED
REPORTING
AD HOC
ANALYTICS
STATISTICAL
ANALYTICS
PREDICTIVE
ANALYTICS
MACHINE
LEARNING MAPREDUCE
SEARCH‐BASED
ANALYTICS
GRAPH
ANALYTICS
STREAM
PROCESSING
MULTI‐STRUCTURED
DATA
APPLICATIONS
OPERATIONAL
INTELLIGENCE
NOSQL
MULTI‐STRUCTURED
DATA
APPLICATIONS
ANALYTIC
DATABASE
STRUCTURED DATA
(NEW) SQL
DATABASE
STRUCTURED DATA
APPLICATIONS
HADOOP
DISTRIBUTED
FILE SYSTEM
MULTI‐STRUCTURED
DATA
YARN
LOG
PROCESSING
EXPLORATORY
ANALYTICS
PLATFORM
MULTI‐STRUCTURED
DATA

Data gravity and the Total Data Warehouse
 ‘Data gravity’ suggests that processing resources will migrate to the
platform that stores the most data, or perhaps the most important data
 The balance of power is currently with the analytic database
 However, Hadoop’s flexibility to support data‐processing engines beyond
MapReduce could tip the balance in its favor in the long term
 Apache YARN enables multiple versions of MapReduce, and for HDFS to
support data‐processing frameworks in addition to MapReduce
 Native SQL analytics
 Stream processing
 Graph processing
 Bulk synchronous parallel computing
 Machine learning
 Apache Spark provides an in‐memory platform supporting high‐performance
processing and multiple data processing engines

Example Total Data Warehouses
 Teradata’s Unified Data Architecture and QueryGrid ‐ enables querying of
data in Teradata Database, Aster Database and Hortonworks
 Pivotal’s Big Data Suite ‐ HD Hadoop distribution/Greenplum
Database/GemFire distributed data grid and HAWQ SQL‐on‐Hadoop query
engine
 Cirro offers a federated approach to performing joins and query processing
across multiple sources of data including relational database and Hadoop
 Microsoft PolyBase enables SQL Server 2012 PDW analysts to query data in
Hadoop using Microsoft’s T‐SQL
 PolyBase is only available as part of the Microsoft Analytics Platform System
(APS)
 APS is an appliance that combines SQL Server 2012 PDW with Microsoft’s
HDInsight distribution of Apache Hadoop
 APS is also the only way that customers can adopt SQL Server 2012 PDW data
warehousing environment
 For Microsoft at least, Hadoop is an integral part of the next‐generation data
warehouse

 SQL‐on‐Hadoop engines clearly have a role to play in enabling the
Total Data Warehouse
 SQL‐based querying of data in HDFS
 Federation of queries across multiple data platforms
 SQL‐on‐Hadoop initiatives exploded in recent years as a means of uniting
the large army of trained SQL analysts with the flexible data storage and
processing capabilities of Hadoop
 But SQL‐on‐Hadoop engines are not created equal
 Batch SQL‐on‐Hadoop
 Interactive SQL‐on‐Hadoop
 SQL‐and‐Hadoop
 Operational SQL‐on‐Hadoop
 And the various offerings within those categories are differentiated
The role of SQL‐on‐Hadoop

Approach Details Examples
SQL on/and Hadoop
Batch
SQL‐on‐Hadoop
Native SQL‐like processing of
data in HDFS (via MR/Tez) Hive on MapReduce
Interactive
SQL‐on‐Hadoop
Specialist SQL‐based query
engine running on Hadoop
Apache Drill, Cloudera
Impala, Hive on Tez,
Spark SQL
SQL‐and‐Hadoop
Federated querying of data in
Hadoop and RDBMS
Teradata, Microsoft,
Oracle, IBM
Operational
SQL‐on‐Hadoop
Operational database that
stores in in HDFS
Splice Machine,
Trafodion

Approach Key features
SQL on Hadoop examples
Hive on Tez
Faster native querying than Hive on MapReduce,
HiveQL compatibility, extreme‐scale data joins
Apache Drill
ANSI SQL, Hadoop, MongoDB, Cassandra, Riak, etc;
consume JSON data, query hierarchical data
Cloudera Impala High performance ad hoc processing,
HiveQL compatibility, Parquet file format
Spark SQL
In‐memory SQL processing, Catalyst query
optimizer, replacing Shark (Hive on Spark)

 Hadoop is largely complementary to existing data warehouse deployments
 The future analytic data‐processing landscape will be a hybrid of analytic
 we call this the Total Data Warehouse
 ‘Data gravity’ suggests that processing resources will migrate to the
platform that stores the most data, or perhaps the most important data
 The balance of power is currently with the analytic database
 Hadoop’s flexibility tip the balance in its favor in the long term
 SQL‐on‐Hadoop engines clearly have a role to play in enabling the
Total Data Warehouse
 But SQL‐on‐Hadoop engines are not created equal
Conclusion

Questions? Comments?
matthew.aslett@451research.com
@maslett

Self Service Data Exploration with Apache Drill
© 2014 MapR Techno©lo 2g0ie1s4 MapR Technologies 1

The MapR Distribution including Apache Hadoop
© 2014 MapR Technologies 2
Exponential
Growth
500+
Customers
Premier
Investors
>2x annual bookings
90% software licenses
80% of accounts expand 3X
< 1% lifetime churn
> $1B in incremental revenue
generated by 1 customer
Big Data
Riding the Wave with
Hadoop
The Big Data
Platform
of Choice

The Power of the Open Source Community
Provisioning
&
coordination
Savannah*
Workflow
& Data
Governance
Data
Integration
& Access
Hue
HttpFS
Flume Knox* Falcon* Whirr
MapR-FS MapR-DB
Management
APACHE HADOOP AND OSS ECOSYSTEM
Streaming
Storm*
NoSQL &
Search
Solr
MapR Data Platform
Security
SQL
Drill
Shark
Impala
YARN
Batch
Spark
Cascading
Pig
Spark
Streaming
HBase
Juju
ML, Graph
GraphX
MLLib
Mahout
MapReduce
v1 & v2
EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS
Tez*
Accumulo*
Hive
Sqoop Sentry* Oozie ZooKeeper
* Certification/support planned for 2014

UNSTRUCTURED
DATA
Unstructured data will account
for more than 80% of the data
collected by organizations
STRUCTURED DATA
1980 1990 2000 2010 2020
Source: Human-Computer Interaction & Knowledge Discovery in Complex Unstructured, Big Data
Total Data Stored

Today’s Data Comes in Different Shapes…
Social Media
Messages
Audio
Sensors
Mobile Data
Email
Clickstream

Distance to Data
Business
(analysts, developers)
“Plumbing”
MapReduce development
Business
Modeling and
transformations
Hive and other
SQL-on-Hadoop
Existing approaches
require a middleman (IT)
Data
Data

Distance to Data
Business
Existing approaches
require a middleman (IT)
“Plumbing”
MapReduce development
Hive and other
SQL-on-Hadoop
Business
Data Agility (analysts, developers)
Data
Data
Data
Business
Modeling and
transformations

Improve time to value Redu2ce the burden on IT
Why Improve Distance to Data?
• Enable rapid data exploration and
application development
• IT should provide a valuable
service without “getting in the way”
• Can’t add DBAs to keep up with
the exponential data growth
• Minimize “unnecessary work” so IT
can focus on value-added
activities and become a partner to
the business users

• Pioneering Data Agility for Hadoop
• Apache open source project
• Scale-out execution engine for low-latency queries
• Unified SQL-based API for analytics & operational applications
APACHE DRILL
40+ contributors
150+ years of experience building
databases and distributed systems

Evolution Towards Self-Service Data Exploration
Data Modeling and
Transformation
Data Visualization
IT-driven
IT-driven
IT-driven
Self-service
IT-driven
Self-service
Not needed
Self-service
Traditional BI
w/ RDBMS
Self-Service BI
w/ RDBMS SQL-on-Hadoop
Self-Service
Data Exploration
Zero-day analytics

Optimized Data Architecture Machine Learning
MapR Optimized Data Architecture
Sources
RELATIONAL,
SAAS,
MAINFRAME
DOCUMENTS,
EMAILS
BLOGS,
TWEETS,
LINK DATA
LOG FILES,
CLICKSTREAMS
SENSORS
Streaming
(Spark Streaming, Storm)
Batch / Search
(MR, Spark, Hive, Pig, …)
NoSQL ODBMS
(HBase, Accumulo, …)
MapR Data Platform
MapR-DB
MAPR DISTRIBUTION FOR HADOOP
MapR-FS
MAPR DISTRIBUTION FOR HADOOP
DATA WAREHOUSE
Data Movement
Data Access
Analytics
Search
Schema-less
data exploration
BI, reporting
Ad-hoc integrated
analytics
Data Transformation, Enrichment
and Integration
Operational Apps
Recommendations
Fraud Detection
Logistics

(1) Self-Describing Data is Ubiquitous
Flat files in DFS
• Complex data (Thrift, Avro, protobuf)
• Columnar data (Parquet, ORC)
• Loosely defined (JSON)
• Traditional files (CSV, TSV)
Data stored in NoSQL stores
• Relational-like (rows, columns)
• Sparse data (NoSQL maps)
• Embedded blobs (JSON)
• Document stores (nested objects)
{
name: {
first: Michael,
last: Smith
},
hobbies: [ski, soccer],
district: Los Altos
}{
name: {
first: Jennifer,
last: Gates
},
hobbies: [sing],
preschool: CCLC
}

RDBMS/SQL-on-Hadoop table
Apache Drill table
(2) Drill’s Data Model is Flexible
Fixed schema Schema-less
HBase
JSON
BSON
CSV
TSV
Parquet
Avro
Flat
Complex
Flexibility
Flexibility
Name Gender Age
Michael M 6
Jennifer F 3
{
name: {
first: Michael,
last: Smith
},
hobbies: [ski, soccer],
district: Los Altos
}{
name: {
first: Jennifer,
last: Gates
},
hobbies: [sing],
preschool: CCLC
}

(3) Drill Supports Schema Discovery On-The-Fly
Schema Declared In Advance Schema2 Discovered On-The-Fly
• Fixed schema
• Leverage schema in centralized
repository (Hive Metastore)
• Fixed schema, evolving schema or
schema-less
• Leverage schema in centralized
repository or self-describing data
SCHEMA ON
WRITE
SCHEMA
BEFORE READ
SCHEMA ON THE
FLY

A storage engine instance
- DFS
- HBase
- Hive Metastore/HCatalog
A workspace
- Sub-directory
- Hive database
A table
- pathnames
- HBase table
- Hive table
Data Source is in the Query
SELECT timestamp, message
FROM dfs1.logs.`AppServerLogs/2014/Jan/p001.parquet`
WHERE errorLevel > 2

Query Directory Trees
# Query file: How many errors per level in Jan 2014?
SELECT errorLevel, count(*)
FROM dfs.logs.`/AppServerLogs/2014/Jan/part0001.parquet`
GROUP BY errorLevel;
# Query directory sub-tree: How many errors per level?
FROM dfs.logs.`/AppServerLogs`
GROUP BY errorLevel;
# Query some partitions: How many errors per level by month from 2012?
FROM dfs.logs.`/AppServerLogs`
WHERE dirs[1] >= 2012
GROUP BY errorLevel, dirs[2];

Works with HBase and Embedded Blobs
# Query an HBase table directly (no schemas)
SELECT cf1.month, cf1.year
FROM hbase.table1;
# Embedded JSON value inside column profileBlob inside column family cf1 of
the HBase table users
SELECT profile.name, count(profile.children)
FROM (
SELECT CONVERT_FROM(cf1.profileBlob, 'json') AS profile
FROM hbase.users
)

Combine Data Sources on the Fly
# Join log directory with JSON file (user profiles) to identify the name and email address for
anyone associated with an error message.
SELECT DISTINCT users.name, users.emails.work
FROM dfs.logs.`/data/logs` logs,
dfs.users.`/profiles.json` users
WHERE logs.uid = users.id AND
logs.errorLevel > 5;
# Join a Hive table and an HBase table (without Hive metadata) to determine the number of
tweets per user
SELECT users.name, count(*) as tweetCount
FROM hive.social.tweets tweets,
hbase.users users
WHERE tweets.userId = convert_from(users.rowkey, 'UTF-8')
GROUP BY tweets.userId;

Summary
• Enable rapid data exploration and application development while
reducing the burden on IT
• Apache Drill 0.5 available now
• Get involved
– Download and play: http://incubator.apache.org/drill/
– Ask questions: drill-user@incubator.apache.org
– Contribute: http://github.com/apache/incubator-drill/
– Join the Drill team at MapR
• Email jacques@mapr.com
• www.mapr.com/careers

Future of-hadoop-analytics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Future of-hadoop-analytics

Similar to Future of-hadoop-analytics (20)

More from MapR Technologies

More from MapR Technologies (20)

Recently uploaded

Recently uploaded (20)

Future of-hadoop-analytics