More Related Content Similar to Future of-hadoop-analytics (20) More from MapR Technologies (20) Future of-hadoop-analytics2. Introducing the Total Data Warehouse
Matthew Aslett
Research Director, Data Management and Analytics, 451 Research
© 2014 by The 451 Group. All rights reserved
3. Matthew Aslett
• Research Director, Data Platforms and Analytics
matthew.aslett@451research.com
www.twitter.com/maslett
Responsible for data management
and analytics research agenda
Focus on operational and analytic
databases, including NoSQL,
NewSQL, and Hadoop
© 2014 by The 451 Group. All rights reserved
With 451 Research since 2007
4. © 2014 by The 451 Group. All rights reserved
Company Overview
One company with 3 operating
divisions
Syndicated research, advisory,
professional services, datacenter
certification, and events
Global focus
270+ staff
1,500+ client organizations:
enterprises, vendors, service
providers, and investment firms
Organic and growth through
acquisition
5. Hadoop and the data warehouse
The rise of Apache Hadoop has been driven largely by demand for more
flexible approaches to data management and analytics
Overcoming the limitations of traditional analytic databases and their adherence
© 2014 by The 451 Group. All rights reserved
to strictly defined schema.
Hadoop is largely complementary to existing data warehouse deployments
However, there is clear evidence that at least some workloads are being
migrated from existing enterprise data warehouses to Hadoop
E.g. Teradata’s CEO noted in October 2013 that, on average, 20% of the
total ETL workload on Teradata data warehouses could potentially move to
Hadoop (4‐8% of the total Teradata data warehouse workload)
That has driven many people to question the extent to which Hadoop will
replace the data warehouse
6. Describe the relationship between Hadoop and the enterprise
data warehouse within your organization
Survey conducted: Sept/Oct 2013
© 2014 by The 451 Group. All rights reserved
Sample: 98
Hadoop and the data warehouse
Hadoop not yet
used
Hadoop for
workloads not
Hadoop replacing
Permanently
migrating
workloads to
Hadoop
EDW
Temporarily previously on EDW
offloading
workloads to
Hadoop
Two‐thirds of
Hadoop
engagement is
currently
non‐threatening
or additive to
existing data
warehouse
deployments
7. Hadoop replacing the data warehouse?
Frames the question incorrectly
based on an assumption that a ‘data warehouse’ is by default based on an
© 2014 by The 451 Group. All rights reserved
analytic relational database
A data warehouse as an enterprise platform for storing, processing and
analyzing data
could be based on an analytic database, Hadoop, or a combination of the two
Hadoop is primarily used to handle unstructured and semi‐structured data
not a good fit – in terms of economics and data formats – for analytic databases
The future analytic data‐processing landscape will be a hybrid of analytic
databases and Hadoop
each used where appropriate for the individual analytic use case.
8. Introducing the Total Data Warehouse
There are various phrases used to describe this hybrid landscape
in keeping with our ‘Total Data’ terminology, we call this the Total Data Warehouse
The primary platforms in a Total Data Warehouse are expected to be analytic
databases and Hadoop
However we also expect to see the Total Data Warehouse comprise other data
storage and processing platforms
Exploratory analytics/discovery platforms
Search
Graph processing
Stream processing
Machine learning
Log processing
NoSQL databases
NewSQL databases
© 2014 by The 451 Group. All rights reserved
9. There are various phrases used to describe this hybrid landscape
in keeping with our ‘Total Data’ terminology, we call this the Total Data Warehouse
© 2014 by The 451 Group. All rights reserved
The Total Data Warehouse
PRE‐DEFINED
REPORTING
AD HOC
ANALYTICS
STATISTICAL
ANALYTICS
PREDICTIVE
ANALYTICS
MACHINE
LEARNING MAPREDUCE
SEARCH‐BASED
ANALYTICS
GRAPH
ANALYTICS
STREAM
PROCESSING
MULTI‐STRUCTURED
DATA
APPLICATIONS
OPERATIONAL
INTELLIGENCE
NOSQL
MULTI‐STRUCTURED
DATA
APPLICATIONS
ANALYTIC
DATABASE
STRUCTURED DATA
(NEW) SQL
DATABASE
STRUCTURED DATA
APPLICATIONS
HADOOP
DISTRIBUTED
FILE SYSTEM
MULTI‐STRUCTURED
DATA
YARN
LOG
PROCESSING
EXPLORATORY
ANALYTICS
PLATFORM
MULTI‐STRUCTURED
DATA
10. Data gravity and the Total Data Warehouse
‘Data gravity’ suggests that processing resources will migrate to the
platform that stores the most data, or perhaps the most important data
The balance of power is currently with the analytic database
However, Hadoop’s flexibility to support data‐processing engines beyond
MapReduce could tip the balance in its favor in the long term
Apache YARN enables multiple versions of MapReduce, and for HDFS to
support data‐processing frameworks in addition to MapReduce
Native SQL analytics
Stream processing
Graph processing
Bulk synchronous parallel computing
Machine learning
Apache Spark provides an in‐memory platform supporting high‐performance
processing and multiple data processing engines
© 2014 by The 451 Group. All rights reserved
11. Example Total Data Warehouses
Teradata’s Unified Data Architecture and QueryGrid ‐ enables querying of
data in Teradata Database, Aster Database and Hortonworks
Pivotal’s Big Data Suite ‐ HD Hadoop distribution/Greenplum
Database/GemFire distributed data grid and HAWQ SQL‐on‐Hadoop query
engine
Cirro offers a federated approach to performing joins and query processing
across multiple sources of data including relational database and Hadoop
Microsoft PolyBase enables SQL Server 2012 PDW analysts to query data in
Hadoop using Microsoft’s T‐SQL
PolyBase is only available as part of the Microsoft Analytics Platform System
(APS)
APS is an appliance that combines SQL Server 2012 PDW with Microsoft’s
HDInsight distribution of Apache Hadoop
APS is also the only way that customers can adopt SQL Server 2012 PDW data
warehousing environment
For Microsoft at least, Hadoop is an integral part of the next‐generation data
warehouse
© 2014 by The 451 Group. All rights reserved
12. SQL‐on‐Hadoop engines clearly have a role to play in enabling the
Total Data Warehouse
SQL‐based querying of data in HDFS
Federation of queries across multiple data platforms
SQL‐on‐Hadoop initiatives exploded in recent years as a means of uniting
the large army of trained SQL analysts with the flexible data storage and
processing capabilities of Hadoop
But SQL‐on‐Hadoop engines are not created equal
Batch SQL‐on‐Hadoop
Interactive SQL‐on‐Hadoop
SQL‐and‐Hadoop
Operational SQL‐on‐Hadoop
And the various offerings within those categories are differentiated
© 2014 by The 451 Group. All rights reserved
The role of SQL‐on‐Hadoop
13. Approach Details Examples
© 2014 by The 451 Group. All rights reserved
SQL on/and Hadoop
Batch
SQL‐on‐Hadoop
Native SQL‐like processing of
data in HDFS (via MR/Tez) Hive on MapReduce
Interactive
SQL‐on‐Hadoop
Specialist SQL‐based query
engine running on Hadoop
Apache Drill, Cloudera
Impala, Hive on Tez,
Spark SQL
SQL‐and‐Hadoop
Federated querying of data in
Hadoop and RDBMS
Teradata, Microsoft,
Oracle, IBM
Operational
SQL‐on‐Hadoop
Operational database that
stores in in HDFS
Splice Machine,
Trafodion
14. Approach Key features
© 2014 by The 451 Group. All rights reserved
SQL on Hadoop examples
Hive on Tez
Faster native querying than Hive on MapReduce,
HiveQL compatibility, extreme‐scale data joins
Apache Drill
ANSI SQL, Hadoop, MongoDB, Cassandra, Riak, etc;
consume JSON data, query hierarchical data
Cloudera Impala High performance ad hoc processing,
HiveQL compatibility, Parquet file format
Spark SQL
In‐memory SQL processing, Catalyst query
optimizer, replacing Shark (Hive on Spark)
15. Hadoop is largely complementary to existing data warehouse deployments
The future analytic data‐processing landscape will be a hybrid of analytic
databases and Hadoop
we call this the Total Data Warehouse
‘Data gravity’ suggests that processing resources will migrate to the
platform that stores the most data, or perhaps the most important data
The balance of power is currently with the analytic database
Hadoop’s flexibility tip the balance in its favor in the long term
SQL‐on‐Hadoop engines clearly have a role to play in enabling the
Total Data Warehouse
But SQL‐on‐Hadoop engines are not created equal
© 2014 by The 451 Group. All rights reserved
Conclusion
17. Self Service Data Exploration with Apache Drill
© 2014 MapR Techno©lo 2g0ie1s4 MapR Technologies 1
18. The MapR Distribution including Apache Hadoop
© 2014 MapR Technologies 2
Exponential
Growth
500+
Customers
Premier
Investors
>2x annual bookings
90% software licenses
80% of accounts expand 3X
< 1% lifetime churn
> $1B in incremental revenue
generated by 1 customer
Big Data
Riding the Wave with
Hadoop
The Big Data
Platform
of Choice
19. The Power of the Open Source Community
Provisioning
&
coordination
Savannah*
Workflow
& Data
Governance
Data
Integration
& Access
Hue
HttpFS
Flume Knox* Falcon* Whirr
MapR-FS MapR-DB
© 2014 MapR Technologies 3
Management
APACHE HADOOP AND OSS ECOSYSTEM
Streaming
Storm*
NoSQL &
Search
Solr
MapR Data Platform
Security
SQL
Drill
Shark
Impala
YARN
Batch
Spark
Cascading
Pig
Spark
Streaming
HBase
Juju
ML, Graph
GraphX
MLLib
Mahout
MapReduce
v1 & v2
EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS
Tez*
Accumulo*
Hive
Sqoop Sentry* Oozie ZooKeeper
* Certification/support planned for 2014
20. © 2014 MapR Technologies 4
UNSTRUCTURED
DATA
Unstructured data will account
for more than 80% of the data
collected by organizations
STRUCTURED DATA
1980 1990 2000 2010 2020
Source: Human-Computer Interaction & Knowledge Discovery in Complex Unstructured, Big Data
Total Data Stored
21. Today’s Data Comes in Different Shapes…
© 2014 MapR Technologies 5
Social Media
Messages
Audio
Sensors
Mobile Data
Email
Clickstream
22. © 2014 MapR Technologies 6
Distance to Data
Business
(analysts, developers)
“Plumbing”
MapReduce development
Business
(analysts, developers)
Modeling and
transformations
Hive and other
SQL-on-Hadoop
Existing approaches
require a middleman (IT)
Data
Data
23. © 2014 MapR Technologies 7
Distance to Data
Business
(analysts, developers)
Existing approaches
require a middleman (IT)
“Plumbing”
MapReduce development
Hive and other
SQL-on-Hadoop
Business
Data Agility (analysts, developers)
Data
Data
Data
Business
(analysts, developers)
Modeling and
transformations
24. Improve time to value Redu2ce the burden on IT
© 2014 MapR Technologies 8
Why Improve Distance to Data?
• Enable rapid data exploration and
application development
• IT should provide a valuable
service without “getting in the way”
• Can’t add DBAs to keep up with
the exponential data growth
• Minimize “unnecessary work” so IT
can focus on value-added
activities and become a partner to
the business users
25. • Pioneering Data Agility for Hadoop
• Apache open source project
• Scale-out execution engine for low-latency queries
• Unified SQL-based API for analytics & operational applications
© 2014 MapR Technologies 9
APACHE DRILL
40+ contributors
150+ years of experience building
databases and distributed systems
26. Evolution Towards Self-Service Data Exploration
© 2014 MapR Technologies 10
Data Modeling and
Transformation
Data Visualization
IT-driven
IT-driven
IT-driven
Self-service
IT-driven
Self-service
Not needed
Self-service
Traditional BI
w/ RDBMS
Self-Service BI
w/ RDBMS SQL-on-Hadoop
Self-Service
Data Exploration
Zero-day analytics
27. Optimized Data Architecture Machine Learning
© 2014 MapR Technologies 11
MapR Optimized Data Architecture
Sources
RELATIONAL,
SAAS,
MAINFRAME
DOCUMENTS,
EMAILS
BLOGS,
TWEETS,
LINK DATA
LOG FILES,
CLICKSTREAMS
SENSORS
Streaming
(Spark Streaming, Storm)
Batch / Search
(MR, Spark, Hive, Pig, …)
NoSQL ODBMS
(HBase, Accumulo, …)
MapR Data Platform
MapR-DB
MAPR DISTRIBUTION FOR HADOOP
MapR-FS
MAPR DISTRIBUTION FOR HADOOP
DATA WAREHOUSE
Data Movement
Data Access
Analytics
Search
Schema-less
data exploration
BI, reporting
Ad-hoc integrated
analytics
Data Transformation, Enrichment
and Integration
Operational Apps
Recommendations
Fraud Detection
Logistics
28. © 2014 MapR Technologies 12
(1) Self-Describing Data is Ubiquitous
Flat files in DFS
• Complex data (Thrift, Avro, protobuf)
• Columnar data (Parquet, ORC)
• Loosely defined (JSON)
• Traditional files (CSV, TSV)
Data stored in NoSQL stores
• Relational-like (rows, columns)
• Sparse data (NoSQL maps)
• Embedded blobs (JSON)
• Document stores (nested objects)
{
name: {
first: Michael,
last: Smith
},
hobbies: [ski, soccer],
district: Los Altos
}{
name: {
first: Jennifer,
last: Gates
},
hobbies: [sing],
preschool: CCLC
}
29. RDBMS/SQL-on-Hadoop table
Apache Drill table
© 2014 MapR Technologies 13
(2) Drill’s Data Model is Flexible
Fixed schema Schema-less
HBase
JSON
BSON
CSV
TSV
Parquet
Avro
Flat
Complex
Flexibility
Flexibility
Name Gender Age
Michael M 6
Jennifer F 3
{
name: {
first: Michael,
last: Smith
},
hobbies: [ski, soccer],
district: Los Altos
}{
name: {
first: Jennifer,
last: Gates
},
hobbies: [sing],
preschool: CCLC
}
30. (3) Drill Supports Schema Discovery On-The-Fly
Schema Declared In Advance Schema2 Discovered On-The-Fly
© 2014 MapR Technologies 14
• Fixed schema
• Leverage schema in centralized
repository (Hive Metastore)
• Fixed schema, evolving schema or
schema-less
• Leverage schema in centralized
repository or self-describing data
SCHEMA ON
WRITE
SCHEMA
BEFORE READ
SCHEMA ON THE
FLY
32. Zero to Results in 2 Minutes (3 Commands)
$ tar xzf apache-drill.tar.gz
$ apache-drill/bin/sqlline -u jdbc:drill:zk=local
0: jdbc:drill:zk=local>
SELECT count(*) AS incidents, columns[1] AS category
FROM dfs.`/tmp/SFPD_Incidents_-_Previous_Three_Months.csv`
GROUP BY columns[1]
ORDER BY incidents DESC;
+------------+------------+
| incidents | category |
+------------+------------+
| 8372 | LARCENY/THEFT |
| 4247 | OTHER OFFENSES |
| 3765 | NON-CRIMINAL |
| 2502 | ASSAULT |
...
35 rows selected (0.847 seconds)
Install
Launch shell
(embedded
mode)
Query
Results
© 2014 MapR Technologies 16
33. © 2014 MapR Technologies 17
A storage engine instance
- DFS
- HBase
- Hive Metastore/HCatalog
A workspace
- Sub-directory
- Hive database
A table
- pathnames
- HBase table
- Hive table
Data Source is in the Query
SELECT timestamp, message
FROM dfs1.logs.`AppServerLogs/2014/Jan/p001.parquet`
WHERE errorLevel > 2
34. © 2014 MapR Technologies 18
Query Directory Trees
# Query file: How many errors per level in Jan 2014?
SELECT errorLevel, count(*)
FROM dfs.logs.`/AppServerLogs/2014/Jan/part0001.parquet`
GROUP BY errorLevel;
# Query directory sub-tree: How many errors per level?
SELECT errorLevel, count(*)
FROM dfs.logs.`/AppServerLogs`
GROUP BY errorLevel;
# Query some partitions: How many errors per level by month from 2012?
SELECT errorLevel, count(*)
FROM dfs.logs.`/AppServerLogs`
WHERE dirs[1] >= 2012
GROUP BY errorLevel, dirs[2];
35. Works with HBase and Embedded Blobs
# Query an HBase table directly (no schemas)
SELECT cf1.month, cf1.year
FROM hbase.table1;
# Embedded JSON value inside column profileBlob inside column family cf1 of
the HBase table users
SELECT profile.name, count(profile.children)
FROM (
SELECT CONVERT_FROM(cf1.profileBlob, 'json') AS profile
FROM hbase.users
)
© 2014 MapR Technologies 19
36. Combine Data Sources on the Fly
# Join log directory with JSON file (user profiles) to identify the name and email address for
anyone associated with an error message.
SELECT DISTINCT users.name, users.emails.work
FROM dfs.logs.`/data/logs` logs,
© 2014 MapR Technologies 20
dfs.users.`/profiles.json` users
WHERE logs.uid = users.id AND
logs.errorLevel > 5;
# Join a Hive table and an HBase table (without Hive metadata) to determine the number of
tweets per user
SELECT users.name, count(*) as tweetCount
FROM hive.social.tweets tweets,
hbase.users users
WHERE tweets.userId = convert_from(users.rowkey, 'UTF-8')
GROUP BY tweets.userId;
37. Summary
• Enable rapid data exploration and application development while
© 2014 MapR Technologies 21
reducing the burden on IT
• Apache Drill 0.5 available now
• Get involved
– Download and play: http://incubator.apache.org/drill/
– Ask questions: drill-user@incubator.apache.org
– Contribute: http://github.com/apache/incubator-drill/
– Join the Drill team at MapR
• Email jacques@mapr.com
• www.mapr.com/careers