Hadoop World 2011: The Blind Men and the Elephant - Matthew Aslett - The 451 Group

The Blind Men and the Elephant
Matthew Aslett, Senior Analyst, The 451 Group
Hadoop World, 8 November, 2011

© 2011 by The 451 Group. All rights reserved

Agenda
 Introduction and family history
 The Blind Men and the Elephant
 What is the point of Hadoop?
 Adoption trends
 Big data, total data
 Exploratory analytics
 Hadoop-related business strategies
 Contributors and their contributions
 A cautionary tale


The 451 Group
451 Research is focused on the business of enterprise IT
innovation. The company’s analysts provide critical and timely
insight into the competitive dynamics of innovation in
emerging technology segments.

Tier1 Research is a single-source research and advisory firm covering
the multi-tenant datacenter, hosting, IT and cloud-computing
sectors, blending the best of industry and financial research.

The Uptime Institute is ‘The Global Data Center Authority’ and a
pioneer in the creation and facilitation of end-user knowledge
communities to improve reliability and uninterruptible availability
in datacenter facilities.

TheInfoPro is a leading IT advisory and research firm that provides
real-world perspectives on the customer and market dynamics of the
enterprise information technology landscape, harnessing the
collective knowledge and insight of leading IT organizations
worldwide.

ChangeWave Research is a research firm that identifies and quantifies
‘change’ in consumer spending behavior, corporate purchasing, and
industry, company and technology trends.


451 Research
 Matthew Aslett
• Senior analyst, enterprise software
• With The 451 Group since 2007
• www.twitter.com/maslett

Information Management Commercial Adoption of Open Source
 Operational databases (CAOS)
 Data warehousing  Open source projects
 Data caching  Adoption of open source software
 Event processing  Vendor strategies

 Hadoop first properly covered in March  Hadoop first covered February 2008 as
2009 report covering the formation of part of coverage of emerging open source
Apache Hadoop distributor Cloudera data management projects


A family history?



“It was six men of Indostan
To learning much inclined,
Who went to see the Elephant
(Though all of them were blind),
That each by observation
Might satisfy his mind.”

John Godfrey Saxe (1872)



“After Hadoop finishes
filtering the data, the place
you want to put that data
is in Oracle Database.”

Larry Ellison (2011)


Oracle Big Data Appliance
Apache Hadoop

NoSQL Database

Oracle Tools

Oracle Database
Data Integrator for Oracle Database

Data Loader
Big data
R distribution integration


What is the point of Hadoop?

Big data Big data Big data
storage integration analytics

 Yes, depending on who you ask (and when)


Example deployment

Processes millions of searches and transactions a
Orbitz day, resulting in hundreds of GBs of log data

 Early Hadoop adopter for long-term storage and
Big data processing of un/semi-structured data
storage  Too much data to store and process in data warehouse
due to cost and space considerations

 Adopted Hive for SQL-like query capabilities
Big data  Also, machine learning to automate hotel ranking based
analytics on user behavior
 Hadoop provided repository to store and query search
logs and MapReduce a more efficient data extraction
process
Big data
integration  Creating data exports to R, and aggregating data to data
warehouse


Vendor timeline – 451 Research coverage
OCT 11
SEP 11
AUG 11
JUL 11
JUN 11
MicroStrategy Quest Software Opera
MAY 11
APR 11 EMC NetApp Dell Pervasive
MAR 11
FEB 11
JAN 11
Platfora Jaspersoft Revolution
DEC 10 Hadapt
NOV 10
OCT 10
SAS Appistry Informatica MapR
SEP 10
AUG 10 Amazon
JUL 10
JUN 10 SnapLogic Cloudera Tableau
MAY 10
APR 10
MAR 10 Oracle Pentaho IBM Hortonworks
Karmasphere Kitenga
FEB 10
JAN 10
DEC 09 Talend Microsoft
NOV 09
OCT 09
SEP 09 Datameer DataStax RainStor
AUG 09
JUL 09
Platform
JUN 09
MAY 09
ZettaSet Gluster Teradata Composite
APR 09
MAR 09


The Apache Hadoop ecosystem
Big data analytics Microsoft IBM Revolution Platfora Karmasphere
ZettaSet MicroStrategy Tableau Pentaho
Kitenga Datameer Jaspersoft Opera SAS

Big data integration RainStor Platform Pervasive Informatica Composite
Talend IBM Quest Hadapt
SnapLogic
Oracle Teradata Microsoft Cloudera

Hadoop distributors
Cloudera Hortonworks Microsoft DataStax
IBM MapR EMC Amazon

Big data storage Appistry
EMC Dell
IBM Gluster NetApp


Current data management trends

The amount of Preliminary survey results Data processing 2013 The value of
% Change:
data to be – for illustration purposes capabilitiesvs. 2011
have data has never
stored, manage Enterprise Data Warehouse never been 198% been better
d and analyzed better
Regional/Departmental Data Marts 169% understood
is growing Exploratory Analytics Platform 183%
rapidly Hadoop Cluster 115%
Data Archive 394%
Operational Databases 703%
Searchable Data Platform 259%
Total Data Growth 2011-2013 180%

RISKOPPORTUNITY

 The data deluge problem is also a big data opportunity


What is Big Data?
 More than just rising data volumes

 Big Data ≠ Volume


What is Big Data?
 Also variety of data types/sources and velocity of data updates

 Big Data = Volume Variety Velocity

 Preliminary survey results – for illustrative purposes:

My organization’s existing data management
architecture is suitable to meet its future 29% 34% 37%
demands for business intelligence

Strongly Agree/Agree Neutral Disagree/Strongly Disagree


Current data management trends

The ‘Big Data’ Data processing The value of
volume, variety covers a diverse capabilities have data has never
and velocity of set of products never been been better
data is growing that can be better understood
rapidly applied to
different
problems

RISKOPPORTUNITY

 ‘Big Data’ highlights the problem – volume/variety/velocity,
 and promises a solution – value,
 but doesn’t provide a path in between


What is Total Data?
 Not just another name
for Big Data

 Inspired by ‘Total Football’ –
a new approach to soccer that
emerged in the late 1960s

 If your data is big, the way
you manage it should be total

 Total Data is making the most
efficient use of existing and new data management resources to
deliver value from data


What is Total Data?
 Also the desire of the user to store and process all their data

 Value = (Volume Variety Velocity) x Totality

Big data
storage


What is Total Data?
 Within tolerable time frames

 Value = (Volume Variety Velocity) x Totality
Time

 Stream processing
 S4 Hadoop
 Storm
 Percolator


What is Total Data?
 And the desire to explore data for new value

 Value = (Volume Variety Velocity) x (Totality + Exploration)
Time

Big data
analytics


Data exploration
 Schema on write  Schema on read

Application Application

Schema
Hadoop

RDBMS
Schema

SQL MapReduce


Data exploration

 Exploratory Analytics Platform
 RDBMS + UDFs
 SQL-MapReduce Application Application

 Splunk
 HPCC Systems Loose schema
Hadoop

 Dryad
 Tenzing
RDBMS
 Dremel Schema
 Piccolo

Analytics MapReduce


Data platforms for different data types
Customer Data 59% 5% 11%

Transactional Data 51% 8% 11%

Domain-specific Application Data 46% 14% 14%

Online Transaction Data 46% 11% 11%

Application Log Data 41% 16% 14%

Other Documents/Content 35% 16% 16%

Audio/Video/Graphics 30% 14% 24%

Network Log Data 30% 16% 22%

Search Log 27% 19% 22%

Other Log Files 27% 16% 24%

Web Log Data 27% 19% 22%

Social Media/Online Data 24% 22% 24%

Enterprise Data Warehouse Exploratory Analytics Platform Hadoop


Data platforms for different application workloads
Data Consolidation 49% 11% 14%

Data Storage for Compliance 49% 11% 16%

Financial Forecasting 49% 16% 8%

Decision Support 49% 22% 8%

Data Sandboxing 43% 16% 11%

Trend Analysis 43% 19% 19%

Data Indexing/Search 41% 16% 19%

Ad Hoc, Iterative Analysis 41% 22% 16%

Customer Analysis 38% 22% 14%

IT Data Analysis 35% 22% 16%

Clickstream Analysis 30% 22% 19%

Enterprise Data Warehouse Exploratory Analytics Platform Hadoop


eBay’s Singularity platform
Analyze & Report
Discover & Explore

Data warehouse Singularity Hadoop

6+PB Teradata EDW 40+PB Teradata appliance 20+PB Hadoop cluster
Structured SQL analysis Semi-structured SQL Unstructured analysis
500+ concurrent users 150+ concurrent users 5-10 concurrent users

 ‘soft data projection’ – apply structural patterns as the data is analyzed

 support for user-defined functions go beyond standard SQL

 a SQL interface familiar to existing analysts


What is Total Data?
 While maximizing the investment in existing skills and resources

(Time x Skills and Resources)

Big data
integration


What is Total Data?
 While maximizing the investment in existing skills and resources

(Time x Skills and Resources)

 Total Data is making the most efficient use of existing and new data
management resources to deliver value from data

 Inspired by ‘Total Football’


The old way

Data Reporting/BI

mart

Reporting/BI
App Relationa
Data
l
mart
database
App

Reporting/BI
App Data
Relationa EDW
l cleansing/MDM
App database Reporting/BI

Reporting/BI
App
Relationa
Data archive
l Reporting/BI
database
App


The old way
Data Operational Analytic Business
archive database database intelligence

29


The new way

App Stream processing Reporting/BI Reporting/BI
Reporting/BI
Cache Data
mart
App Big data
Relationa Reporting/BI Hadoop
l integration
Datastructure
Relationa database
App l
database
NoSQL
App database “Data Hub” EDW
App
NewSQL database
App
Exploratory
Non- Datastructure Big data Big data
Queryable analytics
App Relationa relational storage analytics
archive platform
l database
App database


The new way

Data archive Exploratory
analytics

Data
cache/grid
‘Data Hub’ Non-relational Hadoop
database

Data Datastructure
NoSQL warehouse
database

Event stream Relational 31
processing database


Relevant reports
 Total Data
• Explaining the the total data management approach to
dealing with the impact of big data on the data
management landscape
• Coming late 2011
• sales@the451group.com
COMING
 Free copy for completing our
Total Data survey: LATE
 www.bit.ly/451data
2011


Hadoop-related business strategies

Chukwa Sqoop ZooKeeper Pig Hortonworks

HBase Avro Mahout Flume Cloudera CDH
Supportsubscription

MapReduce Whirr
IBM BigInsights
Community
Hama
HDFS
Hive

Hadoop Common



Management Cloudera Enterprise

Chukwa Sqoop ZooKeeper Pig IBM BigInsights
Enterprise
HBase Avro Mahout Flume
Supportsubscription

Hortonworks
MapReduce Whirr Data Platform
Hama
HDFS
Hive

Hadoop Common


Apache Hadoop contributors

Source: Datameer blog. http://datameer.com/blog/uncategorized/whose-hadoop-is-bigger-really-2.html


Key contributors

Source: Hortonworks blog. http://www.hortonworks.com/reality-check-contributions-to-apache-hadoop/


Key contributors

Source: Cloudera blog. http://www.cloudera.com/blog/2011/10/the-community-effect/



Management Default alternatives:
MapR/EMC
Chukwa Sqoop ZooKeeper Pig – Direct Access NFS

HBase Avro Mahout Flume DataStax
Supportsubscription

–CassandraFS
MapReduce Whirr
Optional alternatives:
Hama IBM – GPFS
HDFS
Hive Appistry – CloudIQ
Hadoop Common
Gluster – GlusterFS



Management Default alternatives:
MapR/EMC
Chukwa Sqoop ZooKeeper Pig – JobTracker HA

HBase Avro Mahout Flume Platform
Supportsubscription

– Platform MapReduce
MapReduce Whirr

Hama
HDFS
Hive

Hadoop Common


Hadoop component alternatives

 Concerns about JobTracker and NameNode as SPOF

MapReduce
JobTracker TaskTracker TaskTracker
TaskTracker TaskTracker TaskTracker
TaskTracker

HDFS
NameNode
DataNode DataNode DataNode DataNode DataNode
DataNode


Apache Hadoop 0.23 and beyond
 NextGen MapReduce splits JobTracker into resource management
and application lifecycle management
NextGen MapReduce
Resource Node Manager Node Manager Node Manager Node Manager Node Manager
Manager App Master App Master App Master
App Master App Master


Apache Hadoop 0.23 and beyond
 NextGen MapReduce splits JobTracker into resource management
and application lifecycle management
NextGen MapReduce
Resource Node Manager Node Manager Node Manager Node Manager Node Manager
Manager App Master App Master App Master
App Master App Master

 NameNode HA adds a standby NameNode to enable warm and hot
standby for both planned and unplanned downtime
NameNode HA
Active Standby
NameNode DataNode DataNode DataNode DataNode
NameNode

 Does not preclude the use of alternatives, but does raise the bar for
‘enterprise-level’ capabilities in Apache Hadoop


A cautionary tale?


Survey details:
http://bit.ly/451data

matthew.aslett@the451group.com www.twitter.com/maslett


Hadoop World 2011: The Blind Men and the Elephant - Matthew Aslett - The 451 Group

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Hadoop World 2011: The Blind Men and the Elephant - Matthew Aslett - The 451 Group

Similar to Hadoop World 2011: The Blind Men and the Elephant - Matthew Aslett - The 451 Group (20)

More from Cloudera, Inc.

More from Cloudera, Inc. (20)

Recently uploaded

Recently uploaded (20)

Hadoop World 2011: The Blind Men and the Elephant - Matthew Aslett - The 451 Group