What is the Point of Hadoop

What is the point of Hadoop?
Matthew Aslett
Research Director, 451 Research

© 2013 by The 451 Group. All rights reserved

 Matthew Aslett
• Research Director, Data Management and Analytics
 matthew.aslett@451research.com
 www.twitter.com/maslett

 Responsible for data management
and analytics research agenda

 Focus on operational and analytic
databases, including NoSQL,
NewSQL, and Hadoop

 With 451 Research since 2007


Unique combination of research, analysis & data
Emerging tech market segment focus
Daily qualitative & quantitative insight
Analyst advisory & Go-to-market support
Global events


Company Overview

 One company with 3 operating  200+ staff
divisions  1,300+ client organizations:
 Syndicated research, advisory, enterprises, vendors, service
professional services, datacenter providers, and investment firms
certification, and events  Organic and growth through
 Global focus acquisition



Hadoop’s greatest asset is its
flexibility: it can be used for
multiple roles and use-cases

But that is also a challenge,
and can lead to confusion
and disillusionment

Each user and vendor has
their own perspective on
Hadoop’s role


The Blind Men and the Elephant

“It was six men of Indostan
To learning much inclined,
Who went to see the Elephant
(Though all of them were blind),
That each by observation
Might satisfy his mind.”

John Godfrey Saxe (1872)


The Blind Men and the Elephant

“After Hadoop finishes
filtering the data, the place
you want to put that data
is in Oracle Database.”

Larry Ellison (2011)


Oracle Big Data Appliance
Apache Hadoop

NoSQL Database

Oracle Tools

Oracle Database
Data Integrator for Oracle Database

Data Loader
Big data
Big data
R distribution processing/i
analytics
ntegration



Big data
Big data Big data
processing/i
storage analytics
ntegration


Big Data
“Big data” - the realization of greater business intelligence by
storing, processing and analyzing data that was previously ignored due to the
limitations of traditional data management technologies due to the three Vs:

Volume Velocity Variety
The volume of data The data is being The data lacks the
is too large for produced at a rate structure to make it
traditional database that is beyond the suitable for storage
software tools to performance limits and analysis in
cope with of traditional traditional databases
systems and data warehouses


Total Data
The adoption of non-traditional data processing technologies
is also driven by the user’s particular data processing requirements.
 Inspired by ‘Total Football’
– a new approach to soccer
that emerged in the late 1960s,
in Amsterdam

 Total Data is making the most
efficient use of existing and
new data management
resources to deliver value

 Not another name for Big Data: if your data is big, the way you
manage it should be total


Big Data and Total Data
Big Data:
The growing volume, velocity
and variety of data

Big Data Technologies:
New technologies being
adopted to store and process
BIG that data
TOTAL
BIG
DATA

DQ
DATA
TECHNOLOGY Total Data:
Volume The user trends driving the
adoption of Big Data
Technologies to store and
Predictive process Big Data and the
analytics management alongside
existing data management
technologies.


Total Data

Totality
The desire to process
and analyze data in
its entirety, rather
than analyzing a
sample of data and
extrapolating the
results.


Totality

Big data
Big data
processing/i
storage
ntegration

 Prior to adopting Hadoop, only had transactional and
summarized non-transactional data stored in its EDW
 The vast majority of its log data was discarded as not valuable
enough to be efficiently processed in an enterprise data warehouse
 Now using Hadoop to process hundreds of GBs of log data
produced by the millions of searches and transactions performed
on its site each day
 Creating data exports to R, and aggregating data to its existing data
warehouse for analysis


Total Data

Totality Exploration
The desire to process The interest in
and analyze data in exploratory analytic
its entirety, rather approaches, in which
than analyzing a schema is defined in
sample of data and response to the
extrapolating the nature of the query.
results.


Exploration
Traditional data warehouses:

Schema on write

Application Schema RDBMS SQL

Hadoop:

Schema on read

Application Hadoop Schema MapReduce


Exploration

Big data
Big data Big data
processing/i
storage analytics
ntegration

 The company wanted to perform analysis on customer
data in order to create geo-targeted advertising
 The required data was already present in its data warehouse
but was modeled in a way that would not allow Orbitz to
efficiently process the query
 Extracting the data into Hadoop enabled the company to query
it in a way the data warehouse was never designed for


Hadoop adoption process

Big data
Big data Big data
processing/i
storage analytics
ntegration

 Google File System  Google MapReduce  Google Dremel
Research paper Research paper Research paper
published: 2003 published: 2004 published: 2010
 Google Tenzing
Research paper
published: 2011

ANALYTICS
PROCESSING
STORAGE

INNOVATORS EARLY ADOPTERS
Image source: http://en.wikipedia.org/wiki/File:DiffusionOfInnovation.png
Licensed under the Creative Commons Attribution 2.5 License.


Crossing the Chasm
 Hadoop as (just) a low cost storage option is not fulfilling its potential
 Processing and integration is not the complete picture
 Hadoop-based analytics unlocks the value of previously ignored data
 Attempting to fast forward to analytics, missing out the
processing/integration stage, creates silos and will result in disillusionment

PROCESSING
ANALYTICS
STORAGE

EARLY
INNOVATORS ADOPTERS EARLY MAJORITY LATE MAJORITY LAGGARDS


Total Data

Totality Exploration Frequency
The desire to process The interest in The desire to
and analyze data in exploratory analytic increase the rate of
its entirety, rather approaches, in which analysis in order to
than analyzing a schema is defined in generate more
sample of data and response to the accurate and timely
extrapolating the nature of the query. business intelligence.
results.


Frequency

 Formerly AT&T Advertising solutions and AT&T Interactive
 Faced with increasing volume of traffic through
distribution network
 Wanted to provide intra-day reporting, but faced days of
report-lag due to loading multiple databases
 Moved data processing to Hadoop, enabling the creation
of a single common data layer for all applications
 Report-lag reduced to hours, rather than days
 New insight enabled by more frequent analysis and being able to
process all the data


Total Data

Totality Exploration Frequency Dependency
The desire to process The interest in The desire to The reliance on
and analyze data in exploratory analytic increase the rate of existing technologies
its entirety, rather approaches, in which analysis in order to and skills, and the
than analyzing a schema is defined in generate more need to balance
sample of data and response to the accurate and timely investment in those
extrapolating the nature of the query. business intelligence. existing technologies
results. and skills with the
adoption of new
techniques.


SQL meets Hadoop
RDBMS and Hadoop
SQL on Hadoop Operational SQL on Hadoop
co-processing
• Hive • Hadapt Adaptive Analytic • Drawn to Scale
• Project Stinger Platform • Spire
• Apache Tez (proposed)
• Teradata Aster SQL-H • Splice Machine
• Impala • Splice SQL Engine
• Cloudera Enterprise RTQ • Rainstor Big Data Analytics
on Hadoop
• Apache Drill
• (incubating) • EMC Greenplum HAWQ

• Phoenix project • Microsoft PolyBase
• For HBase
• Citus Data CitusDB
• Lingual
• For Cascading and • IBM Big SQL
Hadoop


Crossing the Chasm
 Project maturity
 Vendor ecosystem
 Mainstream interest
 Geographic adoption

PROCESSING
ANALYTICS
STORAGE

EARLY
INNOVATORS ADOPTERS EARLY MAJORITY LATE MAJORITY LAGGARDS


Project maturity

Feb 2006 Dec 2012


Vendor ecosystem

70+ different 120+ different
companies, 200+ companies, 750+
individuals individuals

Hortonworks Hortonworks
The rest 37% The rest 27%
29% 31%

HADOOP ALL
CORE HADOOP
PROJECTS
Facebook
7% Cloudera
Facebook
15%
Yahoo! 11%
Cloudera Yahoo
12%
15% 16%

Contributors by lines of
code by current employer


Vendor ecosystem
Academia
Unknown/indi 1%
viduals
4%

Users ALL Hadoop
38% HADOOP vendors
PROJECTS 51%

Contributors by lines of
Other vendors code by current employer
6% and contributor type


Mainstream interest

Source: Indeed.com February, 2013


Largest employers of Hadoop skills
Yahoo
Microsoft
Google
Current employer

eBay
Amazon
IBM
LinkedIn
Oracle
EMC
Cisco
Cloudera

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5
% of total LinkedIn profiles mentioning Hadoop
Source: LinkedIn: August 2012


Largest employers of Hadoop skills
Yahoo
Microsoft
Google
Current employer

Amazon
IBM
eBay
Oracle
LinkedIn
Tata
HP
Cisco

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5
% of total LinkedIn profiles mentioning Hadoop
Source: LinkedIn: February 2013


Geographic adoption
Seattle UK
3.7% 3.0%
NYC
4.8%

LA DC
3.0%
3.5%
China
3.6%

India
9.7%

Bay area
28.2%
LinkedIn search result
December 2011


Geographic adoption
Seattle UK
3.9% NYC 3.4%
4.7%

LA DC
2.8% 3.1%

China
4.4%

India
11.2%

Bay area
24.9%
August, 2012


Geographic adoption
Seattle UK
3.9% NYC 3.4%
4.6%

LA DC
2.7% 3.1%

China
4.8%

India
13.5%

Bay area
22.9%
February 2013


Geographic adoption
USA ROW
40000 Total: 38,049
35000

30000 41.7%

25000
Total: 22,178
20000
39.6%
15000
Total: 9,079 58.3%
10000
35.6% 60.4%
5000
64.4%
0
December 2011 August 2012 February 2013


Conclusions
 Hadoop’s greatest asset is its flexibility, but that is also a challenge,
and can lead to confusion and disillusionment among later adopters

 Hadoop is enabling greater business intelligence by storing, processing and
analyzing data that was previously ignored due to the limitations of
traditional data management technologies

 Storage, processing, and analyzing of data is a process that has enabled
early adopters to understand Hadoop’s role in the wider landscape

 Attempting to fast forward to analytics, missing out the
processing/integration stage, creates silos and will result in disillusionment

 The Hadoop ecosystem is vibrant, with strength in depth, and breadth

 Growing mainstream interest and geographic adoption means Hadoop is
well-positioned to cross the chasm into mainstream adoption


Questions? Comments?


What is the Point of Hadoop

More Related Content

What's hot

Similar to What is the Point of Hadoop

More from DataWorks Summit

Recently uploaded

What is the Point of Hadoop