Hadoop Turns a Corner and Sees the Future

© 2013 Gartner, Inc. and/or its affiliates. All rights reserved. Gartner is a registered trademark of Gartner, Inc. or its affiliates. This publication may not be reproduced or distributed in any form without Gartner's prior written
permission. If you are authorized to access this publication, your use of it is subject to the Usage Guidelines for Gartner Services posted on gartner.com. The information contained in this publication has been obtained
from sources believed to be reliable. Gartner disclaims all warranties as to the accuracy, completeness or adequacy of such information and shall have no liability for errors, omissions or inadequacies in such information.
This publication consists of the opinions of Gartner's research organization and should not be construed as statements of fact. The opinions expressed herein are subject to change without notice. Although Gartner
research may include a discussion of related legal issues, Gartner does not provide legal advice or services and its research should not be construed or used as such. Gartner is a public company, and its shareholders
may include firms and funds that have financial interests in entities covered in Gartner research. Gartner's Board of Directors may include senior managers of these firms or funds. Gartner research is produced
independently by its research organization without input or influence from these firms, funds or their managers. For further information on the independence and integrity of Gartner research, see "Guiding Principles on
Independence and Objectivity."
Merv Adrian
Research Vice President, Information Management
Twitter: @merv
Blogs.gartner.com/merv-adrian
Hadoop — Entering Phase Two?

© 2013 Gartner, Inc. and/or its affiliates. All rights reserved.
NEXUS
Nexus of Forces Drives Innovation
Extreme
Networking
Pervasive
Access
Global-Class
Delivery
"Big," Rich
Context

Cameras and
microphones widely
deployed
New routes to market via
intelligent objects
Content and services
via connected
products
Everything
has a URL
Remote sensing of
objects and environment
Augmented
reality
Situational
decision support
Building and
infrastructure management
Over 50% of Internet connections are things:
2011: 15+ billion permanent, 50+ billion intermittent
2020: 30+ billion permanent, >200 billion intermittent
Audio
GPRS Wi-Fi NFC
Higher-resolution display
LTE
Flash

Gartner Definition of Big Data: High-volume, velocity and variety
information assets that demand cost-effective, innovative forms of
information processing for enhanced insight and decision making.
Gartner Research Circle 2013 Big Data Survey
687Respondents
Worldwide
$3.2BMean
Company Size
5,100
Mean
Employees
60%Mainstream
Adopters
18%Focused on
Running/Maintaining

Are They Investing?
30%
Have
31%
No plans
at this
time
19%
Plan to within
the next
year
15%
Plan to
within two
years
5%
Don't
know

How Does That Compare to Last Year?
Note — Survey base increased from 473 in 2012 to 687 in 2013
27
15
16
11
30
19
15
31
5
Have invested
Within next year
Within two years
No plans
Don't know
20132012
0 10 20 30 40

Things Are Done Differently in Silicon Valley …
Traditional IM
• Requirements based
• Top-down design
• Integration and reuse
• Technology consolidation
• World of DW and ECM
• Competence centers
• Better decisions
• Commercial software
"Big Data" Style
• Opportunity oriented
• Bottom-up experimentation
• Immediate use
• Tool proliferation
• "World of Hadoop"
• Hackathons
• Better business
• Open source

Introducing: The Open-Source Car!

Apache Hadoop is a set of standard open-source software projects
that provide a framework for using massive amounts of data across
a distributed network
The standards steward — Apache Software Foundation — manages
and distributes many typical components of "Hadoop" platform
Many distributions exist —
Built and/or marketed by pure-play specialists or major vendors and they
include additional open-source and commercial components

Apache Hadoop is a set of standard open source software projects
that provide a framework for using massive amounts of data across
a distributed network
The standards steward — Apache Software Foundation — manages
and distributes many typical components of "Hadoop" platform
Many distributions exist —
Built and/or marketed by pure play specialists or major vendors and they
include additional open source and commercial components

Clients Ask: Which Projects Are "Hadoop"?
• Minimum set (from Apache website):
- Apache HDFS
- Apache MapReduce
- Apache Yarn
• Other independent Apache projects:
Ambari, Avro, Cassandra, Chukwa, HBase, Hive, Mahout,
Pig, ZooKeeper
- The virtuous circle of open-source community
• Apache Hadoop is version 1.0. Version 2.0,
including Yarn, is alpha.

Rich, Complex Set of Functional Choices
Ingest/Propagate
Persist
Describe, Develop
Monitor, Administer
Analytics, Machine Learning
Compute, Search

Ingest/Propagate
Apache Flume, Apache Kafka, Apache Sqoop, HDFS NFS,
Informatica HParser, DBMS vendor utilities, Talend, WebHDFS
Import data into HDFS
(or alternatives)
• Commercial DBMS, DI or OSS
• "Big data" ≠ Hadoop —
import is not always required
− MapReduce inside DBMSs, HPCC,
SAS, Splunk, others
Export data into RDBMS
(or alternatives)
• NoSQL DBMS supported, or
offer integration
• On same cluster (HBase),
even same nodes (Hadapt)

Also included here: "intercept-based" data remediation
Develop refers to coding functions, as in Pig, for execution elsewhere,
such as MapReduce
Metadata (Hive, Hcatalog) describes for other stack components
and external ones; e.g., DI and BI tools
Describe, Develop
Apache Crunch, Apache Hive, Apache Pig, Apache Tika, Cascading,
Cloudera Hue, DataFu, Dataguise, IBM Jaql

Runtime execution for programs created to run against HDFS
or HBase data
With Apache Hadoop 2.0, MapReduce will begin to lose its exclusivity
in "the basic stack" with Yarn support
MapReduce was first, but others have emerged as additions/
alternatives/supplements
Compute, Search
Apache Blur, Apache Drill, Apache Giraph, Apache Hama, Apache Lucene, Apache MapReduce,
Apache Solr, Cloudera Impala, HP HAVEn, IBM BigSQL, IBM InfoSphere Streams, HStreaming,
Pivotal HAWQ, SQLstream, Storm, Teradata SQL-H

File system: Append only, access methods at OS level
Database: Collected and structured to facilitate storage, retrieval, modification,
and deletion in online, not only batch, mode
Serialized: Format that can be stored in a database, eliminating
byte ordering, adding metadata
Persist
File System: Apache HDFS, IBM GPFS, Lustre, MapR Data Platform
Serialization: Apache Avro, RCFile (and ORCFile), SequenceFile, Text, Trevni
DBMS: Apache Accumulo, Apache Cassandra, Apache HBase, Google Dremel, Hadapt,
HP Vertica, IBM DB2, Kognitio, Oracle, Oracle MySQL, RainStor, Teradata Aster, Teradata, others

System health and administration
Cloud configuration and connection to resources
Virtualization and resource management
Job management and orchestration
Monitor, Administer
Apache Ambari, Apache Chukwa, Apache Falcon, Apache Oozie, Apache Whirr,
Apache ZooKeeper, Cloudera Manager, Ganglia, Nagios, Pivotal Serengeti

Apache Drill, Apache Hive, Apache Mahout, Datameer, IBM Big Sheets, IBM BigSQL,
Karmasphere, Microsoft Excel, Platfora, Revolution Analytics RHadoop, SAS, Skytree
This is where the future is — it's not just "a part of the stack" but why it exists
Machine learning, advanced statistical analysis, scenario modeling
"BI for Hadoop": Statistical libraries for use in programs, spreadsheets,
reporting, visualization tools

Go Ahead — Pick the Pieces You Need
Ingest/Propagate
Persist
Describe, Develop
Monitor, Administer
Compute, Search

Distribution Vendors Sort It Out for You
Megavendors:
Amazon, EMC
Pivotal, IBM, Intel
Megapartners:
Dell, HP, NetApp,
Microsoft, Oracle,
Teradata
Leading pure plays:
Cloudera, Hortonworks, MapR
Others:
Datastax, LucidWorks, RainStor, Sqrrl,
WANdisco, Zettaset

Hadoop's Great Leap Forward
Hadoop has moved to the next stage with Apache Hadoop 2.0.
• Mainstream vendors are all interested, contributing and adding value
• Skills development is ramping rapidly
From To
Single-stack Yarn-based multistyle environment, supporting
multiple engines
Batch-only, file-based stack Interactive capabilities with multiple optional databases
SQL translation
with Hive
"SQL in front of Hadoop": Cloudera Impala, IBM Big
SQL, Pivotal Hawq, Platfora, others
Relatively unmanaged Ambari-based beginnings of real management

What's Next?
Search
Advanced
prebuilt
analytic
functions
Cluster,
appliance
or cloud?
Virtualization
Graph
processing

What's Still Needed?
Security
Data Warehousing Tools
Governance
Distributed Optimization
Subproject Optimization Skills

By 2015, big data demand will reach
4.4 million jobs worldwide,
but only one-third of those jobs will be filled.
0
500,000
1,000,000
1,500,000
2,000,000
2,500,000
Americas EMEA APJ
Education
Wholesale Trade
Healthcare Providers
Transportation
Utilities
Retail
Insurance
Communications, Media & Services
Government
Banking & Securities
Manufacturing & Natural Resources

Recommendations
 Audit your data — find "dark data" and map it to business
opportunities to identify pilot projects
 Familiarize yourself with the capabilities of available
Hadoop distributions
 Build skills and recruit within the organization from early
experimenters for a data science lab
 Consider cloud pilots to minimize capital expenditure

Thank you!
http://www.flickr.com/photos/orinrobertjohn/3267286885/sizes/o/in/photostream/

Hadoop Turns a Corner and Sees the Future

Hadoop Turns a Corner and Sees the Future

More Related Content

What's hot

Viewers also liked

Similar to Hadoop Turns a Corner and Sees the Future

More from DataWorks Summit

Recently uploaded

Hadoop Turns a Corner and Sees the Future

Editor's Notes