Hadoop - Past, Present and Future - v1.1

6/19/2014
Prepared for:
Presented by:
“Big Data Joe” Rossi
@bigdatajoerossi
Hadoop
Past, Present and Future

Roadmap
~45mins
1- What Makes Up Hadoop 1.x?
2- What’s New In Hadoop 2.x?
3- The Future Of Hadoop …

Hadoop 1.0: HDFS + MapReduce
NameNode
DataNode / TaskTracker DataNode / TaskTracker
JobTracker
Client
1-1
1-21-3

Hadoop 1.0: HDFS + MapReduce
NameNode
JobTracker
Client
1-1 1-2
1-3
ReduceMap
2-1 3-2 3-3 4-1
2-3 4-2 2-2 3-1 4-3
ReduceMap

MapReduce v1 Limitations
Scalability
Maximum cluster size is 4,000 nodes and maximum concurrent tasks is 40,000
Availability
JobTracker failure kills all queued and running jobs
Resources Partitioned into Map and Reduce
Hard partitioning of Map and Reduce slots led to low resource utilization
No Support for Alternate Paradigms / Services
Only MapReduce batch jobs, nothing else

HADOOP 1.0
Single Use System
Batch Apps
Apache Hadoop 1.0: Single Use System
HDFS
(redundant, reliable storage)
MapReduce
(cluster resource management and data
processing)
Pig Hive

YARN Replaces
MapReduce
Yet Another Resource Negotiator
YARN
YARN will be the de-facto distributed
operating system for Big Data

Store DATA in one place
YARN: Taking Hadoop Beyond Batch
Interact with that data in MULTIPLE WAYS
with Predictable Performance and Quality of Service
Applications Run Natively IN Hadoop
HDFS2
YARN
(cluster resource management)
BATCH
(MapReduce)
INTERACTIVE
(Tez)
ONLINE
(HBase)
STREAMING
(DataTorrent)
GRAPH
(Giraph)

Running all on the same Hadoop cluster to give
applications access to all the same source data!
YARN: Applications
MapReduce v2
Stream Processing
Master-WorkerOnline
In-Memory
Apache Storm

2010
2011
2012
2013
2014
Today
YARN: Moving Quickly
Conceived at Yahoo!
Alpha Releases – 2.0
Beta Releases – 2.1
GA Released – 2.2
100,000+ nodes, 400,000+ jobs daily
10 million+ hours of compute daily
Version 2.3
Version 2.4

YARN: What Has Changed?
YARN MRv1
RMResourceManager
AMApplicationMaster
JT
JobTracker
Scheduler Scheduler
NMNodeManager
TTTaskTracker
Container
Map
Reduce
ResourceManager
Scheduler
JobTracker
Scheduler
NodeManager
ApplicationMaster
TaskTracker
Map Reduce
NodeManager
Container Container
TaskTracker
Map Reduce

Scale
New programming models
and services
Improved cluster utilization
Agility
Backwards compatible with
MapReduce v1
Mixed workloads on the
same source of data
6 Benefits of YARN

The Future of Hadoop
Projects and Roadmap

Speed
Deliver interactive query through 100x
performance increases as compared to Hive
10.
Stinger: Interactive Query for Hive
SQL
Support the broadest array of SQL semantics
for analytic applications running against
Hadoop.
Scale
The only SQL interface to Hadoop designed for
queries that scale from Terabytes to Petabytes.

Dynamic Scaling
On-demand cluster size. Increase and decrease
the size with load.
HOYA: HBase (NoSQL) on YARN
Easier Deployment
APIs to create, start, stop and delete HBase
clusters.
Availability
Recover from Region Server loss with a new
container.

Machine Learning
Framework well suited for building machine
learning jobs.
Microsoft REEF
Scalable / Fault Tolerant
Makes it easy to implement scalable, fault-
tolerant runtime environments for a range of
computational models.
Maintain State
Users can build jobs that utilize data from
where it’s needed and also maintain state after
jobs are done.
Retainable
Evaluator
Execution
Framework

Heterogeneous Storages in HDFS
NameNode
Storage
NameNode
SATA SSD
Fusion
IO

Apache Hadoop 2.4
ResourceManager HA / Auto Failover
HDFS Rolling Upgrades
Apache Hadoop 2.5
NodeManager Restart w/o disruption
Dynamic Resource Configuration
Hadoop Roadmap
RELEASED
EARLY
Q2 2014
MID
Q2 2014

I Know You Have
Questions …
No such thing as a stupid question.
Hadoop: Past, Present and Future

SD Big Data Meetup
One Last Thing …
meetup.com/sdbigdata
2nd Wednesday Of The Month
Next: July 9st @ 5:45P

Thank You!
Hadoop: Past, Present and Future
Big Data Joe Rossi
http://bigdatajoe.io/
@bigdatajoerossi

Supporting Slides
Slides with information that may be asked

YARN: How It Works
ResourceManager
NodeManager
ApplicationMaster
NodeManager
NodeManager NodeManager
Scheduler
Container
Container Container
Client

YARN: Example App Deployment
ResourceManager
NodeManager
HOYA / HBase Master
NodeManager
NodeManager NodeManager
Scheduler
Region Server
Region Server Region Server
HOYA Client

Storm Vs. DataTorrent
Solution Matrix DataTorrent Apache Storm
Atomic Micro-batch 1 3
Events per Second Billions Thousands
Automated Parallelism 3
Dynamic Runtime Changes 3
Linear Scalability 3
State Checkpointing 3

Apache Spark + Shark
HDFS2
YARN
Apache Spark
Shark
Hive
(sql)

Hadoop 2.x – YARN + HDFS
NameNode
DataNode / NodeManager DataNode / NodeManager
DataNode / NodeManager DataNode / NodeManager
Standby
NameNode /
ResourceManager
ContainerContainer
ContainerContainer
ContainerContainer
ContainerContainer

Backwards Compatible
YARN is Backwards Compatible for your
existing MapReduce applications. You
can get value from it right away.
YARN: Key Take-Aways
Resource Management
YARN enables Fine Grained Resource
Management for better cluster
utilization.
One Source of Data
YARN allows you to interact with One
Source of Data in multiple ways while
maintaining Predictable Performance
and Quality of Service.
Enabling Smart People
YARN is a flexible framework that is
giving smart people and companies to
do amazing things with data.
YARN will be the de-facto distributed operating
system for Big Data

Storm Vs. DataTorrent - Detailed
Solution Matrix DataTorrent Apache Storm
Proprietary / Open Source O O
Support for Hadoop 1.x 1 1
Support for Hadoop 2.x 1 1
Native YARN 1 3
Dashboard 1 3
Extensible via Modules 1 1
Technical Support 1 1
Atomic Micro-batch 1 3
Events per Second Billions Thousands
Automated Parallelism 1 3
Dynamic Runtime Changes 1 3
High Availability 1 2
Prog. Languages Supported Java, Python, etc. Java, Python, etc.
Log Analysis 1 3
Site Operations 1 3
MapReduce Diagnostics 1 3
Open Source Operators Library 1 2
Open Source Application Templates 1 3
Complex Computations (DAG) 1 3
Linear Scalability 1 3
Security 1 3
CLI and Macros 1 3
Configuration Based Specification 1 3
State Checkpointing 1 3

Users forced to
create data system
silos for managing
mixed workloads
Developers forced
to abuse very
specific
MapReduce to fit
their use cases
The 1st Generation Of Hadoop
Hadoop
HBase

Apache Spark
HDFS2
YARN
Apache Spark
Shark
Hive
(sql)
Spark
Streaming
MLib
(machine learning)

Project Mgt Committee Members
0 2 4 6 8 10 12 14 16
Hortonworks
Others
Cloudera
Yahoo!
Facebook
7
6
3
15
11

Project Committers
0 5 10 15 20 25 30
Hortonworks
Others
Cloudera
Yahoo!
Facebook
24
24
11
11
5

YARN: Why The De-Facto Distributed OS
Technology Adoption
100,000 nodes+ - 400,000 jobs - 10m compute hours daily
Enables Innovation
Smart people and companies to do amazing things to data
Financial Backing
568m+ invested in Hadoop contributing companies, nearly 400m in the
2013 alone

Apache Storm Topology
Bolt
(Filter)Spout
Stream
(Data Source)
Spout
Stream
(Data Source)
Bolt
(RDBMS Writes)
Bolt
(Calculation)
Bolt
(HDFS Writes)
RDBMS
HDFS

HDFS Write Data Flow
NameNode
Client
DataNode DataNode DataNode
1
2
4 5
67
3
Block Bytes
Block Bytes Block Bytes
Block Write Complete
AckAck
Ack
A
B
C

Hadoop - Past, Present and Future - v1.1

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Hadoop - Past, Present and Future - v1.1

Similar to Hadoop - Past, Present and Future - v1.1 (20)

More from Big Data Joe™ Rossi

More from Big Data Joe™ Rossi (6)

Recently uploaded

Recently uploaded (20)

Hadoop - Past, Present and Future - v1.1

Editor's Notes