SlideShare a Scribd company logo
1 of 39
6/19/2014
Prepared for:
Presented by:
“Big Data Joe” Rossi
@bigdatajoerossi
Hadoop
Past, Present and Future
Roadmap
~45mins
1- What Makes Up Hadoop 1.x?
2- What’s New In Hadoop 2.x?
3- The Future Of Hadoop …
What Makes Up Hadoop 1.x?
Hadoop 1.0: HDFS + MapReduce
NameNode
DataNode / TaskTracker DataNode / TaskTracker
DataNode / TaskTracker DataNode / TaskTracker
JobTracker
Client
1-1
1-21-3
Hadoop 1.0: HDFS + MapReduce
NameNode
DataNode / TaskTracker DataNode / TaskTracker
DataNode / TaskTracker DataNode / TaskTracker
JobTracker
Client
1-1 1-2
1-3
ReduceMap
2-1 3-2 3-3 4-1
2-3 4-2 2-2 3-1 4-3
ReduceMap
MapReduce v1 Limitations
Scalability
Maximum cluster size is 4,000 nodes and maximum concurrent tasks is 40,000
Availability
JobTracker failure kills all queued and running jobs
Resources Partitioned into Map and Reduce
Hard partitioning of Map and Reduce slots led to low resource utilization
No Support for Alternate Paradigms / Services
Only MapReduce batch jobs, nothing else
HADOOP 1.0
Single Use System
Batch Apps
Apache Hadoop 1.0: Single Use System
HDFS
(redundant, reliable storage)
MapReduce
(cluster resource management and data
processing)
Pig Hive
What’s New In Hadoop 2.x?
YARN Replaces
MapReduce
Yet Another Resource Negotiator
YARN
YARN will be the de-facto distributed
operating system for Big Data
Store DATA in one place
YARN: Taking Hadoop Beyond Batch
Interact with that data in MULTIPLE WAYS
with Predictable Performance and Quality of Service
Applications Run Natively IN Hadoop
HDFS2
(redundant, reliable storage)
YARN
(cluster resource management)
BATCH
(MapReduce)
INTERACTIVE
(Tez)
ONLINE
(HBase)
STREAMING
(DataTorrent)
GRAPH
(Giraph)
Running all on the same Hadoop cluster to give
applications access to all the same source data!
YARN: Applications
MapReduce v2
Stream Processing
Master-WorkerOnline
In-Memory
Apache Storm
2010
2011
2012
2013
2014
Today
YARN: Moving Quickly
Conceived at Yahoo!
Alpha Releases – 2.0
Beta Releases – 2.1
GA Released – 2.2
100,000+ nodes, 400,000+ jobs daily
10 million+ hours of compute daily
Version 2.3
Version 2.4
YARN: Dr. Evil Approved
YARN: What Has Changed?
YARN MRv1
RMResourceManager
AMApplicationMaster
JT
JobTracker
Scheduler Scheduler
NMNodeManager
TTTaskTracker
Container
Map
Reduce
ResourceManager
Scheduler
JobTracker
Scheduler
NodeManager
ApplicationMaster
TaskTracker
Map Reduce
NodeManager
Container Container
TaskTracker
Map Reduce
Scale
New programming models
and services
Improved cluster utilization
Agility
Backwards compatible with
MapReduce v1
Mixed workloads on the
same source of data
6 Benefits of YARN
The Future of Hadoop
Projects and Roadmap
Speed
Deliver interactive query through 100x
performance increases as compared to Hive
10.
Stinger: Interactive Query for Hive
SQL
Support the broadest array of SQL semantics
for analytic applications running against
Hadoop.
Scale
The only SQL interface to Hadoop designed for
queries that scale from Terabytes to Petabytes.
Dynamic Scaling
On-demand cluster size. Increase and decrease
the size with load.
HOYA: HBase (NoSQL) on YARN
Easier Deployment
APIs to create, start, stop and delete HBase
clusters.
Availability
Recover from Region Server loss with a new
container.
Machine Learning
Framework well suited for building machine
learning jobs.
Microsoft REEF
Scalable / Fault Tolerant
Makes it easy to implement scalable, fault-
tolerant runtime environments for a range of
computational models.
Maintain State
Users can build jobs that utilize data from
where it’s needed and also maintain state after
jobs are done.
Retainable
Evaluator
Execution
Framework
Heterogeneous Storages in HDFS
NameNode
Storage
NameNode
SATA SSD
Fusion
IO
Apache Hadoop 2.4
ResourceManager HA / Auto Failover
HDFS Rolling Upgrades
Apache Hadoop 2.5
NodeManager Restart w/o disruption
Dynamic Resource Configuration
Hadoop Roadmap
RELEASED
EARLY
Q2 2014
MID
Q2 2014
I Know You Have
Questions …
No such thing as a stupid question.
Hadoop: Past, Present and Future
SD Big Data Meetup
One Last Thing …
meetup.com/sdbigdata
2nd Wednesday Of The Month
Next: July 9st @ 5:45P
Thank You!
Hadoop: Past, Present and Future
Big Data Joe Rossi
http://bigdatajoe.io/
@bigdatajoerossi
Supporting Slides
Slides with information that may be asked
YARN: How It Works
ResourceManager
NodeManager
ApplicationMaster
NodeManager
NodeManager NodeManager
Scheduler
Container
Container Container
Client
YARN: Example App Deployment
ResourceManager
NodeManager
HOYA / HBase Master
NodeManager
NodeManager NodeManager
Scheduler
Region Server
Region Server Region Server
HOYA Client
Storm Vs. DataTorrent
Solution Matrix DataTorrent Apache Storm
Atomic Micro-batch 1 3
Events per Second Billions Thousands
Automated Parallelism 3
Dynamic Runtime Changes 3
Linear Scalability 3
State Checkpointing 3
Apache Spark + Shark
HDFS2
(redundant, reliable storage)
YARN
(cluster resource management)
Apache Spark
Shark
Hive
(sql)
Hadoop 2.x – YARN + HDFS
NameNode
DataNode / NodeManager DataNode / NodeManager
DataNode / NodeManager DataNode / NodeManager
Standby
NameNode /
ResourceManager
ContainerContainer
ContainerContainer
ContainerContainer
ContainerContainer
Backwards Compatible
YARN is Backwards Compatible for your
existing MapReduce applications. You
can get value from it right away.
YARN: Key Take-Aways
Resource Management
YARN enables Fine Grained Resource
Management for better cluster
utilization.
One Source of Data
YARN allows you to interact with One
Source of Data in multiple ways while
maintaining Predictable Performance
and Quality of Service.
Enabling Smart People
YARN is a flexible framework that is
giving smart people and companies to
do amazing things with data.
YARN will be the de-facto distributed operating
system for Big Data
Storm Vs. DataTorrent - Detailed
Solution Matrix DataTorrent Apache Storm
Proprietary / Open Source O O
Support for Hadoop 1.x 1 1
Support for Hadoop 2.x 1 1
Native YARN 1 3
Dashboard 1 3
Extensible via Modules 1 1
Technical Support 1 1
Atomic Micro-batch 1 3
Events per Second Billions Thousands
Automated Parallelism 1 3
Dynamic Runtime Changes 1 3
High Availability 1 2
Prog. Languages Supported Java, Python, etc. Java, Python, etc.
Log Analysis 1 3
Site Operations 1 3
MapReduce Diagnostics 1 3
Open Source Operators Library 1 2
Open Source Application Templates 1 3
Complex Computations (DAG) 1 3
Linear Scalability 1 3
Security 1 3
CLI and Macros 1 3
Configuration Based Specification 1 3
State Checkpointing 1 3
Users forced to
create data system
silos for managing
mixed workloads
Developers forced
to abuse very
specific
MapReduce to fit
their use cases
The 1st Generation Of Hadoop
Hadoop
HBase
Apache Spark
HDFS2
(redundant, reliable storage)
YARN
(cluster resource management)
Apache Spark
Shark
Hive
(sql)
Spark
Streaming
MLib
(machine learning)
Project Mgt Committee Members
0 2 4 6 8 10 12 14 16
Hortonworks
Others
Cloudera
Yahoo!
Facebook
7
6
3
15
11
Project Committers
0 5 10 15 20 25 30
Hortonworks
Others
Cloudera
Yahoo!
Facebook
24
24
11
11
5
YARN: Why The De-Facto Distributed OS
Technology Adoption
100,000 nodes+ - 400,000 jobs - 10m compute hours daily
Enables Innovation
Smart people and companies to do amazing things to data
Financial Backing
568m+ invested in Hadoop contributing companies, nearly 400m in the
2013 alone
Apache Storm Topology
Bolt
(Filter)Spout
Stream
(Data Source)
Spout
Stream
(Data Source)
Bolt
(RDBMS Writes)
Bolt
(Calculation)
Bolt
(HDFS Writes)
RDBMS
HDFS
HDFS Write Data Flow
NameNode
Client
DataNode DataNode DataNode
1
2
4 5
67
3
Block Bytes
Block Bytes Block Bytes
Block Write Complete
AckAck
Ack
A
B
C

More Related Content

What's hot

Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Sumeet Singh
 
Hadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduceHadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduceUwe Printz
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystemsunera pathan
 
Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureApache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureDataWorks Summit
 
MapR M7: Providing an enterprise quality Apache HBase API
MapR M7: Providing an enterprise quality Apache HBase APIMapR M7: Providing an enterprise quality Apache HBase API
MapR M7: Providing an enterprise quality Apache HBase APImcsrivas
 
YARN - Hadoop's Resource Manager
YARN - Hadoop's Resource ManagerYARN - Hadoop's Resource Manager
YARN - Hadoop's Resource ManagerVertiCloud Inc
 
Inside MapR's M7
Inside MapR's M7Inside MapR's M7
Inside MapR's M7Ted Dunning
 
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with YarnScale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with YarnDavid Kaiser
 
Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...
Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...
Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...Simplilearn
 
Big Data Performance and Capacity Management
Big Data Performance and Capacity ManagementBig Data Performance and Capacity Management
Big Data Performance and Capacity Managementrightsize
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopVigen Sahakyan
 
TriHUG Feb: Hive on spark
TriHUG Feb: Hive on sparkTriHUG Feb: Hive on spark
TriHUG Feb: Hive on sparktrihug
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep divet3rmin4t0r
 

What's hot (20)

Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
 
Hadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduceHadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduce
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureApache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and Future
 
MapR M7: Providing an enterprise quality Apache HBase API
MapR M7: Providing an enterprise quality Apache HBase APIMapR M7: Providing an enterprise quality Apache HBase API
MapR M7: Providing an enterprise quality Apache HBase API
 
Yarns About Yarn
Yarns About YarnYarns About Yarn
Yarns About Yarn
 
YARN - Hadoop's Resource Manager
YARN - Hadoop's Resource ManagerYARN - Hadoop's Resource Manager
YARN - Hadoop's Resource Manager
 
Hive Now Sparks
Hive Now SparksHive Now Sparks
Hive Now Sparks
 
Inside MapR's M7
Inside MapR's M7Inside MapR's M7
Inside MapR's M7
 
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with YarnScale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
 
10c introduction
10c introduction10c introduction
10c introduction
 
Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...
Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...
Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...
 
Big Data Performance and Capacity Management
Big Data Performance and Capacity ManagementBig Data Performance and Capacity Management
Big Data Performance and Capacity Management
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop YARN
Hadoop YARNHadoop YARN
Hadoop YARN
 
Big Data Journey
Big Data JourneyBig Data Journey
Big Data Journey
 
TriHUG Feb: Hive on spark
TriHUG Feb: Hive on sparkTriHUG Feb: Hive on spark
TriHUG Feb: Hive on spark
 
Philly DB MapR Overview
Philly DB MapR OverviewPhilly DB MapR Overview
Philly DB MapR Overview
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
February 2014 HUG : Pig On Tez
February 2014 HUG : Pig On TezFebruary 2014 HUG : Pig On Tez
February 2014 HUG : Pig On Tez
 

Viewers also liked

De Runnable & synchronized à parallele() et atomically()
De Runnable & synchronized à parallele() et atomically()De Runnable & synchronized à parallele() et atomically()
De Runnable & synchronized à parallele() et atomically()Lorraine JUG
 
IoT Expo: 7 Steps to Business Success on the Internet of Things
IoT Expo: 7 Steps to Business Success on the Internet of ThingsIoT Expo: 7 Steps to Business Success on the Internet of Things
IoT Expo: 7 Steps to Business Success on the Internet of ThingsLogMeIn
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFramesDatabricks
 
Apache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapApache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapKostas Tzoumas
 
Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)Emilio Coppa
 
Scalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsScalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsJonas Bonér
 

Viewers also liked (6)

De Runnable & synchronized à parallele() et atomically()
De Runnable & synchronized à parallele() et atomically()De Runnable & synchronized à parallele() et atomically()
De Runnable & synchronized à parallele() et atomically()
 
IoT Expo: 7 Steps to Business Success on the Internet of Things
IoT Expo: 7 Steps to Business Success on the Internet of ThingsIoT Expo: 7 Steps to Business Success on the Internet of Things
IoT Expo: 7 Steps to Business Success on the Internet of Things
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Apache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapApache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmap
 
Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)
 
Scalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsScalability, Availability & Stability Patterns
Scalability, Availability & Stability Patterns
 

Similar to Hadoop - Past, Present and Future - v1.1

Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1Thanh Nguyen
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0Adam Muise
 
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xModule 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xNPN Training
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big pictureJ S Jodha
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony NguyenThanh Nguyen
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATarak Tar
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATarak Tar
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Ranjith Sekar
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Cognizant
 
Hadoop Ecosystem at a Glance
Hadoop Ecosystem at a GlanceHadoop Ecosystem at a Glance
Hadoop Ecosystem at a GlanceNeev Technologies
 

Similar to Hadoop - Past, Present and Future - v1.1 (20)

Huhadoop - v1.1
Huhadoop - v1.1Huhadoop - v1.1
Huhadoop - v1.1
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
 
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xModule 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big picture
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony Nguyen
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Hadoop
HadoopHadoop
Hadoop
 
Big data
Big dataBig data
Big data
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATA
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATA
 
Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
 
Hadoop info
Hadoop infoHadoop info
Hadoop info
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
 
Hadoop workshop
Hadoop workshopHadoop workshop
Hadoop workshop
 
Hadoop Ecosystem at a Glance
Hadoop Ecosystem at a GlanceHadoop Ecosystem at a Glance
Hadoop Ecosystem at a Glance
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
 

More from Big Data Joe™ Rossi

OC Big Data Monthly Meetup #6 - Session 2 - Basho/Riak
OC Big Data Monthly Meetup #6 - Session 2 - Basho/RiakOC Big Data Monthly Meetup #6 - Session 2 - Basho/Riak
OC Big Data Monthly Meetup #6 - Session 2 - Basho/RiakBig Data Joe™ Rossi
 
OC Big Data Monthly Meetup #6 - Session 1 - IBM
OC Big Data Monthly Meetup #6 - Session 1 - IBMOC Big Data Monthly Meetup #6 - Session 1 - IBM
OC Big Data Monthly Meetup #6 - Session 1 - IBMBig Data Joe™ Rossi
 
SD Big Data Monthly Meetup #4 - Session 2 - WANDisco
SD Big Data Monthly Meetup #4 - Session 2 - WANDiscoSD Big Data Monthly Meetup #4 - Session 2 - WANDisco
SD Big Data Monthly Meetup #4 - Session 2 - WANDiscoBig Data Joe™ Rossi
 
SD Big Data Monthly Meetup #4 - Session 1 - IBM
SD Big Data Monthly Meetup #4 - Session 1 - IBMSD Big Data Monthly Meetup #4 - Session 1 - IBM
SD Big Data Monthly Meetup #4 - Session 1 - IBMBig Data Joe™ Rossi
 
OC Big Data Monthly Meetup #5 - Session 1 - Altiscale
OC Big Data Monthly Meetup #5 - Session 1 - AltiscaleOC Big Data Monthly Meetup #5 - Session 1 - Altiscale
OC Big Data Monthly Meetup #5 - Session 1 - AltiscaleBig Data Joe™ Rossi
 
OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic
OC Big Data Monthly Meetup #5 - Session 2 - Sumo LogicOC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic
OC Big Data Monthly Meetup #5 - Session 2 - Sumo LogicBig Data Joe™ Rossi
 

More from Big Data Joe™ Rossi (6)

OC Big Data Monthly Meetup #6 - Session 2 - Basho/Riak
OC Big Data Monthly Meetup #6 - Session 2 - Basho/RiakOC Big Data Monthly Meetup #6 - Session 2 - Basho/Riak
OC Big Data Monthly Meetup #6 - Session 2 - Basho/Riak
 
OC Big Data Monthly Meetup #6 - Session 1 - IBM
OC Big Data Monthly Meetup #6 - Session 1 - IBMOC Big Data Monthly Meetup #6 - Session 1 - IBM
OC Big Data Monthly Meetup #6 - Session 1 - IBM
 
SD Big Data Monthly Meetup #4 - Session 2 - WANDisco
SD Big Data Monthly Meetup #4 - Session 2 - WANDiscoSD Big Data Monthly Meetup #4 - Session 2 - WANDisco
SD Big Data Monthly Meetup #4 - Session 2 - WANDisco
 
SD Big Data Monthly Meetup #4 - Session 1 - IBM
SD Big Data Monthly Meetup #4 - Session 1 - IBMSD Big Data Monthly Meetup #4 - Session 1 - IBM
SD Big Data Monthly Meetup #4 - Session 1 - IBM
 
OC Big Data Monthly Meetup #5 - Session 1 - Altiscale
OC Big Data Monthly Meetup #5 - Session 1 - AltiscaleOC Big Data Monthly Meetup #5 - Session 1 - Altiscale
OC Big Data Monthly Meetup #5 - Session 1 - Altiscale
 
OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic
OC Big Data Monthly Meetup #5 - Session 2 - Sumo LogicOC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic
OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic
 

Recently uploaded

RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...shivangimorya083
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 

Recently uploaded (20)

RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 

Hadoop - Past, Present and Future - v1.1

  • 1. 6/19/2014 Prepared for: Presented by: “Big Data Joe” Rossi @bigdatajoerossi Hadoop Past, Present and Future
  • 2. Roadmap ~45mins 1- What Makes Up Hadoop 1.x? 2- What’s New In Hadoop 2.x? 3- The Future Of Hadoop …
  • 3. What Makes Up Hadoop 1.x?
  • 4. Hadoop 1.0: HDFS + MapReduce NameNode DataNode / TaskTracker DataNode / TaskTracker DataNode / TaskTracker DataNode / TaskTracker JobTracker Client 1-1 1-21-3
  • 5. Hadoop 1.0: HDFS + MapReduce NameNode DataNode / TaskTracker DataNode / TaskTracker DataNode / TaskTracker DataNode / TaskTracker JobTracker Client 1-1 1-2 1-3 ReduceMap 2-1 3-2 3-3 4-1 2-3 4-2 2-2 3-1 4-3 ReduceMap
  • 6. MapReduce v1 Limitations Scalability Maximum cluster size is 4,000 nodes and maximum concurrent tasks is 40,000 Availability JobTracker failure kills all queued and running jobs Resources Partitioned into Map and Reduce Hard partitioning of Map and Reduce slots led to low resource utilization No Support for Alternate Paradigms / Services Only MapReduce batch jobs, nothing else
  • 7. HADOOP 1.0 Single Use System Batch Apps Apache Hadoop 1.0: Single Use System HDFS (redundant, reliable storage) MapReduce (cluster resource management and data processing) Pig Hive
  • 8. What’s New In Hadoop 2.x?
  • 9. YARN Replaces MapReduce Yet Another Resource Negotiator YARN YARN will be the de-facto distributed operating system for Big Data
  • 10. Store DATA in one place YARN: Taking Hadoop Beyond Batch Interact with that data in MULTIPLE WAYS with Predictable Performance and Quality of Service Applications Run Natively IN Hadoop HDFS2 (redundant, reliable storage) YARN (cluster resource management) BATCH (MapReduce) INTERACTIVE (Tez) ONLINE (HBase) STREAMING (DataTorrent) GRAPH (Giraph)
  • 11. Running all on the same Hadoop cluster to give applications access to all the same source data! YARN: Applications MapReduce v2 Stream Processing Master-WorkerOnline In-Memory Apache Storm
  • 12. 2010 2011 2012 2013 2014 Today YARN: Moving Quickly Conceived at Yahoo! Alpha Releases – 2.0 Beta Releases – 2.1 GA Released – 2.2 100,000+ nodes, 400,000+ jobs daily 10 million+ hours of compute daily Version 2.3 Version 2.4
  • 13. YARN: Dr. Evil Approved
  • 14. YARN: What Has Changed? YARN MRv1 RMResourceManager AMApplicationMaster JT JobTracker Scheduler Scheduler NMNodeManager TTTaskTracker Container Map Reduce ResourceManager Scheduler JobTracker Scheduler NodeManager ApplicationMaster TaskTracker Map Reduce NodeManager Container Container TaskTracker Map Reduce
  • 15. Scale New programming models and services Improved cluster utilization Agility Backwards compatible with MapReduce v1 Mixed workloads on the same source of data 6 Benefits of YARN
  • 16. The Future of Hadoop Projects and Roadmap
  • 17. Speed Deliver interactive query through 100x performance increases as compared to Hive 10. Stinger: Interactive Query for Hive SQL Support the broadest array of SQL semantics for analytic applications running against Hadoop. Scale The only SQL interface to Hadoop designed for queries that scale from Terabytes to Petabytes.
  • 18. Dynamic Scaling On-demand cluster size. Increase and decrease the size with load. HOYA: HBase (NoSQL) on YARN Easier Deployment APIs to create, start, stop and delete HBase clusters. Availability Recover from Region Server loss with a new container.
  • 19. Machine Learning Framework well suited for building machine learning jobs. Microsoft REEF Scalable / Fault Tolerant Makes it easy to implement scalable, fault- tolerant runtime environments for a range of computational models. Maintain State Users can build jobs that utilize data from where it’s needed and also maintain state after jobs are done. Retainable Evaluator Execution Framework
  • 20. Heterogeneous Storages in HDFS NameNode Storage NameNode SATA SSD Fusion IO
  • 21. Apache Hadoop 2.4 ResourceManager HA / Auto Failover HDFS Rolling Upgrades Apache Hadoop 2.5 NodeManager Restart w/o disruption Dynamic Resource Configuration Hadoop Roadmap RELEASED EARLY Q2 2014 MID Q2 2014
  • 22. I Know You Have Questions … No such thing as a stupid question. Hadoop: Past, Present and Future
  • 23. SD Big Data Meetup One Last Thing … meetup.com/sdbigdata 2nd Wednesday Of The Month Next: July 9st @ 5:45P
  • 24. Thank You! Hadoop: Past, Present and Future Big Data Joe Rossi http://bigdatajoe.io/ @bigdatajoerossi
  • 25. Supporting Slides Slides with information that may be asked
  • 26. YARN: How It Works ResourceManager NodeManager ApplicationMaster NodeManager NodeManager NodeManager Scheduler Container Container Container Client
  • 27. YARN: Example App Deployment ResourceManager NodeManager HOYA / HBase Master NodeManager NodeManager NodeManager Scheduler Region Server Region Server Region Server HOYA Client
  • 28. Storm Vs. DataTorrent Solution Matrix DataTorrent Apache Storm Atomic Micro-batch 1 3 Events per Second Billions Thousands Automated Parallelism 3 Dynamic Runtime Changes 3 Linear Scalability 3 State Checkpointing 3
  • 29. Apache Spark + Shark HDFS2 (redundant, reliable storage) YARN (cluster resource management) Apache Spark Shark Hive (sql)
  • 30. Hadoop 2.x – YARN + HDFS NameNode DataNode / NodeManager DataNode / NodeManager DataNode / NodeManager DataNode / NodeManager Standby NameNode / ResourceManager ContainerContainer ContainerContainer ContainerContainer ContainerContainer
  • 31. Backwards Compatible YARN is Backwards Compatible for your existing MapReduce applications. You can get value from it right away. YARN: Key Take-Aways Resource Management YARN enables Fine Grained Resource Management for better cluster utilization. One Source of Data YARN allows you to interact with One Source of Data in multiple ways while maintaining Predictable Performance and Quality of Service. Enabling Smart People YARN is a flexible framework that is giving smart people and companies to do amazing things with data. YARN will be the de-facto distributed operating system for Big Data
  • 32. Storm Vs. DataTorrent - Detailed Solution Matrix DataTorrent Apache Storm Proprietary / Open Source O O Support for Hadoop 1.x 1 1 Support for Hadoop 2.x 1 1 Native YARN 1 3 Dashboard 1 3 Extensible via Modules 1 1 Technical Support 1 1 Atomic Micro-batch 1 3 Events per Second Billions Thousands Automated Parallelism 1 3 Dynamic Runtime Changes 1 3 High Availability 1 2 Prog. Languages Supported Java, Python, etc. Java, Python, etc. Log Analysis 1 3 Site Operations 1 3 MapReduce Diagnostics 1 3 Open Source Operators Library 1 2 Open Source Application Templates 1 3 Complex Computations (DAG) 1 3 Linear Scalability 1 3 Security 1 3 CLI and Macros 1 3 Configuration Based Specification 1 3 State Checkpointing 1 3
  • 33. Users forced to create data system silos for managing mixed workloads Developers forced to abuse very specific MapReduce to fit their use cases The 1st Generation Of Hadoop Hadoop HBase
  • 34. Apache Spark HDFS2 (redundant, reliable storage) YARN (cluster resource management) Apache Spark Shark Hive (sql) Spark Streaming MLib (machine learning)
  • 35. Project Mgt Committee Members 0 2 4 6 8 10 12 14 16 Hortonworks Others Cloudera Yahoo! Facebook 7 6 3 15 11
  • 36. Project Committers 0 5 10 15 20 25 30 Hortonworks Others Cloudera Yahoo! Facebook 24 24 11 11 5
  • 37. YARN: Why The De-Facto Distributed OS Technology Adoption 100,000 nodes+ - 400,000 jobs - 10m compute hours daily Enables Innovation Smart people and companies to do amazing things to data Financial Backing 568m+ invested in Hadoop contributing companies, nearly 400m in the 2013 alone
  • 38. Apache Storm Topology Bolt (Filter)Spout Stream (Data Source) Spout Stream (Data Source) Bolt (RDBMS Writes) Bolt (Calculation) Bolt (HDFS Writes) RDBMS HDFS
  • 39. HDFS Write Data Flow NameNode Client DataNode DataNode DataNode 1 2 4 5 67 3 Block Bytes Block Bytes Block Bytes Block Write Complete AckAck Ack A B C

Editor's Notes

  1. “Hadoop” has become synonymous with “Big Data” .. While Hadoop is a big part of the Big Data movement .. Hadoop itself is just a platform and the tools
  2. The architecture of MapReduce came with it’s limitations. Scalability Even as specs rise on servers to accommodate more load it still couldn’t scale passed the max concurrent tasks. Availability JT failure kills all queued and running jobs After restart they have to be resubmitted and start from the beginning Unable to start from where it left off Can be a huge problem if you have long running batch jobs Resource partitioning of resources Resources were broken up into distinct map and reduce slots which aren’t “fungible”. I love that word. Basically means they weren’t interchangeable. Map slots might be “full” while reduce slots remain empty and vice-versa. This needed to be addressed to to ensure the entire system could be used at max capacity for high utilization. Lacks support You were stuck using MapReduce
  3. In Hadoop 1.0, all methods of accessing the data within the cluster were constrained to using MapReduce, Open-source Hadoop projects like Pig, Hive are built on top of MapReduce and even though they make MapReduce more accessible, they still suffer with it’s limitations. You have seen some distributions move outside of Hadoop ecosystem, like Cloudera’s Impala, to get around the limitations of MapReduce to improve performance. But then, unfortunately, it isn’t community supported and lags behind in features because it doesn’t have the backing of the innovative open source committers. The crazy thing is even with these limitations, 90% of the use cases Nick spoke about yesterday are based on this.
  4. So, what has Trace3 found out on it journey through Big Data about YARN? Well first of all, we discovered that it’s not the type of yarn that cats play with. YARN will be the de-facto distributed operating system for Big Data and by the end of this hour you are going to see why we believe it is and why companies like Cloudera, Hortonworks and MapR are banking on this.
  5. YARN is taking Hadoop beyond batch YARN has solved the limitations of MapReduce v1 YARN gives you the ability to store all your data in one place and have mixed workloads working with that data and still getting predictable performance and QoS. YARN is moving Hadoop beyond just MapReduce and Batch into Interactive, Online, Streaming, Graph, In-Memory, etc
  6. Here are some of the apps that are making up that compute time Hbase will be deployed on YARN Which we will talk about more a bit later Master-Worker applications MapReduce has been moved out to it’s own application framework Real-Time Streaming Analytics This in my opinion is the most promising of the application types. I don’t want to steal my associate Rikin’s thunder, but he will be speaking a lot more in-depth around Real-Time Streaming Analytics in a session later today. Graph Processing YARN has enabled the ability to use iterative applications like Apache Giraph within your cluster where previously MapReduce v1 just wasn’t a viable option.
  7. YARN is fairly new to the scene. But that shouldn’t deter you from being confident in it. It was conceived and architected by Yahoo! And has gone through a very quick maturing process due to the open source community putting it through it paces Currently YARN is running on over 100,000 nodes Responsible for 400,000+ jobs and 10 million+ hours of compute time daily
  8. Yes, I said 10 ‘millllllion’
  9. So, what’s changed with YARN for it to be able to accomplish this? YARN splits up the two major functions of the JobTracker into the ResourceManager and ApplicationMaster Global ResourceManager handles all of the cluster resources Scheduler performs its scheduling function based on the resource requirements of the applications Per-node slave NodeManager Responsible for launching application containers Monitoring their resource usage And reporting the same to the ResourceManger Per-application ApplicationMaster Responsible for negotiating the appropriate resource containers from the scheduler Tracking their status Monitoring for progress Per-application Container running on a NodeManager Let’s see how these all work together
  10. Scale YARN is no longer limited by 40000 concurrent tasks that MapReduce v1 had Today YARN is already handling over 10 million hours of compute time on a daily basis New Programming Models and Services You aren’t limited to just MapReduce If your app can benefit from a distributed operating system then you can utilize Improved Cluster Utilization YARN no longer has a hard partition of resources into map and reduce slots, it utilizes the resource leases aka “Containers” that aren’t limited to in functionality. Agility By moving MapReduce out and on top of YARN it gives customers more agility to make changes, upgrade and have different versions of their framework running so they don’t have to affect the entire cluster. Backwards Compatible What you are currently doing with Hadoop 1.x and MapReduce v1 will work with YARN. Mixed workloads on the same data source You can utilize the “data lake” architecture and run all your apps while still having perdictable performance and quality of service.
  11. One of the projects that I’m keeping a close on is the Stinger project Speed 100x speed increase from Hive10 SQL Improve HiveSQL to make it more ANSI SQL-like Scale Ability to run queries on Terabytes to Petabytes of information
  12. Another project to watch closely is HBase on YARN Dynamic Scaling Scales with usage . As load increases, Easier Deployment HBase cluster deployment can be somewhat complicated, they are looking to correct that be allowing you to do it utilizing builtin API’s Availability When a “RegionServer” is lost, to recover it, it’s just deploying another container within the cluster.
  13. This is a project that lays outside of my wheel-house but from what I’ve learned about it, it’s going to do amazing thing for Machine Learning. I’ve also highlighted this project to add even more credibility to YARN by showing you a company like Microsoft is dedicating internal time and resources to build applications to run on YARN.
  14. Previously a NameNode had one classification of storage media available to it. Now NameNodes as of 2.3 have the ability to split up storage media available to it. Adding awareness of storage media can allow HDFS to make better decisions about the placement of block data with input from applications. An application can choose the distribution of replicas based on its performance and durability requirements.
  15. NodeManger Restart allows for a restart of the NM without losing jobs .. They will continue where they left off after restart Dynamic Resource Config - Currently containers are static .. They allocate a certain amount of proc / memory to each process. Now processes will have the ability to scale up within a container if resources are available within that NodeManager.
  16. Jobs are submitted to the ResouceManager via a public submission protocol and go through an admission control phase during which security credentials are validated and various checks are performed. The RM runs as a daemon on a dedicated machine, and acts as the central authority arbitrating resources for various competing applications in the cluster. Because it has a central and global view of the cluster resources, it can enforce properties such as fairness, capacity, and locality across nodes. Accepted jobs are passed to the scheduler to be run. Once the scheduler has enough resources, the application is moved from accepted to running state. This involves allocating a resource lease Aka as a container (bound JVM) - for the AM and spawning it on a node in the cluster. A record of accepted applications is written to persistent storage and recovered in case of RM failure. The ApplicationMaster is the “head” of a job, managing all lifecycle aspects including dynamically increasing and decreasing resources consumption, managing the flow of execution and handling faults. By delegating all these functions to AMs, YARN’s architecture gains a great deal of scalability, programming model flexibility, and improved upgrading/testing since multiple versions of the same framework can coexist. The RM interacts with a special system daemon running on each node called the NodeManager (NM). Communications between RM and NMs are heartbeat based for scalability. NMs are responsible for monitorng resource availability, reporting faults, and container lifecycle management (e.g., starting, killing). The RM assembles its global view from these snapshots of NM state.
  17. Jobs are submitted to the ResouceManager via a public submission protocol and go through an admission control phase during which security credentials are validated and various checks are performed. The RM runs as a daemon on a dedicated machine, and acts as the central authority arbitrating resources for various competing applications in the cluster. Because it has a central and global view of the cluster resources, it can enforce properties such as fairness, capacity, and locality across nodes. Accepted jobs are passed to the scheduler to be run. Once the scheduler has enough resources, the application is moved from accepted to running state. This involves allocating a resource lease Aka as a container (bound JVM) - for the AM and spawning it on a node in the cluster. A record of accepted applications is written to persistent storage and recovered in case of RM failure. The ApplicationMaster is the “head” of a job, managing all lifecycle aspects including dynamically increasing and decreasing resources consumption, managing the flow of execution and handling faults. By delegating all these functions to AMs, YARN’s architecture gains a great deal of scalability, programming model flexibility, and improved upgrading/testing since multiple versions of the same framework can coexist. The RM interacts with a special system daemon running on each node called the NodeManager (NM). Communications between RM and NMs are heartbeat based for scalability. NMs are responsible for monitorng resource availability, reporting faults, and container lifecycle management (e.g., starting, killing). The RM assembles its global view from these snapshots of NM state.
  18. The Stinger project is tackling the speed portion by utilizing Apache Tez Tez sits at the layer between MapReduce, Pig and Hive to optimize the execution of the these applications.
  19. MapReduce Version consisted of 2 daemons / processes. The JobTracker is a master node responsible for managing the cluster resources (map and reduce slots) and job scheduling. The TaskTracker is a per-node agent and manages the map and reduce tasks.
  20. Backwards Compatible Whatever you are doing with Hadoop 1.0 and MapReduce today, will work with YARN. Even though you don’t need all the capabilities of YARN right now, don’t hesitate to move to it and as new tools and applications become available on YARN your company will be able to utilize them. One Source Of Data YARN allows you to have that data lake with all of your data applications running against it. While still maintaining predictable performance and quality of service Resource Management YARN accomplishes this by how it manage resources for better cluster utilization which translates to “more bang for your buck”. Enabling Smart People YARN is an extremely flexible framework that is giving smart people and companies the ability to do amazing things with data. All these benefits add up to “YARN will be the de-facto distributed operating system for Big Data” We see the innovation in Big Data happening on YARN and We want to help you make the right choice now to avoid the headaches and costs that come along with making the wrong choice.
  21. Before we can understand fully what YARN is solving we need to review what it’s replacing. Hadoop 1.0 The initial design of Hadoop was focused on running massive MapReduce jobs to process web crawl. Although It did end up evolving outside of initial use case and helped solve the “data silo problem”, it ended up creating a different issue, something called the “data system silo problem”. Users were forced into creating data system silos due to mixed workloads. HBase Example Developers were forced to abuse the very specific MapReduce programming model to try to accommodate their user cases. One of the biggest cost to a Hadoop cluster is copying data between the clusters to try to accommodate mixed workloads
  22. PMCs are the people that give oversight for the project roadmap and provide guidance to the committers. One thing to highlight may be that Hortonworks is a spin-off of Yahoo!
  23. The committers are the ones who actually submit code to the project. One thing to highlight may be that Hortonworks is a spin-off of Yahoo!
  24. Question may arise how I can state that YARN will be the de-facto distributed operating system of Big Data. Here are the arguments for my conclusion / prediction.
  25. HDFS Write Data Flow 1 – 7 Connect to the NN to establish block placement Writing to the DNs Once 1 copy of the data is placed, the client gets an acknowledgement The first DN copies the file to the second DN The second DN copies to the third DN The DNs acknowledge to the other DNs the the copy has been completed The DNs acknowledge to the other DNs the the copy has been completed A, B and C Once the files are written to the DNs the information about the new block is sent in the block report / heartbeat.