© Hortonworks Inc. 2014
Apache Hadoop YARN
Present and Future
Vinod Kumar Vavilapalli
vinodkv [at] apache.org
@tshooter
Jian He
jianhe [at] apache.org
Page 1
© Hortonworks Inc. 2014
Who are we?
• Vinod Kumar Vavilapalli
– 7 Hadoop-years old
– Previously @Yahoo!, now @Hortonworks
– Hadoop MapReduce and YARN Development lead & Architect at Hortonworks
– Apache Hadoop YARN project lead
– Apache Hadoop PMC, Apache Member
– 99% + code in Apache, Hadoop
• Jian He
– Software Engineer @ Hortonworks
– Apache Hadoop Committer
– Masters Degree from Brown University.
– Focus on YARN/MapReduce
Page 2
Architecting the Future of Big Data
© Hortonworks Inc. 2014
A quick show of hands..
• Hadoop 1
• Hadoop 2 & YARN
• YARN for MapReduce2
• YARN for beyond MR2
Page 3
Architecting the Future of Big Data
© Hortonworks Inc. 2014
Agenda
• Apache Hadoop 2 : Overview
• Community
• Present
• Future
Page 4
Architecting the Future of Big Data
© Hortonworks Inc. 2014
Apache Hadoop 2
Next Generation Architecture
Architecting the Future of Big Data
Page 5
© Hortonworks Inc. 2014
YARN: the Data Operating System
Page 6
Architecting the Future of Big Data
• Resource Management Platform
• MapReduce v2
• Beyond MapReduce with Tez, Storm, Spark; in Hadoop!
• Did I mention Services like HBase, Accumulo on YARN with Apache Slider?
© Hortonworks Inc. 2014
Why?
• 2.0 >= 2 * 1.0
– YARN: Next generation architecture
• Scale
• Agility
• Return on Investment: 2x throughput on same hardware!
• Ready for improvements in hardware
• Not convinced? Let’s see what others are saying!
Page 7
Architecting the Future of Big Data
© Hortonworks Inc. 2014
Yahoo!
• Leader/Visionary on all things Hadoop!
• On YARN (0.23.x)
• Moving fast to 2.x
Page 8
Architecting the Future of Big Data
http://developer.yahoo.com/blogs/ydn/hadoop-yahoo-more-ever-54421.html
© Hortonworks Inc. 2014
Twitter
Page 9
Architecting the Future of Big Data
Talk: “ Hadoop 2 @Twitter, Elephant Scale”
By: Lohit Vijayarenu & Gera Shegalov
© Hortonworks Inc. 2014
Ebay
• Has one of the largest Hadoop clusters in the industry with tens-
hundreds petabytes of data
• Migrated production clusters to Hadoop-2
Page 10
Architecting the Future of Big Data
© Hortonworks Inc. 2014
YARN Community
At Apache Software Foundation
Architecting the Future of Big Data
Page 11
© Hortonworks Inc. 2014
YARN contributions
Page 12
Architecting the Future of Big Data
0
50
100
150
200
250
300
350
400
2.0.x 2.1.x 2.2.x 2.3.x 2.4.x 2.x trunk
YARN Releases - 06/02/14
YARN Releases - 06/02/14
© Hortonworks Inc. 2014
Contributors
• 104 and counting
• Few ‘big’ contributors
• And a long tail
Page 13
Architecting the Future of Big Data
0
10
20
30
40
50
60
70
80
90
100
© Hortonworks Inc. 2014
Present
Architecting the Future of Big Data
Page 14
© Hortonworks Inc. 2014
Apache Hadoop releases
• 15 October, 2013
• The 1st GA release of Apache Hadoop 2.x
• YARN
– First stable and supported release of YARN
– YARN level APIs solidified for the future
– Binary Compatibility for MapReduce applications built on hadoop-1.x
– Performance
– Scale!
• Support for running Hadoop on Microsoft Windows
• Substantial amount of integration testing with rest of projects in the
ecosystem
– Pig, Hive, Oozie, HBase..
Page 15
Architecting the Future of Big Data
Apache Hadoop 2.2
© Hortonworks Inc. 2014
Apache Hadoop releases (contd)
• 24 February, 2014
• First post GA release for the year 2014
• Alpha features in YARN
– ResourceManager High Availability
– Application History Server
– Will be covered in detail in the 2.4 section
• Number of bug-fixes, enhancements
Page 16
Architecting the Future of Big Data
Apache Hadoop 2.3
© Hortonworks Inc. 2014
Apache Hadoop releases (contd)
• 7 April, 2014
• Most recent release
• Stabilizing features in YARN
– Details follow
– ResourceManager HA
– YARN Timeline Server (beyond history server)
– Preemption in YARN CapacityScheduler
– Container-preserving AM recovery.
Page 17
Architecting the Future of Big Data
Apache Hadoop 2.4
© Hortonworks Inc. 2014
ResourceManager High Availability
Page 18
Architecting the Future of Big Data
• RM – single point of failure
• Goal : Downtime invisible to end-users
– Apps not required to be re-submitted
– NMs to rebind with newly started RM
• Two stories:
– Recovery of state
– Failover
© Hortonworks Inc. 2014
ResourceManager High Availability
Page 19
Architecting the Future of Big Data
• Active/Standby
o Leader election
(ZooKeeper)
• Standby on transition to
Active loads all the
state from the state
store.
• NM, AM, clients, redirect
to the new RM
o RMProxy lib
Talk: Highly Available Resource Management for YARN
By: Karthik Kambatla, Xuan Gong
© Hortonworks Inc. 2014
YARN Timeline Server
• Few MR specific implementations: History and web-UI
• YARN: Not just MR anymore!
• Previous state
– MapReduce specific Job History Server
– YARN level ‘History’ lost beyond ResourceManager Restart
Page 20
Architecting the Future of Big Data
© Hortonworks Inc. 2014
YARN Timeline Server (contd)
Page 21
Entity and Event
collection
RM and Applications periodically send events to
Timeline sever
Pluggable store Depending on site requirements
REST APIs or RPC
Applications and user-interfaces can access
information via REST/ RPC
Visualizations
Users can build tools and visualizations using the
APIs
Apps and System
Applications as well as the system
entities/events
© Hortonworks Inc. 2014
YARN Timeline Server (contd)
Page 22
Architecting the Future of Big Data
YARN
Timeline
Serv`er
App1
App2
RM
Custom App
monitoring
client
RPC
REST API
Events
Events
AMBARI
Events
Talk: “Analyzing Historical Data of Applications on Hadoop
YARN: for Fun and Profit”
By: Zhijie Shen, Mayank Bansal
© Hortonworks Inc. 2014
Capacity Scheduler Preemption
• Enforce
SLAs
• Preempt
across
queues
• Current Capacity
• Guaranteed Capacity
Gather Queue State
STEP1
• Select applications to preempt: Over
cap. Qs
Identify preemptions
STEP2
• Issue preemptions for containers to
application
Issue preemptions
STEP3
• Track containers that have been issued
by not yet executed preemption
• Forcibly kill these containers after
timeout
Kill containers
STEP4
© Hortonworks Inc. 2014
Capacity Scheduler Preemption (Contd)
Application Scheduler
Page 24
Architecting the Future of Big Data
Premptions
Release Resource
Premptions
Kill containers forcibly
after timeout
x
© Hortonworks Inc. 2014
Container-preserving AM restart
• Problem
– Containers are killed when AM goes down.
– New AM needs to know where the previous containers are running
– Previous containers need to know about the new AM. (WIP)
Page 25
Architecting the Future of Big Data
Container1
Container2
Container3
AM1
AM2
restart
© Hortonworks Inc. 2014
Apache Hadoop releases (contd)
• Next releases
– 2.4.1
– 2.5.x
• YARN
– Details follow in future’s section
– ResourceManager work-preserving restart for High Availability
– YARN Timeline Server security & enhancement.
– Lots more
Page 26
Architecting the Future of Big Data
Apache Hadoop 2.5.x
© Hortonworks Inc. 2014
Future
Architecting the Future of Big Data
Page 27
© Hortonworks Inc. 2014
Future: Operational enhancements
• Rolling upgrades
– No/minimal impact to users
– Ideal: Always rolling!
• HDFS upgrades effort is in
• YARN
– RM restart
– NM restart
– Upgrades
Page 28
Architecting the Future of Big Data
Talk: “Hadoop Rolling Upgrades – Taking Availability to the Next Level”
By: Suresh Srinvias, Hortonworks & Jason Lowe Yahoo!
© Hortonworks Inc. 2014
Future: Enabling apps
• Beyond MapReduce
– Apache Tez, Apache Slider, Apache Storm.
• Discussing next
– Long running services
– Multi-dimensional resource scheduling
– Isolation
– Web services
Page 29
Architecting the Future of Big Data
© Hortonworks Inc. 2014
Future: Long running services
• You can run them already!
• Few enhancements needed
– Logs
– Security
– Management/monitoring
• Resource sharing across workload types
Page 30
Architecting the Future of Big Data
Talk: “ Bring your Service to YARN”
By: Sumit Mohanty
© Hortonworks Inc. 2014
Multi-resource scheduling
• Today – memory & cpu
– Physical memory / virtual memory
– CPU Cores – Virtual cores
• CPU stuff: More bake in
• Disks
– Space
– IOPS
• Network
Page 31
Architecting the Future of Big Data
© Hortonworks Inc. 2014
Fine-grain isolation for multi-tenancy
• Custom memory-monitoring
• Cgroups
• Linux Containers
• VMs
Page 32
Architecting the Future of Big Data
© Hortonworks Inc. 2014
Other features
• Application SLAs
– Run my application at 6:00 AM tomorrow and guarantee capacity for me!
• Node labels
– Some of the nodes in my cluster have specialized hardware, give them to me!
• Node affinity/anti-affinity
– Get me on to the nodes where my data is
– Get me off of this node
• Better online queue-management
– Centralized
– Quality feedback
• Web-services
– RESTful APIs for submitting, monitoring and killing apps
– Beyond java-only clients
Page 33
Architecting the Future of Big Data
© Hortonworks Inc. 2014
YARN Ecosystem
Beyond the core YARN project: Briefly
Architecting the Future of Big Data
Page 34
© Hortonworks Inc. 2014
Eco-system
Page 35
Classic Apache Hadoop
MapReduce – Batch
Batch & Interactive
• Apache Tez –
Batch/Interactive
Stream Processing
• Apache Storm
• Apache Samza
Apache Spark – Iterative
applications
YARN Frameworks
• Apache Twill
• Microsoft REEF
There's an app for that...
YARN App Marketplace!
Existing apps
• Apache Slider
Graph Processing
• Apache Giraph
Applications Powered by YARN
Talk: Apache Tez - A New Chapter in Hadoop Data Processing”
By Bikas Saha, Hitesh Shah
© Hortonworks Inc. 2014
Recap
Architecting the Future of Big Data
Page 36
© Hortonworks Inc. 2014
Recap
Page 37
Architecting the Future of Big Data
• YARN helps Apache Hadoop 2 to be twice as good!
• Exciting journey with Hadoop for this decade…
– Hadoop is no longer a one-trick pony, err elephant
– Beyond just MapReduce
• Hadoop 2: Architecture for the future
– Centralized data, multiple apps
• Lots of exciting new features
– Exciting spectrum of application types, workloads and use-cases
© Hortonworks Inc. 2014
Couple more things..
Architecting the Future of Big Data
Page 38
© Hortonworks Inc. 2014
The Book is out!
Page 39
Architecting the Future of Big Data
© Hortonworks Inc. 2014
Page 40
Architecting the Future of Big Data
© Hortonworks Inc. 2014
Thank you!
Page 41
Download Sandbox: Experience Apache Hadoop
Both 2.x and 1.x Versions Available!
http://hortonworks.com/products/hortonworks-sandbox/
Questions Time!

Apache Hadoop YARN: Present and Future

  • 1.
    © Hortonworks Inc.2014 Apache Hadoop YARN Present and Future Vinod Kumar Vavilapalli vinodkv [at] apache.org @tshooter Jian He jianhe [at] apache.org Page 1
  • 2.
    © Hortonworks Inc.2014 Who are we? • Vinod Kumar Vavilapalli – 7 Hadoop-years old – Previously @Yahoo!, now @Hortonworks – Hadoop MapReduce and YARN Development lead & Architect at Hortonworks – Apache Hadoop YARN project lead – Apache Hadoop PMC, Apache Member – 99% + code in Apache, Hadoop • Jian He – Software Engineer @ Hortonworks – Apache Hadoop Committer – Masters Degree from Brown University. – Focus on YARN/MapReduce Page 2 Architecting the Future of Big Data
  • 3.
    © Hortonworks Inc.2014 A quick show of hands.. • Hadoop 1 • Hadoop 2 & YARN • YARN for MapReduce2 • YARN for beyond MR2 Page 3 Architecting the Future of Big Data
  • 4.
    © Hortonworks Inc.2014 Agenda • Apache Hadoop 2 : Overview • Community • Present • Future Page 4 Architecting the Future of Big Data
  • 5.
    © Hortonworks Inc.2014 Apache Hadoop 2 Next Generation Architecture Architecting the Future of Big Data Page 5
  • 6.
    © Hortonworks Inc.2014 YARN: the Data Operating System Page 6 Architecting the Future of Big Data • Resource Management Platform • MapReduce v2 • Beyond MapReduce with Tez, Storm, Spark; in Hadoop! • Did I mention Services like HBase, Accumulo on YARN with Apache Slider?
  • 7.
    © Hortonworks Inc.2014 Why? • 2.0 >= 2 * 1.0 – YARN: Next generation architecture • Scale • Agility • Return on Investment: 2x throughput on same hardware! • Ready for improvements in hardware • Not convinced? Let’s see what others are saying! Page 7 Architecting the Future of Big Data
  • 8.
    © Hortonworks Inc.2014 Yahoo! • Leader/Visionary on all things Hadoop! • On YARN (0.23.x) • Moving fast to 2.x Page 8 Architecting the Future of Big Data http://developer.yahoo.com/blogs/ydn/hadoop-yahoo-more-ever-54421.html
  • 9.
    © Hortonworks Inc.2014 Twitter Page 9 Architecting the Future of Big Data Talk: “ Hadoop 2 @Twitter, Elephant Scale” By: Lohit Vijayarenu & Gera Shegalov
  • 10.
    © Hortonworks Inc.2014 Ebay • Has one of the largest Hadoop clusters in the industry with tens- hundreds petabytes of data • Migrated production clusters to Hadoop-2 Page 10 Architecting the Future of Big Data
  • 11.
    © Hortonworks Inc.2014 YARN Community At Apache Software Foundation Architecting the Future of Big Data Page 11
  • 12.
    © Hortonworks Inc.2014 YARN contributions Page 12 Architecting the Future of Big Data 0 50 100 150 200 250 300 350 400 2.0.x 2.1.x 2.2.x 2.3.x 2.4.x 2.x trunk YARN Releases - 06/02/14 YARN Releases - 06/02/14
  • 13.
    © Hortonworks Inc.2014 Contributors • 104 and counting • Few ‘big’ contributors • And a long tail Page 13 Architecting the Future of Big Data 0 10 20 30 40 50 60 70 80 90 100
  • 14.
    © Hortonworks Inc.2014 Present Architecting the Future of Big Data Page 14
  • 15.
    © Hortonworks Inc.2014 Apache Hadoop releases • 15 October, 2013 • The 1st GA release of Apache Hadoop 2.x • YARN – First stable and supported release of YARN – YARN level APIs solidified for the future – Binary Compatibility for MapReduce applications built on hadoop-1.x – Performance – Scale! • Support for running Hadoop on Microsoft Windows • Substantial amount of integration testing with rest of projects in the ecosystem – Pig, Hive, Oozie, HBase.. Page 15 Architecting the Future of Big Data Apache Hadoop 2.2
  • 16.
    © Hortonworks Inc.2014 Apache Hadoop releases (contd) • 24 February, 2014 • First post GA release for the year 2014 • Alpha features in YARN – ResourceManager High Availability – Application History Server – Will be covered in detail in the 2.4 section • Number of bug-fixes, enhancements Page 16 Architecting the Future of Big Data Apache Hadoop 2.3
  • 17.
    © Hortonworks Inc.2014 Apache Hadoop releases (contd) • 7 April, 2014 • Most recent release • Stabilizing features in YARN – Details follow – ResourceManager HA – YARN Timeline Server (beyond history server) – Preemption in YARN CapacityScheduler – Container-preserving AM recovery. Page 17 Architecting the Future of Big Data Apache Hadoop 2.4
  • 18.
    © Hortonworks Inc.2014 ResourceManager High Availability Page 18 Architecting the Future of Big Data • RM – single point of failure • Goal : Downtime invisible to end-users – Apps not required to be re-submitted – NMs to rebind with newly started RM • Two stories: – Recovery of state – Failover
  • 19.
    © Hortonworks Inc.2014 ResourceManager High Availability Page 19 Architecting the Future of Big Data • Active/Standby o Leader election (ZooKeeper) • Standby on transition to Active loads all the state from the state store. • NM, AM, clients, redirect to the new RM o RMProxy lib Talk: Highly Available Resource Management for YARN By: Karthik Kambatla, Xuan Gong
  • 20.
    © Hortonworks Inc.2014 YARN Timeline Server • Few MR specific implementations: History and web-UI • YARN: Not just MR anymore! • Previous state – MapReduce specific Job History Server – YARN level ‘History’ lost beyond ResourceManager Restart Page 20 Architecting the Future of Big Data
  • 21.
    © Hortonworks Inc.2014 YARN Timeline Server (contd) Page 21 Entity and Event collection RM and Applications periodically send events to Timeline sever Pluggable store Depending on site requirements REST APIs or RPC Applications and user-interfaces can access information via REST/ RPC Visualizations Users can build tools and visualizations using the APIs Apps and System Applications as well as the system entities/events
  • 22.
    © Hortonworks Inc.2014 YARN Timeline Server (contd) Page 22 Architecting the Future of Big Data YARN Timeline Serv`er App1 App2 RM Custom App monitoring client RPC REST API Events Events AMBARI Events Talk: “Analyzing Historical Data of Applications on Hadoop YARN: for Fun and Profit” By: Zhijie Shen, Mayank Bansal
  • 23.
    © Hortonworks Inc.2014 Capacity Scheduler Preemption • Enforce SLAs • Preempt across queues • Current Capacity • Guaranteed Capacity Gather Queue State STEP1 • Select applications to preempt: Over cap. Qs Identify preemptions STEP2 • Issue preemptions for containers to application Issue preemptions STEP3 • Track containers that have been issued by not yet executed preemption • Forcibly kill these containers after timeout Kill containers STEP4
  • 24.
    © Hortonworks Inc.2014 Capacity Scheduler Preemption (Contd) Application Scheduler Page 24 Architecting the Future of Big Data Premptions Release Resource Premptions Kill containers forcibly after timeout x
  • 25.
    © Hortonworks Inc.2014 Container-preserving AM restart • Problem – Containers are killed when AM goes down. – New AM needs to know where the previous containers are running – Previous containers need to know about the new AM. (WIP) Page 25 Architecting the Future of Big Data Container1 Container2 Container3 AM1 AM2 restart
  • 26.
    © Hortonworks Inc.2014 Apache Hadoop releases (contd) • Next releases – 2.4.1 – 2.5.x • YARN – Details follow in future’s section – ResourceManager work-preserving restart for High Availability – YARN Timeline Server security & enhancement. – Lots more Page 26 Architecting the Future of Big Data Apache Hadoop 2.5.x
  • 27.
    © Hortonworks Inc.2014 Future Architecting the Future of Big Data Page 27
  • 28.
    © Hortonworks Inc.2014 Future: Operational enhancements • Rolling upgrades – No/minimal impact to users – Ideal: Always rolling! • HDFS upgrades effort is in • YARN – RM restart – NM restart – Upgrades Page 28 Architecting the Future of Big Data Talk: “Hadoop Rolling Upgrades – Taking Availability to the Next Level” By: Suresh Srinvias, Hortonworks & Jason Lowe Yahoo!
  • 29.
    © Hortonworks Inc.2014 Future: Enabling apps • Beyond MapReduce – Apache Tez, Apache Slider, Apache Storm. • Discussing next – Long running services – Multi-dimensional resource scheduling – Isolation – Web services Page 29 Architecting the Future of Big Data
  • 30.
    © Hortonworks Inc.2014 Future: Long running services • You can run them already! • Few enhancements needed – Logs – Security – Management/monitoring • Resource sharing across workload types Page 30 Architecting the Future of Big Data Talk: “ Bring your Service to YARN” By: Sumit Mohanty
  • 31.
    © Hortonworks Inc.2014 Multi-resource scheduling • Today – memory & cpu – Physical memory / virtual memory – CPU Cores – Virtual cores • CPU stuff: More bake in • Disks – Space – IOPS • Network Page 31 Architecting the Future of Big Data
  • 32.
    © Hortonworks Inc.2014 Fine-grain isolation for multi-tenancy • Custom memory-monitoring • Cgroups • Linux Containers • VMs Page 32 Architecting the Future of Big Data
  • 33.
    © Hortonworks Inc.2014 Other features • Application SLAs – Run my application at 6:00 AM tomorrow and guarantee capacity for me! • Node labels – Some of the nodes in my cluster have specialized hardware, give them to me! • Node affinity/anti-affinity – Get me on to the nodes where my data is – Get me off of this node • Better online queue-management – Centralized – Quality feedback • Web-services – RESTful APIs for submitting, monitoring and killing apps – Beyond java-only clients Page 33 Architecting the Future of Big Data
  • 34.
    © Hortonworks Inc.2014 YARN Ecosystem Beyond the core YARN project: Briefly Architecting the Future of Big Data Page 34
  • 35.
    © Hortonworks Inc.2014 Eco-system Page 35 Classic Apache Hadoop MapReduce – Batch Batch & Interactive • Apache Tez – Batch/Interactive Stream Processing • Apache Storm • Apache Samza Apache Spark – Iterative applications YARN Frameworks • Apache Twill • Microsoft REEF There's an app for that... YARN App Marketplace! Existing apps • Apache Slider Graph Processing • Apache Giraph Applications Powered by YARN Talk: Apache Tez - A New Chapter in Hadoop Data Processing” By Bikas Saha, Hitesh Shah
  • 36.
    © Hortonworks Inc.2014 Recap Architecting the Future of Big Data Page 36
  • 37.
    © Hortonworks Inc.2014 Recap Page 37 Architecting the Future of Big Data • YARN helps Apache Hadoop 2 to be twice as good! • Exciting journey with Hadoop for this decade… – Hadoop is no longer a one-trick pony, err elephant – Beyond just MapReduce • Hadoop 2: Architecture for the future – Centralized data, multiple apps • Lots of exciting new features – Exciting spectrum of application types, workloads and use-cases
  • 38.
    © Hortonworks Inc.2014 Couple more things.. Architecting the Future of Big Data Page 38
  • 39.
    © Hortonworks Inc.2014 The Book is out! Page 39 Architecting the Future of Big Data
  • 40.
    © Hortonworks Inc.2014 Page 40 Architecting the Future of Big Data
  • 41.
    © Hortonworks Inc.2014 Thank you! Page 41 Download Sandbox: Experience Apache Hadoop Both 2.x and 1.x Versions Available! http://hortonworks.com/products/hortonworks-sandbox/ Questions Time!

Editor's Notes

  • #36 Graph processing – Giraph, Hama Stream proessing – Smaza, Storm, Spark, DataTorrent MapReduce Tez – fast query execution Weave/REEF – frameworks to help with writing applications List of some of the applications which already support YARN, in some form. Smaza, Storm, S4 and DataTorrent are streaming frameworks Various types of graph processing frameworks – Giraph and Hama are graph processing systems There’s some github projects – caching systems, on-demand web-server spin up Wave and REEF are frameworks on top of YARN to make writing applications easier