More Related Content Similar to Apache Hadoop YARN: Present and Future Similar to Apache Hadoop YARN: Present and Future (20) More from DataWorks Summit More from DataWorks Summit (20) Apache Hadoop YARN: Present and Future2. © Hortonworks Inc. 2014
Apache Hadoop YARN
Present and Future
Vinod Kumar Vavilapalli
vinodkv [at] apache.org
@tshooter
Page 2
3. © Hortonworks Inc. 2014
A quick show of hands..
• Hadoop 2
Page 3
Architecting the Future of Big Data
Real life Hadoop Logo
4. © Hortonworks Inc. 2014
Who am I?
• 6.75 Hadoop-years old
• Last thing at School – a two node Tomcat cluster. Three months
later, first thing at job, brought down a 800 node cluster ;)
• Previously @Yahoo!
• Now @Hortonworks
• Two hats
– Hortonworks: Hadoop MapReduce and YARN Development lead
– Apache: Apache Hadoop YARN lead. Apache Hadoop PMC, Apache Member
• Worked/working on
– YARN, Hadoop MapReduce, HadoopOnDemand, CapacityScheduler, Hadoop
security
– Apache Ambari: Kickstarted the project and its first release
– Stinger: High performance data processing with Hadoop/Hive
• Lots of trouble shooting on clusters
• 99% + code in Apache, Hadoop
Page 4
Architecting the Future of Big Data
5. © Hortonworks Inc. 2014
Agenda
• Apache Hadoop 2 : Overview
• Past
• Present
• Future
Page 5
Architecting the Future of Big Data
6. © Hortonworks Inc. 2014
Apache Hadoop 2
Next Generation Architecture
Architecting the Future of Big Data
Page 6
7. © Hortonworks Inc. 2014
What is YARN?
• Resource Management Platform
– MapReduce v2
– Beyond MapReduce with Tez, Storm, Spark; in Hadoop!
– Did I mention Services like HBase, Accumulo on YARN with HoYA/Slider?
• How is it different from Hadoop 1? ..
Page 7
Architecting the Future of Big Data
8. © Hortonworks Inc. 2014
Hadoop 1 vs Hadoop 2
HADOOP 1.0
HDFS
(redundant, reliable storage)
MapReduce
(cluster resource management
& data processing)
HDFS2
(redundant, highly-available & reliable storage)
YARN
(cluster resource management)
MapReduce
(data processing)
Others
HADOOP 2.0
Single Use System
Batch Apps
Multi Purpose Platform
Batch, Interactive, Online, Streaming, …
Page 8
9. © Hortonworks Inc. 2014
Key Benefits of YARN
• Scale
• New Programming Models & Services
• Improved cluster utilization
• Agility
• To infinity and beyond ..
Page 9
10. © Hortonworks Inc. 2014
Why Migrate?
• 2.0 >= 2 * 1.0
– HDFS: Lots of ground-breaking features
– YARN: Next generation architecture
• Return on Investment: 2x throughput on same hardware!
• Ready for improvements in hardware
• Not convinced? Let’s see what others are saying!
Page 10
Architecting the Future of Big Data
11. © Hortonworks Inc. 2014
Yahoo!
• Leader/Visionary on all things Hadoop!
• On YARN (0.23.x)
• Moving fast to 2.x
Page 11
Architecting the Future of Big Data
http://developer.yahoo.com/blogs/ydn/hadoop-yahoo-more-ever-54421.html
13. © Hortonworks Inc. 2014
Ebay
• Has one of the largest Hadoop clusters in the industry with many
petabytes of data
• Migrated production clusters to Hadoop-2
• Go to Mayank’s talk
– “Hadoop-2 @ ebay”!
– Thursday, April 3
– Track : Deployment and Operations
• Should be convinced by now .. . No?
Page 13
Architecting the Future of Big Data
14. © Hortonworks Inc. 2014
YARN: the Data Operating System
Page 14
Architecting the Future of Big Data
16. © Hortonworks Inc. 2014
Apache Hadoop releases
• 15 October, 2013
• The 1st GA release of Apache Hadoop 2.x
• YARN
– First stable and supported release of YARN
– Binary Compatibility for MapReduce applications built on hadoop-1.x
– YARN level APIs solidified for the future
– Performance
– Scale!
• HDFS
– High Availability for HDFS
– HDFS Federation
– HDFS Snapshots
– NFSv3 access to data in HDFS
• Support for running Hadoop on Microsoft Windows
• Substantial amount of integration testing with rest of projects in the
ecosystem
Page 16
Architecting the Future of Big Data
Apache Hadoop 2.2
17. © Hortonworks Inc. 2014
Apache Hadoop releases (contd)
• 24 February, 2014
• First post GA release for the year 2014
• Alpha features in YARN
– ResourceManager HA
– Application History
– Will cover in the 2.4 content
• HDFS
– Details follow..
• Number of bug-fixes, enhancements
Page 17
Architecting the Future of Big Data
Apache Hadoop 2.3
18. © Hortonworks Inc. 2014
HDFS: Heterogeneous Storage
Page 18
Architecting the Future of Big Data
19. © Hortonworks Inc. 2014
HDFS: DataNode caching
Page 19
Architecting the Future of Big Data
20. © Hortonworks Inc. 2014
Apache Hadoop releases (contd)
• Very soon!
• YARN
– Details follow..
– ResourceManager restart fail-over for high availability
– Preemption
– Application History and timeline
• HDFS
– FileSystem ACLs
– Rolling upgrades
Page 20
Architecting the Future of Big Data
Apache Hadoop 2.4
21. © Hortonworks Inc. 2014
ResourceManager Restart and fail-over
Page 21
Architecting the Future of Big Data
ZooKeeper
22. © Hortonworks Inc. 2014
Capacity Scheduler Preemption
Page 22
Architecting the Future of Big Data
23. © Hortonworks Inc. 2014
Application History and Timeline
• Few MR specific implementations: History and web-UI
• Not just MR anymore!
• History
– MapReduce specific Job History Server
– Beyond ResourceManager Restart
• Timeline
– Framework specific event collection and UIs
• Run analytics on historical apps!
Page 23
Architecting the Future of Big Data
25. © Hortonworks Inc. 2014
Future: Operational enhancements
• Rolling upgrades
– No/minimal impact to users
– Ideal: Always rolling!
• HDFS in
• YARN
Page 25
Architecting the Future of Big Data
26. © Hortonworks Inc. 2014
Future: Enabling more apps
• Beyond MR
• Discussing next
– Long running services
– Isolation
– Multi-dimensional resource
scheduling
Page 26
Architecting the Future of Big Data
27. © Hortonworks Inc. 2014
Future: Long running services
• You can run them already!
• Few enhancements needed
– Logs
– Security
– Management/monitoring
• Resource sharing across
workload types
• Project Slider
Page 27
Architecting the Future of Big Data
28. © Hortonworks Inc. 2014
Fine-grain isolation for multi-tenancy
• Custom memory-monitoring
• Cgroups
• Linux Containers
• VMs
Page 28
Architecting the Future of Big Data
29. © Hortonworks Inc. 2014
Multi-resource scheduling
• Today – memory & cpu
– Physical memory / virtual memory
– Cpu Cores – Virtual cores
• CPU stuff: More bake in
• Disks
– Space
– IOPS
• Network
Page 29
Architecting the Future of Big Data
30. © Hortonworks Inc. 2014
Other features
• Application SLAs
• Node labels
• Node affinity/anti-affinity
• Better online queue-management
Page 30
Architecting the Future of Big Data
31. © Hortonworks Inc. 2014
YARN Ecosystem
Beyond the core YARN project: Briefly
Architecting the Future of Big Data
Page 31
32. © Hortonworks Inc. 2014
Eco-system
Page 32
Applications Powered by YARN
Apache Giraph – Graph Processing
Apache Hama – BSP
Apache Hadoop MapReduce – Batch
Apache Tez – Batch/Interactive
Apache S4 – Stream Processing
Apache Samza – Stream Processing
Apache Storm – Stream Processing
Apache Spark – Iterative applications
HOYA – HBase on YARN
YARN Frameworks
Apache Twill
REEF by Microsoft
Spring support for Hadoop 2
There's an app for that...
YARN App Marketplace!
33. © Hortonworks Inc. 2014
Apache TEZ
• Moving beyond MR
• A data processing framework that can execute a complex DAG of
tasks.
• “Apache Tez - A New Chapter in Hadoop Data Processing”
– By Siddharth Seth: YARN & Tez Committer/PMC Member
– Thursday, April 3 (4:20-5:00pm)
Page 33
Architecting the Future of Big Data
35. © Hortonworks Inc. 2014
Recap
Page 35
Architecting the Future of Big Data
• Apache Hadoop 2 is, at least, twice as good!
• Exciting journey with Hadoop for this decade…
– Hadoop is no longer a one-trick pony, err elephant
– Beyond just HDFS & MapReduce
• Architecture for the future
– Centralized data
– Exciting spectrum of application types, workloads and usecases
36. © Hortonworks Inc. 2014
Couple more things..
Architecting the Future of Big Data
Page 36
39. © Hortonworks Inc. 2014
Thank you!
Page 39
Download Sandbox: Experience Apache Hadoop
Both 2.x and 1.x Versions Available!
http://hortonworks.com/products/hortonworks-sandbox/
Questions Time!
Editor's Notes Graph processing – Giraph, HamaStream proessing – Smaza, Storm, Spark, DataTorrentMapReduceTez – fast query executionWeave/REEF – frameworks to help with writing applicationsList of some of the applications which already support YARN, in some form.Smaza, Storm, S4 and DataTorrent are streaming frameworksVarious types of graph processing frameworks – Giraph and Hama are graph processing systemsThere’s some github projects – caching systems, on-demand web-server spin up Wave and REEF are frameworks on top of YARN to make writing applications easier