Your SlideShare is downloading. ×
堵俊平:Hadoop virtualization extensions
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

堵俊平:Hadoop virtualization extensions


Published on

BDTC 2013 Beijing China

BDTC 2013 Beijing China

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Big Data in Cloud 堵俊平 Apache Hadoop Committer Staff Engineer, VMware
  • 2. Bio 堵俊平 (Junping Du) - Join VMware in 2008 for cloud product first - Initiate earliest effort on big data within VMware since 2010 - Automate Hadoop deployment on vSphere which becomes Open Source project – Serengeti later - Start contributing to Apache Hadoop community since 2012 - Become Apache Hadoop committer recently only 1 in +8 timezone today
  • 3. Agenda - Virtualization, SDDC and Cloud - Trends from my observation in Big Data - YARN: resource hub for Big Data Applications - YARN in the Cloud
  • 4. What is Virtualization? - @see VMware’s vSphere Guest TCP/IP Guest Monitor File System Monitor Virtual NIC Physical Hardware Scheduler Memory Manager Virtual Switch File System NIC Drivers VMkernel Virtual SCSI I/O Drivers Monitor Emulates Physical Devices: CPU, Memory, I/O CPU is controlled by scheduler and virtualized by monitor Memory is allocated by the VMkernel and virtualized by the monitor Network and I/O devices are emulated and proxied though native device drivers
  • 5. Server Virtualization Adoption on Path to 80% Over Next 5 Years % Virtualized of x86 Workloads 80% Total x86 Workloads 200 100% 180 IDC 2012 to 2016 Change = +12 pts 90% 160 Gartner 2012 to 2016 Change = +22 pts 140 80% x86 % Physical Servers Unvirtualized 70% 百万 120 40% 100 60% IDC+ VMW Estimate: Workloads1 2012 to 2016 CAGR = 21% 50% 80 60 30% 40 20% 20 0% 40% 10% 2010 2011 2012 2013 2014 2015 2016 2017 2018 0% 2009 2010 2011 2012 2013 2014 2015 2016 Source(s): IDC: Annual Virtualization Forecast, Feb-13; Gartner: x86 Server Virtualization, Worldwide, 3Q12 Update; Gartner: Forecast x86 Server Virtualization, Worldwide, 2008-2018, Jul-11; VMware estimates, Note: Server workloads only 1 Installed Base totals assume 5-year refresh
  • 6. Apps on Traditional Infrastructure Windows Linux Databases Mission Critical HPC Big Data
  • 7. Apps on Software-Defined Data Center Windows Linux Mission Critical Databases HPC Big Data Software-Defined Data Center VDC VDC VDC VDC VDC Software-Defined Data Center Services Abstract Pool Automate
  • 8. Infrastructure for Traditional Apps Traditional Applications 2016 141M 70% Infrastructure for Traditional Enterprise Apps Existing Application bound to vendor specific HW 2012 83M Hardware-based Resiliency Hardware-based QOS Hard To automate Complex to scale
  • 9. Infrastructure for New Apps Infrastructure for New/Cloud/Data Apps Application Specific Network and Storage Next Gen Cloud Applications 2016 48M 700% 2012 6M Software-based Infrastructure Transformational Economics Automation and Agility Designed For Scale
  • 10. SDDC Delivers Single Architecture for New and Existing Apps Infrastructure for New/Cloud/Data Apps Application Specific Network and Storage Any Application Infrastructure for Existing Enterprise Apps Existing Application bound to vendor specific HW Any Hardware
  • 11. Let’s back to Big Data … New Trends of Big Data from my observation - Hadoop 2.0, YARN plays as key resource hub in big data ecosystem - MapReduce is not good enough, we need faster one, like: Tez, Spark, etc. - HDFS tries to support more scenarios, i.e. cache for low-latency apps, snapshot for disaster recovery, storage tiers awareness, etc. - More Hadoop-based SQL engines: Apache Drill, Impala, Stinger, Hawq, etc. - For enterprise-ready, more efforts are spent on Security, HA, QoS, Monitor & Management
  • 12. Hadoop MapReduce v1 (Classic) • JobTracker – Manage cluster resources and job scheduling • TaskTracker – Per node agent – Manage tasks
  • 13. MapReduce v1 Limitations • Scalability – Manage cluster resources and job scheduling • SPOF (Single Point Of Failure) • JobTracker failure cause all queued and running job failure – Restart is very tricky due to complex state • Hard partition of resources into map and reduce slots – Low resource utilization • Lacks support for alternate paradigms • Lack of wire-compatible protocols
  • 14. YARN Architecture • Splits up the two major functions of JobTracker – Resource Manager (RM) - Cluster resource management – Application Master (AM) - Task scheduling and monitoring • NodeManager (NM) - A new per-node slave – launching the applications’ containers – monitoring their resource usage (cpu, memory) and reporting to the Resource Manager. • YARN maintains compatibility with existing MapReduce application and support other applications
  • 15. YARN – Hub for Big Data Applications OpenMPI Impala HBase Distributed Shell Spark MapReduce Tez Storm YARN HDFS • App-specific AM • HOYA (Hbase On YArn) – Long running services (YARN-896) • LLAMA (Low Latency Application MAster) – Gang Scheduler (YARN-624)
  • 16. YARN and Cloud • Two different prospective: – YARN-centric prospective • YARN is the key platform to apps • YARN is independent of infrastructure, running on top of Cloud shows YARN’s generality – Cloud-centric prospective • YARN is an umbrella kind of applications • Supporting YARN shows Cloud’s generality
  • 17. YARN and Cloud: YARN-centric Prospective • YARN is “OS” Big Data Apps • Infrastructure (no matter physical or cloud) is “hardware” HBase Open MPI Distributed Shell Spark … Impala MapReduce Tez Storm YARN Infrastructure Bare-metal machines Cloud Infrastructure … VMware Open Stack …
  • 18. YARN and Cloud: Cloud-centric Prospective • Cloud Infrastructure is “OS” • YARN is a group of “process” Legacy Apps Other Big Data Apps YARN Apps Open MPI D.S Spark Impala … HBase MapReduce Tez Storm … YARN Cloud Infrastructure (VMware, Open Stack, etc.)
  • 19. YARN vs. Cloud • Similarity – Target to share resources across applications – Provide Global Resource Management • YARN vs. Cloud – YARN managing resource in OS layer vs. Cloud managing resources in Hypervisor (Not comparable, but Hypervisor is more powerful than OS in isolation) – Apps managed by YARN need specific AppMaster, Apps managed by Cloud is exactly the same as running on physical machines (Cloud +1) – YARN layer is closed to big data app, better understand/estimate app’s requirement (YARN +1) – Cloud layer is closed to hardware resources, easier to track real time and global resource utilization (Cloud +1)
  • 20. YARN + Cloud • Why YARN + Cloud? – Leverage virtualization in strong isolation, fine-grained resource sharing and other benefits – Uniform infrastructure to simplify IT in enterprise • What it looks like? – Running YARN NM inside of VMs managed by Cloud Infrastructure – Build communication channel between YARN RM and Cloud Resource Manager for coordination • How we do? – First thing above is very easy and smoothly – Second things to achieve in two ways • YARN can aware/manipulate Cloud resource change • YARN provide a generic resource notification mechanism so Cloud Manager can use when resource changing
  • 21. Elastic YARN Node in the Cloud Container Add/Remove Resources? Container Other Workload Virtual YARN Node NodeManager Datanode Virtualization Host Grow/Shrink resource of a VM VMDK Grow/Shrink by tens of GB in memory?
  • 22. Elastic YARN Node in the Cloud • VM’s resource boundary can be elastic – – – – CPU is easy – time slicing (with constraints) Memory is harder – page sharing and memory ballooning In case of contention, enforce limits and proportional sharing “Stealing” resources behind apps could cause bad performance (paging) – App aware resource management could address these issues • Hadoop YARN Resource Model – Dynamic with adding/removing nodes – But static for per node • In this case, shall we enable resource elasticity on VM? – If yes, low performance when resource contention happens. – If no, low utilization as physical boxes because free resources cannot be leveraged by other busy VMs • We need better answer .
  • 23. HVE provide the answer! • Hadoop Virtualization Extensions – A project initiated from VMware to enhance Hadoop running on virtualization – A “driver” for Hadoop “OS” running on cloud “hardware” • Goal: Make Hadoop Cloud-Ready – Provide Virtualization-awareness to Hadoop, i.e. virtual topology, virtual resources, etc. – Deliver generic utility that can be leveraged by virtualized platform • Independent of virtualization platform and cloud infrastructure • 100% contribute to Apache Hadoop Community
  • 24. HVE • Philosophy – make infrastructure related components abstract – deliver different implementations that can be configured properly • E.g. BlockPlacementPolicy (Abstract) BlockPlacementPolicy BlockPlacementPolicy Default BlockPlacementPolicy For Virtualization
  • 25. Elastic YARN Node in the Cloud • In this case, shall we enable resource elasticity on VM? • Yes, and we try to get rid of resource contention – Notify YARN that node’s resource get changed – YARN RM scheduler won’t schedule new tasks on nodes get congestion – YARN scheduler preempt low priority tasks if necessary – The work is addressed in YARN-291
  • 26. Implementation – YARN-291 (umbrella) • YARN-312 • YARN-311 – Core scheduler changes – AdminProtocol changes • REST API, JMX, etc. • YARN-313 • CLI Resource Manager Scheduler UpdateNodeResource() AdminService Cluster Resource Admin CLI yarn rmadmin -updateNodeResource <NodeId> <Resource> SchedulerNode RMContext RMNode Resource Tracker Service Heartbeat Node Manager Cloud Resource Manager
  • 27. Welcome contribution to Apache Hadoop! • Hadoop is the key platform – For architecting Big Data – Contribute a bit can change the world! • Open source project is a great platform – For people to share great ideas, works from different organizations – Community is a great work place • Companies and persons get credit – From work and resources they are putting – Also easy to build a ecosystem and show expertise • So many challenges in Big Data, like building Babel – Open source is the common language to make sure we can work together
  • 28. Key messages in today’s talk • SDDC and Cloud are the future for architecting enterprise IT • New trends in big data: YARN plays as a “OS” for big data apps • In VMware, we tries to support any “OS”, include “YARN” • HVE plays as “driver” to enable Hadoop on virtualization/cloud • Contribute to Apache Hadoop
  • 29. Reference • YARN MapReduce 2.0 – • HVE topology extension – • HVE topology extension for YARN – • HVE elastic resource configuration – • Gang Scheduling – • Long-lived services in YARN –