• Like
  • Save
堵俊平:Hadoop virtualization extensions
Upcoming SlideShare
Loading in...5

堵俊平:Hadoop virtualization extensions



BDTC 2013 Beijing China

BDTC 2013 Beijing China



Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    堵俊平:Hadoop virtualization extensions 堵俊平:Hadoop virtualization extensions Presentation Transcript

    • Big Data in Cloud 堵俊平 Apache Hadoop Committer Staff Engineer, VMware
    • Bio 堵俊平 (Junping Du) - Join VMware in 2008 for cloud product first - Initiate earliest effort on big data within VMware since 2010 - Automate Hadoop deployment on vSphere which becomes Open Source project – Serengeti later - Start contributing to Apache Hadoop community since 2012 - Become Apache Hadoop committer recently only 1 in +8 timezone today
    • Agenda - Virtualization, SDDC and Cloud - Trends from my observation in Big Data - YARN: resource hub for Big Data Applications - YARN in the Cloud
    • What is Virtualization? - @see VMware’s vSphere Guest TCP/IP Guest Monitor File System Monitor Virtual NIC Physical Hardware Scheduler Memory Manager Virtual Switch File System NIC Drivers VMkernel Virtual SCSI I/O Drivers Monitor Emulates Physical Devices: CPU, Memory, I/O CPU is controlled by scheduler and virtualized by monitor Memory is allocated by the VMkernel and virtualized by the monitor Network and I/O devices are emulated and proxied though native device drivers
    • Server Virtualization Adoption on Path to 80% Over Next 5 Years % Virtualized of x86 Workloads 80% Total x86 Workloads 200 100% 180 IDC 2012 to 2016 Change = +12 pts 90% 160 Gartner 2012 to 2016 Change = +22 pts 140 80% x86 % Physical Servers Unvirtualized 70% 百万 120 40% 100 60% IDC+ VMW Estimate: Workloads1 2012 to 2016 CAGR = 21% 50% 80 60 30% 40 20% 20 0% 40% 10% 2010 2011 2012 2013 2014 2015 2016 2017 2018 0% 2009 2010 2011 2012 2013 2014 2015 2016 Source(s): IDC: Annual Virtualization Forecast, Feb-13; Gartner: x86 Server Virtualization, Worldwide, 3Q12 Update; Gartner: Forecast x86 Server Virtualization, Worldwide, 2008-2018, Jul-11; VMware estimates, Note: Server workloads only 1 Installed Base totals assume 5-year refresh
    • Apps on Traditional Infrastructure Windows Linux Databases Mission Critical HPC Big Data
    • Apps on Software-Defined Data Center Windows Linux Mission Critical Databases HPC Big Data Software-Defined Data Center VDC VDC VDC VDC VDC Software-Defined Data Center Services Abstract Pool Automate
    • Infrastructure for Traditional Apps Traditional Applications 2016 141M 70% Infrastructure for Traditional Enterprise Apps Existing Application bound to vendor specific HW 2012 83M Hardware-based Resiliency Hardware-based QOS Hard To automate Complex to scale
    • Infrastructure for New Apps Infrastructure for New/Cloud/Data Apps Application Specific Network and Storage Next Gen Cloud Applications 2016 48M 700% 2012 6M Software-based Infrastructure Transformational Economics Automation and Agility Designed For Scale
    • SDDC Delivers Single Architecture for New and Existing Apps Infrastructure for New/Cloud/Data Apps Application Specific Network and Storage Any Application Infrastructure for Existing Enterprise Apps Existing Application bound to vendor specific HW Any Hardware
    • Let’s back to Big Data … New Trends of Big Data from my observation - Hadoop 2.0, YARN plays as key resource hub in big data ecosystem - MapReduce is not good enough, we need faster one, like: Tez, Spark, etc. - HDFS tries to support more scenarios, i.e. cache for low-latency apps, snapshot for disaster recovery, storage tiers awareness, etc. - More Hadoop-based SQL engines: Apache Drill, Impala, Stinger, Hawq, etc. - For enterprise-ready, more efforts are spent on Security, HA, QoS, Monitor & Management
    • Hadoop MapReduce v1 (Classic) • JobTracker – Manage cluster resources and job scheduling • TaskTracker – Per node agent – Manage tasks
    • MapReduce v1 Limitations • Scalability – Manage cluster resources and job scheduling • SPOF (Single Point Of Failure) • JobTracker failure cause all queued and running job failure – Restart is very tricky due to complex state • Hard partition of resources into map and reduce slots – Low resource utilization • Lacks support for alternate paradigms • Lack of wire-compatible protocols
    • YARN Architecture • Splits up the two major functions of JobTracker – Resource Manager (RM) - Cluster resource management – Application Master (AM) - Task scheduling and monitoring • NodeManager (NM) - A new per-node slave – launching the applications’ containers – monitoring their resource usage (cpu, memory) and reporting to the Resource Manager. • YARN maintains compatibility with existing MapReduce application and support other applications
    • YARN – Hub for Big Data Applications OpenMPI Impala HBase Distributed Shell Spark MapReduce Tez Storm YARN HDFS • App-specific AM • HOYA (Hbase On YArn) – Long running services (YARN-896) • LLAMA (Low Latency Application MAster) – Gang Scheduler (YARN-624)
    • YARN and Cloud • Two different prospective: – YARN-centric prospective • YARN is the key platform to apps • YARN is independent of infrastructure, running on top of Cloud shows YARN’s generality – Cloud-centric prospective • YARN is an umbrella kind of applications • Supporting YARN shows Cloud’s generality
    • YARN and Cloud: YARN-centric Prospective • YARN is “OS” Big Data Apps • Infrastructure (no matter physical or cloud) is “hardware” HBase Open MPI Distributed Shell Spark … Impala MapReduce Tez Storm YARN Infrastructure Bare-metal machines Cloud Infrastructure … VMware Open Stack …
    • YARN and Cloud: Cloud-centric Prospective • Cloud Infrastructure is “OS” • YARN is a group of “process” Legacy Apps Other Big Data Apps YARN Apps Open MPI D.S Spark Impala … HBase MapReduce Tez Storm … YARN Cloud Infrastructure (VMware, Open Stack, etc.)
    • YARN vs. Cloud • Similarity – Target to share resources across applications – Provide Global Resource Management • YARN vs. Cloud – YARN managing resource in OS layer vs. Cloud managing resources in Hypervisor (Not comparable, but Hypervisor is more powerful than OS in isolation) – Apps managed by YARN need specific AppMaster, Apps managed by Cloud is exactly the same as running on physical machines (Cloud +1) – YARN layer is closed to big data app, better understand/estimate app’s requirement (YARN +1) – Cloud layer is closed to hardware resources, easier to track real time and global resource utilization (Cloud +1)
    • YARN + Cloud • Why YARN + Cloud? – Leverage virtualization in strong isolation, fine-grained resource sharing and other benefits – Uniform infrastructure to simplify IT in enterprise • What it looks like? – Running YARN NM inside of VMs managed by Cloud Infrastructure – Build communication channel between YARN RM and Cloud Resource Manager for coordination • How we do? – First thing above is very easy and smoothly – Second things to achieve in two ways • YARN can aware/manipulate Cloud resource change • YARN provide a generic resource notification mechanism so Cloud Manager can use when resource changing
    • Elastic YARN Node in the Cloud Container Add/Remove Resources? Container Other Workload Virtual YARN Node NodeManager Datanode Virtualization Host Grow/Shrink resource of a VM VMDK Grow/Shrink by tens of GB in memory?
    • Elastic YARN Node in the Cloud • VM’s resource boundary can be elastic – – – – CPU is easy – time slicing (with constraints) Memory is harder – page sharing and memory ballooning In case of contention, enforce limits and proportional sharing “Stealing” resources behind apps could cause bad performance (paging) – App aware resource management could address these issues • Hadoop YARN Resource Model – Dynamic with adding/removing nodes – But static for per node • In this case, shall we enable resource elasticity on VM? – If yes, low performance when resource contention happens. – If no, low utilization as physical boxes because free resources cannot be leveraged by other busy VMs • We need better answer .
    • HVE provide the answer! • Hadoop Virtualization Extensions – A project initiated from VMware to enhance Hadoop running on virtualization – A “driver” for Hadoop “OS” running on cloud “hardware” • Goal: Make Hadoop Cloud-Ready – Provide Virtualization-awareness to Hadoop, i.e. virtual topology, virtual resources, etc. – Deliver generic utility that can be leveraged by virtualized platform • Independent of virtualization platform and cloud infrastructure • 100% contribute to Apache Hadoop Community
    • HVE • Philosophy – make infrastructure related components abstract – deliver different implementations that can be configured properly • E.g. BlockPlacementPolicy (Abstract) BlockPlacementPolicy BlockPlacementPolicy Default BlockPlacementPolicy For Virtualization
    • Elastic YARN Node in the Cloud • In this case, shall we enable resource elasticity on VM? • Yes, and we try to get rid of resource contention – Notify YARN that node’s resource get changed – YARN RM scheduler won’t schedule new tasks on nodes get congestion – YARN scheduler preempt low priority tasks if necessary – The work is addressed in YARN-291
    • Implementation – YARN-291 (umbrella) • YARN-312 • YARN-311 – Core scheduler changes – AdminProtocol changes • REST API, JMX, etc. • YARN-313 • CLI Resource Manager Scheduler UpdateNodeResource() AdminService Cluster Resource Admin CLI yarn rmadmin -updateNodeResource <NodeId> <Resource> SchedulerNode RMContext RMNode Resource Tracker Service Heartbeat Node Manager Cloud Resource Manager
    • Welcome contribution to Apache Hadoop! • Hadoop is the key platform – For architecting Big Data – Contribute a bit can change the world! • Open source project is a great platform – For people to share great ideas, works from different organizations – Community is a great work place • Companies and persons get credit – From work and resources they are putting – Also easy to build a ecosystem and show expertise • So many challenges in Big Data, like building Babel – Open source is the common language to make sure we can work together
    • Key messages in today’s talk • SDDC and Cloud are the future for architecting enterprise IT • New trends in big data: YARN plays as a “OS” for big data apps • In VMware, we tries to support any “OS”, include “YARN” • HVE plays as “driver” to enable Hadoop on virtualization/cloud • Contribute to Apache Hadoop
    • Reference • YARN MapReduce 2.0 – https://issues.apache.org/jira/browse/MAPREDUCE279 • HVE topology extension – https://issues.apache.org/jira/browse/HADOOP-8468 • HVE topology extension for YARN – https://issues.apache.org/jira/browse/YARN-18 • HVE elastic resource configuration – https://issues.apache.org/jira/browse/YARN-291 • Gang Scheduling – https://issues.apache.org/jira/browse/YARN-624 • Long-lived services in YARN – https://issues.apache.org/jira/browse/YARN-896