Big Data in Cloud
堵俊平
Apache Hadoop Committer
Staff Engineer, VMware
Bio 堵俊平 (Junping Du)
- Join VMware in 2008 for cloud product
first
- Initiate earliest effort on big data within
VMware si...
Agenda
- Virtualization, SDDC and Cloud
- Trends from my observation in Big
Data
- YARN: resource hub for Big Data
Applica...
What is Virtualization?
- @see VMware’s vSphere
Guest

TCP/IP

Guest

Monitor

File
System

Monitor

Virtual NIC

Physical...
Server Virtualization Adoption on
Path to 80% Over Next 5 Years
% Virtualized of x86 Workloads

80%

Total x86 Workloads

...
Apps on Traditional Infrastructure
Windows

Linux

Databases

Mission
Critical

HPC

Big Data
Apps on Software-Defined Data Center
Windows

Linux

Mission
Critical

Databases

HPC

Big Data

Software-Defined Data Cen...
Infrastructure for Traditional Apps
Traditional Applications

2016

141M
70%

Infrastructure for Traditional Enterprise Ap...
Infrastructure for New Apps
Infrastructure for New/Cloud/Data Apps
Application Specific Network and Storage

Next Gen Clou...
SDDC Delivers Single Architecture for New and Existing Apps
Infrastructure for New/Cloud/Data Apps
Application Specific Ne...
Let’s back to Big Data …
New Trends of Big Data from my observation
- Hadoop 2.0, YARN plays as key resource hub in big
da...
Hadoop MapReduce v1 (Classic)
• JobTracker
– Manage cluster
resources and job
scheduling

• TaskTracker
– Per node agent
–...
MapReduce v1 Limitations
• Scalability
– Manage cluster resources and job scheduling

• SPOF (Single Point Of Failure)
• J...
YARN Architecture
• Splits up the two major functions of
JobTracker
– Resource Manager (RM) - Cluster resource
management
...
YARN – Hub for Big Data Applications
OpenMPI

Impala

HBase

Distributed Shell

Spark

MapReduce

Tez

Storm

YARN

HDFS

...
YARN and Cloud
• Two different prospective:
– YARN-centric prospective
• YARN is the key platform to apps
• YARN is indepe...
YARN and Cloud: YARN-centric Prospective
• YARN is “OS”
Big Data Apps
• Infrastructure (no matter physical or cloud) is “h...
YARN and Cloud: Cloud-centric Prospective
• Cloud Infrastructure is “OS”
• YARN is a group of “process”
Legacy Apps

Other...
YARN vs. Cloud
• Similarity
– Target to share resources across applications
– Provide Global Resource Management

• YARN v...
YARN + Cloud
• Why YARN + Cloud?
– Leverage virtualization in strong isolation, fine-grained
resource sharing and other be...
Elastic YARN Node in the Cloud

Container

Add/Remove
Resources?

Container
Other
Workload

Virtual
YARN
Node

NodeManager...
Elastic YARN Node in the Cloud
• VM’s resource boundary can be elastic
–
–
–
–

CPU is easy – time slicing (with constrain...
HVE provide the answer!
• Hadoop Virtualization Extensions
– A project initiated from VMware to enhance Hadoop
running on ...
HVE
• Philosophy
– make infrastructure related components abstract
– deliver different implementations that can be
configu...
Elastic YARN Node in the Cloud
• In this case, shall we enable resource elasticity
on VM?
• Yes, and we try to get rid of ...
Implementation – YARN-291 (umbrella)
• YARN-312

• YARN-311
– Core scheduler changes

– AdminProtocol changes

• REST API,...
Welcome contribution to Apache Hadoop!
• Hadoop is the key platform
– For architecting Big Data
– Contribute a bit can cha...
Key messages in today’s talk
• SDDC and Cloud are the future for architecting
enterprise IT
• New trends in big data: YARN...
Reference
• YARN MapReduce 2.0
– https://issues.apache.org/jira/browse/MAPREDUCE279

• HVE topology extension
– https://is...
堵俊平:Hadoop virtualization extensions
Upcoming SlideShare
Loading in...5
×

堵俊平:Hadoop virtualization extensions

273

Published on

BDTC 2013 Beijing China

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
273
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
14
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

堵俊平:Hadoop virtualization extensions

  1. 1. Big Data in Cloud 堵俊平 Apache Hadoop Committer Staff Engineer, VMware
  2. 2. Bio 堵俊平 (Junping Du) - Join VMware in 2008 for cloud product first - Initiate earliest effort on big data within VMware since 2010 - Automate Hadoop deployment on vSphere which becomes Open Source project – Serengeti later - Start contributing to Apache Hadoop community since 2012 - Become Apache Hadoop committer recently only 1 in +8 timezone today
  3. 3. Agenda - Virtualization, SDDC and Cloud - Trends from my observation in Big Data - YARN: resource hub for Big Data Applications - YARN in the Cloud
  4. 4. What is Virtualization? - @see VMware’s vSphere Guest TCP/IP Guest Monitor File System Monitor Virtual NIC Physical Hardware Scheduler Memory Manager Virtual Switch File System NIC Drivers VMkernel Virtual SCSI I/O Drivers Monitor Emulates Physical Devices: CPU, Memory, I/O CPU is controlled by scheduler and virtualized by monitor Memory is allocated by the VMkernel and virtualized by the monitor Network and I/O devices are emulated and proxied though native device drivers
  5. 5. Server Virtualization Adoption on Path to 80% Over Next 5 Years % Virtualized of x86 Workloads 80% Total x86 Workloads 200 100% 180 IDC 2012 to 2016 Change = +12 pts 90% 160 Gartner 2012 to 2016 Change = +22 pts 140 80% x86 % Physical Servers Unvirtualized 70% 百万 120 40% 100 60% IDC+ VMW Estimate: Workloads1 2012 to 2016 CAGR = 21% 50% 80 60 30% 40 20% 20 0% 40% 10% 2010 2011 2012 2013 2014 2015 2016 2017 2018 0% 2009 2010 2011 2012 2013 2014 2015 2016 Source(s): IDC: Annual Virtualization Forecast, Feb-13; Gartner: x86 Server Virtualization, Worldwide, 3Q12 Update; Gartner: Forecast x86 Server Virtualization, Worldwide, 2008-2018, Jul-11; VMware estimates, Note: Server workloads only 1 Installed Base totals assume 5-year refresh
  6. 6. Apps on Traditional Infrastructure Windows Linux Databases Mission Critical HPC Big Data
  7. 7. Apps on Software-Defined Data Center Windows Linux Mission Critical Databases HPC Big Data Software-Defined Data Center VDC VDC VDC VDC VDC Software-Defined Data Center Services Abstract Pool Automate
  8. 8. Infrastructure for Traditional Apps Traditional Applications 2016 141M 70% Infrastructure for Traditional Enterprise Apps Existing Application bound to vendor specific HW 2012 83M Hardware-based Resiliency Hardware-based QOS Hard To automate Complex to scale
  9. 9. Infrastructure for New Apps Infrastructure for New/Cloud/Data Apps Application Specific Network and Storage Next Gen Cloud Applications 2016 48M 700% 2012 6M Software-based Infrastructure Transformational Economics Automation and Agility Designed For Scale
  10. 10. SDDC Delivers Single Architecture for New and Existing Apps Infrastructure for New/Cloud/Data Apps Application Specific Network and Storage Any Application Infrastructure for Existing Enterprise Apps Existing Application bound to vendor specific HW Any Hardware
  11. 11. Let’s back to Big Data … New Trends of Big Data from my observation - Hadoop 2.0, YARN plays as key resource hub in big data ecosystem - MapReduce is not good enough, we need faster one, like: Tez, Spark, etc. - HDFS tries to support more scenarios, i.e. cache for low-latency apps, snapshot for disaster recovery, storage tiers awareness, etc. - More Hadoop-based SQL engines: Apache Drill, Impala, Stinger, Hawq, etc. - For enterprise-ready, more efforts are spent on Security, HA, QoS, Monitor & Management
  12. 12. Hadoop MapReduce v1 (Classic) • JobTracker – Manage cluster resources and job scheduling • TaskTracker – Per node agent – Manage tasks
  13. 13. MapReduce v1 Limitations • Scalability – Manage cluster resources and job scheduling • SPOF (Single Point Of Failure) • JobTracker failure cause all queued and running job failure – Restart is very tricky due to complex state • Hard partition of resources into map and reduce slots – Low resource utilization • Lacks support for alternate paradigms • Lack of wire-compatible protocols
  14. 14. YARN Architecture • Splits up the two major functions of JobTracker – Resource Manager (RM) - Cluster resource management – Application Master (AM) - Task scheduling and monitoring • NodeManager (NM) - A new per-node slave – launching the applications’ containers – monitoring their resource usage (cpu, memory) and reporting to the Resource Manager. • YARN maintains compatibility with existing MapReduce application and support other applications
  15. 15. YARN – Hub for Big Data Applications OpenMPI Impala HBase Distributed Shell Spark MapReduce Tez Storm YARN HDFS • App-specific AM • HOYA (Hbase On YArn) – Long running services (YARN-896) • LLAMA (Low Latency Application MAster) – Gang Scheduler (YARN-624)
  16. 16. YARN and Cloud • Two different prospective: – YARN-centric prospective • YARN is the key platform to apps • YARN is independent of infrastructure, running on top of Cloud shows YARN’s generality – Cloud-centric prospective • YARN is an umbrella kind of applications • Supporting YARN shows Cloud’s generality
  17. 17. YARN and Cloud: YARN-centric Prospective • YARN is “OS” Big Data Apps • Infrastructure (no matter physical or cloud) is “hardware” HBase Open MPI Distributed Shell Spark … Impala MapReduce Tez Storm YARN Infrastructure Bare-metal machines Cloud Infrastructure … VMware Open Stack …
  18. 18. YARN and Cloud: Cloud-centric Prospective • Cloud Infrastructure is “OS” • YARN is a group of “process” Legacy Apps Other Big Data Apps YARN Apps Open MPI D.S Spark Impala … HBase MapReduce Tez Storm … YARN Cloud Infrastructure (VMware, Open Stack, etc.)
  19. 19. YARN vs. Cloud • Similarity – Target to share resources across applications – Provide Global Resource Management • YARN vs. Cloud – YARN managing resource in OS layer vs. Cloud managing resources in Hypervisor (Not comparable, but Hypervisor is more powerful than OS in isolation) – Apps managed by YARN need specific AppMaster, Apps managed by Cloud is exactly the same as running on physical machines (Cloud +1) – YARN layer is closed to big data app, better understand/estimate app’s requirement (YARN +1) – Cloud layer is closed to hardware resources, easier to track real time and global resource utilization (Cloud +1)
  20. 20. YARN + Cloud • Why YARN + Cloud? – Leverage virtualization in strong isolation, fine-grained resource sharing and other benefits – Uniform infrastructure to simplify IT in enterprise • What it looks like? – Running YARN NM inside of VMs managed by Cloud Infrastructure – Build communication channel between YARN RM and Cloud Resource Manager for coordination • How we do? – First thing above is very easy and smoothly – Second things to achieve in two ways • YARN can aware/manipulate Cloud resource change • YARN provide a generic resource notification mechanism so Cloud Manager can use when resource changing
  21. 21. Elastic YARN Node in the Cloud Container Add/Remove Resources? Container Other Workload Virtual YARN Node NodeManager Datanode Virtualization Host Grow/Shrink resource of a VM VMDK Grow/Shrink by tens of GB in memory?
  22. 22. Elastic YARN Node in the Cloud • VM’s resource boundary can be elastic – – – – CPU is easy – time slicing (with constraints) Memory is harder – page sharing and memory ballooning In case of contention, enforce limits and proportional sharing “Stealing” resources behind apps could cause bad performance (paging) – App aware resource management could address these issues • Hadoop YARN Resource Model – Dynamic with adding/removing nodes – But static for per node • In this case, shall we enable resource elasticity on VM? – If yes, low performance when resource contention happens. – If no, low utilization as physical boxes because free resources cannot be leveraged by other busy VMs • We need better answer .
  23. 23. HVE provide the answer! • Hadoop Virtualization Extensions – A project initiated from VMware to enhance Hadoop running on virtualization – A “driver” for Hadoop “OS” running on cloud “hardware” • Goal: Make Hadoop Cloud-Ready – Provide Virtualization-awareness to Hadoop, i.e. virtual topology, virtual resources, etc. – Deliver generic utility that can be leveraged by virtualized platform • Independent of virtualization platform and cloud infrastructure • 100% contribute to Apache Hadoop Community
  24. 24. HVE • Philosophy – make infrastructure related components abstract – deliver different implementations that can be configured properly • E.g. BlockPlacementPolicy (Abstract) BlockPlacementPolicy BlockPlacementPolicy Default BlockPlacementPolicy For Virtualization
  25. 25. Elastic YARN Node in the Cloud • In this case, shall we enable resource elasticity on VM? • Yes, and we try to get rid of resource contention – Notify YARN that node’s resource get changed – YARN RM scheduler won’t schedule new tasks on nodes get congestion – YARN scheduler preempt low priority tasks if necessary – The work is addressed in YARN-291
  26. 26. Implementation – YARN-291 (umbrella) • YARN-312 • YARN-311 – Core scheduler changes – AdminProtocol changes • REST API, JMX, etc. • YARN-313 • CLI Resource Manager Scheduler UpdateNodeResource() AdminService Cluster Resource Admin CLI yarn rmadmin -updateNodeResource <NodeId> <Resource> SchedulerNode RMContext RMNode Resource Tracker Service Heartbeat Node Manager Cloud Resource Manager
  27. 27. Welcome contribution to Apache Hadoop! • Hadoop is the key platform – For architecting Big Data – Contribute a bit can change the world! • Open source project is a great platform – For people to share great ideas, works from different organizations – Community is a great work place • Companies and persons get credit – From work and resources they are putting – Also easy to build a ecosystem and show expertise • So many challenges in Big Data, like building Babel – Open source is the common language to make sure we can work together
  28. 28. Key messages in today’s talk • SDDC and Cloud are the future for architecting enterprise IT • New trends in big data: YARN plays as a “OS” for big data apps • In VMware, we tries to support any “OS”, include “YARN” • HVE plays as “driver” to enable Hadoop on virtualization/cloud • Contribute to Apache Hadoop
  29. 29. Reference • YARN MapReduce 2.0 – https://issues.apache.org/jira/browse/MAPREDUCE279 • HVE topology extension – https://issues.apache.org/jira/browse/HADOOP-8468 • HVE topology extension for YARN – https://issues.apache.org/jira/browse/YARN-18 • HVE elastic resource configuration – https://issues.apache.org/jira/browse/YARN-291 • Gang Scheduling – https://issues.apache.org/jira/browse/YARN-624 • Long-lived services in YARN – https://issues.apache.org/jira/browse/YARN-896
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×