• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Hadoop on Virtual Machines
 

Hadoop on Virtual Machines

on

  • 7,868 views

Hadoop on Virtualization talk at Hadoop Summit 2012

Hadoop on Virtualization talk at Hadoop Summit 2012
Richard McDougall
Sanjay Radia

Statistics

Views

Total Views
7,868
Views on SlideShare
7,677
Embed Views
191

Actions

Likes
8
Downloads
257
Comments
0

4 Embeds 191

http://www.entreprise-marketing.fr 127
https://twitter.com 31
http://eventifier.co 26
http://www.linkedin.com 7

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Hybrid StorageLocal Disks, retains fault domains of individual disks
  • Data – can I read what I wrote, is the service availableWhen I asked one of the original authors of of GFS if there were any decisions they would revist – random writersSimplicity is keyRaw disk – fs take time to stabilize – we can take advantage of ext4, xfs or zfs

Hadoop on Virtual Machines Hadoop on Virtual Machines Presentation Transcript

  • Hadoop in Virtual Machines Richard McDougall, VMware Sanjay Radia, Hortonworks Hadoop Summit, 2012
  • Part 1
  • Say What?• VMs will just add overhead, due to I/O virt• VMs run on SAN, we’re all about local disks• Hadoop does it’s own cluster management• It’ll do resource management in 2.0• And even HA is coming to Hadoop• And… what is the point, anyway?
  • But you’ve been asking…• Can I virtualize my Hadoop, so that I can make it easier, quicker to get a cluster up and running• Is it possible to run Hadoop on those spare machine cycles I have on hundreds/thousands of nodes?• Can I make my system more available by using some of the standard HA features?
  • And the savvy are asking…• Can I avoid having to install special hardware for the master services, like name-node, job- tracker?• Can I dynamically change the size of the cluster to use more resources?• Can I use VM isolation to increase security or guard against resource-intensive neighbors?• Is it feasible to provision virtual-clusters, giving out one each to a business unit?
  • Ok, so first what about the concerns?• Use your SAN? … if you want to. SAN Storage NAS Filers Local Storage $2 - $10/Gigabyte $1 - $5/Gigabyte $0.05/Gigabyte $1M gets: $1M gets: $1M gets: 0.5Petabytes 1 Petabyte 20 Petabytes 1,000,000 IOPS 400,000 IOPS 10,000,000 IOPS 1Gbyte/sec 2Gbyte/sec 800 Gbytes/sec
  • Hadoop Using Local Disks Task Tracker DatanodeOther HadoopWorkload Virtual Machine Ext4 Ext4 Ext4Virtualization Host OS Image - VMDK VMDK VMDK VMDK Shared Storage
  • Hadoop Perf in a VM(Ratio is elapsed time to physical, Lower Is Better) 1.2 1 Ratio to Native 0.8 0.6 0.4 1 VM 2 VMs 0.2 0
  • Evolution of Hadoop on VMsVM VM VM VM Current Hadoop: Compute T1 T2 Combined VM VM Storage/Co Storage Storage mputeHadoop in VM Separate Storage Separate Compute Clusters- VM lifecycle - Separate compute - Separate virtual clusters determined from data per tenant by Datanode - Elastic compute - Stronger VM-grade security- NOT Elastic - Enable shared and resource isolation- Limited to Hadoop workloads - Enable deployment of Multi-Tenancy - Raise utilization multiple Hadoop runtime versions
  • 1. Hadoop Task Tracker and Data Node in a VM Add/Remove Slot Slots? Slot Other Virtual Task Tracker Hadoop Workload Node Datanode Grow/Shrink by tens of GB? Virtualization Host VMDKGrow/Shrink of a VM is oneapproach
  • 2. Add/remove Virtual Nodes Slot Slot Slot Slot Other Virtual Task Tracker Virtual Task Tracker Hadoop Hadoop Workload Node Node Datanode Datanode Virtualization Host VMDK VMDKJust add/remove morevirtual nodes?
  • But State makes it hard to power-off a node Slot SlotOther Virtual Task Tracker HadoopWorkload Node DatanodeVirtualization Host VMDK Powering off the Hadoop VM would in effect fail the datanode
  • Adding a node needs data… Slot Slot Slot SlotOther Virtual Task Tracker Virtual Task Tracker Hadoop HadoopWorkload Node Node Datanode DatanodeVirtualization Host VMDK VMDKAdding a node would require TBs ofdata replication
  • 2. Separated Compute and Data Slot Slot Virtual Slot Virtual Hadoop Slot Virtual Slot Virtual Hadoop Slot Hadoop Node Hadoop Node Node Node Task Tracker Other Task Tracker Task Tracker Workload Virtual Hadoop Datanode Node Virtualization Host VMDK VMDKTruly Elastic Hadoop:Scalable through virtualnodes
  • Dataflow with separated Compute/Data Slot Virtual Slot Virtual Hadoop Hadoop Node Node Datanode NodeManager Virtual NIC Virtual NICVirtualization Host Virtual Switch VMDK NIC Drivers
  • Performance Analysis of Split1 Combined Compute/Datanode VM per Host 1 Datanode VM, 1 Compute nodes VM per Host Node Node Node Node Manager Manager Manager Manager Datanode Datanode Datanode Datanode Workload: Teragen, Terasort, Teravalidate HW Configuration: 8 cores, 96GB RAM, 16 disks per host x 2 nodes
  • Performance Analysis of Split (Elapsed time: ratio to combined)1.2 10.80.6 Combined Split0.40.2 0 Teragen Terasort Teravalidate
  • Tying it together: Elastic Hadoop Coke Pepsi Hadoop Hadoop Hadoop Hadoop Queue Virtual Virtual Virtual Virtual Runtime LayerData Layer Namespace Namespace Namespace Distributed File System (HDFS, KFS, GPFS, MAPR, Isilon,…) Host Host Host Host Host Host
  • Demo: Shrink/Expand Cluster
  • Demo: Shrink/Expand ClusterSetup 1 Datanodes, 2 Nodemanagers and 2 web servers oneach physical host Web Server Web Server Web Server Web Server Web Server Web Server Web Server Web Server NodeManager NodeManager NodeManager NodeManager NodeManager NodeManager NodeManager NodeManager Datanode Datanode Datanode Datanode
  • Demo: Shrink/Expand ClusterWhen web load is high in daytime, we can suspend some Nodemanagers andpower on more Web servers. Web Server Web Server Web Server Web Server Web Server Web Server Web Server Web Server NodeManager NodeManager NodeManager NodeManager NodeManager NodeManager NodeManager NodeManager Datanode Datanode Datanode Datanode
  • Demo
  • Part 2
  • Expand Hadoop Ecosystem• Hortonworks goal – Expand Hadoop ecosystem – Provide first class support of various platforms• Hadoop should run well on VMs • VMs offer several advantages as presented earlier• Take advantage of vSphere for HA Page 25
  • VMware-Hortonworks Joint Engineering• First class support for VMs – Topology plugins (Hadoop-8468) • 2 VMs can be on same host – Pick closer data – Schedule tasks closer – Don’t put two replicas on same host – MR-tmp on HDFS using block pools • Elastic Compute-VMs will not need local disk – Fast communications within VMs Page 26
  • Hadoop Total System Availability Architecture Slave Nodes of Hadoop Cluster job job job job job AppsRunningOutside Failover JT into Safemode NN JT NN N+K Server Server Server failover HA Cluster for Master Daemons 27
  • HA is coming in 1.0Using Total System Availability Architecture 28 © Hortonworks Inc. 2011
  • HA in Hadoop 1 with HDP1• Total System Availability Architecture – Namenode • Clients pause automatically • JobTracker pauses automatically – Other Hadoop master services (JT, …) coming• Use industry proven HA framework – VMWare vSphere-HA • Failover, fencing, … • Corner cases are tricky – if not addressed, corruption – Addition benefits: • N-N & N+K failover • Migration for maintenance 29
  • Hadoop NN/JT HA with vSphere Page 30
  • NameNode HA – Failover Times• NameNode Failover times with vSphere and LinuxHA – Failure detection + Failover – 0.5 to 2 minutes – OS bootup needed for vSphere – 10-20 seconds – Namenode Startup (exit safemode) • Small/Medium clusters – 1 to 2 minutes • Large cluster – 5 to 15 minutes• NameNode startup time measurements – 60 Nodes, 60K files, 6 million blocks, 300 TB raw storage – 40 sec – 180 Nodes, 200K files, 18 million blocks, 900TB raw storage – 120 sec Cold Failover is good enough for small/medium clusters Failure Detection and Automatic Failover Dominates 31
  • Demo
  • Summary• Advantages of Hadoop on VMs – Cluster Management – Cluster consolidation – Greater Elasticity in mixed environment – Alternate multi-tenancy to capacity scheduler’s offerings• HA for Hadoop Master Daemons – vSphere based HA for NN, JT, … in Hadoop 1 – Total System Availability Architecture Page 33