Your SlideShare is downloading. ×
Hadoop on VMware
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Hadoop on VMware


Published on

Hadoop on VMware, presented at Hadoop World 2011

Hadoop on VMware, presented at Hadoop World 2011

Published in: Technology

  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Hadoop as a ServiceHadoop on VirtualizationHadoop World, December 2011Jun Ping DuRichard McDougallVMware, Inc. © 2009 VMware Inc. All rights reserved
  • 2. Cloud: Big Shifts in Simplification and Optimization1. Reduce the Complexity 2. Dramatically Lower 3. Enable Flexible, Agile Costs IT Service Delivery to simplify operations to redirect investment into to meet and anticipate the and maintenance value-add opportunities needs of the business 2
  • 3. Infrastructure, Apps and now Data… Build Run Private Public ManageSimplify Infrastructure Simplify App Platform Next Trend: With Cloud Through PaaS Simplify Data 3
  • 4. Trend 1/3: New Data Growing at 60% Y/YExabytes of information stored 20 Zetta by 2015 1 Yotta by 2030 Yes, you are part of the yotta audio( generation… digital(tv( digital(photos( camera(phones,(rfid( medical(imaging,( sensors( satellite(images,(games,(scanners,( twi8er( cad/cam,(appliances,(videoconfercing,(digital(movies( Source: The Information Explosion , 20094
  • 5. Trend 2/3: Big Data – Driven by Real-World Benefit5
  • 6. Trend 3/3: Value from Data Exceeds Hardware Cost!  Value from the intelligence of data analytics now outstrips the cost of hardware •  Hadoop enables the use of lower cost hardware •  Hardware cost halving every 18mo Value Big Iron: $40k/CPU Commodity Cluster: $1k/CPU Cost6
  • 7. Three Big Reasons to Virtualize Hadoop: 1. Simplify Hardware !  Trend is “not just hadoop” for big data •  Hadoop is often combined with other technologies: Big SQL, NoSQL etc,… •  Unify the infrastructure platform for allSQLCluster Big SQL NoSQL Hadoop NoSQL Cluster Unified Big Data Infrastructure Private Public Hadoop Cluster !  Common Hardware Base •  Eliminate the hardware/driver/testing phase •  Use existing team for ordering, diagnosis, DSS Cluster capacity management of hardware farm 7
  • 8. Three Big Reasons to Virtualize Hadoop: 2. Rapid ProvisioningI WANT MY HADOOP CLUSTER NOW! !  Instant Cluster Provisioning •  Provision Hadoop Clusters instantly •  Automatable using provisioning engines/scripts: e.g. whir 8
  • 9. Three Big Reasons to Virtualize Hadoop: 3. Leverage Capabilities!  Increase Utilization •  Hadoop cluster only uses resources it needs •  Extra resources can be used by other applications when not in use!  Eliminate single points of failure •  Use vSphere HA for Namenode and Jobtracker!  Use VM Isolation •  Create separate clusters with defensible security •  Enables multiple-versions of Hadoop on the same infrastructure •  Extends to Hadoop and Linux Environments!  Leverage Resource Management •  Control/assign resources through resource pools •  E.g. Use spare cycles for Hadoop Processing through priority control9
  • 10. What? Hadoop in a VM? Really? Actually, Hadoop performs well in a virtual machine10
  • 11. Performance Test: Cluster Configuration Mellanox10 GbE switch AMAX ClusterMax 2X X5650, 96 GB 12X SATA 500 GB Mellanox 10 GbE adapter11
  • 12. Cluster Configuration!  Hardware •  AMAX ClusterMax, 7 nodes •  2X X5650 2.67 GHz hex-core, 96 GB memory •  12X SATA 500 GB 7200 RPM (10 for Hadoop data), EXT4 •  Mellanox ConnectX VPI (MT26418), 10 GbE •  Mellanox Vantage 6048, 10 GbE!  OS/Hypervisor •  RHEL 6.1 x86_64 (native and guest) •  ESX 5.0 RTM with devel Mellanox driver!  VMs (HT off/on) •  1 VM: 92000 MB, (12/24) vCPUs, 10 PRDM disks •  2 VMs: 46000 MB, (6/12) vCPUs, 5 PRDM disks •  4 VMs (HT on only): •  2 small: 18400 MB, 5 vCPUs, 2 disks •  2 large: 27600 MB, 7 vCPUs, 3 disks12
  • 13. Hadoop ConfigurationDistribution •  Cloudera CDH3u0 •  Based on Apache open-source 0.20.2Parameters •  dfs.datanode.max.xcievers=4096 •  dfs.replication=2 •  dfs.block.size=134217728 •  io.file.buffer.size=131072 •”-Xmx2048m -Xmn512m” (native) •”-Xmx1900m -Xmn512m” (virtual)!  Network topology •  Hadoop uses info for reliability and performance •  Multiple VMs/host: Each host is a “rack”13
  • 14. Benchmarks!  Derived from test apps included in distro!  Pi •  Direct-exec Monte-Carlo estimation of pi •  # map tasks = # logical processors •  1.68 T samples!  TestDFSIO •  Streaming write and read π ~ 4*R/(R+G) = 22/7 •  1 TB •  More tasks than processors!  Terasort •  3 phases: teragen, terasort, teravalidate •  10B or 35B records, each 100 Bytes (1 TB, 3.5 TB) •  More tasks than processors •  CPU, networking, and storage I/O14
  • 15. Performance of Hadoop for Several Workloads Ratio of time taken – Lower is Better 1.2 1 0.8 Ratio to Native 0.6 1 VM 0.4 2 VMs 0.2 015
  • 16. Architecting Hadoop as a Service using Virtualization!  Goals •  Make it fast and easy to provision new Hadoop Clusters on Demand •  Leverage virtual machines to provide isolation (esp. for Multi-tenant) •  Optimize Hadoop’s performance based on virtual topologies •  Make the system reliable based on virtual topologies!  Leveraging Virtualization •  Elastic scale in/out •  Use high-availability to protect namenode/job tracker •  Resource controls and sharing: re-use underutilized memory, cpu •  Prioritize Workloads: limit or guarantee resource usage in a mixed environment16
  • 17. Provisioning!  Leverage the vSphere APIs to auto-deploy a cluster •  Whirr, HOD, or custom using ruby, chef, etc,…!  Use linked-clones to rapidly fork many nodes17
  • 18. Fast Provisioning!  From a “seed” node to a cluster Thin Provisioning Linked Clone 60GB => 3.5GB ~6 second18
  • 19. SAN, NAS or Local Disk? !  Shared Storage: SAN or NAS !  Hybrid Storage •  Easy to provision •  SAN for boot images, VMs, other •  Automated cluster rebalancing workloads •  Local disk for HDFS •  Scalable Bandwidth, Lower Cost/GB Other VM Other VM Other VM Other VM Other VM Other VM Other VM Other VMHadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Host Host Host Host Host Host 19
  • 20. Enable Automatic Rack awareness through vSphere!  Important to robust hadoop cluster!  Automatic network topology detect — an important vSphere feature!  Rack script is generated automatically20
  • 21. Multi-tenant: share cluster or not!  Shared big cluster VS. Isolated small clusters  High performance  Secure  Large scale  Flexible  Pre-job provisioning  Post-job provisioning Combination – as customers’ requirement are different 21
  • 22. Elastic Hadoop Cluster!  Traditional hadoop cluster •  Easy to scale out •  Fast-provision new hadoop nodes and join into existing cluster •  Hard to scale in While (ClusterIsTooLarge) { choose node k; kill (node k); wait (k’s data block is recovered); if necessary, hadoop.rebalance(); }!  Elastic hadoop cluster … Normal node NN JT Elastic node TaskTracker … DataNode22
  • 23. Replica Placement!  Second Replica •  Different rack •  Rack-awareness required!  Third Replica •  Same rack, different physical host •  Nodes share host (in virtualized environment)23
  • 24. Demo24
  • 25. Performance!  Create more smaller VMs •  Makes Hadoop scale better •  Allows for easier/faster adjustment of packing of VMs across hosts by vSphere (including through DRS)!  Sizing/Configuration of storage is critical •  Plan on ~50Mbytes/sec of bandwidth per core •  SANs are typically configured by default for IOPS, not Bandwidth •  Ensure SAN ports/switch topology allows required aggregate bandwidth •  Performance of the backend storage should be tested/sized •  Local disks will give ~100-140MBytes/sec per disk: pick correct controller25
  • 26. Summary!  Hadoop does work well in a virtual environment!  Plan a virtual cluster, enable other big-data solutions on the same infrastructure!  Leverage the recipes to automate your configuration and deployment26