Hadoop on VMware


Published on

Hadoop on VMware, presented at Hadoop World 2011

Published in: Technology

Hadoop on VMware

  1. 1. Hadoop as a ServiceHadoop on VirtualizationHadoop World, December 2011Jun Ping DuRichard McDougallVMware, Inc. © 2009 VMware Inc. All rights reserved
  2. 2. Cloud: Big Shifts in Simplification and Optimization1. Reduce the Complexity 2. Dramatically Lower 3. Enable Flexible, Agile Costs IT Service Delivery to simplify operations to redirect investment into to meet and anticipate the and maintenance value-add opportunities needs of the business 2
  3. 3. Infrastructure, Apps and now Data… Build Run Private Public ManageSimplify Infrastructure Simplify App Platform Next Trend: With Cloud Through PaaS Simplify Data 3
  4. 4. Trend 1/3: New Data Growing at 60% Y/YExabytes of information stored 20 Zetta by 2015 1 Yotta by 2030 Yes, you are part of the yotta audio( generation… digital(tv( digital(photos( camera(phones,(rfid( medical(imaging,( sensors( satellite(images,(games,(scanners,( twi8er( cad/cam,(appliances,(videoconfercing,(digital(movies( Source: The Information Explosion , 20094
  5. 5. Trend 2/3: Big Data – Driven by Real-World Benefit5
  6. 6. Trend 3/3: Value from Data Exceeds Hardware Cost!  Value from the intelligence of data analytics now outstrips the cost of hardware •  Hadoop enables the use of lower cost hardware •  Hardware cost halving every 18mo Value Big Iron: $40k/CPU Commodity Cluster: $1k/CPU Cost6
  7. 7. Three Big Reasons to Virtualize Hadoop: 1. Simplify Hardware !  Trend is “not just hadoop” for big data •  Hadoop is often combined with other technologies: Big SQL, NoSQL etc,… •  Unify the infrastructure platform for allSQLCluster Big SQL NoSQL Hadoop NoSQL Cluster Unified Big Data Infrastructure Private Public Hadoop Cluster !  Common Hardware Base •  Eliminate the hardware/driver/testing phase •  Use existing team for ordering, diagnosis, DSS Cluster capacity management of hardware farm 7
  8. 8. Three Big Reasons to Virtualize Hadoop: 2. Rapid ProvisioningI WANT MY HADOOP CLUSTER NOW! !  Instant Cluster Provisioning •  Provision Hadoop Clusters instantly •  Automatable using provisioning engines/scripts: e.g. whir 8
  9. 9. Three Big Reasons to Virtualize Hadoop: 3. Leverage Capabilities!  Increase Utilization •  Hadoop cluster only uses resources it needs •  Extra resources can be used by other applications when not in use!  Eliminate single points of failure •  Use vSphere HA for Namenode and Jobtracker!  Use VM Isolation •  Create separate clusters with defensible security •  Enables multiple-versions of Hadoop on the same infrastructure •  Extends to Hadoop and Linux Environments!  Leverage Resource Management •  Control/assign resources through resource pools •  E.g. Use spare cycles for Hadoop Processing through priority control9
  10. 10. What? Hadoop in a VM? Really? Actually, Hadoop performs well in a virtual machine10
  11. 11. Performance Test: Cluster Configuration Mellanox10 GbE switch AMAX ClusterMax 2X X5650, 96 GB 12X SATA 500 GB Mellanox 10 GbE adapter11
  12. 12. Cluster Configuration!  Hardware •  AMAX ClusterMax, 7 nodes •  2X X5650 2.67 GHz hex-core, 96 GB memory •  12X SATA 500 GB 7200 RPM (10 for Hadoop data), EXT4 •  Mellanox ConnectX VPI (MT26418), 10 GbE •  Mellanox Vantage 6048, 10 GbE!  OS/Hypervisor •  RHEL 6.1 x86_64 (native and guest) •  ESX 5.0 RTM with devel Mellanox driver!  VMs (HT off/on) •  1 VM: 92000 MB, (12/24) vCPUs, 10 PRDM disks •  2 VMs: 46000 MB, (6/12) vCPUs, 5 PRDM disks •  4 VMs (HT on only): •  2 small: 18400 MB, 5 vCPUs, 2 disks •  2 large: 27600 MB, 7 vCPUs, 3 disks12
  13. 13. Hadoop ConfigurationDistribution •  Cloudera CDH3u0 •  Based on Apache open-source 0.20.2Parameters •  dfs.datanode.max.xcievers=4096 •  dfs.replication=2 •  dfs.block.size=134217728 •  io.file.buffer.size=131072 •  mapred.child.java.opts=”-Xmx2048m -Xmn512m” (native) •  mapred.child.java.opts=”-Xmx1900m -Xmn512m” (virtual)!  Network topology •  Hadoop uses info for reliability and performance •  Multiple VMs/host: Each host is a “rack”13
  14. 14. Benchmarks!  Derived from test apps included in distro!  Pi •  Direct-exec Monte-Carlo estimation of pi •  # map tasks = # logical processors •  1.68 T samples!  TestDFSIO •  Streaming write and read π ~ 4*R/(R+G) = 22/7 •  1 TB •  More tasks than processors!  Terasort •  3 phases: teragen, terasort, teravalidate •  10B or 35B records, each 100 Bytes (1 TB, 3.5 TB) •  More tasks than processors •  CPU, networking, and storage I/O14
  15. 15. Performance of Hadoop for Several Workloads Ratio of time taken – Lower is Better 1.2 1 0.8 Ratio to Native 0.6 1 VM 0.4 2 VMs 0.2 015
  16. 16. Architecting Hadoop as a Service using Virtualization!  Goals •  Make it fast and easy to provision new Hadoop Clusters on Demand •  Leverage virtual machines to provide isolation (esp. for Multi-tenant) •  Optimize Hadoop’s performance based on virtual topologies •  Make the system reliable based on virtual topologies!  Leveraging Virtualization •  Elastic scale in/out •  Use high-availability to protect namenode/job tracker •  Resource controls and sharing: re-use underutilized memory, cpu •  Prioritize Workloads: limit or guarantee resource usage in a mixed environment16
  17. 17. Provisioning!  Leverage the vSphere APIs to auto-deploy a cluster •  Whirr, HOD, or custom using ruby, chef, etc,…!  Use linked-clones to rapidly fork many nodes17
  18. 18. Fast Provisioning!  From a “seed” node to a cluster Thin Provisioning Linked Clone 60GB => 3.5GB ~6 second18
  19. 19. SAN, NAS or Local Disk? !  Shared Storage: SAN or NAS !  Hybrid Storage •  Easy to provision •  SAN for boot images, VMs, other •  Automated cluster rebalancing workloads •  Local disk for HDFS •  Scalable Bandwidth, Lower Cost/GB Other VM Other VM Other VM Other VM Other VM Other VM Other VM Other VMHadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Host Host Host Host Host Host 19
  20. 20. Enable Automatic Rack awareness through vSphere!  Important to robust hadoop cluster!  Automatic network topology detect — an important vSphere feature!  Rack script is generated automatically20
  21. 21. Multi-tenant: share cluster or not!  Shared big cluster VS. Isolated small clusters  High performance  Secure  Large scale  Flexible  Pre-job provisioning  Post-job provisioning Combination – as customers’ requirement are different 21
  22. 22. Elastic Hadoop Cluster!  Traditional hadoop cluster •  Easy to scale out •  Fast-provision new hadoop nodes and join into existing cluster •  Hard to scale in While (ClusterIsTooLarge) { choose node k; kill (node k); wait (k’s data block is recovered); if necessary, hadoop.rebalance(); }!  Elastic hadoop cluster … Normal node NN JT Elastic node TaskTracker … DataNode22
  23. 23. Replica Placement!  Second Replica •  Different rack •  Rack-awareness required!  Third Replica •  Same rack, different physical host •  Nodes share host (in virtualized environment)23
  24. 24. Demo24
  25. 25. Performance!  Create more smaller VMs •  Makes Hadoop scale better •  Allows for easier/faster adjustment of packing of VMs across hosts by vSphere (including through DRS)!  Sizing/Configuration of storage is critical •  Plan on ~50Mbytes/sec of bandwidth per core •  SANs are typically configured by default for IOPS, not Bandwidth •  Ensure SAN ports/switch topology allows required aggregate bandwidth •  Performance of the backend storage should be tested/sized •  Local disks will give ~100-140MBytes/sec per disk: pick correct controller25
  26. 26. Summary!  Hadoop does work well in a virtual environment!  Plan a virtual cluster, enable other big-data solutions on the same infrastructure!  Leverage the recipes to automate your configuration and deployment26