Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Best Practices for Virtualizing Hadoop


Published on

This presentation will discuss best practices for designing and building a solid, robust and flexible Hadoop platform on an enterprise virtual infrastructure. Attendees will learn the flexibility and operational advantages of Virtual Machines such as fast provisioning, cloning, high levels of standardization, hybrid storage, vMotioning, increased stabilization of the entire software stack, High Availability and Fault Tolerance. This is a can`t miss presentation for anyone wanting to understand design, configuration and deployment of Hadoop in virtual infrastructures.

Published in: Technology

Best Practices for Virtualizing Hadoop

  1. 1. Best Practices for VirtualizingHadoopMarch 21st, 2013 2:20 – 3:00pmGeorge Trujillo© Hortonworks Inc. 2012
  2. 2. George TrujilloMaster Principal Big Data Specialist - HortonworksTier One Big Data/BCA Specialist – VMware Center of ExcellenceVMware Certified ProfessionalVMware Certified InstructorMySQL Certified DBASun Microsystems Ambassador for Java PlatformsAuthor of Linux Administration and Advanced Linux Administration Video TrainingOracle Double ACEServed on Oracle Fusion Council & Oracle Beta Leadership Council, Independent Oracle Users Group (IOUG) Board of Directors, Recognized as one of the “Oracles of Oracle” by IOUG Page 2 © Hortonworks Inc. 2012
  3. 3. AgendaTitle: Best Practices for Virtualizing HadoopSummary: This presentation will discuss best practices for designing and buildinga solid, robust and flexible Hadoop platform on an enterprise virtual infrastructure.Attendees will learn the flexibility and operational advantages of Virtual Machinessuch as fast provisioning, cloning, high levels of standardization, hybridstorage, vMotioning, increased stabilization of the entire software stack, HighAvailability and Fault Tolerance. This is a cant miss presentation for anyonewanting to understand design, configuration and deployment of Hadoop in virtualinfrastructures.Agenda:A. Hypervisor’s todayB. Building an enterprise virtual platformC. Virtualizing Master and Slave serversD. Key best practices Page 3 © Hortonworks Inc. 2012
  4. 4. Grand Junction Colorado Page 4 © Hortonworks Inc. 2012
  5. 5. Hypervisors Today: Faster/Less Overhead• VMware vSphere, Microsoft Hyper-V Server, Citrix XenServer and RedHat RHEVHypervisor Performance Benchmarks % OverheadVMware 1M IOPS with 1 microsecond of latency (5.1) 2 – 10%KVM 1M transactions/minute (IBM hardware RHEL) 10%Hypervisor Performance vSphere 5.1VMware vCPUs 64 RAM per VM, RAM per Host 1TB / 2TB Network 36 GB/s IOPS 1,000,000 vSphere 5.1 – 1m IOPS with 1 1 μs latency Page 5 © Hortonworks Inc. 2012
  6. 6. Why Virtualize Hadoop• Virtual Servers offer advantages over Physical Servers• Enabling Hadoop as a service in a public or private cloud• Cloud providers are making it easy to deploy Hadoop for POCs, dev and test environments• Running a Consistent, Highly Reliable Hardware Environment• Standardizing on a Single Common Hardware Platform (software stack)• Virtualization is a natural step towards the cloud• Cloud and virtualization vendors are offering elastic MapReduce solutions Page 6 © Hortonworks Inc. 2012
  7. 7. Virtualization FeaturesFaster provisioning Live CloningLive migrations TemplatesLive storage migrations Distributed Resource SchedulingHigh Availability Hot CPU and Memory addLive Cloning VM ReplicationNetwork isolation using VXLANs Multi-VM trust zonesVM Backups Distributed Power ManagementElasticity Multi-tenancyStorage I/O Control Network I/O Control16Gb FC Support iSCSI Jumbo Frame Support Page 7 © Hortonworks Inc. 2012
  8. 8. Building an Enterprise Virtual Platform Hortonworks Data Platform Sqoop Talend WebHDFS FlumeNG “Others” Data Extraction (DB Transfer) (Data Transfer) (REST API) (Data Transfer) (Vendors and 3rd party tools) And Load Ambari Oozie Ganglia Nagios Management (Management) (Workflow) (Monitoring) (Alerts) MonitoringWebHCatalog Pig Mahout HCatalog Hive HBase Zookeeper Hadoop (Rest-like APIs) (Scripting) (Machine Learning) (Metadata Mgmt) (Query) (Column DB) (Coordination) Essentials Distributed Processing Distributed Storage Core Hadoop (MapReduce) (HDFS) (kernel) Linux Windows Hypervisor Hardware Page 8 © Hortonworks Inc. 2012
  9. 9. Virtualizing Master Servers• Virtualize the master servers (NameNode, JobTracker, HBase Master, Secondary NameNode) – Ganglia, Nagios, Ambari, Active Directory, Metadata databases• Virtualizing Hadoop master servers offers: – Virtualization features to master servers – VMware High Availability• Develop a hybrid storage solution Page 9 © Hortonworks Inc. 2012
  10. 10. Protecting Master Servers• Shared storage is required to fully leverage virtualization features.• A virtual enterprise platform will provide: – Less down time (Live migrations, cloning, …) – A more reliable software stack – A higher Quality of Service – Reduced CapEx and OpEx – Increased operational flexibility Page 10 © Hortonworks Inc. 2012
  11. 11. Configure Hardware Properly• Do not overcommit SLA or production environments• Understand how NUMA affects your VMs – try to keep the VM size within the NUMA node – Look at disabling node interleaving (leave NUMA enabled) – Maintain memory locality• Leverage hyperthreading – make sure there is hardware and BIOS support – Hyper Threading – can improve performance up to 20%• Do not set memory limits on production servers.• Let hypervisor control power mgmt by BIOS setting “OS Controled Mode”• Enable C1E in BIOS Page 11 © Hortonworks Inc. 2012
  12. 12. Configure Hardware Properly (2)• Run latest version of BIOS and VMware Tools• Verify BIOS settings enable all populated processor sockets and enable all cores in each socket.• Enable “Turbo Boost” in BIOS if processors support it.• Disabling hardware devices (in BIOS) can free interrupt resources. – COM andn LPT ports, USB controllers, floppy drives, network interfaces, optical drives, storage controllers, etc• Enable virtualization features in BIOS (VT-x, AMD-V, EPT, RVI)• Initially leave memory scrubbing rate at manufacturer’s default setting. Page 12 © Hortonworks Inc. 2012
  13. 13. More Best Practices• Use latest hypervisor and virtual tools• Configure an OS kernel as a single-core or multi-core kernel based on the number of vCPUs being used.• Size virtual machines to avoid entering host “soft” memory state and the likely breaking of host large pages into small pages. Leave at least 6% of memory for the hypervisor and VM memory overhead is conservative. – If free memory drops below minFree (“soft” memory state), memory will be reclaimed through ballooning and other memory management techniques. All these techniques require breaking host large pages into small pages.• Have a very good reason for using CPU affinity otherwise avoid it like the plague Page 13 © Hortonworks Inc. 2012
  14. 14. Linux Best Practices• Kernel parameters: – nofile=16384 – nproc=32000 – Mount with noatime and nodiratime attributes disabled – File descriptors set to 65535 – File system read-ahead buffer should be increased to 1024 or 2,048. – Epoll file descriptor limit should be increased to 4096• Turn off swapping• Use ext4 or xfs (mount noatime) – Ext can be about 5% better on reads than xfs – XFS can be 12-25% better on writes (and auto defrags in the background)• Linux 2.6.30+ can give 60% better energy consumption. Page 14 © Hortonworks Inc. 2012
  15. 15. Networking Best Practices• Separate VM traffic from live migration and management traffic – Separate NICs with separate vSwitches• Leverage NIC teaming (at least 2 NICS per vSwitch)• Leverage latest drivers for hypervisor vendor• Avoid multi-queue networking: Hadoop drives a high packet rate, but not high enough to justify the overhead of multi-queue.• Network: – Channel bonding two GbE ports can give better I/O performance – 8 Queues per port Page 15 © Hortonworks Inc. 2012
  16. 16. Networking Best Practices (2)• Evaluate these features with network adapters to leverage hardware features: – Checksum offload – TCP segmentation offload(TSO) – Jumbo frames (JF) – Large receive offload(LRO) – Ability to handle high-memory DMA (that is, 64-bit DMA addresses) – Ability to handle multiple Scatter Gather elements per Tx frame• Optimize 10 Gigabit Ethernet network adapters – Features like NetQueue can significantly improve performance of 10 Gigabit Ethernet network adapters in virtualized environments. Page 16 © Hortonworks Inc. 2012
  17. 17. Storage Best Practices• Make good storage decisions – i.e. VMFS or Raw Device Mappings (RDM) – VMDK – leverages all features of virtualization – RDM – leverages features of storage vendors (replication, snapshots, …) – Run in Advanced Host Controller interface mode (AHCI). – Native Command Queuing enabled (NCQ)• Use multiple vSCSI adapters and evenly distribute target devices• Use eagerzeroedthick for VMDK files or uncheck Windows “Quick Format” option• Makes sure there is block alignment for storage Page 17 © Hortonworks Inc. 2012
  18. 18. Hadoop Virtualized Extensions (HVE)• HVE is a new feature that extends the Hadoop topology awareness mechanism to support rack and node groups with hosts containing VMs. – Data locality-related policies maintained within a virtual layer• HVE merged into branch-1 – Available in Apache Hadoop 1.2• Extensions include: – Block placement and removal policies – Balancer policies – Task scheduling – Network topology awareness Page 18 © Hortonworks Inc. 2012
  19. 19. HVE: Virtualization Topology Awareness• HVE is a new feature that extends the Hadoop topology awareness mechanism to support rack and node groups with hosts containing VMs. – Data locality-related policies maintained within a virtual layer. Data Center Rack1 Rack2 NodeG 1 NodeG 2 NodeG 3 NodeG 4Host1 Host2 Host3 Host4 Host5 Host6 Host7 Host8V V V V V V V V V V V V V V V VM M M M M M M M M M M M M M M MV V V V V V V V V V V V V V V VM M M M M M M M M M M M M M M M Page 19 © Hortonworks Inc. 2012
  20. 20. HVE: Replica PoliciesMultiple replicas are not placed on the same node with standard or extensionreplica placement/removal policies. Rules are maintained for the balancer.Standard Replica Policies Extension Replica Policies1st replica is on local (closest) node of Multiple replicas are not be placed onthe writer the same node or on nodes under the same node group2nd replica is on separate rack of 1st 1st replica is on the local node or localreplica; node group of the writer3rd replica is on the same rack as the 2nd replica is on a remote rack of the2nd replica; 1st replicaRemaining replicas are placedrandomly across rack to meetminimum restriction. Page 20 © Hortonworks Inc. 2012
  21. 21. Follow Virtualization Best Practices Validate virtualization and Hadoop configurations withHardware vendor hardware compatibility lists.Hadoop Follow recommended Hadoop reference architectures. Follow virtualization vendors best practices,Virtualization deployment guides and workload characterizations.Storage Review storage vendor recommendations. Validate internal best practices for configuring andInternal managing VMs. Page 21 © Hortonworks Inc. 2012
  22. 22. Best Practices for Virtualizing Hadoop Thank You For AttendingGeorge TrujilloBlog: http://cloud-dba-journey.blogspot.comTwitter: GeorgeTrujillo