Best Practices for Virtualizing Apache Hadoop

  • 4,523 views
Uploaded on

Join this webinar to discuss best practices for designing and building a solid, robust and flexible Hadoop platform on an enterprise virtual infrastructure. Attendees will learn the flexibility and …

Join this webinar to discuss best practices for designing and building a solid, robust and flexible Hadoop platform on an enterprise virtual infrastructure. Attendees will learn the flexibility and operational advantages of Virtual Machines such as fast provisioning, cloning, high levels of standardization, hybrid storage, vMotioning, increased stabilization of the entire software stack, High Availability and Fault Tolerance. This is a can`t miss presentation for anyone wanting to understand design, configuration and deployment of Hadoop in virtual infrastructures.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
4,523
On Slideshare
0
From Embeds
0
Number of Embeds
5

Actions

Shares
Downloads
334
Comments
0
Likes
9

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. © Hortonworks Inc. 2013Best PracticesVirtualizing HadoopGeorge Trujillo
  • 2. © Hortonworks Inc. 2013George Trujillo§ Master Principal Big Data Specialist - Hortonworks§ Tier One Big Data/BCA Specialist – VMware Center of Excellence§ VMware Certified Instructor (VMware Certified Professional)§ MySQL Certified DBA§ Sun Microsystems Ambassador for Java Platforms§ Author of Linux Administration and Advanced Linux AdministrationVideo Training§ Recognized Oracle Double ACE by Oracle Corporation§ Served on Oracle Fusion Council & Oracle Beta Leadership Council,Independent Oracle Users Group (IOUG) Board of Directors,Recognized as one of the “Oracles of Oracle” by IOUGPage 2
  • 3. © Hortonworks Inc. 2013Agenda• Hypervisor’s today• Building an enterprise virtual platform• Virtualizing Master and Slave servers• Best practices• Deploying Hadoop in public and private cloudsPage 3
  • 4. © Hortonworks Inc. 2013Hypervisors Today: Faster/Less Overhead• VMware vSphere, Microsoft Hyper-V Server, CitrixXenServer and RedHat RHEVPage 4Hypervisor Performance Benchmarks % OverheadVMware 1M IOPS with 1 microsecond of latency (5.1) 2 – 10%KVM 1M transactions/minute (IBM hardware RHEL) < 10%Hypervisor Performance vSphere 5.1VMware vCPUs 64RAM per VM, RAM per Host 1TB / 2TBNetwork 36 GB/sIOPS/VM 1,000,000
  • 5. © Hortonworks Inc. 2013Why Virtualize Hadoop?• Virtual Servers offer advantages over Physical Servers• Standardization: On a Single Common software stack• Higher consistency and reliability due to abstracting thehardware environment• Operational flexibility with vMotion, Storage vMotion, LiveCloning, template deployments, hot memory and CPU add,Distributed Resource Scheduling, private VLANs, Storage andNetwork I/O control, etc.• Virtualization is a natural step towards the cloud• Enabling Hadoop as a service in a public or private cloud• Cloud providers are making it easy to deploy Hadoop for POCs,dev and test environments• Cloud and virtualization vendors are offering elastic MapReducesolutionsPage 5
  • 6. © Hortonworks Inc. 2013Virtualization FeaturesPage 6Faster provisioning Live CloningLive migrations TemplatesLive storage migrations Distributed Resource SchedulingHigh Availability Hot CPU and Memory addLive Cloning VM ReplicationNetwork isolation using VXLANs Multi-VM trust zonesVM Backups Distributed Power ManagementElasticity Multi-tenancyStorage/Network I/O Control Private virtual networks16Gb FC Support iSCSI Jumbo Frame SupportNote: Features/functionality dependent on the hypervisor
  • 7. © Hortonworks Inc. 2013Hortonworks Data PlatformBuilding an Enterprise Virtual PlatformPage 7HardwareLinux WindowsDistributed Storage(HDFS)Distributed Processing(MapReduce)Hive(Query)Pig(Scripting)HCatalog(Metadata Mgmt)Zookeeper(Coordination)HBase(Column DB)WebHCatalog(Rest-like APIs)Ambari(Management)Mahout(Machine Learning)Oozie(Workflow)Ganglia(Monitoring)Nagios(Alerts)Sqoop(DB Transfer)WebHDFS(REST API)“Others”(Talend, Informatica, etc.)Data ExtractionAnd LoadManagementMonitoringHadoopEssentialsCore Hadoop(kernel)FlumeNG(Data Transfer)Hypervisor
  • 8. © Hortonworks Inc. 2013Virtualizing Hadoop• The primary goal of virtualizing master and slave servers is thesame, to maximize operational efficiency and leverage existinghardware.• However the strategy for virtualizing Hadoop master servers isdifferent than virtualizing Hadoop slave servers.– Hadoop master servers can follow virtualization best practices andguidelines for tier1 and business critical environments.– Hadoop slave servers need to follow virtualization best practices andalso use Hadoop Virtual Extensions so a Hadoop cluster is “virtualaware”.Page 8
  • 9. © Hortonworks Inc. 2013Virtualizing Master Servers• Virtualize the master servers (NameNode, JobTracker,HBase Master, Secondary NameNode)– Consider any key management servers: Ganglia, Nagios, Ambari,Active Directory, Metadata databases• Goals of a virtual enterprise Hadoop platform:– Less down time (Live migrations, cloning, …)– A more reliable software stack– A higher Quality of Service– Reduced CapEx and OpEx– Increased operational flexibility with virtualization features– VMware High Availability (with five clicks)• Shared storage for the Hadoop master servers is requiredto fully leverage virtualization features.Page 9
  • 10. © Hortonworks Inc. 2013Configure Environment Properly• Do not overcommit SLA or production environments• Size virtual machines to avoid entering host “soft” memorystate and the likely breaking of host large pages into smallpages. Leave at least 6% of memory for the hypervisorand VM memory overhead is conservative.– If free memory drops below minFree (“soft” memory state),memory will be reclaimed through ballooning and other memorymanagement techniques. All these techniques require breakinghost large pages into small pages.• Leverage hyperthreading – make sure there is hardwareand BIOS support– Hyper Threading – can improve performance up to 20%• Do not set memory limits on production servers.Page 10
  • 11. © Hortonworks Inc. 2013Configure Environment Properly (2)• Run latest version of hypervisor, BIOS and virtual tools• Verify BIOS settings enable all populated processorsockets and enable all cores in each socket.• Enable “Turbo Boost” in BIOS if processors support it.• Disabling hardware devices (in BIOS) can free interruptresources.– COM and LPT ports, USB controllers, floppy drives, networkinterfaces, optical drives, storage controllers, etc• Enable virtualization features in BIOS (VT-x, AMD-V, EPT,RVI)• Initially leave memory scrubbing rate at manufacturer’sdefault setting.Page 11
  • 12. © Hortonworks Inc. 2013More Best Practices• Configure an OS kernel as a single-core or multi-corekernel based on the number of vCPUs being used.• Understand how NUMA affects your VMs – try to keep theVM size within the NUMA node– Look at disabling node interleaving (leave NUMA enabled)– Maintain memory locality• Let hypervisor control power mgmt by BIOS setting “OSControlled Mode”• Enable C1E in BIOS• Have a very good reason for using CPU affinity otherwiseavoid it like the plaguePage 12
  • 13. © Hortonworks Inc. 2013Linux Best Practices• Kernel parameters:– nofile=16384– nproc=32000– Mount with noatime and nodiratime attributes disabled– File descriptors set to 65535– File system read-ahead buffer should be increased to 1024 or 2,048.– Epoll file descriptor limit should be increased to 4096• Turn off swapping• Use ext4 or xfs (mount noatime)– Ext can be about 5% better on reads than xfs– XFS can be 12-25% better on writes (and auto defrags in thebackground)• Linux 2.6.30+ can give 60% better energy consumption.Page 13
  • 14. © Hortonworks Inc. 2013Networking Best Practices• Separate VM traffic from live migration and managementtraffic– Separate NICs with separate vSwitches• Leverage NIC teaming (at least 2 NICS per vSwitch)• Leverage latest adapters and drivers from hypervisorvendor• Be careful with multi-queue networking: Hadoop drives ahigh packet rate, but not high enough to justify theoverhead of multi-queue.• Network:– Channel bonding two GbE ports can give better I/O performance– 8 Queues per portPage 14
  • 15. © Hortonworks Inc. 2013Networking Best Practices (2)• Evaluate these features with network adapters to leveragehardware features:– Checksum offload– TCP segmentation offload(TSO)– Jumbo frames (JF)– Large receive offload(LRO)– Ability to handle high-memory DMA (that is, 64-bit DMAaddresses)– Ability to handle multiple Scatter Gather elements per Tx frame• Optimize 10 Gigabit Ethernet network adapters– Features like NetQueue can significantly improve performance of10 Gigabit Ethernet network adapters in virtualized environments.Page 15
  • 16. © Hortonworks Inc. 2013Storage Best Practices• Make good storage decisions– i.e. VMFS or Raw Device Mappings (RDM)– VMDK – leverages all features of virtualization– RDM – leverages features of storage vendors (replication,snapshots, …)– Run in Advanced Host Controller interface mode (AHCI).– Native Command Queuing enabled (NCQ)• Use multiple vSCSI adapters and evenly distribute targetdevices• Use eagerzeroedthick for VMDK files or uncheckWindows “Quick Format” option• Makes sure there is block alignment for storagePage 16
  • 17. © Hortonworks Inc. 2013Virtualizing Data Servers• HVE is a new feature that extends the Hadoop topologyawareness mechanism to support rack and node groupswith hosts containing VMs.– Data locality-related policies maintained within a virtual layer• HVE merged into branch-1– Available in Apache Hadoop 1.2, HDP 1.2– https://issues.apache.org/jira/browse/HADOOP-8817• Extensions include:– Block placement and removal policies– Balancer policies– Task scheduling– Network topology awarenessPage 17
  • 18. © Hortonworks Inc. 2013HVE: Virtualization Topology AwarenessPage 18Host8Rack1Data CenterRack2NodeG 3 NodeG 4Host7VMVMVMVMVMVMVMVMHost6Host5VMVMVMVMVMVMVMVMHost4Host3VMVMVMVMVMVMVMVMHost2Host1VMVMVMVMVMVMVMVMNodeG 1 NodeG 2• HVE is a new feature that extends the Hadoop topologyawareness mechanism to support rack and node groupswith hosts containing VMs.– Data locality-related policies maintained within a virtual layer.
  • 19. © Hortonworks Inc. 2013HVE: Replica PoliciesPage 19Standard Replica Policies Extension Replica Policies1st replica is on local (closest) node ofthe writerMultiple replicas are not be placed onthe same node or on nodes under thesame node group2nd replica is on separate rack of 1streplica;1st replica is on the local node or localnode group of the writer3rd replica is on the same rack as the2nd replica;2nd replica is on a remote rack of the1st replicaRemaining replicas are placedrandomly across rack to meetminimum restriction.Multiple replicas are not placed on the same node with standard or extensionreplica placement/removal policies. Rules are maintained for the balancer.
  • 20. © Hortonworks Inc. 2013Follow Virtualization Best PracticesPage 20§ Validate virtualization and Hadoop configurations withvendor hardware compatibility lists.Hardware§ Follow recommended Hadoop reference architectures.Hadoop§ Review storage vendor recommendations.Storage§ Follow virtualization vendors best practices,deployment guides and workload characterizations.Virtualization§ Validate internal guidelines and best practices forconfiguring and managing corporate VMs.Internal
  • 21. Benefits of Running Hadoop in a Private CloudElastic Hadoop•  Create pool of clusternodes•  On demand cluster scaleup/downMulti-tenant Hadoop•  Better isolate workloadsand enforce organizationalsecurity boundariesCapEx reduction•  Better utilization of physicalservers•  Cluster ‘timeshare’•  Promote responsible usagethrough chargeback/showbackOpEx reduction•  Rapid provisioning & selfprovisioning•  Simplify cluster maintenanceLEAD TO
  • 22. Hortonworks & Rackspace Partnership•  Goal:–  Enable Hadoop to run efficiently in OpenStack basedpublic and private cloud environments•  Where we stand–  Rackspace public cloud service available soon( Q3CY13)–  Continued work on enabling Hortonworks dataplatform to run efficiently on Rackspace OpenStackprivate cloud platform•  Project Savannah–  Automate the deployment of Hadoop on enterpriseclass OpenStack clouds.
  • 23. © Hortonworks Inc. 2013Final Thoughts• Virtualization features can provide operational advantagesto a Hadoop cluster.• A lot of companies have expertise in virtualizing tier two/three platforms but not tier one. Be careful of growingpains.• Can your organization handle the jump of moving toHadoop and managing an enterprise virtual infrastructureat the same time?• Give Hadoop Virtual Extensions time to bake.• Organizations are increasing their percentage of virtualservers and cloud deployments. They do not want to takea step back into physical servers unless they have to.Page 23
  • 24. © Hortonworks Inc. 2013Next StepsPage 24Download Hortonworks Sandboxwww.hortonworks.com/sandboxDownload Hortonworks Data Platformwww.hortonworks.com/downloadRegister for Hadoop Serieswww.hortonworks.com/webinars
  • 25. Hadoop SummitPage 25Architecting the Future of Big Data•  June 26-27, 2013- San Jose Convention Cntr•  Co-hosted by Hortonworks & Yahoo!•  Theme: Enabling the Next GenerationEnterprise Data Platform•  90+ Sessions and 7 Tracks:•  Community Focused Event–  Sessions selected by a Conference Committee–  Community Choice allowed public to vote forsessions they want to see•  Training classes offered pre event–  Apache Hadoop Essentials: A TechnicalUnderstanding for Business Users–  Understanding Microsoft HDInsight and ApacheHadoop–  Developing Solutions with Apache Hadoop –HDFS and MapReduce–  Applying Data Science using Apache Hadoophadoopsummit.org
  • 26. Thank YouFor AttendingBest Practices for Virtualizing HadoopGeorge TrujilloBlog: http://cloud-dba-journey.blogspot.comTwitter: GeorgeTrujillo