Best Practices for Virtualizing Hadoop

Best Practices for Virtualizing
Hadoop
March 21st, 2013 2:20 – 3:00pm
George Trujillo

© Hortonworks Inc. 2012

George Trujillo
Master Principal Big Data Specialist - Hortonworks
Tier One Big Data/BCA Specialist – VMware Center of Excellence
VMware Certified Professional
VMware Certified Instructor
MySQL Certified DBA
Sun Microsystem's Ambassador for Java Platforms
Author of Linux Administration and Advanced Linux
Administration Video Training
Oracle Double ACE
Served on Oracle Fusion Council & Oracle Beta Leadership
Council, Independent Oracle Users Group (IOUG) Board of
Directors, Recognized as one of the “Oracles of Oracle” by IOUG
Page 2

Agenda
Title: Best Practices for Virtualizing Hadoop

Summary: This presentation will discuss best practices for designing and building
a solid, robust and flexible Hadoop platform on an enterprise virtual infrastructure.
Attendees will learn the flexibility and operational advantages of Virtual Machines
such as fast provisioning, cloning, high levels of standardization, hybrid
storage, vMotioning, increased stabilization of the entire software stack, High
Availability and Fault Tolerance. This is a can't miss presentation for anyone
wanting to understand design, configuration and deployment of Hadoop in virtual
infrastructures.

Agenda:
A. Hypervisor’s today
B. Building an enterprise virtual platform
C. Virtualizing Master and Slave servers
D. Key best practices

Page 3

Grand Junction Colorado

Page 4

Hypervisors Today: Faster/Less Overhead

• VMware vSphere, Microsoft Hyper-V Server, Citrix
XenServer and RedHat RHEV
Hypervisor Performance Benchmarks % Overhead
VMware 1M IOPS with 1 microsecond of latency (5.1) 2 – 10%
KVM 1M transactions/minute (IBM hardware RHEL) 10%

Hypervisor Performance vSphere 5.1
VMware vCPUs 64
RAM per VM, RAM per Host 1TB / 2TB
Network 36 GB/s
IOPS 1,000,000
vSphere 5.1 – 1m IOPS with 1 1 μs latency

Page 5

Why Virtualize Hadoop
• Virtual Servers offer advantages over Physical Servers
• Enabling Hadoop as a service in a public or private
cloud
• Cloud providers are making it easy to deploy Hadoop
for POCs, dev and test environments
• Running a Consistent, Highly Reliable Hardware
Environment
• Standardizing on a Single Common Hardware Platform
(software stack)
• Virtualization is a natural step towards the cloud
• Cloud and virtualization vendors are offering elastic
MapReduce solutions

Page 6

Virtualization Features

Faster provisioning Live Cloning
Live migrations Templates
Live storage migrations Distributed Resource Scheduling
High Availability Hot CPU and Memory add
Live Cloning VM Replication
Network isolation using VXLANs Multi-VM trust zones
VM Backups Distributed Power Management
Elasticity Multi-tenancy
Storage I/O Control Network I/O Control
16Gb FC Support iSCSI Jumbo Frame Support

Page 7

Building an Enterprise Virtual Platform
Hortonworks Data Platform
Sqoop Talend WebHDFS FlumeNG “Others” Data Extraction
(DB Transfer) (Data Transfer) (REST API) (Data Transfer) (Vendors and 3rd party tools) And Load

Ambari Oozie Ganglia Nagios Management
(Management) (Workflow) (Monitoring) (Alerts) Monitoring

WebHCatalog Pig Mahout HCatalog Hive HBase Zookeeper Hadoop
(Rest-like APIs) (Scripting) (Machine Learning) (Metadata Mgmt) (Query) (Column DB) (Coordination) Essentials

Distributed Processing Distributed Storage Core Hadoop
(MapReduce) (HDFS) (kernel)

Linux Windows

Hypervisor

Hardware

Page 8

Virtualizing Master Servers
• Virtualize the master servers (NameNode, JobTracker,
HBase Master, Secondary NameNode)
– Ganglia, Nagios, Ambari, Active Directory, Metadata databases
• Virtualizing Hadoop master servers offers:
– Virtualization features to master servers
– VMware High Availability
• Develop a hybrid storage solution

Page 9

Protecting Master Servers
• Shared storage is required to fully leverage
virtualization features.
• A virtual enterprise platform will provide:
– Less down time (Live migrations, cloning, …)
– A more reliable software stack
– A higher Quality of Service
– Reduced CapEx and OpEx
– Increased operational flexibility

Page 10

Configure Hardware Properly
• Do not overcommit SLA or production environments
• Understand how NUMA affects your VMs – try to keep
the VM size within the NUMA node
– Look at disabling node interleaving (leave NUMA enabled)
– Maintain memory locality
• Leverage hyperthreading – make sure there is
hardware and BIOS support
– Hyper Threading – can improve performance up to 20%
• Do not set memory limits on production servers.
• Let hypervisor control power mgmt by BIOS setting
“OS Controled Mode”
• Enable C1E in BIOS

Page 11

Configure Hardware Properly (2)
• Run latest version of BIOS and VMware Tools
• Verify BIOS settings enable all populated processor
sockets and enable all cores in each socket.
• Enable “Turbo Boost” in BIOS if processors support
it.
• Disabling hardware devices (in BIOS) can free
interrupt resources.
– COM andn LPT ports, USB controllers, floppy drives, network
interfaces, optical drives, storage controllers, etc
• Enable virtualization features in BIOS (VT-x, AMD-V,
EPT, RVI)
• Initially leave memory scrubbing rate at
manufacturer’s default setting.
Page 12

More Best Practices
• Use latest hypervisor and virtual tools
• Configure an OS kernel as a single-core or multi-core
kernel based on the number of vCPUs being used.
• Size virtual machines to avoid entering host “soft”
memory state and the likely breaking of host large
pages into small pages. Leave at least 6% of memory
for the hypervisor and VM memory overhead is
conservative.
– If free memory drops below minFree (“soft” memory
state), memory will be reclaimed through ballooning and other
memory management techniques. All these techniques require
breaking host large pages into small pages.
• Have a very good reason for using CPU affinity
otherwise avoid it like the plague
Page 13

Linux Best Practices
• Kernel parameters:
– nofile=16384
– nproc=32000
– Mount with noatime and nodiratime attributes disabled
– File descriptors set to 65535
– File system read-ahead buffer should be increased to 1024 or 2,048.
– Epoll file descriptor limit should be increased to 4096
• Turn off swapping
• Use ext4 or xfs (mount noatime)
– Ext can be about 5% better on reads than xfs
– XFS can be 12-25% better on writes (and auto defrags in the
background)
• Linux 2.6.30+ can give 60% better energy consumption.

Page 14

Networking Best Practices
• Separate VM traffic from live migration and
management traffic
– Separate NICs with separate vSwitches
• Leverage NIC teaming (at least 2 NICS per vSwitch)
• Leverage latest drivers for hypervisor vendor
• Avoid multi-queue networking: Hadoop drives a high
packet rate, but not high enough to justify the
overhead of multi-queue.
• Network:
– Channel bonding two GbE ports can give better I/O performance
– 8 Queues per port

Page 15

Networking Best Practices (2)
• Evaluate these features with network adapters to
leverage hardware features:
– Checksum offload
– TCP segmentation offload(TSO)
– Jumbo frames (JF)
– Large receive offload(LRO)
– Ability to handle high-memory DMA (that is, 64-bit DMA
addresses)
– Ability to handle multiple Scatter Gather elements per Tx frame
• Optimize 10 Gigabit Ethernet network adapters
– Features like NetQueue can significantly improve performance of
10 Gigabit Ethernet network adapters in virtualized environments.

Page 16

Storage Best Practices
• Make good storage decisions
– i.e. VMFS or Raw Device Mappings (RDM)
– VMDK – leverages all features of virtualization
– RDM – leverages features of storage vendors (replication,
snapshots, …)
– Run in Advanced Host Controller interface mode (AHCI).
– Native Command Queuing enabled (NCQ)
• Use multiple vSCSI adapters and evenly distribute
target devices
• Use eagerzeroedthick for VMDK files or uncheck
Windows “Quick Format” option
• Makes sure there is block alignment for storage

Page 17

Hadoop Virtualized Extensions (HVE)
• HVE is a new feature that extends the Hadoop topology
awareness mechanism to support rack and node
groups with hosts containing VMs.
– Data locality-related policies maintained within a virtual layer
• HVE merged into branch-1
– Available in Apache Hadoop 1.2
• Extensions include:
– Block placement and removal policies
– Balancer policies
– Task scheduling
– Network topology awareness

Page 18

HVE: Virtualization Topology Awareness
• HVE is a new feature that extends the Hadoop topology
awareness mechanism to support rack and node
groups with hosts containing VMs.
– Data locality-related policies maintained within a virtual layer.
Data Center

Rack1 Rack2

NodeG 1 NodeG 2 NodeG 3 NodeG 4

Host1 Host2 Host3 Host4 Host5 Host6 Host7 Host8
V V V V V V V V V V V V V V V V
M M M M M M M M M M M M M M M M
V V V V V V V V V V V V V V V V
M M M M M M M M M M M M M M M M

Page 19

HVE: Replica Policies
Multiple replicas are not placed on the same node with standard or extension
replica placement/removal policies. Rules are maintained for the balancer.

Standard Replica Policies Extension Replica Policies
1st replica is on local (closest) node of Multiple replicas are not be placed on
the writer the same node or on nodes under the
same node group
2nd replica is on separate rack of 1st 1st replica is on the local node or local
replica; node group of the writer
3rd replica is on the same rack as the 2nd replica is on a remote rack of the
2nd replica; 1st replica
Remaining replicas are placed
randomly across rack to meet
minimum restriction.

Page 20

Follow Virtualization Best Practices

Validate virtualization and Hadoop configurations with
Hardware
vendor hardware compatibility lists.

Hadoop Follow recommended Hadoop reference architectures.

Follow virtualization vendors best practices,
Virtualization
deployment guides and workload characterizations.

Storage Review storage vendor recommendations.

Validate internal best practices for configuring and
Internal
managing VMs.

Page 21

Best Practices for Virtualizing Hadoop

Thank You
For Attending
George Trujillo
Blog: http://cloud-dba-journey.blogspot.com
Twitter: GeorgeTrujillo

Best Practices for Virtualizing Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Best Practices for Virtualizing Hadoop

Similar to Best Practices for Virtualizing Hadoop (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Best Practices for Virtualizing Hadoop