Hadoop World 2011: Hadoop as a Service in Cloud

Hadoop as a Service

Jun Ping Du
Richard McDougall
VMware, Inc.

© 2009 VMware Inc. All rights reserved

Cloud: Big Shifts in Simplification and Optimization

1. Reduce the Complexity 2. Dramatically Lower 3. Enable Flexible, Agile
Costs IT Service Delivery
to simplify operations to redirect investment into to meet and anticipate the
and maintenance value-add opportunities needs of the business

2

Infrastructure, Apps and now Data…

Build Run
Private
Public

Manage

Simplify Infrastructure Simplify App Platform
Next Trend:
With Cloud Through PaaS
Simplify Data

3

Trend 1/3: New Data Growing at 60% Y/Y

Exabytes of information stored 20 Zetta by 2015

1 Yotta by 2030

Yes, you are part
of the yotta
audio generation…
digital tv
digital photos
camera phones, rfid
medical imaging, sensors
satellite images, games, scanners, twitter
cad/cam, appliances, videoconfercing, digital movies

Source: The Information Explosion , 2009

4

Trend 2/3: Big Data – Driven by Real-World Benefit

5

Trend 3/3: Value from Data Exceeds Hardware Cost

 Value from the intelligence of data analytics now outstrips the cost
of hardware
• Hadoop enables the use of lower cost hardware
• Hardware cost halving every 18mo
Value
Big Iron:
$40k/CPU

Commodity
Cluster:
$1k/CPU
Cost

6

Three Big Reasons to Virtualize Hadoop: 1. Simplify Hardware

 Trend is ―not just hadoop‖ for big data
• Hadoop is often combined with other
technologies: Big SQL, NoSQL etc,…

SQLCluster
• Unify the infrastructure platform for all

Big SQL NoSQL Hadoop
NoSQL Cluster

Unified Big Data Infrastructure

Private
Public
Hadoop Cluster
 Common Hardware Base
• Eliminate the hardware/driver/testing phase
• Use existing team for
DSS Cluster ordering, diagnosis, capacity management of
7
hardware farm

Three Big Reasons to Virtualize Hadoop: 2. Rapid Provisioning

I WANT MY HADOOP CLUSTER NOW!

 Instant Cluster Provisioning
• Provision Hadoop Clusters instantly
• Automatable using provisioning
engines/scripts: e.g. whir

8

Three Big Reasons to Virtualize Hadoop: 3. Leverage Capabilities

 Increase Utilization
• Hadoop cluster only uses resources it needs
• Extra resources can be used by other applications when not in use
 Eliminate single points of failure
• Use vSphere HA for Namenode and Jobtracker
 Use VM Isolation
• Create separate clusters with defensible security
• Enables multiple-versions of Hadoop on the same infrastructure
• Extends to Hadoop and Linux Environments
 Leverage Resource Management
• Control/assign resources through resource pools
• E.g. Use spare cycles for Hadoop Processing through priority control

9

What? Hadoop in a VM? Really?

Actually, Hadoop performs well in a virtual machine

10

Performance Test: Cluster Configuration

Mellanox10 GbE switch

AMAX ClusterMax
2X X5650, 96 GB
12X SATA 500 GB
Mellanox 10 GbE adapter

11

Cluster Configuration
 Hardware
• AMAX ClusterMax, 7 nodes
• 2X X5650 2.67 GHz hex-core, 96 GB memory
• 12X SATA 500 GB 7200 RPM (10 for Hadoop data), EXT4
• Mellanox ConnectX VPI (MT26418), 10 GbE
• Mellanox Vantage 6048, 10 GbE
 OS/Hypervisor
• RHEL 6.1 x86_64 (native and guest)
• ESX 5.0 RTM with devel Mellanox driver
 VMs (HT off/on)
• 1 VM: 92000 MB, (12/24) vCPUs, 10 PRDM disks
• 2 VMs: 46000 MB, (6/12) vCPUs, 5 PRDM disks
• 4 VMs (HT on only):
• 2 small: 18400 MB, 5 vCPUs, 2 disks
• 2 large: 27600 MB, 7 vCPUs, 3 disks
12

Hadoop Configuration
Distribution
• Cloudera CDH3u0
• Based on Apache open-source 0.20.2
Parameters
• dfs.datanode.max.xcievers=4096
• dfs.replication=2
• dfs.block.size=134217728
• io.file.buffer.size=131072
• mapred.child.java.opts=”-Xmx2048m -Xmn512m” (native)
• mapred.child.java.opts=”-Xmx1900m -Xmn512m” (virtual)
 Network topology
• Hadoop uses info for reliability and performance
• Multiple VMs/host: Each host is a “rack”

13

Benchmarks
 Derived from test apps included in distro
 Pi
• Direct-exec Monte-Carlo estimation of pi
• # map tasks = # logical processors
• 1.68 T samples
 TestDFSIO
• Streaming write and read
~ 4*R/(R+G) = 22/7
• 1 TB
• More tasks than processors
 Terasort
• 3 phases: teragen, terasort, teravalidate
• 10B or 35B records, each 100 Bytes (1 TB, 3.5 TB)
• More tasks than processors
• CPU, networking, and storage I/O

14

Performance of Hadoop for Several Workloads

Ratio of time taken – Lower is Better
1.2

1

0.8
Ratio to Native

0.6

1 VM
0.4
2 VMs

0.2

0

15

Architecting Hadoop as a Service using Virtualization

 Goals
• Make it fast and easy to provision new Hadoop Clusters on Demand
• Leverage virtual machines to provide isolation (esp. for Multi-tenant)
• Optimize Hadoop’s performance based on virtual topologies
• Make the system reliable based on virtual topologies
 Leveraging Virtualization
• Elastic scale in/out
• Use high-availability to protect namenode/job tracker
• Resource controls and sharing: re-use underutilized memory, cpu
• Prioritize Workloads: limit or guarantee resource usage in a mixed
environment

16

Provisioning

 Leverage the vSphere APIs to auto-deploy a cluster
• Whirr, HOD, or custom using ruby, chef, etc,…
 Use linked-clones to rapidly fork many nodes

17

Fast Provisioning

 From a ―seed‖ node to a cluster

Thin Provisioning Linked Clone

60GB => 3.5GB ~6 second

18

SAN, NAS or Local Disk?

 Shared Storage: SAN or NAS  Hybrid Storage
• Easy to provision • SAN for boot images, VMs, other
• Automated cluster rebalancing workloads
• Local disk for HDFS
• Scalable Bandwidth, Lower Cost/GB
Other VM

Other VM

Other VM

Other VM

Other VM

Other VM

Other VM

Other VM
Hadoop

Hadoop

Hadoop

Hadoop

Hadoop

Hadoop

Hadoop

Hadoop

Hadoop

Hadoop
Host Host Host Host Host Host

19

Enable Automatic Rack awareness through vSphere

 Important to robust hadoop
cluster

 Automatic network topology
detect — an important
vSphere feature

 Rack script is generated
automatically

20

Multi-tenant: share cluster or not

 Shared big cluster VS. Isolated small clusters

High performance Secure
Large scale Flexible
Pre-job provisioning Post-job provisioning

Combination – as customers’ requirement are different

21

Elastic Hadoop Cluster

 Traditional hadoop cluster
• Easy to scale out
• Fast-provision new hadoop nodes and join into existing cluster
• Hard to scale in
While (ClusterIsTooLarge) {
choose node k;
kill (node k);
wait (k’s data block is recovered);
if necessary, hadoop.rebalance();
}

 Elastic hadoop cluster
…
Normal node

NN JT Elastic node

TaskTracker
…
DataNode

22

Replica Placement

 Second Replica
• Different rack
• Rack-awareness required

 Third Replica
• Same rack, different physical host
• Nodes share host (in virtualized
environment)

23

Performance

 Create more smaller VMs
• Makes Hadoop scale better
• Allows for easier/faster adjustment of packing of VMs across hosts by vSphere
(including through DRS)
 Sizing/Configuration of storage is critical
• Plan on ~50Mbytes/sec of bandwidth per core
• SANs are typically configured by default for IOPS, not Bandwidth
• Ensure SAN ports/switch topology allows required aggregate bandwidth
• Performance of the backend storage should be tested/sized
• Local disks will give ~100-140MBytes/sec per disk: pick correct controller

25

Summary

 Hadoop does work well in a virtual environment
 Plan a virtual cluster, enable other big-data solutions on the same
infrastructure
 Leverage the recipes to automate your configuration and
deployment

26

Hadoop World 2011: Hadoop as a Service in Cloud

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Hadoop World 2011: Hadoop as a Service in Cloud

Similar to Hadoop World 2011: Hadoop as a Service in Cloud (20)

More from Cloudera, Inc.

More from Cloudera, Inc. (20)

Recently uploaded

Recently uploaded (20)

Hadoop World 2011: Hadoop as a Service in Cloud