3. v sphere big data extensions

© 2009 VMware Inc. All rights reserved
vSphere Big Data Extensions Deep Dive
路广
大数据研发高级经理
VMware中国研发中心

Get your Hadoop cluster in minutes
Hadoop Installation and
Configuration
Network Configuration
OS installation
Server preparation
Manual process, cost days
Fully automated process,
10 minutes to get a
Hadoop/HBase cluster from
scratch
1/1000 human efforts,
Least Hadoop operation knowledge
Automate by Serengeti on
vSphere with best practice

Serengeti deployment architecture
• Serengeti is packaged as virtual appliance, which can be easily
deployed on VC.
• Serengeti works as a VC extension and establishes SSL connection
with VC.
• Serengeti will clone VM from template and control/config VM through
VC.

Storage
Evolution of Hadoop on VMs – Data/Compute separation
Compute
Current
Hadoop:
Combined
Storage/Com
pute
Storage
T1 T2
VM VM VM
VMVM
VM
Hadoop in VM
- * VM lifecycle
determined
by Datanode
- * Limited elasticity
Separate Storage
- * Separate compute
from data
- * Remove elastic constrain
- by Datanode
- * Elastic compute
- * Raise utilization
Separate Compute Clusters
- * Separate virtual compute
- * Compute cluster per tenant
- * Stronger VM-grade security
and resource isolation
Slave Node

Elastic Scalability & Multi-Tenancy
Deploy separate compute clusters for different tenants sharing HDFS.
Commission/decommission compute nodes according to priority and
available resources
ExperimentationDynamic resourcepool
Data layer
Production
recommendation engine
Compute layer Compute
VM
Compute
VM
Compute
VM
Compute
VM
Compute
VM
Compute
VM
Compute
VM
Compute
VM
Compute
VM
Compute
VM
Compute
VM
Compute
VM
Compute
VM
Compute
VM
Compute
VM
Experimentation Production
Compute
VM
Job
Tracker
Job
Tracker
VMware vSphere + Serengeti

Serengeti architecture diagram

Rapid Deployment of a Hadoop/HBase Cluster with Serengeti
Done
Step 1: Deploy Serengeti virtual appliance on vSphere.
Step 2: A few clicks to stand up Hadoop Cluster.

Customizing your Hadoop/HBase cluster with Serengeti
 Choice of distros
 Storage configuration
• Choice of shared storage or Local disk
 Resource configuration
 High availability option
 # of nodes
…
"distro":"apache",
"groups":[
{ "name":"master",
"roles":[
"hadoop_namenode",
"hadoop_jobtracker”],
"storage": {
"type": "SHARED",
"sizeGB": 20},
"instance_type":MEDIUM,
"instance_num":1,
"ha":true},
{"name":"worker",
"roles":[
"hadoop_datanode",
"hadoop_tasktracker"
],
"instance_type":SMALL,
"instance_num":5,
"ha":false
…

Cluster creation workflow – VM creation
VM placement
Calculation
UI
CLI
Create cluster request
Host
Host
TT
DN
TT
Cluster Spec
{
groups”:[
“name”:
“roles”:
"placementPolicies": {
}
]
}
VC
DN
Query
resource
Serengeti
Web Service
VM Creation
Template VM Host
DN
TT
Query resource
Clone VM
Add disk
Configure VM
1
2
4
Clone VM
Clone VM
Add disk
Configure VM
Analyze
spec
3

Workflow - Hadoop Package Deployment
Serengeti Server
Package Server
Hadoop Nodes
Admin
1) download
hadoop tarballs or
create yum repo on
Package Server
2) config tarball urls
or yum repo urls for
each distro in
manifest file
3) run ‘cluster
create’ to create a
cluster for a hadoop
distro; save tarball
urls or yum repo
urls in Chef Server.
4) remotely ssh to Hadoop nodes
and execute chef-client
chef-client
5) read tarball urls or yum
repo urls from Chef Server,
then download and extract
hadoop tarballs to
/usr/lib/hadoop/ or yum
install rpms from Package
Server
6) generate hadoop
configuration files on all
nodes
7) start hadoop daemons
on all nodes
simultaneously with
synchronization between
NN, DDs, JT, TTsChef Server

Cluster creation workflow – Software installation
Ironfan
Software bootstrap request
Cluster Spec
for Ironfan
"cluster_data": {
"rack_topology_policy":
"NONE",
"groups": [
{
"name":
"ComputeMaster",
"roles": [
"hadoop_jobtracker"
],
"instances": [
{
"name": “sample-
ComputeMaster-0",
……}
}
"distro_package_repos": [
"http://<server
ip>mapr/2.1.3/mapr-
m5.repo"
],
……
DN1
Serengeti
Web Service
1
Analyze
spec
Ironfan
Thrift Service
Chef Server Package Server
Chef Client
TT1
Chef Client
2
Create
Chef
Nodes
SSH to
start chef
client
3
4
Login to Chef
server
Download
cookbook
REST API
5 5Execute
cookbook
DataNode
cookbook
TaskTracker
cookbook
Download bits
Hadoop
binary
Pig, Hive,
etc.
6

Cluster creation workflow – Software installation - continued
Ironfan
Software bootstrap request
DN1
Serengeti
Web Service
Ironfan
Thrift Service
Chef Server
Chef Client
TT1
Chef Client
7
Get properties
REST API
8 8
Configure Hadoop
Start Hadoop daemons with
synchronization between NN, DDs, JT, TTs
Get
bootstrap
status
Persist
bootstrap
staus
Bootstrap
status query
Serengeti
Web Service
Note: Software installation on all
nodes are executed
simultaneously

Configure/reconfigure Hadoop with ease by Serengeti
Modify Hadoop cluster configuration from Serengeti
• Use the “configuration” section of the json spec file
• Specify Hadoop attributes in core-site.xml, hdfs-site.xml, mapred-site.xml,
hadoop-env.sh, log4j.properties
• Apply new Hadoop configuration using the edited spec file
"configuration": {
"hadoop": {
"core-site.xml": {
// check for all settings at http://hadoop.apache.org/common/docs/r1.0.0/core-default.html
},
"hdfs-site.xml": {
// check for all settings at http://hadoop.apache.org/common/docs/r1.0.0/hdfs-default.html
},
"mapred-site.xml": {
// check for all settings at http://hadoop.apache.org/common/docs/r1.0.0/mapred-default.html
"io.sort.mb": "300"
} ,
"hadoop-env.sh": {
// "HADOOP_HEAPSIZE": "",
// "HADOOP_NAMENODE_OPTS": "",
// "HADOOP_DATANODE_OPTS": "",
…
> cluster config --name myHadoop --specFile /home/serengeti/myHadoop.json

Workflow - Tuning Hadoop Configuration
Serengeti Server Hadoop Nodes
Admin
1) run ‘cluster export’
to export cluster spec
and set hadoop conf
params in the spec.
2) run ‘cluster config’
to apply the new
hadoop configuration
to the whole cluster
or a node group of
the cluster.
3) save new hadoop
configuration into
Chef Server.
4) remotely ssh to hadoop nodes
and execute chef-client
chef-client
5) read hadoop configuration
from Chef Server
6) generate new hadoop
configuration files on all
nodes
7) restart corresponding
hadoop daemons on all
nodes simultaneously to
apply the new configuration
Chef Server

Rolling operation
Rolling operation works on one node each time, which does not
impact whole cluster job execution.
Supported functions:
• Cluster scale up/down
• Cluster fix
Workflow
• The workflow for each node is similar to whole cluster operation.
• Only when one node finishes all steps, the other node will start.
• Node will be restarted during the operation.

One click to scale out your cluster with Serengeti

Easily scale out using Serengeti
Host Host Host Host Host
Virtualization Platform
NN JT
• Use Case:
 When the cluster capacity is not big enough
 New hardware is available
• Through Serengeti
 One click in UI to scale out cluster
worker worker worker worker
Virtualization Platform

VC adapter
Leverage VLSI to connect VC
Have VC object cache to improve VC query performance
Listen for VC event
• VM power on, VM power off, VM creation, etc.
• If VM status is changed from VC outside of Serengeti, cluster list can
immediately show the VM status change

VM placement - Fine control of DC separation cluster
Constraint number of nodes on each host
Group association:
• Put compute nodes close to data nodes

VM placement - Rack aware placement
Balance number of nodes across multiple racks

Disk placement
Host
DN CN
Even Split on local disks
Host
DN CN
Aggregate on shared storage

Separated system disk
Host
DN CN
Host
DN CN
System disk
Separated virtual system disks on
specified local storage
System disk
Data disks
Data disks
Separated virtual system disks on
shared storage

VHM: Example Architecture
ESX ESX ESX
J
T
DATA VM DATA VM DATA VM
Local Disks
SAN/NAS Non-Hadoop VMs
Hadoop Compute VMs
JT: JobTracker
TT: TaskTracker
NN: NameNode
VHM: Virtual Hadoop Manager
N
N
T
T
T
T
T
T
VirtualCenter Management Server
DRS DRS DRSDRS DRS
V
H
M
Hadoop HDFS VMs
T
T
T
T
T
T
J
T

Virtual Hadoop Manager
State, stats
(Slots used,
Pending work)
Commands
(Decommission,
Recommission)
Stats and VM
configuration
Serengeti Job
Tracker
vCenter DB
Manual/Auto
Power on/off
Virtual Hadoop Manager (VHM)
Job
Tracker
Task
Tracker
Task
Tracker
Task
Tracker
vCenter Server
Serengeti
Configuration
VC
state and stats
Hadoop
state and stats
VC
actions
Hadoop
actions
Algorithms
Cluster
Configuration

3. v sphere big data extensions

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to 3. v sphere big data extensions

Similar to 3. v sphere big data extensions (20)

More from Chiou-Nan Chen

More from Chiou-Nan Chen (20)

Recently uploaded

Recently uploaded (20)

3. v sphere big data extensions

Editor's Notes