© 2009 VMware Inc. All rights reserved
vSphere Big Data Extensions Deep Dive
路广
大数据研发高级经理
VMware中国研发中心
Get your Hadoop cluster in minutes
Hadoop Installation and
Configuration
Network Configuration
OS installation
Server prep...
Serengeti deployment architecture
• Serengeti is packaged as virtual appliance, which can be easily
deployed on VC.
• Sere...
Storage
Evolution of Hadoop on VMs – Data/Compute separation
Compute
Current
Hadoop:
Combined
Storage/Com
pute
Storage
T1 ...
Elastic Scalability & Multi-Tenancy
Deploy separate compute clusters for different tenants sharing HDFS.
Commission/decomm...
Serengeti architecture diagram
Rapid Deployment of a Hadoop/HBase Cluster with Serengeti
Done
Step 1: Deploy Serengeti virtual appliance on vSphere.
Step...
Customizing your Hadoop/HBase cluster with Serengeti
 Choice of distros
 Storage configuration
• Choice of shared storag...
Cluster creation workflow – VM creation
VM placement
Calculation
UI
CLI
Create cluster request
Host
Host
TT
DN
TT
Cluster ...
Workflow - Hadoop Package Deployment
Serengeti Server
Package Server
Hadoop Nodes
Admin
1) download
hadoop tarballs or
cre...
Cluster creation workflow – Software installation
Ironfan
Software bootstrap request
Cluster Spec
for Ironfan
"cluster_dat...
Cluster creation workflow – Software installation - continued
Ironfan
Software bootstrap request
DN1
Serengeti
Web Service...
Configure/reconfigure Hadoop with ease by Serengeti
Modify Hadoop cluster configuration from Serengeti
• Use the “configur...
Workflow - Tuning Hadoop Configuration
Serengeti Server Hadoop Nodes
Admin
1) run ‘cluster export’
to export cluster spec
...
Rolling operation
Rolling operation works on one node each time, which does not
impact whole cluster job execution.
Suppor...
One click to scale out your cluster with Serengeti
Easily scale out using Serengeti
Host Host Host Host Host
Virtualization Platform
NN JT
• Use Case:
 When the cluster cap...
VC adapter
Leverage VLSI to connect VC
Have VC object cache to improve VC query performance
Listen for VC event
• VM power...
VM placement - Fine control of DC separation cluster
Constraint number of nodes on each host
Group association:
• Put comp...
VM placement - Rack aware placement
Balance number of nodes across multiple racks
Disk placement
Host
DN CN
Even Split on local disks
Host
DN CN
Aggregate on shared storage
Separated system disk
Host
DN CN
Host
DN CN
System disk
Separated virtual system disks on
specified local storage
System d...
VHM: Example Architecture
ESX ESX ESX
J
T
DATA VM DATA VM DATA VM
Local Disks
SAN/NAS Non-Hadoop VMs
Hadoop Compute VMs
JT...
Virtual Hadoop Manager
State, stats
(Slots used,
Pending work)
Commands
(Decommission,
Recommission)
Stats and VM
configur...
Q&A
Upcoming SlideShare
Loading in …5
×

3. v sphere big data extensions

585 views

Published on

VMWare Big Data Forum

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
585
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
11
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • Simple description for Key modules:
    Web service running above Tomcat, is the central controller of cluster management workflow, which is leveraging Spring Batch library.
    The VM placement algorithm, disk placement policy are processed in VM placement module in WS layer.
    Serengeti is talking with VC through VC adapter layer, which maintain several VC sessions to execute different VC tasks and listen for VC events.
    Serengeti is distro neutral, so the hadoop software is installed and configured after VM is created. Open source project Chef and Ironfan are leveraged to install and configure hadoop services.
    Chef is a popular distribute software configuration tool.
    5. Runtime Manager is responsible for hadoop cluster elasticity control. Serengeti is talking with VHM through rabbitMQ.
  • Chef Server and Package Server are now deployed in the same VM of Serengeti Server. They can be deployed on separate VMs to support large scale cluster (200+ nodes)
  • Step 4: chef client connect to chef server, and download cookbook through REST API.
  • Chef provide flexible software deployment and configuration mechanism, so it’s easy to add more services into Serengeti.
    During VM placement, embed several performance improvement configuration based on host and VM CPU/Memory size.
  • At this stage, you will constantly configuring and reconfigure your cluster to tune for optimal results. With sergenti, this process is very simple. Taking the json spec file I showed earlier, you can specify the various hadoop attributes through xml file and apply these new configuration to the cluster. We will automatically change the hadoop cluster according to your specification, and the changes are propagated to the entire cluster. You don’t need to do reconfigure one node at a time.

  • Sample Hadoop Configuration:
    {
    … …
    // we suggest running convert-hadoop-conf.rb to generate "configuration" section and paste the output here
    "configuration": {
    "hadoop": {
    "core-site.xml": {
    // check for all settings at http://hadoop.apache.org/docs/stable/core-default.html
    // note: any value (int, float, boolean, string) must be enclosed in double quotes and here is a sample:
    // "io.file.buffer.size": "4096"
    },
    "hdfs-site.xml": {
    // check for all settings at http://hadoop.apache.org/docs/stable/hdfs-default.html
    },
    "mapred-site.xml": {
    // check for all settings at http://hadoop.apache.org/docs/stable/mapred-default.html
    },
    "hadoop-env.sh": {
    // "HADOOP_HEAPSIZE": "",
    // "HADOOP_NAMENODE_OPTS": "",
    // "HADOOP_DATANODE_OPTS": "",
    // "HADOOP_SECONDARYNAMENODE_OPTS": "",
    // "HADOOP_JOBTRACKER_OPTS": "",
    // "HADOOP_TASKTRACKER_OPTS": "",
    // "HADOOP_CLASSPATH": "",
    // "JAVA_HOME": "",
    // "PATH": ""
    },
    "log4j.properties": {
    // "hadoop.root.logger": "INFO,RFA",
    // "log4j.appender.RFA.MaxBackupIndex": "10",
    // "log4j.appender.RFA.MaxFileSize": "100MB",
    // "hadoop.security.logger": "DEBUG,DRFA"
    },
    "fair-scheduler.xml": {
    // check for all settings at http://hadoop.apache.org/docs/stable/fair_scheduler.html
    // "text": "the full content of fair-scheduler.xml in one line"
    },
    "capacity-scheduler.xml": {
    // check for all settings at http://hadoop.apache.org/docs/stable/capacity_scheduler.html
    },
    "mapred-queue-acls.xml": {
    // check for all settings at http://hadoop.apache.org/docs/stable/cluster_setup.html#Configuring+the+Hadoop+Daemons
    // "mapred.queue.queue-name.acl-submit-job": "",
    // "mapred.queue.queue-name.acl-administer-jobs", ""
    }
    }
    }
    }
  • Not configurable to choose which disk placement rule.
  • The separated system disk can be configured in cluster spec at node group level as following:
    dsNames4System:<ds name used to put system disk>
    dsNames4Data:<ds name used to put data disk>
    If these attribute is not set, default value will be used.
  • 3. v sphere big data extensions

    1. 1. © 2009 VMware Inc. All rights reserved vSphere Big Data Extensions Deep Dive 路广 大数据研发高级经理 VMware中国研发中心
    2. 2. Get your Hadoop cluster in minutes Hadoop Installation and Configuration Network Configuration OS installation Server preparation Manual process, cost days Fully automated process, 10 minutes to get a Hadoop/HBase cluster from scratch 1/1000 human efforts, Least Hadoop operation knowledge Automate by Serengeti on vSphere with best practice
    3. 3. Serengeti deployment architecture • Serengeti is packaged as virtual appliance, which can be easily deployed on VC. • Serengeti works as a VC extension and establishes SSL connection with VC. • Serengeti will clone VM from template and control/config VM through VC.
    4. 4. Storage Evolution of Hadoop on VMs – Data/Compute separation Compute Current Hadoop: Combined Storage/Com pute Storage T1 T2 VM VM VM VMVM VM Hadoop in VM - * VM lifecycle determined by Datanode - * Limited elasticity Separate Storage - * Separate compute from data - * Remove elastic constrain - by Datanode - * Elastic compute - * Raise utilization Separate Compute Clusters - * Separate virtual compute - * Compute cluster per tenant - * Stronger VM-grade security and resource isolation Slave Node
    5. 5. Elastic Scalability & Multi-Tenancy Deploy separate compute clusters for different tenants sharing HDFS. Commission/decommission compute nodes according to priority and available resources ExperimentationDynamic resourcepool Data layer Production recommendation engine Compute layer Compute VM Compute VM Compute VM Compute VM Compute VM Compute VM Compute VM Compute VM Compute VM Compute VM Compute VM Compute VM Compute VM Compute VM Compute VM Experimentation Production Compute VM Job Tracker Job Tracker VMware vSphere + Serengeti
    6. 6. Serengeti architecture diagram
    7. 7. Rapid Deployment of a Hadoop/HBase Cluster with Serengeti Done Step 1: Deploy Serengeti virtual appliance on vSphere. Step 2: A few clicks to stand up Hadoop Cluster.
    8. 8. Customizing your Hadoop/HBase cluster with Serengeti  Choice of distros  Storage configuration • Choice of shared storage or Local disk  Resource configuration  High availability option  # of nodes … "distro":"apache", "groups":[ { "name":"master", "roles":[ "hadoop_namenode", "hadoop_jobtracker”], "storage": { "type": "SHARED", "sizeGB": 20}, "instance_type":MEDIUM, "instance_num":1, "ha":true}, {"name":"worker", "roles":[ "hadoop_datanode", "hadoop_tasktracker" ], "instance_type":SMALL, "instance_num":5, "ha":false …
    9. 9. Cluster creation workflow – VM creation VM placement Calculation UI CLI Create cluster request Host Host TT DN TT Cluster Spec { groups”:[ “name”: “roles”: "placementPolicies": { } ] } VC DN Query resource Serengeti Web Service VM Creation Template VM Host DN TT Query resource Clone VM Add disk Configure VM 1 2 4 Clone VM Clone VM Add disk Configure VM Analyze spec 3
    10. 10. Workflow - Hadoop Package Deployment Serengeti Server Package Server Hadoop Nodes Admin 1) download hadoop tarballs or create yum repo on Package Server 2) config tarball urls or yum repo urls for each distro in manifest file 3) run ‘cluster create’ to create a cluster for a hadoop distro; save tarball urls or yum repo urls in Chef Server. 4) remotely ssh to Hadoop nodes and execute chef-client chef-client 5) read tarball urls or yum repo urls from Chef Server, then download and extract hadoop tarballs to /usr/lib/hadoop/ or yum install rpms from Package Server 6) generate hadoop configuration files on all nodes 7) start hadoop daemons on all nodes simultaneously with synchronization between NN, DDs, JT, TTsChef Server
    11. 11. Cluster creation workflow – Software installation Ironfan Software bootstrap request Cluster Spec for Ironfan "cluster_data": { "rack_topology_policy": "NONE", "groups": [ { "name": "ComputeMaster", "roles": [ "hadoop_jobtracker" ], "instances": [ { "name": “sample- ComputeMaster-0", ……} } "distro_package_repos": [ "http://<server ip>mapr/2.1.3/mapr- m5.repo" ], …… DN1 Serengeti Web Service 1 Analyze spec Ironfan Thrift Service Chef Server Package Server Chef Client TT1 Chef Client 2 Create Chef Nodes SSH to start chef client 3 4 Login to Chef server Download cookbook REST API 5 5Execute cookbook DataNode cookbook TaskTracker cookbook Download bits Hadoop binary Pig, Hive, etc. 6
    12. 12. Cluster creation workflow – Software installation - continued Ironfan Software bootstrap request DN1 Serengeti Web Service Ironfan Thrift Service Chef Server Chef Client TT1 Chef Client 7 Get properties REST API 8 8 Configure Hadoop Start Hadoop daemons with synchronization between NN, DDs, JT, TTs Get bootstrap status Persist bootstrap staus Bootstrap status query Serengeti Web Service Note: Software installation on all nodes are executed simultaneously
    13. 13. Configure/reconfigure Hadoop with ease by Serengeti Modify Hadoop cluster configuration from Serengeti • Use the “configuration” section of the json spec file • Specify Hadoop attributes in core-site.xml, hdfs-site.xml, mapred-site.xml, hadoop-env.sh, log4j.properties • Apply new Hadoop configuration using the edited spec file "configuration": { "hadoop": { "core-site.xml": { // check for all settings at http://hadoop.apache.org/common/docs/r1.0.0/core-default.html }, "hdfs-site.xml": { // check for all settings at http://hadoop.apache.org/common/docs/r1.0.0/hdfs-default.html }, "mapred-site.xml": { // check for all settings at http://hadoop.apache.org/common/docs/r1.0.0/mapred-default.html "io.sort.mb": "300" } , "hadoop-env.sh": { // "HADOOP_HEAPSIZE": "", // "HADOOP_NAMENODE_OPTS": "", // "HADOOP_DATANODE_OPTS": "", … > cluster config --name myHadoop --specFile /home/serengeti/myHadoop.json
    14. 14. Workflow - Tuning Hadoop Configuration Serengeti Server Hadoop Nodes Admin 1) run ‘cluster export’ to export cluster spec and set hadoop conf params in the spec. 2) run ‘cluster config’ to apply the new hadoop configuration to the whole cluster or a node group of the cluster. 3) save new hadoop configuration into Chef Server. 4) remotely ssh to hadoop nodes and execute chef-client chef-client 5) read hadoop configuration from Chef Server 6) generate new hadoop configuration files on all nodes 7) restart corresponding hadoop daemons on all nodes simultaneously to apply the new configuration Chef Server
    15. 15. Rolling operation Rolling operation works on one node each time, which does not impact whole cluster job execution. Supported functions: • Cluster scale up/down • Cluster fix Workflow • The workflow for each node is similar to whole cluster operation. • Only when one node finishes all steps, the other node will start. • Node will be restarted during the operation.
    16. 16. One click to scale out your cluster with Serengeti
    17. 17. Easily scale out using Serengeti Host Host Host Host Host Virtualization Platform NN JT • Use Case:  When the cluster capacity is not big enough  New hardware is available • Through Serengeti  One click in UI to scale out cluster worker worker worker worker Virtualization Platform
    18. 18. VC adapter Leverage VLSI to connect VC Have VC object cache to improve VC query performance Listen for VC event • VM power on, VM power off, VM creation, etc. • If VM status is changed from VC outside of Serengeti, cluster list can immediately show the VM status change
    19. 19. VM placement - Fine control of DC separation cluster Constraint number of nodes on each host Group association: • Put compute nodes close to data nodes
    20. 20. VM placement - Rack aware placement Balance number of nodes across multiple racks
    21. 21. Disk placement Host DN CN Even Split on local disks Host DN CN Aggregate on shared storage
    22. 22. Separated system disk Host DN CN Host DN CN System disk Separated virtual system disks on specified local storage System disk Data disks Data disks Separated virtual system disks on shared storage
    23. 23. VHM: Example Architecture ESX ESX ESX J T DATA VM DATA VM DATA VM Local Disks SAN/NAS Non-Hadoop VMs Hadoop Compute VMs JT: JobTracker TT: TaskTracker NN: NameNode VHM: Virtual Hadoop Manager N N T T T T T T VirtualCenter Management Server DRS DRS DRSDRS DRS V H M Hadoop HDFS VMs T T T T T T J T
    24. 24. Virtual Hadoop Manager State, stats (Slots used, Pending work) Commands (Decommission, Recommission) Stats and VM configuration Serengeti Job Tracker vCenter DB Manual/Auto Power on/off Virtual Hadoop Manager (VHM) Job Tracker Task Tracker Task Tracker Task Tracker vCenter Server Serengeti Configuration VC state and stats Hadoop state and stats VC actions Hadoop actions Algorithms Cluster Configuration
    25. 25. Q&A

    ×