VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Management, and Virtualization Extensions

Big Data Platform Building Blocks: Serengeti,
Resource Management,
and Virtualization Extensions
Abhishek Kashyap, Pivotal
Kevin Leong, VMware
VAPP5762
#VAPP5762

22
Agenda
 Big Data, Hadoop, and What It Means to You
 The VMware Big Data Platform
• Operate Clusters Simply
• Share Infrastructure Efficiently
• Leverage Existing Investment
 Pivotal and VMware: Partnering to Virtualize Hadoop
 Conclusion and Q&A

33
Big Data, Hadoop, and
What It Means to You

44
What is Hadoop?
 Framework that allows for distributed processing of large data sets
across clusters of commodity servers
• Store large amount of data
• Process the large amount of data stored
 Inspired by Google’s MapReduce and Google File System (GFS)
papers
 Apache Open Source Project
• Initial work done at Yahoo! starting in 2005
• Open sourced in 2009 there is now a very active open source community

55
What is Hadoop?
 Storage & Compute in One Framework
 Open Source Project of the Apache Software Foundation
 Java-intensive programming required
HDFS MapReduce
Two Core Components
Scalable storage in
Hadoop Distributed
File System
Compute via the
MapReduce distributed
processing platform

66
Why Hadoop?
 HDFS provides cheap and reliable storage on commodity hardware
 In-place data analysis, rather than moving from file systems to data
warehouses
 Ability to analyze structured and unstructured data
Enables better business decisions from more types of data at
higher speeds and lower costs

77
Use Case: Data Warehouse Augmentation / Offload
 Challenges
• Existing EDW used for low value and resource consuming ETL process
• Planned growth will far exceed compute capacity
• Hard to do analytics or even basic reporting on EDW system
 Objectives
• Reduce EDW Total Cost of Ownership
• Enable longer data retention to enable analytics and accelerate time to market
• Migrate ETL off EDW to free up compute resources

88
Use Case: Retailer Trend Analysis
 Deep Historical Reporting for Retail Trends:
• Credit card company loads 10 years of data for all retailers (100’s of TB’s)
• Run Map/Reduce Job develop historical picture of retailers in a specific area
• Load results from Hadoop into data warehouse and further analyze with
standard BI/statistics packages
 Why do this in Hadoop?
• Ability to store years of data cost effectively
• Data available for immediate recall (not on tapes or flat files)
• No need to ETL/normalize the data
• Data exists in its valuable, original format
• Offload intensive computation from DW
• Ability to combine structured and unstructured data

99
Pivotal HD
HDFS
HBase
Pig, Hive,
Mahout
Map Reduce
Sqoop Flume
Resource
Management
& Workflow
Yarn
Zookeeper
Apache

1010
Pivotal HD
HDFS
HBase
Pig, Hive,
Mahout
Map Reduce
Sqoop Flume
Resource
Management
& Workflow
Yarn
Zookeeper
Deploy,
Configure,
Monitor,
Manage
Command
Center
Hadoop Virtualization (HVE)
Data Loader
Pivotal HD
Enterprise
Apache Pivotal HD Enterprise

1111
Pivotal HD
HDFS
HBase
Pig, Hive,
Mahout
Map Reduce
Sqoop Flume
Resource
Management
& Workflow
Yarn
Zookeeper
Deploy,
Configure,
Monitor,
Manage
Command
Center
Data Loader
Pivotal HD
Enterprise
Apache Pivotal HD Enterprise HAWQ
Xtension
Framework
Catalog
Services
Query
Optimizer
Dynamic Pipelining
ANSI SQL + Analytics
HAWQ– Advanced
Database Services

1212
Pivotal HD
HDFS
HBase
Pig, Hive,
Mahout
Map Reduce
Sqoop Flume
Resource
Management
& Workflow
Yarn
Zookeeper
Deploy,
Configure,
Monitor,
Manage
Command
Center
Data Loader
Pivotal HD
Enterprise
Apache Pivotal HD Enterprise HAWQ
Xtension
Framework
Catalog
Services
Query
Optimizer
Dynamic Pipelining
ANSI SQL + Analytics
HAWQ– Advanced
Database Services
Spring XD
Pivotal Analytics
Pivotal
Chorus & Alpine Miner
MoreVRP

1313
The VMware Big Data Platform

1414
The Big Data Journey in the Enterprise
Stage 3: Cloud Analytics Platform
• Serve many departments
• Often part of mission critical workflow
• Fully integrated with analytics/BI tools
Stage1: Hadoop Piloting
• Often start with line of business
• Try 1 or 2 use cases to explore
the value of Hadoop
Stage 2: Hadoop Production
• Serve a few departments
• More use cases
• Growing # and size of clusters
• Core Hadoop + components
10’s 100’s0 node
Integrated
Scale
Standalone

1515
Getting from Here to There
Host Host Host Host Host Host Host
Virtualization
Shared File SystemData
Layer
Compute
Layer
Hadoop
test/dev

1616
Virtualization
Layer
Compute
Layer
Hadoop
test/dev
Hadoop
production
Hadoop
production
Hadoop
experimentation

1717
Virtualization
Layer
Compute
Layer
Hadoop
test/dev
HBase
Hadoop
production
SQL on Hadoop
HAWQ, Impala, Drill
NoSQL
Cassandra
Mongo
Other
Spark
Shark
Solr
Platfora

1818
Benefits of Virtualization at Each Stage
Stage 3: Cloud Analytics Platform
 Mixed workloads
 Right tool at the right time
 Flexible and elastic infrastructure
Stage1: Hadoop Piloting
 Rapid deployment
 On the fly cluster resizing
 Flexible config
 Automation of cluster lifecycle
Stage 2: Hadoop Production
 High Availability
 Consolidation
 Tiered SLAs
 Elastic Scaling
10’s 100’s0 node
Integrated
Scale
Standalone

1919
A Brief History of Project Serengeti
(and Big Data at VMware)

2020
Big Data Initiatives at VMware
Serengeti
vSphere
Resource
Management
Hadoop
Virtualization
Extensions
 Virtualization changes
for core Hadoop
 Contributed back to
Apache Hadoop
 Advanced resource
management on vSphere
 Big Data applications-specific
extension to DRS
 Open source project
 Tool to simplify virtualized Hadoop
deployment & operations

2121
Clustered Workload Management: The Next Frontier
ESXi
Serengeti
Hadoop
Management
Virtualization
vCenter
Source: http://www.conferencebike.com/image/generated/792.png

2222
Serengeti Project History
Serengeti 0.5
June 2012
Serengeti 0.6
August 2012
Serengeti 0.7
October 2012
Serengeti 0.8
April 2013
• Hadoop in
10 min
• Highly
Available
Hadoop
• Time to
insight
• Configuring
Hadoop
• Compute
elasticity
• Configuring
placement
and topology
• HBase
• MapR
• CDH4
• Performance
best
practices
Serengeti 0.9/
BDE Beta
June 2013
• Integrated
GUI
• Automatic
elasticity
• YARN/
Pivotal HD

2323
Big Data Extensions: Serengeti-vCenter Integration
ESXi
Hadoop
Management
Virtualization
Big Data Extensions + vCenter

2424
Why Virtualize Hadoop
Operate
Clusters
Simply
Share
Infrastructure
Efficiently
Leverage
Existing
Investment

2525
Operate Clusters Simply
Serengeti

2626
What Does Nick Think About Hadoop?
I don’t want to be the
bottleneck when it
comes to provisioning
Hadoop clusters
I need sizing flexibility,
because my Hadoop users
don’t know how large of a
cluster they need
I want to establish a
repeatable process for
deploying Hadoop
clusters
I don’t really know
that much about
Hadoop
I want to better manage
the jumble of LOB
Hadoop clusters in my
enterprise
Source: http://www.smartdraw.com/solutions/information-technology/images/nick.png

2727
Choose Your Own Adventure
Source: http://www.vintagecomputing.com/wp-content/images/retroscan/supercomputer_cyoa_large.jpg

2929
Deploy Hadoop Clusters in Minutes
Hadoop Installation and
Configuration
Network Configuration
OS installation
Server preparation
From manual process To fully automated, using the GUI

3030
How It Works
 BDE is packaged as a virtual appliance, which can be easily
deployed on vCenter
 BDE works as a vCenter extension and establishes SSL connection
with vCenter
 BDE clones VMs from the template and controls/configures VMs
through vCenter
Host Host Host Host Host
Virtualization Platform
Hadoop
Node
Hadoop
Node
vCenter
Management
Server
Template
Virtual Appliance VM Cloning

3131
User-specified Customizations Using Cluster Specification File
Storage configuration
Choice of shared or local
High Availability option
Number of nodes and
resource configuration
VM placement policies

3232
Deploy
Customize
Load data
Execute
jobs
Tune
configuration
Scale
…
Deploy, Manage, Run Virtual Hadoop with BDE

3333
Agility and Operational Simplicity
Virtualization
Layer
Compute
Layer
Hadoop
test/dev

3434
Share Infrastructure Efficiently
vSphere Resource Management

3535
Adult Supervision Required
CLUSTERS OVER
10 NODES

3636
Production
Test
Experimentation
Dept A: recommendation engine Dept B: ad targeting
Production
Test
Experimentation
Log files
Social dataTransaction data Historical cust behavior
Pain Points:
1. Cluster sprawl
2. Redundant common data in
separate clusters
3. Inefficient use of resources. Some
clusters could be running at
capacity while other clusters are
sitting idle
NoSQL Real time SQL …
On the horizon…
Challenges of Running Hadoop in the Enterprise

3737
The Virtualization Advantage
Experimentation
Production
Recommendation Engine
Production
Ad Targeting
Test/Dev
Production
Test
Production
Test
Experimentation
Recommendation engine Ad targeting
Experimentation
One physical platform to support
multiple virtual big data clusters

3838
What Other Things Does Nick Think About Hadoop?
I want to scale out
when my workload
requires it
My Hadoop users
ask for large
Hadoop clusters,
which end up
underutilized
I want to offer
Hadoop-as-a-
Service in my
private cloud
I want to get all
Hadoop clusters
into a centralized
environment to
minimize spend

3939
Achieving Multi-tenancy
 Resource Isolation
• Control the greedy noisy neighbor
• Reserve resources to meet needs
 Version Isolation
• Allow concurrent OS, App, Distro versions
 Security Isolation
• Provide privacy between users/groups
• Runtime and data privacy required
Host Host Host Host Host Host
VMware vSphere + Serengeti
Host

4040
Combined
Storage/
Compute
VM
Hadoop in VM
 VM lifecycle
determined
by Datanode
 Limited elasticity
 Limited to Hadoop
Multi-Tenancy
Storage
Compute
VM
VM
Separate Storage
 Separate compute
from data
 Elastic compute
 Enable shared
workloads
 Raise utilization
Storage
T1 T2
VM
VM
VM
Separate Compute Tenants
 Separate virtual clusters
per tenant
 Stronger VM-grade security
and resource isolation
 Enable deployment of
multiple Hadoop runtime
versions
Slave Node
Separating Hadoop Data and Compute for Elasticity

4141
Dynamic Hadoop Scaling
 Deploy separate compute clusters for different tenants
sharing HDFS
 Commission/decommission compute nodes according to priority
and available resources
ExperimentationDynamic resourcepool
Data layer
Production
recommendation engine
Compute layer Compute
VM
Compute
VM
Compute
VM
Compute
VM
Compute
VM
Compute
VM
Compute
VM
Compute
VM
Compute
VM
Compute
VM
Compute
VM
Compute
VM
Compute
VM
Compute
VM
Compute
VM
Experimentation Production
Compute
VM
Job
Tracker
Job
Tracker
VMware vSphere + Serengeti

4343
State, stats
(Slots used,
Pending work)
Commands
(Decommission,
Recommission)
Stats and VM configuration
Serengeti Job
Tracker
vCenter DB
Manual/Auto
Power on/off
Virtual Hadoop Manager (VHM)
Job
Tracker
Task
Tracker
Task
Tracker
Task
Tracker
vCenter Server
Serengeti
Configuration
VC
state and stats
Hadoop
state and stats
VC
actions
Hadoop
actions
Algorithms
Cluster
Configuration
Resource Management Module

4444
Combining Elasticity and Multi-tenancy
Virtualization
Layer
Compute
Layer
Hadoop
test/dev
Hadoop
production
Hadoop
production
Hadoop
experimentation

4545
Leverage Existing Investment

4646
What Is Nick Still Thinking About Hadoop?
I want to use my
existing
infrastructure, not
buy new hardware
I want to leverage
the tools I already
have
Hadoop on
Amazon is costing
too much
My data is in
shared storage; do
I have to move it?
I want a low-risk
way of trying
Hadoop

4747
Use Storage That Meets Your Needs
SAN Storage
$2 - $10/Gigabyte
$1M gets:
0.5 Petabytes
200,000 IOPS
8Gbyte/sec
NAS Filers
$1 - $5/Gigabyte
$1M gets:
1 Petabyte
200,000 IOPS
10Gbyte/sec
Local Storage
$0.05/Gigabyte
$1M gets:
10 Petabytes
400,000 IOPS
250 Gbytes/sec

4848
Leveraging Isilon as External HDFS
 Time to results: Analysis of data in place
 Lower risk using vSphere with Isilon
 Scale storage and compute independently
Data Layer – Hadoop on Isilon
Elastic Virtual Compute Layer

4949
Hybrid Storage Model to Get the Best of Both Worlds
 Master nodes:
• NameNode, JobTracker on
shared storage
• Leverage vSphere vMotion, HA
and FT
 Slave nodes
• TaskTracker, DataNode on local
storage
• Lower cost, scalable bandwidth
Local StorageShared Storage

5050
Achieving HA for the Entire Hadoop Stack
 Battle-tested HA technology
 Single mechanism to achieve HA for the entire Hadoop stack
 Simple to enable HA/FT
HDFS
(Hadoop Distributed File System)
HBase (Key-Value store)
MapReduce (Job Scheduling/Execution System)
Pig (Data Flow) Hive (SQL)
BI ReportingETL Tools
ManagementServer
Zookeepr(Coordination)
HCatalog
RDBMS
Namenode
Jobtracker
Hive MetaDB Hcatalog MDB
Server

5151
Leveraging Other VMware Assets
 Monitoring with vCenter Operations Manager
• Gain comprehensive visibility
• Eliminate manual processes with intelligent automation
• Proactively manage operations
 Future: vCloud Automation Center, Software-defined Storage

5252
Get Maximum Value from Existing Tools and Infrastructure
Virtualization
Layer
Compute
Layer
Hadoop
test/dev
HBase
Hadoop
production
SQL on Hadoop
HAWQ, Impala, Drill
NoSQL
Cassandra
Mongo
Other
Spark
Shark
Solr
Platfora

5353
Pivotal and VMware:
Partnering to Virtualize Hadoop

5454
Virtualization Benefits
 Multi-tenancy (users, business units) with strong vSphere-based
isolation
 Multiple big data applications and compute engines can access
common HDFS data
 Agility to scale Hadoop nodes at run-time
 Provide On-Demand Hadoop / Hadoop as a Service

5555
Busting Myths About Virtual Hadoop
Virtualization will add significant
performance overhead
Virtual Hadoop performance
is comparable to bare metal
Hadoop cannot work
with shared storage
Shared storage is a valid choice,
especially for smaller clusters
Virtualization necessitates
the use of shared storage
Shared storage is useful for HA, but
virtual Hadoop on DAS is very common
Hadoop distribution vendors don’t
support virtual implementations
Pivotal HD is jointly tested, certified, and
supported on vSphere
Source: http://www.psychologytoday.com/files/u637/good-grief-charlie-brown.jpg, http://images2.wikia.nocookie.net/__cb20101130042247/peanuts/images/6/6d/Joe-cool-1-.jpg

5656
Native versus Virtual Platforms, 32 hosts, 16 disks/host
Source: http://www.vmware.com/resources/techresources/10360

5757
Harness the Flexibility of Virtualization
Hadoop Virtualization Extensions

5858
You Need Hadoop Virtual Extensions
 Topology Extensions:
• Enable Hadoop to recognize additional virtualization layer for
read/write/balancing for proper replica placement
• Enable compute/data node separation without losing locality
 Elasticity Extensions:
• Ability to dynamically adjust resources allocated (CPU, memory, map/reduce
slots) to compute nodes
• Enables runtime elasticity of Hadoop nodes

5959
Hadoop Virtual Extensions

6060
Current Hadoop Network Topology Not Virtualization Aware
H1 H2 H3
R1
H4 H5 H6
R2
H7 H8 H9
R3
H10 H11 H12
R4
D1 D1
/
• D = data center
• R = rack
• H = host
 Multiple replicas may end up on same Physical Host in Virtual
environments

6161
HVE Adds a New Layer in Hadoop Network Topology
• D = data center
• R = rack
• NG = node group
• HG = node
N13N1 N2 N3 N4 N5 N6 N7 N8 N9 N10 N11 N12
R1 R2 R3 R4
D1 D2
/
NG1 NG2 NG3 NG4 NG5 NG6 NG7 NG8

6262
“Virtualization Aware” Replica Placement Policy During Write
Updated Policies:
• No replicas are placed on the
same node or nodes under
the same node group
• 1st replica is on the local
node or one of nodes under
the same node group of the
writer
• 2nd replica is on a remote
rack of the 1st replica
• 3rd replica is on the same
rack as the 2nd replica
• Remaining replicas are
placed randomly across rack
to meet minimum restriction

6363
Hadoop Virtual Extensions

6464
HVE Achieves Vertical Scaling of Hadoop Nodes
 VM’s boundary is elastic already
• VM resource type: reserved (low limit) and maximum (up limit)
• If resource is tight, VMs compete for resource (between reserved and
maximum) based on shares
• “Stealing” resources without notifying Apps sometimes cause very bad
performance
• Thus, need to figure out a way to make app-aware resource change
 Current Hadoop resource schedulers are static
• MRV1 – slots
• YARN – resources (Memory for now, YARN-2 will include CPUs)
 HVE Elasticity patches
• Enable flexible resource model for each
Hadoop node
• Change resources at runtime

6565
Pivotal HD is the Best Suited for Virtualization
 Only distribution that ships with VMware Hadoop Virtual
Extensions (HVE)
• Fully tested
• Ensures proper HDFS replication placement on vSphere
• Improves MapReduce performance through better data locality on vSphere
• Allows dynamic scaling of Hadoop Compute Nodes
 Certified on vSphere
 VMware Serengeti deploys and scales Pivotal HD on vSphere out-
of-box
• Only YARN based distribution supported by Serengeti

6767
Big Data Platform Building Blocks and Key Benefits
Serengeti
vSphere
Resource
Management
Hadoop
Virtualization
Extensions
Partnership

6868
Thank You
projectserengeti.org
gopivotal.com/pivotal-products/pivotal-data-fabric/pivotal-hd
Kevin Leong
kleong@vmware.com
Abhishek Kashyap
akashyap@gopivotal.com

VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Management, and Virtualization Extensions

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Management, and Virtualization Extensions

Similar to VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Management, and Virtualization Extensions (20)

More from VMworld

More from VMworld (20)

Recently uploaded

Recently uploaded (20)

VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Management, and Virtualization Extensions