Big Data Platform Building Blocks: Serengeti,
Resource Management,
and Virtualization Extensions
Abhishek Kashyap, Pivotal...
22
Agenda
 Big Data, Hadoop, and What It Means to You
 The VMware Big Data Platform
• Operate Clusters Simply
• Share In...
33
Big Data, Hadoop, and
What It Means to You
44
What is Hadoop?
 Framework that allows for distributed processing of large data sets
across clusters of commodity serv...
55
What is Hadoop?
 Storage & Compute in One Framework
 Open Source Project of the Apache Software Foundation
 Java-int...
66
Why Hadoop?
 HDFS provides cheap and reliable storage on commodity hardware
 In-place data analysis, rather than movi...
77
Use Case: Data Warehouse Augmentation / Offload
 Challenges
• Existing EDW used for low value and resource consuming E...
88
Use Case: Retailer Trend Analysis
 Deep Historical Reporting for Retail Trends:
• Credit card company loads 10 years o...
99
Pivotal HD
HDFS
HBase
Pig, Hive,
Mahout
Map Reduce
Sqoop Flume
Resource
Management
& Workflow
Yarn
Zookeeper
Apache
1010
Pivotal HD
HDFS
HBase
Pig, Hive,
Mahout
Map Reduce
Sqoop Flume
Resource
Management
& Workflow
Yarn
Zookeeper
Deploy,
...
1111
Pivotal HD
HDFS
HBase
Pig, Hive,
Mahout
Map Reduce
Sqoop Flume
Resource
Management
& Workflow
Yarn
Zookeeper
Deploy,
...
1212
Pivotal HD
HDFS
HBase
Pig, Hive,
Mahout
Map Reduce
Sqoop Flume
Resource
Management
& Workflow
Yarn
Zookeeper
Deploy,
...
1313
The VMware Big Data Platform
1414
The Big Data Journey in the Enterprise
Stage 3: Cloud Analytics Platform
• Serve many departments
• Often part of mis...
1515
Getting from Here to There
Host Host Host Host Host Host Host
Virtualization
Shared File SystemData
Layer
Compute
Lay...
1616
Getting from Here to There
Host Host Host Host Host Host Host
Virtualization
Shared File SystemData
Layer
Compute
Lay...
1717
Getting from Here to There
Host Host Host Host Host Host Host
Virtualization
Shared File SystemData
Layer
Compute
Lay...
1818
Benefits of Virtualization at Each Stage
Stage 3: Cloud Analytics Platform
 Mixed workloads
 Right tool at the righ...
1919
A Brief History of Project Serengeti
(and Big Data at VMware)
2020
Big Data Initiatives at VMware
Serengeti
vSphere
Resource
Management
Hadoop
Virtualization
Extensions
 Virtualizatio...
2121
Clustered Workload Management: The Next Frontier
ESXi
Serengeti
Hadoop
Management
Virtualization
vCenter
Source: http...
2222
Serengeti Project History
Serengeti 0.5
June 2012
Serengeti 0.6
August 2012
Serengeti 0.7
October 2012
Serengeti 0.8
...
2323
Big Data Extensions: Serengeti-vCenter Integration
ESXi
Hadoop
Management
Virtualization
Big Data Extensions + vCenter
2424
Why Virtualize Hadoop
Operate
Clusters
Simply
Share
Infrastructure
Efficiently
Leverage
Existing
Investment
2525
Operate Clusters Simply
Serengeti
2626
What Does Nick Think About Hadoop?
I don’t want to be the
bottleneck when it
comes to provisioning
Hadoop clusters
I ...
2727
Choose Your Own Adventure
Source: http://www.vintagecomputing.com/wp-content/images/retroscan/supercomputer_cyoa_larg...
2828
Big Data Extensions Demo
2929
Deploy Hadoop Clusters in Minutes
Hadoop Installation and
Configuration
Network Configuration
OS installation
Server ...
3030
How It Works
 BDE is packaged as a virtual appliance, which can be easily
deployed on vCenter
 BDE works as a vCent...
3131
User-specified Customizations Using Cluster Specification File
Storage configuration
Choice of shared or local
High A...
3232
Deploy
Customize
Load data
Execute
jobs
Tune
configuration
Scale
…
Deploy, Manage, Run Virtual Hadoop with BDE
3333
Agility and Operational Simplicity
Host Host Host Host Host Host Host
Virtualization
Shared File SystemData
Layer
Com...
3434
Share Infrastructure Efficiently
vSphere Resource Management
3535
Adult Supervision Required
CLUSTERS OVER
10 NODES
3636
Production
Test
Experimentation
Dept A: recommendation engine Dept B: ad targeting
Production
Test
Experimentation
Lo...
3737
The Virtualization Advantage
Experimentation
Production
Recommendation Engine
Production
Ad Targeting
Test/Dev
Produc...
3838
What Other Things Does Nick Think About Hadoop?
Source: http://www.smartdraw.com/solutions/information-technology/ima...
3939
Achieving Multi-tenancy
 Resource Isolation
• Control the greedy noisy neighbor
• Reserve resources to meet needs
 ...
4040
Combined
Storage/
Compute
VM
Hadoop in VM
 VM lifecycle
determined
by Datanode
 Limited elasticity
 Limited to Had...
4141
Dynamic Hadoop Scaling
 Deploy separate compute clusters for different tenants
sharing HDFS
 Commission/decommissio...
4242
Elastic Hadoop Demo
4343
State, stats
(Slots used,
Pending work)
Commands
(Decommission,
Recommission)
Stats and VM configuration
Serengeti Jo...
4444
Combining Elasticity and Multi-tenancy
Host Host Host Host Host Host Host
Virtualization
Shared File SystemData
Layer...
4545
Leverage Existing Investment
4646
What Is Nick Still Thinking About Hadoop?
Source: http://www.smartdraw.com/solutions/information-technology/images/ni...
4747
Use Storage That Meets Your Needs
SAN Storage
$2 - $10/Gigabyte
$1M gets:
0.5 Petabytes
200,000 IOPS
8Gbyte/sec
NAS F...
4848
Leveraging Isilon as External HDFS
 Time to results: Analysis of data in place
 Lower risk using vSphere with Isilo...
4949
Hybrid Storage Model to Get the Best of Both Worlds
 Master nodes:
• NameNode, JobTracker on
shared storage
• Levera...
5050
Achieving HA for the Entire Hadoop Stack
 Battle-tested HA technology
 Single mechanism to achieve HA for the entir...
5151
Leveraging Other VMware Assets
 Monitoring with vCenter Operations Manager
• Gain comprehensive visibility
• Elimina...
5252
Get Maximum Value from Existing Tools and Infrastructure
Host Host Host Host Host Host Host
Virtualization
Shared Fil...
5353
Pivotal and VMware:
Partnering to Virtualize Hadoop
5454
Virtualization Benefits
 Multi-tenancy (users, business units) with strong vSphere-based
isolation
 Multiple big da...
5555
Busting Myths About Virtual Hadoop
Virtualization will add significant
performance overhead
Virtual Hadoop performanc...
5656
Native versus Virtual Platforms, 32 hosts, 16 disks/host
Source: http://www.vmware.com/resources/techresources/10360
5757
Harness the Flexibility of Virtualization
Hadoop Virtualization Extensions
5858
You Need Hadoop Virtual Extensions
 Topology Extensions:
• Enable Hadoop to recognize additional virtualization laye...
5959
Hadoop Virtual Extensions
 Topology Extensions:
• Enable Hadoop to recognize additional virtualization layer for
rea...
6060
Current Hadoop Network Topology Not Virtualization Aware
H1 H2 H3
R1
H4 H5 H6
R2
H7 H8 H9
R3
H10 H11 H12
R4
D1 D1
/
•...
6161
HVE Adds a New Layer in Hadoop Network Topology
• D = data center
• R = rack
• NG = node group
• HG = node
N13N1 N2 N...
6262
“Virtualization Aware” Replica Placement Policy During Write
Updated Policies:
• No replicas are placed on the
same n...
6363
Hadoop Virtual Extensions
 Topology Extensions:
• Enable Hadoop to recognize additional virtualization layer for
rea...
6464
HVE Achieves Vertical Scaling of Hadoop Nodes
 VM’s boundary is elastic already
• VM resource type: reserved (low li...
6565
Pivotal HD is the Best Suited for Virtualization
 Only distribution that ships with VMware Hadoop Virtual
Extensions...
6666
Conclusion
6767
Big Data Platform Building Blocks and Key Benefits
Serengeti
vSphere
Resource
Management
Hadoop
Virtualization
Extens...
6868
Thank You
projectserengeti.org
gopivotal.com/pivotal-products/pivotal-data-fabric/pivotal-hd
Kevin Leong
kleong@vmwar...
THANK YOU
Big Data Platform Building Blocks: Serengeti,
Resource Management,
and Virtualization Extensions
Abhishek Kashyap, Pivotal...
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Management, and Virtualization Extensions
Upcoming SlideShare
Loading in …5
×

VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Management, and Virtualization Extensions

1,021 views

Published on

VMworld 2013

Abhishek Kashyap, Pivotal
Kevin Leong, VMware

Learn more about VMworld and register at http://www.vmworld.com/index.jspa?src=socmed-vmworld-slideshare

Published in: Technology
  • Be the first to comment

  • Be the first to like this

VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Management, and Virtualization Extensions

  1. 1. Big Data Platform Building Blocks: Serengeti, Resource Management, and Virtualization Extensions Abhishek Kashyap, Pivotal Kevin Leong, VMware VAPP5762 #VAPP5762
  2. 2. 22 Agenda  Big Data, Hadoop, and What It Means to You  The VMware Big Data Platform • Operate Clusters Simply • Share Infrastructure Efficiently • Leverage Existing Investment  Pivotal and VMware: Partnering to Virtualize Hadoop  Conclusion and Q&A
  3. 3. 33 Big Data, Hadoop, and What It Means to You
  4. 4. 44 What is Hadoop?  Framework that allows for distributed processing of large data sets across clusters of commodity servers • Store large amount of data • Process the large amount of data stored  Inspired by Google’s MapReduce and Google File System (GFS) papers  Apache Open Source Project • Initial work done at Yahoo! starting in 2005 • Open sourced in 2009 there is now a very active open source community
  5. 5. 55 What is Hadoop?  Storage & Compute in One Framework  Open Source Project of the Apache Software Foundation  Java-intensive programming required HDFS MapReduce Two Core Components Scalable storage in Hadoop Distributed File System Compute via the MapReduce distributed processing platform
  6. 6. 66 Why Hadoop?  HDFS provides cheap and reliable storage on commodity hardware  In-place data analysis, rather than moving from file systems to data warehouses  Ability to analyze structured and unstructured data Enables better business decisions from more types of data at higher speeds and lower costs
  7. 7. 77 Use Case: Data Warehouse Augmentation / Offload  Challenges • Existing EDW used for low value and resource consuming ETL process • Planned growth will far exceed compute capacity • Hard to do analytics or even basic reporting on EDW system  Objectives • Reduce EDW Total Cost of Ownership • Enable longer data retention to enable analytics and accelerate time to market • Migrate ETL off EDW to free up compute resources
  8. 8. 88 Use Case: Retailer Trend Analysis  Deep Historical Reporting for Retail Trends: • Credit card company loads 10 years of data for all retailers (100’s of TB’s) • Run Map/Reduce Job develop historical picture of retailers in a specific area • Load results from Hadoop into data warehouse and further analyze with standard BI/statistics packages  Why do this in Hadoop? • Ability to store years of data cost effectively • Data available for immediate recall (not on tapes or flat files) • No need to ETL/normalize the data • Data exists in its valuable, original format • Offload intensive computation from DW • Ability to combine structured and unstructured data
  9. 9. 99 Pivotal HD HDFS HBase Pig, Hive, Mahout Map Reduce Sqoop Flume Resource Management & Workflow Yarn Zookeeper Apache
  10. 10. 1010 Pivotal HD HDFS HBase Pig, Hive, Mahout Map Reduce Sqoop Flume Resource Management & Workflow Yarn Zookeeper Deploy, Configure, Monitor, Manage Command Center Hadoop Virtualization (HVE) Data Loader Pivotal HD Enterprise Apache Pivotal HD Enterprise
  11. 11. 1111 Pivotal HD HDFS HBase Pig, Hive, Mahout Map Reduce Sqoop Flume Resource Management & Workflow Yarn Zookeeper Deploy, Configure, Monitor, Manage Command Center Hadoop Virtualization (HVE) Data Loader Pivotal HD Enterprise Apache Pivotal HD Enterprise HAWQ Xtension Framework Catalog Services Query Optimizer Dynamic Pipelining ANSI SQL + Analytics HAWQ– Advanced Database Services
  12. 12. 1212 Pivotal HD HDFS HBase Pig, Hive, Mahout Map Reduce Sqoop Flume Resource Management & Workflow Yarn Zookeeper Deploy, Configure, Monitor, Manage Command Center Data Loader Pivotal HD Enterprise Apache Pivotal HD Enterprise HAWQ Xtension Framework Catalog Services Query Optimizer Dynamic Pipelining ANSI SQL + Analytics HAWQ– Advanced Database Services Spring XD Pivotal Analytics Pivotal Chorus & Alpine Miner MoreVRP Hadoop Virtualization (HVE)
  13. 13. 1313 The VMware Big Data Platform
  14. 14. 1414 The Big Data Journey in the Enterprise Stage 3: Cloud Analytics Platform • Serve many departments • Often part of mission critical workflow • Fully integrated with analytics/BI tools Stage1: Hadoop Piloting • Often start with line of business • Try 1 or 2 use cases to explore the value of Hadoop Stage 2: Hadoop Production • Serve a few departments • More use cases • Growing # and size of clusters • Core Hadoop + components 10’s 100’s0 node Integrated Scale Standalone
  15. 15. 1515 Getting from Here to There Host Host Host Host Host Host Host Virtualization Shared File SystemData Layer Compute Layer Hadoop test/dev
  16. 16. 1616 Getting from Here to There Host Host Host Host Host Host Host Virtualization Shared File SystemData Layer Compute Layer Hadoop test/dev Hadoop production Hadoop production Hadoop experimentation
  17. 17. 1717 Getting from Here to There Host Host Host Host Host Host Host Virtualization Shared File SystemData Layer Compute Layer Hadoop test/dev HBase Hadoop production SQL on Hadoop HAWQ, Impala, Drill NoSQL Cassandra Mongo Other Spark Shark Solr Platfora
  18. 18. 1818 Benefits of Virtualization at Each Stage Stage 3: Cloud Analytics Platform  Mixed workloads  Right tool at the right time  Flexible and elastic infrastructure Stage1: Hadoop Piloting  Rapid deployment  On the fly cluster resizing  Flexible config  Automation of cluster lifecycle Stage 2: Hadoop Production  High Availability  Consolidation  Tiered SLAs  Elastic Scaling 10’s 100’s0 node Integrated Scale Standalone
  19. 19. 1919 A Brief History of Project Serengeti (and Big Data at VMware)
  20. 20. 2020 Big Data Initiatives at VMware Serengeti vSphere Resource Management Hadoop Virtualization Extensions  Virtualization changes for core Hadoop  Contributed back to Apache Hadoop  Advanced resource management on vSphere  Big Data applications-specific extension to DRS  Open source project  Tool to simplify virtualized Hadoop deployment & operations
  21. 21. 2121 Clustered Workload Management: The Next Frontier ESXi Serengeti Hadoop Management Virtualization vCenter Source: http://www.conferencebike.com/image/generated/792.png
  22. 22. 2222 Serengeti Project History Serengeti 0.5 June 2012 Serengeti 0.6 August 2012 Serengeti 0.7 October 2012 Serengeti 0.8 April 2013 • Hadoop in 10 min • Highly Available Hadoop • Time to insight • Configuring Hadoop • Compute elasticity • Configuring placement and topology • HBase • MapR • CDH4 • Performance best practices Serengeti 0.9/ BDE Beta June 2013 • Integrated GUI • Automatic elasticity • YARN/ Pivotal HD
  23. 23. 2323 Big Data Extensions: Serengeti-vCenter Integration ESXi Hadoop Management Virtualization Big Data Extensions + vCenter
  24. 24. 2424 Why Virtualize Hadoop Operate Clusters Simply Share Infrastructure Efficiently Leverage Existing Investment
  25. 25. 2525 Operate Clusters Simply Serengeti
  26. 26. 2626 What Does Nick Think About Hadoop? I don’t want to be the bottleneck when it comes to provisioning Hadoop clusters I need sizing flexibility, because my Hadoop users don’t know how large of a cluster they need I want to establish a repeatable process for deploying Hadoop clusters I don’t really know that much about Hadoop I want to better manage the jumble of LOB Hadoop clusters in my enterprise Source: http://www.smartdraw.com/solutions/information-technology/images/nick.png
  27. 27. 2727 Choose Your Own Adventure Source: http://www.vintagecomputing.com/wp-content/images/retroscan/supercomputer_cyoa_large.jpg
  28. 28. 2828 Big Data Extensions Demo
  29. 29. 2929 Deploy Hadoop Clusters in Minutes Hadoop Installation and Configuration Network Configuration OS installation Server preparation From manual process To fully automated, using the GUI
  30. 30. 3030 How It Works  BDE is packaged as a virtual appliance, which can be easily deployed on vCenter  BDE works as a vCenter extension and establishes SSL connection with vCenter  BDE clones VMs from the template and controls/configures VMs through vCenter Host Host Host Host Host Virtualization Platform Hadoop Node Hadoop Node vCenter Management Server Template Virtual Appliance VM Cloning
  31. 31. 3131 User-specified Customizations Using Cluster Specification File Storage configuration Choice of shared or local High Availability option Number of nodes and resource configuration VM placement policies
  32. 32. 3232 Deploy Customize Load data Execute jobs Tune configuration Scale … Deploy, Manage, Run Virtual Hadoop with BDE
  33. 33. 3333 Agility and Operational Simplicity Host Host Host Host Host Host Host Virtualization Shared File SystemData Layer Compute Layer Hadoop test/dev
  34. 34. 3434 Share Infrastructure Efficiently vSphere Resource Management
  35. 35. 3535 Adult Supervision Required CLUSTERS OVER 10 NODES
  36. 36. 3636 Production Test Experimentation Dept A: recommendation engine Dept B: ad targeting Production Test Experimentation Log files Social dataTransaction data Historical cust behavior Pain Points: 1. Cluster sprawl 2. Redundant common data in separate clusters 3. Inefficient use of resources. Some clusters could be running at capacity while other clusters are sitting idle NoSQL Real time SQL … On the horizon… Challenges of Running Hadoop in the Enterprise
  37. 37. 3737 The Virtualization Advantage Experimentation Production Recommendation Engine Production Ad Targeting Test/Dev Production Test Production Test Experimentation Recommendation engine Ad targeting Experimentation One physical platform to support multiple virtual big data clusters
  38. 38. 3838 What Other Things Does Nick Think About Hadoop? Source: http://www.smartdraw.com/solutions/information-technology/images/nick.png I want to scale out when my workload requires it My Hadoop users ask for large Hadoop clusters, which end up underutilized I want to offer Hadoop-as-a- Service in my private cloud I want to get all Hadoop clusters into a centralized environment to minimize spend
  39. 39. 3939 Achieving Multi-tenancy  Resource Isolation • Control the greedy noisy neighbor • Reserve resources to meet needs  Version Isolation • Allow concurrent OS, App, Distro versions  Security Isolation • Provide privacy between users/groups • Runtime and data privacy required Host Host Host Host Host Host VMware vSphere + Serengeti Host
  40. 40. 4040 Combined Storage/ Compute VM Hadoop in VM  VM lifecycle determined by Datanode  Limited elasticity  Limited to Hadoop Multi-Tenancy Storage Compute VM VM Separate Storage  Separate compute from data  Elastic compute  Enable shared workloads  Raise utilization Storage T1 T2 VM VM VM Separate Compute Tenants  Separate virtual clusters per tenant  Stronger VM-grade security and resource isolation  Enable deployment of multiple Hadoop runtime versions Slave Node Separating Hadoop Data and Compute for Elasticity
  41. 41. 4141 Dynamic Hadoop Scaling  Deploy separate compute clusters for different tenants sharing HDFS  Commission/decommission compute nodes according to priority and available resources ExperimentationDynamic resourcepool Data layer Production recommendation engine Compute layer Compute VM Compute VM Compute VM Compute VM Compute VM Compute VM Compute VM Compute VM Compute VM Compute VM Compute VM Compute VM Compute VM Compute VM Compute VM Experimentation Production Compute VM Job Tracker Job Tracker VMware vSphere + Serengeti
  42. 42. 4242 Elastic Hadoop Demo
  43. 43. 4343 State, stats (Slots used, Pending work) Commands (Decommission, Recommission) Stats and VM configuration Serengeti Job Tracker vCenter DB Manual/Auto Power on/off Virtual Hadoop Manager (VHM) Job Tracker Task Tracker Task Tracker Task Tracker vCenter Server Serengeti Configuration VC state and stats Hadoop state and stats VC actions Hadoop actions Algorithms Cluster Configuration Resource Management Module
  44. 44. 4444 Combining Elasticity and Multi-tenancy Host Host Host Host Host Host Host Virtualization Shared File SystemData Layer Compute Layer Hadoop test/dev Hadoop production Hadoop production Hadoop experimentation
  45. 45. 4545 Leverage Existing Investment
  46. 46. 4646 What Is Nick Still Thinking About Hadoop? Source: http://www.smartdraw.com/solutions/information-technology/images/nick.png I want to use my existing infrastructure, not buy new hardware I want to leverage the tools I already have Hadoop on Amazon is costing too much My data is in shared storage; do I have to move it? I want a low-risk way of trying Hadoop
  47. 47. 4747 Use Storage That Meets Your Needs SAN Storage $2 - $10/Gigabyte $1M gets: 0.5 Petabytes 200,000 IOPS 8Gbyte/sec NAS Filers $1 - $5/Gigabyte $1M gets: 1 Petabyte 200,000 IOPS 10Gbyte/sec Local Storage $0.05/Gigabyte $1M gets: 10 Petabytes 400,000 IOPS 250 Gbytes/sec
  48. 48. 4848 Leveraging Isilon as External HDFS  Time to results: Analysis of data in place  Lower risk using vSphere with Isilon  Scale storage and compute independently Data Layer – Hadoop on Isilon Elastic Virtual Compute Layer
  49. 49. 4949 Hybrid Storage Model to Get the Best of Both Worlds  Master nodes: • NameNode, JobTracker on shared storage • Leverage vSphere vMotion, HA and FT  Slave nodes • TaskTracker, DataNode on local storage • Lower cost, scalable bandwidth Local StorageShared Storage
  50. 50. 5050 Achieving HA for the Entire Hadoop Stack  Battle-tested HA technology  Single mechanism to achieve HA for the entire Hadoop stack  Simple to enable HA/FT HDFS (Hadoop Distributed File System) HBase (Key-Value store) MapReduce (Job Scheduling/Execution System) Pig (Data Flow) Hive (SQL) BI ReportingETL Tools ManagementServer Zookeepr(Coordination) HCatalog RDBMS Namenode Jobtracker Hive MetaDB Hcatalog MDB Server
  51. 51. 5151 Leveraging Other VMware Assets  Monitoring with vCenter Operations Manager • Gain comprehensive visibility • Eliminate manual processes with intelligent automation • Proactively manage operations  Future: vCloud Automation Center, Software-defined Storage
  52. 52. 5252 Get Maximum Value from Existing Tools and Infrastructure Host Host Host Host Host Host Host Virtualization Shared File SystemData Layer Compute Layer Hadoop test/dev HBase Hadoop production SQL on Hadoop HAWQ, Impala, Drill NoSQL Cassandra Mongo Other Spark Shark Solr Platfora
  53. 53. 5353 Pivotal and VMware: Partnering to Virtualize Hadoop
  54. 54. 5454 Virtualization Benefits  Multi-tenancy (users, business units) with strong vSphere-based isolation  Multiple big data applications and compute engines can access common HDFS data  Agility to scale Hadoop nodes at run-time  Provide On-Demand Hadoop / Hadoop as a Service
  55. 55. 5555 Busting Myths About Virtual Hadoop Virtualization will add significant performance overhead Virtual Hadoop performance is comparable to bare metal Hadoop cannot work with shared storage Shared storage is a valid choice, especially for smaller clusters Virtualization necessitates the use of shared storage Shared storage is useful for HA, but virtual Hadoop on DAS is very common Hadoop distribution vendors don’t support virtual implementations Pivotal HD is jointly tested, certified, and supported on vSphere Source: http://www.psychologytoday.com/files/u637/good-grief-charlie-brown.jpg, http://images2.wikia.nocookie.net/__cb20101130042247/peanuts/images/6/6d/Joe-cool-1-.jpg
  56. 56. 5656 Native versus Virtual Platforms, 32 hosts, 16 disks/host Source: http://www.vmware.com/resources/techresources/10360
  57. 57. 5757 Harness the Flexibility of Virtualization Hadoop Virtualization Extensions
  58. 58. 5858 You Need Hadoop Virtual Extensions  Topology Extensions: • Enable Hadoop to recognize additional virtualization layer for read/write/balancing for proper replica placement • Enable compute/data node separation without losing locality  Elasticity Extensions: • Ability to dynamically adjust resources allocated (CPU, memory, map/reduce slots) to compute nodes • Enables runtime elasticity of Hadoop nodes
  59. 59. 5959 Hadoop Virtual Extensions  Topology Extensions: • Enable Hadoop to recognize additional virtualization layer for read/write/balancing for proper replica placement • Enable compute/data node separation without losing locality  Elasticity Extensions: • Ability to dynamically adjust resources allocated (CPU, memory, map/reduce slots) to compute nodes • Enables runtime elasticity of Hadoop nodes
  60. 60. 6060 Current Hadoop Network Topology Not Virtualization Aware H1 H2 H3 R1 H4 H5 H6 R2 H7 H8 H9 R3 H10 H11 H12 R4 D1 D1 / • D = data center • R = rack • H = host  Multiple replicas may end up on same Physical Host in Virtual environments
  61. 61. 6161 HVE Adds a New Layer in Hadoop Network Topology • D = data center • R = rack • NG = node group • HG = node N13N1 N2 N3 N4 N5 N6 N7 N8 N9 N10 N11 N12 R1 R2 R3 R4 D1 D2 / NG1 NG2 NG3 NG4 NG5 NG6 NG7 NG8
  62. 62. 6262 “Virtualization Aware” Replica Placement Policy During Write Updated Policies: • No replicas are placed on the same node or nodes under the same node group • 1st replica is on the local node or one of nodes under the same node group of the writer • 2nd replica is on a remote rack of the 1st replica • 3rd replica is on the same rack as the 2nd replica • Remaining replicas are placed randomly across rack to meet minimum restriction
  63. 63. 6363 Hadoop Virtual Extensions  Topology Extensions: • Enable Hadoop to recognize additional virtualization layer for read/write/balancing for proper replica placement • Enable compute/data node separation without losing locality  Elasticity Extensions: • Ability to dynamically adjust resources allocated (CPU, memory, map/reduce slots) to compute nodes • Enables runtime elasticity of Hadoop nodes
  64. 64. 6464 HVE Achieves Vertical Scaling of Hadoop Nodes  VM’s boundary is elastic already • VM resource type: reserved (low limit) and maximum (up limit) • If resource is tight, VMs compete for resource (between reserved and maximum) based on shares • “Stealing” resources without notifying Apps sometimes cause very bad performance • Thus, need to figure out a way to make app-aware resource change  Current Hadoop resource schedulers are static • MRV1 – slots • YARN – resources (Memory for now, YARN-2 will include CPUs)  HVE Elasticity patches • Enable flexible resource model for each Hadoop node • Change resources at runtime
  65. 65. 6565 Pivotal HD is the Best Suited for Virtualization  Only distribution that ships with VMware Hadoop Virtual Extensions (HVE) • Fully tested • Ensures proper HDFS replication placement on vSphere • Improves MapReduce performance through better data locality on vSphere • Allows dynamic scaling of Hadoop Compute Nodes  Certified on vSphere  VMware Serengeti deploys and scales Pivotal HD on vSphere out- of-box • Only YARN based distribution supported by Serengeti
  66. 66. 6666 Conclusion
  67. 67. 6767 Big Data Platform Building Blocks and Key Benefits Serengeti vSphere Resource Management Hadoop Virtualization Extensions Partnership
  68. 68. 6868 Thank You projectserengeti.org gopivotal.com/pivotal-products/pivotal-data-fabric/pivotal-hd Kevin Leong kleong@vmware.com Abhishek Kashyap akashyap@gopivotal.com
  69. 69. THANK YOU
  70. 70. Big Data Platform Building Blocks: Serengeti, Resource Management, and Virtualization Extensions Abhishek Kashyap, Pivotal Kevin Leong, VMware VAPP5762 #VAPP5762

×