VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study

381 views

Published on

VMworld 2013

Jayanth Gummaraju, VMware
Sasha Kipervarg, Identified, Inc.

Learn more about VMworld and register at http://www.vmworld.com/index.jspa?src=socmed-vmworld-slideshare

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
381
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study

  1. 1. Big Data Extensions: Advanced Features and Customer Case Study Jayanth Gummaraju, VMware Sasha Kipervarg, Identified, Inc. VAPP5484 #VAPP5484
  2. 2. 2 Data Is Exploding & Hadoop Is Driving Growth Unstructured data driving growth Hadoop adoption is ramping 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 Structured Unstructured Complex unstructured data forecastedtooutpace structured relationaldata by 10x by 2020 Evaluating 53% In- production 23% Piloting 18% Testing 2% Don't know 2% Other 2% Source: Forrester Survey of60 CIOs, September 2011 • Unstructured data explosion and Hadoop capabilities causing CIOs to reconsider Enterprise data strategy • Hadoop’s ability to process raw data at cost presents intriguing value proposition
  3. 3. 3 Agenda  Big Data Extensions Overview  Virtualized Hadoop at Identified Inc.  Advanced Features
  4. 4. 4 Questions for Audience  Familiarity with Hadoop 1. New to Hadoop 2. Reasonably familiar 3. Expert  Hadoop cluster sizes 1. < 10 nodes 2. 10-50 nodes 3. > 50 nodes  Virtualizing Hadoop 1. Never virtualized 2. Actively exploring virtualization 3. Running virtualized Hadoop in test-dev/production
  5. 5. 5 Big Data on vSphere: Value Proposition  Basic Features • Fast provisioning • Minutes/hours instead of days • Workload Consolidation • Multiple virtual clusters co-exist on same physical hardware • High Availability • Not limited to NameNode, JobTracker  Advanced Features • Auto-elasticity • High Resource Utilization • True multi-tenancy • VM-grade security, performance, and configuration isolation
  6. 6. 6 Serengeti vSphere Resource Management Hadoop Virtualization Extensions vSphere Big Data Extensions: Program Highlights  Open source project  Tool to simplify virtualized Hadoop deployment & operations Serengeti  Virtualization changes for core Hadoop  Contributed back to Apache Hadoop  Advanced resource management on vSphere  Big Data applications-specific extension to DRS
  7. 7. 7 What is Hadoop? Distributed processing of large data sets across clusters of computers Source: http://architects.dzone.com/articles/how-hadoop-mapreduce-works map() reduce() InputData OutputData Split [k1, v1] Sort by k1 Merge [k1, [v1, v2, v3,…]] map() map() reduce()
  8. 8. 8 Slave Node 1 Slave Node 2 Slave Node 3 Input File Tasks Are Scheduled Where Data Resides JobTrackerJob DataNode TaskTracker Split 1 – 64MB Task - 1 Split 2 – 64MB Split 3 – 64MB TaskTracker TaskTracker DataNode DataNode Block 1 – 64MB Block 2 – 64MB Block 3 – 64MB Task - 2 Task - 3 NameNode
  9. 9. 9 Myth: Virtual Performance Is Sub-optimal [http://www.vmware.com/resources/techresources/10360, Jeff Buell, Apr 2013] (lower is better) 32 hosts/3.6GHz 8 cores/15K RPM 146GB SAS disks/10GbE/72-96GB RAM
  10. 10. 10 Agenda  Big Data Extensions Overview  Virtualized Hadoop at Identified Inc.  Advanced Features
  11. 11. 11 Agenda  Big Data Extensions Overview  Virtualized Hadoop at Identified Inc.  Advanced Features
  12. 12. 12 Compute-Data Separation Combined Storage/ Compute VM Hadoop in VM • VM lifecycle determined by Datanode • Limited elasticity • Limited to Hadoop Multi-Tenancy Storage Compute VM VM Separate Storage • Separate compute from data • Elastic compute • Enable shared workloads • Raise utilization Storage T1 T2 VM VM VM Separate Compute Tenants • Separate virtual clusters per tenant • Stronger VM-grade security and resource isolation • Enable deployment of multiple Hadoop runtime versions Slave Node
  13. 13. 13 Dataflow with Separated Compute/Data Virtual Hadoop Node Virtual Hadoop Node ESX Host Virtual Hadoop Node VMDK DataNode Virtual Hadoop Node TaskTracker Slot Slot Virtual Switch Virtual NIC Virtual NIC NIC Drivers
  14. 14. 14 Elastic Scalability & Multi-Tenancy  Deploy separate compute clusters for different tenants sharing HDFS.  According to priority and available resources, power-on/off compute VMs ExperimentationDynamic resourcepool Data layer Production recommendation engine Compute layer Compute VM Compute VM Compute VM Compute VM Compute VM Compute VM Compute VM Compute VM Compute VM Compute VM Compute VM Compute VM Compute VM Compute VM Compute VM Experimentation Production Compute VM Job Tracker Job Tracker VMware vSphere + Big Data Extensions
  15. 15. 15 Auto-elastic Hadoop in Action ESX ESX ESX J T DATA VM DATA VM DATA VM Local Disks SAN/NAS Non-Hadoop VMs Hadoop Compute VMs JT: JobTracker TT: TaskTracker NN: NameNode VHM: Virtual Hadoop Manager N N T T T T T T VirtualCenter Management Server DRS DRS DRSDRS DRS VHM Hadoop HDFS VMs T T T T T T J T
  16. 16. 16 Advanced Resource Management using Virtual Hadoop Manager State, stats (Slots used, Pending work) Commands (Decommission, Recommission) Stats and VM configuration Serengeti Job Tracker vCenter DB Manual/Auto Power on/off Virtual Hadoop Manager (VHM) Job Tracker Task Tracker Task Tracker Task Tracker vCenter Server VC actions Hadoop actions Serengeti Configuration VC state and stats Hadoop state and stats Auto-Scaling Algorithms Cluster Configuration
  17. 17. 17 Auto-Scaling Algorithms: 5 Key Insights ① Expand or Shrink clusters based on ambient data • Expand when there is work and no imminent contention • Shrink when there is contention • Predictable scaling for matching customer expectation, ease of testing, etc. ② Use contention detection as an input to scaling response • Contention reflects user's resource control settings and workload demands ③ Act as an extension to DRS for distributed applications spanning multiple VMs • A glue between DRS and Application-scheduler • Penalize few VMs heavily rather than all VMs lightly/uniformly ④ React only if there is true contention and in a timely manner • Actively used resources are deprived • Do not react to transients ⑤ Use Hysteresis and Control Theory concepts to guide decisions • E.g., transient windows and thresholds, feedback from previous actions, etc.
  18. 18. 18 Shrinking-related Metrics  CPU is being deprived • VC metric: CPU Ready • Time that vCPU is ready to run, but cannot be scheduled on a pCPU  Memory is being deprived • VC metrics: • Usage: Active Memory, Granted Memory • Reclamation: Memory Ballooning, Host Swap • Typically starts with ballooning then leads to host swapping  TaskTracker is dead or faulty • Hadoop metrics: Alive Nodes and Task Failures
  19. 19. 19 Expansion-related Metrics  Jobs are present • Hadoop metrics: jobs_preparing, jobs_running  High slot usage • Hadoop metrics: map_slots_used, max_map_slots, reduce_slots_used, max_reduce_slots  High task throughput • Hadoop metrics: maps_completed, reduces_completed  No imminent contention • VC metrics: CPU Ready, Memory Ballooning
  20. 20. 20 Auto-elasticity Demo
  21. 21. 21 What’s Next?  Resource management enhancements • Algorithmic optimizations • Contention metrics related to Disk/Network IO  Auto-elasticity support for YARN and HBase • YARN – Hadoop 2.x • HBase – Hadoop database  Serengeti enhancements • Support for additional Hadoop distros  Hadoop extensions • Dynamic resource configuration
  22. 22. 22 Main Takeaways  Value proposition • Fast provisioning • Workload consolidation • Elasticity  better resource utilization • Multi-tenancy using VMs  differentiated service  Key technologies • Serengeti • Advanced Resource Management • Hadoop Virtual Extensions Host Host Host vSphere Platform Make vSphere the platform of choice for running Big Data
  23. 23. 23 Questions?  Contact information • Jayanth Gummaraju jgummaraju@vmware.com • Sasha Kipervarg sasha@identified.com  Other related sessions • Breakout session (VAPP5402, VAPP5762) • Big Data Panel (VAPP5626) • Hands-on lab (HOL-SDC-1309)  For more information (including download information) • vSphere Big Data Extensions http://www.vmware.com/hadoop • Project Serengeti http://www.projectserengeti.org
  24. 24. THANK YOU
  25. 25. Big Data Extensions: Advanced Features and Customer Case Study Jayanth Gummaraju, VMware Sasha Kipervarg, Identified, Inc. VAPP5484 #VAPP5484

×