Enabling Highly Available, Elastic, Multi-tenancyHadoop on DemandRichard McDougall,VMware, Inc@richardmcdougll            ...
Cloud: Big Shifts in Simplification and Optimization1. Reduce the Complexity      2. Dramatically Lower         3. Enable ...
A Holistic View of a Big Data System:                 Real Time                  Streams                      Real-Time   ...
Common Infrastructure for Big Data                                                           MPP DB    HBase       Hadoop ...
Enterprise Challenges with Using Hadoop§  Deployment     •  Slow to provision     •  Complex to keep running/tune§  Sing...
I.     Market Overview & InsightsII.    Virtualization + HadoopIII.  Distribution & OSS Contribution6
Hadoop Runs Well on Virtualization                                     Comparable performance to physical                 ...
Use Local Disk where it’s Needed     SAN Storage          NAS Filers       Local Storage    $2 - $10/Gigabyte   $1 - $5/Gi...
Extend Virtual Storage Architecture to Include Local Disk §  Shared Storage: SAN or NAS                                  ...
Why Virtualize Hadoop?     Simple to Operate            Highly Available               Elastic Scaling§  Rapid deployment...
Deploy a Hadoop Cluster in under 30 MinutesStep 1: Deploy Serengeti virtual appliance on vSphere.                         ...
A Tour Through Serengeti$ ssh serengeti@serengeti-vm$ serengetiserengeti>12
A Tour Through Serengetiserengeti> cluster create --name myElephantserengeti> cluster list -–name myElephantname: myElepha...
A Tour Through Serengeti$ ssh rmc@rmc-elephant-009.eng.vmware.com$ hadoop jar hadoop-examples.jar teragen 1000000000 tera-...
Serengeti Spec File[        "distro":"apache",               Choice of Distro          {             "name": "master",    ...
Configuring Distro’s{         "name" : "cdh",         "version" : "3u3",         "packages" : [           {              "...
Serengeti Demo                         Deploy Serengeti vApp on vSphere                         Deploy a Hadoop cluster in...
Why Virtualize Hadoop?     Simple to Operate            Highly Available               Elastic Scaling§  Rapid deployment...
High Availability for the Hadoop Stack                                  ETL Tools        BI Reporting              RDBMS  ...
Live Machine Migration Reduces Planned DowntimeDescription:Enables the live migration of virtualmachines from one host to ...
vSphere High Availability (HA) - protection against unplanned downtime     Overview      •  Protection against host and VM...
vSphere Fault Tolerance provides continuous protection                                                      Overview      ...
One click to HA§  Easy to setup, one click is all you need23
Example HA Failover for Hadoop       Serengeti                       Namenode                                      vSphere...
vSphere HA and Optionally FT§  vSphere HA •  Is application-aware: will auto-restart NN if heartbeat goes away •  Is easy...
High Availability for the Hadoop Stack                                  ETL Tools        BI Reporting              RDBMS  ...
Why Virtualize Hadoop?     Simple to Operate            Highly Available               Elastic Scaling§  Rapid deployment...
Elastic Scaling and Multi-tenancy of Hadoop on vSphere       VM                               VM                          ...
“Time Share”     Other VM                Other VM                            Other VM                                     ...
Virtualization delivers VM level Multi-tenancy                                                                            ...
I.     Market Overview & InsightsII.    Virtualization + HadoopIII.  Distribution & OSS Contribution31
Open Source of Serengeti, Spring Hadoop, Hadoop Extensions         Commercial Vendors             Community Projects•  Sup...
Hadoop Virtualization Extensions: Topology Awareness33
Virtual Topologies34
Proposed Topology Changes                            HADOOP-8468 (Umbrella JIRA)                            HADOOP-8469   ...
Spring for Apache Hadoop§  Announced initial formation of Spring Data OSS project in 2010 •  Enables Spring-powered appli...
Big Data on Virtualized InfrastructureEnabling Highly Available, Elastic, Multi-tenancy Hadoop on DemandRichard McDougall,...
Upcoming SlideShare
Loading in...5
×

Big data on virtualized infrastucture

2,673

Published on

Big Data and virtualization are two of the most exciting trends in the industry today. In this session you will learn about the components of Big Data systems, and how real-time, interactive and distributed processing systems like Hadoop integrate with existing applications and databases. The combination of Big Data systems with virtualization gives Hadoop and other Big Data technologies the key benefits of cloud computing: elasticity, multi-tenancy and high availability. A new open source project that VMware will announce at the Hadoop Summit will make it easy to deploy, configure and manage Hadoop on a virtualized infrastructure. We will discuss reference architectures for key Hadoop distributions anddiscuss future directions of this new open source project.

Published in: Technology
2 Comments
22 Likes
Statistics
Notes
  • tahnks bosss
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • That's the beauty of the software.......its feature's are unique.
    VMware is world no.1 in Virtualisation.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
2,673
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
2
Likes
22
Embeds 0
No embeds

No notes for slide

Big data on virtualized infrastucture

  1. 1. Enabling Highly Available, Elastic, Multi-tenancyHadoop on DemandRichard McDougall,VMware, Inc@richardmcdougll © 2009 VMware Inc. All rights reserved
  2. 2. Cloud: Big Shifts in Simplification and Optimization1. Reduce the Complexity 2. Dramatically Lower 3. Enable Flexible, Agile Costs IT Service Delivery to simplify operations to redirect investment into to meet and anticipate the and maintenance value-add opportunities needs of the business 2
  3. 3. A Holistic View of a Big Data System: Real Time Streams Real-Time Processing (s4, storm) Analytics ETL Real Time Structured Big SQL Database (Greenplum, Batch (hBase, AsterData, Processing Gemfire, Etc…) Cassandra) Unstructured Data (HDFS)3
  4. 4. Common Infrastructure for Big Data MPP DB HBase Hadoop Virtualization Platform Virtualization Platform Hadoop HBase Cluster Consolidation MPP DB §  Simplify •  Single Hardware InfrastructureCluster Sprawling •  Unified operationsSingle purpose clusters for variousbusiness applications lead to cluster §  Optimizesprawl. •  Shared Resources = higher utilization •  Elastic resources = faster on-demand access 4
  5. 5. Enterprise Challenges with Using Hadoop§  Deployment •  Slow to provision •  Complex to keep running/tune§  Single Points of Failure •  Single point of failure with Name Node and Job tracker •  No HA for Hadoop Framework Components (Hive, HCatalog, etc.)§  Low Utilization •  Dedicated clusters to run Hadoop with low CPU utilization •  No easy way to share resource between Hadoop and non-Hadoop workloads •  Noisy neighbor, lack resource containment§  Need Multi-tenant Isolation, Resource Management, etc,… •  Noisy Neighbor - no performance or security isolation between different tenants/users •  Lack of configuration isolation - Can’t run multiple versions on the cluster 5
  6. 6. I.  Market Overview & InsightsII.  Virtualization + HadoopIII.  Distribution & OSS Contribution6
  7. 7. Hadoop Runs Well on Virtualization Comparable performance to physical 1.2 1 0.8 Ratio to Native 0.6 0.4 1 VM 2 VMs 0.2 0 Source: http://www.vmware.com/files/pdf/techpaper/VMW-Hadoop-Performance-vSphere5.pdf7
  8. 8. Use Local Disk where it’s Needed SAN Storage NAS Filers Local Storage $2 - $10/Gigabyte $1 - $5/Gigabyte $0.05/Gigabyte $1M gets: $1M gets: $1M gets: 0.5Petabytes 1 Petabyte 20 Petabytes 200,000 IOPS 400,000 IOPS 10,000,000 IOPS 1Gbyte/sec 2Gbyte/sec 800 Gbytes/sec8
  9. 9. Extend Virtual Storage Architecture to Include Local Disk §  Shared Storage: SAN or NAS §  Hybrid Storage •  Easy to provision •  SAN for boot images, VMs, other •  Automated cluster rebalancing workloads •  Local disk for Hadoop & HDFS •  Scalable Bandwidth, Lower Cost/GB Other VM Other VM Other VM Other VM Other VM Other VM Other VM Other VMHadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Host Host Host Host Host Host 9
  10. 10. Why Virtualize Hadoop? Simple to Operate Highly Available Elastic Scaling§  Rapid deployment §  No more single point of §  Shrink and expand failure cluster on demand§  Unified operations across enterprise §  One click to setup §  Resource Guarantee§  Easy Clone of Cluster §  High availability for MR §  Independent scaling of Jobs Compute and data10
  11. 11. Deploy a Hadoop Cluster in under 30 MinutesStep 1: Deploy Serengeti virtual appliance on vSphere. Deploy vHelperOVF to vSphereStep 2: A few simple commands to stand up Hadoop Cluster. Select Compute, memory, storage and network Select configuration template Automate deployment Done 11
  12. 12. A Tour Through Serengeti$ ssh serengeti@serengeti-vm$ serengetiserengeti>12
  13. 13. A Tour Through Serengetiserengeti> cluster create --name myElephantserengeti> cluster list -–name myElephantname: myElephant, distro: cdh, status:RUNNING NAME ROLES INSTANCE CPU MEM(MB) TYPE --------------------------------------------------------------------------- master [hadoop_NameNode, hadoop_jobtracker] 1 2 7500 LOCAL 50name: myElephant, distro: cdh, status:RUNNING NAME ROLES INSTANCE CPU MEM(MB) TYPE --------------------------------------------------------------------------- master [hive, hadoop_client, pig] 1 1 3700 LOCAL 50 NAME HOST IP ----------------------------------------------------------------- myElephant-client0 rmc-elephant-009.eng.vmware.com 10.0.20.18413
  14. 14. A Tour Through Serengeti$ ssh rmc@rmc-elephant-009.eng.vmware.com$ hadoop jar hadoop-examples.jar teragen 1000000000 tera-data…14
  15. 15. Serengeti Spec File[ "distro":"apache", Choice of Distro { "name": "master", "roles": [ "hadoop_NameNode", "hadoop_jobtracker" ], "instanceNum": 1, "instanceType": "MEDIUM", “ha”:true, HA Option }, { "name": "worker", "roles": [ "hadoop_datanode", "hadoop_tasktracker" ], "instanceNum": 5, "instanceType": "SMALL", "storage": { Choice of Shared Storage or Local Disk "type": "LOCAL", "sizeGB": 10 } }, ]15
  16. 16. Configuring Distro’s{ "name" : "cdh", "version" : "3u3", "packages" : [ { "roles" : ["hadoop_NameNode", "hadoop_jobtracker", "hadoop_tasktracker", "hadoop_datanode", "hadoop_client"], "tarball" : "cdh/3u3/hadoop-0.20.2-cdh3u3.tar.gz" }, { "roles" : ["hive"], "tarball" : "cdh/3u3/hive-0.7.1-cdh3u3.tar.gz" }, { "roles" : ["pig"], "tarball" : "cdh/3u3/pig-0.8.1-cdh3u3.tar.gz" } ] },16
  17. 17. Serengeti Demo Deploy Serengeti vApp on vSphere Deploy a Hadoop cluster in 10 Minutes Run MapReduce Serengeti Demo Scale out the Hadoop cluster Create a Customized Hadoop cluster Use Your Favorite Hadoop Distribution17
  18. 18. Why Virtualize Hadoop? Simple to Operate Highly Available Elastic Scaling§  Rapid deployment §  No more single point of §  Shrink and expand failure cluster on demand§  Unified operations across enterprise §  One click to setup §  Resource Guarantee§  Easy Clone of Cluster §  High availability for MR §  Independent scaling of Jobs Compute and data18
  19. 19. High Availability for the Hadoop Stack ETL Tools BI Reporting RDBMS Pig (Data Flow) Hive (SQL) HCatalog Zookeepr (Coordination) Hive Hcatalog MDB MetaDB Management Server MapReduce (Job Scheduling/Execution System) HBase (Key-Value store) Jobtracker Namenode HDFS (Hadoop Distributed File System) Server19
  20. 20. Live Machine Migration Reduces Planned DowntimeDescription:Enables the live migration of virtualmachines from one host to anotherwith continuous service availability.Benefits:•  Revolutionary technology that is the basis for automated virtual machine movement•  Meets service level and performance goals 20
  21. 21. vSphere High Availability (HA) - protection against unplanned downtime Overview •  Protection against host and VM failures •  Automatic failure detection (host, guest OS) •  Automatic virtual machine restart in minutes, on any available host in cluster •  OS and application-independent, does not require complex configuration changes21
  22. 22. vSphere Fault Tolerance provides continuous protection Overview •  Single identical VMs running in lockstep on separate hosts •  Zero downtime, zero data loss XX failover for all virtual machines in App App App App App App App HA HA FT OS OS OS OS OS OS OS case of hardware failures VMware ESX VMware ESX •  Integrated with VMware HA/DRS •  No complex clustering or specialized hardware required •  Single common mechanism for all X applications and operating systems Zero downtime for Name Node, Job Tracker and other components in Hadoop clusters22
  23. 23. One click to HA§  Easy to setup, one click is all you need23
  24. 24. Example HA Failover for Hadoop Serengeti Namenode vSphere HA Namenode Server TaskTracker TaskTracker TaskTracker TaskTracker HDFS Datanode HDFS Datanode HDFS Datanode HDFS Datanode Hive Hive Hive Hive hBase hBase hBase hBase24
  25. 25. vSphere HA and Optionally FT§  vSphere HA •  Is application-aware: will auto-restart NN if heartbeat goes away •  Is easy to configure •  Has no performance overhead§  vSphere FT •  Has the added bonus of no pause-time when there is hardware failure •  Has a one vcpu max •  Perf. measurements: Has a 2% perf overhead to NN. Current extrapolated measurement shows this is good for ~300 host cluster.§  HDFS 2 HA •  Only covers Namenode – what about the other 5+ master services? •  Not available in Apache Hadoop 0.20 •  Not as battle-tested as vSphere HA •  Is more complex to install, manage25
  26. 26. High Availability for the Hadoop Stack ETL Tools BI Reporting RDBMS Pig (Data Flow) Hive (SQL) HCatalog Zookeepr (Coordination) Hive Hcatalog MDB MetaDB Management Server MapReduce (Job Scheduling/Execution System) HBase (Key-Value store) Jobtracker Namenode HDFS (Hadoop Distributed File System) Server26
  27. 27. Why Virtualize Hadoop? Simple to Operate Highly Available Elastic Scaling§  Rapid deployment §  No more single point of §  Shrink and expand failure cluster on demand§  Unified operations across enterprise §  One click to setup §  Resource Guarantee§  Easy Clone of Cluster §  High availability for MR §  Independent scaling of Jobs Compute and data27
  28. 28. Elastic Scaling and Multi-tenancy of Hadoop on vSphere VM VM VM VM Current   Hadoop:   Compute T1 T2   Combined   VM VM Storage/ Storage Storage Compute1.  Hadoop  in  VM   2.  Separate  Compute  and  Data   3.  Mul8.  Clusters  -­‐  Single  Tenant   -­‐  Single  Tenant   -­‐  Mul6ple  Tenants  -­‐  Fixed  Resources   -­‐  Elas6c  Compute   -­‐  Elas6c  Compute     28
  29. 29. “Time Share” Other VM Other VM Other VM Other VM Other VM Other VM Other VM Other VM Other VM Other VM Other VM Other VM Other VM Other VM Other VM Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop vHelper VMware vSphere Host Host Host HDFS HDFS HDFS While existing apps run during the day to support business operations, Hadoop batch jobs kicks off at night to conduct deep analysis of data.29
  30. 30. Virtualization delivers VM level Multi-tenancy §  Performance isolation Coke   Pepsi   •  No more noisy neighbors – Resource container to achieve guaranteed SLA for different tenants/users/ jobsRun6me     §  Configuration isolation  Hadoop    Hadoop    Hadoop   Virtual   Virtual   Virtual  Layer   •  Support multiple Hadoop  Hadoop    Queue   Virtual   environments on the same physical clusters •  Multiple Linux versions •  Multiple Hadoop Data   Data   Data   versions Container   Container   Container   §  Security isolationData   HDFS   •  Higher level of securityLayer   •  Compute VM can only access data VM Host   Host   Host   Host   Host   Host   through Access Control List 30
  31. 31. I.  Market Overview & InsightsII.  Virtualization + HadoopIII.  Distribution & OSS Contribution31
  32. 32. Open Source of Serengeti, Spring Hadoop, Hadoop Extensions Commercial Vendors Community Projects•  Support major distribution and multiple projects•  Contribute Hadoop Virtualization Extension (HVE) to Open Source Community32
  33. 33. Hadoop Virtualization Extensions: Topology Awareness33
  34. 34. Virtual Topologies34
  35. 35. Proposed Topology Changes HADOOP-8468 (Umbrella JIRA) HADOOP-8469 HDFS-3495 MAPREDUCE-4310 HDFS-3498 MAPREDUCE-4309 HADOOP-8470 HADOOP-847235
  36. 36. Spring for Apache Hadoop§  Announced initial formation of Spring Data OSS project in 2010 •  Enables Spring-powered applications to use new data access technologies •  Data project technologies around MongoDB, Neo4J, Riak, Redis, JDBC Extensions, JPA, REST, and Blob§  Announcing additional contributions on GitHub: •  Integration with Cascading library •  Hbase support •  Hadoop security support •  More examples •  Administrative application, RESTful API to upload Hadoop jobs to schedule for batch execution, query status, etc. •  Web HDFS support36
  37. 37. Big Data on Virtualized InfrastructureEnabling Highly Available, Elastic, Multi-tenancy Hadoop on DemandRichard McDougall,VMware, Inc@richardmcdougll © 2009 VMware Inc. All rights reserved

×