• Save
Big data on virtualized infrastucture
Upcoming SlideShare
Loading in...5
×
 

Big data on virtualized infrastucture

on

  • 2,912 views

Big Data and virtualization are two of the most exciting trends in the industry today. In this session you will learn about the components of Big Data systems, and how real-time, interactive and ...

Big Data and virtualization are two of the most exciting trends in the industry today. In this session you will learn about the components of Big Data systems, and how real-time, interactive and distributed processing systems like Hadoop integrate with existing applications and databases. The combination of Big Data systems with virtualization gives Hadoop and other Big Data technologies the key benefits of cloud computing: elasticity, multi-tenancy and high availability. A new open source project that VMware will announce at the Hadoop Summit will make it easy to deploy, configure and manage Hadoop on a virtualized infrastructure. We will discuss reference architectures for key Hadoop distributions anddiscuss future directions of this new open source project.

Statistics

Views

Total Views
2,912
Views on SlideShare
2,834
Embed Views
78

Actions

Likes
20
Downloads
0
Comments
2

2 Embeds 78

http://eventifier.co 70
http://eventifier.com 8

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • tahnks bosss
    Are you sure you want to
    Your message goes here
    Processing…
  • That's the beauty of the software.......its feature's are unique.
    VMware is world no.1 in Virtualisation.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Big data on virtualized infrastucture Big data on virtualized infrastucture Presentation Transcript

  • Enabling Highly Available, Elastic, Multi-tenancyHadoop on DemandRichard McDougall,VMware, Inc@richardmcdougll © 2009 VMware Inc. All rights reserved
  • Cloud: Big Shifts in Simplification and Optimization1. Reduce the Complexity 2. Dramatically Lower 3. Enable Flexible, Agile Costs IT Service Delivery to simplify operations to redirect investment into to meet and anticipate the and maintenance value-add opportunities needs of the business 2
  • A Holistic View of a Big Data System: Real Time Streams Real-Time Processing (s4, storm) Analytics ETL Real Time Structured Big SQL Database (Greenplum, Batch (hBase, AsterData, Processing Gemfire, Etc…) Cassandra) Unstructured Data (HDFS)3
  • Common Infrastructure for Big Data MPP DB HBase Hadoop Virtualization Platform Virtualization Platform Hadoop HBase Cluster Consolidation MPP DB §  Simplify •  Single Hardware InfrastructureCluster Sprawling •  Unified operationsSingle purpose clusters for variousbusiness applications lead to cluster §  Optimizesprawl. •  Shared Resources = higher utilization •  Elastic resources = faster on-demand access 4
  • Enterprise Challenges with Using Hadoop§  Deployment •  Slow to provision •  Complex to keep running/tune§  Single Points of Failure •  Single point of failure with Name Node and Job tracker •  No HA for Hadoop Framework Components (Hive, HCatalog, etc.)§  Low Utilization •  Dedicated clusters to run Hadoop with low CPU utilization •  No easy way to share resource between Hadoop and non-Hadoop workloads •  Noisy neighbor, lack resource containment§  Need Multi-tenant Isolation, Resource Management, etc,… •  Noisy Neighbor - no performance or security isolation between different tenants/users •  Lack of configuration isolation - Can’t run multiple versions on the cluster 5
  • I.  Market Overview & InsightsII.  Virtualization + HadoopIII.  Distribution & OSS Contribution6
  • Hadoop Runs Well on Virtualization Comparable performance to physical 1.2 1 0.8 Ratio to Native 0.6 0.4 1 VM 2 VMs 0.2 0 Source: http://www.vmware.com/files/pdf/techpaper/VMW-Hadoop-Performance-vSphere5.pdf7
  • Use Local Disk where it’s Needed SAN Storage NAS Filers Local Storage $2 - $10/Gigabyte $1 - $5/Gigabyte $0.05/Gigabyte $1M gets: $1M gets: $1M gets: 0.5Petabytes 1 Petabyte 20 Petabytes 200,000 IOPS 400,000 IOPS 10,000,000 IOPS 1Gbyte/sec 2Gbyte/sec 800 Gbytes/sec8
  • Extend Virtual Storage Architecture to Include Local Disk §  Shared Storage: SAN or NAS §  Hybrid Storage •  Easy to provision •  SAN for boot images, VMs, other •  Automated cluster rebalancing workloads •  Local disk for Hadoop & HDFS •  Scalable Bandwidth, Lower Cost/GB Other VM Other VM Other VM Other VM Other VM Other VM Other VM Other VMHadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Host Host Host Host Host Host 9
  • Why Virtualize Hadoop? Simple to Operate Highly Available Elastic Scaling§  Rapid deployment §  No more single point of §  Shrink and expand failure cluster on demand§  Unified operations across enterprise §  One click to setup §  Resource Guarantee§  Easy Clone of Cluster §  High availability for MR §  Independent scaling of Jobs Compute and data10
  • Deploy a Hadoop Cluster in under 30 MinutesStep 1: Deploy Serengeti virtual appliance on vSphere. Deploy vHelperOVF to vSphereStep 2: A few simple commands to stand up Hadoop Cluster. Select Compute, memory, storage and network Select configuration template Automate deployment Done 11
  • A Tour Through Serengeti$ ssh serengeti@serengeti-vm$ serengetiserengeti>12
  • A Tour Through Serengetiserengeti> cluster create --name myElephantserengeti> cluster list -–name myElephantname: myElephant, distro: cdh, status:RUNNING NAME ROLES INSTANCE CPU MEM(MB) TYPE --------------------------------------------------------------------------- master [hadoop_NameNode, hadoop_jobtracker] 1 2 7500 LOCAL 50name: myElephant, distro: cdh, status:RUNNING NAME ROLES INSTANCE CPU MEM(MB) TYPE --------------------------------------------------------------------------- master [hive, hadoop_client, pig] 1 1 3700 LOCAL 50 NAME HOST IP ----------------------------------------------------------------- myElephant-client0 rmc-elephant-009.eng.vmware.com 10.0.20.18413
  • A Tour Through Serengeti$ ssh rmc@rmc-elephant-009.eng.vmware.com$ hadoop jar hadoop-examples.jar teragen 1000000000 tera-data…14
  • Serengeti Spec File[ "distro":"apache", Choice of Distro { "name": "master", "roles": [ "hadoop_NameNode", "hadoop_jobtracker" ], "instanceNum": 1, "instanceType": "MEDIUM", “ha”:true, HA Option }, { "name": "worker", "roles": [ "hadoop_datanode", "hadoop_tasktracker" ], "instanceNum": 5, "instanceType": "SMALL", "storage": { Choice of Shared Storage or Local Disk "type": "LOCAL", "sizeGB": 10 } }, ]15
  • Configuring Distro’s{ "name" : "cdh", "version" : "3u3", "packages" : [ { "roles" : ["hadoop_NameNode", "hadoop_jobtracker", "hadoop_tasktracker", "hadoop_datanode", "hadoop_client"], "tarball" : "cdh/3u3/hadoop-0.20.2-cdh3u3.tar.gz" }, { "roles" : ["hive"], "tarball" : "cdh/3u3/hive-0.7.1-cdh3u3.tar.gz" }, { "roles" : ["pig"], "tarball" : "cdh/3u3/pig-0.8.1-cdh3u3.tar.gz" } ] },16
  • Serengeti Demo Deploy Serengeti vApp on vSphere Deploy a Hadoop cluster in 10 Minutes Run MapReduce Serengeti Demo Scale out the Hadoop cluster Create a Customized Hadoop cluster Use Your Favorite Hadoop Distribution17
  • Why Virtualize Hadoop? Simple to Operate Highly Available Elastic Scaling§  Rapid deployment §  No more single point of §  Shrink and expand failure cluster on demand§  Unified operations across enterprise §  One click to setup §  Resource Guarantee§  Easy Clone of Cluster §  High availability for MR §  Independent scaling of Jobs Compute and data18
  • High Availability for the Hadoop Stack ETL Tools BI Reporting RDBMS Pig (Data Flow) Hive (SQL) HCatalog Zookeepr (Coordination) Hive Hcatalog MDB MetaDB Management Server MapReduce (Job Scheduling/Execution System) HBase (Key-Value store) Jobtracker Namenode HDFS (Hadoop Distributed File System) Server19
  • Live Machine Migration Reduces Planned DowntimeDescription:Enables the live migration of virtualmachines from one host to anotherwith continuous service availability.Benefits:•  Revolutionary technology that is the basis for automated virtual machine movement•  Meets service level and performance goals 20
  • vSphere High Availability (HA) - protection against unplanned downtime Overview •  Protection against host and VM failures •  Automatic failure detection (host, guest OS) •  Automatic virtual machine restart in minutes, on any available host in cluster •  OS and application-independent, does not require complex configuration changes21
  • vSphere Fault Tolerance provides continuous protection Overview •  Single identical VMs running in lockstep on separate hosts •  Zero downtime, zero data loss XX failover for all virtual machines in App App App App App App App HA HA FT OS OS OS OS OS OS OS case of hardware failures VMware ESX VMware ESX •  Integrated with VMware HA/DRS •  No complex clustering or specialized hardware required •  Single common mechanism for all X applications and operating systems Zero downtime for Name Node, Job Tracker and other components in Hadoop clusters22
  • One click to HA§  Easy to setup, one click is all you need23
  • Example HA Failover for Hadoop Serengeti Namenode vSphere HA Namenode Server TaskTracker TaskTracker TaskTracker TaskTracker HDFS Datanode HDFS Datanode HDFS Datanode HDFS Datanode Hive Hive Hive Hive hBase hBase hBase hBase24
  • vSphere HA and Optionally FT§  vSphere HA •  Is application-aware: will auto-restart NN if heartbeat goes away •  Is easy to configure •  Has no performance overhead§  vSphere FT •  Has the added bonus of no pause-time when there is hardware failure •  Has a one vcpu max •  Perf. measurements: Has a 2% perf overhead to NN. Current extrapolated measurement shows this is good for ~300 host cluster.§  HDFS 2 HA •  Only covers Namenode – what about the other 5+ master services? •  Not available in Apache Hadoop 0.20 •  Not as battle-tested as vSphere HA •  Is more complex to install, manage25
  • High Availability for the Hadoop Stack ETL Tools BI Reporting RDBMS Pig (Data Flow) Hive (SQL) HCatalog Zookeepr (Coordination) Hive Hcatalog MDB MetaDB Management Server MapReduce (Job Scheduling/Execution System) HBase (Key-Value store) Jobtracker Namenode HDFS (Hadoop Distributed File System) Server26
  • Why Virtualize Hadoop? Simple to Operate Highly Available Elastic Scaling§  Rapid deployment §  No more single point of §  Shrink and expand failure cluster on demand§  Unified operations across enterprise §  One click to setup §  Resource Guarantee§  Easy Clone of Cluster §  High availability for MR §  Independent scaling of Jobs Compute and data27
  • Elastic Scaling and Multi-tenancy of Hadoop on vSphere VM VM VM VM Current   Hadoop:   Compute T1 T2   Combined   VM VM Storage/ Storage Storage Compute1.  Hadoop  in  VM   2.  Separate  Compute  and  Data   3.  Mul8.  Clusters  -­‐  Single  Tenant   -­‐  Single  Tenant   -­‐  Mul6ple  Tenants  -­‐  Fixed  Resources   -­‐  Elas6c  Compute   -­‐  Elas6c  Compute     28
  • “Time Share” Other VM Other VM Other VM Other VM Other VM Other VM Other VM Other VM Other VM Other VM Other VM Other VM Other VM Other VM Other VM Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop vHelper VMware vSphere Host Host Host HDFS HDFS HDFS While existing apps run during the day to support business operations, Hadoop batch jobs kicks off at night to conduct deep analysis of data.29
  • Virtualization delivers VM level Multi-tenancy §  Performance isolation Coke   Pepsi   •  No more noisy neighbors – Resource container to achieve guaranteed SLA for different tenants/users/ jobsRun6me     §  Configuration isolation  Hadoop    Hadoop    Hadoop   Virtual   Virtual   Virtual  Layer   •  Support multiple Hadoop  Hadoop    Queue   Virtual   environments on the same physical clusters •  Multiple Linux versions •  Multiple Hadoop Data   Data   Data   versions Container   Container   Container   §  Security isolationData   HDFS   •  Higher level of securityLayer   •  Compute VM can only access data VM Host   Host   Host   Host   Host   Host   through Access Control List 30
  • I.  Market Overview & InsightsII.  Virtualization + HadoopIII.  Distribution & OSS Contribution31
  • Open Source of Serengeti, Spring Hadoop, Hadoop Extensions Commercial Vendors Community Projects•  Support major distribution and multiple projects•  Contribute Hadoop Virtualization Extension (HVE) to Open Source Community32
  • Hadoop Virtualization Extensions: Topology Awareness33
  • Virtual Topologies34
  • Proposed Topology Changes HADOOP-8468 (Umbrella JIRA) HADOOP-8469 HDFS-3495 MAPREDUCE-4310 HDFS-3498 MAPREDUCE-4309 HADOOP-8470 HADOOP-847235
  • Spring for Apache Hadoop§  Announced initial formation of Spring Data OSS project in 2010 •  Enables Spring-powered applications to use new data access technologies •  Data project technologies around MongoDB, Neo4J, Riak, Redis, JDBC Extensions, JPA, REST, and Blob§  Announcing additional contributions on GitHub: •  Integration with Cascading library •  Hbase support •  Hadoop security support •  More examples •  Administrative application, RESTful API to upload Hadoop jobs to schedule for batch execution, query status, etc. •  Web HDFS support36
  • Big Data on Virtualized InfrastructureEnabling Highly Available, Elastic, Multi-tenancy Hadoop on DemandRichard McDougall,VMware, Inc@richardmcdougll © 2009 VMware Inc. All rights reserved