Elastic, Multi-tenantHadoop on Demand!Richard McDougall, !Chief Architect, Application Infrastructure and Big Data, VMware...
Broad Application of Hadoop technology Horizontal Use Cases                                         Vertical Use Cases Log...
How does Hadoop enable parallel processing?!  A framework for distributed processing of large data sets across clusters of...
Hadoop System Architecture!  MapReduce: Programming framework for highly parallel data processing!  Hadoop Distributed Fil...
Job Tracker Schedules Tasks Where the Data Resides                                             Job                        ...
Hadoop Distributed File System
Hadoop Data Locality and Replication
The Right Big Data Tools for the Right Job…                   Real                  Time                             Machi...
So yes, there’s a lot more than just Map-  Reduce…           Hadoop        batch analysisCompute                   HBase  ...
Elasticity Enables Sharing of Resources
Containers with Isolation are a Tried and TestedApproach                                Reckless Workload 2   Hungry Workl...
Mixing Workloads: Three big types ofIsolation are Required                        !  Resource Isolation                   ...
Community activity in Isolation and ResourceManagement    !  YARN      •  Goal: Support workloads other than M-R on Hadoop...
Project Serengeti – Hadoop on Virtualization    Simple to Operate         Highly Available           Elastic Scaling  !  R...
Common Infrastructure for Big Data                                                          MPP DB    HBase       Hadoop  ...
Evolution of Hadoop on VMs      Slave NodeVM                      VM                          VM             VM     Curren...
In-house Hadoop as a Service “Enterprise EMR”– (Hadoop + Hadoop)                                                          ...
Integrated Hadoop and Webapps – (Hadoop +Other Workloads)               Short-lived          Hadoop compute clusterCompute...
Integrated Big Data Production – (Hadoop + other  big data)           Hadoop        batch analysisCompute                 ...
Deploy a Hadoop Cluster in under 30 MinutesStep 1: Deploy Serengeti virtual appliance on vSphere.                         ...
A Tour Through Serengeti$ ssh serengeti@serengeti-vm$ serengetiserengeti>
A Tour Through Serengetiserengeti> cluster create --name dcsepserengeti> cluster listname: dcsep, distro: apache, status: ...
Serengeti Spec File[        "distro":"apache",               Choice of Distro          {             "name": "master",    ...
Fully Customizable Configuration Profile!  Tune Hadoop cluster config in Serengeti spec file     "configuration": {      "...
Getting to Insights!  Point compute only cluster to existing HDFS     … "externalHDFS": "hdfs://hostname-of-namenode:8020"...
Configuring Distro’s{         "name" : "cdh",         "version" : "3u3",         "packages" : [           {              "...
Serengeti Demo                     Deploy Serengeti vApp on vSphere                     Deploy a Hadoop cluster in 10 Minu...
Serengeti Architecture                                                                                                    ...
Use Local Disk where it’s Needed   SAN Storage          NAS Filers       Local Storage  $2 - $10/Gigabyte   $1 - $5/Gigaby...
Rules of Thumb: Sizing for Hadoop!  Disk:  •  Provide about 50Mbytes/sec of disk bandwidth per core  •  If using SATA, tha...
Extend Virtual Storage Architecture to Include  Local Disk                                                                ...
Hadoop Using Local Disks                           Task Tracker             Datanode  Other        Hadoop  Workload     Vi...
Native versus Virtual Platforms, 24 hosts, 12disks/host                                              450                  ...
Local vs Various SAN Storage Configurations                                                       4.5                     ...
Hadoop Virtualization Extensions: TopologyAwareness
Virtual Topologies
Hadoop Topology Changes for Virtualization
Hadoop Virtualization Extensions for Topology               HVE Task Scheduling Policy Extension    Balancer Policy Extens...
Why Virtualize Hadoop?  Simple to Operate         Highly Available           Elastic Scaling!  Rapid deployment     !  No ...
Live Machine Migration Reduces PlannedDowntimeDescription:Enables the live migration of virtualmachines from one host to a...
vSphere High Availability (HA) - protectionagainst unplanned downtime  Overview   •  Protection against host and VM failur...
Example HA Failover for Hadoop     Serengeti                  vSphere HA                  Namenode                    Name...
vSphere Fault Tolerance provides continuousprotection                                                  Overview           ...
High Availability for the Hadoop Stack                               ETL Tools        BI Reporting           RDBMS        ...
Performance Effect of FT for Master Daemons!  NameNode and JobTracker placed in separate UP VMs!  Small overhead: Enabling...
Why Virtualize Hadoop?  Simple to Operate         Highly Available           Elastic Scaling!  Rapid deployment     !  No ...
“Time Share”   Other VM              Other VM                          Other VM                                     Other ...
Hadoop Task Tracker and Data Node in a VM                                                Add/Remove                       ...
Add/remove Virtual Nodes                                  Slot                     Slot                                  S...
But State makes it hard to power-off anode                                      Slot                                      ...
Adding a node needs data…                                 Slot                     Slot                                 Sl...
Separated Compute and Data                                                                       Slot                     ...
Dataflow with separated Compute/Data                                      Slot                       Virtual        Slot  ...
Elastic Compute!  Set number of active TaskTracker nodes     > cluster limit --name myHadoop --nodeGroup worker --activeCo...
Performance Analysis of SeparationCombined mode                             Split Mode1 Combined Compute/Datanode VM per H...
Performance Analysis of SeparationMinimum performance impact with separation of compute and data                          ...
Freedom of Choice and Open Source             Distributions                  Community Projects•  Flexibility to choose fr...
Elastic, Multi-tenantHadoop on Demand!Richard McDougall, !Chief Architect, Application Infrastructure and Big Data, VMware...
Upcoming SlideShare
Loading in...5
×

Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand

3,649

Published on

Hadoop talk at Europe Apachecon introducing Project Serengeti (projectserengeti.org)

Published in: Technology
0 Comments
9 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,649
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
227
Comments
0
Likes
9
Embeds 0
No embeds

No notes for slide

Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand

  1. 1. Elastic, Multi-tenantHadoop on Demand!Richard McDougall, !Chief Architect, Application Infrastructure and Big Data, VMware, Inc!@richardmcdougll!ApacheCon Europe, 2012!!http://www.vmware.com/hadoophttp://cto.vmware.com/http://projectserengeti.orghttp://github.com/vmware-serengeti © 2009 VMware Inc. All rights reserved
  2. 2. Broad Application of Hadoop technology Horizontal Use Cases Vertical Use Cases Log Processing / Click Financial Services Stream Analytics Machine Learning / Internet Retailersophisticated data mining Web crawling / text Pharmaceutical / Drug processing Discovery Extract Transform Load Mobile / Telecom (ETL) replacement Image / XML message Scientific Research processing General archiving / Social Media complianceHadoop’s ability to handle large unstructured data affordably and efficiently makes it a valuable tool kit for enterprises across a number of applications and fields.
  3. 3. How does Hadoop enable parallel processing?!  A framework for distributed processing of large data sets across clusters of computers using a simple programming model. Source: http://architects.dzone.com/articles/how-hadoop-mapreduce-works
  4. 4. Hadoop System Architecture!  MapReduce: Programming framework for highly parallel data processing!  Hadoop Distributed File System (HDFS): Distributed data storage
  5. 5. Job Tracker Schedules Tasks Where the Data Resides Job Tracker Job Input%File Host%1 Host%2 Host%3 Split%1%–%64MB Task%% Task%% Task%% Split%2%–%64MB Tracker Tracker Tracker Split%3%–%64MB Task%<%1 Task%<%2 Task%<%3 DataNode DataNode DataNode %Input%File Block%1%–%64MB Block%2%–%64MB Block%3%–%64MB
  6. 6. Hadoop Distributed File System
  7. 7. Hadoop Data Locality and Replication
  8. 8. The Right Big Data Tools for the Right Job… Real Time Machine Streams Learning (Social, (Mahout, etc…) sensors) Real-Time Processing (s4, storm, Data Visualization spark) (Excel, Tableau) ETL Interactive HIVE Real Time Database Analytics (Shark, (Impala, Batch(Informatica, Gemfire, hBase, Greenplum, ProcessingTalend, Spring Cassandra) AsterData, (Map-Reduce)Integration) Netezza…) Structured and Unstructured Data (HDFS, MAPR) Cloud Infrastructure Compute Storag Networking e
  9. 9. So yes, there’s a lot more than just Map- Reduce… Hadoop batch analysisCompute HBase Big SQL – Otherlayer real-time queries Impala NoSQL – Spark, Shark, Cassandra, Solr, Mongo, etc Platfora, Etc,…Data HDFSlayer Some sort of distributed, resource management OS + Filesystem Host Host Host Host Host Host Host
  10. 10. Elasticity Enables Sharing of Resources
  11. 11. Containers with Isolation are a Tried and TestedApproach Reckless Workload 2 Hungry Workload 1 Sneaky Workload 3 Some sort of distributed, resource management OS + Filesystem Host Host Host Host Host Host Host
  12. 12. Mixing Workloads: Three big types ofIsolation are Required !  Resource Isolation •  Control the greedy noisy neighbor •  Reserve resources to meet needs !  Version Isolation •  Allow concurrent OS, App, Distro versions !  Security Isolation •  Provide privacy between users/groups •  Runtime and data privacy required Some sort of distributed, resource management OS + Filesystem Host Host Host Host Host Host Host
  13. 13. Community activity in Isolation and ResourceManagement !  YARN •  Goal: Support workloads other than M-R on Hadoop •  Initial need is for MPI/M-R from Yahoo •  Not quite ready for prime-time yet? •  Non-posix File system self selects workload types !  Mesos •  Distributed Resource Broker •  Mixed Workloads with some RM •  Active project, in use at Twitter •  Leverages OS Virtualization – e.g. cgroups !  Virtualization •  Virtual machine as the primary isolation, resource management and versioned deployment container •  Basis for Project Serengeti
  14. 14. Project Serengeti – Hadoop on Virtualization Simple to Operate Highly Available Elastic Scaling !  Rapid deployment !  No more single point !  Shrink and expand of failure cluster on demand !  Unified operations across enterprise !  One click to setup !  Resource Guarantee !  Easy Clone of !  High availability for !  Independent scaling Cluster MR Jobs of Compute and dataSerengeti is an Open Source Project to automate deploymentof Hadoop on virtual platformshttp://projectserengeti.orghttp://github.com/vmware-serengeti
  15. 15. Common Infrastructure for Big Data MPP DB HBase Hadoop Virtualization Platform Virtualization Platform Hadoop HBase Cluster Consolidation MPP DB !  Simplify •  Single Hardware InfrastructureCluster Sprawling •  Unified operationsSingle purpose clusters for variousbusiness applications lead to cluster !  Optimizesprawl. •  Shared Resources = higher utilization •  Elastic resources = faster on-demand access
  16. 16. Evolution of Hadoop on VMs Slave NodeVM VM VM VM Current% Hadoop:% Compute T1 T2 % Combined% VM VM Storage/% Storage Storage ComputeHadoop%in%VM! Separate%Storage! Separate%Compute%Clusters!<  VM%lifecycle% <  Separate%compute% <  Separate%virtual%clusters% determined% from%data% per%tenant% by%Datanode% <  ElasIc%compute% <  Stronger%VM<grade%security%<  Limited%elasIcity% <  Enable%shared% and%resource%isolaIon%<  Limited%to%Hadoop% workloads% <  Enable%deployment%of% MulI<Tenancy% <  Raise%uIlizaIon% mulIple%Hadoop%runIme%% versions%
  17. 17. In-house Hadoop as a Service “Enterprise EMR”– (Hadoop + Hadoop) Production Ad hoc ETL of log files data miningCompute Productionlayer recommendation engineData HDFS HDFSlayer Virtualization platform Host Host Host Host Host Host
  18. 18. Integrated Hadoop and Webapps – (Hadoop +Other Workloads) Short-lived Hadoop compute clusterCompute Hadooplayer compute cluster Web servers for ecommerce siteData HDFSlayer Virtualization platform Host Host Host Host Host Host
  19. 19. Integrated Big Data Production – (Hadoop + other big data) Hadoop batch analysisCompute HBase Big SQL – Otherlayer real-time queries Impala NoSQL – Spark, Shark, Cassandra, Solr, Mongo, etc Platfora, Etc,…Data HDFSlayer Virtualization Host Host Host Host Host Host Host
  20. 20. Deploy a Hadoop Cluster in under 30 MinutesStep 1: Deploy Serengeti virtual appliance on vSphere. Deploy vHelperOVF to vSphereStep 2: A few simple commands to stand up Hadoop Cluster. Select Compute, memory, storage and network Select configuration template Automate deployment Done
  21. 21. A Tour Through Serengeti$ ssh serengeti@serengeti-vm$ serengetiserengeti>
  22. 22. A Tour Through Serengetiserengeti> cluster create --name dcsepserengeti> cluster listname: dcsep, distro: apache, status: RUNNING NAME ROLES INSTANCE CPU MEM(MB) TYPE ----------------------------------------------------------------------------- master [hadoop_namenode, hadoop_jobtracker] 1 6 2048 LOCAL 10 data [hadoop_datanode] 1 2 1024 LOCAL 10 compute [hadoop_tasktracker] 8 2 1024 LOCAL 10 client [hadoop_client, pig, hive] 1 1 3748 LOCAL 10
  23. 23. Serengeti Spec File[ "distro":"apache", Choice of Distro { "name": "master", "roles": [ "hadoop_NameNode", "hadoop_jobtracker" ], "instanceNum": 1, "instanceType": "MEDIUM", “ha”:true, HA Option }, { "name": "worker", "roles": [ "hadoop_datanode", "hadoop_tasktracker" ], "instanceNum": 5, "instanceType": "SMALL", "storage": { Choice of Shared Storage or Local Disk "type": "LOCAL", "sizeGB": 10 } }, ]
  24. 24. Fully Customizable Configuration Profile!  Tune Hadoop cluster config in Serengeti spec file "configuration": { "hadoop": { "mapred-site.xml": { "mapred.jobtracker.taskScheduler": "org.apache.hadoop.mapred.FairScheduler" …!  Control the placement of Hadoop nodes "placementPolicies": { "instancePerHost": 2, "groupRacks": { "type": "ROUNDROBIN", "racks": ["rack1", "rack2", "rack3“] …!  Setup physical racks/hosts mapping topology > topology upload --fileName <topology file name> > topology list!  Create Hadoop clusters using HVE topology > cluster create --name XXX --topology HVE --distro <HVE-supported_distro>
  25. 25. Getting to Insights!  Point compute only cluster to existing HDFS … "externalHDFS": "hdfs://hostname-of-namenode:8020", …!  Interact with HDFS from Serengeti CLI > fs ls /tmp > fs put --from /tmp/local.data --to /tmp/hdfs.data!  Launch MapReduce/Pig/Hive jobs from Serengeti CLI > cluster target --name myHadoop > mr jar --jarfile /opt/serengeti/cli/lib/hadoop-examples-1.0.1.jar --mainclass org.apache.hadoop.examples.PiEstimator --args "100 1000000000"!  Deploy Hive Server for ODBC/JDBC services "name": "client", "roles": [ "hadoop_client", "hive", "hive_server", "pig" ], …
  26. 26. Configuring Distro’s{ "name" : "cdh", "version" : "3u3", "packages" : [ { "roles" : ["hadoop_NameNode", "hadoop_jobtracker", "hadoop_tasktracker", "hadoop_datanode", "hadoop_client"], "tarball" : "cdh/3u3/hadoop-0.20.2-cdh3u3.tar.gz" }, { "roles" : ["hive"], "tarball" : "cdh/3u3/hive-0.7.1-cdh3u3.tar.gz" }, { "roles" : ["pig"], "tarball" : "cdh/3u3/pig-0.8.1-cdh3u3.tar.gz" } ] },
  27. 27. Serengeti Demo Deploy Serengeti vApp on vSphere Deploy a Hadoop cluster in 10 Minutes Run MapReduce Serengeti Demo Scale out the Hadoop cluster Create a Customized Hadoop cluster Use Your Favorite Hadoop Distribution
  28. 28. Serengeti Architecture Java Serengeti CLIhttp://github.com/vmware-serengeti (spring-shell) Ruby Serengeti Server Serengeti web service DB Resource Cluster Network Task Distro Mgr Mgr mgr mgr mgr Shell command to trigger Report deployment Deploy Engine Proxy deployment progress and summary (bash script) Knife cluster cli RabbitMQ (share with chef server) Chef Orchestration Layer (Ironfan) service provisioning inside vms package server Cloud Manager download packages cookbooks/roles Cluster provision engine data bags (vm CRUD) Chef server Chef bootstrap nodes Fog vSphere Cloud provider download cookbook/ resource services recipes vCenter Hadoop node Hadoop node Hadoop node Hadoop node Client VM Chef-client Chef-client Chef-client Chef-client Chef-client
  29. 29. Use Local Disk where it’s Needed SAN Storage NAS Filers Local Storage $2 - $10/Gigabyte $1 - $5/Gigabyte $0.05/Gigabyte $1M gets: $1M gets: $1M gets: 0.5Petabytes 1 Petabyte 10 Petabytes 200,000 IOPS 200,000 IOPS 400,000 IOPS 8Gbyte/sec 10Gbyte/sec 250 Gbytes/sec
  30. 30. Rules of Thumb: Sizing for Hadoop!  Disk: •  Provide about 50Mbytes/sec of disk bandwidth per core •  If using SATA, that’s about one disk per core!  Network •  Provide about 200mbits of aggregate network bandwidth per core!  Memory •  Use a memory:core ratio of about 4Gbytes:core
  31. 31. Extend Virtual Storage Architecture to Include Local Disk !  Hybrid Storage !  Shared Storage: SAN or NAS •  SAN for boot images, VMs, other •  Easy to provision workloads •  Automated cluster rebalancing •  Local disk for Hadoop & HDFS •  Scalable Bandwidth, Lower Cost/GB Other VM Other VM Other VM Other VM Other VM Other VM Other VM Other VMHadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Host Host Host Host Host Host
  32. 32. Hadoop Using Local Disks Task Tracker Datanode Other Hadoop Workload Virtual Machine Ext4 Ext4 Ext4 Virtualization Host OS Image - VMDK VMDK VMDK VMDK Shared% Storage%
  33. 33. Native versus Virtual Platforms, 24 hosts, 12disks/host 450 400 Elapsed time, seconds (lower is better) 350 Native 1 VM 300 2 VMs 4 VMs 250 200 150 100 50 0 TeraGen TeraSort TeraValidate
  34. 34. Local vs Various SAN Storage Configurations 4.5 16 x HP DL380G7, EMC VNX 7500, 96 physical disks Elapsed time ratio to Local disks (lower is better) 4 Local disks 3.5 SAN JBOD SAN RAID-0, 16 KB page size 3 SAN RAID-0 2.5 SAN RAID-5 2 1.5 1 0.5 0 TeraGen TeraSort TeraValidate
  35. 35. Hadoop Virtualization Extensions: TopologyAwareness
  36. 36. Virtual Topologies
  37. 37. Hadoop Topology Changes for Virtualization
  38. 38. Hadoop Virtualization Extensions for Topology HVE Task Scheduling Policy Extension Balancer Policy ExtensionReplica Choosing Policy ExtensionReplica Placement Policy Extension Replica Removal Policy Extension Network Topology Extension Hadoop HDFS MapReduce Hadoop CommonHADOOP-8468 (Umbrella JIRA)HADOOP-8469 Terasort locality Data Node- RackHDFS-3495 Local group Local LocalMAPREDUCE-4310 Normal 392 - 8HDFS-3498MAPREDUCE-4309 Normal with HVE 397 2 1HADOOP-8470 D/C separation 0 - 400HADOOP-8472 D/C separation with HVE 0 400 0
  39. 39. Why Virtualize Hadoop? Simple to Operate Highly Available Elastic Scaling!  Rapid deployment !  No more single point !  Shrink and expand of failure cluster on demand!  Unified operations across enterprise !  One click to setup !  Resource Guarantee!  Easy Clone of !  High availability for !  Independent scaling Cluster MR Jobs of Compute and data
  40. 40. Live Machine Migration Reduces PlannedDowntimeDescription:Enables the live migration of virtualmachines from one host to anotherwith continuous service availability.Benefits:•  Revolutionary technology that is the basis for automated virtual machine movement•  Meets service level and performance goals
  41. 41. vSphere High Availability (HA) - protectionagainst unplanned downtime Overview •  Protection against host and VM failures •  Automatic failure detection (host, guest OS) •  Automatic virtual machine restart in minutes, on any available host in cluster •  OS and application-independent, does not require complex configuration changes
  42. 42. Example HA Failover for Hadoop Serengeti vSphere HA Namenode Namenode Server TaskTracker TaskTracker TaskTracker TaskTracker HDFS HDFS HDFS HDFS Datanode Datanode Datanode Datanode Hive Hive Hive Hive hBase hBase hBase hBase
  43. 43. vSphere Fault Tolerance provides continuousprotection Overview •  Single identical VMs running in lockstep on separate hosts •  Zero downtime, zero data loss XX failover for all virtual machines in App App App App App App App HA HA FT OS OS OS OS OS OS OS case of hardware failures VMware ESX VMware ESX •  Integrated with VMware HA/DRS •  No complex clustering or specialized hardware required •  Single common mechanism for all X applications and operating systems Zero downtime for Name Node, Job Tracker and other components in Hadoop clusters
  44. 44. High Availability for the Hadoop Stack ETL Tools BI Reporting RDBMS Pig (Data Flow) Hive (SQL) HCatalog Zookeepr (Coordination) Hive Hcatalog MetaDB MDB Management Server MapReduce (Job Scheduling/Execution System) HBase (Key-Value store) Jobtracker Namenode HDFS (Hadoop Distributed File System) Server
  45. 45. Performance Effect of FT for Master Daemons!  NameNode and JobTracker placed in separate UP VMs!  Small overhead: Enabling FT causes 2-4% slowdown for TeraSort!  8 MB case places similar load on NN &JT as >200 hosts with 256 MB 1.04 TeraSort Elapsed time ratio to FT off 1.03 1.02 1.01 1 256 64 16 8 HDFS block size, MB
  46. 46. Why Virtualize Hadoop? Simple to Operate Highly Available Elastic Scaling!  Rapid deployment !  No more single point !  Shrink and expand of failure cluster on demand!  Unified operations across enterprise !  One click to setup !  Resource Guarantee!  Easy Clone of !  High availability for !  Independent scaling Cluster MR Jobs of Compute and data
  47. 47. “Time Share” Other VM Other VM Other VM Other VM Other VM Other VM Other VM Other VM Other VM Other VM Other VM Other VM Other VM Other VM Other VM Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Serengeti VMware vSphere Host Host Host HDFS HDFS HDFS While existing apps run during the day to support business operations, Hadoop batch jobs kicks off at night to conduct deep analysis of data.
  48. 48. Hadoop Task Tracker and Data Node in a VM Add/Remove Slot Slots? Slot Virtual Task Tracker Other Hadoop Workload Node Datanode Grow/Shrink by tens of GB? Virtualization Host VMDKGrow/Shrink of a VM is oneapproach
  49. 49. Add/remove Virtual Nodes Slot Slot Slot Slot Virtual Task Tracker Virtual Task Tracker Other Hadoop Hadoop Workload Node Node Datanode Datanode Virtualization Host VMDK VMDKJust add/remove morevirtual nodes?
  50. 50. But State makes it hard to power-off anode Slot Slot Virtual Task Tracker Other Hadoop Workload Node Datanode Virtualization Host VMDK Powering off the Hadoop VM would in effect fail the datanode
  51. 51. Adding a node needs data… Slot Slot Slot Slot Virtual Task Tracker Virtual Task TrackerOther Hadoop HadoopWorkload Node Node Datanode DatanodeVirtualization Host VMDK VMDKAdding a node would require TBs ofdata replication
  52. 52. Separated Compute and Data Slot Slot Virtual Slot Virtual Hadoop Slot Virtual Slot Virtual Hadoop Slot Hadoop Node Hadoop Node Node Node Task Tracker Other Task Tracker Task Tracker Workload Virtual Hadoop Datanode Node Virtualization Host VMDK VMDKTruly Elastic Hadoop:Scalable through virtualnodes
  53. 53. Dataflow with separated Compute/Data Slot Virtual Slot Virtual Hadoop Hadoop Node Node Datanode NodeManager Virtual NIC Virtual NIC Virtualization Host Virtual Switch VMDK NIC Drivers
  54. 54. Elastic Compute!  Set number of active TaskTracker nodes > cluster limit --name myHadoop --nodeGroup worker --activeComputeNodeNum 8!  Enable all the TaskTrackers in the cluster > cluster unlimit --name myHadoop
  55. 55. Performance Analysis of SeparationCombined mode Split Mode1 Combined Compute/Datanode VM per Host 1 Datanode VM, 1 Compute node VM per Host Task Tracker Task Tracker Task Tracker Task Tracker Datanode Datanode Datanode Datanode Workload: Teragen, Terasort, Teravalidate HW Configuration: 8 cores, 96GB RAM, 16 disks per host x 2 nodes
  56. 56. Performance Analysis of SeparationMinimum performance impact with separation of compute and data 1.2 Elapsed time: ratio to combined 1 0.8 0.6 Combined Split 0.4 0.2 0 Teragen Terasort Teravalidate
  57. 57. Freedom of Choice and Open Source Distributions Community Projects•  Flexibility to choose from major distributions cluster create --name myHadoop --distro apache•  Support for multiple projects•  Open architecture to welcome industry participation•  Contributing Hadoop Virtualization Extensions (HVE) to open source community
  58. 58. Elastic, Multi-tenantHadoop on Demand!Richard McDougall, !Chief Architect, Application Infrastructure and Big Data, VMware, Inc!@richardmcdougll!ApacheCon Europe, 2012!!http://www.vmware.com/hadoophttp://cto.vmware.com/http://projectserengeti.orghttp://github.com/vmware-serengeti © 2009 VMware Inc. All rights reserved
  1. ¿Le ha llamado la atención una diapositiva en particular?

    Recortar diapositivas es una manera útil de recopilar información importante para consultarla más tarde.

×