Inside the Hadoop Machine @ VMworld

  • 1,106 views
Uploaded on

 

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,106
On Slideshare
0
From Embeds
0
Number of Embeds
3

Actions

Shares
Downloads
88
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. APP-CAP2956Inside the HadoopMachineJeff Buell, VMware, Inc.Richard McDougall, VMware, Inc.Sanjay Radia, Hortonworks #vmworldapps
  • 2. Disclaimer!  This session may contain product features that are currently under development.!  This session/overview of the new technology represents no commitment from VMware to deliver these features in any generally available product.!  Features are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind.!  Technical feasibility and market demand will affect final delivery.!  Pricing and packaging for any new technologies or features discussed or presented have not been determined.2
  • 3. Broad Application of Hadoop technology Horizontal Use Cases Vertical Use Cases Log Processing / Click Financial Services Stream Analytics Machine Learning / Internet Retailersophisticated data mining Web crawling / text Pharmaceutical / Drug processing Discovery Extract Transform Load Mobile / Telecom (ETL) replacement Image / XML message Scientific Research processing General archiving / Social Media compliance Hadoop’s ability to handle large unstructured data affordably and efficiently makes it a valuable tool kit for enterprises across a number of applications and fields.3
  • 4. How does Hadoop enable parallel processing?!  A framework for distributed processing of large data sets across clusters of computers using a simple programming model. Source: http://architects.dzone.com/articles/how-hadoop-mapreduce-works4
  • 5. Hadoop System Architecture!  MapReduce: Programming framework for highly parallel data processing!  Hadoop Distributed File System (HDFS): Distributed data storage 5
  • 6. Job Tracker Schedules Tasks Where the Data Resides Job Tracker Job Input%File Host%1 Host%2 Host%3 Split%1%–%64MB Task%% Task%% Task%% Split%2%–%64MB Tracker Tracker Tracker Split%3%–%64MB Task%<%1 Task%<%2 Task%<%3 DataNode DataNode DataNode %Input%File Block%1%–%64MB Block%2%–%64MB Block%3%–%64MB6
  • 7. Hadoop Distributed File System7
  • 8. Hadoop Data Locality and Replication8
  • 9. Hadoop Topology Awareness9
  • 10. Why Virtualize Hadoop? Simple to Operate Highly Available Elastic Scaling!  Rapid deployment !  No more single point of !  Shrink and expand failure cluster on demand!  Unified operations across enterprise !  One click to setup !  Resource Guarantee!  Easy Clone of Cluster !  High availability for MR !  Independent scaling of Jobs Compute and data10
  • 11. Enterprise Challenges with Using Hadoop!  Deployment •  Slow to provision •  Complex to keep running/tune!  Single Points of Failure •  Single point of failure with Name Node and Job tracker •  No HA for Hadoop Framework Components (Hive, HCatalog, etc.)!  Low Utilization •  Dedicated clusters to run Hadoop with low CPU utilization •  No easy way to share resource between Hadoop and non-Hadoop workloads •  Noisy neighbor, lack resource containment!  Need Multi-tenant Isolation, Resource Management, etc,… •  Noisy Neighbor - no performance or security isolation between different tenants/users •  Lack of configuration isolation - Can t run multiple versions on the cluster11
  • 12. Virtualization enables a Common Infrastructure for Big Data MPP DB HBase Hadoop Virtualization Platform Virtualization Platform Hadoop HBase Cluster Consolidation MPP DB !  Simplify •  Single Hardware InfrastructureCluster Sprawling •  Unified operationsSingle purpose clusters for variousbusiness applications lead to cluster !  Optimizesprawl. •  Shared Resources = higher utilization •  Elastic resources = faster on-demand access 12
  • 13. Deploy a Hadoop Cluster in under 30 MinutesStep 1: Deploy Serengeti virtual appliance on vSphere. Deploy vHelperOVF to vSphereStep 2: A few simple commands to stand up Hadoop Cluster. Select Compute, memory, storage and network Select configuration template Automate deployment Done 13
  • 14. A Tour Through Serengeti$ ssh serengeti@serengeti-vm$ serengetiserengeti>14
  • 15. A Tour Through Serengetiserengeti> cluster create --name myElephantserengeti> cluster list -–name myElephantname: myElephant, distro: cdh, status:RUNNING NAME ROLES INSTANCE CPU MEM(MB) TYPE --------------------------------------------------------------------------- master [hadoop_NameNode, hadoop_jobtracker] 1 2 7500 LOCAL 50name: myElephant, distro: cdh, status:RUNNING NAME ROLES INSTANCE CPU MEM(MB) TYPE --------------------------------------------------------------------------- master [hive, hadoop_client, pig] 1 1 3700 LOCAL 50 NAME HOST IP ----------------------------------------------------------------- myElephant-client0 rmc-elephant-009.eng.vmware.com 10.0.20.18415
  • 16. A Tour Through Serengeti$ ssh rmc@rmc-elephant-009.eng.vmware.com$ hadoop jar hadoop-examples.jar teragen 1000000000 tera-data…16
  • 17. Serengeti Spec File[ "distro":"apache", Choice of Distro { "name": "master", "roles": [ "hadoop_NameNode", "hadoop_jobtracker" ], "instanceNum": 1, "instanceType": "MEDIUM", “ha”:true, HA Option }, { "name": "worker", "roles": [ "hadoop_datanode", "hadoop_tasktracker" ], "instanceNum": 5, "instanceType": "SMALL", "storage": { Choice of Shared Storage or Local Disk "type": "LOCAL", "sizeGB": 10 } }, ]17
  • 18. Configuring Distro’s{ "name" : "cdh", "version" : "3u3", "packages" : [ { "roles" : ["hadoop_NameNode", "hadoop_jobtracker", "hadoop_tasktracker", "hadoop_datanode", "hadoop_client"], "tarball" : "cdh/3u3/hadoop-0.20.2-cdh3u3.tar.gz" }, { "roles" : ["hive"], "tarball" : "cdh/3u3/hive-0.7.1-cdh3u3.tar.gz" }, { "roles" : ["pig"], "tarball" : "cdh/3u3/pig-0.8.1-cdh3u3.tar.gz" } ] },18
  • 19. Open Source of Serengeti, Spring Hadoop, Hadoop Extensions Commercial Vendors Community Projects•  Support major distribution and multiple projects•  Contribute Hadoop Virtualization Extension (HVE) to Open Source Community19
  • 20. Use Local Disk where it’s Needed SAN Storage NAS Filers Local Storage $2 - $10/Gigabyte $1 - $5/Gigabyte $0.05/Gigabyte $1M gets: $1M gets: $1M gets: 0.5Petabytes 1 Petabyte 10 Petabytes 200,000 IOPS 200,000 IOPS 400,000 IOPS 8Gbyte/sec 10Gbyte/sec 250 Gbytes/sec20
  • 21. Extend Virtual Storage Architecture to Include Local Disk !  Shared Storage: SAN or NAS !  Hybrid Storage •  Easy to provision •  SAN for boot images, VMs, other •  Automated cluster rebalancing workloads •  Local disk for Hadoop & HDFS •  Scalable Bandwidth, Lower Cost/GB Other VM Other VM Other VM Other VM Other VM Other VM Other VM Other VMHadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Host Host Host Host Host Host 21
  • 22. Hadoop has Significant Ephemeral Data Map%Task% Reduce% Map%Task%Job% Map% Reduce% Sort% Map%Task% Output% file.out* Spills% Map%Task% DFS% Spills% &%Logs% % Shuffle% Map_*.out* Input% Data% spill*.out* 75%%of% Combine% DFS% Intermediate.out* Output% % Disk%Bandwidth% % Data% 12%%of% 12%%of% Bandwidth% Bandwidth% HDFS%22
  • 23. Virtualized Hadoop Performance!  Issues of interest •  Native vs various virtual configurations •  Local disks vs Fibre Channel SAN •  Effect of protecting Hadoop master daemons with Fault Tolerance •  Public cloud (renting) vs private cloud (buying) Arista 7124SX 10 GbE switch 24x HP DL380 G7 2x X5687, 72 GB 16x SAS 146 GB Broadcom 10 GbE adapter Qlogic 8 Gb/s HBA … EMC VNX750023
  • 24. Configuration!  Software •  vSphere 5.0 U1 (storage tests), 5.1 (Native/Virtual, FT) •  RHEL 6.1 x86_64 •  Cloudera CDH3u4 •  Hadoop applications: TeraGen, TeraSort, TeraValidate (1 TB)!  Hadoop VMs •  Processors (16 logical threads), memory (72 GB), disks (12) partitioned among 1, 2, or 4 VMs per host •  Separate VMs for NameNode and JobTracker for storage and FT tests!  Hadoop configuration •  One map and one reduce task per vCPU (= logical thread) •  Machines are highly loaded •  256 MB block size •  FT tests: 8 – 256 MB block sizes to vary load on NN and JT24
  • 25. Native versus Virtual Platforms, 24 hosts, 12 disks/host 450 Elapsed time, seconds (lower is better) 400 350 Native 1 VM 300 2 VMs 4 VMs 250 200 150 100 50 0 TeraGen TeraSort TeraValidate25
  • 26. Local vs Various SAN Storage Configurations 4.5 16 x HP DL380G7, EMC VNX 7500, 96 physical disks Elapsed time ratio to Local disks (lower is better) 4 Local disks SAN JBOD 3.5 SAN RAID-0, 16 KB page size SAN RAID-0 SAN RAID-5 3 2.5 2 1.5 1 0.5 0 TeraGen TeraSort TeraValidate26
  • 27. Performance Effect of FT for Master Daemons!  NameNode and JobTracker placed in separate UP VMs!  Small overhead: Enabling FT causes 2-4% slowdown for TeraSort!  8 MB case places similar load on NN &JT as >200 hosts with 256 MB 1.04 Elapsed time ratio to FT off TeraSort 1.03 1.02 1.01 1 256 64 16 8 HDFS block size, MB27
  • 28. Different Clouds for Different Folks!  Yahoo! Hadoop 2009: Classic benchmark test, 1460 hosts!  Google/MapR: SaaS on Google Compute Engine!  vSphere 5.1: 24 host cluster, 2 VMs/host, 8 or 12 disks/host, CDH3u4!  Vastly different cluster sizes •  Compare throughput (MB sorted per second) normalized with resources!  Cost: rental or estimate of running continuously for 3 years #cores #disks TeraSort, s MB/s/core MB/s/disk cost Yahoo! 11680 5840 62 1.3 2.6 ~$7 Google/MapR 5024 1256 80 2.4 9.5 $16 vSphere 5.1 192 192 442 11.2 11.2 ~$2 vSphere 5.1 192 288 359 13.8 9.2 ~$228
  • 29. Why Virtualize Hadoop? Simple to Operate Highly Available Elastic Scaling!  Rapid deployment !  No more single point of !  Shrink and expand failure cluster on demand!  Unified operations across enterprise !  One click to setup !  Resource Guarantee!  Easy Clone of Cluster !  High availability for MR !  Independent scaling of Jobs Compute and data29
  • 30. VMware-Hortonworks Joint Engineering!  Hortonworks goal •  Expand Hadoop ecosystem •  Provide first class support of various platforms •  Hadoop should run well on VMs •  VMs offer several advantages as presented earlier •  Take advantage of vSphere for HA!  First class support for VMs •  Topology plugins (Hadoop-8468) •  2 VMs can be on same host •  Pick closer data •  Schedule tasks closer •  Don’t put two replicas on same host •  MR-tmp on HDFS using block pools •  Elastic Compute-VMs will not need local disk •  Fast communications within VMs30
  • 31. Hadoop Full-Stack High Availability Slave Nodes of Hadoop Cluster job job job job job Apps Running Outside Failover JT into Safemode NN JT NN N+K Server Server Server failover HA Cluster for Master Daemons31
  • 32. HA is in HDP 1.0 Using Total System Availability Architecture32
  • 33. HA in Hadoop 1 with HDP1!  Full Stack High Availability •  Namenode •  Clients pause automatically •  JobTracker pauses automatically •  Other Hadoop master services (JT, …) coming!  Use industry proven HA framework •  VMWare vSphere-HA •  Failover, fencing, … •  Corner cases are tricky – if not addressed, corruption •  Addition benefits: •  N-N & N+K failover •  Migration for maintenance33
  • 34. Hadoop NN/JT HA with vSphere34
  • 35. Namenode Failover Times!  60 Nodes, 60K files, 6 million blocks, 300 TB raw storage – 1-3.5 minutes •  Failure detection and Failover – 0.5 to 2 minutes •  Namenode Startup (exit safemode) – 30 sec!  180 Nodes, 200K files, 18 million blocks, 900TB raw storage – 2-4.5 minutes •  Failure detection and Failover – 0.5 to 2 minutes •  Namenode Startup (exit safemode) – 110 sec For vSphere - OS bootup is needed – 10-20 seconds is included above. Cold Failover is good enough for small/medium clusters Failure Detection and Automatic Failover Dominates35 35
  • 36. Summary!  Advantages of Hadoop on VMs •  Cluster Management •  Cluster consolidation •  Greater Elasticity in mixed environment •  Alternate multi-tenancy to capacity scheduler’s offerings!  HA for Hadoop Master Daemons •  vSphere based HA for NN, JT, … in Hadoop 1 •  Total System Availability Architecture36
  • 37. Why Virtualize Hadoop? Simple to Operate Highly Available Elastic Scaling!  Rapid deployment !  No more single point of !  Shrink and expand failure cluster on demand!  Unified operations across enterprise !  One click to setup !  Resource Guarantee!  Easy Clone of Cluster !  High availability for MR !  Independent scaling of Jobs Compute and data37
  • 38. Elastic Scaling and Multi-tenancy of Hadoop on vSphere VM VM VM VM Current% Hadoop:% Compute T1 T2 % Combined% VM VM Storage/ Storage Storage Compute1.#Hadoop#in#VM# 2.#Separate#Compute#and#Data# 3.#Mul8.#Clusters#<  Single%Tenant% <  Single%Tenant% <  MulQple%Tenants%<  Fixed%Resources% <  ElasQc%Compute% <  ElasQc%Compute% % 38
  • 39. Separated Compute and Data Slot Slot Virtual Slot Virtual Hadoop Slot Virtual Slot Virtual Hadoop Slot Hadoop Node Hadoop Node Node Node Task Tracker Other Task Tracker Task Tracker Workload Virtual Hadoop Datanode Node Virtualization Host VMDK VMDKTruly Elastic Hadoop:Scalable through virtual nodes 39
  • 40. Referenceswww.projectserengeti.orgwww.hortonworks.comwww.cloudera.comFault Tolerance performance whitepaper:www.vmware.com/resources/techresources/10301MapR/Google blog: www.mapr.com/blog/google-mapr40
  • 41. FILL OUTA SURVEYEVERY COMPLETE SURVEY IS ENTERED INTO DRAWING FOR A $25 VMWARE COMPANY STORE GIFT CERTIFICATE
  • 42. APP-CAP2956Inside the HadoopMachineJeff Buell, VMware, Inc.Richard McDougall, VMware, Inc.Sanjay Radia, Hortonworks #vmworldapps