• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Hadoop World 2011: Hadoop as a Service in Cloud
 

Hadoop World 2011: Hadoop as a Service in Cloud

on

  • 3,874 views

Hadoop framework is often built on native environment with commodity hardware as its original design. However, with growing tendency of cloud computing, there is stronger requirement to build hadoop ...

Hadoop framework is often built on native environment with commodity hardware as its original design. However, with growing tendency of cloud computing, there is stronger requirement to build hadoop cluster on a public/private cloud in order for customers to benefit from virtualization and multi-tenancy. My speech want to introduce some challenges to provide hadoop service on virtualization platform like: performance, rack awareness, job scheduling, memory over commitment, etc and propose some solutions.

Statistics

Views

Total Views
3,874
Views on SlideShare
2,541
Embed Views
1,333

Actions

Likes
10
Downloads
0
Comments
1

10 Embeds 1,333

http://www.cloudera.com 1220
http://ornl.socialtext.net 71
http://cloudera.louddog.net 15
http://cloudera.brian.dev 8
http://192.168.6.179 7
http://blog.cloudera.com 5
http://a0.twimg.com 3
http://test.cloudera.com 2
http://us-w1.rockmelt.com 1
http://cloudera.matt.dev 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

11 of 1 previous next

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • virtualized hadoop
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Hadoop World 2011: Hadoop as a Service in Cloud Hadoop World 2011: Hadoop as a Service in Cloud Presentation Transcript

    • Hadoop as a ServiceJun Ping DuRichard McDougallVMware, Inc. © 2009 VMware Inc. All rights reserved
    • Cloud: Big Shifts in Simplification and Optimization1. Reduce the Complexity 2. Dramatically Lower 3. Enable Flexible, Agile Costs IT Service Delivery to simplify operations to redirect investment into to meet and anticipate the and maintenance value-add opportunities needs of the business 2
    • Infrastructure, Apps and now Data… Build Run Private Public ManageSimplify Infrastructure Simplify App Platform Next Trend: With Cloud Through PaaS Simplify Data 3
    • Trend 1/3: New Data Growing at 60% Y/YExabytes of information stored 20 Zetta by 2015 1 Yotta by 2030 Yes, you are part of the yotta audio generation… digital tv digital photos camera phones, rfid medical imaging, sensors satellite images, games, scanners, twitter cad/cam, appliances, videoconfercing, digital movies Source: The Information Explosion , 20094
    • Trend 2/3: Big Data – Driven by Real-World Benefit5
    • Trend 3/3: Value from Data Exceeds Hardware Cost Value from the intelligence of data analytics now outstrips the cost of hardware • Hadoop enables the use of lower cost hardware • Hardware cost halving every 18mo Value Big Iron: $40k/CPU Commodity Cluster: $1k/CPU Cost6
    • Three Big Reasons to Virtualize Hadoop: 1. Simplify Hardware  Trend is ―not just hadoop‖ for big data • Hadoop is often combined with other technologies: Big SQL, NoSQL etc,…SQLCluster • Unify the infrastructure platform for all Big SQL NoSQL Hadoop NoSQL Cluster Unified Big Data Infrastructure Private Public Hadoop Cluster  Common Hardware Base • Eliminate the hardware/driver/testing phase • Use existing team for DSS Cluster ordering, diagnosis, capacity management of 7 hardware farm
    • Three Big Reasons to Virtualize Hadoop: 2. Rapid ProvisioningI WANT MY HADOOP CLUSTER NOW!  Instant Cluster Provisioning • Provision Hadoop Clusters instantly • Automatable using provisioning engines/scripts: e.g. whir 8
    • Three Big Reasons to Virtualize Hadoop: 3. Leverage Capabilities Increase Utilization • Hadoop cluster only uses resources it needs • Extra resources can be used by other applications when not in use Eliminate single points of failure • Use vSphere HA for Namenode and Jobtracker Use VM Isolation • Create separate clusters with defensible security • Enables multiple-versions of Hadoop on the same infrastructure • Extends to Hadoop and Linux Environments Leverage Resource Management • Control/assign resources through resource pools • E.g. Use spare cycles for Hadoop Processing through priority control9
    • What? Hadoop in a VM? Really? Actually, Hadoop performs well in a virtual machine10
    • Performance Test: Cluster Configuration Mellanox10 GbE switch AMAX ClusterMax 2X X5650, 96 GB 12X SATA 500 GB Mellanox 10 GbE adapter11
    • Cluster Configuration Hardware • AMAX ClusterMax, 7 nodes • 2X X5650 2.67 GHz hex-core, 96 GB memory • 12X SATA 500 GB 7200 RPM (10 for Hadoop data), EXT4 • Mellanox ConnectX VPI (MT26418), 10 GbE • Mellanox Vantage 6048, 10 GbE OS/Hypervisor • RHEL 6.1 x86_64 (native and guest) • ESX 5.0 RTM with devel Mellanox driver VMs (HT off/on) • 1 VM: 92000 MB, (12/24) vCPUs, 10 PRDM disks • 2 VMs: 46000 MB, (6/12) vCPUs, 5 PRDM disks • 4 VMs (HT on only): • 2 small: 18400 MB, 5 vCPUs, 2 disks • 2 large: 27600 MB, 7 vCPUs, 3 disks12
    • Hadoop ConfigurationDistribution • Cloudera CDH3u0 • Based on Apache open-source 0.20.2Parameters • dfs.datanode.max.xcievers=4096 • dfs.replication=2 • dfs.block.size=134217728 • io.file.buffer.size=131072 • mapred.child.java.opts=”-Xmx2048m -Xmn512m” (native) • mapred.child.java.opts=”-Xmx1900m -Xmn512m” (virtual) Network topology • Hadoop uses info for reliability and performance • Multiple VMs/host: Each host is a “rack”13
    • Benchmarks Derived from test apps included in distro Pi • Direct-exec Monte-Carlo estimation of pi • # map tasks = # logical processors • 1.68 T samples TestDFSIO • Streaming write and read ~ 4*R/(R+G) = 22/7 • 1 TB • More tasks than processors Terasort • 3 phases: teragen, terasort, teravalidate • 10B or 35B records, each 100 Bytes (1 TB, 3.5 TB) • More tasks than processors • CPU, networking, and storage I/O14
    • Performance of Hadoop for Several Workloads Ratio of time taken – Lower is Better 1.2 1 0.8 Ratio to Native 0.6 1 VM 0.4 2 VMs 0.2 015
    • Architecting Hadoop as a Service using Virtualization Goals • Make it fast and easy to provision new Hadoop Clusters on Demand • Leverage virtual machines to provide isolation (esp. for Multi-tenant) • Optimize Hadoop’s performance based on virtual topologies • Make the system reliable based on virtual topologies Leveraging Virtualization • Elastic scale in/out • Use high-availability to protect namenode/job tracker • Resource controls and sharing: re-use underutilized memory, cpu • Prioritize Workloads: limit or guarantee resource usage in a mixed environment16
    • Provisioning Leverage the vSphere APIs to auto-deploy a cluster • Whirr, HOD, or custom using ruby, chef, etc,… Use linked-clones to rapidly fork many nodes17
    • Fast Provisioning From a ―seed‖ node to a cluster Thin Provisioning Linked Clone 60GB => 3.5GB ~6 second18
    • SAN, NAS or Local Disk?  Shared Storage: SAN or NAS  Hybrid Storage • Easy to provision • SAN for boot images, VMs, other • Automated cluster rebalancing workloads • Local disk for HDFS • Scalable Bandwidth, Lower Cost/GB Other VM Other VM Other VM Other VM Other VM Other VM Other VM Other VMHadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Host Host Host Host Host Host 19
    • Enable Automatic Rack awareness through vSphere Important to robust hadoop cluster Automatic network topology detect — an important vSphere feature Rack script is generated automatically20
    • Multi-tenant: share cluster or not Shared big cluster VS. Isolated small clusters High performance Secure Large scale Flexible Pre-job provisioning Post-job provisioningCombination – as customers’ requirement are different21
    • Elastic Hadoop Cluster Traditional hadoop cluster • Easy to scale out • Fast-provision new hadoop nodes and join into existing cluster • Hard to scale in While (ClusterIsTooLarge) { choose node k; kill (node k); wait (k’s data block is recovered); if necessary, hadoop.rebalance(); } Elastic hadoop cluster … Normal node NN JT Elastic node TaskTracker … DataNode22
    • Replica Placement Second Replica • Different rack • Rack-awareness required Third Replica • Same rack, different physical host • Nodes share host (in virtualized environment)23
    • Demo24
    • Performance Create more smaller VMs • Makes Hadoop scale better • Allows for easier/faster adjustment of packing of VMs across hosts by vSphere (including through DRS) Sizing/Configuration of storage is critical • Plan on ~50Mbytes/sec of bandwidth per core • SANs are typically configured by default for IOPS, not Bandwidth • Ensure SAN ports/switch topology allows required aggregate bandwidth • Performance of the backend storage should be tested/sized • Local disks will give ~100-140MBytes/sec per disk: pick correct controller25
    • Summary Hadoop does work well in a virtual environment Plan a virtual cluster, enable other big-data solutions on the same infrastructure Leverage the recipes to automate your configuration and deployment26