SlideShare a Scribd company logo
1 of 42
Download to read offline
APP-CAP2956

Inside the Hadoop
Machine




Jeff Buell, VMware, Inc.

Richard McDougall, VMware, Inc.

Sanjay Radia, Hortonworks




                           #vmworldapps
Disclaimer

!  This session may contain product features that are
    currently under development.

!  This session/overview of the new technology represents
    no commitment from VMware to deliver these features in
    any generally available product.

!  Features are subject to change, and must not be included in
    contracts, purchase orders, or sales agreements of any kind.

!  Technical feasibility and market demand will affect final delivery.
!  Pricing and packaging for any new technologies or features
    discussed or presented have not been determined.




2
Broad Application of Hadoop technology

    Horizontal Use Cases                                           Vertical Use Cases


    Log Processing / Click
                                                                   Financial Services
      Stream Analytics

   Machine Learning /                                                Internet Retailer
sophisticated data mining

     Web crawling / text                                          Pharmaceutical / Drug
        processing                                                     Discovery

    Extract Transform Load
                                                                    Mobile / Telecom
      (ETL) replacement

    Image / XML message
                                                                   Scientific Research
        processing

      General archiving /
                                                                      Social Media
         compliance


    Hadoop’s ability to handle large unstructured data affordably and efficiently makes
     it a valuable tool kit for enterprises across a number of applications and fields.

3
How does Hadoop enable parallel processing?

!  A framework for distributed processing of large data sets across
    clusters of computers using a simple programming model.




                                  Source: http://architects.dzone.com/articles/how-hadoop-mapreduce-works


4
Hadoop System Architecture




!  MapReduce: Programming
     framework for highly parallel data
     processing
!  Hadoop Distributed File System
     (HDFS): Distributed data storage




 5
Job Tracker Schedules Tasks Where the Data Resides
                                            Job
                                          Tracker
     Job

      Input%File            Host%1                     Host%2           Host%3

    Split%1%–%64MB
                         Task%%                 Task%%               Task%%
    Split%2%–%64MB       Tracker                Tracker              Tracker
    Split%3%–%64MB
                           Task%<%1                   Task%<%2         Task%<%3




                           DataNode                   DataNode         DataNode



           %Input%File   Block%1%–%64MB             Block%2%–%64MB   Block%3%–%64MB




6
Hadoop Distributed File System




7
Hadoop Data Locality and Replication




8
Hadoop Topology Awareness




9
Why Virtualize Hadoop?



     Simple to Operate          Highly Available               Elastic Scaling


!  Rapid deployment        !  No more single point of    !  Shrink and expand
                              failure                       cluster on demand
!  Unified operations
   across enterprise       !  One click to setup         !  Resource Guarantee

!  Easy Clone of Cluster   !  High availability for MR   !  Independent scaling of
                              Jobs                          Compute and data




10
Enterprise Challenges with Using Hadoop

!  Deployment
  •  Slow to provision
  •  Complex to keep running/tune
!  Single Points of Failure
  •  Single point of failure with Name Node and Job tracker
  •  No HA for Hadoop Framework Components (Hive, HCatalog, etc.)
!  Low Utilization
  •  Dedicated clusters to run Hadoop with low CPU utilization
  •  No easy way to share resource between Hadoop and non-Hadoop workloads
  •  Noisy neighbor, lack resource containment
!  Need Multi-tenant Isolation, Resource Management, etc,…
  •  Noisy Neighbor - no performance or security isolation between different tenants/users
  •  Lack of configuration isolation - Can t run multiple versions on the cluster




11
Virtualization enables a Common Infrastructure for Big Data


                                                          MPP DB    HBase       Hadoop
      Virtualization Platform
                                                      Virtualization Platform


      Hadoop


                       HBase



                                        Cluster Consolidation
        MPP DB

                                        !  Simplify
                                          •  Single Hardware Infrastructure
Cluster Sprawling
                                          •  Unified operations
Single purpose clusters for various
business applications lead to cluster   !  Optimize
sprawl.                                   •  Shared Resources = higher utilization
                                          •  Elastic resources = faster on-demand access
 12
Deploy a Hadoop Cluster in under 30 Minutes

Step 1: Deploy Serengeti virtual appliance on vSphere.


                                                               Deploy vHelperOVF to
                                                                     vSphere




Step 2: A few simple commands to stand up Hadoop Cluster.
                                                             Select Compute, memory,
                                                               storage and network




                                                            Select configuration template




                                                               Automate deployment




                         Done


  13
A Tour Through Serengeti



$ ssh serengeti@serengeti-vm

$ serengeti

serengeti>




14
A Tour Through Serengeti



serengeti> cluster create --name myElephant

serengeti> cluster list -–name myElephant

name: myElephant, distro: cdh, status:RUNNING
  NAME    ROLES                                 INSTANCE   CPU MEM(MB) TYPE
  ---------------------------------------------------------------------------
  master [hadoop_NameNode, hadoop_jobtracker] 1            2   7500     LOCAL   50

name: myElephant, distro: cdh, status:RUNNING
  NAME    ROLES                                 INSTANCE   CPU MEM(MB) TYPE
  ---------------------------------------------------------------------------
  master [hive, hadoop_client, pig]             1          1   3700     LOCAL   50

     NAME                HOST                              IP
     -----------------------------------------------------------------
     myElephant-client0 rmc-elephant-009.eng.vmware.com    10.0.20.184




15
A Tour Through Serengeti



$ ssh rmc@rmc-elephant-009.eng.vmware.com

$ hadoop jar hadoop-examples.jar teragen 1000000000 tera-data

…




16
Serengeti Spec File
[
        "distro":"apache",               Choice of Distro
          {
             "name": "master",
             "roles": [
                "hadoop_NameNode",
                "hadoop_jobtracker"
             ],
             "instanceNum": 1,
             "instanceType": "MEDIUM",
             “ha”:true,                  HA Option
          },
          {
             "name": "worker",
             "roles": [
                "hadoop_datanode", "hadoop_tasktracker"
             ],
             "instanceNum": 5,
             "instanceType": "SMALL",
             "storage": {                Choice of Shared Storage or Local Disk
                "type": "LOCAL",
                "sizeGB": 10
             }
          },
    ]

17
Configuring Distro’s


{
         "name" : "cdh",
         "version" : "3u3",
         "packages" : [
           {
              "roles" : ["hadoop_NameNode", "hadoop_jobtracker",
                         "hadoop_tasktracker", "hadoop_datanode",
                         "hadoop_client"],
              "tarball" : "cdh/3u3/hadoop-0.20.2-cdh3u3.tar.gz"
           },
           {
              "roles" : ["hive"],
              "tarball" : "cdh/3u3/hive-0.7.1-cdh3u3.tar.gz"
           },
           {
              "roles" : ["pig"],
              "tarball" : "cdh/3u3/pig-0.8.1-cdh3u3.tar.gz"
           }
         ]
    },




18
Open Source of Serengeti, Spring Hadoop, Hadoop Extensions


          Commercial Vendors            Community Projects




•  Support major distribution and multiple projects
•  Contribute Hadoop Virtualization Extension (HVE) to Open
   Source Community



19
Use Local Disk where it’s Needed




      SAN Storage          NAS Filers       Local Storage

     $2 - $10/Gigabyte   $1 - $5/Gigabyte   $0.05/Gigabyte

         $1M gets:          $1M gets:          $1M gets:
       0.5Petabytes         1 Petabyte       10 Petabytes
       200,000 IOPS       200,000 IOPS       400,000 IOPS
        8Gbyte/sec         10Gbyte/sec      250 Gbytes/sec

20
Extend Virtual Storage Architecture to Include Local Disk

 !  Shared Storage: SAN or NAS                                                            !  Hybrid Storage
         •  Easy to provision                                                              •  SAN for boot images, VMs, other
         •  Automated cluster rebalancing                                                       workloads
                                                                                           •  Local disk for Hadoop & HDFS
                                                                                           •  Scalable Bandwidth, Lower Cost/GB
          Other VM

                     Other VM




                                                  Other VM




                                                                               Other VM




                                                                                                     Other VM

                                                                                                                Other VM




                                                                                                                                             Other VM




                                                                                                                                                                          Other VM
Hadoop




                                Hadoop

                                         Hadoop




                                                             Hadoop

                                                                      Hadoop




                                                                                            Hadoop




                                                                                                                           Hadoop

                                                                                                                                    Hadoop




                                                                                                                                                        Hadoop

                                                                                                                                                                 Hadoop
          Host                           Host                         Host                           Host                           Host                         Host




     21
Hadoop has Significant Ephemeral Data


                    Map%Task%
                                                     Reduce%
                    Map%Task%
Job%                                    Map%         Reduce%               Sort%
                    Map%Task%           Output%
                                        file.out*
                                                                      Spills%
                    Map%Task%

     DFS%
                    Spills%
                    &%Logs%
                                  %         Shuffle%
                                            Map_*.out*
     Input%
     Data%
                    spill*.out*   75%%of%              Combine%                        DFS%
                                                       Intermediate.out*               Output%
       %                          Disk%Bandwidth%                                  %   Data%
       12%%of%                                                                     12%%of%
       Bandwidth%                                                                  Bandwidth%
                                      HDFS%




22
Virtualized Hadoop Performance

!  Issues of interest
  •  Native vs various virtual configurations
  •  Local disks vs Fibre Channel SAN
  •  Effect of protecting Hadoop master daemons with Fault Tolerance
  •  Public cloud (renting) vs private cloud (buying)

                         Arista 7124SX 10 GbE switch




     24x HP DL380 G7
     2x X5687, 72 GB
     16x SAS 146 GB
     Broadcom 10 GbE adapter
     Qlogic 8 Gb/s HBA
                                                          …
                                            EMC VNX7500


23
Configuration

!  Software
 •  vSphere 5.0 U1 (storage tests), 5.1 (Native/Virtual, FT)
 •  RHEL 6.1 x86_64
 •  Cloudera CDH3u4
 •  Hadoop applications: TeraGen, TeraSort, TeraValidate (1 TB)
!  Hadoop VMs
 •  Processors (16 logical threads), memory (72 GB), disks (12) partitioned among
     1, 2, or 4 VMs per host
 •  Separate VMs for NameNode and JobTracker for storage and FT tests
!  Hadoop configuration
 •  One map and one reduce task per vCPU (= logical thread)
     •  Machines are highly loaded
 •  256 MB block size
 •  FT tests: 8 – 256 MB block sizes to vary load on NN and JT

24
Native versus Virtual Platforms, 24 hosts, 12 disks/host
                                                 450


       Elapsed time, seconds (lower is better)   400


                                                 350
                                                                               Native
                                                                               1 VM
                                                 300
                                                                               2 VMs
                                                                               4 VMs
                                                 250


                                                 200


                                                 150


                                                 100


                                                  50


                                                   0
                                                       TeraGen   TeraSort   TeraValidate

25
Local vs Various SAN Storage Configurations
                                                            4.5
                                                                            16 x HP DL380G7, EMC VNX 7500, 96 physical disks
      Elapsed time ratio to Local disks (lower is better)    4                        Local disks
                                                                                      SAN JBOD
                                                            3.5                       SAN RAID-0, 16 KB page size
                                                                                      SAN RAID-0
                                                                                      SAN RAID-5
                                                             3


                                                            2.5


                                                             2


                                                            1.5


                                                             1


                                                            0.5


                                                             0
                                                                  TeraGen         TeraSort                TeraValidate

26
Performance Effect of FT for Master Daemons

!  NameNode and JobTracker placed in separate UP VMs
!  Small overhead: Enabling FT causes 2-4% slowdown for TeraSort
!  8 MB case places similar load on NN &JT as >200 hosts with 256 MB

                                       1.04
        Elapsed time ratio to FT off




                                                          TeraSort

                                       1.03



                                       1.02



                                       1.01



                                         1
                                              256    64            16     8
                                                    HDFS block size, MB


27
Different Clouds for Different Folks

!  Yahoo! Hadoop 2009: Classic benchmark test, 1460 hosts
!  Google/MapR: SaaS on Google Compute Engine
!  vSphere 5.1: 24 host cluster, 2 VMs/host, 8 or 12 disks/host,
  CDH3u4
!  Vastly different cluster sizes
  •  Compare throughput (MB sorted per second) normalized with resources
!  Cost: rental or estimate of running continuously for 3 years

                 #cores   #disks    TeraSort, s MB/s/core MB/s/disk        cost

 Yahoo!          11680     5840        62           1.3         2.6        ~$7

 Google/MapR      5024     1256        80           2.4         9.5        $16

 vSphere 5.1      192       192        442         11.2        11.2        ~$2

 vSphere 5.1      192       288        359         13.8         9.2        ~$2

28
Why Virtualize Hadoop?



     Simple to Operate          Highly Available               Elastic Scaling


!  Rapid deployment        !  No more single point of    !  Shrink and expand
                              failure                       cluster on demand
!  Unified operations
   across enterprise       !  One click to setup         !  Resource Guarantee

!  Easy Clone of Cluster   !  High availability for MR   !  Independent scaling of
                              Jobs                          Compute and data




29
VMware-Hortonworks Joint Engineering

!  Hortonworks goal
 •  Expand Hadoop ecosystem
 •  Provide first class support of various platforms
 •  Hadoop should run well on VMs
 •  VMs offer several advantages as presented earlier
 •  Take advantage of vSphere for HA
!  First class support for VMs
 •  Topology plugins (Hadoop-8468)
    •  2 VMs can be on same host
      •  Pick closer data
      •  Schedule tasks closer
      •  Don’t put two replicas on same host
 •  MR-tmp on HDFS using block pools
    •  Elastic Compute-VMs will not need local disk
 •  Fast communications within VMs
30
Hadoop Full-Stack High Availability


                          Slave Nodes of Hadoop Cluster


                    job        job              job   job    job


      Apps
     Running
     Outside
                                         Failover

                      JT into Safemode

               NN                          JT               NN
                                                                        N+K
               Server                       Server           Server   failover

                      HA Cluster for Master Daemons



31
HA is in HDP 1.0
     Using Total System Availability Architecture




32
HA in Hadoop 1 with HDP1

!  Full Stack High Availability
  •  Namenode
     •  Clients pause automatically
     •  JobTracker pauses automatically
  •  Other Hadoop master services (JT, …) coming


!  Use industry proven HA framework
  •  VMWare vSphere-HA
     •  Failover, fencing, …
     •  Corner cases are tricky – if not addressed, corruption
  •  Addition benefits:
     •  N-N & N+K failover
     •  Migration for maintenance




33
Hadoop NN/JT HA with vSphere




34
Namenode Failover Times

!  60 Nodes, 60K files, 6 million blocks, 300 TB raw storage – 1-3.5
 minutes
 •  Failure detection and Failover – 0.5 to 2 minutes
 •  Namenode Startup (exit safemode) – 30 sec
!  180 Nodes, 200K files, 18 million blocks, 900TB raw storage – 2-4.5
 minutes
 •  Failure detection and Failover – 0.5 to 2 minutes
 •  Namenode Startup (exit safemode) – 110 sec


 For vSphere - OS bootup is needed – 10-20 seconds is included above.


 Cold Failover is good enough for small/medium clusters
     Failure Detection and Automatic Failover Dominates




35
                                                            35
Summary

!  Advantages of Hadoop on VMs
 •  Cluster Management
 •  Cluster consolidation
 •  Greater Elasticity in mixed environment
 •  Alternate multi-tenancy to capacity scheduler’s offerings
!  HA for Hadoop Master Daemons
 •  vSphere based HA for NN, JT, … in Hadoop 1
 •  Total System Availability Architecture




36
Why Virtualize Hadoop?



     Simple to Operate          Highly Available               Elastic Scaling


!  Rapid deployment        !  No more single point of    !  Shrink and expand
                              failure                       cluster on demand
!  Unified operations
   across enterprise       !  One click to setup         !  Resource Guarantee

!  Easy Clone of Cluster   !  High availability for MR   !  Independent scaling of
                              Jobs                          Compute and data




37
Elastic Scaling and Multi-tenancy of Hadoop on vSphere



     VM                       VM                    VM             VM

          Current%
          Hadoop:%                 Compute               T1             T2
          %
          Combined%           VM                    VM
          Storage/                 Storage               Storage
          Compute




1.#Hadoop#in#VM#         2.#Separate#Compute#and#Data# 3.#Mul8.#Clusters#
<     Single%Tenant%     <     Single%Tenant%            <    MulQple%Tenants%
<     Fixed%Resources%   <     ElasQc%Compute%           <    ElasQc%Compute%
                         %

     38
Separated Compute and Data

                                                                                Slot
                                               Slot               Virtual   Slot
                                                                Virtual
                                                                  Hadoop        Slot
                                 Virtual       Slot           Virtual
                                                                Hadoop      Slot
                                 Hadoop                           Node
                                                              Hadoop
                                                                Node
                                 Node                         Node         Task Tracker
   Other                                   Task Tracker                 Task Tracker
   Workload




                                 Virtual
                                 Hadoop                   Datanode
                                 Node



   Virtualization Host                         VMDK                       VMDK




Truly Elastic Hadoop:
Scalable through virtual nodes


  39
References

www.projectserengeti.org
www.hortonworks.com
www.cloudera.com


Fault Tolerance performance whitepaper:
www.vmware.com/resources/techresources/10301


MapR/Google blog: www.mapr.com/blog/google-mapr




40
FILL OUT
A SURVEY

EVERY COMPLETE SURVEY
        IS ENTERED INTO
         DRAWING FOR A
   $25 VMWARE COMPANY
 STORE GIFT CERTIFICATE
APP-CAP2956

Inside the Hadoop
Machine




Jeff Buell, VMware, Inc.

Richard McDougall, VMware, Inc.

Sanjay Radia, Hortonworks




                           #vmworldapps

More Related Content

What's hot

Cloud Storage Adoption, Practice, and Deployment
Cloud Storage Adoption, Practice, and DeploymentCloud Storage Adoption, Practice, and Deployment
Cloud Storage Adoption, Practice, and DeploymentGlusterFS
 
Gluster Webinar: Introduction to GlusterFS
Gluster Webinar: Introduction to GlusterFSGluster Webinar: Introduction to GlusterFS
Gluster Webinar: Introduction to GlusterFSGlusterFS
 
HDFS Futures: NameNode Federation for Improved Efficiency and Scalability
HDFS Futures: NameNode Federation for Improved Efficiency and ScalabilityHDFS Futures: NameNode Federation for Improved Efficiency and Scalability
HDFS Futures: NameNode Federation for Improved Efficiency and ScalabilityHortonworks
 
Big data on virtualized infrastucture
Big data on virtualized infrastuctureBig data on virtualized infrastucture
Big data on virtualized infrastuctureDataWorks Summit
 
Using Distributed In-Memory Computing for Fast Data Analysis
Using Distributed In-Memory Computing for Fast Data AnalysisUsing Distributed In-Memory Computing for Fast Data Analysis
Using Distributed In-Memory Computing for Fast Data AnalysisScaleOut Software
 
Common and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopCommon and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopBrock Noland
 
Open solaris customer presentation
Open solaris customer presentationOpen solaris customer presentation
Open solaris customer presentationxKinAnx
 
Operate your hadoop cluster like a high eff goldmine
Operate your hadoop cluster like a high eff goldmineOperate your hadoop cluster like a high eff goldmine
Operate your hadoop cluster like a high eff goldmineDataWorks Summit
 
Postgres Plus Cloud Database
Postgres Plus Cloud DatabasePostgres Plus Cloud Database
Postgres Plus Cloud DatabaseGary Carter
 
Scaling With Sun Systems For MySQL Jan09
Scaling With Sun Systems For MySQL Jan09Scaling With Sun Systems For MySQL Jan09
Scaling With Sun Systems For MySQL Jan09Steve Staso
 
Is Private Cloud Right for Your Organization
Is Private Cloud Right for Your OrganizationIs Private Cloud Right for Your Organization
Is Private Cloud Right for Your OrganizationDave Roberts
 
Hana Memory Scale out using the hecatonchire Project
Hana Memory Scale out using the hecatonchire ProjectHana Memory Scale out using the hecatonchire Project
Hana Memory Scale out using the hecatonchire ProjectBenoit Hudzia
 
Wicked Easy Ceph Block Storage & OpenStack Deployment with Crowbar
Wicked Easy Ceph Block Storage & OpenStack Deployment with CrowbarWicked Easy Ceph Block Storage & OpenStack Deployment with Crowbar
Wicked Easy Ceph Block Storage & OpenStack Deployment with CrowbarCeph Community
 
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRBig Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRVijay Rayapati
 
Gluster open stack dev summit 042011
Gluster open stack dev summit 042011Gluster open stack dev summit 042011
Gluster open stack dev summit 042011Open Stack
 
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Cloudera, Inc.
 

What's hot (20)

Cloud Storage Adoption, Practice, and Deployment
Cloud Storage Adoption, Practice, and DeploymentCloud Storage Adoption, Practice, and Deployment
Cloud Storage Adoption, Practice, and Deployment
 
Gluster Webinar: Introduction to GlusterFS
Gluster Webinar: Introduction to GlusterFSGluster Webinar: Introduction to GlusterFS
Gluster Webinar: Introduction to GlusterFS
 
HDFS Futures: NameNode Federation for Improved Efficiency and Scalability
HDFS Futures: NameNode Federation for Improved Efficiency and ScalabilityHDFS Futures: NameNode Federation for Improved Efficiency and Scalability
HDFS Futures: NameNode Federation for Improved Efficiency and Scalability
 
Big data on virtualized infrastucture
Big data on virtualized infrastuctureBig data on virtualized infrastucture
Big data on virtualized infrastucture
 
Google Compute and MapR
Google Compute and MapRGoogle Compute and MapR
Google Compute and MapR
 
Using Distributed In-Memory Computing for Fast Data Analysis
Using Distributed In-Memory Computing for Fast Data AnalysisUsing Distributed In-Memory Computing for Fast Data Analysis
Using Distributed In-Memory Computing for Fast Data Analysis
 
cosbench-openstack.pdf
cosbench-openstack.pdfcosbench-openstack.pdf
cosbench-openstack.pdf
 
Cosbench apac
Cosbench apacCosbench apac
Cosbench apac
 
Common and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopCommon and unique use cases for Apache Hadoop
Common and unique use cases for Apache Hadoop
 
Open solaris customer presentation
Open solaris customer presentationOpen solaris customer presentation
Open solaris customer presentation
 
Operate your hadoop cluster like a high eff goldmine
Operate your hadoop cluster like a high eff goldmineOperate your hadoop cluster like a high eff goldmine
Operate your hadoop cluster like a high eff goldmine
 
Postgres Plus Cloud Database
Postgres Plus Cloud DatabasePostgres Plus Cloud Database
Postgres Plus Cloud Database
 
Scaling With Sun Systems For MySQL Jan09
Scaling With Sun Systems For MySQL Jan09Scaling With Sun Systems For MySQL Jan09
Scaling With Sun Systems For MySQL Jan09
 
Is Private Cloud Right for Your Organization
Is Private Cloud Right for Your OrganizationIs Private Cloud Right for Your Organization
Is Private Cloud Right for Your Organization
 
Hana Memory Scale out using the hecatonchire Project
Hana Memory Scale out using the hecatonchire ProjectHana Memory Scale out using the hecatonchire Project
Hana Memory Scale out using the hecatonchire Project
 
Wicked Easy Ceph Block Storage & OpenStack Deployment with Crowbar
Wicked Easy Ceph Block Storage & OpenStack Deployment with CrowbarWicked Easy Ceph Block Storage & OpenStack Deployment with Crowbar
Wicked Easy Ceph Block Storage & OpenStack Deployment with Crowbar
 
Management server internals
Management server internalsManagement server internals
Management server internals
 
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRBig Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
 
Gluster open stack dev summit 042011
Gluster open stack dev summit 042011Gluster open stack dev summit 042011
Gluster open stack dev summit 042011
 
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
 

Viewers also liked

Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsRichard McDougall
 
Solaris Internals Preso circa 2009
Solaris Internals Preso circa 2009Solaris Internals Preso circa 2009
Solaris Internals Preso circa 2009Richard McDougall
 
Building Big Data Applications
Building Big Data ApplicationsBuilding Big Data Applications
Building Big Data ApplicationsRichard McDougall
 
Virtualizing Oracle Databases with VMware
Virtualizing Oracle Databases with VMwareVirtualizing Oracle Databases with VMware
Virtualizing Oracle Databases with VMwareRichard McDougall
 
VMware Performance Troubleshooting
VMware Performance TroubleshootingVMware Performance Troubleshooting
VMware Performance Troubleshootingglbsolutions
 
Denver VMUG nov 2011
Denver VMUG nov 2011Denver VMUG nov 2011
Denver VMUG nov 2011Dan Brinkmann
 
Citrix Remote Access Solution Soup
Citrix Remote Access Solution SoupCitrix Remote Access Solution Soup
Citrix Remote Access Solution SoupDan Brinkmann
 
VMware vSphere Performance Troubleshooting
VMware vSphere Performance TroubleshootingVMware vSphere Performance Troubleshooting
VMware vSphere Performance TroubleshootingDan Brinkmann
 
VMware Advance Troubleshooting Workshop - Day 5
VMware Advance Troubleshooting Workshop - Day 5VMware Advance Troubleshooting Workshop - Day 5
VMware Advance Troubleshooting Workshop - Day 5Vepsun Technologies
 
VMware Advance Troubleshooting Workshop - Day 2
VMware Advance Troubleshooting Workshop - Day 2VMware Advance Troubleshooting Workshop - Day 2
VMware Advance Troubleshooting Workshop - Day 2Vepsun Technologies
 
VMware Advance Troubleshooting Workshop - Day 4
VMware Advance Troubleshooting Workshop - Day 4VMware Advance Troubleshooting Workshop - Day 4
VMware Advance Troubleshooting Workshop - Day 4Vepsun Technologies
 
VMware Advance Troubleshooting Workshop - Day 3
VMware Advance Troubleshooting Workshop - Day 3VMware Advance Troubleshooting Workshop - Day 3
VMware Advance Troubleshooting Workshop - Day 3Vepsun Technologies
 
VMware Advance Troubleshooting Workshop - Day 6
VMware Advance Troubleshooting Workshop - Day 6VMware Advance Troubleshooting Workshop - Day 6
VMware Advance Troubleshooting Workshop - Day 6Vepsun Technologies
 
VMware Performance for Gurus - A Tutorial
VMware Performance for Gurus - A TutorialVMware Performance for Gurus - A Tutorial
VMware Performance for Gurus - A TutorialRichard McDougall
 

Viewers also liked (17)

Making of the Burner Board
Making of the Burner BoardMaking of the Burner Board
Making of the Burner Board
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure Considerations
 
Solaris Internals Preso circa 2009
Solaris Internals Preso circa 2009Solaris Internals Preso circa 2009
Solaris Internals Preso circa 2009
 
Building Big Data Applications
Building Big Data ApplicationsBuilding Big Data Applications
Building Big Data Applications
 
Virtualizing Oracle Databases with VMware
Virtualizing Oracle Databases with VMwareVirtualizing Oracle Databases with VMware
Virtualizing Oracle Databases with VMware
 
Hadoop I/O Analysis
Hadoop I/O AnalysisHadoop I/O Analysis
Hadoop I/O Analysis
 
VMware Performance Troubleshooting
VMware Performance TroubleshootingVMware Performance Troubleshooting
VMware Performance Troubleshooting
 
Denver VMUG nov 2011
Denver VMUG nov 2011Denver VMUG nov 2011
Denver VMUG nov 2011
 
Citrix Remote Access Solution Soup
Citrix Remote Access Solution SoupCitrix Remote Access Solution Soup
Citrix Remote Access Solution Soup
 
VMware vSphere Performance Troubleshooting
VMware vSphere Performance TroubleshootingVMware vSphere Performance Troubleshooting
VMware vSphere Performance Troubleshooting
 
VMware Advance Troubleshooting Workshop - Day 5
VMware Advance Troubleshooting Workshop - Day 5VMware Advance Troubleshooting Workshop - Day 5
VMware Advance Troubleshooting Workshop - Day 5
 
VMware Advance Troubleshooting Workshop - Day 2
VMware Advance Troubleshooting Workshop - Day 2VMware Advance Troubleshooting Workshop - Day 2
VMware Advance Troubleshooting Workshop - Day 2
 
VMware Advance Troubleshooting Workshop - Day 4
VMware Advance Troubleshooting Workshop - Day 4VMware Advance Troubleshooting Workshop - Day 4
VMware Advance Troubleshooting Workshop - Day 4
 
VMware Advance Troubleshooting Workshop - Day 3
VMware Advance Troubleshooting Workshop - Day 3VMware Advance Troubleshooting Workshop - Day 3
VMware Advance Troubleshooting Workshop - Day 3
 
VMware Advance Troubleshooting Workshop - Day 6
VMware Advance Troubleshooting Workshop - Day 6VMware Advance Troubleshooting Workshop - Day 6
VMware Advance Troubleshooting Workshop - Day 6
 
IdP, SAML, OAuth
IdP, SAML, OAuthIdP, SAML, OAuth
IdP, SAML, OAuth
 
VMware Performance for Gurus - A Tutorial
VMware Performance for Gurus - A TutorialVMware Performance for Gurus - A Tutorial
VMware Performance for Gurus - A Tutorial
 

Similar to Inside the Hadoop Machine @ VMworld

Best Practices for Virtualizing Hadoop
Best Practices for Virtualizing HadoopBest Practices for Virtualizing Hadoop
Best Practices for Virtualizing HadoopDataWorks Summit
 
Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephantsOvidiu Dimulescu
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Ranjith Sekar
 
Pivotal: Virtualize Big Data to Make the Elephant Dance
Pivotal: Virtualize Big Data to Make the Elephant DancePivotal: Virtualize Big Data to Make the Elephant Dance
Pivotal: Virtualize Big Data to Make the Elephant DanceEMC
 
Hadoop World 2011: Hadoop as a Service in Cloud
Hadoop World 2011: Hadoop as a Service in CloudHadoop World 2011: Hadoop as a Service in Cloud
Hadoop World 2011: Hadoop as a Service in CloudCloudera, Inc.
 
Introduction to Hadoop
Introduction to Hadoop Introduction to Hadoop
Introduction to Hadoop Sudarshan Pant
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introductionChirag Ahuja
 
Secure Hadoop Cluster With Kerberos
Secure Hadoop Cluster With KerberosSecure Hadoop Cluster With Kerberos
Secure Hadoop Cluster With KerberosEdureka!
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configurationprabakaranbrick
 
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for womenHadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for womenmaharajothip1
 
Session 01 - Into to Hadoop
Session 01 - Into to HadoopSession 01 - Into to Hadoop
Session 01 - Into to HadoopAnandMHadoop
 
Vmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps IronfanVmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps IronfanJim Kaskade
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionCloudera, Inc.
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosLester Martin
 
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Etu Solution
 
Deploying Hadoop-based Bigdata Environments
Deploying Hadoop-based Bigdata Environments Deploying Hadoop-based Bigdata Environments
Deploying Hadoop-based Bigdata Environments buildacloud
 
Deploying Hadoop-Based Bigdata Environments
Deploying Hadoop-Based Bigdata EnvironmentsDeploying Hadoop-Based Bigdata Environments
Deploying Hadoop-Based Bigdata EnvironmentsPuppet
 

Similar to Inside the Hadoop Machine @ VMworld (20)

Best Practices for Virtualizing Hadoop
Best Practices for Virtualizing HadoopBest Practices for Virtualizing Hadoop
Best Practices for Virtualizing Hadoop
 
Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephants
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
Hadoop 2.0 handout 5.0
Hadoop 2.0 handout 5.0Hadoop 2.0 handout 5.0
Hadoop 2.0 handout 5.0
 
Pivotal: Virtualize Big Data to Make the Elephant Dance
Pivotal: Virtualize Big Data to Make the Elephant DancePivotal: Virtualize Big Data to Make the Elephant Dance
Pivotal: Virtualize Big Data to Make the Elephant Dance
 
Hadoop World 2011: Hadoop as a Service in Cloud
Hadoop World 2011: Hadoop as a Service in CloudHadoop World 2011: Hadoop as a Service in Cloud
Hadoop World 2011: Hadoop as a Service in Cloud
 
Introduction to Hadoop
Introduction to Hadoop Introduction to Hadoop
Introduction to Hadoop
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Secure Hadoop Cluster With Kerberos
Secure Hadoop Cluster With KerberosSecure Hadoop Cluster With Kerberos
Secure Hadoop Cluster With Kerberos
 
Hadoop programming
Hadoop programmingHadoop programming
Hadoop programming
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configuration
 
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for womenHadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
 
Session 01 - Into to Hadoop
Session 01 - Into to HadoopSession 01 - Into to Hadoop
Session 01 - Into to Hadoop
 
Vmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps IronfanVmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps Ironfan
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
 
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
 
Deploying Hadoop-based Bigdata Environments
Deploying Hadoop-based Bigdata Environments Deploying Hadoop-based Bigdata Environments
Deploying Hadoop-based Bigdata Environments
 
Deploying Hadoop-Based Bigdata Environments
Deploying Hadoop-Based Bigdata EnvironmentsDeploying Hadoop-Based Bigdata Environments
Deploying Hadoop-Based Bigdata Environments
 

Recently uploaded

Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 

Recently uploaded (20)

Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 

Inside the Hadoop Machine @ VMworld

  • 1. APP-CAP2956 Inside the Hadoop Machine Jeff Buell, VMware, Inc. Richard McDougall, VMware, Inc. Sanjay Radia, Hortonworks #vmworldapps
  • 2. Disclaimer !  This session may contain product features that are currently under development. !  This session/overview of the new technology represents no commitment from VMware to deliver these features in any generally available product. !  Features are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind. !  Technical feasibility and market demand will affect final delivery. !  Pricing and packaging for any new technologies or features discussed or presented have not been determined. 2
  • 3. Broad Application of Hadoop technology Horizontal Use Cases Vertical Use Cases Log Processing / Click Financial Services Stream Analytics Machine Learning / Internet Retailer sophisticated data mining Web crawling / text Pharmaceutical / Drug processing Discovery Extract Transform Load Mobile / Telecom (ETL) replacement Image / XML message Scientific Research processing General archiving / Social Media compliance Hadoop’s ability to handle large unstructured data affordably and efficiently makes it a valuable tool kit for enterprises across a number of applications and fields. 3
  • 4. How does Hadoop enable parallel processing? !  A framework for distributed processing of large data sets across clusters of computers using a simple programming model. Source: http://architects.dzone.com/articles/how-hadoop-mapreduce-works 4
  • 5. Hadoop System Architecture !  MapReduce: Programming framework for highly parallel data processing !  Hadoop Distributed File System (HDFS): Distributed data storage 5
  • 6. Job Tracker Schedules Tasks Where the Data Resides Job Tracker Job Input%File Host%1 Host%2 Host%3 Split%1%–%64MB Task%% Task%% Task%% Split%2%–%64MB Tracker Tracker Tracker Split%3%–%64MB Task%<%1 Task%<%2 Task%<%3 DataNode DataNode DataNode %Input%File Block%1%–%64MB Block%2%–%64MB Block%3%–%64MB 6
  • 8. Hadoop Data Locality and Replication 8
  • 10. Why Virtualize Hadoop? Simple to Operate Highly Available Elastic Scaling !  Rapid deployment !  No more single point of !  Shrink and expand failure cluster on demand !  Unified operations across enterprise !  One click to setup !  Resource Guarantee !  Easy Clone of Cluster !  High availability for MR !  Independent scaling of Jobs Compute and data 10
  • 11. Enterprise Challenges with Using Hadoop !  Deployment •  Slow to provision •  Complex to keep running/tune !  Single Points of Failure •  Single point of failure with Name Node and Job tracker •  No HA for Hadoop Framework Components (Hive, HCatalog, etc.) !  Low Utilization •  Dedicated clusters to run Hadoop with low CPU utilization •  No easy way to share resource between Hadoop and non-Hadoop workloads •  Noisy neighbor, lack resource containment !  Need Multi-tenant Isolation, Resource Management, etc,… •  Noisy Neighbor - no performance or security isolation between different tenants/users •  Lack of configuration isolation - Can t run multiple versions on the cluster 11
  • 12. Virtualization enables a Common Infrastructure for Big Data MPP DB HBase Hadoop Virtualization Platform Virtualization Platform Hadoop HBase Cluster Consolidation MPP DB !  Simplify •  Single Hardware Infrastructure Cluster Sprawling •  Unified operations Single purpose clusters for various business applications lead to cluster !  Optimize sprawl. •  Shared Resources = higher utilization •  Elastic resources = faster on-demand access 12
  • 13. Deploy a Hadoop Cluster in under 30 Minutes Step 1: Deploy Serengeti virtual appliance on vSphere. Deploy vHelperOVF to vSphere Step 2: A few simple commands to stand up Hadoop Cluster. Select Compute, memory, storage and network Select configuration template Automate deployment Done 13
  • 14. A Tour Through Serengeti $ ssh serengeti@serengeti-vm $ serengeti serengeti> 14
  • 15. A Tour Through Serengeti serengeti> cluster create --name myElephant serengeti> cluster list -–name myElephant name: myElephant, distro: cdh, status:RUNNING NAME ROLES INSTANCE CPU MEM(MB) TYPE --------------------------------------------------------------------------- master [hadoop_NameNode, hadoop_jobtracker] 1 2 7500 LOCAL 50 name: myElephant, distro: cdh, status:RUNNING NAME ROLES INSTANCE CPU MEM(MB) TYPE --------------------------------------------------------------------------- master [hive, hadoop_client, pig] 1 1 3700 LOCAL 50 NAME HOST IP ----------------------------------------------------------------- myElephant-client0 rmc-elephant-009.eng.vmware.com 10.0.20.184 15
  • 16. A Tour Through Serengeti $ ssh rmc@rmc-elephant-009.eng.vmware.com $ hadoop jar hadoop-examples.jar teragen 1000000000 tera-data … 16
  • 17. Serengeti Spec File [ "distro":"apache", Choice of Distro { "name": "master", "roles": [ "hadoop_NameNode", "hadoop_jobtracker" ], "instanceNum": 1, "instanceType": "MEDIUM", “ha”:true, HA Option }, { "name": "worker", "roles": [ "hadoop_datanode", "hadoop_tasktracker" ], "instanceNum": 5, "instanceType": "SMALL", "storage": { Choice of Shared Storage or Local Disk "type": "LOCAL", "sizeGB": 10 } }, ] 17
  • 18. Configuring Distro’s { "name" : "cdh", "version" : "3u3", "packages" : [ { "roles" : ["hadoop_NameNode", "hadoop_jobtracker", "hadoop_tasktracker", "hadoop_datanode", "hadoop_client"], "tarball" : "cdh/3u3/hadoop-0.20.2-cdh3u3.tar.gz" }, { "roles" : ["hive"], "tarball" : "cdh/3u3/hive-0.7.1-cdh3u3.tar.gz" }, { "roles" : ["pig"], "tarball" : "cdh/3u3/pig-0.8.1-cdh3u3.tar.gz" } ] }, 18
  • 19. Open Source of Serengeti, Spring Hadoop, Hadoop Extensions Commercial Vendors Community Projects •  Support major distribution and multiple projects •  Contribute Hadoop Virtualization Extension (HVE) to Open Source Community 19
  • 20. Use Local Disk where it’s Needed SAN Storage NAS Filers Local Storage $2 - $10/Gigabyte $1 - $5/Gigabyte $0.05/Gigabyte $1M gets: $1M gets: $1M gets: 0.5Petabytes 1 Petabyte 10 Petabytes 200,000 IOPS 200,000 IOPS 400,000 IOPS 8Gbyte/sec 10Gbyte/sec 250 Gbytes/sec 20
  • 21. Extend Virtual Storage Architecture to Include Local Disk !  Shared Storage: SAN or NAS !  Hybrid Storage •  Easy to provision •  SAN for boot images, VMs, other •  Automated cluster rebalancing workloads •  Local disk for Hadoop & HDFS •  Scalable Bandwidth, Lower Cost/GB Other VM Other VM Other VM Other VM Other VM Other VM Other VM Other VM Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Host Host Host Host Host Host 21
  • 22. Hadoop has Significant Ephemeral Data Map%Task% Reduce% Map%Task% Job% Map% Reduce% Sort% Map%Task% Output% file.out* Spills% Map%Task% DFS% Spills% &%Logs% % Shuffle% Map_*.out* Input% Data% spill*.out* 75%%of% Combine% DFS% Intermediate.out* Output% % Disk%Bandwidth% % Data% 12%%of% 12%%of% Bandwidth% Bandwidth% HDFS% 22
  • 23. Virtualized Hadoop Performance !  Issues of interest •  Native vs various virtual configurations •  Local disks vs Fibre Channel SAN •  Effect of protecting Hadoop master daemons with Fault Tolerance •  Public cloud (renting) vs private cloud (buying) Arista 7124SX 10 GbE switch 24x HP DL380 G7 2x X5687, 72 GB 16x SAS 146 GB Broadcom 10 GbE adapter Qlogic 8 Gb/s HBA … EMC VNX7500 23
  • 24. Configuration !  Software •  vSphere 5.0 U1 (storage tests), 5.1 (Native/Virtual, FT) •  RHEL 6.1 x86_64 •  Cloudera CDH3u4 •  Hadoop applications: TeraGen, TeraSort, TeraValidate (1 TB) !  Hadoop VMs •  Processors (16 logical threads), memory (72 GB), disks (12) partitioned among 1, 2, or 4 VMs per host •  Separate VMs for NameNode and JobTracker for storage and FT tests !  Hadoop configuration •  One map and one reduce task per vCPU (= logical thread) •  Machines are highly loaded •  256 MB block size •  FT tests: 8 – 256 MB block sizes to vary load on NN and JT 24
  • 25. Native versus Virtual Platforms, 24 hosts, 12 disks/host 450 Elapsed time, seconds (lower is better) 400 350 Native 1 VM 300 2 VMs 4 VMs 250 200 150 100 50 0 TeraGen TeraSort TeraValidate 25
  • 26. Local vs Various SAN Storage Configurations 4.5 16 x HP DL380G7, EMC VNX 7500, 96 physical disks Elapsed time ratio to Local disks (lower is better) 4 Local disks SAN JBOD 3.5 SAN RAID-0, 16 KB page size SAN RAID-0 SAN RAID-5 3 2.5 2 1.5 1 0.5 0 TeraGen TeraSort TeraValidate 26
  • 27. Performance Effect of FT for Master Daemons !  NameNode and JobTracker placed in separate UP VMs !  Small overhead: Enabling FT causes 2-4% slowdown for TeraSort !  8 MB case places similar load on NN &JT as >200 hosts with 256 MB 1.04 Elapsed time ratio to FT off TeraSort 1.03 1.02 1.01 1 256 64 16 8 HDFS block size, MB 27
  • 28. Different Clouds for Different Folks !  Yahoo! Hadoop 2009: Classic benchmark test, 1460 hosts !  Google/MapR: SaaS on Google Compute Engine !  vSphere 5.1: 24 host cluster, 2 VMs/host, 8 or 12 disks/host, CDH3u4 !  Vastly different cluster sizes •  Compare throughput (MB sorted per second) normalized with resources !  Cost: rental or estimate of running continuously for 3 years #cores #disks TeraSort, s MB/s/core MB/s/disk cost Yahoo! 11680 5840 62 1.3 2.6 ~$7 Google/MapR 5024 1256 80 2.4 9.5 $16 vSphere 5.1 192 192 442 11.2 11.2 ~$2 vSphere 5.1 192 288 359 13.8 9.2 ~$2 28
  • 29. Why Virtualize Hadoop? Simple to Operate Highly Available Elastic Scaling !  Rapid deployment !  No more single point of !  Shrink and expand failure cluster on demand !  Unified operations across enterprise !  One click to setup !  Resource Guarantee !  Easy Clone of Cluster !  High availability for MR !  Independent scaling of Jobs Compute and data 29
  • 30. VMware-Hortonworks Joint Engineering !  Hortonworks goal •  Expand Hadoop ecosystem •  Provide first class support of various platforms •  Hadoop should run well on VMs •  VMs offer several advantages as presented earlier •  Take advantage of vSphere for HA !  First class support for VMs •  Topology plugins (Hadoop-8468) •  2 VMs can be on same host •  Pick closer data •  Schedule tasks closer •  Don’t put two replicas on same host •  MR-tmp on HDFS using block pools •  Elastic Compute-VMs will not need local disk •  Fast communications within VMs 30
  • 31. Hadoop Full-Stack High Availability Slave Nodes of Hadoop Cluster job job job job job Apps Running Outside Failover JT into Safemode NN JT NN N+K Server Server Server failover HA Cluster for Master Daemons 31
  • 32. HA is in HDP 1.0 Using Total System Availability Architecture 32
  • 33. HA in Hadoop 1 with HDP1 !  Full Stack High Availability •  Namenode •  Clients pause automatically •  JobTracker pauses automatically •  Other Hadoop master services (JT, …) coming !  Use industry proven HA framework •  VMWare vSphere-HA •  Failover, fencing, … •  Corner cases are tricky – if not addressed, corruption •  Addition benefits: •  N-N & N+K failover •  Migration for maintenance 33
  • 34. Hadoop NN/JT HA with vSphere 34
  • 35. Namenode Failover Times !  60 Nodes, 60K files, 6 million blocks, 300 TB raw storage – 1-3.5 minutes •  Failure detection and Failover – 0.5 to 2 minutes •  Namenode Startup (exit safemode) – 30 sec !  180 Nodes, 200K files, 18 million blocks, 900TB raw storage – 2-4.5 minutes •  Failure detection and Failover – 0.5 to 2 minutes •  Namenode Startup (exit safemode) – 110 sec For vSphere - OS bootup is needed – 10-20 seconds is included above. Cold Failover is good enough for small/medium clusters Failure Detection and Automatic Failover Dominates 35 35
  • 36. Summary !  Advantages of Hadoop on VMs •  Cluster Management •  Cluster consolidation •  Greater Elasticity in mixed environment •  Alternate multi-tenancy to capacity scheduler’s offerings !  HA for Hadoop Master Daemons •  vSphere based HA for NN, JT, … in Hadoop 1 •  Total System Availability Architecture 36
  • 37. Why Virtualize Hadoop? Simple to Operate Highly Available Elastic Scaling !  Rapid deployment !  No more single point of !  Shrink and expand failure cluster on demand !  Unified operations across enterprise !  One click to setup !  Resource Guarantee !  Easy Clone of Cluster !  High availability for MR !  Independent scaling of Jobs Compute and data 37
  • 38. Elastic Scaling and Multi-tenancy of Hadoop on vSphere VM VM VM VM Current% Hadoop:% Compute T1 T2 % Combined% VM VM Storage/ Storage Storage Compute 1.#Hadoop#in#VM# 2.#Separate#Compute#and#Data# 3.#Mul8.#Clusters# <  Single%Tenant% <  Single%Tenant% <  MulQple%Tenants% <  Fixed%Resources% <  ElasQc%Compute% <  ElasQc%Compute% % 38
  • 39. Separated Compute and Data Slot Slot Virtual Slot Virtual Hadoop Slot Virtual Slot Virtual Hadoop Slot Hadoop Node Hadoop Node Node Node Task Tracker Other Task Tracker Task Tracker Workload Virtual Hadoop Datanode Node Virtualization Host VMDK VMDK Truly Elastic Hadoop: Scalable through virtual nodes 39
  • 40. References www.projectserengeti.org www.hortonworks.com www.cloudera.com Fault Tolerance performance whitepaper: www.vmware.com/resources/techresources/10301 MapR/Google blog: www.mapr.com/blog/google-mapr 40
  • 41. FILL OUT A SURVEY EVERY COMPLETE SURVEY IS ENTERED INTO DRAWING FOR A $25 VMWARE COMPANY STORE GIFT CERTIFICATE
  • 42. APP-CAP2956 Inside the Hadoop Machine Jeff Buell, VMware, Inc. Richard McDougall, VMware, Inc. Sanjay Radia, Hortonworks #vmworldapps