SlideShare a Scribd company logo
1 of 22
Best Practices for Virtualizing
Hadoop
March 21st, 2013          2:20 – 3:00pm
George Trujillo




© Hortonworks Inc. 2012
George Trujillo
Master Principal Big Data Specialist - Hortonworks
Tier One Big Data/BCA Specialist – VMware Center of Excellence
VMware Certified Professional
VMware Certified Instructor
MySQL Certified DBA
Sun Microsystem's Ambassador for Java Platforms
Author of Linux Administration and Advanced Linux
 Administration Video Training
Oracle Double ACE
Served on Oracle Fusion Council & Oracle Beta Leadership
 Council, Independent Oracle Users Group (IOUG) Board of
 Directors, Recognized as one of the “Oracles of Oracle” by IOUG
                                                          Page 2
      © Hortonworks Inc. 2012
Agenda
Title:    Best Practices for Virtualizing Hadoop

Summary: This presentation will discuss best practices for designing and building
a solid, robust and flexible Hadoop platform on an enterprise virtual infrastructure.
Attendees will learn the flexibility and operational advantages of Virtual Machines
such as fast provisioning, cloning, high levels of standardization, hybrid
storage, vMotioning, increased stabilization of the entire software stack, High
Availability and Fault Tolerance. This is a can't miss presentation for anyone
wanting to understand design, configuration and deployment of Hadoop in virtual
infrastructures.

Agenda:
A. Hypervisor’s today
B. Building an enterprise virtual platform
C. Virtualizing Master and Slave servers
D. Key best practices



                                                                             Page 3
         © Hortonworks Inc. 2012
Grand Junction Colorado




                             Page 4
   © Hortonworks Inc. 2012
Hypervisors Today: Faster/Less Overhead

• VMware vSphere, Microsoft Hyper-V Server, Citrix
  XenServer and RedHat RHEV
Hypervisor              Performance Benchmarks                        % Overhead
VMware                  1M IOPS with 1 microsecond of latency (5.1)     2 – 10%
KVM                     1M transactions/minute (IBM hardware RHEL)       10%

Hypervisor              Performance                                   vSphere 5.1
VMware                  vCPUs                                             64
                        RAM per VM, RAM per Host                       1TB / 2TB
                        Network                                        36 GB/s
                        IOPS                                           1,000,000
                        vSphere 5.1 – 1m IOPS with 1 1 μs latency




                                                                               Page 5
         © Hortonworks Inc. 2012
Why Virtualize Hadoop
• Virtual Servers offer advantages over Physical Servers
• Enabling Hadoop as a service in a public or private
  cloud
• Cloud providers are making it easy to deploy Hadoop
  for POCs, dev and test environments
• Running a Consistent, Highly Reliable Hardware
  Environment
• Standardizing on a Single Common Hardware Platform
  (software stack)
• Virtualization is a natural step towards the cloud
• Cloud and virtualization vendors are offering elastic
  MapReduce solutions

                                                     Page 6
     © Hortonworks Inc. 2012
Virtualization Features

Faster provisioning              Live Cloning
Live migrations                  Templates
Live storage migrations          Distributed Resource Scheduling
High Availability                Hot CPU and Memory add
Live Cloning                     VM Replication
Network isolation using VXLANs   Multi-VM trust zones
VM Backups                       Distributed Power Management
Elasticity                       Multi-tenancy
Storage I/O Control              Network I/O Control
16Gb FC Support                  iSCSI Jumbo Frame Support




                                                                   Page 7
       © Hortonworks Inc. 2012
Building an Enterprise Virtual Platform
                                       Hortonworks Data Platform
     Sqoop                   Talend            WebHDFS            FlumeNG                          “Others”                   Data Extraction
   (DB Transfer)         (Data Transfer)         (REST API)       (Data Transfer)        (Vendors and 3rd party tools)          And Load


                            Ambari               Oozie             Ganglia               Nagios                               Management
                          (Management)          (Workflow)         (Monitoring)             (Alerts)                           Monitoring



WebHCatalog               Pig              Mahout             HCatalog            Hive         HBase        Zookeeper            Hadoop
 (Rest-like APIs)      (Scripting)     (Machine Learning)    (Metadata Mgmt)      (Query)    (Column DB)     (Coordination)     Essentials



            Distributed Processing                                                Distributed Storage                         Core Hadoop
                        (MapReduce)                                                         (HDFS)                              (kernel)

                             Linux                                                       Windows


                                                      Hypervisor

                                                       Hardware



                                                                                                                                 Page 8
             © Hortonworks Inc. 2012
Virtualizing Master Servers
• Virtualize the master servers (NameNode, JobTracker,
  HBase Master, Secondary NameNode)
  – Ganglia, Nagios, Ambari, Active Directory, Metadata databases
• Virtualizing Hadoop master servers offers:
  – Virtualization features to master servers
  – VMware High Availability
• Develop a hybrid storage solution




                                                                    Page 9
      © Hortonworks Inc. 2012
Protecting Master Servers
• Shared storage is required to fully leverage
  virtualization features.
• A virtual enterprise platform will provide:
  – Less down time (Live migrations, cloning, …)
  – A more reliable software stack
  – A higher Quality of Service
  – Reduced CapEx and OpEx
  – Increased operational flexibility




                                                   Page 10
      © Hortonworks Inc. 2012
Configure Hardware Properly
• Do not overcommit SLA or production environments
• Understand how NUMA affects your VMs – try to keep
  the VM size within the NUMA node
  – Look at disabling node interleaving (leave NUMA enabled)
  – Maintain memory locality
• Leverage hyperthreading – make sure there is
  hardware and BIOS support
  – Hyper Threading – can improve performance up to 20%
• Do not set memory limits on production servers.
• Let hypervisor control power mgmt by BIOS setting
  “OS Controled Mode”
• Enable C1E in BIOS

                                                               Page 11
     © Hortonworks Inc. 2012
Configure Hardware Properly (2)
• Run latest version of BIOS and VMware Tools
• Verify BIOS settings enable all populated processor
  sockets and enable all cores in each socket.
• Enable “Turbo Boost” in BIOS if processors support
  it.
• Disabling hardware devices (in BIOS) can free
  interrupt resources.
  – COM andn LPT ports, USB controllers, floppy drives, network
    interfaces, optical drives, storage controllers, etc
• Enable virtualization features in BIOS (VT-x, AMD-V,
  EPT, RVI)
• Initially leave memory scrubbing rate at
  manufacturer’s default setting.
                                                                  Page 12
     © Hortonworks Inc. 2012
More Best Practices
• Use latest hypervisor and virtual tools
• Configure an OS kernel as a single-core or multi-core
  kernel based on the number of vCPUs being used.
• Size virtual machines to avoid entering host “soft”
  memory state and the likely breaking of host large
  pages into small pages. Leave at least 6% of memory
  for the hypervisor and VM memory overhead is
  conservative.
  – If free memory drops below minFree (“soft” memory
    state), memory will be reclaimed through ballooning and other
    memory management techniques. All these techniques require
    breaking host large pages into small pages.
• Have a very good reason for using CPU affinity
  otherwise avoid it like the plague
                                                                    Page 13
     © Hortonworks Inc. 2012
Linux Best Practices
• Kernel parameters:
  – nofile=16384
  – nproc=32000
  – Mount with noatime and nodiratime attributes disabled
  – File descriptors set to 65535
  – File system read-ahead buffer should be increased to 1024 or 2,048.
  – Epoll file descriptor limit should be increased to 4096
• Turn off swapping
• Use ext4 or xfs (mount noatime)
  – Ext can be about 5% better on reads than xfs
  – XFS can be 12-25% better on writes (and auto defrags in the
    background)
• Linux 2.6.30+ can give 60% better energy consumption.

                                                                  Page 14
     © Hortonworks Inc. 2012
Networking Best Practices
• Separate VM traffic from live migration and
  management traffic
  – Separate NICs with separate vSwitches
• Leverage NIC teaming (at least 2 NICS per vSwitch)
• Leverage latest drivers for hypervisor vendor
• Avoid multi-queue networking: Hadoop drives a high
  packet rate, but not high enough to justify the
  overhead of multi-queue.
• Network:
  – Channel bonding two GbE ports can give better I/O performance
  – 8 Queues per port



                                                               Page 15
     © Hortonworks Inc. 2012
Networking Best Practices (2)
• Evaluate these features with network adapters to
  leverage hardware features:
  – Checksum offload
  – TCP segmentation offload(TSO)
  – Jumbo frames (JF)
  – Large receive offload(LRO)
  – Ability to handle high-memory DMA (that is, 64-bit DMA
    addresses)
  – Ability to handle multiple Scatter Gather elements per Tx frame
• Optimize 10 Gigabit Ethernet network adapters
  – Features like NetQueue can significantly improve performance of
    10 Gigabit Ethernet network adapters in virtualized environments.



                                                                      Page 16
      © Hortonworks Inc. 2012
Storage Best Practices
• Make good storage decisions
  – i.e. VMFS or Raw Device Mappings (RDM)
      – VMDK – leverages all features of virtualization
      – RDM – leverages features of storage vendors (replication,
        snapshots, …)
  – Run in Advanced Host Controller interface mode (AHCI).
  – Native Command Queuing enabled (NCQ)
• Use multiple vSCSI adapters and evenly distribute
  target devices
• Use eagerzeroedthick for VMDK files or uncheck
  Windows “Quick Format” option
• Makes sure there is block alignment for storage


                                                                    Page 17
     © Hortonworks Inc. 2012
Hadoop Virtualized Extensions (HVE)
• HVE is a new feature that extends the Hadoop topology
  awareness mechanism to support rack and node
  groups with hosts containing VMs.
  – Data locality-related policies maintained within a virtual layer
• HVE merged into branch-1
  – Available in Apache Hadoop 1.2
• Extensions include:
  – Block placement and removal policies
  – Balancer policies
  – Task scheduling
  – Network topology awareness




                                                                       Page 18
      © Hortonworks Inc. 2012
HVE: Virtualization Topology Awareness
• HVE is a new feature that extends the Hadoop topology
  awareness mechanism to support rack and node
  groups with hosts containing VMs.
    – Data locality-related policies maintained within a virtual layer.
                                                     Data Center



                         Rack1                                                   Rack2



     NodeG 1                               NodeG 2                 NodeG 3                   NodeG 4



Host1         Host2                Host3       Host4         Host5       Host6           Host7    Host8
V    V        V         V          V   V       V      V     V       V    V   V           V   V    V      V
M    M        M         M          M   M       M      M     M       M    M   M           M   M    M      M
V    V        V         V          V   V       V      V     V       V    V   V           V   V    V      V
M    M        M         M          M   M       M      M     M       M    M   M           M   M    M      M

                                                                                                       Page 19
         © Hortonworks Inc. 2012
HVE: Replica Policies
Multiple replicas are not placed on the same node with standard or extension
replica placement/removal policies. Rules are maintained for the balancer.

Standard Replica Policies                   Extension Replica Policies
1st replica is on local (closest) node of   Multiple replicas are not be placed on
the writer                                  the same node or on nodes under the
                                            same node group
2nd replica is on separate rack of 1st      1st replica is on the local node or local
replica;                                    node group of the writer
3rd replica is on the same rack as the      2nd replica is on a remote rack of the
2nd replica;                                1st replica
Remaining replicas are placed
randomly across rack to meet
minimum restriction.




                                                                                 Page 20
        © Hortonworks Inc. 2012
Follow Virtualization Best Practices

                           Validate virtualization and Hadoop configurations with
Hardware
                            vendor hardware compatibility lists.

Hadoop                     Follow recommended Hadoop reference architectures.

                           Follow virtualization vendors best practices,
Virtualization
                            deployment guides and workload characterizations.

Storage                    Review storage vendor recommendations.

                           Validate internal best practices for configuring and
Internal
                            managing VMs.



                                                                                   Page 21
       © Hortonworks Inc. 2012
Best Practices for Virtualizing Hadoop




        Thank You
        For Attending
George Trujillo
Blog: http://cloud-dba-journey.blogspot.com
Twitter: GeorgeTrujillo

More Related Content

What's hot

From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.Taras Matyashovsky
 
VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right VMworld
 
Introduction to hazelcast
Introduction to hazelcastIntroduction to hazelcast
Introduction to hazelcastEmin Demirci
 
Oracle Cloud Infrastructure – Storage
Oracle Cloud Infrastructure – StorageOracle Cloud Infrastructure – Storage
Oracle Cloud Infrastructure – StorageMarketingArrowECS_CZ
 
Postgres & Red Hat Cluster Suite
Postgres & Red Hat Cluster SuitePostgres & Red Hat Cluster Suite
Postgres & Red Hat Cluster SuiteEDB
 
Deep Dive into RDS PostgreSQL Universe
Deep Dive into RDS PostgreSQL UniverseDeep Dive into RDS PostgreSQL Universe
Deep Dive into RDS PostgreSQL UniverseJignesh Shah
 
Sub-second-sql-on-hadoop-at-scale
Sub-second-sql-on-hadoop-at-scaleSub-second-sql-on-hadoop-at-scale
Sub-second-sql-on-hadoop-at-scaleYifeng Jiang
 
Technical Introduction to PostgreSQL and PPAS
Technical Introduction to PostgreSQL and PPASTechnical Introduction to PostgreSQL and PPAS
Technical Introduction to PostgreSQL and PPASAshnikbiz
 
How to Increase Performance of Your Hadoop Cluster
How to Increase Performance of Your Hadoop ClusterHow to Increase Performance of Your Hadoop Cluster
How to Increase Performance of Your Hadoop ClusterAltoros
 
Introduction to GlusterFS Webinar - September 2011
Introduction to GlusterFS Webinar - September 2011Introduction to GlusterFS Webinar - September 2011
Introduction to GlusterFS Webinar - September 2011GlusterFS
 
Hadoop Operations at LinkedIn
Hadoop Operations at LinkedInHadoop Operations at LinkedIn
Hadoop Operations at LinkedInDataWorks Summit
 
Red Hat Storage Day Boston - Why Software-defined Storage Matters
Red Hat Storage Day Boston - Why Software-defined Storage MattersRed Hat Storage Day Boston - Why Software-defined Storage Matters
Red Hat Storage Day Boston - Why Software-defined Storage MattersRed_Hat_Storage
 
Benefity Oracle Cloudu (3/4): Compute
Benefity Oracle Cloudu (3/4): ComputeBenefity Oracle Cloudu (3/4): Compute
Benefity Oracle Cloudu (3/4): ComputeMarketingArrowECS_CZ
 
Spring Meetup Paris - Getting Distributed with Hazelcast and Spring
Spring Meetup Paris - Getting Distributed with Hazelcast and SpringSpring Meetup Paris - Getting Distributed with Hazelcast and Spring
Spring Meetup Paris - Getting Distributed with Hazelcast and SpringEmrah Kocaman
 
Best Practices with PostgreSQL on Solaris
Best Practices with PostgreSQL on SolarisBest Practices with PostgreSQL on Solaris
Best Practices with PostgreSQL on SolarisJignesh Shah
 
Severalnines Self-Training: MySQL® Cluster - Part VI
Severalnines Self-Training: MySQL® Cluster - Part VISeveralnines Self-Training: MySQL® Cluster - Part VI
Severalnines Self-Training: MySQL® Cluster - Part VISeveralnines
 
NGENSTOR_ODA_P2V_V5
NGENSTOR_ODA_P2V_V5NGENSTOR_ODA_P2V_V5
NGENSTOR_ODA_P2V_V5UniFabric
 
Check Point automatizace a orchestrace
Check Point automatizace a orchestraceCheck Point automatizace a orchestrace
Check Point automatizace a orchestraceMarketingArrowECS_CZ
 
Power BI with Essbase in the Oracle Cloud
Power BI with Essbase in the Oracle CloudPower BI with Essbase in the Oracle Cloud
Power BI with Essbase in the Oracle CloudKellyn Pot'Vin-Gorman
 

What's hot (20)

From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.
 
VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right
 
Introduction to hazelcast
Introduction to hazelcastIntroduction to hazelcast
Introduction to hazelcast
 
Oracle Cloud Infrastructure – Storage
Oracle Cloud Infrastructure – StorageOracle Cloud Infrastructure – Storage
Oracle Cloud Infrastructure – Storage
 
Postgres & Red Hat Cluster Suite
Postgres & Red Hat Cluster SuitePostgres & Red Hat Cluster Suite
Postgres & Red Hat Cluster Suite
 
Deep Dive into RDS PostgreSQL Universe
Deep Dive into RDS PostgreSQL UniverseDeep Dive into RDS PostgreSQL Universe
Deep Dive into RDS PostgreSQL Universe
 
Sub-second-sql-on-hadoop-at-scale
Sub-second-sql-on-hadoop-at-scaleSub-second-sql-on-hadoop-at-scale
Sub-second-sql-on-hadoop-at-scale
 
Technical Introduction to PostgreSQL and PPAS
Technical Introduction to PostgreSQL and PPASTechnical Introduction to PostgreSQL and PPAS
Technical Introduction to PostgreSQL and PPAS
 
How to Increase Performance of Your Hadoop Cluster
How to Increase Performance of Your Hadoop ClusterHow to Increase Performance of Your Hadoop Cluster
How to Increase Performance of Your Hadoop Cluster
 
Introduction to GlusterFS Webinar - September 2011
Introduction to GlusterFS Webinar - September 2011Introduction to GlusterFS Webinar - September 2011
Introduction to GlusterFS Webinar - September 2011
 
Hadoop Operations at LinkedIn
Hadoop Operations at LinkedInHadoop Operations at LinkedIn
Hadoop Operations at LinkedIn
 
Red Hat Storage Day Boston - Why Software-defined Storage Matters
Red Hat Storage Day Boston - Why Software-defined Storage MattersRed Hat Storage Day Boston - Why Software-defined Storage Matters
Red Hat Storage Day Boston - Why Software-defined Storage Matters
 
Benefity Oracle Cloudu (3/4): Compute
Benefity Oracle Cloudu (3/4): ComputeBenefity Oracle Cloudu (3/4): Compute
Benefity Oracle Cloudu (3/4): Compute
 
Spring Meetup Paris - Getting Distributed with Hazelcast and Spring
Spring Meetup Paris - Getting Distributed with Hazelcast and SpringSpring Meetup Paris - Getting Distributed with Hazelcast and Spring
Spring Meetup Paris - Getting Distributed with Hazelcast and Spring
 
Best Practices with PostgreSQL on Solaris
Best Practices with PostgreSQL on SolarisBest Practices with PostgreSQL on Solaris
Best Practices with PostgreSQL on Solaris
 
Hadoop on Virtual Machines
Hadoop on Virtual MachinesHadoop on Virtual Machines
Hadoop on Virtual Machines
 
Severalnines Self-Training: MySQL® Cluster - Part VI
Severalnines Self-Training: MySQL® Cluster - Part VISeveralnines Self-Training: MySQL® Cluster - Part VI
Severalnines Self-Training: MySQL® Cluster - Part VI
 
NGENSTOR_ODA_P2V_V5
NGENSTOR_ODA_P2V_V5NGENSTOR_ODA_P2V_V5
NGENSTOR_ODA_P2V_V5
 
Check Point automatizace a orchestrace
Check Point automatizace a orchestraceCheck Point automatizace a orchestrace
Check Point automatizace a orchestrace
 
Power BI with Essbase in the Oracle Cloud
Power BI with Essbase in the Oracle CloudPower BI with Essbase in the Oracle Cloud
Power BI with Essbase in the Oracle Cloud
 

Similar to Best Practices for Virtualizing Hadoop

Best Practices for Virtualizing Apache Hadoop
Best Practices for Virtualizing Apache HadoopBest Practices for Virtualizing Apache Hadoop
Best Practices for Virtualizing Apache HadoopHortonworks
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)outstanding59
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldRichard McDougall
 
App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)outstanding59
 
Hadoop World 2011: Hadoop as a Service in Cloud
Hadoop World 2011: Hadoop as a Service in CloudHadoop World 2011: Hadoop as a Service in Cloud
Hadoop World 2011: Hadoop as a Service in CloudCloudera, Inc.
 
Building a Distributed Block Storage System on Xen
Building a Distributed Block Storage System on XenBuilding a Distributed Block Storage System on Xen
Building a Distributed Block Storage System on XenThe Linux Foundation
 
Private cloud server virtualization
Private cloud server virtualization Private cloud server virtualization
Private cloud server virtualization Pierre-Juan Labeyrie
 
Is your cloud ready for Big Data? Strata NY 2013
Is your cloud ready for Big Data? Strata NY 2013Is your cloud ready for Big Data? Strata NY 2013
Is your cloud ready for Big Data? Strata NY 2013Richard McDougall
 
The Kubernetes WebLogic revival (part 1)
The Kubernetes WebLogic revival (part 1)The Kubernetes WebLogic revival (part 1)
The Kubernetes WebLogic revival (part 1)Simon Haslam
 
Ws08 r2 hyper v overview r2
Ws08 r2 hyper v overview r2Ws08 r2 hyper v overview r2
Ws08 r2 hyper v overview r2Omid Koushki
 
South jersey sql virtualization
South jersey sql virtualizationSouth jersey sql virtualization
South jersey sql virtualizationJoseph D'Antoni
 
VMWARE VS MS-HYPER-V
VMWARE VS MS-HYPER-VVMWARE VS MS-HYPER-V
VMWARE VS MS-HYPER-VDavid Ramirez
 
Making clouds: turning opennebula into a product
Making clouds: turning opennebula into a productMaking clouds: turning opennebula into a product
Making clouds: turning opennebula into a productCarlo Daffara
 
Making Clouds: Turning OpenNebula into a Product
Making Clouds: Turning OpenNebula into a ProductMaking Clouds: Turning OpenNebula into a Product
Making Clouds: Turning OpenNebula into a ProductNETWAYS
 
OpenNebulaConf 2013 - Making Clouds: Turning OpenNebula into a Product by Car...
OpenNebulaConf 2013 - Making Clouds: Turning OpenNebula into a Product by Car...OpenNebulaConf 2013 - Making Clouds: Turning OpenNebula into a Product by Car...
OpenNebulaConf 2013 - Making Clouds: Turning OpenNebula into a Product by Car...OpenNebula Project
 
How YARN Enables Multiple Data Processing Engines in Hadoop
How YARN Enables Multiple Data Processing Engines in HadoopHow YARN Enables Multiple Data Processing Engines in Hadoop
How YARN Enables Multiple Data Processing Engines in HadoopPOSSCON
 

Similar to Best Practices for Virtualizing Hadoop (20)

Best Practices for Virtualizing Apache Hadoop
Best Practices for Virtualizing Apache HadoopBest Practices for Virtualizing Apache Hadoop
Best Practices for Virtualizing Apache Hadoop
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworld
 
App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)
 
Hadoop World 2011: Hadoop as a Service in Cloud
Hadoop World 2011: Hadoop as a Service in CloudHadoop World 2011: Hadoop as a Service in Cloud
Hadoop World 2011: Hadoop as a Service in Cloud
 
Hadoop on VMware
Hadoop on VMwareHadoop on VMware
Hadoop on VMware
 
Building a Distributed Block Storage System on Xen
Building a Distributed Block Storage System on XenBuilding a Distributed Block Storage System on Xen
Building a Distributed Block Storage System on Xen
 
Private cloud server virtualization
Private cloud server virtualization Private cloud server virtualization
Private cloud server virtualization
 
Is your cloud ready for Big Data? Strata NY 2013
Is your cloud ready for Big Data? Strata NY 2013Is your cloud ready for Big Data? Strata NY 2013
Is your cloud ready for Big Data? Strata NY 2013
 
The Kubernetes WebLogic revival (part 1)
The Kubernetes WebLogic revival (part 1)The Kubernetes WebLogic revival (part 1)
The Kubernetes WebLogic revival (part 1)
 
Ws08 r2 hyper v overview r2
Ws08 r2 hyper v overview r2Ws08 r2 hyper v overview r2
Ws08 r2 hyper v overview r2
 
South jersey sql virtualization
South jersey sql virtualizationSouth jersey sql virtualization
South jersey sql virtualization
 
Virtualization Smackdown
Virtualization SmackdownVirtualization Smackdown
Virtualization Smackdown
 
VMWARE VS MS-HYPER-V
VMWARE VS MS-HYPER-VVMWARE VS MS-HYPER-V
VMWARE VS MS-HYPER-V
 
Making clouds: turning opennebula into a product
Making clouds: turning opennebula into a productMaking clouds: turning opennebula into a product
Making clouds: turning opennebula into a product
 
Making Clouds: Turning OpenNebula into a Product
Making Clouds: Turning OpenNebula into a ProductMaking Clouds: Turning OpenNebula into a Product
Making Clouds: Turning OpenNebula into a Product
 
OpenNebulaConf 2013 - Making Clouds: Turning OpenNebula into a Product by Car...
OpenNebulaConf 2013 - Making Clouds: Turning OpenNebula into a Product by Car...OpenNebulaConf 2013 - Making Clouds: Turning OpenNebula into a Product by Car...
OpenNebulaConf 2013 - Making Clouds: Turning OpenNebula into a Product by Car...
 
How YARN Enables Multiple Data Processing Engines in Hadoop
How YARN Enables Multiple Data Processing Engines in HadoopHow YARN Enables Multiple Data Processing Engines in Hadoop
How YARN Enables Multiple Data Processing Engines in Hadoop
 
What's New in RHEL 6 for Linux on System z?
What's New in RHEL 6 for Linux on System z?What's New in RHEL 6 for Linux on System z?
What's New in RHEL 6 for Linux on System z?
 
Sql saturday dc vm ware
Sql saturday dc vm wareSql saturday dc vm ware
Sql saturday dc vm ware
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 

Recently uploaded (20)

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 

Best Practices for Virtualizing Hadoop

  • 1. Best Practices for Virtualizing Hadoop March 21st, 2013 2:20 – 3:00pm George Trujillo © Hortonworks Inc. 2012
  • 2. George Trujillo Master Principal Big Data Specialist - Hortonworks Tier One Big Data/BCA Specialist – VMware Center of Excellence VMware Certified Professional VMware Certified Instructor MySQL Certified DBA Sun Microsystem's Ambassador for Java Platforms Author of Linux Administration and Advanced Linux Administration Video Training Oracle Double ACE Served on Oracle Fusion Council & Oracle Beta Leadership Council, Independent Oracle Users Group (IOUG) Board of Directors, Recognized as one of the “Oracles of Oracle” by IOUG Page 2 © Hortonworks Inc. 2012
  • 3. Agenda Title: Best Practices for Virtualizing Hadoop Summary: This presentation will discuss best practices for designing and building a solid, robust and flexible Hadoop platform on an enterprise virtual infrastructure. Attendees will learn the flexibility and operational advantages of Virtual Machines such as fast provisioning, cloning, high levels of standardization, hybrid storage, vMotioning, increased stabilization of the entire software stack, High Availability and Fault Tolerance. This is a can't miss presentation for anyone wanting to understand design, configuration and deployment of Hadoop in virtual infrastructures. Agenda: A. Hypervisor’s today B. Building an enterprise virtual platform C. Virtualizing Master and Slave servers D. Key best practices Page 3 © Hortonworks Inc. 2012
  • 4. Grand Junction Colorado Page 4 © Hortonworks Inc. 2012
  • 5. Hypervisors Today: Faster/Less Overhead • VMware vSphere, Microsoft Hyper-V Server, Citrix XenServer and RedHat RHEV Hypervisor Performance Benchmarks % Overhead VMware 1M IOPS with 1 microsecond of latency (5.1) 2 – 10% KVM 1M transactions/minute (IBM hardware RHEL) 10% Hypervisor Performance vSphere 5.1 VMware vCPUs 64 RAM per VM, RAM per Host 1TB / 2TB Network 36 GB/s IOPS 1,000,000 vSphere 5.1 – 1m IOPS with 1 1 μs latency Page 5 © Hortonworks Inc. 2012
  • 6. Why Virtualize Hadoop • Virtual Servers offer advantages over Physical Servers • Enabling Hadoop as a service in a public or private cloud • Cloud providers are making it easy to deploy Hadoop for POCs, dev and test environments • Running a Consistent, Highly Reliable Hardware Environment • Standardizing on a Single Common Hardware Platform (software stack) • Virtualization is a natural step towards the cloud • Cloud and virtualization vendors are offering elastic MapReduce solutions Page 6 © Hortonworks Inc. 2012
  • 7. Virtualization Features Faster provisioning Live Cloning Live migrations Templates Live storage migrations Distributed Resource Scheduling High Availability Hot CPU and Memory add Live Cloning VM Replication Network isolation using VXLANs Multi-VM trust zones VM Backups Distributed Power Management Elasticity Multi-tenancy Storage I/O Control Network I/O Control 16Gb FC Support iSCSI Jumbo Frame Support Page 7 © Hortonworks Inc. 2012
  • 8. Building an Enterprise Virtual Platform Hortonworks Data Platform Sqoop Talend WebHDFS FlumeNG “Others” Data Extraction (DB Transfer) (Data Transfer) (REST API) (Data Transfer) (Vendors and 3rd party tools) And Load Ambari Oozie Ganglia Nagios Management (Management) (Workflow) (Monitoring) (Alerts) Monitoring WebHCatalog Pig Mahout HCatalog Hive HBase Zookeeper Hadoop (Rest-like APIs) (Scripting) (Machine Learning) (Metadata Mgmt) (Query) (Column DB) (Coordination) Essentials Distributed Processing Distributed Storage Core Hadoop (MapReduce) (HDFS) (kernel) Linux Windows Hypervisor Hardware Page 8 © Hortonworks Inc. 2012
  • 9. Virtualizing Master Servers • Virtualize the master servers (NameNode, JobTracker, HBase Master, Secondary NameNode) – Ganglia, Nagios, Ambari, Active Directory, Metadata databases • Virtualizing Hadoop master servers offers: – Virtualization features to master servers – VMware High Availability • Develop a hybrid storage solution Page 9 © Hortonworks Inc. 2012
  • 10. Protecting Master Servers • Shared storage is required to fully leverage virtualization features. • A virtual enterprise platform will provide: – Less down time (Live migrations, cloning, …) – A more reliable software stack – A higher Quality of Service – Reduced CapEx and OpEx – Increased operational flexibility Page 10 © Hortonworks Inc. 2012
  • 11. Configure Hardware Properly • Do not overcommit SLA or production environments • Understand how NUMA affects your VMs – try to keep the VM size within the NUMA node – Look at disabling node interleaving (leave NUMA enabled) – Maintain memory locality • Leverage hyperthreading – make sure there is hardware and BIOS support – Hyper Threading – can improve performance up to 20% • Do not set memory limits on production servers. • Let hypervisor control power mgmt by BIOS setting “OS Controled Mode” • Enable C1E in BIOS Page 11 © Hortonworks Inc. 2012
  • 12. Configure Hardware Properly (2) • Run latest version of BIOS and VMware Tools • Verify BIOS settings enable all populated processor sockets and enable all cores in each socket. • Enable “Turbo Boost” in BIOS if processors support it. • Disabling hardware devices (in BIOS) can free interrupt resources. – COM andn LPT ports, USB controllers, floppy drives, network interfaces, optical drives, storage controllers, etc • Enable virtualization features in BIOS (VT-x, AMD-V, EPT, RVI) • Initially leave memory scrubbing rate at manufacturer’s default setting. Page 12 © Hortonworks Inc. 2012
  • 13. More Best Practices • Use latest hypervisor and virtual tools • Configure an OS kernel as a single-core or multi-core kernel based on the number of vCPUs being used. • Size virtual machines to avoid entering host “soft” memory state and the likely breaking of host large pages into small pages. Leave at least 6% of memory for the hypervisor and VM memory overhead is conservative. – If free memory drops below minFree (“soft” memory state), memory will be reclaimed through ballooning and other memory management techniques. All these techniques require breaking host large pages into small pages. • Have a very good reason for using CPU affinity otherwise avoid it like the plague Page 13 © Hortonworks Inc. 2012
  • 14. Linux Best Practices • Kernel parameters: – nofile=16384 – nproc=32000 – Mount with noatime and nodiratime attributes disabled – File descriptors set to 65535 – File system read-ahead buffer should be increased to 1024 or 2,048. – Epoll file descriptor limit should be increased to 4096 • Turn off swapping • Use ext4 or xfs (mount noatime) – Ext can be about 5% better on reads than xfs – XFS can be 12-25% better on writes (and auto defrags in the background) • Linux 2.6.30+ can give 60% better energy consumption. Page 14 © Hortonworks Inc. 2012
  • 15. Networking Best Practices • Separate VM traffic from live migration and management traffic – Separate NICs with separate vSwitches • Leverage NIC teaming (at least 2 NICS per vSwitch) • Leverage latest drivers for hypervisor vendor • Avoid multi-queue networking: Hadoop drives a high packet rate, but not high enough to justify the overhead of multi-queue. • Network: – Channel bonding two GbE ports can give better I/O performance – 8 Queues per port Page 15 © Hortonworks Inc. 2012
  • 16. Networking Best Practices (2) • Evaluate these features with network adapters to leverage hardware features: – Checksum offload – TCP segmentation offload(TSO) – Jumbo frames (JF) – Large receive offload(LRO) – Ability to handle high-memory DMA (that is, 64-bit DMA addresses) – Ability to handle multiple Scatter Gather elements per Tx frame • Optimize 10 Gigabit Ethernet network adapters – Features like NetQueue can significantly improve performance of 10 Gigabit Ethernet network adapters in virtualized environments. Page 16 © Hortonworks Inc. 2012
  • 17. Storage Best Practices • Make good storage decisions – i.e. VMFS or Raw Device Mappings (RDM) – VMDK – leverages all features of virtualization – RDM – leverages features of storage vendors (replication, snapshots, …) – Run in Advanced Host Controller interface mode (AHCI). – Native Command Queuing enabled (NCQ) • Use multiple vSCSI adapters and evenly distribute target devices • Use eagerzeroedthick for VMDK files or uncheck Windows “Quick Format” option • Makes sure there is block alignment for storage Page 17 © Hortonworks Inc. 2012
  • 18. Hadoop Virtualized Extensions (HVE) • HVE is a new feature that extends the Hadoop topology awareness mechanism to support rack and node groups with hosts containing VMs. – Data locality-related policies maintained within a virtual layer • HVE merged into branch-1 – Available in Apache Hadoop 1.2 • Extensions include: – Block placement and removal policies – Balancer policies – Task scheduling – Network topology awareness Page 18 © Hortonworks Inc. 2012
  • 19. HVE: Virtualization Topology Awareness • HVE is a new feature that extends the Hadoop topology awareness mechanism to support rack and node groups with hosts containing VMs. – Data locality-related policies maintained within a virtual layer. Data Center Rack1 Rack2 NodeG 1 NodeG 2 NodeG 3 NodeG 4 Host1 Host2 Host3 Host4 Host5 Host6 Host7 Host8 V V V V V V V V V V V V V V V V M M M M M M M M M M M M M M M M V V V V V V V V V V V V V V V V M M M M M M M M M M M M M M M M Page 19 © Hortonworks Inc. 2012
  • 20. HVE: Replica Policies Multiple replicas are not placed on the same node with standard or extension replica placement/removal policies. Rules are maintained for the balancer. Standard Replica Policies Extension Replica Policies 1st replica is on local (closest) node of Multiple replicas are not be placed on the writer the same node or on nodes under the same node group 2nd replica is on separate rack of 1st 1st replica is on the local node or local replica; node group of the writer 3rd replica is on the same rack as the 2nd replica is on a remote rack of the 2nd replica; 1st replica Remaining replicas are placed randomly across rack to meet minimum restriction. Page 20 © Hortonworks Inc. 2012
  • 21. Follow Virtualization Best Practices Validate virtualization and Hadoop configurations with Hardware vendor hardware compatibility lists. Hadoop Follow recommended Hadoop reference architectures. Follow virtualization vendors best practices, Virtualization deployment guides and workload characterizations. Storage Review storage vendor recommendations. Validate internal best practices for configuring and Internal managing VMs. Page 21 © Hortonworks Inc. 2012
  • 22. Best Practices for Virtualizing Hadoop Thank You For Attending George Trujillo Blog: http://cloud-dba-journey.blogspot.com Twitter: GeorgeTrujillo