SlideShare a Scribd company logo
1 of 40
Download to read offline
APOLLO GROUP




Hadoop Operations: Starting Out Small
So Your Cluster Isn't Yahoo-sized (yet)
Michael Arnold
Principal Systems Engineer
14 June 2012
Agenda

  Who
  What (Definitions)
  Decisions for Now
  Decisions for Later
  Lessons Learned




APOLLO GROUP             © 2012 Apollo Group        2
APOLLO GROUP




  Who




APOLLO GROUP Apollo Group
          © 2012            3
Who is Apollo?

        Apollo Group is a leading provider of higher
          education programs for working adults.




APOLLO GROUP              © 2012 Apollo Group                4
Who is Michael Arnold?

  Systems Administrator
  Automation geek
  13 years in IT
  I deal with:
      –Server hardware specification/configuration
      –Server firmware
      –Server operating system
      –Hadoop application health
      –Monitoring all the above


APOLLO GROUP              © 2012 Apollo Group        5
APOLLO GROUP




  What
  Definitions




APOLLO GROUP Apollo Group
          © 2012            6
Definitions

  Q: What is a tiny/small/medium/large cluster?
  A:
      –Tiny:          1-9
      –Small:         10-99
      –Medium:        100-999
      –Large:         1000+
      –Yahoo-sized:   4000




APOLLO GROUP              © 2012 Apollo Group             7
Definitions

  Q: What is a “headnode”?
  A: A server that runs one or more of the following
   Hadoop processes:
      –NameNode
      –JobTracker
      –Secondary NameNode
      –ZooKeeper
      –HBase Master




APOLLO GROUP            © 2012 Apollo Group             8
APOLLO GROUP




  What decisions should you
  make now and which can
  you postpone for later?
  Decisions for Now



APOLLO GROUP Apollo Group
          © 2012              9
Which Hadoop distribution?

  Amazon
  Apache
  Cloudera
  Greenplum
  Hortonworks
  IBM
  MapR
  Platform Computing



APOLLO GROUP            © 2012 Apollo Group   10
Should you virtualize?

  Can be OK for small clusters BUT
      –virtualization adds overhead
      –can cause performance degradation
      –cannot take advantage of Hadoop rack locality
  Virtualization can be good for:
      –functional testing of M/R job or workflow changes
      –evaluation of Hadoop upgrades




APOLLO GROUP              © 2012 Apollo Group              11
What sort of hardware should you be
                                      considering?

  Inexpensive
  Not “enterprisey” hardware
     –No RAID*
     –No Redundant power*
  Low power consumption
  No optical drives
     –get systems that can boot off the network



                                              * except in headnodes

APOLLO GROUP            © 2012 Apollo Group                       12
Plan for capacity expansion

  Start at the bottom and
   work your way up
  Leave room in your
   cabinets for more
   machines




APOLLO GROUP            © 2012 Apollo Group    13
Plan for capacity expansion (cont.)

  Deploy your initial
   cluster in two cabinets
     –One headnode, one
      switch, and several
      (five) datanodes per
      cabinet




APOLLO GROUP            © 2012 Apollo Group      14
Plan for capacity expansion (cont.)

  Install a second cluster
   in the empty space in
   the upper half of the
   cabinet




APOLLO GROUP             © 2012 Apollo Group     15
APOLLO GROUP




  What decisions should you
  make now and which can
  you postpone for later?
  Decisions for Later



APOLLO GROUP Apollo Group
          © 2012              16
What size cluster?

  Depends upon your:
  Budget
  Data size
  Workload characteristics
  SLA




APOLLO GROUP           © 2012 Apollo Group                    17
What size cluster? (cont.)

  Are your MapReduce jobs:
  compute-intensive?
  reading lots of data?

  http://www.cloudera.com/blog/2010/08/hadoophbase-capacity-planning/




APOLLO GROUP                   © 2012 Apollo Group                      18
Should you implement rack awareness?


        If more than one switch in the cluster:

                           YES




APOLLO GROUP            © 2012 Apollo Group       19
Should you use automation?

       If not in the beginning, then as soon as
                        possible.

  Boot disks will fail.
  Automated OS and application installs:
      –Save time
      –Reduce errors
          •Cobbler/Spacewalk/Foreman/xCat/etc
          •Puppet/Chef/Cfengine/shell scripts/etc

APOLLO GROUP              © 2012 Apollo Group       20
APOLLO GROUP




  Lessons Learned




APOLLO GROUP Apollo Group
          © 2012            21
Keep It Simple

            Don't add redundancy and features
         (server/network) that will make things more
                 complicated and expensive.

               Hadoop has built-in redundancies.

                     Don't overlook them.




APOLLO GROUP                © 2012 Apollo Group                22
Automate the Hardware

  Twelve hours of manual work in the datacenter is
   not fun.
  Make sure all server firmware is configured
   identically.
      –HP SmartStart Scripting Toolkit
      –Dell OpenManage Deployment Toolkit
      –IBM ServerGuide Scripting Toolkit




APOLLO GROUP            © 2012 Apollo Group           23
Rolling upgrades are possible

               (Just not of the Hadoop software.)

   Datanodes can be decommissioned, patched, and
       added back into the cluster without service
                      downtime.




APOLLO GROUP                © 2012 Apollo Group      24
The smallest thing can have a big impact on the
                                             cluster


  Bad NIC/switchport can cause cluster slowness.

  Slow disks can cause intermittent job slowdowns.




APOLLO GROUP           © 2012 Apollo Group            25
HDFS blocks are weird

  On ext3/ext4:
      –Small blocks are not padded to the HDFS block-
       size, but rather the actual size of the data.
      –Each HDFS block is actually two files on the
       datanode's filesystem:
          •The actual data and
          •A metadata/checksum file

 # ls -l blk_1058778885645824207*
 -rw-r--r-- 1 hdfs hdfs 35094 May 14 01:26 blk_1058778885645824207
 -rw-r--r-- 1 hdfs hdfs   283 May 14 01:26 blk_1058778885645824207_19155994.meta



APOLLO GROUP                        © 2012 Apollo Group                        26
Do not prematurely optimize

  Be careful tuning your datanode filesystems.
      • mkfs -t ext4 -T largefile4 ... (probably bad)
      • mkfs -t ext4 -i 131072 -m 0 ... (better)

 /etc/mke2fs.conf
 [fs_types]
  hadoop = {
         features = has_journal,extent,huge_file,flex_bg,uninit_bg,dir_nlink,
  extra_isize
         inode_ratio = 131072
         blocksize = -1
         reserved_ratio = 0
         default_mntopts = acl,user_xattr
  }

APOLLO GROUP                       © 2012 Apollo Group                          27
Use DNS-friendly names for services

       hdfs://hdfs.delta.hadoop.apollogrp.edu:8020/
         mapred.delta.hadoop.apollogrp.edu:8021
      http://oozie.delta.hadoop.apollogrp.edu:11000/
      hiveserver.delta.hadoop.apollogrp.edu:10000



   Yes, the names are long, but I bet you can figure out how to
                    connect to Bravo Cluster.




APOLLO GROUP                © 2012 Apollo Group                   29
Use a parallel, remote execution tool

  pdsh/Cluster SSH/mussh/etc

                 SSH in a for loop is so 2010

  FUNC/MCollective




APOLLO GROUP               © 2012 Apollo Group     30
Make your log directories as large as you can.

  20-100GB /var/log
      –Implement log purging cronjobs or your log
       directories will fill up.


  Beware: M/R jobs can fill up /tmp as well.




APOLLO GROUP              © 2012 Apollo Group        31
Insist on IPMI 2.0 for out of band management of
                                     server hardware.

  Serial Over LAN is awesome when booting a
   system.
  Standardized hardware/temperature monitoring.
  Simple remote power control.




APOLLO GROUP            © 2012 Apollo Group         33
Spanning-tree is the devil

  Enable portfast on your server switch ports or the
   BMCs may never get a DHCP lease.




APOLLO GROUP            © 2012 Apollo Group            34
Apollo has re-built it's cluster four times.

               You may end up doing so as well.




APOLLO GROUP               © 2012 Apollo Group     35
Apollo Timeline

  First build
  Cloudera Professional Services helped install CDH
  Four nodes
  Manually build OS via USB CDROM.
  CDH2




APOLLO GROUP           © 2012 Apollo Group                 36
Apollo Timeline

  Second build
  Cobbler
  All software deployment is via kickstart. Very little
   is in puppet. Config files are deployed via wget.
  CDH2




APOLLO GROUP              © 2012 Apollo Group                 37
Apollo Timeline

  Third build
  OS filesystem partitioning needed to change.
  Most software deployment still via kickstart.
  CDH3b2




APOLLO GROUP            © 2012 Apollo Group                 38
Apollo Timeline

  Fourth build
  HDFS filesystem inodes needed to be increased.
  Full puppet automation.
  Added redundant/hotswap enterprise hardware for
   headnodes.
  CDH3u1




APOLLO GROUP          © 2012 Apollo Group                 39
Cluster failures at Apollo

  Hardware
      –disk failures (40+)
      –disk cabling (6)
      –RAM (2)
      –switch port (1)
  Software
      –Cluster
          •NFS (NN -> 2NN metadata)
      –Job
          •TT java heap
          •Running out of /tmp or /var/log/hadoop
          •Running out of HDFS space

APOLLO GROUP                  © 2012 Apollo Group        40
Know your workload

  You can spend all the time in the world trying to get
   the best CPU/RAM/HDD/switch/cabinet
   configuration, but you are running on pure luck
   until you understand your cluster's workload.




APOLLO GROUP             © 2012 Apollo Group               41
APOLLO GROUP




  Questions?




APOLLO GROUP Apollo Group
          © 2012            42

More Related Content

What's hot

Managing Oracle Solaris Systems with Puppet
Managing Oracle Solaris Systems with PuppetManaging Oracle Solaris Systems with Puppet
Managing Oracle Solaris Systems with Puppetglynnfoster
 
KNOX-HTTPFS-ONEFS-WP
KNOX-HTTPFS-ONEFS-WPKNOX-HTTPFS-ONEFS-WP
KNOX-HTTPFS-ONEFS-WPBoni Bruno
 
Red Hat for IBM System z IBM Enterprise2014 Las Vegas
Red Hat for IBM System z IBM Enterprise2014 Las Vegas Red Hat for IBM System z IBM Enterprise2014 Las Vegas
Red Hat for IBM System z IBM Enterprise2014 Las Vegas Filipe Miranda
 
CERN Agile Infrastructure, Road to Production
CERN Agile Infrastructure, Road to ProductionCERN Agile Infrastructure, Road to Production
CERN Agile Infrastructure, Road to ProductionSteve Traylen
 
Oracle Solaris 11.1 New Features
Oracle Solaris 11.1 New FeaturesOracle Solaris 11.1 New Features
Oracle Solaris 11.1 New FeaturesOrgad Kimchi
 
Osol Netadmin Solaris Administrator
Osol Netadmin Solaris AdministratorOsol Netadmin Solaris Administrator
Osol Netadmin Solaris AdministratorOpeyemi Olakitan
 
Terrraform meet Oracle Cloud: Platform Provisioning Automation
Terrraform meet Oracle Cloud: Platform Provisioning AutomationTerrraform meet Oracle Cloud: Platform Provisioning Automation
Terrraform meet Oracle Cloud: Platform Provisioning AutomationSimon Haslam
 
Hp cmu – easy to use cluster management utility @ hpcday 2012 kiev
Hp cmu – easy to use cluster management utility @ hpcday 2012 kievHp cmu – easy to use cluster management utility @ hpcday 2012 kiev
Hp cmu – easy to use cluster management utility @ hpcday 2012 kievVolodymyr Saviak
 
Better Practices when Using Terraform to Manage Oracle Cloud Infrastructure
Better Practices when Using Terraform to Manage Oracle Cloud InfrastructureBetter Practices when Using Terraform to Manage Oracle Cloud Infrastructure
Better Practices when Using Terraform to Manage Oracle Cloud InfrastructureSimon Haslam
 
IBM Edge2015 Las Vegas
IBM Edge2015 Las VegasIBM Edge2015 Las Vegas
IBM Edge2015 Las VegasFilipe Miranda
 
Linux Containers and Docker SHARE.ORG Seattle 2015
Linux Containers and Docker SHARE.ORG Seattle 2015Linux Containers and Docker SHARE.ORG Seattle 2015
Linux Containers and Docker SHARE.ORG Seattle 2015Filipe Miranda
 

What's hot (12)

Managing Oracle Solaris Systems with Puppet
Managing Oracle Solaris Systems with PuppetManaging Oracle Solaris Systems with Puppet
Managing Oracle Solaris Systems with Puppet
 
KNOX-HTTPFS-ONEFS-WP
KNOX-HTTPFS-ONEFS-WPKNOX-HTTPFS-ONEFS-WP
KNOX-HTTPFS-ONEFS-WP
 
Red Hat for IBM System z IBM Enterprise2014 Las Vegas
Red Hat for IBM System z IBM Enterprise2014 Las Vegas Red Hat for IBM System z IBM Enterprise2014 Las Vegas
Red Hat for IBM System z IBM Enterprise2014 Las Vegas
 
CERN Agile Infrastructure, Road to Production
CERN Agile Infrastructure, Road to ProductionCERN Agile Infrastructure, Road to Production
CERN Agile Infrastructure, Road to Production
 
Oracle Solaris 11.1 New Features
Oracle Solaris 11.1 New FeaturesOracle Solaris 11.1 New Features
Oracle Solaris 11.1 New Features
 
Osol Netadmin Solaris Administrator
Osol Netadmin Solaris AdministratorOsol Netadmin Solaris Administrator
Osol Netadmin Solaris Administrator
 
Terrraform meet Oracle Cloud: Platform Provisioning Automation
Terrraform meet Oracle Cloud: Platform Provisioning AutomationTerrraform meet Oracle Cloud: Platform Provisioning Automation
Terrraform meet Oracle Cloud: Platform Provisioning Automation
 
Hp cmu – easy to use cluster management utility @ hpcday 2012 kiev
Hp cmu – easy to use cluster management utility @ hpcday 2012 kievHp cmu – easy to use cluster management utility @ hpcday 2012 kiev
Hp cmu – easy to use cluster management utility @ hpcday 2012 kiev
 
Oow Ppt 1
Oow Ppt 1Oow Ppt 1
Oow Ppt 1
 
Better Practices when Using Terraform to Manage Oracle Cloud Infrastructure
Better Practices when Using Terraform to Manage Oracle Cloud InfrastructureBetter Practices when Using Terraform to Manage Oracle Cloud Infrastructure
Better Practices when Using Terraform to Manage Oracle Cloud Infrastructure
 
IBM Edge2015 Las Vegas
IBM Edge2015 Las VegasIBM Edge2015 Las Vegas
IBM Edge2015 Las Vegas
 
Linux Containers and Docker SHARE.ORG Seattle 2015
Linux Containers and Docker SHARE.ORG Seattle 2015Linux Containers and Docker SHARE.ORG Seattle 2015
Linux Containers and Docker SHARE.ORG Seattle 2015
 

Viewers also liked

ESG: NetApp Open Solution for Hadoop
ESG: NetApp Open Solution for HadoopESG: NetApp Open Solution for Hadoop
ESG: NetApp Open Solution for HadoopNetApp
 
Webinar - Managing Files with Puppet
Webinar - Managing Files with PuppetWebinar - Managing Files with Puppet
Webinar - Managing Files with PuppetOlinData
 
Managing Files via Puppet: Let Me Count The Ways
Managing Files via Puppet: Let Me Count The WaysManaging Files via Puppet: Let Me Count The Ways
Managing Files via Puppet: Let Me Count The WaysMichael Arnold
 
Challenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopChallenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopDataWorks Summit
 
Implementing Hadoop on a single cluster
Implementing Hadoop on a single clusterImplementing Hadoop on a single cluster
Implementing Hadoop on a single clusterSalil Navgire
 
Data-Ed Webinar: A Framework for Implementing NoSQL, Hadoop
Data-Ed Webinar: A Framework for Implementing NoSQL, HadoopData-Ed Webinar: A Framework for Implementing NoSQL, Hadoop
Data-Ed Webinar: A Framework for Implementing NoSQL, HadoopDATAVERSITY
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Milind Bhandarkar
 
Monitor PowerKVM using Ganglia, Nagios
Monitor PowerKVM using Ganglia, NagiosMonitor PowerKVM using Ganglia, Nagios
Monitor PowerKVM using Ganglia, NagiosPradeep Kumar
 
Hadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programHadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programPraveen Kumar Donta
 
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...Hortonworks
 
Hadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing ArchitecturesHadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing ArchitecturesHumza Naseer
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 DataWorks Summit
 
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsBest Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsCloudera, Inc.
 
Implementing a Data Lake with Enterprise Grade Data Governance
Implementing a Data Lake with Enterprise Grade Data GovernanceImplementing a Data Lake with Enterprise Grade Data Governance
Implementing a Data Lake with Enterprise Grade Data GovernanceHortonworks
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseDataWorks Summit
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Hortonworks
 
Hadoop and Your Data Warehouse
Hadoop and Your Data WarehouseHadoop and Your Data Warehouse
Hadoop and Your Data WarehouseCaserta
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with HadoopOReillyStrata
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 

Viewers also liked (20)

ESG: NetApp Open Solution for Hadoop
ESG: NetApp Open Solution for HadoopESG: NetApp Open Solution for Hadoop
ESG: NetApp Open Solution for Hadoop
 
Webinar - Managing Files with Puppet
Webinar - Managing Files with PuppetWebinar - Managing Files with Puppet
Webinar - Managing Files with Puppet
 
Managing Files via Puppet: Let Me Count The Ways
Managing Files via Puppet: Let Me Count The WaysManaging Files via Puppet: Let Me Count The Ways
Managing Files via Puppet: Let Me Count The Ways
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Challenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopChallenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on Hadoop
 
Implementing Hadoop on a single cluster
Implementing Hadoop on a single clusterImplementing Hadoop on a single cluster
Implementing Hadoop on a single cluster
 
Data-Ed Webinar: A Framework for Implementing NoSQL, Hadoop
Data-Ed Webinar: A Framework for Implementing NoSQL, HadoopData-Ed Webinar: A Framework for Implementing NoSQL, Hadoop
Data-Ed Webinar: A Framework for Implementing NoSQL, Hadoop
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011
 
Monitor PowerKVM using Ganglia, Nagios
Monitor PowerKVM using Ganglia, NagiosMonitor PowerKVM using Ganglia, Nagios
Monitor PowerKVM using Ganglia, Nagios
 
Hadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programHadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce program
 
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
 
Hadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing ArchitecturesHadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing Architectures
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
 
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsBest Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
 
Implementing a Data Lake with Enterprise Grade Data Governance
Implementing a Data Lake with Enterprise Grade Data GovernanceImplementing a Data Lake with Enterprise Grade Data Governance
Implementing a Data Lake with Enterprise Grade Data Governance
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data Warehouse
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
 
Hadoop and Your Data Warehouse
Hadoop and Your Data WarehouseHadoop and Your Data Warehouse
Hadoop and Your Data Warehouse
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with Hadoop
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 

Similar to Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)

Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC IsilonImproving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC IsilonDataWorks Summit/Hadoop Summit
 
Pivotal: Operationalizing 1000 Node Hadoop Cluster - Analytics Workbench
Pivotal: Operationalizing 1000 Node Hadoop Cluster - Analytics WorkbenchPivotal: Operationalizing 1000 Node Hadoop Cluster - Analytics Workbench
Pivotal: Operationalizing 1000 Node Hadoop Cluster - Analytics WorkbenchEMC
 
Nagios Conference 2012 - Dave Williams - Embedding Nagios using RaspberyPi
Nagios Conference 2012 - Dave Williams - Embedding Nagios using RaspberyPiNagios Conference 2012 - Dave Williams - Embedding Nagios using RaspberyPi
Nagios Conference 2012 - Dave Williams - Embedding Nagios using RaspberyPiNagios
 
Oracle RAC 12c (12.1.0.2) Operational Best Practices - A result of true colla...
Oracle RAC 12c (12.1.0.2) Operational Best Practices - A result of true colla...Oracle RAC 12c (12.1.0.2) Operational Best Practices - A result of true colla...
Oracle RAC 12c (12.1.0.2) Operational Best Practices - A result of true colla...Markus Michalewicz
 
Building Your Own Drupal Distribution
Building Your Own Drupal DistributionBuilding Your Own Drupal Distribution
Building Your Own Drupal DistributionAniket Maithani
 
Next Generation Hadoop Operations
Next Generation Hadoop OperationsNext Generation Hadoop Operations
Next Generation Hadoop OperationsOwen O'Malley
 
Speeding up I/O for Machine Learning ft Apple Case Study using TensorFlow, N...
Speeding up I/O for Machine Learning  ft Apple Case Study using TensorFlow, N...Speeding up I/O for Machine Learning  ft Apple Case Study using TensorFlow, N...
Speeding up I/O for Machine Learning ft Apple Case Study using TensorFlow, N...Alluxio, Inc.
 
Extending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesExtending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesNicola Ferraro
 
Platform Engineering for the Modern Oracle World
Platform Engineering for the Modern Oracle WorldPlatform Engineering for the Modern Oracle World
Platform Engineering for the Modern Oracle WorldSimon Haslam
 
GlassFish in Production Environments
GlassFish in Production EnvironmentsGlassFish in Production Environments
GlassFish in Production EnvironmentsBruno Borges
 
Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?
Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?  Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?
Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You? EMC
 
What's new in MySQL 5.6
What's new in MySQL 5.6What's new in MySQL 5.6
What's new in MySQL 5.6Shlomi Noach
 
Hadoop2 new and noteworthy SNIA conf
Hadoop2 new and noteworthy SNIA confHadoop2 new and noteworthy SNIA conf
Hadoop2 new and noteworthy SNIA confSujee Maniyam
 
Harnessing the Power of Apache Hadoop Series
Harnessing the Power of Apache Hadoop SeriesHarnessing the Power of Apache Hadoop Series
Harnessing the Power of Apache Hadoop SeriesCloudera, Inc.
 
Intalio create and cloudfoudry - short
Intalio create and cloudfoudry - shortIntalio create and cloudfoudry - short
Intalio create and cloudfoudry - shorthmalphettes
 
Enterprise software needs a PaaS
Enterprise software needs a PaaSEnterprise software needs a PaaS
Enterprise software needs a PaaShmalphettes
 
Modern infrastructure for business data lake
Modern infrastructure for business data lakeModern infrastructure for business data lake
Modern infrastructure for business data lakeEMC
 
Docker 101 - An introduction to docker
Docker 101 - An introduction to dockerDocker 101 - An introduction to docker
Docker 101 - An introduction to dockerRichard Banks
 
Oracle RAC 12c Collaborate Best Practices - IOUG 2014 version
Oracle RAC 12c Collaborate Best Practices - IOUG 2014 versionOracle RAC 12c Collaborate Best Practices - IOUG 2014 version
Oracle RAC 12c Collaborate Best Practices - IOUG 2014 versionMarkus Michalewicz
 

Similar to Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet) (20)

Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC IsilonImproving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
 
Pivotal: Operationalizing 1000 Node Hadoop Cluster - Analytics Workbench
Pivotal: Operationalizing 1000 Node Hadoop Cluster - Analytics WorkbenchPivotal: Operationalizing 1000 Node Hadoop Cluster - Analytics Workbench
Pivotal: Operationalizing 1000 Node Hadoop Cluster - Analytics Workbench
 
Nagios Conference 2012 - Dave Williams - Embedding Nagios using RaspberyPi
Nagios Conference 2012 - Dave Williams - Embedding Nagios using RaspberyPiNagios Conference 2012 - Dave Williams - Embedding Nagios using RaspberyPi
Nagios Conference 2012 - Dave Williams - Embedding Nagios using RaspberyPi
 
Oracle RAC 12c (12.1.0.2) Operational Best Practices - A result of true colla...
Oracle RAC 12c (12.1.0.2) Operational Best Practices - A result of true colla...Oracle RAC 12c (12.1.0.2) Operational Best Practices - A result of true colla...
Oracle RAC 12c (12.1.0.2) Operational Best Practices - A result of true colla...
 
EMC config Hadoop
EMC config HadoopEMC config Hadoop
EMC config Hadoop
 
Building Your Own Drupal Distribution
Building Your Own Drupal DistributionBuilding Your Own Drupal Distribution
Building Your Own Drupal Distribution
 
Next Generation Hadoop Operations
Next Generation Hadoop OperationsNext Generation Hadoop Operations
Next Generation Hadoop Operations
 
Speeding up I/O for Machine Learning ft Apple Case Study using TensorFlow, N...
Speeding up I/O for Machine Learning  ft Apple Case Study using TensorFlow, N...Speeding up I/O for Machine Learning  ft Apple Case Study using TensorFlow, N...
Speeding up I/O for Machine Learning ft Apple Case Study using TensorFlow, N...
 
Extending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesExtending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with Kubernetes
 
Platform Engineering for the Modern Oracle World
Platform Engineering for the Modern Oracle WorldPlatform Engineering for the Modern Oracle World
Platform Engineering for the Modern Oracle World
 
GlassFish in Production Environments
GlassFish in Production EnvironmentsGlassFish in Production Environments
GlassFish in Production Environments
 
Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?
Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?  Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?
Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?
 
What's new in MySQL 5.6
What's new in MySQL 5.6What's new in MySQL 5.6
What's new in MySQL 5.6
 
Hadoop2 new and noteworthy SNIA conf
Hadoop2 new and noteworthy SNIA confHadoop2 new and noteworthy SNIA conf
Hadoop2 new and noteworthy SNIA conf
 
Harnessing the Power of Apache Hadoop Series
Harnessing the Power of Apache Hadoop SeriesHarnessing the Power of Apache Hadoop Series
Harnessing the Power of Apache Hadoop Series
 
Intalio create and cloudfoudry - short
Intalio create and cloudfoudry - shortIntalio create and cloudfoudry - short
Intalio create and cloudfoudry - short
 
Enterprise software needs a PaaS
Enterprise software needs a PaaSEnterprise software needs a PaaS
Enterprise software needs a PaaS
 
Modern infrastructure for business data lake
Modern infrastructure for business data lakeModern infrastructure for business data lake
Modern infrastructure for business data lake
 
Docker 101 - An introduction to docker
Docker 101 - An introduction to dockerDocker 101 - An introduction to docker
Docker 101 - An introduction to docker
 
Oracle RAC 12c Collaborate Best Practices - IOUG 2014 version
Oracle RAC 12c Collaborate Best Practices - IOUG 2014 versionOracle RAC 12c Collaborate Best Practices - IOUG 2014 version
Oracle RAC 12c Collaborate Best Practices - IOUG 2014 version
 

Recently uploaded

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 

Recently uploaded (20)

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 

Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)

  • 1. APOLLO GROUP Hadoop Operations: Starting Out Small So Your Cluster Isn't Yahoo-sized (yet) Michael Arnold Principal Systems Engineer 14 June 2012
  • 2. Agenda Who What (Definitions) Decisions for Now Decisions for Later Lessons Learned APOLLO GROUP © 2012 Apollo Group 2
  • 3. APOLLO GROUP Who APOLLO GROUP Apollo Group © 2012 3
  • 4. Who is Apollo? Apollo Group is a leading provider of higher education programs for working adults. APOLLO GROUP © 2012 Apollo Group 4
  • 5. Who is Michael Arnold? Systems Administrator Automation geek 13 years in IT I deal with: –Server hardware specification/configuration –Server firmware –Server operating system –Hadoop application health –Monitoring all the above APOLLO GROUP © 2012 Apollo Group 5
  • 6. APOLLO GROUP What Definitions APOLLO GROUP Apollo Group © 2012 6
  • 7. Definitions Q: What is a tiny/small/medium/large cluster? A: –Tiny: 1-9 –Small: 10-99 –Medium: 100-999 –Large: 1000+ –Yahoo-sized: 4000 APOLLO GROUP © 2012 Apollo Group 7
  • 8. Definitions Q: What is a “headnode”? A: A server that runs one or more of the following Hadoop processes: –NameNode –JobTracker –Secondary NameNode –ZooKeeper –HBase Master APOLLO GROUP © 2012 Apollo Group 8
  • 9. APOLLO GROUP What decisions should you make now and which can you postpone for later? Decisions for Now APOLLO GROUP Apollo Group © 2012 9
  • 10. Which Hadoop distribution? Amazon Apache Cloudera Greenplum Hortonworks IBM MapR Platform Computing APOLLO GROUP © 2012 Apollo Group 10
  • 11. Should you virtualize? Can be OK for small clusters BUT –virtualization adds overhead –can cause performance degradation –cannot take advantage of Hadoop rack locality Virtualization can be good for: –functional testing of M/R job or workflow changes –evaluation of Hadoop upgrades APOLLO GROUP © 2012 Apollo Group 11
  • 12. What sort of hardware should you be considering? Inexpensive Not “enterprisey” hardware –No RAID* –No Redundant power* Low power consumption No optical drives –get systems that can boot off the network * except in headnodes APOLLO GROUP © 2012 Apollo Group 12
  • 13. Plan for capacity expansion Start at the bottom and work your way up Leave room in your cabinets for more machines APOLLO GROUP © 2012 Apollo Group 13
  • 14. Plan for capacity expansion (cont.) Deploy your initial cluster in two cabinets –One headnode, one switch, and several (five) datanodes per cabinet APOLLO GROUP © 2012 Apollo Group 14
  • 15. Plan for capacity expansion (cont.) Install a second cluster in the empty space in the upper half of the cabinet APOLLO GROUP © 2012 Apollo Group 15
  • 16. APOLLO GROUP What decisions should you make now and which can you postpone for later? Decisions for Later APOLLO GROUP Apollo Group © 2012 16
  • 17. What size cluster? Depends upon your: Budget Data size Workload characteristics SLA APOLLO GROUP © 2012 Apollo Group 17
  • 18. What size cluster? (cont.) Are your MapReduce jobs: compute-intensive? reading lots of data? http://www.cloudera.com/blog/2010/08/hadoophbase-capacity-planning/ APOLLO GROUP © 2012 Apollo Group 18
  • 19. Should you implement rack awareness? If more than one switch in the cluster: YES APOLLO GROUP © 2012 Apollo Group 19
  • 20. Should you use automation? If not in the beginning, then as soon as possible. Boot disks will fail. Automated OS and application installs: –Save time –Reduce errors •Cobbler/Spacewalk/Foreman/xCat/etc •Puppet/Chef/Cfengine/shell scripts/etc APOLLO GROUP © 2012 Apollo Group 20
  • 21. APOLLO GROUP Lessons Learned APOLLO GROUP Apollo Group © 2012 21
  • 22. Keep It Simple Don't add redundancy and features (server/network) that will make things more complicated and expensive. Hadoop has built-in redundancies. Don't overlook them. APOLLO GROUP © 2012 Apollo Group 22
  • 23. Automate the Hardware Twelve hours of manual work in the datacenter is not fun. Make sure all server firmware is configured identically. –HP SmartStart Scripting Toolkit –Dell OpenManage Deployment Toolkit –IBM ServerGuide Scripting Toolkit APOLLO GROUP © 2012 Apollo Group 23
  • 24. Rolling upgrades are possible (Just not of the Hadoop software.) Datanodes can be decommissioned, patched, and added back into the cluster without service downtime. APOLLO GROUP © 2012 Apollo Group 24
  • 25. The smallest thing can have a big impact on the cluster Bad NIC/switchport can cause cluster slowness. Slow disks can cause intermittent job slowdowns. APOLLO GROUP © 2012 Apollo Group 25
  • 26. HDFS blocks are weird On ext3/ext4: –Small blocks are not padded to the HDFS block- size, but rather the actual size of the data. –Each HDFS block is actually two files on the datanode's filesystem: •The actual data and •A metadata/checksum file # ls -l blk_1058778885645824207* -rw-r--r-- 1 hdfs hdfs 35094 May 14 01:26 blk_1058778885645824207 -rw-r--r-- 1 hdfs hdfs 283 May 14 01:26 blk_1058778885645824207_19155994.meta APOLLO GROUP © 2012 Apollo Group 26
  • 27. Do not prematurely optimize Be careful tuning your datanode filesystems. • mkfs -t ext4 -T largefile4 ... (probably bad) • mkfs -t ext4 -i 131072 -m 0 ... (better) /etc/mke2fs.conf [fs_types] hadoop = { features = has_journal,extent,huge_file,flex_bg,uninit_bg,dir_nlink, extra_isize inode_ratio = 131072 blocksize = -1 reserved_ratio = 0 default_mntopts = acl,user_xattr } APOLLO GROUP © 2012 Apollo Group 27
  • 28. Use DNS-friendly names for services hdfs://hdfs.delta.hadoop.apollogrp.edu:8020/ mapred.delta.hadoop.apollogrp.edu:8021 http://oozie.delta.hadoop.apollogrp.edu:11000/ hiveserver.delta.hadoop.apollogrp.edu:10000 Yes, the names are long, but I bet you can figure out how to connect to Bravo Cluster. APOLLO GROUP © 2012 Apollo Group 29
  • 29. Use a parallel, remote execution tool pdsh/Cluster SSH/mussh/etc SSH in a for loop is so 2010 FUNC/MCollective APOLLO GROUP © 2012 Apollo Group 30
  • 30. Make your log directories as large as you can. 20-100GB /var/log –Implement log purging cronjobs or your log directories will fill up. Beware: M/R jobs can fill up /tmp as well. APOLLO GROUP © 2012 Apollo Group 31
  • 31. Insist on IPMI 2.0 for out of band management of server hardware. Serial Over LAN is awesome when booting a system. Standardized hardware/temperature monitoring. Simple remote power control. APOLLO GROUP © 2012 Apollo Group 33
  • 32. Spanning-tree is the devil Enable portfast on your server switch ports or the BMCs may never get a DHCP lease. APOLLO GROUP © 2012 Apollo Group 34
  • 33. Apollo has re-built it's cluster four times. You may end up doing so as well. APOLLO GROUP © 2012 Apollo Group 35
  • 34. Apollo Timeline First build Cloudera Professional Services helped install CDH Four nodes Manually build OS via USB CDROM. CDH2 APOLLO GROUP © 2012 Apollo Group 36
  • 35. Apollo Timeline Second build Cobbler All software deployment is via kickstart. Very little is in puppet. Config files are deployed via wget. CDH2 APOLLO GROUP © 2012 Apollo Group 37
  • 36. Apollo Timeline Third build OS filesystem partitioning needed to change. Most software deployment still via kickstart. CDH3b2 APOLLO GROUP © 2012 Apollo Group 38
  • 37. Apollo Timeline Fourth build HDFS filesystem inodes needed to be increased. Full puppet automation. Added redundant/hotswap enterprise hardware for headnodes. CDH3u1 APOLLO GROUP © 2012 Apollo Group 39
  • 38. Cluster failures at Apollo Hardware –disk failures (40+) –disk cabling (6) –RAM (2) –switch port (1) Software –Cluster •NFS (NN -> 2NN metadata) –Job •TT java heap •Running out of /tmp or /var/log/hadoop •Running out of HDFS space APOLLO GROUP © 2012 Apollo Group 40
  • 39. Know your workload You can spend all the time in the world trying to get the best CPU/RAM/HDD/switch/cabinet configuration, but you are running on pure luck until you understand your cluster's workload. APOLLO GROUP © 2012 Apollo Group 41
  • 40. APOLLO GROUP Questions? APOLLO GROUP Apollo Group © 2012 42