SlideShare a Scribd company logo
1 of 40
Download to read offline
APOLLO GROUP




Hadoop Operations: Starting Out Small
So Your Cluster Isn't Yahoo-sized (yet)
Michael Arnold
Principal Systems Engineer
14 June 2012
Agenda

  Who
  What (Definitions)
  Decisions for Now
  Decisions for Later
  Lessons Learned




APOLLO GROUP             © 2012 Apollo Group        2
APOLLO GROUP




  Who




APOLLO GROUP Apollo Group
          © 2012            3
Who is Apollo?

        Apollo Group is a leading provider of higher
          education programs for working adults.




APOLLO GROUP              © 2012 Apollo Group                4
Who is Michael Arnold?

  Systems Administrator
  Automation geek
  13 years in IT
  I deal with:
      –Server hardware specification/configuration
      –Server firmware
      –Server operating system
      –Hadoop application health
      –Monitoring all the above


APOLLO GROUP              © 2012 Apollo Group        5
APOLLO GROUP




  What
  Definitions




APOLLO GROUP Apollo Group
          © 2012            6
Definitions

  Q: What is a tiny/small/medium/large cluster?
  A:
      –Tiny:          1-9
      –Small:         10-99
      –Medium:        100-999
      –Large:         1000+
      –Yahoo-sized:   4000




APOLLO GROUP              © 2012 Apollo Group             7
Definitions

  Q: What is a “headnode”?
  A: A server that runs one or more of the following
   Hadoop processes:
      –NameNode
      –JobTracker
      –Secondary NameNode
      –ZooKeeper
      –HBase Master




APOLLO GROUP            © 2012 Apollo Group             8
APOLLO GROUP




  What decisions should you
  make now and which can
  you postpone for later?
  Decisions for Now



APOLLO GROUP Apollo Group
          © 2012              9
Which Hadoop distribution?

  Amazon
  Apache
  Cloudera
  Greenplum
  Hortonworks
  IBM
  MapR
  Platform Computing



APOLLO GROUP            © 2012 Apollo Group   10
Should you virtualize?

  Can be OK for small clusters BUT
      –virtualization adds overhead
      –can cause performance degradation
      –cannot take advantage of Hadoop rack locality
  Virtualization can be good for:
      –functional testing of M/R job or workflow changes
      –evaluation of Hadoop upgrades




APOLLO GROUP              © 2012 Apollo Group              11
What sort of hardware should you be
                                      considering?

  Inexpensive
  Not “enterprisey” hardware
     –No RAID*
     –No Redundant power*
  Low power consumption
  No optical drives
     –get systems that can boot off the network



                                              * except in headnodes

APOLLO GROUP            © 2012 Apollo Group                       12
Plan for capacity expansion

  Start at the bottom and
   work your way up
  Leave room in your
   cabinets for more
   machines




APOLLO GROUP            © 2012 Apollo Group    13
Plan for capacity expansion (cont.)

  Deploy your initial
   cluster in two cabinets
     –One headnode, one
      switch, and several
      (five) datanodes per
      cabinet




APOLLO GROUP            © 2012 Apollo Group      14
Plan for capacity expansion (cont.)

  Install a second cluster
   in the empty space in
   the upper half of the
   cabinet




APOLLO GROUP             © 2012 Apollo Group     15
APOLLO GROUP




  What decisions should you
  make now and which can
  you postpone for later?
  Decisions for Later



APOLLO GROUP Apollo Group
          © 2012              16
What size cluster?

  Depends upon your:
  Budget
  Data size
  Workload characteristics
  SLA




APOLLO GROUP           © 2012 Apollo Group                    17
What size cluster? (cont.)

  Are your MapReduce jobs:
  compute-intensive?
  reading lots of data?

  http://www.cloudera.com/blog/2010/08/hadoophbase-capacity-planning/




APOLLO GROUP                   © 2012 Apollo Group                      18
Should you implement rack awareness?


        If more than one switch in the cluster:

                           YES




APOLLO GROUP            © 2012 Apollo Group       19
Should you use automation?

       If not in the beginning, then as soon as
                        possible.

  Boot disks will fail.
  Automated OS and application installs:
      –Save time
      –Reduce errors
          •Cobbler/Spacewalk/Foreman/xCat/etc
          •Puppet/Chef/Cfengine/shell scripts/etc

APOLLO GROUP              © 2012 Apollo Group       20
APOLLO GROUP




  Lessons Learned




APOLLO GROUP Apollo Group
          © 2012            21
Keep It Simple

            Don't add redundancy and features
         (server/network) that will make things more
                 complicated and expensive.

               Hadoop has built-in redundancies.

                     Don't overlook them.




APOLLO GROUP                © 2012 Apollo Group                22
Automate the Hardware

  Twelve hours of manual work in the datacenter is
   not fun.
  Make sure all server firmware is configured
   identically.
      –HP SmartStart Scripting Toolkit
      –Dell OpenManage Deployment Toolkit
      –IBM ServerGuide Scripting Toolkit




APOLLO GROUP            © 2012 Apollo Group           23
Rolling upgrades are possible

               (Just not of the Hadoop software.)

   Datanodes can be decommissioned, patched, and
       added back into the cluster without service
                      downtime.




APOLLO GROUP                © 2012 Apollo Group      24
The smallest thing can have a big impact on the
                                             cluster


  Bad NIC/switchport can cause cluster slowness.

  Slow disks can cause intermittent job slowdowns.




APOLLO GROUP           © 2012 Apollo Group            25
HDFS blocks are weird

  On ext3/ext4:
      –Small blocks are not padded to the HDFS block-
       size, but rather the actual size of the data.
      –Each HDFS block is actually two files on the
       datanode's filesystem:
          •The actual data and
          •A metadata/checksum file

 # ls -l blk_1058778885645824207*
 -rw-r--r-- 1 hdfs hdfs 35094 May 14 01:26 blk_1058778885645824207
 -rw-r--r-- 1 hdfs hdfs   283 May 14 01:26 blk_1058778885645824207_19155994.meta



APOLLO GROUP                        © 2012 Apollo Group                        26
Do not prematurely optimize

  Be careful tuning your datanode filesystems.
      • mkfs -t ext4 -T largefile4 ... (probably bad)
      • mkfs -t ext4 -i 131072 -m 0 ... (better)

 /etc/mke2fs.conf
 [fs_types]
  hadoop = {
         features = has_journal,extent,huge_file,flex_bg,uninit_bg,dir_nlink,
  extra_isize
         inode_ratio = 131072
         blocksize = -1
         reserved_ratio = 0
         default_mntopts = acl,user_xattr
  }

APOLLO GROUP                       © 2012 Apollo Group                          27
Use DNS-friendly names for services

       hdfs://hdfs.delta.hadoop.apollogrp.edu:8020/
         mapred.delta.hadoop.apollogrp.edu:8021
      http://oozie.delta.hadoop.apollogrp.edu:11000/
      hiveserver.delta.hadoop.apollogrp.edu:10000



   Yes, the names are long, but I bet you can figure out how to
                    connect to Bravo Cluster.




APOLLO GROUP                © 2012 Apollo Group                   29
Use a parallel, remote execution tool

  pdsh/Cluster SSH/mussh/etc

                 SSH in a for loop is so 2010

  FUNC/MCollective




APOLLO GROUP               © 2012 Apollo Group     30
Make your log directories as large as you can.

  20-100GB /var/log
      –Implement log purging cronjobs or your log
       directories will fill up.


  Beware: M/R jobs can fill up /tmp as well.




APOLLO GROUP              © 2012 Apollo Group        31
Insist on IPMI 2.0 for out of band management of
                                     server hardware.

  Serial Over LAN is awesome when booting a
   system.
  Standardized hardware/temperature monitoring.
  Simple remote power control.




APOLLO GROUP            © 2012 Apollo Group         33
Spanning-tree is the devil

  Enable portfast on your server switch ports or the
   BMCs may never get a DHCP lease.




APOLLO GROUP            © 2012 Apollo Group            34
Apollo has re-built it's cluster four times.

               You may end up doing so as well.




APOLLO GROUP               © 2012 Apollo Group     35
Apollo Timeline

  First build
  Cloudera Professional Services helped install CDH
  Four nodes
  Manually build OS via USB CDROM.
  CDH2




APOLLO GROUP           © 2012 Apollo Group                 36
Apollo Timeline

  Second build
  Cobbler
  All software deployment is via kickstart. Very little
   is in puppet. Config files are deployed via wget.
  CDH2




APOLLO GROUP              © 2012 Apollo Group                 37
Apollo Timeline

  Third build
  OS filesystem partitioning needed to change.
  Most software deployment still via kickstart.
  CDH3b2




APOLLO GROUP            © 2012 Apollo Group                 38
Apollo Timeline

  Fourth build
  HDFS filesystem inodes needed to be increased.
  Full puppet automation.
  Added redundant/hotswap enterprise hardware for
   headnodes.
  CDH3u1




APOLLO GROUP          © 2012 Apollo Group                 39
Cluster failures at Apollo

  Hardware
      –disk failures (40+)
      –disk cabling (6)
      –RAM (2)
      –switch port (1)
  Software
      –Cluster
          •NFS (NN -> 2NN metadata)
      –Job
          •TT java heap
          •Running out of /tmp or /var/log/hadoop
          •Running out of HDFS space

APOLLO GROUP                  © 2012 Apollo Group        40
Know your workload

  You can spend all the time in the world trying to get
   the best CPU/RAM/HDD/switch/cabinet
   configuration, but you are running on pure luck
   until you understand your cluster's workload.




APOLLO GROUP             © 2012 Apollo Group               41
APOLLO GROUP




  Questions?




APOLLO GROUP Apollo Group
          © 2012            42

More Related Content

What's hot

Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryCloudera, Inc.
 
Best Practices for Virtualizing Hadoop
Best Practices for Virtualizing HadoopBest Practices for Virtualizing Hadoop
Best Practices for Virtualizing HadoopDataWorks Summit
 
Improving Hadoop Performance via Linux
Improving Hadoop Performance via LinuxImproving Hadoop Performance via Linux
Improving Hadoop Performance via LinuxAlex Moundalexis
 
Meet hbase 2.0
Meet hbase 2.0Meet hbase 2.0
Meet hbase 2.0enissoz
 
Apache Hadoop 3 updates with migration story
Apache Hadoop 3 updates with migration storyApache Hadoop 3 updates with migration story
Apache Hadoop 3 updates with migration storySunil Govindan
 
Multi-tenant, Multi-cluster and Multi-container Apache HBase Deployments
Multi-tenant, Multi-cluster and Multi-container Apache HBase DeploymentsMulti-tenant, Multi-cluster and Multi-container Apache HBase Deployments
Multi-tenant, Multi-cluster and Multi-container Apache HBase DeploymentsDataWorks Summit
 
Running a container cloud on YARN
Running a container cloud on YARNRunning a container cloud on YARN
Running a container cloud on YARNDataWorks Summit
 
SD Big Data Monthly Meetup #4 - Session 2 - WANDisco
SD Big Data Monthly Meetup #4 - Session 2 - WANDiscoSD Big Data Monthly Meetup #4 - Session 2 - WANDisco
SD Big Data Monthly Meetup #4 - Session 2 - WANDiscoBig Data Joe™ Rossi
 
High Availability for HBase Tables - Past, Present, and Future
High Availability for HBase Tables - Past, Present, and FutureHigh Availability for HBase Tables - Past, Present, and Future
High Availability for HBase Tables - Past, Present, and FutureDataWorks Summit
 
CBlocks - Posix compliant files systems for HDFS
CBlocks - Posix compliant files systems for HDFSCBlocks - Posix compliant files systems for HDFS
CBlocks - Posix compliant files systems for HDFSDataWorks Summit
 
Difference between hadoop 2 vs hadoop 3
Difference between hadoop 2 vs hadoop 3Difference between hadoop 2 vs hadoop 3
Difference between hadoop 2 vs hadoop 3Manish Chopra
 
Optimizing Dell PowerEdge Configurations for Hadoop
Optimizing Dell PowerEdge Configurations for HadoopOptimizing Dell PowerEdge Configurations for Hadoop
Optimizing Dell PowerEdge Configurations for HadoopMike Pittaro
 
Ozone - Evolution of hdfs scalability
Ozone - Evolution of hdfs scalabilityOzone - Evolution of hdfs scalability
Ozone - Evolution of hdfs scalabilityDinesh Chitlangia
 
Introduction to Cloudera's Administrator Training for Apache Hadoop
Introduction to Cloudera's Administrator Training for Apache HadoopIntroduction to Cloudera's Administrator Training for Apache Hadoop
Introduction to Cloudera's Administrator Training for Apache HadoopCloudera, Inc.
 
Applications on Hadoop
Applications on HadoopApplications on Hadoop
Applications on Hadoopmarkgrover
 
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon ValleyIntro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valleymarkgrover
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseenissoz
 
Big data processing meets non-volatile memory: opportunities and challenges
Big data processing meets non-volatile memory: opportunities and challenges Big data processing meets non-volatile memory: opportunities and challenges
Big data processing meets non-volatile memory: opportunities and challenges DataWorks Summit
 

What's hot (20)

Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster Recovery
 
Cross-DC Fault-Tolerant ViewFileSystem @ Twitter
Cross-DC Fault-Tolerant ViewFileSystem @ TwitterCross-DC Fault-Tolerant ViewFileSystem @ Twitter
Cross-DC Fault-Tolerant ViewFileSystem @ Twitter
 
Best Practices for Virtualizing Hadoop
Best Practices for Virtualizing HadoopBest Practices for Virtualizing Hadoop
Best Practices for Virtualizing Hadoop
 
Improving Hadoop Performance via Linux
Improving Hadoop Performance via LinuxImproving Hadoop Performance via Linux
Improving Hadoop Performance via Linux
 
Meet hbase 2.0
Meet hbase 2.0Meet hbase 2.0
Meet hbase 2.0
 
Apache Hadoop 3 updates with migration story
Apache Hadoop 3 updates with migration storyApache Hadoop 3 updates with migration story
Apache Hadoop 3 updates with migration story
 
Multi-tenant, Multi-cluster and Multi-container Apache HBase Deployments
Multi-tenant, Multi-cluster and Multi-container Apache HBase DeploymentsMulti-tenant, Multi-cluster and Multi-container Apache HBase Deployments
Multi-tenant, Multi-cluster and Multi-container Apache HBase Deployments
 
Running a container cloud on YARN
Running a container cloud on YARNRunning a container cloud on YARN
Running a container cloud on YARN
 
SD Big Data Monthly Meetup #4 - Session 2 - WANDisco
SD Big Data Monthly Meetup #4 - Session 2 - WANDiscoSD Big Data Monthly Meetup #4 - Session 2 - WANDisco
SD Big Data Monthly Meetup #4 - Session 2 - WANDisco
 
High Availability for HBase Tables - Past, Present, and Future
High Availability for HBase Tables - Past, Present, and FutureHigh Availability for HBase Tables - Past, Present, and Future
High Availability for HBase Tables - Past, Present, and Future
 
CBlocks - Posix compliant files systems for HDFS
CBlocks - Posix compliant files systems for HDFSCBlocks - Posix compliant files systems for HDFS
CBlocks - Posix compliant files systems for HDFS
 
Difference between hadoop 2 vs hadoop 3
Difference between hadoop 2 vs hadoop 3Difference between hadoop 2 vs hadoop 3
Difference between hadoop 2 vs hadoop 3
 
Optimizing Dell PowerEdge Configurations for Hadoop
Optimizing Dell PowerEdge Configurations for HadoopOptimizing Dell PowerEdge Configurations for Hadoop
Optimizing Dell PowerEdge Configurations for Hadoop
 
Hadoop Operations
Hadoop OperationsHadoop Operations
Hadoop Operations
 
Ozone - Evolution of hdfs scalability
Ozone - Evolution of hdfs scalabilityOzone - Evolution of hdfs scalability
Ozone - Evolution of hdfs scalability
 
Introduction to Cloudera's Administrator Training for Apache Hadoop
Introduction to Cloudera's Administrator Training for Apache HadoopIntroduction to Cloudera's Administrator Training for Apache Hadoop
Introduction to Cloudera's Administrator Training for Apache Hadoop
 
Applications on Hadoop
Applications on HadoopApplications on Hadoop
Applications on Hadoop
 
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon ValleyIntro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
 
Big data processing meets non-volatile memory: opportunities and challenges
Big data processing meets non-volatile memory: opportunities and challenges Big data processing meets non-volatile memory: opportunities and challenges
Big data processing meets non-volatile memory: opportunities and challenges
 

Viewers also liked

Intro to hadoop tutorial
Intro to hadoop tutorialIntro to hadoop tutorial
Intro to hadoop tutorialmarkgrover
 
Tutorial hadoop hdfs_map_reduce
Tutorial hadoop hdfs_map_reduceTutorial hadoop hdfs_map_reduce
Tutorial hadoop hdfs_map_reducemudassar mulla
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopVigen Sahakyan
 
Big dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosqlBig dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosqlKhanderao Kand
 
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.Natalino Busa
 
Hadoop Tutorial
Hadoop TutorialHadoop Tutorial
Hadoop Tutorialawesomesos
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hari Shankar Sreekumar
 
Breakout: Hadoop and the Operational Data Store
Breakout: Hadoop and the Operational Data StoreBreakout: Hadoop and the Operational Data Store
Breakout: Hadoop and the Operational Data StoreCloudera, Inc.
 
Hadoop Streaming Tutorial With Python
Hadoop Streaming Tutorial With PythonHadoop Streaming Tutorial With Python
Hadoop Streaming Tutorial With PythonJoe Stein
 
Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015hadooparchbook
 
A real time architecture using Hadoop and Storm @ FOSDEM 2013
A real time architecture using Hadoop and Storm @ FOSDEM 2013A real time architecture using Hadoop and Storm @ FOSDEM 2013
A real time architecture using Hadoop and Storm @ FOSDEM 2013Nathan Bijnens
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomynzhang
 
Apache hadoop pig overview and introduction
Apache hadoop pig overview and introductionApache hadoop pig overview and introduction
Apache hadoop pig overview and introductionBigClasses Com
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Senthil Kumar
 
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka Edureka!
 
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...Edureka!
 
Hadoop Tutorial with @techmilind
Hadoop Tutorial with @techmilindHadoop Tutorial with @techmilind
Hadoop Tutorial with @techmilindEMC
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop TutorialEdureka!
 

Viewers also liked (20)

Intro to hadoop tutorial
Intro to hadoop tutorialIntro to hadoop tutorial
Intro to hadoop tutorial
 
HBase introduction talk
HBase introduction talkHBase introduction talk
HBase introduction talk
 
Tutorial hadoop hdfs_map_reduce
Tutorial hadoop hdfs_map_reduceTutorial hadoop hdfs_map_reduce
Tutorial hadoop hdfs_map_reduce
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop YARN
Hadoop YARNHadoop YARN
Hadoop YARN
 
Big dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosqlBig dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosql
 
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
 
Hadoop Tutorial
Hadoop TutorialHadoop Tutorial
Hadoop Tutorial
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
 
Breakout: Hadoop and the Operational Data Store
Breakout: Hadoop and the Operational Data StoreBreakout: Hadoop and the Operational Data Store
Breakout: Hadoop and the Operational Data Store
 
Hadoop Streaming Tutorial With Python
Hadoop Streaming Tutorial With PythonHadoop Streaming Tutorial With Python
Hadoop Streaming Tutorial With Python
 
Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015
 
A real time architecture using Hadoop and Storm @ FOSDEM 2013
A real time architecture using Hadoop and Storm @ FOSDEM 2013A real time architecture using Hadoop and Storm @ FOSDEM 2013
A real time architecture using Hadoop and Storm @ FOSDEM 2013
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomy
 
Apache hadoop pig overview and introduction
Apache hadoop pig overview and introductionApache hadoop pig overview and introduction
Apache hadoop pig overview and introduction
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview
 
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka
 
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
 
Hadoop Tutorial with @techmilind
Hadoop Tutorial with @techmilindHadoop Tutorial with @techmilind
Hadoop Tutorial with @techmilind
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 

Similar to Hadoop operations

Automated Configuration of Firmware
Automated Configuration of FirmwareAutomated Configuration of Firmware
Automated Configuration of FirmwareMichael Arnold
 
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC IsilonImproving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC IsilonDataWorks Summit/Hadoop Summit
 
Pivotal: Operationalizing 1000 Node Hadoop Cluster - Analytics Workbench
Pivotal: Operationalizing 1000 Node Hadoop Cluster - Analytics WorkbenchPivotal: Operationalizing 1000 Node Hadoop Cluster - Analytics Workbench
Pivotal: Operationalizing 1000 Node Hadoop Cluster - Analytics WorkbenchEMC
 
Nagios Conference 2012 - Dave Williams - Embedding Nagios using RaspberyPi
Nagios Conference 2012 - Dave Williams - Embedding Nagios using RaspberyPiNagios Conference 2012 - Dave Williams - Embedding Nagios using RaspberyPi
Nagios Conference 2012 - Dave Williams - Embedding Nagios using RaspberyPiNagios
 
Oracle RAC 12c (12.1.0.2) Operational Best Practices - A result of true colla...
Oracle RAC 12c (12.1.0.2) Operational Best Practices - A result of true colla...Oracle RAC 12c (12.1.0.2) Operational Best Practices - A result of true colla...
Oracle RAC 12c (12.1.0.2) Operational Best Practices - A result of true colla...Markus Michalewicz
 
Building Your Own Drupal Distribution
Building Your Own Drupal DistributionBuilding Your Own Drupal Distribution
Building Your Own Drupal DistributionAniket Maithani
 
Next Generation Hadoop Operations
Next Generation Hadoop OperationsNext Generation Hadoop Operations
Next Generation Hadoop OperationsOwen O'Malley
 
Speeding up I/O for Machine Learning ft Apple Case Study using TensorFlow, N...
Speeding up I/O for Machine Learning  ft Apple Case Study using TensorFlow, N...Speeding up I/O for Machine Learning  ft Apple Case Study using TensorFlow, N...
Speeding up I/O for Machine Learning ft Apple Case Study using TensorFlow, N...Alluxio, Inc.
 
Extending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesExtending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesNicola Ferraro
 
Platform Engineering for the Modern Oracle World
Platform Engineering for the Modern Oracle WorldPlatform Engineering for the Modern Oracle World
Platform Engineering for the Modern Oracle WorldSimon Haslam
 
GlassFish in Production Environments
GlassFish in Production EnvironmentsGlassFish in Production Environments
GlassFish in Production EnvironmentsBruno Borges
 
Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?
Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?  Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?
Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You? EMC
 
What's new in MySQL 5.6
What's new in MySQL 5.6What's new in MySQL 5.6
What's new in MySQL 5.6Shlomi Noach
 
Hadoop2 new and noteworthy SNIA conf
Hadoop2 new and noteworthy SNIA confHadoop2 new and noteworthy SNIA conf
Hadoop2 new and noteworthy SNIA confSujee Maniyam
 
Harnessing the Power of Apache Hadoop Series
Harnessing the Power of Apache Hadoop SeriesHarnessing the Power of Apache Hadoop Series
Harnessing the Power of Apache Hadoop SeriesCloudera, Inc.
 
Intalio create and cloudfoudry - short
Intalio create and cloudfoudry - shortIntalio create and cloudfoudry - short
Intalio create and cloudfoudry - shorthmalphettes
 
Enterprise software needs a PaaS
Enterprise software needs a PaaSEnterprise software needs a PaaS
Enterprise software needs a PaaShmalphettes
 
Modern infrastructure for business data lake
Modern infrastructure for business data lakeModern infrastructure for business data lake
Modern infrastructure for business data lakeEMC
 
Docker 101 - An introduction to docker
Docker 101 - An introduction to dockerDocker 101 - An introduction to docker
Docker 101 - An introduction to dockerRichard Banks
 

Similar to Hadoop operations (20)

Automated Configuration of Firmware
Automated Configuration of FirmwareAutomated Configuration of Firmware
Automated Configuration of Firmware
 
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC IsilonImproving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
 
Pivotal: Operationalizing 1000 Node Hadoop Cluster - Analytics Workbench
Pivotal: Operationalizing 1000 Node Hadoop Cluster - Analytics WorkbenchPivotal: Operationalizing 1000 Node Hadoop Cluster - Analytics Workbench
Pivotal: Operationalizing 1000 Node Hadoop Cluster - Analytics Workbench
 
Nagios Conference 2012 - Dave Williams - Embedding Nagios using RaspberyPi
Nagios Conference 2012 - Dave Williams - Embedding Nagios using RaspberyPiNagios Conference 2012 - Dave Williams - Embedding Nagios using RaspberyPi
Nagios Conference 2012 - Dave Williams - Embedding Nagios using RaspberyPi
 
Oracle RAC 12c (12.1.0.2) Operational Best Practices - A result of true colla...
Oracle RAC 12c (12.1.0.2) Operational Best Practices - A result of true colla...Oracle RAC 12c (12.1.0.2) Operational Best Practices - A result of true colla...
Oracle RAC 12c (12.1.0.2) Operational Best Practices - A result of true colla...
 
EMC config Hadoop
EMC config HadoopEMC config Hadoop
EMC config Hadoop
 
Building Your Own Drupal Distribution
Building Your Own Drupal DistributionBuilding Your Own Drupal Distribution
Building Your Own Drupal Distribution
 
Next Generation Hadoop Operations
Next Generation Hadoop OperationsNext Generation Hadoop Operations
Next Generation Hadoop Operations
 
Speeding up I/O for Machine Learning ft Apple Case Study using TensorFlow, N...
Speeding up I/O for Machine Learning  ft Apple Case Study using TensorFlow, N...Speeding up I/O for Machine Learning  ft Apple Case Study using TensorFlow, N...
Speeding up I/O for Machine Learning ft Apple Case Study using TensorFlow, N...
 
Extending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesExtending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with Kubernetes
 
Platform Engineering for the Modern Oracle World
Platform Engineering for the Modern Oracle WorldPlatform Engineering for the Modern Oracle World
Platform Engineering for the Modern Oracle World
 
GlassFish in Production Environments
GlassFish in Production EnvironmentsGlassFish in Production Environments
GlassFish in Production Environments
 
Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?
Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?  Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?
Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?
 
What's new in MySQL 5.6
What's new in MySQL 5.6What's new in MySQL 5.6
What's new in MySQL 5.6
 
Hadoop2 new and noteworthy SNIA conf
Hadoop2 new and noteworthy SNIA confHadoop2 new and noteworthy SNIA conf
Hadoop2 new and noteworthy SNIA conf
 
Harnessing the Power of Apache Hadoop Series
Harnessing the Power of Apache Hadoop SeriesHarnessing the Power of Apache Hadoop Series
Harnessing the Power of Apache Hadoop Series
 
Intalio create and cloudfoudry - short
Intalio create and cloudfoudry - shortIntalio create and cloudfoudry - short
Intalio create and cloudfoudry - short
 
Enterprise software needs a PaaS
Enterprise software needs a PaaSEnterprise software needs a PaaS
Enterprise software needs a PaaS
 
Modern infrastructure for business data lake
Modern infrastructure for business data lakeModern infrastructure for business data lake
Modern infrastructure for business data lake
 
Docker 101 - An introduction to docker
Docker 101 - An introduction to dockerDocker 101 - An introduction to docker
Docker 101 - An introduction to docker
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 

Recently uploaded (20)

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 

Hadoop operations

  • 1. APOLLO GROUP Hadoop Operations: Starting Out Small So Your Cluster Isn't Yahoo-sized (yet) Michael Arnold Principal Systems Engineer 14 June 2012
  • 2. Agenda Who What (Definitions) Decisions for Now Decisions for Later Lessons Learned APOLLO GROUP © 2012 Apollo Group 2
  • 3. APOLLO GROUP Who APOLLO GROUP Apollo Group © 2012 3
  • 4. Who is Apollo? Apollo Group is a leading provider of higher education programs for working adults. APOLLO GROUP © 2012 Apollo Group 4
  • 5. Who is Michael Arnold? Systems Administrator Automation geek 13 years in IT I deal with: –Server hardware specification/configuration –Server firmware –Server operating system –Hadoop application health –Monitoring all the above APOLLO GROUP © 2012 Apollo Group 5
  • 6. APOLLO GROUP What Definitions APOLLO GROUP Apollo Group © 2012 6
  • 7. Definitions Q: What is a tiny/small/medium/large cluster? A: –Tiny: 1-9 –Small: 10-99 –Medium: 100-999 –Large: 1000+ –Yahoo-sized: 4000 APOLLO GROUP © 2012 Apollo Group 7
  • 8. Definitions Q: What is a “headnode”? A: A server that runs one or more of the following Hadoop processes: –NameNode –JobTracker –Secondary NameNode –ZooKeeper –HBase Master APOLLO GROUP © 2012 Apollo Group 8
  • 9. APOLLO GROUP What decisions should you make now and which can you postpone for later? Decisions for Now APOLLO GROUP Apollo Group © 2012 9
  • 10. Which Hadoop distribution? Amazon Apache Cloudera Greenplum Hortonworks IBM MapR Platform Computing APOLLO GROUP © 2012 Apollo Group 10
  • 11. Should you virtualize? Can be OK for small clusters BUT –virtualization adds overhead –can cause performance degradation –cannot take advantage of Hadoop rack locality Virtualization can be good for: –functional testing of M/R job or workflow changes –evaluation of Hadoop upgrades APOLLO GROUP © 2012 Apollo Group 11
  • 12. What sort of hardware should you be considering? Inexpensive Not “enterprisey” hardware –No RAID* –No Redundant power* Low power consumption No optical drives –get systems that can boot off the network * except in headnodes APOLLO GROUP © 2012 Apollo Group 12
  • 13. Plan for capacity expansion Start at the bottom and work your way up Leave room in your cabinets for more machines APOLLO GROUP © 2012 Apollo Group 13
  • 14. Plan for capacity expansion (cont.) Deploy your initial cluster in two cabinets –One headnode, one switch, and several (five) datanodes per cabinet APOLLO GROUP © 2012 Apollo Group 14
  • 15. Plan for capacity expansion (cont.) Install a second cluster in the empty space in the upper half of the cabinet APOLLO GROUP © 2012 Apollo Group 15
  • 16. APOLLO GROUP What decisions should you make now and which can you postpone for later? Decisions for Later APOLLO GROUP Apollo Group © 2012 16
  • 17. What size cluster? Depends upon your: Budget Data size Workload characteristics SLA APOLLO GROUP © 2012 Apollo Group 17
  • 18. What size cluster? (cont.) Are your MapReduce jobs: compute-intensive? reading lots of data? http://www.cloudera.com/blog/2010/08/hadoophbase-capacity-planning/ APOLLO GROUP © 2012 Apollo Group 18
  • 19. Should you implement rack awareness? If more than one switch in the cluster: YES APOLLO GROUP © 2012 Apollo Group 19
  • 20. Should you use automation? If not in the beginning, then as soon as possible. Boot disks will fail. Automated OS and application installs: –Save time –Reduce errors •Cobbler/Spacewalk/Foreman/xCat/etc •Puppet/Chef/Cfengine/shell scripts/etc APOLLO GROUP © 2012 Apollo Group 20
  • 21. APOLLO GROUP Lessons Learned APOLLO GROUP Apollo Group © 2012 21
  • 22. Keep It Simple Don't add redundancy and features (server/network) that will make things more complicated and expensive. Hadoop has built-in redundancies. Don't overlook them. APOLLO GROUP © 2012 Apollo Group 22
  • 23. Automate the Hardware Twelve hours of manual work in the datacenter is not fun. Make sure all server firmware is configured identically. –HP SmartStart Scripting Toolkit –Dell OpenManage Deployment Toolkit –IBM ServerGuide Scripting Toolkit APOLLO GROUP © 2012 Apollo Group 23
  • 24. Rolling upgrades are possible (Just not of the Hadoop software.) Datanodes can be decommissioned, patched, and added back into the cluster without service downtime. APOLLO GROUP © 2012 Apollo Group 24
  • 25. The smallest thing can have a big impact on the cluster Bad NIC/switchport can cause cluster slowness. Slow disks can cause intermittent job slowdowns. APOLLO GROUP © 2012 Apollo Group 25
  • 26. HDFS blocks are weird On ext3/ext4: –Small blocks are not padded to the HDFS block- size, but rather the actual size of the data. –Each HDFS block is actually two files on the datanode's filesystem: •The actual data and •A metadata/checksum file # ls -l blk_1058778885645824207* -rw-r--r-- 1 hdfs hdfs 35094 May 14 01:26 blk_1058778885645824207 -rw-r--r-- 1 hdfs hdfs 283 May 14 01:26 blk_1058778885645824207_19155994.meta APOLLO GROUP © 2012 Apollo Group 26
  • 27. Do not prematurely optimize Be careful tuning your datanode filesystems. • mkfs -t ext4 -T largefile4 ... (probably bad) • mkfs -t ext4 -i 131072 -m 0 ... (better) /etc/mke2fs.conf [fs_types] hadoop = { features = has_journal,extent,huge_file,flex_bg,uninit_bg,dir_nlink, extra_isize inode_ratio = 131072 blocksize = -1 reserved_ratio = 0 default_mntopts = acl,user_xattr } APOLLO GROUP © 2012 Apollo Group 27
  • 28. Use DNS-friendly names for services hdfs://hdfs.delta.hadoop.apollogrp.edu:8020/ mapred.delta.hadoop.apollogrp.edu:8021 http://oozie.delta.hadoop.apollogrp.edu:11000/ hiveserver.delta.hadoop.apollogrp.edu:10000 Yes, the names are long, but I bet you can figure out how to connect to Bravo Cluster. APOLLO GROUP © 2012 Apollo Group 29
  • 29. Use a parallel, remote execution tool pdsh/Cluster SSH/mussh/etc SSH in a for loop is so 2010 FUNC/MCollective APOLLO GROUP © 2012 Apollo Group 30
  • 30. Make your log directories as large as you can. 20-100GB /var/log –Implement log purging cronjobs or your log directories will fill up. Beware: M/R jobs can fill up /tmp as well. APOLLO GROUP © 2012 Apollo Group 31
  • 31. Insist on IPMI 2.0 for out of band management of server hardware. Serial Over LAN is awesome when booting a system. Standardized hardware/temperature monitoring. Simple remote power control. APOLLO GROUP © 2012 Apollo Group 33
  • 32. Spanning-tree is the devil Enable portfast on your server switch ports or the BMCs may never get a DHCP lease. APOLLO GROUP © 2012 Apollo Group 34
  • 33. Apollo has re-built it's cluster four times. You may end up doing so as well. APOLLO GROUP © 2012 Apollo Group 35
  • 34. Apollo Timeline First build Cloudera Professional Services helped install CDH Four nodes Manually build OS via USB CDROM. CDH2 APOLLO GROUP © 2012 Apollo Group 36
  • 35. Apollo Timeline Second build Cobbler All software deployment is via kickstart. Very little is in puppet. Config files are deployed via wget. CDH2 APOLLO GROUP © 2012 Apollo Group 37
  • 36. Apollo Timeline Third build OS filesystem partitioning needed to change. Most software deployment still via kickstart. CDH3b2 APOLLO GROUP © 2012 Apollo Group 38
  • 37. Apollo Timeline Fourth build HDFS filesystem inodes needed to be increased. Full puppet automation. Added redundant/hotswap enterprise hardware for headnodes. CDH3u1 APOLLO GROUP © 2012 Apollo Group 39
  • 38. Cluster failures at Apollo Hardware –disk failures (40+) –disk cabling (6) –RAM (2) –switch port (1) Software –Cluster •NFS (NN -> 2NN metadata) –Job •TT java heap •Running out of /tmp or /var/log/hadoop •Running out of HDFS space APOLLO GROUP © 2012 Apollo Group 40
  • 39. Know your workload You can spend all the time in the world trying to get the best CPU/RAM/HDD/switch/cabinet configuration, but you are running on pure luck until you understand your cluster's workload. APOLLO GROUP © 2012 Apollo Group 41
  • 40. APOLLO GROUP Questions? APOLLO GROUP Apollo Group © 2012 42