Hadoop admin


Published on

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Hadoop admin

  1. 1. BalajiRajan Meetup.com/DevOps-Bangalore str.balaji@gmail.com / balajirajan.com
  2. 2. Some Numbers...... # 1k => 1000 bytes # 1kb => 1024 bytes # 1m => 1000000 bytes # 1mb => 1024*1024 bytes # 1g => 1000000000 bytes # 1gb => 1024*1024*1024 bytes # 1T => 1000000000000 bytes #Tb =>1024*1024*1024*1024 bytes # 1Petabytes , Exabytes, zettabytes... etc Max data in memory (RAM): 64GB Max data per computer (disk): 24TB Data processed by Google every month: 400PB… in 2007 Average job size: 180GB Time: 180GB of data would take to read sequentially off a single disk drive: approximately 45 minutes 15/12/13
  3. 3. Data Access Speed is the Bottleneck We can process data very quickly, but we can only read/write it very slowly Solution: parallel reads – 1 HDD = 75MB/sec – 1,000 HDDs = 75GB/sec – Far more acceptable 15/12/13
  4. 4. Moving to a Cluster of Machines * In the late 1990s, Google decided to design its architecture using clusters of low-cost machines – Rather than fewer, more powerful machines * Creating an architecture around low-cost, unreliable hardware presents a number of challenges 15/12/13
  5. 5. System Requirements * System should support partial failure * System should support data recoverability * System should be consistent * System should be scalable 15/12/13
  6. 6. Hadoop's Origins Google created an architecture which answers these (and other) requirements Released two White Papers 1. 2003: Description of the Google File System (GFS) – A method for storing data in a distributed, reliable fashion 2. 2004: Description of distributed MapReduce – A method for processing data in a parallel fashion 15/12/13
  7. 7. So Hadoop was based on these White Papers 15/12/13
  8. 8. Hadoop Cluster 15/12/13
  9. 9. HDFS Features * Operates ‘on top of’ an existing filesystem * Files are stored as ‘blocks’ – Much larger than for most filesystems – Default is 64MB * Provides reliability through replication – Each block is replicated across multiple DataNodes – Default replication factor is 3 * Single NameNode daemon stores metadata and co-ordinates access – Provides simple, centralized management * Blocks are stored on slave nodes – Running the DataNode daemon 15/12/13
  10. 10. 15/12/13
  11. 11. HDFS: Block Diagram
  12. 12. The NameNode    The NameNode stores all metadata – Information about file locations in HDFS – Information about file ownership and permissions – Names of the individual blocks – Locations of the blocks Metadata is stored on disk and read when the NameNode daemon starts up – Filename is fsimage When changes to the metadata are required, these are made in RAM – Changes are also written to a log file on disk called edits – Full details later
  13. 13. The NameNode: Memory Allocation  When the NameNode is running, all meta data is held in RAM for fast response  Each ‘item’ consumes 150-200 bytes of RAM  Items: – Filename, permissions, etc. – Block information for each block
  14. 14. The NameNode: Memory Allocation  Why HDFS prefers fewer, larger files: – Consider 1GB of data, HDFS block size 128MB – Stored as 1 x 1GB file – Name: 1 item – Blocks: 8 x 3 = 24 items – Total items: 25 – Stored as 1000 x 1MB files – Names: 1000 items – Blocks: 1000 x 3 = 3000 items – Total items: 4000
  15. 15. The Slave Nodes Actual contents of the files are stored as blocks on the slave nodes  Blocks are simply files on the slave nodes’ underlying filesystem – Named blk_xxxxxxx – Nothing on the slave node provides information about what underlying file the block is a part of – That information is only stored in the NameNode’s metadata  Each block is stored on multiple different nodes for redundancy – Default is three replicas  Each slave node runs a DataNode daemon – Controls access to the blocks – Communicates with the NameNode 
  16. 16. Secondary Name Node
  17. 17. The Secondary NameNode:   The Secondary NameNode is not a failover NameNode! – It performs memory-intensive administrative functions for the NameNode – NameNode keeps information about files and blocks (the metadata) in memory – NameNode writes metadata changes to an editlog – Secondary NameNode periodically combines a prior filesystem snapshot and editlog into a new snapshot – New snapshot is transmitted back to the NameNode Secondary NameNode should run on a separate machine in a large installation – It requires as much RAM as the NameNode
  18. 18. Writing Files to HDFS
  19. 19. Anatomy of a File Write 1. Client connects to the NameNode 2. NameNode places an entry for the file in its metadata, returns the block name and list of DataNodes to the client 3. Client connects to the first DataNode and starts sending data 4. As data is received by the first DataNode, it connects to the second and starts sending data 5. Second DataNode similarly connects to the third 6. ack packets from the pipeline are sent back to the client 7. Client reports to the NameNode when the block is written
  20. 20. Reading Files from HDFS
  21. 21. Anatomy of a File Read     Client connects to the NameNode NameNode returns the name and locations of the first few blocks of the file – Block locations are returned closest-first. Client connects to the first of the DataNodes, and reads the block If the DataNode fails during the read, the client will seamlessly connect to the next one in the list to read the block
  22. 22. The NameNode Is Not A Bottleneck  Note: the data never travels via the NameNode – For writes – For reads – During re-replication
  23. 23. Dealing With Data Corruption  As the DataNode is reading the block, it also calculates the checksum.  ‘Live’ checksum is compared to the checksum created when the block was stored.  If they differ, the client reads from the next DataNode in the list – The NameNode is informed that a corrupted version of the block has been found. – The NameNode will then re-replicate that block elsewhere.  The DataNode verifies the checksums for blocks on a regular basis to avoid ‘bit rot’ – Default is every three weeks after the block was created
  24. 24. Data Reliability and Recovery  DataNodes send heartbeats to the NameNode – Every three seconds  After a period without any heartbeats, a DataNode is assumed to be lost – NameNode determines which blocks were on the lost node. – NameNode finds other DataNodes with copies of these blocks. – These DataNodes are instructed to copy the blocks to other nodes. – Three-fold replication is actively maintained.
  25. 25. Hadoop is ‘Rack-aware’
  26. 26. Hadoop is ‘Rack-aware’  Hadoop understands the concept of ‘rack awareness’ – The idea of where nodes are located, relative to one another – Helps the JT to assign tasks to nodes closest to the data – Helps the NN determine the ‘closest’ block to a client during reads – In reality, this should perhaps be described as being ‘switchaware’  HDFS replicates data blocks on nodes on different racks – Provides extra data security in case of catastrophic hardware failure  Rack-awareness is determined by a user-defined script
  27. 27. Rack-aware Script <property> <name>topology.script.file.name</name> <value>/etc/hadoop/topology.sh</value> </property> Script create a file which contains a server and rack informaton: ============ /rack1 /rack1 /rack1 /rack2 /rack2 /rack2 /rack3 /rack3 /rack3 =============
  28. 28. Datacenter
  29. 29. HDFS File Permissions   Files in HDFS have an owner, a group, and permissions – Very similar to Unix file permissions HDFS permissions are designed to stop good people doing foolish things
  30. 30. What Is MapReduce?     MapReduce is a method for distributing a task across multiple nodes Each node processes data stored on that node Consists of two developer-created phases – Map – Reduce In between Map and Reduce is the shuffle and sort – Sends data from the Mappers to the Reducers
  31. 31. What Is MapReduce?
  32. 32. MapReduce: Basic Concepts        Each Mapper processes a single input split from HDFS Hadoop passes the developer’s Map code one record at a time Each record has a key and a value Intermediate data is written by the Mapper to local disk During the shuffle and sort phase, all the values associated with the same intermediate key are transferred to the same Reducer – The developer specifies the number of Reducers Reducer is passed each key and a list of all its values – Keys are passed in sorted order Output from the Reducers is written to HDFS
  33. 33. MapReduce: A Simple Example
  34. 34. MapReduce: A Simple Example 15/12/13
  35. 35. MapReduce: A Simple Example 15/12/13
  36. 36. Some MapReduce Terminology * A user runs a client program on a client computer * The client program submits a job to Hadoop – The job consists of a mapper, a reducer, and a list of inputs * The job is sent to the JobTracker * Each Slave Node runs a process called the TaskTracker * The JobTracker instructs TaskTrackers to run and monitor tasks – A Map or Reduce over a piece of data is a single task * A task attempt is an instance of a task running on a slave node – Task attempts can fail, in which case they will be restarted (more later) – There will be at least as many task attempts as there are tasks which need be performed 15/12/13 to
  37. 37. Aside: The Job Submission Process When a job is submitted, the following happens: – The client requests and receives a new unique Job ID from the JobTracker (includes JobTracker start time and a sequence number) – The client calculates the input splits for the job – How the input data will be split up between Mappers – The client turns the job configuration information into an XML file – The client places the XML file and the job jar into a temporary directory in HDFS (the Job ID is included in the path) – The client contacts the JobTracker with the location of the XML and jar files, and the list of input splits – The JobTracker takes over the job from this point on 15/12/13
  38. 38. MapReduce: High Level 15/12/13
  39. 39. MapReduce Failure Recovery Task processes send heartbeats to the TaskTracker TaskTrackers send heartbeats to the JobTracker Any task that fails to report in 10 minutes is assumed to have failed – Its JVM is killed by the TaskTracker Any task that throws an exception is said to have failed Failed tasks are reported to the JobTracker by the TaskTracker The JobTracker reschedules any failed tasks – It tries to avoid rescheduling the task on the same TaskTracker where it previously failed If a task fails four times, the whole job fails 15/12/13
  40. 40. MapReduce Failure Recovery Any TaskTracker that fails to report in 10 minutes is assumed to have crashed – All tasks on the node are restarted elsewhere – Any TaskTracker reporting a high number of failed tasks is blacklisted, to prevent the node from blocking the entire job – There is also a ‘global blacklist’, for TaskTrackers which fail on multiple jobs. The JobTracker manages the state of each job – Partial results of failed tasks are ignored 15/12/13
  41. 41. The Apache Hadoop Project Hadoop is a ‘top-level’ Apache project – Created and managed under the auspices of the Apache Software Foundation Several other projects exist that rely on some or all of Hadoop – Typically either both HDFS and MapReduce, or just HDFS Ecosystem projects are often also top-level Apache projects – Some are ‘Apache incubator’ projects – Some are not managed by the Apache Software Foundation Ecosystem projects include Hive, Pig, Sqoop, Flume, HBase,Oozie, … 15/12/13
  42. 42. Hive Hive is a high-level abstraction on top of MapReduce – Initially created by a team at Facebook – Avoids having to write Java MapReduce code – Data in HDFS is queried using a language very similar to SQL – Known as HiveQL HiveQL queries are turned into MapReduce jobs by the Hive interpreter – ‘Tables’ are just directories of files stored in HDFS – A Hive Metastore contains information on how to map a file to a table structure 15/12/13
  43. 43. Planning Your Hadoop Cluster * What issues to consider when planning your Hadoop cluster 1. What types of hardware are typically used for Hadoop nodes 2. How to optimally configure your network topology 3. How to select the right operating system and Hadoop distribution
  44. 44. Cluster Growth Based on Storage Capacity • Basing your cluster growth on storage capacity is often a good method to use • Example: – Data grows by approximately 1TB per week – HDFS set up to replicate each block three times – Therefore, 3TB of extra storage space required per week – Plus some overhead – say, 30% – Assuming machines with 4 x 1TB hard drives, this equates to a new machine required each week – Alternatively: Two years of data – 100TB – will require approximately 100 machines
  45. 45. Classifying Nodes • Nodes can be classified as either ‘slave nodes’ or ‘master nodes’ • Slave node runs DataNode plus TaskTracker daemons • Master node runs either a NameNode daemon, a Secondary NameNode Daemon, or a JobTracker daemon – On smaller clusters, NameNode and JobTracker are often run on the same machine – Sometimes even Secondary NameNode is on the same machine as the NameNode and JobTracker – Important that at least one copy of the NameNode’s metadata is stored on a separate machine (see later)
  46. 46. Slave Nodes: Recommended Configuration • Typical ‘base’ configuration for a slave Node – 4 x 1TB or 2TB hard drives, in a JBOD* configuration – Do not use RAID! (See later) – 2 x Quad-core CPUs – 24-32GB RAM – Gigabit Ethernet • Multiples of (1 hard drive + 2 cores + 6-8GB RAM) tend to work well for many types of applications – Especially those that are I/O bound
  47. 47. Slave Nodes: More Details (CPU) • Quad-core CPUs are now standard • Hex-core CPUs are becoming more prevalent – But are more expensive • Hyper-threading should be enabled • Hadoop nodes are seldom CPU-bound – They are typically disk- and network-I/O bound – Therefore, top-of-the-range CPUs are usually not necessary
  48. 48. Slave Nodes: More Details (RAM) • Slave node configuration specifies the maximum number of Map and Reduce tasks that can run simultaneously on that node • Each Map or Reduce task will take 1GB to 2GB of RAM • Slave nodes should not be using virtual memory • Ensure you have enough RAM to run all tasks, plus overhead for the DataNode and TaskTracker daemons, plus the operating system • Rule of thumb: Total number of tasks = 1.5 x number of processor cores -- This is a starting point, and should not be taken as a definitive setting for all clusters
  49. 49. Slave Nodes: More Details (Disk) • In general, more spindles (disks) is better • In practice, we see anywhere from four to 12 disks per node • Use 3.5" disks – Faster, cheaper, higher capacity than 2.5" disks • 7,200 RPM SATA drives are fine – No need to buy 15,000 RPM drives • 8 x 1.5TB drives is likely to be better than 6 x 2TB drives – Different tasks are more likely to be accessing different disks • A good practical maximum is 24TB per slave node – More than that will result in massive network traffic if a node dies and block re-replication must take place
  50. 50. Slave Nodes: Why Not RAID? • Slave Nodes do not benefit from using RAID* storage – HDFS provides built-in redundancy by replicating blocks across multiple nodes – RAID striping (RAID 0) is actually slower than the JBOD configuration used by HDFS – RAID 0 read and write operations are limited by the speed of the slowest disk in the RAID array – Disk operations on JBOD are independent, so the average speed is greater than that of the slowest disk – One test by Yahoo showed JBOD performing between 10% and 30% faster than RAID 0, depending on the operations being performed
  51. 51. What About Virtualization? • Virtualization is usually not worth considering – Multiple virtual nodes per machine hurts performance – Hadoop runs optimally when it can use all the disks at once What About Blade Servers? Blade servers are not recommended – Failure of a blade chassis results in many nodes being unavailable – Individual blades usually have very limited hard disk capacity – Network interconnection between the chassis and top-of-rack switch can become a bottleneck
  52. 52. Master Nodes: Single Points of Failure • Slave nodes are expected to fail at some point – This is an assumption built into Hadoop – NameNode will automatically re-replicate blocks that were on the failed node to other nodes in the cluster, retaining the 3x replication requirement – JobTracker will automatically re-assign tasks that were running on failed nodes • Master nodes are single points of failure – If the NameNode goes down, the cluster is inaccessible – If the JobTracker goes down, no jobs can run on the cluster – All currently running jobs will fail • Spend more money on your master nodes!
  53. 53. Master Node Hardware Recommendations • Carrier-class hardware – Not commodity hardware • Dual power supplies • Dual Ethernet cards – Bonded to provide failover • RAIDed hard drives • At least 32GB of RAM
  54. 54. General Network Considerations • Hadoop is very bandwidth-intensive! – Often, all nodes are communicating with each other at the same time • Use dedicated switches for your Hadoop cluster • Nodes are connected to a top-of-rack switch • Nodes should be connected at a minimum speed of 1Gb/sec • For clusters where large amounts of intermediate data is generated, consider 10Gb/sec connections – Expensive – Alternative: bond two 1Gb/sec connections to each node
  55. 55. General Network Considerations (cont’d) • Racks are interconnected via core switches • Core switches should connect to top-of-rack switches at 10Gb/ sec or faster • Beware of over-subscription in top-of-rack and core switches • Consider bonded Ethernet to mitigate against failure • Consider redundant top-of-rack and core switches
  56. 56. Operating System Recommendations • Choose an OS you’re comfortable administering • CentOS: geared towards servers rather than individual workstations – Conservative about package versions – Very widely used in production • RedHat Enterprise Linux (RHEL): RedHat-supported analog to CentOS – Includes support contracts, for a price • In production, we often see a mixture of RHEL and CentOS machines – Often RHEL on master nodes, CentOS on slaves
  57. 57. Configuring The System • Do not use Linux’s LVM (Logical Volume Manager) to make all your disks appear as a single volume • – As with RAID 0, this limits speed to that of the slowest disk • Check the machines’ BIOS* settings – BIOS settings may not be configured for optimal performance – For example, if you have SATA drives make sure IDE emulation is not enabled • Test disk I/O speed with hdparm -t – Example: hdparm -t /dev/sda1 – You should see speeds of 70MB/sec or more Anything less is an indication of possible problems
  58. 58. Configuring The System Hadoop has no specific disk partitioning requirements – Use whatever partitioning system makes sense to you Mount disks with the noatime option Common directory structure for data mount points: /data/<n>/dfs/nn /data/<n>/dfs/dn /data/<n>/dfs/snn /data/<n>/mapred/local Reduce the swappiness of the system – Set vm.swappiness to 0 or 5 in /etc/sysctl.conf 15/12/13
  59. 59. Filesystem Considerations • Cloudera recommends the ext3 and ext4 filesystems – ext4 is now becoming more commonly used • XFS provides some performance benefit during kickstart – It formats in 0 seconds, vs several minutes for each disk with ext3 • XFS has some performance issues – Slow deletes in some versions – Some performance improvements are available; see e.g., http://everything2.com/index.pl?node_id=1479435 – Some versions had problems when a machine runs out of memory
  60. 60. Operating System Parameters • Increase the nofile ulimit for the mapred and hdfs users to at least 32K – Setting is in /etc/security/limits.conf • Disable IPv6 • Disable SELinux • Install and configure the ntp daemon – Ensures the time on all nodes is synchronized – Important for HBase – Useful when using logs to debug problems
  61. 61. Java Virtual Machine (JVM) Recommendations • Always use the official Oracle JDK (http://java.com/) – Hadoop is complex software, and often exposes bugs in other JDK implementations • Version 1.6 is required – Avoid 1.6.0u18 – This version had significant bugs • Hadoop is not yet production-tested with Java 7 (1.7) • Recommendation: don’t upgrade to a new version as soon as it is released – Wait until it has been tested for some time
  62. 62. Cloudara Manager
  63. 63. For easy installation  Cloudera has released Cloudera Manager (CM), a tool for easy deployment and configuration of Hadoop clusters  The free version, Cloudera Manager Free Edition, can manage up to 50 nodes – The version supplied with Cloudera Enterprise supports an unlimited number of nodes
  64. 64. Using Cloudera Manager Free Edition
  65. 65. Typical Configuration Parameters
  66. 66. Hadoop's Configuration Files  Each machine in the Hadoop cluster has its own set of configuration files  Configuration files all reside in Hadoop’s conf directory – Typically /etc/hadoop/conf  Primary configuration files are written in XML
  67. 67. Sample Configuration File Sample configuration file (mapred-site.xml) <configuration> <property> <name>mapred.job.tracker</name> <value>localhost:8021</value> </property> </configuration>
  68. 68. Core-site.xml
  69. 69. hdfs-site.xml  The single most important configuration value on your entire cluster, set on the NameNode: * Loss of the NameNode’s metadata will result in the effective loss of all the data on the cluster – Although the blocks will remain, there is no way of reconstructing the original files without the metadata * This must be at least two disks (or a RAID volume) on the NameNode, plus an NFS mount elsewhere on the network – Failure to set this correctly will result in eventual loss of your cluster’s data
  70. 70. Mapred-site.xml
  71. 71. Additional Configuration Files There are several more configuration files in /etc/hadoop/conf – hadoop-env.sh: environment variables for Hadoop daemons – HDFS and MapReduce include/exclude files * Controls who can connect to the NameNode and JobTracker – masters, slaves: hostname lists for ssh control – hadoop-policy.xml: Access control policies – log4j.properties: logging (covered later in the course) – fair-scheduler.xml: Scheduler (covered later in the course) – hadoop-metrics.properties: Monitoring (covered later in the course) 
  72. 72. Environment Setup: hadoop-env.sh HADOOP_HEAPSIZE – Controls the heap size for Hadoop daemons – Default 1GB – Comment this out, and set the heap for individual daemons HADOOP_NAMENODE_OPTS – Java options for the NameNode – At least 4GB: -Xmx4g HADOOP_JOBTRACKER_OPTS – Java options for the JobTracker – At least 4GB: -Xmx4g HADOOP_DATANODE_OPTS, HADOOP_TASKTRACKER_OPTS – Set to 1GB each: -Xmx1g 15/12/13
  73. 73. Host 'include' and 'exclude' Files  Optionally, specify dfs.hosts in hdfs-site.xml to point to a file listing hosts which are allowed to connect to the NameNode and act as DataNodes – Similarly, mapred.hosts points to a file which lists hosts allowedto connect as TaskTrackers  Both files are optional – If omitted, any host may connect and act as a DataNode/ TaskTracker – This is a possible security/data integrity issue  NameNode can be forced to reread the dfs.hosts file with hadoop dfsadmin -refreshNodes – No such command for the JobTracker, which has to be restarted to re-read the mapred.hosts file, so many System Administrators only create a dfs.hosts file
  74. 74. Managing and Scheduling Jobs
  75. 75. Displaying Running Jobs • To view all jobs running on the cluster, use # hadoop job –list
  76. 76. Displaying All Jobs • To display all jobs including completed jobs, use # hadoop job -list all
  77. 77. Killing a Job • It is important to note that once a user has submitted a job, they can not stop it just by hitting CTRL-C on their terminal – This stops job output appearing on the user’s console – The job is still running on the cluster!
  78. 78. Killing a Job To kill a job use hadoop job -kill <job_id> 15/12/13
  79. 79. Demo!!!
  80. 80. Reference: 1. Cloudera.com 2. Bradhedlund.com
  81. 81. ???