About Me!• Current Hadoop Engineering, Yahoo! - Performance, Utilization & HDFS core group.• Recent Past Javasoft & J2EE Group, Sun - JVM Performance, SIP container, XML & Web Services.
My contribution to Hadoop• Namenode memory improvements• Developed tools to understand cluster utilization and performance at scale.• Namenode & Job tracker - Garbage collector tunings.• Disk Fail Inplace
Impact of a single disk failure Old generation grids: New grids: (6 x 1.5TB drives, 12 slots) (12 x 3TB drives, 24 slots) 10PB, 3 replica grid = 10 PB, 3 replica grid = 3777 nodes 944 nodes Failure of one disk = Failure of one disk =Loss of 0.02% of grid storage Loss of 0.1% of grid storage, i.e. 5 times magnified loss of storage Failure of one disk = Failure of one disk =Loss of 0.02% of grid compute Loss of 0.1% of grid compute capacity capacity, i.e 5 times magnified loss of compute
Node Statistics Total Active Blackliste Excluded nodes d 30242 28436(94%) 65 (0.2%) 1741(6%) Breakout of blacklisted nodes in all gridsEthernet Link Failure Disk Failure 11 (16% of failures) 54 (83% of failures)
What is DFIP?• DFIP – Disk Fail Inplace• We want to run Hadoop even when disks fail until a threshold.• Primarily – DataNode and TaskTracker• We took a holistic approach to solve this disk failure problem.
Why now?• Trend in high density disks (36TB) – Cost of losing a node is high• To increase operational efficiency – Utilization – Scaling data – Various other benefits
Where to inject a failure?• Complete stack analysis for disk failures. DataNode TaskTracker JVM Linux SCSI Device Driver
Lab Setup• 40 node cluster on two racks• Kickstart and TFTP Server• Kerberos Server
Lab Setup(Cont…)• PXE Boot, TFTP Server, DHCP Server & Kerberos Server. Kerberos Server PXE Server Hadoop Nodes
Operational Improvement• With DFIP, Completely changed Hadoop deployment layout.• Linux re-image time took 4 hours on a 12 disk system. Improvement: We reduced the re-image time to 20 minutes (12X better).
File Management• Separate out user and system files• RAID1 on system files• System files – Kernel files, Hadoop binaries, pids and logs & JDK• User files – HDFS data, Task logs and output & Distributed cache etc.
Datanode impact• Separation of system and user files• Datanode logs on RAID1• DataNode doesn’t honor volumes tolerated. – Jira – HDFS-1592• DataNode process doesn’t exit when disks fail – Jira – HDFS-1692
Datanode: HDFS-1692• DataNode process doesn’t exit when disks fail – Runtime issue (Secure Mode).
TaskTracker Impact• Separation of system and user files• Tasktracker logs on RAID1• Tasktracker should handle disk failures at both startup and runtime. – Jira: MAPREDUCE-2413• Distribute task userlogs on multiple disks. – Jira: MAPREDUCE-2415² Components impacted:- Linux task controller, Default task controller, Health checkscript, Security and most of the components in Tasktracker.
Tasktracker: MAPREDUCE-2413• Tasktracker should handle disk failures at both startup and runtime. – Keep track of good disks all the time. – Pass the good disks to all the components like DefaultTaskController and LinuxTaskController. – Periodically check for disk failures – If disk failures happens, re-init the TaskTracker. – Modified Health Check Scripts.
TaskTracker: MAPREDUCE-2415• Distribute task userlogs on multiple disks. – Single point of failure.
Rigorous Testing• Random writer benchmark (With failures)• Terasort benchmark (With failures)• Gridmixv3 benchmark (With failures)• Passed 950 QA tests• Tested with Valgrind for Memory leaks