A Cluster Is Only As Strong As its Weakest Link


Published on

Early detection and correction of cluster health issues is a vital part of daily cluster management, no matter the size. Building and managing a healthy cluster is the best cure for meeting service level agreements and preventing or avoiding elongated troubleshooting. A cluster is effective and efficient when problems are detected and eliminated early. Fortunately, deploying simple tools and processes prevents minor problems from becoming major headaches. This talk covers how we developed, tested, and deployed a comprehensive health process based on real life events and experiences. The table driven health check runs a full scan in ~2 seconds and includes: a checklist, ‘positive’ error pattern matching, enabling and disabling node blacklisting, logging, validating file systems, processing very large log files, trapping in-rack network faults (adds 5 seconds to accurately detect packet loss), and recommissioning nodes into production.

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Dan Romike, Hadoop Tooling Engineer / Configuration Manager, Twitter, Inc.Dan Romike started with Hadoop in the summer of 2008 at Yahoo!, Inc., in their Hadoop data warehouse and site operations teams and received a ‘You Rock’ award for a very large data management project. He has since worked with Hadoop operations at eBay, Inc. and now at Twitter as a Hadoop Reliability Engineer. He recently gave a presentation at the 2011 Summit discussing Hadoop automation and has an extensive background building and managing Unix based production environments.
  • Early detection and correction of cluster health issues isa vital part of daily cluster management, no matter thesize. Building and managing a healthy cluster is the bestcure to meeting service level agreements and preventing or avoidingelongated troubleshooting. A cluster is effective and efficientwhen problems are detected and eliminated early.
  • Deploying simple tools and processes prevents minor problemsfrom becoming major headaches. This talk covers how Twitter'sHadoop Reliability team developed, tested, and deployed a broadspectrum cluster health check that detects problems quickly andearly.
  • Clusters run at full efficiency when all LIVE nodes are working at their peek. During node failures, partial or full, the cluster may behave in unexpected ways and thus causing a weak link. Finding a small problem on thousands of nodes is time consuming. What’s we deployed is an internal check that is able to affect a change in the cluster’s behavior and blacklist failing nodes thus preventing new tasks from starting in a failed condition.
  • We start with a high-level review of the Hadoop environment at Twitter.We are a very small operational team and we need the ability to manage a large Hadoop environment from installation to production and we try to avoid losing time working on troubleshooting issues that are affecting the cluster.Our team effort is to build these missing layers in the Health and management pyramid that will provide us meaningful and simple interfaces for the Hadoop admins.
  • Each clusters have a primary use. We run close to maximum for storage and processing on most clusters so it is important to test and evaluate all releases and production changes to prevent failures on the large clusters.These clusters are thousands of nodes and 10s of petabytesin multiple datacenters with a large number of jobs / day
  • The Site Operations team manages our infrastructure and corrects node failures (after being withdrawn from the clusters). We ticket each failure, one per ticket, and they quickly and accurately correct the issues and return the node ready for commissioning. The support we receive is immeasurable because we would not be able to grow as quickly as we have in the last year.Some of the issues that are discovered and resolved by Site operations are discussed.
  • Our nodes belong to roles managed by an internal Configuration Manager. All nodes must belong to a role, each node has inherited attributes, and we may affect a role-wide operation by executing commands through the manager.
  • To ensure that our code and configurations are accurate, we have a rigorous process that includes: Peer reviews, review boards, staging, validations, canary, restaging, production. The code and configurations are checked in and distributed to the nodes via Puppet, without exception.
  • Reliability covers many aspects of cluster management and is part of the daily maintenance, outages, preventative care, and health evaluation that every cluster, irrespective of size, requires.Our focus is the HEALTH aspect of Hadoop and to be able to manage failures without intervention. We do so with a complex Health process that has simple roots, it isolates node issues, and reported failures are rolled up into the monitoring system, which is an independent function.
  • Hadoop is highly dependent on a healthy cluster, be it 10 or 1000 nodes. A cluster may exhibit failed behaviors from minor issues on a single node, and discovering the issue and immediately blacklisting it is important.Listed here are most of the weak links that will cause data and job issues.
  • This section covers what we wanted to achieve to obtain full cluster health. We realized early that the health process plays an important role in node health as well is validation and ensuring that returning nodes enter the cluster fully functional. The same script is able to perform multiple tasks with no code changes.
  • What are some of the best methods to building and deploying a check? There is a limited amount of time to run checks, seconds, and to scan for other issues, and a full log body scan was not reasonable nor may be accurate. Here are some of the aspects we sought.
  • After the script is deployed, we needed to verify these goals. Though difficult to track, we used our work load and number of people required to manage our clusters as an primary indicator. We are pleased with the results.
  • Each Hadoop cluster has three primary columns of health, we created two and one is provided:The health check finds issues collecting in logs and process states based on thresholds and timeOur monitoring system will notify us of issues over time using aggregation.And finally, Hadoop manages heartbeats for both datanodes and tasks, these provide critical information on the node’s status. Should the heartbeat be delayed too long, the cluster will automatically take corrective actions.The administrator takes manual actions to exclude or include nodes into the cluster, however, in some cases, nodes have to be excluded to kill an issue.
  • To install a health check, update these properties, as described.
  • The actual process of the health check is to return a result message to the job manager. An ‘ERROR’ indicates the node is to be taken out of circulation, but the attempts are allowed to finish. Any other terms may be used to indicate to automation that the node PASSed or that other issues exist and actions are required.Because tasks finish, instead of being terminated, the blacklist gives us time to evaluate the problem and take corrective actions such as fail-tasking. StoriesFull file systems from errant jobs filled node storage; health caused a brownout and shown in blacklistedErrant jars cause Full GCs in TT: updated health to count Full GCs over 999 records to restart TTRacks lost packets: added rack packet loss detection in same rack to blacklist the rack and wrote a crawler for inter-rackPredictive disk failures in the controller: detected and blacklistedKickstart install root on the wrong disk, detectedHigh load averages slow down jobs: blacklist immediatelyMemory shortfall: detect and blacklist nodesbinfsusedfsused $E $FAULTS_DF root $ERRS_DFsbinmkfswrongfsused $W $FAULTS_DF root $WARN_DFsbindiskwrongfsused $W $FAULTS_DW root $WARN_DW file mounts $proc/mounts $E 1 root ^\/dev\/ file fstab $etc/fstab $E 1 root ^LABEL= file loadavg $proc/loadavg $W 70.0 root [0-9.]+procdatanode $dnpid $E 1 hadoop $PROC_DNproctasktracker $ttpid $W 1 hadoop $PROC_TTprocregionserver $rspid $W 1 hadoop $PROC_RSprocmonit $log/monit.log $W 1 root $ubin/monitprocsyslogd $run/syslog-ng.pid $W 1 root syslog-ngproc scribed $run/scribe.pid $W 1 root $usbin/scribed log syslog-dev $log/syslog $E $FAULTS_DV root $ERRS_DV log syslog-hw $log/syslog $E $FAULTS_HW root $ERRS_HW log mcelog-hw $log/mcelog $E $FAULTS_MC root $ERRS_MC log ttlog $ttlog $W $FAULTS_TT hadoop $ERRS_TT log dnlog $dnlog $W $FAULTS_DN hadoop $ERRS_DN log rslog $rslog $W $FAULTS_RS hadoop $ERRS_RS log scribe $sclog $W $FAULTS_SC hadoop $ERRS_SC log shortmem $proc/meminfo $W $FAULTS_SM root $ERRS_SM log bonding $bonding $F $FAULTS_EB root $ERRS_EB toggle blacklisted $bllog $E $FAULTS_BL hadoop $ERRS_BL
  • Detecting faults is based on real-life experiences and is usually taught by errors and failures. This section describes some of the faults we scan for and which provide the basis of a health check.We also induce faults into the health system to perform maintenance operations. It is easier to do maintain by blacklisting than to exclude.
  • Managing a node’s expected performance is a major concern and ‘weakness’ in working with large clusters. A single node issue may cause cascading problems which extends job run times.Some of the possible issues are network losses, speed reductions, and issues caused by manual interventions as part of general maintenance. The health process needs to trap and blacklist nodes that are not meeting specifications.
  • The Hadoop storage system may be difficult to maintain as the cluster grows. With 10s of petabytes spinning on 1000s of nodes, storage issues have caused major issues in the past.However, with improvements in disks and controllers, storage has been far less of an issue. We are now focused on performance gains and storage efficiency:Running the latest file systemsReducing inodes to recover 3% in storageImprove build time by reduced inodesImprove fsck time by reduced inodesHadoopmay only use 1-2% of inodesFSCK time dramatically improvesKickstarat improvementsOld kernels have security issuesDataode goes down, TT needs to blacklistTask tracker failures on disk full
  • Bottom up log scans is an effective method of limiting the amount of data to process and for locating just recent issues. Some logs are large and may be stale, so keeping the information fresh and current prevents brownouts and less blacklisting.We also use ‘positive exception’ matching logic via egrep based on receiving many ‘false positives’. We choose to match the majority of the pattern directly and then with a positive column match [123], on the ‘not’ side, we negated a column match [^123]. We want to match what we were looking for, not what we weren’t looking for.
  • We have a two layers remaining in our health strategy or pyramid.Underway is a management shell that will ease the process of managing lists, faults, and reducing recovery time.
  • We are currently wrapping up our management CLI that assists the Hadoop Administrators to start, stop clusters, manage lists, and perform beak/fix actions, to name a few.Our goal is to reduce the time to manage and recover a cluster:Improve recovery time from a crashReduce node time to repairReduce recovery from brownoutsImprove the ability to manage nodes based on state without a SQL database
  • The management shell is a BASH CLI that eases the administrative functions for large clusters.
  • The last part of the pyramid is the future of integrating automation and tools where the health process provides an essential role.
  • Hadoop clusters have increased efficiency by having fewer task failures due to node hardware faults. Long tails rarely occur due to node issues. On-call issues have also declined because we have less troubleshooting issues due to stuck jobs. We are looking forward to hearing your comments.
  • Questions.
  • A Cluster Is Only As Strong As its Weakest Link

    1. 1. A cluster is only as strong as its weakest link. @DanRomike Hadoop Tooling Engineer / Configuration Manager @Twitter 1#HadoopSummit
    2. 2. Introduction • Hadoop health at Twitter: – Scope of our operation – What are some of our weak links? – What is in our checkup? – Where does our health check run? – Which faults are meaningful to us? – What is our future health strategy? – Summary of our achievements 2#HadoopSummit
    3. 3. Cluster Health Pyramid Us Tools and Jenkins A Cluster Management Shell Health Scans Management of 1000s/Nodes, 10s/Clusters 3#HadoopSummit
    4. 4. MANAGING HADOOP What we support 4#HadoopSummit
    5. 5. The Health Pyramid Us Tools and Jenkins A Cluster Management Shell Health Scans Management of 1000s/Nodes, 10s/Clusters 5#HadoopSummit
    6. 6. Clusters Data Warehouse / HBase Large number of computing jobs: 10’sk/ day High storage consumption Tripled in Size Processing Large number of computing jobs: 10’sk/ day Doubled in Size Backups HDFS Storage Doubled in Size Test Test releases Evaluate jobs 6#HadoopSummit
    7. 7. Site Operations Central Site Operations Team • Ticket based • Short repair times • Infrastructure Generally, what breaks? • PSU, LOM, BIOS, Wiring • Network Bonding • Disks, Controllers • TOR Switches • Rack Power 7#HadoopSummit
    8. 8. Our Configuration Manager Role Run Attribute 8#HadoopSummit
    9. 9. Automation Refined processes Source Control Repository Config Mgmt Puppet 9#HadoopSummit
    10. 10. Cluster Reliability Team 10 Manage Build, grow, and migrate On-boarding Migrate distcp harness Configuration Optimized properties heartbeats.in.seconds Set to cluster size Reliability Data integrity Failures, under- rep, 3-reps fsck, -report, metasave Violated,MISSING Balance Balancer rack-topology.sh Nodes LIVE, DEAD, B-LIST Break/fix Recommission HEALTH Scan Isolate issues Report failures #HadoopSummit
    11. 11. Weak Links Node Issues • Performance loss, slow • Storage failures • High CPU usage • Memory failures • Onboard network failures • Power On/Off Infrastructure Issues • Changes, adds and moves • Site power maintenance • Rack issues • Unscheduled changes • Cooling • Network infrastructure 11#HadoopSummit
    12. 12. CLUSTER HEALTH Health checks for Hadoop production environments 12#HadoopSummit
    13. 13. The Health Pyramid Us Tools and Jenkins A Cluster Management Shell Health Scans Management of 1000s/Nodes, 10s/Clusters 13#HadoopSummit
    14. 14. Health Check Mission Create and deploy a comprehensive health check that reports failing nodes, reduces impact to performance, and uses common standard tools. Fast: logs may grow quickly, avoid timeouts Adjustable: setting the right thresholds Reliable: must not cause issues or ‘brownouts’ Reusable: new tools will use status and results 14#HadoopSummit
    15. 15. Health Goals Reduce on-call incidents Reduce troubleshooting Prevent cascading failures Verify after maintenance Facilitate change and growth 15#HadoopSummit
    16. 16. Early Detection Health 1-3mins Thresholds Preset Level Blacklist ERROR,Exclude Notify Alert Monitor Threshold Alert Alerts Email Page On-Call Heartbeats It’s Alive Delays Performance Datanodes 0-3secs Tasks 0-5secs 16#HadoopSummit
    17. 17. mapred-site.xml <name>mapred.healthChecker.script.path</name> <value>/etc/hadoop/conf/healthcheck2</value> <name>mapred.healthChecker.interval</name> <value>180000</value> <name>mapred.healthChecker.script.timeout</n ame> <value>45000</value> 17#HadoopSummit
    18. 18. Healthy to Blacklisted PASS ERROR WARN Con figu re Exe cute Eval uate FAIL Health 18#HadoopSummit
    19. 19. FAULTS What to scan for 19#HadoopSummit
    20. 20. Faults to Detect • Network – Speed decrease – Partial rack power outages, loss of services – Rack switch packet loss – Errors/drops/retries bursts • Reported memory vs. installed memory • Induced fault: for node maintenance 20#HadoopSummit
    21. 21. More Faults • Storage – Full – Incorrect disk installed – Correct inodes per file system – File system type: ext4 – HW disk controller issues • Kernel is too old • High CPU spikes with high loads • Datanode failure 21#HadoopSummit
    22. 22. Log Checking • Which logs to check – System logs – Datanode logs – Tasktracker logs • How to check – Relevant records – Bottom up scan – Positive Pattern Matching – Use of fault counters and scan thresholds 22#HadoopSummit
    23. 23. FUTURE STRATEGY Reduce recovery time by building a management shell 23#HadoopSummit
    24. 24. The Health Pyramid Us Tools and Jenkins A Cluster Management Shell Health Scans Management of 1000s/Nodes, 10s/Clusters 24#HadoopSummit
    25. 25. Management Shell • Health Shell (CLI) maintains a working list – Refines the list as node state changes – Interactive BASH Shell is the CLI – Concurrent execution functions – Interfaces to all Hadoop admin functions – Familiar interface 25#HadoopSummit
    26. 26. Today’s Health Pyramid Us Tools and Jenkins A Cluster Management Shell Health Scans Management of 1000s/Nodes, 10s/Clusters 26#HadoopSummit
    27. 27. CONCLUSION Change weak links into strong links 27#HadoopSummit
    28. 28. Achievements • Failing nodes are blacklisted • New cluster validations • Fewer Job tails • Less intervention • Increased job throughput • Improved health 28#HadoopSummit
    29. 29. #ThankYou @DanRomike 29#HadoopSummit