Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

A Cluster Is Only As Strong As its Weakest Link

Early detection and correction of cluster health issues is a vital part of daily cluster management, no matter the size. Building and managing a healthy cluster is the best cure for meeting service level agreements and preventing or avoiding elongated troubleshooting. A cluster is effective and efficient when problems are detected and eliminated early. Fortunately, deploying simple tools and processes prevents minor problems from becoming major headaches. This talk covers how we developed, tested, and deployed a comprehensive health process based on real life events and experiences. The table driven health check runs a full scan in ~2 seconds and includes: a checklist, ‘positive’ error pattern matching, enabling and disabling node blacklisting, logging, validating file systems, processing very large log files, trapping in-rack network faults (adds 5 seconds to accurately detect packet loss), and recommissioning nodes into production.

  • Be the first to comment

A Cluster Is Only As Strong As its Weakest Link

  1. 1. A cluster is only as strong as its weakest link. @DanRomike Hadoop Tooling Engineer / Configuration Manager @Twitter 1#HadoopSummit
  2. 2. Introduction • Hadoop health at Twitter: – Scope of our operation – What are some of our weak links? – What is in our checkup? – Where does our health check run? – Which faults are meaningful to us? – What is our future health strategy? – Summary of our achievements 2#HadoopSummit
  3. 3. Cluster Health Pyramid Us Tools and Jenkins A Cluster Management Shell Health Scans Management of 1000s/Nodes, 10s/Clusters 3#HadoopSummit
  4. 4. MANAGING HADOOP What we support 4#HadoopSummit
  5. 5. The Health Pyramid Us Tools and Jenkins A Cluster Management Shell Health Scans Management of 1000s/Nodes, 10s/Clusters 5#HadoopSummit
  6. 6. Clusters Data Warehouse / HBase Large number of computing jobs: 10’sk/ day High storage consumption Tripled in Size Processing Large number of computing jobs: 10’sk/ day Doubled in Size Backups HDFS Storage Doubled in Size Test Test releases Evaluate jobs 6#HadoopSummit
  7. 7. Site Operations Central Site Operations Team • Ticket based • Short repair times • Infrastructure Generally, what breaks? • PSU, LOM, BIOS, Wiring • Network Bonding • Disks, Controllers • TOR Switches • Rack Power 7#HadoopSummit
  8. 8. Our Configuration Manager Role Run Attribute 8#HadoopSummit
  9. 9. Automation Refined processes Source Control Repository Config Mgmt Puppet 9#HadoopSummit
  10. 10. Cluster Reliability Team 10 Manage Build, grow, and migrate On-boarding Migrate distcp harness Configuration Optimized properties heartbeats.in.seconds Set to cluster size Reliability Data integrity Failures, under- rep, 3-reps fsck, -report, metasave Violated,MISSING Balance Balancer rack-topology.sh Nodes LIVE, DEAD, B-LIST Break/fix Recommission HEALTH Scan Isolate issues Report failures #HadoopSummit
  11. 11. Weak Links Node Issues • Performance loss, slow • Storage failures • High CPU usage • Memory failures • Onboard network failures • Power On/Off Infrastructure Issues • Changes, adds and moves • Site power maintenance • Rack issues • Unscheduled changes • Cooling • Network infrastructure 11#HadoopSummit
  12. 12. CLUSTER HEALTH Health checks for Hadoop production environments 12#HadoopSummit
  13. 13. The Health Pyramid Us Tools and Jenkins A Cluster Management Shell Health Scans Management of 1000s/Nodes, 10s/Clusters 13#HadoopSummit
  14. 14. Health Check Mission Create and deploy a comprehensive health check that reports failing nodes, reduces impact to performance, and uses common standard tools. Fast: logs may grow quickly, avoid timeouts Adjustable: setting the right thresholds Reliable: must not cause issues or ‘brownouts’ Reusable: new tools will use status and results 14#HadoopSummit
  15. 15. Health Goals Reduce on-call incidents Reduce troubleshooting Prevent cascading failures Verify after maintenance Facilitate change and growth 15#HadoopSummit
  16. 16. Early Detection Health 1-3mins Thresholds Preset Level Blacklist ERROR,Exclude Notify Alert Monitor Threshold Alert Alerts Email Page On-Call Heartbeats It’s Alive Delays Performance Datanodes 0-3secs Tasks 0-5secs 16#HadoopSummit
  17. 17. mapred-site.xml <name>mapred.healthChecker.script.path</name> <value>/etc/hadoop/conf/healthcheck2</value> <name>mapred.healthChecker.interval</name> <value>180000</value> <name>mapred.healthChecker.script.timeout</n ame> <value>45000</value> 17#HadoopSummit
  18. 18. Healthy to Blacklisted PASS ERROR WARN Con figu re Exe cute Eval uate FAIL Health 18#HadoopSummit
  19. 19. FAULTS What to scan for 19#HadoopSummit
  20. 20. Faults to Detect • Network – Speed decrease – Partial rack power outages, loss of services – Rack switch packet loss – Errors/drops/retries bursts • Reported memory vs. installed memory • Induced fault: for node maintenance 20#HadoopSummit
  21. 21. More Faults • Storage – Full – Incorrect disk installed – Correct inodes per file system – File system type: ext4 – HW disk controller issues • Kernel is too old • High CPU spikes with high loads • Datanode failure 21#HadoopSummit
  22. 22. Log Checking • Which logs to check – System logs – Datanode logs – Tasktracker logs • How to check – Relevant records – Bottom up scan – Positive Pattern Matching – Use of fault counters and scan thresholds 22#HadoopSummit
  23. 23. FUTURE STRATEGY Reduce recovery time by building a management shell 23#HadoopSummit
  24. 24. The Health Pyramid Us Tools and Jenkins A Cluster Management Shell Health Scans Management of 1000s/Nodes, 10s/Clusters 24#HadoopSummit
  25. 25. Management Shell • Health Shell (CLI) maintains a working list – Refines the list as node state changes – Interactive BASH Shell is the CLI – Concurrent execution functions – Interfaces to all Hadoop admin functions – Familiar interface 25#HadoopSummit
  26. 26. Today’s Health Pyramid Us Tools and Jenkins A Cluster Management Shell Health Scans Management of 1000s/Nodes, 10s/Clusters 26#HadoopSummit
  27. 27. CONCLUSION Change weak links into strong links 27#HadoopSummit
  28. 28. Achievements • Failing nodes are blacklisted • New cluster validations • Fewer Job tails • Less intervention • Increased job throughput • Improved health 28#HadoopSummit
  29. 29. #ThankYou @DanRomike 29#HadoopSummit

×