How to monitor the  $H!T out of Hadoop Developing a comprehensive open approach to monitoring hadoop clusters
Relevant Hadoop Information <ul><li>From 3 – 3000 Nodes </li></ul><ul><li>Hardware/Software failures “common” </li></ul><u...
Monitoring Software <ul><li>Nagios –  </li></ul><ul><ul><li>Red Yellow Green Alerts, Escalations </li></ul></ul><ul><ul><l...
Cacti <ul><li>Performance Graphing System </li></ul><ul><li>RRD/RRA Front End </li></ul><ul><li>Slick Web Interface </li><...
 
hadoop-cacti-jtg <ul><li>JMX Fetching Code w/ (kick off) scripts </li></ul><ul><li>Cacti templates For Hadoop </li></ul><u...
Hadoop JMX
Sample Cluster P1 <ul><li>NameNode & SecNameNode </li></ul><ul><ul><li>Hardware RAID </li></ul></ul><ul><ul><li>8 GB RAM <...
A Sample Cluster p2 <ul><li>Slave (hadoopdata1-XXXX) </li></ul><ul><ul><li>JBOD 8x 1TB SATA Disk </li></ul></ul><ul><ul><l...
Prerequisites <ul><li>Nagios (install) DAG RPMs </li></ul><ul><li>Cacti (install) Several RPMS </li></ul><ul><li>Liberal n...
Alerts & Escalations <ul><li>X nodes * Y Services = < Sleep </li></ul><ul><li>Define a policy  </li></ul><ul><ul><li>Wake ...
Wake Me Up’s <ul><li>NameNode </li></ul><ul><ul><li>Disk Full (Big Big Headache) </li></ul></ul><ul><ul><li>RAID Array Iss...
Don’t Wake Me Up’s <ul><li>Or ‘Wake someone else up’ </li></ul><ul><li>DataNode </li></ul><ul><ul><li>Warning Currently Fa...
Monitoring Battle Plan <ul><li>Start With the Basics </li></ul><ul><ul><li>Ping, Disk </li></ul></ul><ul><li>Add Hadoop Sp...
The Basics Nagios <ul><li>Nagios (All Nodes) </li></ul><ul><ul><li>Host up (Ping check) </li></ul></ul><ul><ul><li>Disk % ...
The Basics Cacti <ul><li>Cacti (All Nodes) </li></ul><ul><ul><li>CPU (full CPU) </li></ul></ul><ul><ul><li>RAM/SWAP  </li>...
Disk Utilization
RAID Tools <ul><li>Hpacucli – not a Street Fighter move </li></ul><ul><ul><li>Alerts on RAID events (NameNode)  </li></ul>...
Before you jump in <ul><li>X Nodes * Y Checks * = Lots of work </li></ul><ul><li>About 3 Nodes into the process … </li></u...
Nagios <ul><li>Answers “IS IT RUNNING?” </li></ul><ul><li>Text based Configuration </li></ul>
Cacti <ul><li>Answers “HOW WELL IS IT RUNNING?” </li></ul><ul><li>Web Based configuration  </li></ul><ul><ul><li>php-cli t...
Monitoring Battle Plan Thus Far <ul><li>Start With the Basics </li></ul><ul><ul><li>Ping, Disk !!!!!!Done!!!!!! </li></ul>...
Add Hadoop Specific Alarms <ul><li>Hadoop Components with a Web Interface </li></ul><ul><ul><li>NameNode 50070 </li></ul><...
nagios_check_commands.cfg <ul><li>Component Failure </li></ul><ul><li>(Future) Newer Hadoop will have XML status  </li></u...
Monitoring Battle Plan <ul><li>Start With the Basics </li></ul><ul><ul><li>Ping, Disk (Done) </li></ul></ul><ul><li>Add Ha...
JMX Graphing <ul><li>Enable JMX </li></ul><ul><li>Import Templates </li></ul>
JMX Graphing
JMX Graphing
JMX Graphing
 
Standard Java JMX
Monitoring Battle Plan Thus Far <ul><li>Start With the Basics !!!!!!Done!!!!! </li></ul><ul><ul><li>Ping, Disk </li></ul><...
Add JMX based Alarms <ul><li>hadoop-cacti-jtg is flexible </li></ul><ul><ul><li>extend fetch classes </li></ul></ul><ul><u...
Quick JMX Base Walkthrough  <ul><li>url, user, pass, object specified from CLI </li></ul><ul><li>wantedVariables, wantedOp...
Extend for NameNode
Extend for Nagios
Monitoring Battle Plan <ul><li>Start With the Basics !DONE! </li></ul><ul><ul><li>Ping, Disk </li></ul></ul><ul><li>Add Ha...
Review <ul><li>File System Growth </li></ul><ul><ul><li>Size </li></ul></ul><ul><ul><li>Number of Files </li></ul></ul><ul...
The Future <ul><li>JMX Coming to JobTracker and TaskTracker (0.21) </li></ul><ul><ul><li>Collect and Graph Jobs Running </...
Upcoming SlideShare
Loading in...5
×

Hw09 Monitoring Best Practices

2,324

Published on

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,324
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
100
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Transcript of "Hw09 Monitoring Best Practices"

  1. 1. How to monitor the $H!T out of Hadoop Developing a comprehensive open approach to monitoring hadoop clusters
  2. 2. Relevant Hadoop Information <ul><li>From 3 – 3000 Nodes </li></ul><ul><li>Hardware/Software failures “common” </li></ul><ul><li>Redundant Components DataNode, TaskTracker </li></ul><ul><li>Non-redundant Components NameNode, JobTracker, SecondaryNameNode </li></ul><ul><li>Fast Evolving Technology (Best Practices?) </li></ul>
  3. 3. Monitoring Software <ul><li>Nagios – </li></ul><ul><ul><li>Red Yellow Green Alerts, Escalations </li></ul></ul><ul><ul><li>Defacto Standard – Widely deployed </li></ul></ul><ul><ul><li>Text base configuration </li></ul></ul><ul><ul><li>Web Interface </li></ul></ul><ul><ul><li>Pluggable with shell scripts/external apps </li></ul></ul><ul><ul><ul><li>Return 0 - OK </li></ul></ul></ul>
  4. 4. Cacti <ul><li>Performance Graphing System </li></ul><ul><li>RRD/RRA Front End </li></ul><ul><li>Slick Web Interface </li></ul><ul><li>Template System for Graph Types </li></ul><ul><li>Pluggable </li></ul><ul><ul><li>SNMP input </li></ul></ul><ul><ul><li>Shell script /external program </li></ul></ul>
  5. 6. hadoop-cacti-jtg <ul><li>JMX Fetching Code w/ (kick off) scripts </li></ul><ul><li>Cacti templates For Hadoop </li></ul><ul><li>Premade Nagios Check Scripts </li></ul><ul><li>Helper/Batch/automation scripts </li></ul><ul><li>Apache License </li></ul>
  6. 7. Hadoop JMX
  7. 8. Sample Cluster P1 <ul><li>NameNode & SecNameNode </li></ul><ul><ul><li>Hardware RAID </li></ul></ul><ul><ul><li>8 GB RAM </li></ul></ul><ul><ul><li>1x QUAD CORE </li></ul></ul><ul><ul><li>DerbyDB (hive) on SecNameNode </li></ul></ul><ul><li>JobTracker </li></ul><ul><ul><li>8GB RAM </li></ul></ul><ul><ul><li>1x QUAD CORE </li></ul></ul>
  8. 9. A Sample Cluster p2 <ul><li>Slave (hadoopdata1-XXXX) </li></ul><ul><ul><li>JBOD 8x 1TB SATA Disk </li></ul></ul><ul><ul><li>RAM 16GB </li></ul></ul><ul><ul><li>2x Quad Core </li></ul></ul>
  9. 10. Prerequisites <ul><li>Nagios (install) DAG RPMs </li></ul><ul><li>Cacti (install) Several RPMS </li></ul><ul><li>Liberal network access to the cluster </li></ul>
  10. 11. Alerts & Escalations <ul><li>X nodes * Y Services = < Sleep </li></ul><ul><li>Define a policy </li></ul><ul><ul><li>Wake Me Up’s (SMS) </li></ul></ul><ul><ul><li>Don’t Wake Me Up’s (EMAIL) </li></ul></ul><ul><ul><li>Review (Daily, Weekly, Monthly) </li></ul></ul>
  11. 12. Wake Me Up’s <ul><li>NameNode </li></ul><ul><ul><li>Disk Full (Big Big Headache) </li></ul></ul><ul><ul><li>RAID Array Issues (failed disk) </li></ul></ul><ul><li>JobTracker </li></ul><ul><li>SecNameNode </li></ul><ul><ul><li>Do not realize it is not working too late </li></ul></ul>
  12. 13. Don’t Wake Me Up’s <ul><li>Or ‘Wake someone else up’ </li></ul><ul><li>DataNode </li></ul><ul><ul><li>Warning Currently Failed Disk will down the Data Node (see Jira) </li></ul></ul><ul><li>TaskTracker </li></ul><ul><li>Hardware </li></ul><ul><ul><li>Bad Disk (Start RMA) </li></ul></ul><ul><li>Slaves are expendable (up to a point) </li></ul>
  13. 14. Monitoring Battle Plan <ul><li>Start With the Basics </li></ul><ul><ul><li>Ping, Disk </li></ul></ul><ul><li>Add Hadoop Specific Alarms </li></ul><ul><ul><li>check_data_node </li></ul></ul><ul><li>Add JMX Graphing </li></ul><ul><ul><li>NameNodeOperations </li></ul></ul><ul><li>Add JMX Based alarms </li></ul><ul><ul><li>FilesTotal > 1,000,000 or LiveNodes < 50% </li></ul></ul>
  14. 15. The Basics Nagios <ul><li>Nagios (All Nodes) </li></ul><ul><ul><li>Host up (Ping check) </li></ul></ul><ul><ul><li>Disk % Full </li></ul></ul><ul><ul><li>SWAP > 85 % </li></ul></ul><ul><li>* Load based alarms are somewhat useless 389% CPU load is not necessarily a bad thing in Hadoopville </li></ul>
  15. 16. The Basics Cacti <ul><li>Cacti (All Nodes) </li></ul><ul><ul><li>CPU (full CPU) </li></ul></ul><ul><ul><li>RAM/SWAP </li></ul></ul><ul><ul><li>Network </li></ul></ul><ul><ul><li>Disk Usage </li></ul></ul>
  16. 17. Disk Utilization
  17. 18. RAID Tools <ul><li>Hpacucli – not a Street Fighter move </li></ul><ul><ul><li>Alerts on RAID events (NameNode) </li></ul></ul><ul><ul><ul><li>Disk failed </li></ul></ul></ul><ul><ul><ul><li>Rebuilding </li></ul></ul></ul><ul><ul><li>JBOD (DataNode) </li></ul></ul><ul><ul><ul><li>Failed Drive </li></ul></ul></ul><ul><ul><ul><li>Drive Errors </li></ul></ul></ul><ul><li>Dell, SUN, Vendor Specific Tools </li></ul>
  18. 19. Before you jump in <ul><li>X Nodes * Y Checks * = Lots of work </li></ul><ul><li>About 3 Nodes into the process … </li></ul><ul><ul><li>Wait!!! I need some interns!!! </li></ul></ul><ul><li>Solution S.I.C.C.T. Semi-Intelligent-Configuration-cloning-tools </li></ul><ul><ul><li>(I made that up) </li></ul></ul><ul><ul><li>(for this presentation) </li></ul></ul>
  19. 20. Nagios <ul><li>Answers “IS IT RUNNING?” </li></ul><ul><li>Text based Configuration </li></ul>
  20. 21. Cacti <ul><li>Answers “HOW WELL IS IT RUNNING?” </li></ul><ul><li>Web Based configuration </li></ul><ul><ul><li>php-cli tools </li></ul></ul>
  21. 22. Monitoring Battle Plan Thus Far <ul><li>Start With the Basics </li></ul><ul><ul><li>Ping, Disk !!!!!!Done!!!!!! </li></ul></ul><ul><li>Add Hadoop Specific Alarms </li></ul><ul><ul><li>check_data_node </li></ul></ul><ul><li>Add JMX Graphing </li></ul><ul><ul><li>NameNodeOperations </li></ul></ul><ul><li>Add JMX Based alarms </li></ul><ul><ul><li>FilesTotal > 1,000,000 or LiveNodes < 50% </li></ul></ul>
  22. 23. Add Hadoop Specific Alarms <ul><li>Hadoop Components with a Web Interface </li></ul><ul><ul><li>NameNode 50070 </li></ul></ul><ul><ul><li>JobTracker 50030 </li></ul></ul><ul><ul><li>TaskTracker 50060 </li></ul></ul><ul><ul><li>DataNode 50075 </li></ul></ul><ul><li>check_http + regex = simple + effective </li></ul>
  23. 24. nagios_check_commands.cfg <ul><li>Component Failure </li></ul><ul><li>(Future) Newer Hadoop will have XML status </li></ul>define command { command_name check_remote_namenode command_line $USER1$/check_http -H $HOSTADDRESS$ -u http://$HOSTADDRESS$:$ARG1$/dfshealth.jsp -p $ARG1$ -r NameNode } define service {                service_description            check_remote_namenode                use                             generic-service                host_name                       hadoopname1                check_command               check_remote_namenode!50070 }
  24. 25. Monitoring Battle Plan <ul><li>Start With the Basics </li></ul><ul><ul><li>Ping, Disk (Done) </li></ul></ul><ul><li>Add Hadoop Specific Alarms </li></ul><ul><ul><li>check_data_node (Done) </li></ul></ul><ul><li>Add JMX Graphing </li></ul><ul><ul><li>NameNodeOperations </li></ul></ul><ul><li>Add JMX Based alarms </li></ul><ul><ul><li>FilesTotal > 1,000,000 or LiveNodes < 50% </li></ul></ul>
  25. 26. JMX Graphing <ul><li>Enable JMX </li></ul><ul><li>Import Templates </li></ul>
  26. 27. JMX Graphing
  27. 28. JMX Graphing
  28. 29. JMX Graphing
  29. 31. Standard Java JMX
  30. 32. Monitoring Battle Plan Thus Far <ul><li>Start With the Basics !!!!!!Done!!!!! </li></ul><ul><ul><li>Ping, Disk </li></ul></ul><ul><li>Add Hadoop Specific Alarms !Done! </li></ul><ul><ul><li>check_data_node </li></ul></ul><ul><li>Add JMX Graphing !Done! </li></ul><ul><ul><li>NameNodeOperations </li></ul></ul><ul><li>Add JMX Based alarms </li></ul><ul><ul><li>FilesTotal > 1,000,000 or LiveNodes < 50% </li></ul></ul>
  31. 33. Add JMX based Alarms <ul><li>hadoop-cacti-jtg is flexible </li></ul><ul><ul><li>extend fetch classes </li></ul></ul><ul><ul><li>Don’t call output() </li></ul></ul><ul><ul><li>Write your own check logic </li></ul></ul>
  32. 34. Quick JMX Base Walkthrough <ul><li>url, user, pass, object specified from CLI </li></ul><ul><li>wantedVariables, wantedOperations by inheritance </li></ul><ul><li>fetch() output() provided </li></ul>
  33. 35. Extend for NameNode
  34. 36. Extend for Nagios
  35. 37. Monitoring Battle Plan <ul><li>Start With the Basics !DONE! </li></ul><ul><ul><li>Ping, Disk </li></ul></ul><ul><li>Add Hadoop Specific Alarms !DONE! </li></ul><ul><ul><li>check_data_node </li></ul></ul><ul><li>Add JMX Graphing !DONE! </li></ul><ul><ul><li>NameNodeOperations </li></ul></ul><ul><li>Add JMX Based alarms !DONE! </li></ul><ul><ul><li>FilesTotal > 1,000,000 or LiveNodes < 50% </li></ul></ul>
  36. 38. Review <ul><li>File System Growth </li></ul><ul><ul><li>Size </li></ul></ul><ul><ul><li>Number of Files </li></ul></ul><ul><ul><li>Number of Blocks </li></ul></ul><ul><ul><li>Ratio’s </li></ul></ul><ul><li>Utilization </li></ul><ul><ul><li>CPU/Memory </li></ul></ul><ul><ul><li>Disk </li></ul></ul><ul><li>Email (nightly) </li></ul><ul><ul><li>FSCK </li></ul></ul><ul><ul><li>DSFADMIN </li></ul></ul>
  37. 39. The Future <ul><li>JMX Coming to JobTracker and TaskTracker (0.21) </li></ul><ul><ul><li>Collect and Graph Jobs Running </li></ul></ul><ul><ul><li>Collect and Graph Map / Reduce per node </li></ul></ul><ul><ul><li>Profile Specific Jobs in Cacti? </li></ul></ul>
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×