Your SlideShare is downloading. ×
0
Hw09   Monitoring Best Practices
Hw09   Monitoring Best Practices
Hw09   Monitoring Best Practices
Hw09   Monitoring Best Practices
Hw09   Monitoring Best Practices
Hw09   Monitoring Best Practices
Hw09   Monitoring Best Practices
Hw09   Monitoring Best Practices
Hw09   Monitoring Best Practices
Hw09   Monitoring Best Practices
Hw09   Monitoring Best Practices
Hw09   Monitoring Best Practices
Hw09   Monitoring Best Practices
Hw09   Monitoring Best Practices
Hw09   Monitoring Best Practices
Hw09   Monitoring Best Practices
Hw09   Monitoring Best Practices
Hw09   Monitoring Best Practices
Hw09   Monitoring Best Practices
Hw09   Monitoring Best Practices
Hw09   Monitoring Best Practices
Hw09   Monitoring Best Practices
Hw09   Monitoring Best Practices
Hw09   Monitoring Best Practices
Hw09   Monitoring Best Practices
Hw09   Monitoring Best Practices
Hw09   Monitoring Best Practices
Hw09   Monitoring Best Practices
Hw09   Monitoring Best Practices
Hw09   Monitoring Best Practices
Hw09   Monitoring Best Practices
Hw09   Monitoring Best Practices
Hw09   Monitoring Best Practices
Hw09   Monitoring Best Practices
Hw09   Monitoring Best Practices
Hw09   Monitoring Best Practices
Hw09   Monitoring Best Practices
Hw09   Monitoring Best Practices
Hw09   Monitoring Best Practices
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Hw09 Monitoring Best Practices

2,306

Published on

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,306
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
99
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. How to monitor the $H!T out of Hadoop Developing a comprehensive open approach to monitoring hadoop clusters
  • 2. Relevant Hadoop Information
    • From 3 – 3000 Nodes
    • Hardware/Software failures “common”
    • Redundant Components DataNode, TaskTracker
    • Non-redundant Components NameNode, JobTracker, SecondaryNameNode
    • Fast Evolving Technology (Best Practices?)
  • 3. Monitoring Software
    • Nagios –
      • Red Yellow Green Alerts, Escalations
      • Defacto Standard – Widely deployed
      • Text base configuration
      • Web Interface
      • Pluggable with shell scripts/external apps
        • Return 0 - OK
  • 4. Cacti
    • Performance Graphing System
    • RRD/RRA Front End
    • Slick Web Interface
    • Template System for Graph Types
    • Pluggable
      • SNMP input
      • Shell script /external program
  • 5.  
  • 6. hadoop-cacti-jtg
    • JMX Fetching Code w/ (kick off) scripts
    • Cacti templates For Hadoop
    • Premade Nagios Check Scripts
    • Helper/Batch/automation scripts
    • Apache License
  • 7. Hadoop JMX
  • 8. Sample Cluster P1
    • NameNode & SecNameNode
      • Hardware RAID
      • 8 GB RAM
      • 1x QUAD CORE
      • DerbyDB (hive) on SecNameNode
    • JobTracker
      • 8GB RAM
      • 1x QUAD CORE
  • 9. A Sample Cluster p2
    • Slave (hadoopdata1-XXXX)
      • JBOD 8x 1TB SATA Disk
      • RAM 16GB
      • 2x Quad Core
  • 10. Prerequisites
    • Nagios (install) DAG RPMs
    • Cacti (install) Several RPMS
    • Liberal network access to the cluster
  • 11. Alerts & Escalations
    • X nodes * Y Services = < Sleep
    • Define a policy
      • Wake Me Up’s (SMS)
      • Don’t Wake Me Up’s (EMAIL)
      • Review (Daily, Weekly, Monthly)
  • 12. Wake Me Up’s
    • NameNode
      • Disk Full (Big Big Headache)
      • RAID Array Issues (failed disk)
    • JobTracker
    • SecNameNode
      • Do not realize it is not working too late
  • 13. Don’t Wake Me Up’s
    • Or ‘Wake someone else up’
    • DataNode
      • Warning Currently Failed Disk will down the Data Node (see Jira)
    • TaskTracker
    • Hardware
      • Bad Disk (Start RMA)
    • Slaves are expendable (up to a point)
  • 14. Monitoring Battle Plan
    • Start With the Basics
      • Ping, Disk
    • Add Hadoop Specific Alarms
      • check_data_node
    • Add JMX Graphing
      • NameNodeOperations
    • Add JMX Based alarms
      • FilesTotal > 1,000,000 or LiveNodes < 50%
  • 15. The Basics Nagios
    • Nagios (All Nodes)
      • Host up (Ping check)
      • Disk % Full
      • SWAP > 85 %
    • * Load based alarms are somewhat useless 389% CPU load is not necessarily a bad thing in Hadoopville
  • 16. The Basics Cacti
    • Cacti (All Nodes)
      • CPU (full CPU)
      • RAM/SWAP
      • Network
      • Disk Usage
  • 17. Disk Utilization
  • 18. RAID Tools
    • Hpacucli – not a Street Fighter move
      • Alerts on RAID events (NameNode)
        • Disk failed
        • Rebuilding
      • JBOD (DataNode)
        • Failed Drive
        • Drive Errors
    • Dell, SUN, Vendor Specific Tools
  • 19. Before you jump in
    • X Nodes * Y Checks * = Lots of work
    • About 3 Nodes into the process …
      • Wait!!! I need some interns!!!
    • Solution S.I.C.C.T. Semi-Intelligent-Configuration-cloning-tools
      • (I made that up)
      • (for this presentation)
  • 20. Nagios
    • Answers “IS IT RUNNING?”
    • Text based Configuration
  • 21. Cacti
    • Answers “HOW WELL IS IT RUNNING?”
    • Web Based configuration
      • php-cli tools
  • 22. Monitoring Battle Plan Thus Far
    • Start With the Basics
      • Ping, Disk !!!!!!Done!!!!!!
    • Add Hadoop Specific Alarms
      • check_data_node
    • Add JMX Graphing
      • NameNodeOperations
    • Add JMX Based alarms
      • FilesTotal > 1,000,000 or LiveNodes < 50%
  • 23. Add Hadoop Specific Alarms
    • Hadoop Components with a Web Interface
      • NameNode 50070
      • JobTracker 50030
      • TaskTracker 50060
      • DataNode 50075
    • check_http + regex = simple + effective
  • 24. nagios_check_commands.cfg
    • Component Failure
    • (Future) Newer Hadoop will have XML status
    define command { command_name check_remote_namenode command_line $USER1$/check_http -H $HOSTADDRESS$ -u http://$HOSTADDRESS$:$ARG1$/dfshealth.jsp -p $ARG1$ -r NameNode } define service {                service_description            check_remote_namenode                use                             generic-service                host_name                       hadoopname1                check_command               check_remote_namenode!50070 }
  • 25. Monitoring Battle Plan
    • Start With the Basics
      • Ping, Disk (Done)
    • Add Hadoop Specific Alarms
      • check_data_node (Done)
    • Add JMX Graphing
      • NameNodeOperations
    • Add JMX Based alarms
      • FilesTotal > 1,000,000 or LiveNodes < 50%
  • 26. JMX Graphing
    • Enable JMX
    • Import Templates
  • 27. JMX Graphing
  • 28. JMX Graphing
  • 29. JMX Graphing
  • 30.  
  • 31. Standard Java JMX
  • 32. Monitoring Battle Plan Thus Far
    • Start With the Basics !!!!!!Done!!!!!
      • Ping, Disk
    • Add Hadoop Specific Alarms !Done!
      • check_data_node
    • Add JMX Graphing !Done!
      • NameNodeOperations
    • Add JMX Based alarms
      • FilesTotal > 1,000,000 or LiveNodes < 50%
  • 33. Add JMX based Alarms
    • hadoop-cacti-jtg is flexible
      • extend fetch classes
      • Don’t call output()
      • Write your own check logic
  • 34. Quick JMX Base Walkthrough
    • url, user, pass, object specified from CLI
    • wantedVariables, wantedOperations by inheritance
    • fetch() output() provided
  • 35. Extend for NameNode
  • 36. Extend for Nagios
  • 37. Monitoring Battle Plan
    • Start With the Basics !DONE!
      • Ping, Disk
    • Add Hadoop Specific Alarms !DONE!
      • check_data_node
    • Add JMX Graphing !DONE!
      • NameNodeOperations
    • Add JMX Based alarms !DONE!
      • FilesTotal > 1,000,000 or LiveNodes < 50%
  • 38. Review
    • File System Growth
      • Size
      • Number of Files
      • Number of Blocks
      • Ratio’s
    • Utilization
      • CPU/Memory
      • Disk
    • Email (nightly)
      • FSCK
      • DSFADMIN
  • 39. The Future
    • JMX Coming to JobTracker and TaskTracker (0.21)
      • Collect and Graph Jobs Running
      • Collect and Graph Map / Reduce per node
      • Profile Specific Jobs in Cacti?

×