Hw09   Monitoring Best Practices
 

Like this? Share it with your network

Share

Hw09 Monitoring Best Practices

on

  • 3,645 views

 

Statistics

Views

Total Views
3,645
Views on SlideShare
3,532
Embed Views
113

Actions

Likes
2
Downloads
97
Comments
0

3 Embeds 113

http://storify.com 97
http://www.slideshare.net 15
http://www.techgig.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Hw09 Monitoring Best Practices Presentation Transcript

  • 1. How to monitor the $H!T out of Hadoop Developing a comprehensive open approach to monitoring hadoop clusters
  • 2. Relevant Hadoop Information
    • From 3 – 3000 Nodes
    • Hardware/Software failures “common”
    • Redundant Components DataNode, TaskTracker
    • Non-redundant Components NameNode, JobTracker, SecondaryNameNode
    • Fast Evolving Technology (Best Practices?)
  • 3. Monitoring Software
    • Nagios –
      • Red Yellow Green Alerts, Escalations
      • Defacto Standard – Widely deployed
      • Text base configuration
      • Web Interface
      • Pluggable with shell scripts/external apps
        • Return 0 - OK
  • 4. Cacti
    • Performance Graphing System
    • RRD/RRA Front End
    • Slick Web Interface
    • Template System for Graph Types
    • Pluggable
      • SNMP input
      • Shell script /external program
  • 5.  
  • 6. hadoop-cacti-jtg
    • JMX Fetching Code w/ (kick off) scripts
    • Cacti templates For Hadoop
    • Premade Nagios Check Scripts
    • Helper/Batch/automation scripts
    • Apache License
  • 7. Hadoop JMX
  • 8. Sample Cluster P1
    • NameNode & SecNameNode
      • Hardware RAID
      • 8 GB RAM
      • 1x QUAD CORE
      • DerbyDB (hive) on SecNameNode
    • JobTracker
      • 8GB RAM
      • 1x QUAD CORE
  • 9. A Sample Cluster p2
    • Slave (hadoopdata1-XXXX)
      • JBOD 8x 1TB SATA Disk
      • RAM 16GB
      • 2x Quad Core
  • 10. Prerequisites
    • Nagios (install) DAG RPMs
    • Cacti (install) Several RPMS
    • Liberal network access to the cluster
  • 11. Alerts & Escalations
    • X nodes * Y Services = < Sleep
    • Define a policy
      • Wake Me Up’s (SMS)
      • Don’t Wake Me Up’s (EMAIL)
      • Review (Daily, Weekly, Monthly)
  • 12. Wake Me Up’s
    • NameNode
      • Disk Full (Big Big Headache)
      • RAID Array Issues (failed disk)
    • JobTracker
    • SecNameNode
      • Do not realize it is not working too late
  • 13. Don’t Wake Me Up’s
    • Or ‘Wake someone else up’
    • DataNode
      • Warning Currently Failed Disk will down the Data Node (see Jira)
    • TaskTracker
    • Hardware
      • Bad Disk (Start RMA)
    • Slaves are expendable (up to a point)
  • 14. Monitoring Battle Plan
    • Start With the Basics
      • Ping, Disk
    • Add Hadoop Specific Alarms
      • check_data_node
    • Add JMX Graphing
      • NameNodeOperations
    • Add JMX Based alarms
      • FilesTotal > 1,000,000 or LiveNodes < 50%
  • 15. The Basics Nagios
    • Nagios (All Nodes)
      • Host up (Ping check)
      • Disk % Full
      • SWAP > 85 %
    • * Load based alarms are somewhat useless 389% CPU load is not necessarily a bad thing in Hadoopville
  • 16. The Basics Cacti
    • Cacti (All Nodes)
      • CPU (full CPU)
      • RAM/SWAP
      • Network
      • Disk Usage
  • 17. Disk Utilization
  • 18. RAID Tools
    • Hpacucli – not a Street Fighter move
      • Alerts on RAID events (NameNode)
        • Disk failed
        • Rebuilding
      • JBOD (DataNode)
        • Failed Drive
        • Drive Errors
    • Dell, SUN, Vendor Specific Tools
  • 19. Before you jump in
    • X Nodes * Y Checks * = Lots of work
    • About 3 Nodes into the process …
      • Wait!!! I need some interns!!!
    • Solution S.I.C.C.T. Semi-Intelligent-Configuration-cloning-tools
      • (I made that up)
      • (for this presentation)
  • 20. Nagios
    • Answers “IS IT RUNNING?”
    • Text based Configuration
  • 21. Cacti
    • Answers “HOW WELL IS IT RUNNING?”
    • Web Based configuration
      • php-cli tools
  • 22. Monitoring Battle Plan Thus Far
    • Start With the Basics
      • Ping, Disk !!!!!!Done!!!!!!
    • Add Hadoop Specific Alarms
      • check_data_node
    • Add JMX Graphing
      • NameNodeOperations
    • Add JMX Based alarms
      • FilesTotal > 1,000,000 or LiveNodes < 50%
  • 23. Add Hadoop Specific Alarms
    • Hadoop Components with a Web Interface
      • NameNode 50070
      • JobTracker 50030
      • TaskTracker 50060
      • DataNode 50075
    • check_http + regex = simple + effective
  • 24. nagios_check_commands.cfg
    • Component Failure
    • (Future) Newer Hadoop will have XML status
    define command { command_name check_remote_namenode command_line $USER1$/check_http -H $HOSTADDRESS$ -u http://$HOSTADDRESS$:$ARG1$/dfshealth.jsp -p $ARG1$ -r NameNode } define service {                service_description            check_remote_namenode                use                             generic-service                host_name                       hadoopname1                check_command               check_remote_namenode!50070 }
  • 25. Monitoring Battle Plan
    • Start With the Basics
      • Ping, Disk (Done)
    • Add Hadoop Specific Alarms
      • check_data_node (Done)
    • Add JMX Graphing
      • NameNodeOperations
    • Add JMX Based alarms
      • FilesTotal > 1,000,000 or LiveNodes < 50%
  • 26. JMX Graphing
    • Enable JMX
    • Import Templates
  • 27. JMX Graphing
  • 28. JMX Graphing
  • 29. JMX Graphing
  • 30.  
  • 31. Standard Java JMX
  • 32. Monitoring Battle Plan Thus Far
    • Start With the Basics !!!!!!Done!!!!!
      • Ping, Disk
    • Add Hadoop Specific Alarms !Done!
      • check_data_node
    • Add JMX Graphing !Done!
      • NameNodeOperations
    • Add JMX Based alarms
      • FilesTotal > 1,000,000 or LiveNodes < 50%
  • 33. Add JMX based Alarms
    • hadoop-cacti-jtg is flexible
      • extend fetch classes
      • Don’t call output()
      • Write your own check logic
  • 34. Quick JMX Base Walkthrough
    • url, user, pass, object specified from CLI
    • wantedVariables, wantedOperations by inheritance
    • fetch() output() provided
  • 35. Extend for NameNode
  • 36. Extend for Nagios
  • 37. Monitoring Battle Plan
    • Start With the Basics !DONE!
      • Ping, Disk
    • Add Hadoop Specific Alarms !DONE!
      • check_data_node
    • Add JMX Graphing !DONE!
      • NameNodeOperations
    • Add JMX Based alarms !DONE!
      • FilesTotal > 1,000,000 or LiveNodes < 50%
  • 38. Review
    • File System Growth
      • Size
      • Number of Files
      • Number of Blocks
      • Ratio’s
    • Utilization
      • CPU/Memory
      • Disk
    • Email (nightly)
      • FSCK
      • DSFADMIN
  • 39. The Future
    • JMX Coming to JobTracker and TaskTracker (0.21)
      • Collect and Graph Jobs Running
      • Collect and Graph Map / Reduce per node
      • Profile Specific Jobs in Cacti?