70a monitoring & troubleshooting

1,045 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,045
On SlideShare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

70a monitoring & troubleshooting

  1. 1. Monitoring and Troubleshooting 7/6/2012© 2012 MapR Technologies Troubleshooting 1
  2. 2. Monitoring & Troubleshooting Agenda • Cluster Monitoring Tools • Troubleshooting MapReduce Jobs • Troubleshooting Scenarios • Working with MapR Support • Things to Avoid© 2012 MapR Technologies Troubleshooting 2
  3. 3. Monitoring & Troubleshooting Objectives At the end of this module you will be able to: • Identify the tools you can use to monitor your cluster • Explain how MapR central logging can help you monitor MapReduce jobs • Describe several common troubleshooting scenarios and how to resolve issues based on these scenarios • List the tools you can use to work with MapR Support© 2012 MapR Technologies Troubleshooting 3
  4. 4. Cluster Monitoring Tools© 2012 MapR Technologies Troubleshooting 4
  5. 5. Monitoring Tools  Built-In Tools – MapR Control System – MapR Metrics  3rd Party Tools – Nagios – Ganglia5 © 2012 MapR Technologies Troubleshooting 5
  6. 6. MapR Control System  MapR Control System – Dashboard with cluster overview • Node health • MapR-FS and available disks • Resource utilization – bandwidth – disk space – CPU • MapReduce job status • Alarms6 © 2012 MapR Technologies Troubleshooting 6
  7. 7. MapR Control System7 © 2012 MapR Technologies Troubleshooting 7
  8. 8. MapR Metrics  MapR Metrics – View performance information about Hadoop jobs • Predict cluster usage • Measure which jobs consume resources • Troubleshoot failures & performance issues – Metrics provided on • Cumulative CPU/memory usage • # of running/failed tasks/attempts • Speed of input, output, and shuffle • Duration of task attempts • Data read, written, or shuffled • Memory in use • Number of records skipped/spilled8 © 2012 MapR Technologies Troubleshooting 8
  9. 9. MapR Metrics9 © 2012 MapR Technologies Troubleshooting 9
  10. 10. 3rd Party Tools  Nagios – Configuration script generator  Ganglia – CLDB does metrics – MapRGangliaContext – Only need gmond on CLDB node10 © 2012 MapR Technologies Troubleshooting 10
  11. 11. MapR Service Logs  /opt/mapr/logs  For example: – CLDB – Warden – FileServer (mfs) – NFS11 © 2012 MapR Technologies Troubleshooting 11
  12. 12. Troubleshooting MapReduce Jobs© 2012 MapR Technologies Troubleshooting 12
  13. 13. Central Logging  MapR 2.0 introduces central logging – Log files written to “local” volume on MapR-FS • replication factor = 1 – I/O confined to node – /var/mapr/local/<host>/logs/mapred/userlogs – Configurable via JobTracker variable • mapr.localvolumes.path13 © 2012 MapR Technologies Troubleshooting 13
  14. 14. Central Logging  New CLI for MapReduce logs maprcli job linklogs -jobid <jobPatten> -todir <maprfsDir> [ -jobconf <pathToJobXml>] – Create a job-centric view of all logs on all involved TaskTracker nodes – Creates the following structure under <maprfsDir> for all <jobid>’s matching <jobPattern> • <jobid>/hosts/<host>/ – symbolic links to log directories of tasks executed for <jobid> on <host> • <jobid>/mappers/ – symbolic links to log directories of all map task attempts for <jobid> across the cluster • <jobid>/reducers/ – symbolic links to log directories of all reduce task attempts for <jobid> across the cluster14 © 2012 MapR Technologies Troubleshooting 14
  15. 15. Troubleshooting Scenarios© 2012 MapR Technologies Troubleshooting 15
  16. 16. Troubleshooting Scenarios  Slow nodes  Out of memory  Out of disk space  Time skew  No ZooKeeper quorum  Contention for resources  Requirements not met16 © 2012 MapR Technologies Troubleshooting 16
  17. 17. Identifying Slow Nodes  Before installation: – Use dd to benchmark read/write speed • dd bs=4M if=/dev/null of=/dev/sd<x> – Compare performance across nodes to test network throughput: • dd bs=4M if=/dev/null | sudo ssh root@node dd bs=4M of=/dev/foo’  After installation: – Look at task starting and completion times – Look in system logs for memory or CPU problems – Look at the performance of writes to the local volume (where intermediate data goes)  Slow disks identified based on a threshold in mfs.conf – May really be slow NIC17 © 2012 MapR Technologies Troubleshooting 17
  18. 18. Out of Memory  Make sure there is enough swap space  See if a memory-intensive job is running  Use ulimit to make sure there are no limits on the number of file descriptors, resource usage, and the number of processes  Garbage collection can result in out-of-memory errors18 © 2012 MapR Technologies Troubleshooting 18
  19. 19. Out of Disk Space  MapR logs go to /opt/mapr/logs – If this partition is too small, space can run out – Set up a cron job to clean out old logs – Move to a larger partition19 © 2012 MapR Technologies Troubleshooting 19
  20. 20. Time Skew  NTP is your friend  20 Seconds differential is the max allowed20 © 2012 MapR Technologies Troubleshooting 20
  21. 21. No ZooKeeper Quorum  Not enough ZooKeepers running  configure.sh run improperly – Different ZooKeeper or CLDB nodes specified  Network problem – Hostname resolution – Physical connection down21 © 2012 MapR Technologies Troubleshooting 21
  22. 22. Contention for Resources  Make sure there’s no limit on file descriptors, processes  Make sure the service layout follows good guidelines – Don’t run ZooKeeper with CLDB or JobTracker – Fewer task slots when running TaskTracker with CLDB or ZooKeeper – Avoid running the active JobTracker on the primary CLDB node  Don’t run other random things on cluster nodes  Don’t mix distributions22 © 2012 MapR Technologies Troubleshooting 22
  23. 23. Requirements Not Met  Use Sun Java JDK  Same users/groups with same UID/GID numbers on all nodes  Proper licensing  Host resolution between all nodes – DNS or /etc/hosts  Keyless ssh between all nodes for the root user  All necessary ports open – Watch out for iptables and selinux23 © 2012 MapR Technologies Troubleshooting 23
  24. 24. Working with MapR Support© 2012 MapR Technologies Troubleshooting 24
  25. 25. Working with MapR Support  mapr-support-collect and mapr-support dump  fsck and gfsck25 © 2012 MapR Technologies Troubleshooting 25
  26. 26. Things to Avoid© 2012 MapR Technologies Troubleshooting 26
  27. 27. Things to Avoid  Remove ZooKeeper data manually  Format disks (unless you are sure)  Run configure.sh incorrectly  Use dd on an installed node  Modify configuration files – Without a good reason – Inconsistently27 © 2012 MapR Technologies Troubleshooting 27
  28. 28. Questions© 2012 MapR Technologies Troubleshooting 28

×