Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

70a monitoring & troubleshooting

1,250 views

Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

70a monitoring & troubleshooting

  1. 1. Monitoring and Troubleshooting 7/6/2012© 2012 MapR Technologies Troubleshooting 1
  2. 2. Monitoring & Troubleshooting Agenda • Cluster Monitoring Tools • Troubleshooting MapReduce Jobs • Troubleshooting Scenarios • Working with MapR Support • Things to Avoid© 2012 MapR Technologies Troubleshooting 2
  3. 3. Monitoring & Troubleshooting Objectives At the end of this module you will be able to: • Identify the tools you can use to monitor your cluster • Explain how MapR central logging can help you monitor MapReduce jobs • Describe several common troubleshooting scenarios and how to resolve issues based on these scenarios • List the tools you can use to work with MapR Support© 2012 MapR Technologies Troubleshooting 3
  4. 4. Cluster Monitoring Tools© 2012 MapR Technologies Troubleshooting 4
  5. 5. Monitoring Tools  Built-In Tools – MapR Control System – MapR Metrics  3rd Party Tools – Nagios – Ganglia5 © 2012 MapR Technologies Troubleshooting 5
  6. 6. MapR Control System  MapR Control System – Dashboard with cluster overview • Node health • MapR-FS and available disks • Resource utilization – bandwidth – disk space – CPU • MapReduce job status • Alarms6 © 2012 MapR Technologies Troubleshooting 6
  7. 7. MapR Control System7 © 2012 MapR Technologies Troubleshooting 7
  8. 8. MapR Metrics  MapR Metrics – View performance information about Hadoop jobs • Predict cluster usage • Measure which jobs consume resources • Troubleshoot failures & performance issues – Metrics provided on • Cumulative CPU/memory usage • # of running/failed tasks/attempts • Speed of input, output, and shuffle • Duration of task attempts • Data read, written, or shuffled • Memory in use • Number of records skipped/spilled8 © 2012 MapR Technologies Troubleshooting 8
  9. 9. MapR Metrics9 © 2012 MapR Technologies Troubleshooting 9
  10. 10. 3rd Party Tools  Nagios – Configuration script generator  Ganglia – CLDB does metrics – MapRGangliaContext – Only need gmond on CLDB node10 © 2012 MapR Technologies Troubleshooting 10
  11. 11. MapR Service Logs  /opt/mapr/logs  For example: – CLDB – Warden – FileServer (mfs) – NFS11 © 2012 MapR Technologies Troubleshooting 11
  12. 12. Troubleshooting MapReduce Jobs© 2012 MapR Technologies Troubleshooting 12
  13. 13. Central Logging  MapR 2.0 introduces central logging – Log files written to “local” volume on MapR-FS • replication factor = 1 – I/O confined to node – /var/mapr/local/<host>/logs/mapred/userlogs – Configurable via JobTracker variable • mapr.localvolumes.path13 © 2012 MapR Technologies Troubleshooting 13
  14. 14. Central Logging  New CLI for MapReduce logs maprcli job linklogs -jobid <jobPatten> -todir <maprfsDir> [ -jobconf <pathToJobXml>] – Create a job-centric view of all logs on all involved TaskTracker nodes – Creates the following structure under <maprfsDir> for all <jobid>’s matching <jobPattern> • <jobid>/hosts/<host>/ – symbolic links to log directories of tasks executed for <jobid> on <host> • <jobid>/mappers/ – symbolic links to log directories of all map task attempts for <jobid> across the cluster • <jobid>/reducers/ – symbolic links to log directories of all reduce task attempts for <jobid> across the cluster14 © 2012 MapR Technologies Troubleshooting 14
  15. 15. Troubleshooting Scenarios© 2012 MapR Technologies Troubleshooting 15
  16. 16. Troubleshooting Scenarios  Slow nodes  Out of memory  Out of disk space  Time skew  No ZooKeeper quorum  Contention for resources  Requirements not met16 © 2012 MapR Technologies Troubleshooting 16
  17. 17. Identifying Slow Nodes  Before installation: – Use dd to benchmark read/write speed • dd bs=4M if=/dev/null of=/dev/sd<x> – Compare performance across nodes to test network throughput: • dd bs=4M if=/dev/null | sudo ssh root@node dd bs=4M of=/dev/foo’  After installation: – Look at task starting and completion times – Look in system logs for memory or CPU problems – Look at the performance of writes to the local volume (where intermediate data goes)  Slow disks identified based on a threshold in mfs.conf – May really be slow NIC17 © 2012 MapR Technologies Troubleshooting 17
  18. 18. Out of Memory  Make sure there is enough swap space  See if a memory-intensive job is running  Use ulimit to make sure there are no limits on the number of file descriptors, resource usage, and the number of processes  Garbage collection can result in out-of-memory errors18 © 2012 MapR Technologies Troubleshooting 18
  19. 19. Out of Disk Space  MapR logs go to /opt/mapr/logs – If this partition is too small, space can run out – Set up a cron job to clean out old logs – Move to a larger partition19 © 2012 MapR Technologies Troubleshooting 19
  20. 20. Time Skew  NTP is your friend  20 Seconds differential is the max allowed20 © 2012 MapR Technologies Troubleshooting 20
  21. 21. No ZooKeeper Quorum  Not enough ZooKeepers running  configure.sh run improperly – Different ZooKeeper or CLDB nodes specified  Network problem – Hostname resolution – Physical connection down21 © 2012 MapR Technologies Troubleshooting 21
  22. 22. Contention for Resources  Make sure there’s no limit on file descriptors, processes  Make sure the service layout follows good guidelines – Don’t run ZooKeeper with CLDB or JobTracker – Fewer task slots when running TaskTracker with CLDB or ZooKeeper – Avoid running the active JobTracker on the primary CLDB node  Don’t run other random things on cluster nodes  Don’t mix distributions22 © 2012 MapR Technologies Troubleshooting 22
  23. 23. Requirements Not Met  Use Sun Java JDK  Same users/groups with same UID/GID numbers on all nodes  Proper licensing  Host resolution between all nodes – DNS or /etc/hosts  Keyless ssh between all nodes for the root user  All necessary ports open – Watch out for iptables and selinux23 © 2012 MapR Technologies Troubleshooting 23
  24. 24. Working with MapR Support© 2012 MapR Technologies Troubleshooting 24
  25. 25. Working with MapR Support  mapr-support-collect and mapr-support dump  fsck and gfsck25 © 2012 MapR Technologies Troubleshooting 25
  26. 26. Things to Avoid© 2012 MapR Technologies Troubleshooting 26
  27. 27. Things to Avoid  Remove ZooKeeper data manually  Format disks (unless you are sure)  Run configure.sh incorrectly  Use dd on an installed node  Modify configuration files – Without a good reason – Inconsistently27 © 2012 MapR Technologies Troubleshooting 27
  28. 28. Questions© 2012 MapR Technologies Troubleshooting 28

×