• Save
70a monitoring & troubleshooting
Upcoming SlideShare
Loading in...5
×
 

70a monitoring & troubleshooting

on

  • 670 views

 

Statistics

Views

Total Views
670
Views on SlideShare
670
Embed Views
0

Actions

Likes
0
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    70a monitoring & troubleshooting 70a monitoring & troubleshooting Presentation Transcript

    • Monitoring and Troubleshooting 7/6/2012© 2012 MapR Technologies Troubleshooting 1
    • Monitoring & Troubleshooting Agenda • Cluster Monitoring Tools • Troubleshooting MapReduce Jobs • Troubleshooting Scenarios • Working with MapR Support • Things to Avoid© 2012 MapR Technologies Troubleshooting 2
    • Monitoring & Troubleshooting Objectives At the end of this module you will be able to: • Identify the tools you can use to monitor your cluster • Explain how MapR central logging can help you monitor MapReduce jobs • Describe several common troubleshooting scenarios and how to resolve issues based on these scenarios • List the tools you can use to work with MapR Support© 2012 MapR Technologies Troubleshooting 3
    • Cluster Monitoring Tools© 2012 MapR Technologies Troubleshooting 4
    • Monitoring Tools  Built-In Tools – MapR Control System – MapR Metrics  3rd Party Tools – Nagios – Ganglia5 © 2012 MapR Technologies Troubleshooting 5
    • MapR Control System  MapR Control System – Dashboard with cluster overview • Node health • MapR-FS and available disks • Resource utilization – bandwidth – disk space – CPU • MapReduce job status • Alarms6 © 2012 MapR Technologies Troubleshooting 6
    • MapR Control System7 © 2012 MapR Technologies Troubleshooting 7
    • MapR Metrics  MapR Metrics – View performance information about Hadoop jobs • Predict cluster usage • Measure which jobs consume resources • Troubleshoot failures & performance issues – Metrics provided on • Cumulative CPU/memory usage • # of running/failed tasks/attempts • Speed of input, output, and shuffle • Duration of task attempts • Data read, written, or shuffled • Memory in use • Number of records skipped/spilled8 © 2012 MapR Technologies Troubleshooting 8
    • MapR Metrics9 © 2012 MapR Technologies Troubleshooting 9
    • 3rd Party Tools  Nagios – Configuration script generator  Ganglia – CLDB does metrics – MapRGangliaContext – Only need gmond on CLDB node10 © 2012 MapR Technologies Troubleshooting 10
    • MapR Service Logs  /opt/mapr/logs  For example: – CLDB – Warden – FileServer (mfs) – NFS11 © 2012 MapR Technologies Troubleshooting 11
    • Troubleshooting MapReduce Jobs© 2012 MapR Technologies Troubleshooting 12
    • Central Logging  MapR 2.0 introduces central logging – Log files written to “local” volume on MapR-FS • replication factor = 1 – I/O confined to node – /var/mapr/local/<host>/logs/mapred/userlogs – Configurable via JobTracker variable • mapr.localvolumes.path13 © 2012 MapR Technologies Troubleshooting 13
    • Central Logging  New CLI for MapReduce logs maprcli job linklogs -jobid <jobPatten> -todir <maprfsDir> [ -jobconf <pathToJobXml>] – Create a job-centric view of all logs on all involved TaskTracker nodes – Creates the following structure under <maprfsDir> for all <jobid>’s matching <jobPattern> • <jobid>/hosts/<host>/ – symbolic links to log directories of tasks executed for <jobid> on <host> • <jobid>/mappers/ – symbolic links to log directories of all map task attempts for <jobid> across the cluster • <jobid>/reducers/ – symbolic links to log directories of all reduce task attempts for <jobid> across the cluster14 © 2012 MapR Technologies Troubleshooting 14
    • Troubleshooting Scenarios© 2012 MapR Technologies Troubleshooting 15
    • Troubleshooting Scenarios  Slow nodes  Out of memory  Out of disk space  Time skew  No ZooKeeper quorum  Contention for resources  Requirements not met16 © 2012 MapR Technologies Troubleshooting 16
    • Identifying Slow Nodes  Before installation: – Use dd to benchmark read/write speed • dd bs=4M if=/dev/null of=/dev/sd<x> – Compare performance across nodes to test network throughput: • dd bs=4M if=/dev/null | sudo ssh root@node dd bs=4M of=/dev/foo’  After installation: – Look at task starting and completion times – Look in system logs for memory or CPU problems – Look at the performance of writes to the local volume (where intermediate data goes)  Slow disks identified based on a threshold in mfs.conf – May really be slow NIC17 © 2012 MapR Technologies Troubleshooting 17
    • Out of Memory  Make sure there is enough swap space  See if a memory-intensive job is running  Use ulimit to make sure there are no limits on the number of file descriptors, resource usage, and the number of processes  Garbage collection can result in out-of-memory errors18 © 2012 MapR Technologies Troubleshooting 18
    • Out of Disk Space  MapR logs go to /opt/mapr/logs – If this partition is too small, space can run out – Set up a cron job to clean out old logs – Move to a larger partition19 © 2012 MapR Technologies Troubleshooting 19
    • Time Skew  NTP is your friend  20 Seconds differential is the max allowed20 © 2012 MapR Technologies Troubleshooting 20
    • No ZooKeeper Quorum  Not enough ZooKeepers running  configure.sh run improperly – Different ZooKeeper or CLDB nodes specified  Network problem – Hostname resolution – Physical connection down21 © 2012 MapR Technologies Troubleshooting 21
    • Contention for Resources  Make sure there’s no limit on file descriptors, processes  Make sure the service layout follows good guidelines – Don’t run ZooKeeper with CLDB or JobTracker – Fewer task slots when running TaskTracker with CLDB or ZooKeeper – Avoid running the active JobTracker on the primary CLDB node  Don’t run other random things on cluster nodes  Don’t mix distributions22 © 2012 MapR Technologies Troubleshooting 22
    • Requirements Not Met  Use Sun Java JDK  Same users/groups with same UID/GID numbers on all nodes  Proper licensing  Host resolution between all nodes – DNS or /etc/hosts  Keyless ssh between all nodes for the root user  All necessary ports open – Watch out for iptables and selinux23 © 2012 MapR Technologies Troubleshooting 23
    • Working with MapR Support© 2012 MapR Technologies Troubleshooting 24
    • Working with MapR Support  mapr-support-collect and mapr-support dump  fsck and gfsck25 © 2012 MapR Technologies Troubleshooting 25
    • Things to Avoid© 2012 MapR Technologies Troubleshooting 26
    • Things to Avoid  Remove ZooKeeper data manually  Format disks (unless you are sure)  Run configure.sh incorrectly  Use dd on an installed node  Modify configuration files – Without a good reason – Inconsistently27 © 2012 MapR Technologies Troubleshooting 27
    • Questions© 2012 MapR Technologies Troubleshooting 28