Profile Your Hadoop Jobs

  • Dear, thank you very much for this presentation is very helpful and clear , I just have one problem in download log files I FOUND USERLOG but empty ? also is all tasks will be on on file ?
  1. 1. Profiling Hadoop Applications Basant Verma
  2. 2. Agenda • Profiling General Background • Available Options • Profile using Free and Open Source tools • Profile using YourKit • Other troubleshooting tools
  3. 3. What does Profiling Provide? • Profiling runtime / CPU usage: – what lines of code the program is spending the most time in – what call/invocation paths were used to get to these lines • naturally represented as tree structures • Profiling memory usage: – what kinds of objects are sitting on the heap – where were they allocated – who is pointing to them now – memory leaks
  4. 4. Profiler Types and Components • Components needed for profiling – Profiling Agent • Collects profiled data (samples, traces, exceptions etc.) – Analysis Tool • Provides interface for analyzing profiled data and help user identify potential problems • Types of Profilers – insertion – sampling – instrumenting
  5. 5. Available Options • Sun JDK Tools – hprof: Profiler (uses jvmti) – jmap: Provides memory map (dump) heap – jhat: Analyze memory dump – jstack: Provide thread dump – Jvisualvm: GUI based profile data analyzer • Open Source – Visual VM (same as jvisualvm but downloaded as independent app) • Uses HPROF internally for profiling. Provides GUI for analysis of heap dump and profiler outputs – NetBeans Profiler • Similar to VisualVM but integrated into IDE – Eclipse MAT (Memory Analysis Tool) • Can load .hprof files • Commercial – YourKit – JProfile
  7. 7. 7 Official hprof Documentation usage: java -Xrunhprof:[help]|[<option>=<value>, ...] Option Name and Value Description Default --------------------- ----------- ------- heap=dump|sites|all heap profiling all cpu=samples|times|old CPU usage off monitor=y|n monitor contention n format=a|b text(txt) or binary output a file=<file> write data to file off depth=<size> stack trace depth 4 interval=<ms> sample interval in ms 10 cutoff=<value> output cutoff point 0.0001 lineno=y|n line number in traces? Y thread=y|n thread in traces? N doe=y|n dump on exit? Y msa=y|n Solaris micro state accounting n force=y|n force output to <file> y verbose=y|n print messages about dumps y
  8. 8. 8 Sample hprof usage • To measure CPU usage, try the following: java -Xrunhprof:cpu=samples,depth=6,heap=dump • Settings: – Takes samples of CPU execution – Record call traces that include the last 6 levels on the stack – Dumps the heap map (bigger file size but helps in finding problems) • Creates the file java.hprof.txt in the current directory
  9. 9. HPROF with Hadoop • Hadoop uses hprof as the default profiler • Profiling related parameters Purpose JobConf API Command line Parameter Enable Profiling setProfileEnabled(true) mapred.task.profile=true Additional parameters for Profiler setProfileParams(…) mapred.task.profile.params Range of sampled task to profile setProfileTaskRange mapred.task.profile.maps mapred.task.profile.reduces
  10. 10. Example • Using Java API • Using Command line parameters jobConf.setProfileEnabled(true); jobConf.setProfileParams("-agentlib:hprof=cpu=samples,heap=sites” + “,depth=4,thread=y,file=%s"); jobConf.setProfileTaskRange(true, "0-2"); jobConf.setProfileTaskRange(false, "0-1"); hadoop jar $HADOOP_HOME/hadoop-examples.jar wordcount -Dmapred.task.profile=true -Dmapred.task.profile.params=-agentlib:hprof=cpu=samples,heap=all, depth=4,thread=y,file=%s -Dmapred.task.profile.maps=0-2 -Dmapred.task.profile.reduces=0-1 input output
  11. 11. Collecting Profiler Output • Hadoop JobClient automatically downloads profile logs from all the profiled tasks – If output format type is not specified, hprof creates profile output in text format (format=a) • Profiler Outputs are also available via History WebUI • You can also download profile output using curl – curl -o attempt_201305161037_0004_m_000000_0.hprof " 201305161037_0004_m_000000_0&filter=profile"
  12. 12. Task User Log
  13. 13. Analyze Profiler output • You can use VisualVM, NetBeans profiler or YourKit for analyzing the profiling data. – The above tools support only binary format of hprof output (i.e. option format=b) • Example – Run profiler with Hadoop job – Load Profiler output using VisualVM menu option hadoop jar $HADOOP_HOME/hadoop-examples.jar wordcount -Dmapred.task.profile=true -Dmapred.task.profile.params=-agentlib:hprof=cpu=samples,heap=all, depth=4,thread=y,format=b,file=%s input output
  14. 14. Analyze Profile Output in VisualVM
  15. 15. Object Query Language • VisualVM and jhat support special query language (OQL) to query Java heap. – Example : Select all Strings with length 1K or more • More information about OQL is available at select s from java.lang.String where s.count > 1024;
  16. 16. Analyze Profile Output in Eclipse MAT
  17. 17. Profiling Pig Jobs • Use Hadoop command line parameters • More information about Pig job profiling is available at Pig Wiki – pig -Dmapred.task.profile=true -Dmapred.task.profile.params=-agentlib:hprof=cpu=samples,heap=sites,thread=y,verbose=n -Dmapred.task.profile.maps=0-2 -Dmapred.task.profile.reduces=0-0 mypigscript.pig
  18. 18. Profiling Hive Queries • Set appropriate Hadoop parameters before submitting the queries hive> set mapred.task.profile=true; hive> set mapred.task.profile.params=-agentlib:hprof=heap=dump,format=b,file=%s; hive> set mapred.task.profile.maps=0-2; hive> set mapred.task.profile.reduces=0-0; hive> hive> <hive query>
  20. 20. YourKit Profiler - Summary • Commercial Java Profiling Tool – Free tryout and Open Source licenses are available • Used by many Open Source projects including Hadoop, Pig, Hive etc. • Features – On-Demand Profiling – CPU, Memory and Concurrency profiling methods – Has integration (Eclipse, NetBeans, IntelliJ) – Above all, has relatively low performance overhead
  21. 21. Using YourKit Profiler • You will need to install YourKit profiler (just the profiler lib) on to each TaskTracker • Tell Hadoop to use a different profiler • Theoretically, you can also use DistributedCache to make binaries available on TaskTracker machines – Though, I did not have success with this hadoop jar $HADOOP_HOME/hadoop-examples.jar wordcount -Dmapred.task.profile=true -Dmapred.task.profile.params=- agentpath:<yourkit_path>/libyjpagent.jnilib=dir=/tmp/yourkit_snapnshot,sampling,disablej2ee -Dmapred.task.profile.maps=0-2 -Dmapred.task.profile.reduces=0-1 input output
  22. 22. Small Glitch • Hadoop JobClient.waitforCompletion(…) will throw error since profile logs are not available in the default directory. • However, the job will continue to run successfully. • To avoid this, you can instead use option to specify the profiling parameters
  23. 23. YourKit to Analyze Jobs • Can analyze profile output from both YourKit Profiler and hprof/jmap.
  24. 24. OTHER TOOLS
  25. 25. Using other Tools • JDK Tool ‘jmap’ – Can be used for capturing heap map of a running Java process and later used for analysis inside VisualVM or YourKit • $ jmap -dump:live,format=b,file=xyz.hprof <jvm-pid> • Don’t run jmap with -histo:live option on JT or NN – Java process can also be instructed to generate hprof dump of heap map in case of OutOfMemoryError • -XX:+HeapDumpOnOutOfMemoryError • JDK Tool ‘jhat’ – Can read heap dump in hprof format and provides a light weight web interface to analyze profiler output
  26. 26. Other Tools (Cont…) • Hadoop Vaidya (Simple Diagnostic Tool) – Identifies common performance problem related to Hadoop Jobs (unbalanced partitioning, granularity of tasks, combiners etc.) – Works merely on Hadoop Job (does not understands the specifics of Hive/Pig)
  27. 27. Other Recommendation • If possible try running Hadoop (MR/Pig/Hive) in local mode using LocalJobRunner – LocalJobRunner runs the entire MapReduce job in a single JVM – It simplifies profiling and log collection – Can also be used for attaching debugger from IDE
  28. 28. Resources • Troubleshooting Java application – • Profile Hadoop Job (Chapter 5 - “Hadoop – The definitive Guide”) – 1974/tuning-a-job/id3545664 • Profiling Pig Job – • ‘hprof’ Official Documentation – • YourKit Profiler –