Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Debugging Apache
Hadoop YARN Cluster in
Production
Jian He, Junping Du and Xuan Gong
Hortonworks YARN Team
06/30/2016
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Who are We
 Junping Du
– Apache Hadoop Committer and PMC
Member
– D...
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Today’s Agenda
 YARN in a Nutshell
 Trouble-shooting Process and T...
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
YARN in a Nutshell
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
YARN Architecture
 ResourceManager
 NodeManager
 ApplicationMaste...
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
RM and NM in a nutshell
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
YARN in a Nutshell
Trouble-shooting Process and Tools
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
“Troubles” to start troubleshooting effort on a YARN cluster
 Appli...
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Process: Phenomenon -> Root Cause -> Solution
 Solution:
– Infrastr...
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Iceberg of troubleshooting – Case Study
 "java.lang.RuntimeExcepti...
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Iceberg of troubleshooting – Dig Deeply
 Most connections are from...
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Lesson Learned for Trouble-shooting on a production cluster
 What’...
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Handy Tools for YARN Troubleshooting
 Log
 UI
 Historic Info
– J...
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Log
 Log CLI
– yarn logs -applicationId <application ID> [OPTIONS]...
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Runtime Log Level settings in YARN UI
 RM: http://<rm_addr>:8088/l...
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
UI (Ambari and YARN)
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Job History Server
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Memory dump analysis
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hadoop metrics
 RPC metrics
– RpcQueueTimeAvgTime
– ReceivedBytes
...
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
YARN top
 top like command line view for application stats, queue ...
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
YARN in a Nutshell
Trouble-shooting Process and Tools
Case S...
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Why is my job hung ?
 Job can be stuck at 3 states.
NEW_SAVING: Wa...
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Case Study
 Friday evening, Customer experiences cluster outages.
...
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Case Study
 Are there any resources available in the queue ?
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Case Study
 Are there any resources available for the app ?
– Some...
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Case study
 Not a problem of resource contention.
 Use yarn logs ...
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Case study
 May be a scheduling issue.
 Analyze the scheduler log...
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Case study
 Scheduler log shows several apps are skipped for sched...
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Case study
 Tried to kill those misbehaving jobs, cluster went fin...
31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What we learn
 Rebooting service can solve many problems. 
– Than...
32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Case study 2
 10 % of the jobs are failing every day.
 After they...
33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Case study 2
 Initial attempt,
– Dig deeper into the code to see u...
34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Case study 2
 Requested more failed application logs
 Identify pa...
35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
YARN in a Nutshell
Trouble-shooting Process and Tools
Case S...
36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Enhanced YARN Log CLI (YARN-4904)
 Useful Log CLIs
– Get container...
37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
YARN in a Nutshell
Trouble-shooting Process and Tools
Case S...
39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Summary and Future
 Summary
– Methodology and Tools for trouble-sh...
40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
New RM UI (YARN-3368)
41 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Thank You
42 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Backup Slides
43 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
YARN Log Command example Screenshot
yarn logs –applicationId applic...
44 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
YARN Log Command example Screenshot
yarn logs –applicationId applic...
45 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
YARN Log Command example Screenshot
yarn logs –applicationId applic...
46 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
YARN Log Command example Screenshot
yarn logs –applicationId applic...
47 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
YARN Log Command example Screenshot
yarn logs –applicationId applic...
48 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
YARN Log Command example Screenshot
yarn logs –applicationId applic...
Upcoming SlideShare
Loading in …5
×

Debugging Apache Hadoop YARN Cluster in Production

5,480 views

Published on

Debugging Apache Hadoop YARN Cluster in Production

Published in: Technology
  • Be the first to comment

Debugging Apache Hadoop YARN Cluster in Production

  1. 1. Debugging Apache Hadoop YARN Cluster in Production Jian He, Junping Du and Xuan Gong Hortonworks YARN Team 06/30/2016
  2. 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Who are We  Junping Du – Apache Hadoop Committer and PMC Member – Dev Lead in Hortonworks YARN team  Xuan Gong – Apache Hadoop Committer and PMC Member – Software Engineer  Jian He – Apache Hadoop Committer and PMC Member – Staff Software Engineer
  3. 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Today’s Agenda  YARN in a Nutshell  Trouble-shooting Process and Tools  Case Study  Enhanced YARN Log Tool Demo  Summary and Future
  4. 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda YARN in a Nutshell
  5. 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved YARN Architecture  ResourceManager  NodeManager  ApplicationMaster  Other daemons: – Application History/Timeline Server – Job History Server (for MR only) – Proxy Server – Etc.
  6. 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved RM and NM in a nutshell
  7. 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda YARN in a Nutshell Trouble-shooting Process and Tools
  8. 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved “Troubles” to start troubleshooting effort on a YARN cluster  Applications Failed  Applications Hang/Slow  YARN configuration doesn’t work  YARN APIs (CLI, WebService, etc.) doesn’t work  YARN daemons crashed (OOM issue, etc.)  YARN daemons’ log has error/warnings  YARN cluster monitoring tools (like Ambari) alert Problem Type Distribution Configuration Executing Jobs Cluster Administration Installation Application Development Performance
  9. 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Process: Phenomenon -> Root Cause -> Solution  Solution: – Infrastructure/Hardware issue • Replace disks • Fix network – Mis-configuration • Fix configuration • Enhance documentation – Setup issue • Fix setup • Restart services – Application issue • Update application • Workaround – A YARN Bug • Report/fix it in Apache community!  Phenomenon: – Application Failed  Root cause: – Container Launch failures • Classpath issue • Resource localization failures – Too many attempt failures • Network connection issue • NM disk issues • AM failed caused by node restarted – Application logic issue • Container failed with OOM, etc. – Security issue • Token related issues
  10. 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Iceberg of troubleshooting – Case Study  "java.lang.RuntimeException: java.io.FileNotFoundException: /etc/hadoop/2.3.4.0-3485/0/core-site.xml (Too many open files in system)”  That actually due to too many TCP connections issue
  11. 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Iceberg of troubleshooting – Dig Deeply  Most connections are from local NM to DNs – LogAggregationService – ResourceLocalizationService  We found the root cause is threads leak on NM LogAggregationService: – YARN-4697 NM aggregation thread pool is not bound by limits – YARN-4325 Purge app state from NM state-store should cover more LOG_HANDLING cases – YARN-4984 LogAggregationService shouldn't swallow exception in handling createAppDir() which cause thread leak.
  12. 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Lesson Learned for Trouble-shooting on a production cluster  What’s mean by a “Production” Cluster? – Cannot afford stop/restart cluster for trouble shooting – Most operations on cluster are “Read Only” – In fenced network, remote debugging with local cluster admin.  Lesson learned: 1. Get related info (screenshots, log files, jstack, memory heap dump, etc.) as much as you can 2. Work closely with the end user to gain an understanding of the issue and symptoms 3. Setup knowledge base used to compare to previous cases 4. If possible, reproduce the issue on test/backup cluster – easy to trouble shooting and verify 5. Version your configuration!
  13. 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Handy Tools for YARN Troubleshooting  Log  UI  Historic Info – JobHistoryServer (for MR only) – Application Timeline Service (v1, v1.5, v2.0)  Monitoring tools, like: AMBARI  Runtime info – Memory Dump – Jstack – System Metrics
  14. 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Log  Log CLI – yarn logs -applicationId <application ID> [OPTIONS] – Discuss more later  Enable Debug log – When daemons are NOT running • Put log level settings like: export YARN_ROOT_LOGGER = “DEBUG, console” to yarn-env.sh • Start the daemons – When Daemons are running • Dynamic change log level via daemon’s logLevel UI/CLI • CLI: – yarn daemonlog [-getlevel <host:httpPort> <classname>] – yarn daemonlog [-setlevel <host:httpPort> <classname> <level>] – for YARN Client side • Similar setting as daemons not running
  15. 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Runtime Log Level settings in YARN UI  RM: http://<rm_addr>:8088/logLevel  NM: http://<nm_addr>:8042/logLevel  ATS: http://<ats_addr>:8188/logLevel
  16. 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved UI (Ambari and YARN)
  17. 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Job History Server
  18. 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Memory dump analysis
  19. 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hadoop metrics  RPC metrics – RpcQueueTimeAvgTime – ReceivedBytes …  JVM metrics – MemHeapUsedM – ThreadsBlocked …  Documentation: – http://s.apache.org/UwSu
  20. 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved YARN top  top like command line view for application stats, queue stats
  21. 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda YARN in a Nutshell Trouble-shooting Process and Tools Case Study
  22. 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Why is my job hung ?  Job can be stuck at 3 states. NEW_SAVING: Waiting for app to be persisted in state-store - Connection error with state-store (zookeeper etc.) Accepted: Waiting to allocate ApplicationMaster container. - Low max- AM-resource-percentage config Running: waiting for containers to be allocated? - Are there resources available for the app - Otherwise, application land issue, stuck on socket read/write. App states
  23. 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Case Study  Friday evening, Customer experiences cluster outages.  Large amount of jobs getting stuck.  There are resources available in the cluster.  Restarting Resource Manger can resolve issue temporarily  But after several hours, cluster again goes back to the bad state
  24. 24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Case Study  Are there any resources available in the queue ?
  25. 25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Case Study  Are there any resources available for the app ? – Sometimes, even if cluster has resources, user may still not be able to run their applications because they hit the user-limit. – User-limit controls how much resources a single user can use – Check user-limit info on the scheduler UI – Check application head room on application UI
  26. 26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  27. 27. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Case study  Not a problem of resource contention.  Use yarn logs command to get hung application logs. – Found app waiting for containers to be allocated.  Problem: cluster has free resources, but app is not able to use it.
  28. 28. 28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Case study  May be a scheduling issue.  Analyze the scheduler log. (Most difficult) – User not much familiar with the scheduler log. – RM log is too huge, hard to do text searching in the logs. – Getting worse if enabling debug log.  Dump the scheduling log into a separate file
  29. 29. 29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Case study  Scheduler log shows several apps are skipped for scheduling.  Pick one of the applications, go to the application attempt UI,  Check the resource requests table (see below), notice billions of containers are asked by the application. 8912124
  30. 30. 30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Case study  Tried to kill those misbehaving jobs, cluster went fine.  Find the user who submit those jobs and stop him/her from doing that.  Big achievement so far, unblock the cluster.  Offline debugging and find product bug.  Surprisingly, we use int for measuring memory size in the scheduler.  That misbehaving app asked too much resources, which caused integer overflow in the scheduler.  YARN-4844, replace int with long for resource memory API.
  31. 31. 31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved What we learn  Rebooting service can solve many problems.  – Thanks to working-preserving RM and NM recovery (YARN-556 & YARN-1336).  Denial of Service - Poorly written, or accidental configuration for workloads can cause component outages. – Carefully code against DOS scenarios. – Example: User RPC method (getQueueInfo) holds scheduler lock  UI enhancement – Small change, big impact. – Example: Resource requests table on application very useful in this case.  Alerts – Ask too many containers, alerting to the users.
  32. 32. 32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Case study 2  10 % of the jobs are failing every day.  After they re-run, jobs sometime finish successfully.  No resource contention when jobs are running  Logs contain a lot of mysterious connection errors (unable to read call parameters)
  33. 33. 33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Case study 2  Initial attempt, – Dig deeper into the code to see under what conditions, this exception may throw. – Not able to figure out.
  34. 34. 34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Case study 2  Requested more failed application logs  Identify pattern for these applications  Finally, we realize all apps failed on a certain set of nodes.  Ask customer to exclude those nodes. Jobs running fine after that.  Customer checked “/var/log/messages” and found disk issues for those nodes. When dealing with mysterious connection failures, hung problems, try to find correlation between failed apps and nodes.
  35. 35. 35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda YARN in a Nutshell Trouble-shooting Process and Tools Case Study Enhanced YARN Log Tool Demo
  36. 36. 36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Enhanced YARN Log CLI (YARN-4904)  Useful Log CLIs – Get container logs for running apps • yarn logs –applicationId ${appId} – Get a specific container log • yarn logs –applicationId ${appId} –containerId ${containerId} – Get AM Container logs. • yarn logs -applicationId ${appId} –am 1 – Get a specific log file • yarn logs -applicationId ${appId} –logFiles syslog • Support java regular expression – Get the log file's first 'n' bytes or the last 'n' bytes • yarn logs –applicationId ${appId} –size 100 – Dump the application/container logs • yarn logs –applicationId ${appId} –out ${local_dir}– List application/container log information • yarn logs –applicationId ${appId} -show_application_log_info • yarn logs –applicationId ${appId} –containerId ${containerId} -show_container_log_info
  37. 37. 37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  38. 38. 38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda YARN in a Nutshell Trouble-shooting Process and Tools Case Study Enhanced YARN Log Tool Demo Summary and Future
  39. 39. 39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Summary and Future  Summary – Methodology and Tools for trouble-shooting on YARN – Case Study – Enhanced YARN Log CLI • YARN-4904  Future Enhancement – ATS (Application Timeline Service) v2 • YARN-2928 • #hs16sj “How YARN Timeline Service v.2 Unlocks 360-Degree Platform Insights at Scale” – New ResourceManager UI • YARN-3368
  40. 40. 40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved New RM UI (YARN-3368)
  41. 41. 41 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Thank You
  42. 42. 42 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Backup Slides
  43. 43. 43 © Hortonworks Inc. 2011 – 2016. All Rights Reserved YARN Log Command example Screenshot yarn logs –applicationId application_1467090861129_0001 –am 1
  44. 44. 44 © Hortonworks Inc. 2011 – 2016. All Rights Reserved YARN Log Command example Screenshot yarn logs –applicationId application_1467090861129_0001 –containerId container_1467090861129_0001_01_000002
  45. 45. 45 © Hortonworks Inc. 2011 – 2016. All Rights Reserved YARN Log Command example Screenshot yarn logs –applicationId application_1467090861129_0001 –am 1 –logFiles stderr
  46. 46. 46 © Hortonworks Inc. 2011 – 2016. All Rights Reserved YARN Log Command example Screenshot yarn logs –applicationId application_1467090861129_0001 –am 1 –logFiles stderr –size -1000
  47. 47. 47 © Hortonworks Inc. 2011 – 2016. All Rights Reserved YARN Log Command example Screenshot yarn logs –applicationId application_1467090861129_0001 –out ${localDir}
  48. 48. 48 © Hortonworks Inc. 2011 – 2016. All Rights Reserved YARN Log Command example Screenshot yarn logs –applicationId application_1467090861129_0001 –-show_application_log_info yarn logs –applicationId application_1467090861129_0001 –-show_container_log_info

×