More Related Content
More from DataWorks Summit/Hadoop Summit (20)
Debugging Apache Hadoop YARN Cluster in Production
- 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Who are We
Junping Du
– Apache Hadoop Committer and PMC
Member
– Dev Lead in Hortonworks YARN team
Xuan Gong
– Apache Hadoop Committer and PMC
Member
– Software Engineer
Jian He
– Apache Hadoop Committer and PMC
Member
– Staff Software Engineer
- 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Today’s Agenda
YARN in a Nutshell
Trouble-shooting Process and Tools
Case Study
Enhanced YARN Log Tool Demo
Summary and Future
- 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
YARN in a Nutshell
- 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
YARN Architecture
ResourceManager
NodeManager
ApplicationMaster
Other daemons:
– Application
History/Timeline Server
– Job History Server (for
MR only)
– Proxy Server
– Etc.
- 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
RM and NM in a nutshell
- 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
YARN in a Nutshell
Trouble-shooting Process and Tools
- 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
“Troubles” to start troubleshooting effort on a YARN cluster
Applications Failed
Applications Hang/Slow
YARN configuration doesn’t work
YARN APIs (CLI, WebService, etc.) doesn’t work
YARN daemons crashed (OOM issue, etc.)
YARN daemons’ log has error/warnings
YARN cluster monitoring tools (like Ambari) alert
Problem Type Distribution
Configuration
Executing Jobs
Cluster Administration
Installation
Application Development
Performance
- 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Process: Phenomenon -> Root Cause -> Solution
Solution:
– Infrastructure/Hardware issue
• Replace disks
• Fix network
– Mis-configuration
• Fix configuration
• Enhance documentation
– Setup issue
• Fix setup
• Restart services
– Application issue
• Update application
• Workaround
– A YARN Bug
• Report/fix it in Apache
community!
Phenomenon:
– Application Failed
Root cause:
– Container Launch failures
• Classpath issue
• Resource localization
failures
– Too many attempt failures
• Network connection
issue
• NM disk issues
• AM failed caused by
node restarted
– Application logic issue
• Container failed with
OOM, etc.
– Security issue
• Token related issues
- 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Iceberg of troubleshooting – Case Study
"java.lang.RuntimeException: java.io.FileNotFoundException:
/etc/hadoop/2.3.4.0-3485/0/core-site.xml (Too many open files in
system)”
That actually due to too many TCP connections issue
- 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Iceberg of troubleshooting – Dig Deeply
Most connections are from local NM to DNs
– LogAggregationService
– ResourceLocalizationService
We found the root cause is threads leak on NM LogAggregationService:
– YARN-4697
NM aggregation thread pool is not bound by limits
– YARN-4325
Purge app state from NM state-store should cover more LOG_HANDLING cases
– YARN-4984
LogAggregationService shouldn't swallow exception in handling createAppDir() which cause thread
leak.
- 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Lesson Learned for Trouble-shooting on a production cluster
What’s mean by a “Production” Cluster?
– Cannot afford stop/restart cluster for trouble shooting
– Most operations on cluster are “Read Only”
– In fenced network, remote debugging with local cluster admin.
Lesson learned:
1. Get related info (screenshots, log files, jstack, memory heap dump, etc.) as much as you can
2. Work closely with the end user to gain an understanding of the issue and symptoms
3. Setup knowledge base used to compare to previous cases
4. If possible, reproduce the issue on test/backup cluster – easy to trouble shooting and verify
5. Version your configuration!
- 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Handy Tools for YARN Troubleshooting
Log
UI
Historic Info
– JobHistoryServer (for MR only)
– Application Timeline Service (v1, v1.5, v2.0)
Monitoring tools, like: AMBARI
Runtime info
– Memory Dump
– Jstack
– System Metrics
- 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Log
Log CLI
– yarn logs -applicationId <application ID> [OPTIONS]
– Discuss more later
Enable Debug log
– When daemons are NOT running
• Put log level settings like: export YARN_ROOT_LOGGER = “DEBUG, console” to yarn-env.sh
• Start the daemons
– When Daemons are running
• Dynamic change log level via daemon’s logLevel UI/CLI
• CLI:
– yarn daemonlog [-getlevel <host:httpPort> <classname>]
– yarn daemonlog [-setlevel <host:httpPort> <classname> <level>]
– for YARN Client side
• Similar setting as daemons not running
- 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Runtime Log Level settings in YARN UI
RM: http://<rm_addr>:8088/logLevel
NM: http://<nm_addr>:8042/logLevel
ATS: http://<ats_addr>:8188/logLevel
- 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hadoop metrics
RPC metrics
– RpcQueueTimeAvgTime
– ReceivedBytes
…
JVM metrics
– MemHeapUsedM
– ThreadsBlocked
…
Documentation:
– http://s.apache.org/UwSu
- 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
YARN top
top like command line view for application stats, queue stats
- 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
YARN in a Nutshell
Trouble-shooting Process and Tools
Case Study
- 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Why is my job hung ?
Job can be stuck at 3 states.
NEW_SAVING: Waiting for app to be persisted in state-store
- Connection error with state-store (zookeeper etc.)
Accepted: Waiting to allocate ApplicationMaster container.
- Low max- AM-resource-percentage config
Running: waiting for containers to be allocated?
- Are there resources available for the app
- Otherwise, application land issue, stuck on
socket read/write.
App
states
- 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Case Study
Friday evening, Customer experiences cluster outages.
Large amount of jobs getting stuck.
There are resources available in the cluster.
Restarting Resource Manger can resolve issue temporarily
But after several hours, cluster again goes back to the bad state
- 24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Case Study
Are there any resources available in the queue ?
- 25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Case Study
Are there any resources available for the app ?
– Sometimes, even if cluster has resources, user may still not be able to run their applications
because they hit the user-limit.
– User-limit controls how much resources a single user can use
– Check user-limit info on the scheduler UI
– Check application head room on application UI
- 27. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Case study
Not a problem of resource contention.
Use yarn logs command to get hung application logs.
– Found app waiting for containers to be allocated.
Problem: cluster has free resources, but app is not able to use it.
- 28. 28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Case study
May be a scheduling issue.
Analyze the scheduler log. (Most difficult)
– User not much familiar with the scheduler log.
– RM log is too huge, hard to do text searching in the logs.
– Getting worse if enabling debug log.
Dump the scheduling log into a separate file
- 29. 29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Case study
Scheduler log shows several apps are skipped for scheduling.
Pick one of the applications, go to the application attempt UI,
Check the resource requests table (see below), notice billions of containers are asked by
the application.
8912124
- 30. 30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Case study
Tried to kill those misbehaving jobs, cluster went fine.
Find the user who submit those jobs and stop him/her from doing that.
Big achievement so far, unblock the cluster.
Offline debugging and find product bug.
Surprisingly, we use int for measuring memory size in the scheduler.
That misbehaving app asked too much resources, which caused integer overflow in the
scheduler.
YARN-4844, replace int with long for resource memory API.
- 31. 31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What we learn
Rebooting service can solve many problems.
– Thanks to working-preserving RM and NM recovery (YARN-556 & YARN-1336).
Denial of Service - Poorly written, or accidental configuration for workloads can cause
component outages.
– Carefully code against DOS scenarios.
– Example: User RPC method (getQueueInfo) holds scheduler lock
UI enhancement
– Small change, big impact.
– Example: Resource requests table on application very useful in this case.
Alerts
– Ask too many containers, alerting to the users.
- 32. 32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Case study 2
10 % of the jobs are failing every day.
After they re-run, jobs sometime finish successfully.
No resource contention when jobs are running
Logs contain a lot of mysterious connection errors (unable to read call parameters)
- 33. 33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Case study 2
Initial attempt,
– Dig deeper into the code to see under what conditions, this exception may throw.
– Not able to figure out.
- 34. 34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Case study 2
Requested more failed application logs
Identify pattern for these applications
Finally, we realize all apps failed on a certain set of nodes.
Ask customer to exclude those nodes. Jobs running fine after that.
Customer checked “/var/log/messages” and found disk issues for those nodes.
When dealing with mysterious connection
failures, hung problems, try to find
correlation between failed apps and nodes.
- 35. 35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
YARN in a Nutshell
Trouble-shooting Process and Tools
Case Study
Enhanced YARN Log Tool Demo
- 36. 36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Enhanced YARN Log CLI (YARN-4904)
Useful Log CLIs
– Get container logs for running apps
• yarn logs –applicationId ${appId}
– Get a specific container log
• yarn logs –applicationId ${appId} –containerId ${containerId}
– Get AM Container logs.
• yarn logs -applicationId ${appId} –am 1
– Get a specific log file
• yarn logs -applicationId ${appId} –logFiles syslog
• Support java regular expression
– Get the log file's first 'n' bytes or the last 'n' bytes
• yarn logs –applicationId ${appId} –size 100
– Dump the application/container logs
• yarn logs –applicationId ${appId} –out ${local_dir}– List application/container log information
• yarn logs –applicationId ${appId} -show_application_log_info
• yarn logs –applicationId ${appId} –containerId ${containerId} -show_container_log_info
- 38. 38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
YARN in a Nutshell
Trouble-shooting Process and Tools
Case Study
Enhanced YARN Log Tool Demo
Summary and Future
- 39. 39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Summary and Future
Summary
– Methodology and Tools for trouble-shooting on YARN
– Case Study
– Enhanced YARN Log CLI
• YARN-4904
Future Enhancement
– ATS (Application Timeline Service) v2
• YARN-2928
• #hs16sj “How YARN Timeline Service v.2 Unlocks 360-Degree Platform Insights at
Scale”
– New ResourceManager UI
• YARN-3368
- 40. 40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
New RM UI (YARN-3368)
- 43. 43 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
YARN Log Command example Screenshot
yarn logs –applicationId application_1467090861129_0001 –am 1
- 44. 44 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
YARN Log Command example Screenshot
yarn logs –applicationId application_1467090861129_0001 –containerId container_1467090861129_0001_01_000002
- 45. 45 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
YARN Log Command example Screenshot
yarn logs –applicationId application_1467090861129_0001 –am 1 –logFiles stderr
- 46. 46 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
YARN Log Command example Screenshot
yarn logs –applicationId application_1467090861129_0001 –am 1 –logFiles stderr –size -1000
- 47. 47 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
YARN Log Command example Screenshot
yarn logs –applicationId application_1467090861129_0001 –out ${localDir}
- 48. 48 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
YARN Log Command example Screenshot
yarn logs –applicationId application_1467090861129_0001 –-show_application_log_info
yarn logs –applicationId application_1467090861129_0001 –-show_container_log_info