Hadoop is used to run large scale jobs that are sub-divided into many tasks that are executed over multiple machines. There are complex dependencies between these tasks. And at scale, there can be 10’s to 1000’s of tasks running over 100 to 1000’s of machines which increases the challenge of making sense of their performance. Add to that pipelines of such jobs that logically run a business workflow as another level of complexity. This does not even account for other users competing for resources on the same Hadoop cluster which can have even a more drastic impact on your job performance and your SLAs. No wonder that the question of why Hadoop jobs run slower than expected, remains a perennial source of grief for developers. In this talk, we will draw on our experience in debugging and analyzing Hadoop jobs to describe some methodical approaches to this and present current and new tracing and tooling ideas that can help semi-automate parts of this difficult problem.
1. Why is my Hadoop* job slow?
Bikas Saha
@bikassaha
*Apache Hadoop, Falcon, Atlas, Tez, Sqoop, Flume, Kafka, Pig, Hive,
HBase, Accumulo, Storm, Solr, Spark, Ranger, Knox, Ambari, ZooKeeper,
Oozie, Zeppelin and the Hadoop elephant logo are trademarks of the
Apache Software Foundation.
Hitesh Shah
It is now possible to infer which application/job did what in HDFS
Files created can be tracked down to the MR or Tez job and the specific task attempt that created them.
Using simple string manipulation and aggregations, you can file jobs inducing high loads against the Namenode.
Tracking what YARN maps to what application type and instance is now much easier.
It could made more easier if “mr_attempt_1464484887407_0007_m_000000_0” pointed to an oozie worklow instead of the MR job
Who killed my application and how (command-line, webservice)?