Hadoop is used to run large scale jobs that are sub-divided into many tasks that are executed over multiple machines. There are complex dependencies between these tasks. And at scale, there can be 10’s to 1000’s of tasks running over 100 to 1000’s of machines which increases the challenge of making sense of their performance. Add to that pipelines of such jobs that logically run a business workflow as another level of complexity. This does not even account for other users competing for resources on the same Hadoop cluster which can have even a more drastic impact on your job performance and your SLAs. No wonder that the question of why Hadoop jobs run slower than expected, remains a perennial source of grief for developers. In this talk, we will draw on our experience in debugging and analyzing Hadoop jobs to describe some methodical approaches to this and present current and new tracing and tooling ideas that can help semi-automate parts of this difficult problem.