Unraveling Hadoop Meltdown Mysteries


Published on

As powerful and flexible as Hadoop is, jobs still sometimes fail or thrash unpredictably. Pepperdata co-founder and CEO Sean Suchter, one of the first commercial users of Hadoop in the early days at Yahoo, will give real-world examples of Hadoop meltdowns complete with metrics and what we can learn from them. He'll also show how to automatically increase Hadoop cluster throughput through fine-grained job hardware usage visibility.

1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Hi, this is Sean Suchter from Pepperdata. Today I’m going to show you a couple meltdowns we saw on large scale (hundred to thousand nodes) customer clusters and how we found the root causes.
  • The first one we saw was a case where the cluster’s disks started thrashing and slowed everything to a crawl.
  • We could clearly see frequent spikes when the disks became very busy. The question at hand was “what was causing those spikes?”
  • To diagnose this, we looked at some job-level metrics about how they were using the cluster’s IO subsystem.
  • We rapidly found that the spikes (see the bottom graph) happened at the same time as heavy periods of many individual files being opened, quite rapidly. The biggest spike, that one on the right, showed that this cluster was opening over 200 thousand individual files per second.
  • In order to find the culprit, we broke the data down by user and job.
  • It became immediately clear that it was not widespread usage of the cluster, but rather one individual repeated series of ETL jobs that was causing this behavior. The middle graph shows that all the spikes were from one user and the right hand graph shows that it was several jobs from that user. There are other users on the middle graph, for example you can see the tiny little green job, but they are all small compared to this ETL user.
  • When we zoomed in on one of these jobs and looked at its tasks, we could see that each individual task was opening between 400 and 500 new files per second.
  • Once we knew this root cause, the solution was to let the author know. He was rather surprised, to say the least.
  • The second use case was a case were nodes were dying across a business critical 1200 node cluster.
  • Late one night, nodes abruptly started swapping and becoming non-responsive. People in the datacenter had to get paged to physically reset the hosts. The job submitters all reported that they didn’t change anything and no new software got deployed. So the question was “what the heck changed?”
  • Because of the alarm bells raised by the swapping, we started by looking at memory usage on some of the nodes and found these spikes beginning at around 10pm. (22 hundred hours)
  • Zooming in on one of them…
  • We found that the tasks on this node suddenly went way above normal, consuming 6 gigs of this 8 gig host. When you add in the OS overhead, this caused the node to start swapping. This node recovered, but most were not so lucky.
  • Breaking this spike down by job, we found that it was one particular job alone that caused the entire problem.
  • Looking at that job, we found which user and purpose it was for.
  • Looking at the individual tasks, we could find that this one job’s tasks were much greedier than everything else. Most tasks were well under one gig, but this one very rapidly, within a few seconds, spiked up to 1.5 or even 2 gigs.
  • The root cause ended up being that while the job didn’t change, the input data did. That user’s jobs got stopped immediately, so the cluster would stop melting down. Then we changed two configuration settings. We made better use of the capacity scheduler’s virtual memory controls and also used the Pepperdata protection features to limit the physical memory of tasks.
  • The main take away from these couple case studies is that while you can find problems at the node level, you find the root causes when you really look into the job and task details.
  • Unraveling Hadoop Meltdown Mysteries

    1. 1. Meltdown Mysteries Sean Suchter
    2. 2. Disks are thrashing!
    3. 3. Solution • Make job author aware of surprising behavior. • Modify job code & settings to be nicer to disks.
    4. 4. Nodes are dying!
    5. 5. Initial diagnosis… • Nodes abruptly started swapping and becoming non-responsive. (Required physical power cycling) • Job submitters report “I didn’t change anything” • Question: What’s doing this to the cluster?
    6. 6. Cause & solution • While the job didn’t change, its input data did. • Stop that user’s jobs immediately. • Better use of capacity scheduler virtual memory controls. • Use Pepperdata protection to limit physical memory as well.
    7. 7. Take-away • You see problems at the node level. • You see the root causes at the task level.
    8. 8. Pepperdata meetup tomorrow! • War Stories from the Hadoop Trenches • Allen Wittenauer (Apache Hadoop committer and former LinkedIn) • Eric Baldeschwieler (former Hortonworks CEO / CTO) • Todd Nemet (Looker; former Altiscale, ClearStory Data, Cloudera) • 6pm Wed 6/25 • Firehouse Brewery, 111 S Murphy, Sunnyvale • http://www.meetup.com/pepperdata/