The Past, Present, and Future of
Hadoop @ LinkedIn
Carl Steinbach
Senior Staff Software Engineer
Data Analytics Infrastructure Group
LinkedIn
The (Not So) Distant Past
PYMK (People You May Know)
First version implemented in 2006
6-8 Million members
Ran on Oracle (foreshadowing!)
Found various overlaps
School, Work… etc
Used common connections
Triangle closing (?)
Triangle Closing
?
Mary
Dave
Steve
PYMK Problems
By 2008, 40-50 Million members
Still running on Oracle
Failed often
Infrequent data refresh
6 weeks – 6 months!
Humble Beginnings Back in ‘08
Success! (circa 2009)
Apache Hadoop 0.20
20 node cluster (repurposed hardware)
PYMK in 3 days!
The Present
Hadoop @ LinkedIn Circa 2016
> 10 Clusters
> 10,000 Nodes
> 1000 Users
Thousands of workflows, datasets, and ad-hoc queries
MR, Pig, Hive, Gobblin, Cubert, Scalding, Tez, Spark,
Presto, …
Two Types of Scaling Challenges
Machines
People and Processes
Scaling Machines
Some Tough Talk About HDFS
Conventional wisdom holds that HDFS
Scales to > 4k nodes without federation*
Scales to > 8k nodes with federation*
What’s been our experience?
Many Apache releases won’t scale past a couple thousand nodes
Vendor distros usually aren’t much better
Why?
Scale testing happens after the release, not before
Most vendors have only a handful of customers with clusters larger than 1k nodes
* Heavily dependent on NN RPC workload, block size, average file size, average container size, etc, etc
March 2015 Was Not a Good Month
What Happened?
We rapidly added 500 nodes to a 2000 node cluster
(don’t do this!)
NameNode RPC queue length and wait time skyrocketed
Jobs crawled to a halt
What Was the Cause?
A subtle performance/scale regression was introduced
upstream
The bug was included in multiple releases
Increased time to allocate a new file
The more nodes you had, the worse it got
How We Used to do Scale Testing
1. Deploy the release to a small cluster (num_nodes = 100)
2. See if anything breaks
3. If no, then deploy to next largest cluster and goto step 2
4. If yes, figure out what went wrong and fix it
Problems with this approach
Expensive: developer time + hardware
Risky: Sometimes you can’t roll back!
Doesn’t always work: overlooks non-linear regressions
• Scale testing and performance
investigation tool for HDFS
• High fidelity in all the dimensions that
matter
• Focused on the NameNode
• Completely Black-box
• Accurately fakes thousands of DNs on a
small fraction of the hardware
• More details in forthcoming blog post
18
HDFS Dynamometer
Scaling People and Processes
20
21
v
Hadoop Performance Tuning
Too many dials!
Lots of frameworks: each one is
slightly different.
Performance can change over time.
Tuning requires constant monitoring
and maintenance!
22
Why Are Most User Jobs Poorly Tuned?
* Tuning decision tree from “Hadoop In Practice”
23
Dr Elephant: Running Light Without Overbyte
Automated Performance
Troubleshooting for Hadoop
Workflows
● Detects Common MR and
Spark Pathologies:
○ Mapper Data Skew
○ Reducer Data Skew
○ Mapper Input Size
○ Mapper Speed
○ Reducer Time
○ Shuffle & Sort
○ More!
● Explains Cause of Disease
● Guided Treatment Process
Grab the source code
github.com/linkedin/dr-elephant
Read the blog post
engineering.linkedin.com/blog
24
Dr Elephant is Now Open Source
Upgrades are Hard
A totally fictional story:
The Hadoop team pushes a new Pig upgrade
The next day thirty flows fail with ClassNotFoundExceptions
Angry users riot
Property damage exceeds $30mm
What happened?
The flows depended on a third-party UDF that depended on a transitive
dependency provided by the old version of Pig, but not the new version of
Pig
Bringing Shading Out of the Shadows
What most people think it is
Package artifact and all dependencies in the same JAR + rename some or all of
the package names
What it really is
Static linking for Java
Unfairly maligned by many people
We built an improved Gradle plugin that makes shading
easier for inexperienced users
Audit Hadoop flows for
incompatible and unnecessary
dependencies.
Predict failures before they happen
by scanning for dependencies
that won’t be satisfied post-
upgrade.
Proved extremely useful during
Hadoop2 migration
27
Byte-Ray: “X-Ray Goggles for JAR Files”
Byte-Ray in Action
SoakCycle:
Real World Integration Testing
The Future?
Dali
2015 was the year of the table
We want to make 2016 the year of the view
Learn more at the Dali talk tomorrow
©2014 LinkedIn Corporation. All Rights Reserved.©2014 LinkedIn Corporation. All Rights Reserved.

LinkedIn

  • 2.
    The Past, Present,and Future of Hadoop @ LinkedIn Carl Steinbach Senior Staff Software Engineer Data Analytics Infrastructure Group LinkedIn
  • 3.
    The (Not So)Distant Past
  • 4.
    PYMK (People YouMay Know) First version implemented in 2006 6-8 Million members Ran on Oracle (foreshadowing!) Found various overlaps School, Work… etc Used common connections Triangle closing (?)
  • 5.
  • 6.
    PYMK Problems By 2008,40-50 Million members Still running on Oracle Failed often Infrequent data refresh 6 weeks – 6 months!
  • 7.
  • 8.
    Success! (circa 2009) ApacheHadoop 0.20 20 node cluster (repurposed hardware) PYMK in 3 days!
  • 9.
  • 10.
    Hadoop @ LinkedInCirca 2016 > 10 Clusters > 10,000 Nodes > 1000 Users Thousands of workflows, datasets, and ad-hoc queries MR, Pig, Hive, Gobblin, Cubert, Scalding, Tez, Spark, Presto, …
  • 11.
    Two Types ofScaling Challenges Machines People and Processes
  • 12.
  • 13.
    Some Tough TalkAbout HDFS Conventional wisdom holds that HDFS Scales to > 4k nodes without federation* Scales to > 8k nodes with federation* What’s been our experience? Many Apache releases won’t scale past a couple thousand nodes Vendor distros usually aren’t much better Why? Scale testing happens after the release, not before Most vendors have only a handful of customers with clusters larger than 1k nodes * Heavily dependent on NN RPC workload, block size, average file size, average container size, etc, etc
  • 14.
    March 2015 WasNot a Good Month
  • 15.
    What Happened? We rapidlyadded 500 nodes to a 2000 node cluster (don’t do this!) NameNode RPC queue length and wait time skyrocketed Jobs crawled to a halt
  • 16.
    What Was theCause? A subtle performance/scale regression was introduced upstream The bug was included in multiple releases Increased time to allocate a new file The more nodes you had, the worse it got
  • 17.
    How We Usedto do Scale Testing 1. Deploy the release to a small cluster (num_nodes = 100) 2. See if anything breaks 3. If no, then deploy to next largest cluster and goto step 2 4. If yes, figure out what went wrong and fix it Problems with this approach Expensive: developer time + hardware Risky: Sometimes you can’t roll back! Doesn’t always work: overlooks non-linear regressions
  • 18.
    • Scale testingand performance investigation tool for HDFS • High fidelity in all the dimensions that matter • Focused on the NameNode • Completely Black-box • Accurately fakes thousands of DNs on a small fraction of the hardware • More details in forthcoming blog post 18 HDFS Dynamometer
  • 19.
  • 20.
  • 21.
  • 22.
    Too many dials! Lotsof frameworks: each one is slightly different. Performance can change over time. Tuning requires constant monitoring and maintenance! 22 Why Are Most User Jobs Poorly Tuned? * Tuning decision tree from “Hadoop In Practice”
  • 23.
    23 Dr Elephant: RunningLight Without Overbyte Automated Performance Troubleshooting for Hadoop Workflows ● Detects Common MR and Spark Pathologies: ○ Mapper Data Skew ○ Reducer Data Skew ○ Mapper Input Size ○ Mapper Speed ○ Reducer Time ○ Shuffle & Sort ○ More! ● Explains Cause of Disease ● Guided Treatment Process
  • 24.
    Grab the sourcecode github.com/linkedin/dr-elephant Read the blog post engineering.linkedin.com/blog 24 Dr Elephant is Now Open Source
  • 25.
    Upgrades are Hard Atotally fictional story: The Hadoop team pushes a new Pig upgrade The next day thirty flows fail with ClassNotFoundExceptions Angry users riot Property damage exceeds $30mm What happened? The flows depended on a third-party UDF that depended on a transitive dependency provided by the old version of Pig, but not the new version of Pig
  • 26.
    Bringing Shading Outof the Shadows What most people think it is Package artifact and all dependencies in the same JAR + rename some or all of the package names What it really is Static linking for Java Unfairly maligned by many people We built an improved Gradle plugin that makes shading easier for inexperienced users
  • 27.
    Audit Hadoop flowsfor incompatible and unnecessary dependencies. Predict failures before they happen by scanning for dependencies that won’t be satisfied post- upgrade. Proved extremely useful during Hadoop2 migration 27 Byte-Ray: “X-Ray Goggles for JAR Files”
  • 28.
  • 29.
  • 30.
  • 31.
    Dali 2015 was theyear of the table We want to make 2016 the year of the view Learn more at the Dali talk tomorrow
  • 32.
    ©2014 LinkedIn Corporation.All Rights Reserved.©2014 LinkedIn Corporation. All Rights Reserved.

Editor's Notes

  • #5 -Since People You May Know is long, we call it PYMK at Linkedin. -The original version ran on Oracle -And the way it worked was to attempt to find overlaps between any pairs of people. Did they share the same school? Did they work at the same company? -One big indicator was common connections, and we used something called triangle closing.
  • #6 -Triangle closing is an easy concept to follow <click> -Mary knows Dave and Steve <click> -We make a guess that Dave may also know Steve This is essentially what this feature does. We closed that triangle. <click> -Additionally, If Dave and Steve share more than one connection, then we can become more confident in our guess.
  • #7 -3 years later, and LinkedIn was growing… fast, to 40-50 Million members. I joined about this time to be a member of the data products group -We still used Oracle to create PYMK data and it may not surprise people to hear that we had scalability problems. -In fact it failed often, and required a lot of manual intervention. When it succeeded, it would take about 6 weeks to produce new results, by which time the data was most likely stale. -At its worst, PYMK had so many problems that no new data appeared on our site for 6 months. ----- Meeting Notes (9/3/13 14:06) ----- 6 min
  • #9 -We tried other solutions. I won’t name them, even though some of them some of them were well known… and none of them could solve our scale problem -So we started a 20 node hadoop cluster… pretty much on bad hardware that we ‘stole’ or ‘repurposed’ from our research and development servers without anyone really knowing. -We really didn’t know what we’re doing, and our cluster was misconfigured… <click> -but it solved PYMK in 3 days. -So everything was good… well we’ll see