3. PYMK (People You May Know)
First version implemented in 2006
6-8 Million members
Ran on Oracle (foreshadowing!)
Found various overlaps
School, Work… etc
Used common connections
Triangle closing (?)
12. Some Tough Talk About HDFS
Conventional wisdom holds that HDFS
Scales to > 4k nodes without federation*
Scales to > 8k nodes with federation*
What’s been our experience?
Many Apache releases won’t scale past a couple thousand nodes
Vendor distros usually aren’t much better
Why?
Scale testing happens after the release, not before
Most vendors have only a handful of customers with clusters larger than 1k nodes
* Heavily dependent on NN RPC workload, block size, average file size, average container size, etc, etc
14. What Happened?
We rapidly added 500 nodes to a 2000 node cluster
(don’t do this!)
NameNode RPC queue length and wait time skyrocketed
Jobs crawled to a halt
15. What Was the Cause?
A subtle performance/scale regression was introduced
upstream
The bug was included in multiple releases
Increased time to allocate a new file
The more nodes you had, the worse it got
16. How We Used to do Scale Testing
1. Deploy the release to a small cluster (num_nodes = 100)
2. See if anything breaks
3. If no, then deploy to next largest cluster and goto step 2
4. If yes, figure out what went wrong and fix it
Problems with this approach
Expensive: developer time + hardware
Risky: Sometimes you can’t roll back!
Doesn’t always work: overlooks non-linear regressions
17. • Scale testing and performance
investigation tool for HDFS
• High fidelity in all the dimensions that
matter
• Focused on the NameNode
• Completely Black-box
• Accurately fakes thousands of DNs on a
small fraction of the hardware
• More details in forthcoming blog post
17
HDFS Dynamometer
21. Too many dials!
Lots of frameworks: each one is
slightly different.
Performance can change over time.
Tuning requires constant monitoring
and maintenance!
21
Why Are Most User Jobs Poorly Tuned?
* Tuning decision tree from “Hadoop In Practice”
22. 22
Dr Elephant: Running Light Without Overbyte
Automated Performance
Troubleshooting for Hadoop
Workflows
● Detects Common MR and
Spark Pathologies:
○ Mapper Data Skew
○ Reducer Data Skew
○ Mapper Input Size
○ Mapper Speed
○ Reducer Time
○ Shuffle & Sort
○ More!
● Explains Cause of Disease
● Guided Treatment Process
23. Grab the source code
github.com/linkedin/dr-elephant
Read the blog post
engineering.linkedin.com/blog
23
Dr Elephant is Now Open Source
24. Upgrades are Hard
A totally fictional story:
The Hadoop team pushes a new Pig upgrade
The next day thirty flows fail with ClassNotFoundExceptions
Angry users riot
Property damage exceeds $30mm
What happened?
The flows depended on a third-party UDF that depended on a transitive
dependency provided by the old version of Pig, but not the new version of
Pig
25. Bringing Shading Out of the Shadows
What most people think it is
Package artifact and all dependencies in the same JAR + rename some or all of
the package names
What it really is
Static linking for Java
Unfairly maligned by many people
We built an improved Gradle plugin that makes shading
easier for inexperienced users
26. Audit Hadoop flows for
incompatible and unnecessary
dependencies.
Predict failures before they happen
by scanning for dependencies
that won’t be satisfied post-
upgrade.
Proved extremely useful during
Hadoop2 migration
26
Byte-Ray: “X-Ray Goggles for JAR Files”
-Since People You May Know is long, we call it PYMK at Linkedin.
-The original version ran on Oracle
-And the way it worked was to attempt to find overlaps between any pairs of people. Did they share the same school? Did they work at the same company?
-One big indicator was common connections, and we used something called triangle closing.
-Triangle closing is an easy concept to follow
<click>
-Mary knows Dave and Steve
<click>
-We make a guess that Dave may also know Steve This is essentially what this feature does. We closed that triangle.
<click>
-Additionally, If Dave and Steve share more than one connection, then we can become more confident in our guess.
-3 years later, and LinkedIn was growing… fast, to 40-50 Million members. I joined about this time to be a member of the data products group
-We still used Oracle to create PYMK data and it may not surprise people to hear that we had scalability problems.
-In fact it failed often, and required a lot of manual intervention. When it succeeded, it would take about 6 weeks to produce new results, by which time the data was most likely stale.
-At its worst, PYMK had so many problems that no new data appeared on our site for 6 months.
----- Meeting Notes (9/3/13 14:06) -----
6 min
-We tried other solutions. I won’t name them, even though some of them some of them were well known… and none of them could solve our scale problem
-So we started a 20 node hadoop cluster… pretty much on bad hardware that we ‘stole’ or ‘repurposed’ from our research and development servers without anyone really knowing.
-We really didn’t know what we’re doing, and our cluster was misconfigured…
<click>
-but it solved PYMK in 3 days.
-So everything was good… well we’ll see