Hacking Data Science
Overview of ML pipeline
Gather data
Feature
engineering
Model
fitting
Evaluation
©2013 LinkedIn Corporation. All Rights Reserved. 2
Understanding Seniority
©2013 LinkedIn Corporation. All Rights Reserved. 3
©2013 LinkedIn Corporation. All Rights Reserved. 4
Companies are not standard
©2013 LinkedIn Corporation. All Rights Reserved. 5
Titles are not enough
©2013 LinkedIn Corporation. All Rights Reserved. 6
Things change
Learning to target better
©2013 LinkedIn Corporation. All Rights Reserved. 7
Classifying names to genders
©2013 LinkedIn Corporation. All Rights Reserved. 8
Let’s look at Monica again
©2013 LinkedIn Corporation. All Rights Reserved. 9
Not so fast …
©2013 LinkedIn Corporation. All Rights Reserved. 10
Not so fast …
©2013 LinkedIn Corporation. All Rights Reserved. 11
Even slower …
©2013 LinkedIn Corporation. All Rights Reserved. 12
Sometime the answer is just under your nose
©2013 LinkedIn Corporation. All Rights Reserved. 13
Comment Spam on Influencer content
©2013 LinkedIn Corporation. All Rights Reserved. 14
Challenge 1: Binary tasks are too guessable
©2013 LinkedIn Corporation. All Rights Reserved. 15
Challenge 2: Context matters
©2013 LinkedIn Corporation. All Rights Reserved. 16
Spam Comment Annotation Task
©2013 LinkedIn Corporation. All Rights Reserved. 17
Quality: Gold distributions and skewed datasets
©2013 LinkedIn Corporation. All Rights Reserved. 18
Using results to evaluate new features
©2013 LinkedIn Corporation. All Rights Reserved. 19
Model ΔP ΔR ΔPRC
Baseline - - -
Variation 1 + - +
Variation 2 - + +
Variation 3 - ++ - -
Variation 4 - +++ ++
Variation 5 - +++ ++
Variation 6 - +++ ++
Variation 7 - ++++ +++
Variation 8 - ++++ +++
Variation 9 - ++++ +++
Variation 10 - ++++ +++
“As simple as possible, but not simpler”
©2013 LinkedIn Corporation. All Rights Reserved. 20
Linkedin Channels
©2013 LinkedIn Corporation. All Rights Reserved. 21
Labels aren’t free
©2013 LinkedIn Corporation. All Rights Reserved. 22
Suggest likely candidates for topics then expand
©2013 LinkedIn Corporation. All Rights Reserved. 23
Evaluate suggested article-topic pairs
 Using results to evaluate new implementations of spam classifier
– Improve Prec without drop in Rec
 18k comments labeled in 54 hrs for $180
©2013 LinkedIn Corporation. All Rights Reserved. 24
Quality: Not by Gold alone
©2013 LinkedIn Corporation. All Rights Reserved. 25
Using results to evaluate existing classification
framework
©2013 LinkedIn Corporation. All Rights Reserved. 26
“Help your helpers”
©2013 LinkedIn Corporation. All Rights Reserved. 27
Search is a major portal to information
©2013 LinkedIn Corporation. All Rights Reserved. 28
LI Search is personalized
©2013 LinkedIn Corporation. All Rights Reserved. 29
Evaluation is still possible
©2013 LinkedIn Corporation. All Rights Reserved. 30
Search Evaluation – WTF@1
©2013 LinkedIn Corporation. All Rights Reserved. 31
Quality: Behavioral metrics are good too!
©2013 LinkedIn Corporation. All Rights Reserved. 32
“Pick a solvable problem”
©2013 LinkedIn Corporation. All Rights Reserved. 33
Standardizing titles
©2013 LinkedIn Corporation. All Rights Reserved. 34
©2013 LinkedIn Corporation. All Rights Reserved. 35
Which question is easier
©2013 LinkedIn Corporation. All Rights Reserved. 36
1. Find a better name for the title “account executive”?
2. How similar are “account executive” and “sales executive”?
©2013 LinkedIn Corporation. All Rights Reserved. 37
Notable Experts
©2013 LinkedIn Corporation. All Rights Reserved. 38
First attempt
©2013 LinkedIn Corporation. All Rights Reserved. 39
Second attempt
©2013 LinkedIn Corporation. All Rights Reserved. 40
Third attempt
©2013 LinkedIn Corporation. All Rights Reserved. 41
What makes the best data mining expert?
 Education?
 Industry experience?
 Amount of publications?
 Communication skills?
 Hacking skills?
 Knowledge of statistics?
 Number of endorsements?
©2013 LinkedIn Corporation. All Rights Reserved. 42
“More bad data != better data”
©2013 LinkedIn Corporation. All Rights Reserved. 43
Summary
©2013 LinkedIn Corporation. All Rights Reserved. 44
1. Use the data you already have
2. Keep it simple, but not too simple
3. Pick a solvable problem
4. Help your helpers
5. Sample intelligently
6. More (bad) data != better data
©2013 LinkedIn Corporation. All Rights Reserved. 45
Questions?

Crowdsourcing Series: LinkedIn. By Vitaly Gordon & Patrick Philips.

Editor's Notes

  • #4 Supervised (gold, agreement) & unsupervised (behavioral)
  • #5 Supervised (gold, agreement) & unsupervised (behavioral)
  • #6 Supervised (gold, agreement) & unsupervised (behavioral)
  • #7 Supervised (gold, agreement) & unsupervised (behavioral)
  • #8 Supervised (gold, agreement) & unsupervised (behavioral)
  • #9 Supervised (gold, agreement) & unsupervised (behavioral)
  • #10 Supervised (gold, agreement) & unsupervised (behavioral)
  • #11 Supervised (gold, agreement) & unsupervised (behavioral)
  • #12 Supervised (gold, agreement) & unsupervised (behavioral)
  • #13 Supervised (gold, agreement) & unsupervised (behavioral)
  • #15 Context: why it mattersOff-topic comments lower the perceived value of Influencer content, LI network, etc.Legit members may leave low-quality topics -> no hell-banning
  • #16 Especially if you only guess on the hard ones+ Gold and wawa don’t work as well with binary tasks
  • #17 + references to article, other comments, etc.
  • #18 Sampling: took clusters where at least one item scored poorly with existing classifierStill a biased dataset -> skew gold to catch positive cases (80% of Golds have at least one comment flagged)Treat any comment that got at least 1 vote as “suspect”NEXT TIME: set minimum agreement thresholds and collect more labels dynamically
  • #19 Sampling: took clusters where at least one item scored poorly with existing classifierStill a biased dataset -> skew gold to catch positive cases (80% of Golds have at least one comment flagged)Treat any comment that got at least 1 vote as “suspect”NEXT TIME: set minimum agreement thresholds and collect more labels dynamically
  • #20 + Using results to evaluate new implementations of spam classifierImprove Prec without drop in Rec+ 18k comments labeled in 54 hrs for $180
  • #21 + simple as possible, but not any simpler
  • #22 need to find timely, relevant content for many subjects
  • #23 Free-text tagging = standardization pain, plus hard to manage quality+ double-pass -> annoyingStandardized taxonomy: 1,200 topics selected as representative linkedin members interests + random guessing: 1200 topics is still a lot
  • #24 Pick “likely” labels for evaluation:+ weak classifier to identify skills in an article -> expand to related skills+ weak classifier to identify industry of article -> expand to related skills+ pick labels based on source of article (e.g., forbes -> economy, marketing, etc.)+ 100 candidate labels for each article
  • #25 + 400k article-topic pairs+ e.g., 60k pairs in ~1 week @ 7c each+ 4 labels for each item, take the average value (rather than looking for consensus)+ bootstrap additional gold from items completed with high agreementLessons+ difference between very & somewhat relevant: “is this the primary topic”+ some non-english articles, some garbled articles
  • #26 + 400k article-topic pairs+ e.g., 60k pairs in ~1 week @ 7c each+ 4 labels for each item, take the average value (rather than looking for consensus)+ bootstrap additional gold from items completed with high agreementLessons+ difference between very & somewhat relevant: “is this the primary topic”+ some non-english articles, some garbled articles
  • #27 Working towards a “less” supervised way to create new channels
  • #28 Preprocessing the data to select likely matches greatly reduced the number of labels needed
  • #29 search: + helps members find and be found+ People, Jobs, Groups and more
  • #30 LI search is personalized: + tuple of (user, query, document)Too much to ask a random person to label for training+ “imagine that you’re X and see Y” has its limits+ train from logs
  • #31 Indirect measures: + CTR@1, CTR@P1, Session Abandonment, etc.Explicit measures:+ what about non-personalized search (such as for recruiters)?+ what about identifying items that are off-topic for all members?
  • #32 1000 query-result pairs+ retrieve all queries where result@1 didn’t get a click+ remove any queries tagged as {firstname, lastname} where the name in the query matched the name in the profile (we know these perform well}Binary tasks bad – added a second set of questions+ allows us to audit query tagger at the same timeUsing results to triage queries for additional manual review+ also adds an explicit relevance metric to track over time (wtf@1)
  • #33 Other behavioral stuff:+ individual judgment duration, scrolls, clicks, mouse movement+ jQuery is your friend
  • #34 Picking the right problem gets you a long way thereSkillRank example----- Meeting Notes (8/15/13 16:55) -----+ name queries really aren't that useful so we excluded those+ ran it internally first, then with turkers++ nearly identical, arguably it was better
  • #35 Supervised (gold, agreement) & unsupervised (behavioral)
  • #36 Supervised (gold, agreement) & unsupervised (behavioral)
  • #37 Supervised (gold, agreement) & unsupervised (behavioral)
  • #43 Supervised (gold, agreement) & unsupervised (behavioral)
  • #44 Picking the right problem gets you a long way thereSkillRank example----- Meeting Notes (8/15/13 16:55) -----+ name queries really aren't that useful so we excluded those+ ran it internally first, then with turkers++ nearly identical, arguably it was better
  • #45 Other fun lessons5. Not by gold alone