Crowdsourcing Series: LinkedIn. By Vitaly Gordon & Patrick Philips.
Upcoming SlideShare
Loading in...5
×
 

Crowdsourcing Series: LinkedIn. By Vitaly Gordon & Patrick Philips.

on

  • 1,440 views

Better data beats better algorithms, but better data can be hard to come by. In this talk, Vitaly Gordon, Senior Data Scientist at LinkedIn, and Patrick Philips, Crowdsourcing Expert at LinkedIn, will ...

Better data beats better algorithms, but better data can be hard to come by. In this talk, Vitaly Gordon, Senior Data Scientist at LinkedIn, and Patrick Philips, Crowdsourcing Expert at LinkedIn, will show how the LinkedIn data science team hacks data science using sophisticated data mining and crowdsourcing techniques to leverage the data they already have and create the data that's missing.

Statistics

Views

Total Views
1,440
Views on SlideShare
1,037
Embed Views
403

Actions

Likes
0
Downloads
5
Comments
0

13 Embeds 403

http://g33ktalk.com 297
http://www.hakkalabs.co 61
https://www.linkedin.com 14
http://www.feedspot.com 8
http://www.linkedin.com 8
https://hakka.herokuapp.com 5
http://newsblur.com 2
http://www.newsblur.com 2
http://cloud.feedly.com 2
http://feeds.feedburner.com 1
http://digg.com 1
https://hakkalabs.herokuapp.com 1
http://translate.googleusercontent.com 1
More...

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Supervised (gold, agreement) & unsupervised (behavioral)
  • Supervised (gold, agreement) & unsupervised (behavioral)
  • Supervised (gold, agreement) & unsupervised (behavioral)
  • Supervised (gold, agreement) & unsupervised (behavioral)
  • Supervised (gold, agreement) & unsupervised (behavioral)
  • Supervised (gold, agreement) & unsupervised (behavioral)
  • Supervised (gold, agreement) & unsupervised (behavioral)
  • Supervised (gold, agreement) & unsupervised (behavioral)
  • Supervised (gold, agreement) & unsupervised (behavioral)
  • Supervised (gold, agreement) & unsupervised (behavioral)
  • Context: why it mattersOff-topic comments lower the perceived value of Influencer content, LI network, etc.Legit members may leave low-quality topics -> no hell-banning
  • Especially if you only guess on the hard ones+ Gold and wawa don’t work as well with binary tasks
  • + references to article, other comments, etc.
  • Sampling: took clusters where at least one item scored poorly with existing classifierStill a biased dataset -> skew gold to catch positive cases (80% of Golds have at least one comment flagged)Treat any comment that got at least 1 vote as “suspect”NEXT TIME: set minimum agreement thresholds and collect more labels dynamically
  • Sampling: took clusters where at least one item scored poorly with existing classifierStill a biased dataset -> skew gold to catch positive cases (80% of Golds have at least one comment flagged)Treat any comment that got at least 1 vote as “suspect”NEXT TIME: set minimum agreement thresholds and collect more labels dynamically
  • + Using results to evaluate new implementations of spam classifierImprove Prec without drop in Rec+ 18k comments labeled in 54 hrs for $180
  • + simple as possible, but not any simpler
  • need to find timely, relevant content for many subjects
  • Free-text tagging = standardization pain, plus hard to manage quality+ double-pass -> annoyingStandardized taxonomy: 1,200 topics selected as representative linkedin members interests + random guessing: 1200 topics is still a lot
  • Pick “likely” labels for evaluation:+ weak classifier to identify skills in an article -> expand to related skills+ weak classifier to identify industry of article -> expand to related skills+ pick labels based on source of article (e.g., forbes -> economy, marketing, etc.)+ 100 candidate labels for each article
  • + 400k article-topic pairs+ e.g., 60k pairs in ~1 week @ 7c each+ 4 labels for each item, take the average value (rather than looking for consensus)+ bootstrap additional gold from items completed with high agreementLessons+ difference between very & somewhat relevant: “is this the primary topic”+ some non-english articles, some garbled articles
  • + 400k article-topic pairs+ e.g., 60k pairs in ~1 week @ 7c each+ 4 labels for each item, take the average value (rather than looking for consensus)+ bootstrap additional gold from items completed with high agreementLessons+ difference between very & somewhat relevant: “is this the primary topic”+ some non-english articles, some garbled articles
  • Working towards a “less” supervised way to create new channels
  • Preprocessing the data to select likely matches greatly reduced the number of labels needed
  • search: + helps members find and be found+ People, Jobs, Groups and more
  • LI search is personalized: + tuple of (user, query, document)Too much to ask a random person to label for training+ “imagine that you’re X and see Y” has its limits+ train from logs
  • Indirect measures: + CTR@1, CTR@P1, Session Abandonment, etc.Explicit measures:+ what about non-personalized search (such as for recruiters)?+ what about identifying items that are off-topic for all members?
  • 1000 query-result pairs+ retrieve all queries where result@1 didn’t get a click+ remove any queries tagged as {firstname, lastname} where the name in the query matched the name in the profile (we know these perform well}Binary tasks bad – added a second set of questions+ allows us to audit query tagger at the same timeUsing results to triage queries for additional manual review+ also adds an explicit relevance metric to track over time (wtf@1)
  • Other behavioral stuff:+ individual judgment duration, scrolls, clicks, mouse movement+ jQuery is your friend
  • Picking the right problem gets you a long way thereSkillRank example----- Meeting Notes (8/15/13 16:55) -----+ name queries really aren't that useful so we excluded those+ ran it internally first, then with turkers++ nearly identical, arguably it was better
  • Supervised (gold, agreement) & unsupervised (behavioral)
  • Supervised (gold, agreement) & unsupervised (behavioral)
  • Supervised (gold, agreement) & unsupervised (behavioral)
  • Supervised (gold, agreement) & unsupervised (behavioral)
  • Picking the right problem gets you a long way thereSkillRank example----- Meeting Notes (8/15/13 16:55) -----+ name queries really aren't that useful so we excluded those+ ran it internally first, then with turkers++ nearly identical, arguably it was better
  • Other fun lessons5. Not by gold alone

Crowdsourcing Series: LinkedIn. By Vitaly Gordon & Patrick Philips. Crowdsourcing Series: LinkedIn. By Vitaly Gordon & Patrick Philips. Presentation Transcript

  • Hacking Data Science
  • Overview of ML pipeline Gather data Feature engineering Model fitting Evaluation ©2013 LinkedIn Corporation. All Rights Reserved. 2
  • Understanding Seniority ©2013 LinkedIn Corporation. All Rights Reserved. 3
  • ©2013 LinkedIn Corporation. All Rights Reserved. 4 Companies are not standard
  • ©2013 LinkedIn Corporation. All Rights Reserved. 5 Titles are not enough
  • ©2013 LinkedIn Corporation. All Rights Reserved. 6 Things change
  • Learning to target better ©2013 LinkedIn Corporation. All Rights Reserved. 7
  • Classifying names to genders ©2013 LinkedIn Corporation. All Rights Reserved. 8
  • Let’s look at Monica again ©2013 LinkedIn Corporation. All Rights Reserved. 9
  • Not so fast … ©2013 LinkedIn Corporation. All Rights Reserved. 10
  • Not so fast … ©2013 LinkedIn Corporation. All Rights Reserved. 11
  • Even slower … ©2013 LinkedIn Corporation. All Rights Reserved. 12
  • Sometime the answer is just under your nose ©2013 LinkedIn Corporation. All Rights Reserved. 13
  • Comment Spam on Influencer content ©2013 LinkedIn Corporation. All Rights Reserved. 14
  • Challenge 1: Binary tasks are too guessable ©2013 LinkedIn Corporation. All Rights Reserved. 15
  • Challenge 2: Context matters ©2013 LinkedIn Corporation. All Rights Reserved. 16
  • Spam Comment Annotation Task ©2013 LinkedIn Corporation. All Rights Reserved. 17
  • Quality: Gold distributions and skewed datasets ©2013 LinkedIn Corporation. All Rights Reserved. 18
  • Using results to evaluate new features ©2013 LinkedIn Corporation. All Rights Reserved. 19 Model ΔP ΔR ΔPRC Baseline - - - Variation 1 + - + Variation 2 - + + Variation 3 - ++ - - Variation 4 - +++ ++ Variation 5 - +++ ++ Variation 6 - +++ ++ Variation 7 - ++++ +++ Variation 8 - ++++ +++ Variation 9 - ++++ +++ Variation 10 - ++++ +++
  • “As simple as possible, but not simpler” ©2013 LinkedIn Corporation. All Rights Reserved. 20
  • Linkedin Channels ©2013 LinkedIn Corporation. All Rights Reserved. 21
  • Labels aren’t free ©2013 LinkedIn Corporation. All Rights Reserved. 22
  • Suggest likely candidates for topics then expand ©2013 LinkedIn Corporation. All Rights Reserved. 23
  • Evaluate suggested article-topic pairs  Using results to evaluate new implementations of spam classifier – Improve Prec without drop in Rec  18k comments labeled in 54 hrs for $180 ©2013 LinkedIn Corporation. All Rights Reserved. 24
  • Quality: Not by Gold alone ©2013 LinkedIn Corporation. All Rights Reserved. 25
  • Using results to evaluate existing classification framework ©2013 LinkedIn Corporation. All Rights Reserved. 26
  • “Help your helpers” ©2013 LinkedIn Corporation. All Rights Reserved. 27
  • Search is a major portal to information ©2013 LinkedIn Corporation. All Rights Reserved. 28
  • LI Search is personalized ©2013 LinkedIn Corporation. All Rights Reserved. 29
  • Evaluation is still possible ©2013 LinkedIn Corporation. All Rights Reserved. 30
  • Search Evaluation – WTF@1 ©2013 LinkedIn Corporation. All Rights Reserved. 31
  • Quality: Behavioral metrics are good too! ©2013 LinkedIn Corporation. All Rights Reserved. 32
  • “Pick a solvable problem” ©2013 LinkedIn Corporation. All Rights Reserved. 33
  • Standardizing titles ©2013 LinkedIn Corporation. All Rights Reserved. 34
  • ©2013 LinkedIn Corporation. All Rights Reserved. 35
  • Which question is easier ©2013 LinkedIn Corporation. All Rights Reserved. 36 1. Find a better name for the title “account executive”? 2. How similar are “account executive” and “sales executive”?
  • ©2013 LinkedIn Corporation. All Rights Reserved. 37
  • Notable Experts ©2013 LinkedIn Corporation. All Rights Reserved. 38
  • First attempt ©2013 LinkedIn Corporation. All Rights Reserved. 39
  • Second attempt ©2013 LinkedIn Corporation. All Rights Reserved. 40
  • Third attempt ©2013 LinkedIn Corporation. All Rights Reserved. 41
  • What makes the best data mining expert?  Education?  Industry experience?  Amount of publications?  Communication skills?  Hacking skills?  Knowledge of statistics?  Number of endorsements? ©2013 LinkedIn Corporation. All Rights Reserved. 42
  • “More bad data != better data” ©2013 LinkedIn Corporation. All Rights Reserved. 43
  • Summary ©2013 LinkedIn Corporation. All Rights Reserved. 44 1. Use the data you already have 2. Keep it simple, but not too simple 3. Pick a solvable problem 4. Help your helpers 5. Sample intelligently 6. More (bad) data != better data
  • ©2013 LinkedIn Corporation. All Rights Reserved. 45 Questions?