SlideShare a Scribd company logo
1 of 21
@mrogati
Overcoming presentation bias
   Explicitly model the presentation bias
    ◦ This includes the rank & snippet
    ◦ [Smith & Elkan KDD’07], [Yue et al WWW’10] etc.
   Learning from positive examples only
    ◦ [Elkan & Noto KDD’08] (good overview), [Lee & Liu
      ICML’03] etc.




   Mix in random instances as negative examples to
    avoid learning a classifier that stays too close to
    the margin
The Cold Start Problem
Crowdsourcing?
Caveat #1: Personalization
Caveat #2: Task Description




Example: Rate profile similarity from 1-4
Caveat #3: Quality Control
and Community Management




… and scale, cost , input data selection
Alternatives?
(Your) Other Data
Task 1: Real or Fake?




Training Data: Closed Accounts
Task 2: Recommended Groups




Training Data: Current Group Members
Task 3: Job Matching




Training Data: Job Posters’ Activity
Task 4: Profile Similarity




 Training Data: Saved Folders
training
data
matters!                           100%
                  50
                    %
       20%



 10%



             observed lift in production recommender systems
@mrogati

More Related Content

Similar to The Model and the Train Wreck - A Training Data How-To -- @mrogati's talk at Strata 2012

iConnect: Expertise Location at Deloitte
iConnect: Expertise Location at DeloitteiConnect: Expertise Location at Deloitte
iConnect: Expertise Location at DeloitteKM Chicago
 
Trustworthy Recommender Systems
Trustworthy Recommender SystemsTrustworthy Recommender Systems
Trustworthy Recommender SystemsWQ Fan
 
Incident, Problem, Change, Knowledge…and Service Catalog? A Powerful Circle.
Incident, Problem, Change, Knowledge…and Service Catalog? A Powerful Circle. Incident, Problem, Change, Knowledge…and Service Catalog? A Powerful Circle.
Incident, Problem, Change, Knowledge…and Service Catalog? A Powerful Circle. Evergreen Systems
 
Best Practices For ITSM Process Assessment v1.pdf
Best Practices For ITSM Process Assessment v1.pdfBest Practices For ITSM Process Assessment v1.pdf
Best Practices For ITSM Process Assessment v1.pdfCaasMarta
 
Are you ready for Data science? A 12 point test
Are you ready for Data science? A 12 point testAre you ready for Data science? A 12 point test
Are you ready for Data science? A 12 point testBertil Hatt
 
Who Owns Faculty Data?: Fairness and transparency in UCLA's new academic HR s...
Who Owns Faculty Data?: Fairness and transparency in UCLA's new academic HR s...Who Owns Faculty Data?: Fairness and transparency in UCLA's new academic HR s...
Who Owns Faculty Data?: Fairness and transparency in UCLA's new academic HR s...chloejreynolds
 
Proven ETL Developer Interview Questions to Assess and Hire ETL Developers
Proven ETL Developer Interview Questions to Assess and Hire ETL DevelopersProven ETL Developer Interview Questions to Assess and Hire ETL Developers
Proven ETL Developer Interview Questions to Assess and Hire ETL DevelopersInterview Mocha
 
Joel Klettke - Engage 2020 (Positiong, Branding, and Conversions)
Joel Klettke - Engage 2020 (Positiong, Branding, and Conversions)Joel Klettke - Engage 2020 (Positiong, Branding, and Conversions)
Joel Klettke - Engage 2020 (Positiong, Branding, and Conversions)Joel Klettke
 
Data-Ed: Emerging Trends in Data Jobs
Data-Ed: Emerging Trends in Data JobsData-Ed: Emerging Trends in Data Jobs
Data-Ed: Emerging Trends in Data JobsData Blueprint
 
Data-Ed Online: Emerging Trends in Data Jobs
Data-Ed Online: Emerging Trends in Data JobsData-Ed Online: Emerging Trends in Data Jobs
Data-Ed Online: Emerging Trends in Data JobsDATAVERSITY
 
Modeling Requirements Narrated2
Modeling Requirements Narrated2Modeling Requirements Narrated2
Modeling Requirements Narrated2Daniel Brookshier
 
Modeling Requirements with SysML
Modeling Requirements with SysML Modeling Requirements with SysML
Modeling Requirements with SysML Daniel Brookshier
 
March2008 Strategies For Adopting Self Service And Automation
March2008   Strategies For Adopting Self Service And AutomationMarch2008   Strategies For Adopting Self Service And Automation
March2008 Strategies For Adopting Self Service And AutomationIT Service and Support
 
Madhukar_Eunny_BIDW_Consultant
Madhukar_Eunny_BIDW_ConsultantMadhukar_Eunny_BIDW_Consultant
Madhukar_Eunny_BIDW_Consultantmadhukar eunny
 
How Talent Analytics Can Help You Maximize Your HR Strategy
How Talent Analytics Can Help You Maximize Your HR StrategyHow Talent Analytics Can Help You Maximize Your HR Strategy
How Talent Analytics Can Help You Maximize Your HR StrategyGlassdoor
 

Similar to The Model and the Train Wreck - A Training Data How-To -- @mrogati's talk at Strata 2012 (20)

iConnect: Expertise Location at Deloitte
iConnect: Expertise Location at DeloitteiConnect: Expertise Location at Deloitte
iConnect: Expertise Location at Deloitte
 
Trustworthy Recommender Systems
Trustworthy Recommender SystemsTrustworthy Recommender Systems
Trustworthy Recommender Systems
 
Incident, Problem, Change, Knowledge…and Service Catalog? A Powerful Circle.
Incident, Problem, Change, Knowledge…and Service Catalog? A Powerful Circle. Incident, Problem, Change, Knowledge…and Service Catalog? A Powerful Circle.
Incident, Problem, Change, Knowledge…and Service Catalog? A Powerful Circle.
 
Best Practices For ITSM Process Assessment v1.pdf
Best Practices For ITSM Process Assessment v1.pdfBest Practices For ITSM Process Assessment v1.pdf
Best Practices For ITSM Process Assessment v1.pdf
 
CDO_public
CDO_publicCDO_public
CDO_public
 
Are you ready for Data science? A 12 point test
Are you ready for Data science? A 12 point testAre you ready for Data science? A 12 point test
Are you ready for Data science? A 12 point test
 
Who Owns Faculty Data?: Fairness and transparency in UCLA's new academic HR s...
Who Owns Faculty Data?: Fairness and transparency in UCLA's new academic HR s...Who Owns Faculty Data?: Fairness and transparency in UCLA's new academic HR s...
Who Owns Faculty Data?: Fairness and transparency in UCLA's new academic HR s...
 
Proven ETL Developer Interview Questions to Assess and Hire ETL Developers
Proven ETL Developer Interview Questions to Assess and Hire ETL DevelopersProven ETL Developer Interview Questions to Assess and Hire ETL Developers
Proven ETL Developer Interview Questions to Assess and Hire ETL Developers
 
It Jobs For Grads
It Jobs For GradsIt Jobs For Grads
It Jobs For Grads
 
Joel Klettke - Engage 2020 (Positiong, Branding, and Conversions)
Joel Klettke - Engage 2020 (Positiong, Branding, and Conversions)Joel Klettke - Engage 2020 (Positiong, Branding, and Conversions)
Joel Klettke - Engage 2020 (Positiong, Branding, and Conversions)
 
Data-Ed: Emerging Trends in Data Jobs
Data-Ed: Emerging Trends in Data JobsData-Ed: Emerging Trends in Data Jobs
Data-Ed: Emerging Trends in Data Jobs
 
Data-Ed Online: Emerging Trends in Data Jobs
Data-Ed Online: Emerging Trends in Data JobsData-Ed Online: Emerging Trends in Data Jobs
Data-Ed Online: Emerging Trends in Data Jobs
 
Modeling Requirements Narrated2
Modeling Requirements Narrated2Modeling Requirements Narrated2
Modeling Requirements Narrated2
 
Modeling Requirements with SysML
Modeling Requirements with SysML Modeling Requirements with SysML
Modeling Requirements with SysML
 
Embedding Employability Survey - Staff
Embedding Employability Survey - StaffEmbedding Employability Survey - Staff
Embedding Employability Survey - Staff
 
ITIL Challenges With Implementation
ITIL Challenges With ImplementationITIL Challenges With Implementation
ITIL Challenges With Implementation
 
March2008 Strategies For Adopting Self Service And Automation
March2008   Strategies For Adopting Self Service And AutomationMarch2008   Strategies For Adopting Self Service And Automation
March2008 Strategies For Adopting Self Service And Automation
 
Benchmarking
BenchmarkingBenchmarking
Benchmarking
 
Madhukar_Eunny_BIDW_Consultant
Madhukar_Eunny_BIDW_ConsultantMadhukar_Eunny_BIDW_Consultant
Madhukar_Eunny_BIDW_Consultant
 
How Talent Analytics Can Help You Maximize Your HR Strategy
How Talent Analytics Can Help You Maximize Your HR StrategyHow Talent Analytics Can Help You Maximize Your HR Strategy
How Talent Analytics Can Help You Maximize Your HR Strategy
 

Recently uploaded

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 

Recently uploaded (20)

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 

The Model and the Train Wreck - A Training Data How-To -- @mrogati's talk at Strata 2012

Editor's Notes

  1. Title : The Model and the Train Wreck – a Training Data How-Toby Monica Rogati, data scientist at LinkedInTalking about the Cinderella of Machine Learning - Training Data - in particular, how do you get good training data and why should you care – why does training data matter?
  2. This is the deep data session, so we’re all familiar w/ recommender systems: highly personalized recommendations for products, articles, links or even for your next job. Let’s take a quick peek under the hood – what’s the basic algorithm behind these recommendations?
  3. Well, you usually have a user profile, and you’re trying to match it to the items you’re trying to recommend (in this case, jobs) using a set of criteria – for example, their skills, their geography and their industry. Then, in order to show the top 3, you combine how well the profile matches the item across all these dimensions. But how do you do that? How do you decide what’s more important? You could code up some heuristics – which works if you have 3 criteria, not so well if you have 300.
  4. That’s when machine learning comes in – and that’s where we have hundreds and hundreds of algorithms and their variants, with thousands of papers written every year trying to squeeze out a little more performance for a particular task. This is awesome, and there are some amazingly cool algorithms out there, and both academia and industry are pushing the boundaries of science. However, they’re all often using the same dataset, which means they’re using the same training data. This is great for science and reproducibility, not so great from an ROI perspective if you’re in industry trying to improve your results as much as possible.Using *more* training data usually beats better algorithms – this is widely known, including this famous talk by Peter Norvig from Google on the unreasonable effectiveness of data – we see it again and again in recommender systems, machine translation, text mining & classification, ad targeting… More data beats clever algorithms, and better data beats more data. – and in this talk, I’ll discuss a few techniques for getting better training data.
  5. So how does training data look like? Well, to continue with the job matching example, a training data point consists of a (user, job) tuple, a score on how well they match across all these different dimensions, and a flag on whether they’re a good match or not, the response variable you’re trying to predict.So where do we get this training data from?
  6. Well, one popular source of training data is using what the user clicked on to customize their recommendations & adapt the algorithm. That’s great, so let’s take a look at this. We show the user 3 jobs, they click on some, x out others and don’t click on the other one. So those are my training data points.But if you actually used that to train a model, you’re in trouble. Remember, you’re trying to teach the algorithm what makes a good match and what doesn’t. Using this data actually doesn’t work because of what we call presentation bias. So now you’re thinking, “I know what that is, it’s that items higher up the list are more likely to be clicked just because they’re higher up the list, not because they’re better matches”. That’s true, so you can do some clever things to get around that – but that’s only part of the story. The other part of the story is that you had some initial algorithm or heuristics that you used to show these items to the user – and let’s say your algorithm made geography a very important criterion, so all your top 3 results match geography exactly, but they’re going to be a mix of positive and negative datapoints. What this means is that when you train, your new model will learn that geography makes absolutely no difference in whether the user is going to click or not, so it’s simply going to ignore that criterion since it’s not a good predictor at all ! What you’ve done is learned a new model that’s actually a refinement on top of the old one. In this case, it was obvious, but many times it’s more subtle so you might just decide your algorithm doesn’t work, when the problem was in the training data all along.There’s good news though. Once you know about this trick, there are several ways to get around it.
  7. This is a parody slide – I *could* have made all of them look like this and I actually have during my days in academia – so enjoy the rest of my image-driven slides.The info is real though.
  8. This works. Your recommender system is a well oiled machine and you continue accumulating more and more data & building better and better models. Except…there’s one “detail” we conveniently overlooked. What do you do when your product doesn’t exist, when there’s nothing for the user to click on?
  9. That’s the cold start problem, and you have several options. You could start by showing random items (this is nice because you’re getting rid of the presentation bias too). That works if it’s ads or news articles, it doesn’t work so well for job recommendations that people take very personally and are offended by bad matches, or if you’re launching a recommendation startup. Reid Hoffman said that if you’re not embarrassed w/ your product when it launches, you launched too late. Well, launching w/ random recommendations in this case is not an option. You could also manually set some weights in your model – again, this works w/ 3 features, not so well w/ 300. Ideally, you’d like to have some training data. So where do you get it – where do you get those coveted 0s and 1s?
  10. Crowdsourcing Is awesome and these days, it would be silly not to get labeled data by using mechanical turk, Crowdflower or Samasource. I’m a big fan. However, crowdsourcing for recSys is hard. If you want to make it work, here are a few issues you’d have to consider.
  11. You’re asking a worker to put themselves in the hsoes of a user and decide whether a job recommendation is good – whether they want to stay in the enterprise space, take over the family business, or if he loves Data as much as we do to become an Android developer. Or maybe he wants to switch universes entirely and have the Force be with him. The point is, this is hard to do for somebody other than the user.
  12. If you’ve ever written task descriptions for crowdsourcing, you know it’s pretty much like writing good code. It needs to be concise, precise, explain edge cases and give very specific instructions and catch exceptions. Sometimes though, that’s not even possible. Take the case where you’d like to launch a product showing profiles similar to the one you’re looking at. How do you define similarity? How do you give them “rules” about what’s more important and how to trade off going to the same school vs. being in the same industry? If you had those rules, you should have just coded them up & not bother w/ crowdsourcing.
  13. That’s a talk onto itself – how do you do quality control, how do you maintain your requester reputation, how do you decide how much data to label, how do you select your input data you’d like to have labeled – all things you’d have to consider to be successful when doing crowdsourcing for labeling recommender systems training data.
  14. But maybe, just maybe, if you’re a bit creative, in some cases, you don’t need to do crowdsourcing at all. There is an alternative – data recycling.
  15. The universe of data you have access to (be it within your own company or via external sources of data) is complex and beautiful – and it might contain the signals you can transform into training data for your application.Let’s go through a few examples.
  16. Better algorithms are great. However, it takes time and effort to get them right.The Netflix prize took 3 years and thousands of people for a 10% reduction in error. Let’s compare that with some example lifts I’ve gotten by changing the *training data* on production rec systems . These are all systems that have been in production for more than a year at the time of the change.Using fresher & larger training data: 20% lift. Using some of the data recycling techniques I mentioned – 50% lift. Using training data at all vs. a hand built model – 100% lift. YMMV.This is why training data matters. It’s your best bang for the buck.
  17. You know, I’ve started by calling training data the Cinderella of Machine Learning – but now I’m thinking it’s more like Harry Potter – first neglected in a cupboard under the stairs, then growing up and doing some real magic.