The Model and the Train Wreck - A Training Data How-To -- @mrogati's talk at Strata 2012

•

7 likes•7,392 views

Getting training data for a recommender system is easy: if users clicked it, it’s a positive – if they didn’t, it’s a negative. … Or is it? You’ve probably learned an algorithm to run on top of your existing algorithm, now and every time you re-train. And what do you do when the data product you’re building doesn’t have any users yet? Do you really launch with random results, hand label 50K examples, or ask a Turker to pretend they’re User #1337? Unlike having a better algorithm, having better training data can improve your results by orders of magnitude. Yet training data generation is often an afterthought—a footnote in a formula-filled publication. In this talk, we use examples from production recommender systems to bring training data to the forefront: from overcoming presentation bias to the art of crowdsourcing subjective judgments to creative data exhaust exploitation and feature creation.

Technology Business

Overcoming presentation bias
 Explicitly model the presentation bias
◦ This includes the rank & snippet
◦ [Smith & Elkan KDD’07], [Yue et al WWW’10] etc.
 Learning from positive examples only
◦ [Elkan & Noto KDD’08] (good overview), [Lee & Liu
ICML’03] etc.

 Mix in random instances as negative examples to
avoid learning a classifier that stays too close to
the margin

Caveat #3: Quality Control
and Community Management

… and scale, cost , input data selection

Task 1: Real or Fake?

Training Data: Closed Accounts

Task 2: Recommended Groups

Training Data: Current Group Members

Task 3: Job Matching

Training Data: Job Posters’ Activity

training
data
matters! 100%
50
%
20%

10%

observed lift in production recommender systems

Similar to The Model and the Train Wreck - A Training Data How-To -- @mrogati's talk at Strata 2012

iConnect: Expertise Location at DeloitteKM Chicago

Trustworthy Recommender SystemsWQ Fan

Incident, Problem, Change, Knowledge…and Service Catalog? A Powerful Circle. Evergreen Systems

Best Practices For ITSM Process Assessment v1.pdfCaasMarta

CDO_publicRoberto Maranca

Are you ready for Data science? A 12 point testBertil Hatt

Who Owns Faculty Data?: Fairness and transparency in UCLA's new academic HR s...chloejreynolds

Proven ETL Developer Interview Questions to Assess and Hire ETL DevelopersInterview Mocha

It Jobs For GradsUniversity of Sussex, Careers & Employability Centre

Joel Klettke - Engage 2020 (Positiong, Branding, and Conversions)Joel Klettke

Data-Ed: Emerging Trends in Data JobsData Blueprint

Data-Ed Online: Emerging Trends in Data JobsDATAVERSITY

Modeling Requirements Narrated2Daniel Brookshier

Modeling Requirements with SysML Daniel Brookshier

Embedding Employability Survey - StaffEmbedding Employability

ITIL Challenges With ImplementationAcend Corporate Learning

March2008 Strategies For Adopting Self Service And AutomationIT Service and Support

BenchmarkingAnwarrChaudary

Madhukar_Eunny_BIDW_Consultantmadhukar eunny

How Talent Analytics Can Help You Maximize Your HR StrategyGlassdoor

Similar to The Model and the Train Wreck - A Training Data How-To -- @mrogati's talk at Strata 2012 (20)

iConnect: Expertise Location at Deloitte

Trustworthy Recommender Systems

Incident, Problem, Change, Knowledge…and Service Catalog? A Powerful Circle.

Best Practices For ITSM Process Assessment v1.pdf

CDO_public

Are you ready for Data science? A 12 point test

Who Owns Faculty Data?: Fairness and transparency in UCLA's new academic HR s...

Proven ETL Developer Interview Questions to Assess and Hire ETL Developers

It Jobs For Grads

Joel Klettke - Engage 2020 (Positiong, Branding, and Conversions)

Data-Ed: Emerging Trends in Data Jobs

Data-Ed Online: Emerging Trends in Data Jobs

Modeling Requirements Narrated2

Modeling Requirements with SysML

Embedding Employability Survey - Staff

ITIL Challenges With Implementation

March2008 Strategies For Adopting Self Service And Automation

Benchmarking

Madhukar_Eunny_BIDW_Consultant

How Talent Analytics Can Help You Maximize Your HR Strategy

Recently uploaded

Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3

Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3

SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521

Gen AI in Business - Global Trends Report 2024.pdfAddepto

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3

A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3

DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy

The State of Passkeys with FIDO Alliance.pptxLoriGlavin3

Artificial intelligence in cctv survelliance.pptxhariprasad279825

TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3

WordPress Websites for Engineers: Elevate Your Brandgvaughan

"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays

unit 4 immunoblotting technique complete.pptxBkGupta21

The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech

What is Artificial Intelligence?????????blackmambaettijean

Take control of your SAP testing with UiPath Test SuiteDianaGray10

Recently uploaded (20)

Developer Data Modeling Mistakes: From Postgres to NoSQL

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx

Digital Identity is Under Attack: FIDO Paris Seminar.pptx

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx

SALESFORCE EDUCATION CLOUD | FEXLE SERVICES

Gen AI in Business - Global Trends Report 2024.pdf

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx

A Deep Dive on Passkeys: FIDO Paris Seminar.pptx

DevoxxFR 2024 Reproducible Builds with Apache Maven

The State of Passkeys with FIDO Alliance.pptx

Artificial intelligence in cctv survelliance.pptx

TeamStation AI System Report LATAM IT Salaries 2024

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx

WordPress Websites for Engineers: Elevate Your Brand

"Debugging python applications inside k8s environment", Andrii Soldatenko

unit 4 immunoblotting technique complete.pptx

The Ultimate Guide to Choosing WordPress Pros and Cons

What is Artificial Intelligence?????????

Take control of your SAP testing with UiPath Test Suite

The Model and the Train Wreck - A Training Data How-To -- @mrogati's talk at Strata 2012

1. @mrogati

7. Overcoming presentation bias  Explicitly model the presentation bias ◦ This includes the rank & snippet ◦ [Smith & Elkan KDD’07], [Yue et al WWW’10] etc.  Learning from positive examples only ◦ [Elkan & Noto KDD’08] (good overview), [Lee & Liu ICML’03] etc.  Mix in random instances as negative examples to avoid learning a classifier that stays too close to the margin

9. The Cold Start Problem

10. Crowdsourcing?

11. Caveat #1: Personalization

12. Caveat #2: Task Description Example: Rate profile similarity from 1-4

13. Caveat #3: Quality Control and Community Management … and scale, cost , input data selection

14. Alternatives?

15. (Your) Other Data

16. Task 1: Real or Fake? Training Data: Closed Accounts

17. Task 2: Recommended Groups Training Data: Current Group Members

18. Task 3: Job Matching Training Data: Job Posters’ Activity

19. Task 4: Profile Similarity Training Data: Saved Folders

20. training data matters! 100% 50 % 20% 10% observed lift in production recommender systems

21. @mrogati

Editor's Notes

Title : The Model and the Train Wreck – a Training Data How-Toby Monica Rogati, data scientist at LinkedInTalking about the Cinderella of Machine Learning - Training Data - in particular, how do you get good training data and why should you care – why does training data matter?
This is the deep data session, so we’re all familiar w/ recommender systems: highly personalized recommendations for products, articles, links or even for your next job. Let’s take a quick peek under the hood – what’s the basic algorithm behind these recommendations?
Well, you usually have a user profile, and you’re trying to match it to the items you’re trying to recommend (in this case, jobs) using a set of criteria – for example, their skills, their geography and their industry. Then, in order to show the top 3, you combine how well the profile matches the item across all these dimensions. But how do you do that? How do you decide what’s more important? You could code up some heuristics – which works if you have 3 criteria, not so well if you have 300.
That’s when machine learning comes in – and that’s where we have hundreds and hundreds of algorithms and their variants, with thousands of papers written every year trying to squeeze out a little more performance for a particular task. This is awesome, and there are some amazingly cool algorithms out there, and both academia and industry are pushing the boundaries of science. However, they’re all often using the same dataset, which means they’re using the same training data. This is great for science and reproducibility, not so great from an ROI perspective if you’re in industry trying to improve your results as much as possible.Using *more* training data usually beats better algorithms – this is widely known, including this famous talk by Peter Norvig from Google on the unreasonable effectiveness of data – we see it again and again in recommender systems, machine translation, text mining & classification, ad targeting… More data beats clever algorithms, and better data beats more data. – and in this talk, I’ll discuss a few techniques for getting better training data.
So how does training data look like? Well, to continue with the job matching example, a training data point consists of a (user, job) tuple, a score on how well they match across all these different dimensions, and a flag on whether they’re a good match or not, the response variable you’re trying to predict.So where do we get this training data from?
Well, one popular source of training data is using what the user clicked on to customize their recommendations & adapt the algorithm. That’s great, so let’s take a look at this. We show the user 3 jobs, they click on some, x out others and don’t click on the other one. So those are my training data points.But if you actually used that to train a model, you’re in trouble. Remember, you’re trying to teach the algorithm what makes a good match and what doesn’t. Using this data actually doesn’t work because of what we call presentation bias. So now you’re thinking, “I know what that is, it’s that items higher up the list are more likely to be clicked just because they’re higher up the list, not because they’re better matches”. That’s true, so you can do some clever things to get around that – but that’s only part of the story. The other part of the story is that you had some initial algorithm or heuristics that you used to show these items to the user – and let’s say your algorithm made geography a very important criterion, so all your top 3 results match geography exactly, but they’re going to be a mix of positive and negative datapoints. What this means is that when you train, your new model will learn that geography makes absolutely no difference in whether the user is going to click or not, so it’s simply going to ignore that criterion since it’s not a good predictor at all ! What you’ve done is learned a new model that’s actually a refinement on top of the old one. In this case, it was obvious, but many times it’s more subtle so you might just decide your algorithm doesn’t work, when the problem was in the training data all along.There’s good news though. Once you know about this trick, there are several ways to get around it.
This is a parody slide – I *could* have made all of them look like this and I actually have during my days in academia – so enjoy the rest of my image-driven slides.The info is real though.
This works. Your recommender system is a well oiled machine and you continue accumulating more and more data & building better and better models. Except…there’s one “detail” we conveniently overlooked. What do you do when your product doesn’t exist, when there’s nothing for the user to click on?
That’s the cold start problem, and you have several options. You could start by showing random items (this is nice because you’re getting rid of the presentation bias too). That works if it’s ads or news articles, it doesn’t work so well for job recommendations that people take very personally and are offended by bad matches, or if you’re launching a recommendation startup. Reid Hoffman said that if you’re not embarrassed w/ your product when it launches, you launched too late. Well, launching w/ random recommendations in this case is not an option. You could also manually set some weights in your model – again, this works w/ 3 features, not so well w/ 300. Ideally, you’d like to have some training data. So where do you get it – where do you get those coveted 0s and 1s?
Crowdsourcing Is awesome and these days, it would be silly not to get labeled data by using mechanical turk, Crowdflower or Samasource. I’m a big fan. However, crowdsourcing for recSys is hard. If you want to make it work, here are a few issues you’d have to consider.
You’re asking a worker to put themselves in the hsoes of a user and decide whether a job recommendation is good – whether they want to stay in the enterprise space, take over the family business, or if he loves Data as much as we do to become an Android developer. Or maybe he wants to switch universes entirely and have the Force be with him. The point is, this is hard to do for somebody other than the user.
If you’ve ever written task descriptions for crowdsourcing, you know it’s pretty much like writing good code. It needs to be concise, precise, explain edge cases and give very specific instructions and catch exceptions. Sometimes though, that’s not even possible. Take the case where you’d like to launch a product showing profiles similar to the one you’re looking at. How do you define similarity? How do you give them “rules” about what’s more important and how to trade off going to the same school vs. being in the same industry? If you had those rules, you should have just coded them up & not bother w/ crowdsourcing.
That’s a talk onto itself – how do you do quality control, how do you maintain your requester reputation, how do you decide how much data to label, how do you select your input data you’d like to have labeled – all things you’d have to consider to be successful when doing crowdsourcing for labeling recommender systems training data.
But maybe, just maybe, if you’re a bit creative, in some cases, you don’t need to do crowdsourcing at all. There is an alternative – data recycling.
The universe of data you have access to (be it within your own company or via external sources of data) is complex and beautiful – and it might contain the signals you can transform into training data for your application.Let’s go through a few examples.
Better algorithms are great. However, it takes time and effort to get them right.The Netflix prize took 3 years and thousands of people for a 10% reduction in error. Let’s compare that with some example lifts I’ve gotten by changing the *training data* on production rec systems . These are all systems that have been in production for more than a year at the time of the change.Using fresher & larger training data: 20% lift. Using some of the data recycling techniques I mentioned – 50% lift. Using training data at all vs. a hand built model – 100% lift. YMMV.This is why training data matters. It’s your best bang for the buck.
You know, I’ve started by calling training data the Cinderella of Machine Learning – but now I’m thinking it’s more like Harry Potter – first neglected in a cupboard under the stairs, then growing up and doing some real magic.

The Model and the Train Wreck - A Training Data How-To -- @mrogati's talk at Strata 2012

Recommended

Recommended

More Related Content

Similar to The Model and the Train Wreck - A Training Data How-To -- @mrogati's talk at Strata 2012

Similar to The Model and the Train Wreck - A Training Data How-To -- @mrogati's talk at Strata 2012 (20)

Recently uploaded

Recently uploaded (20)

The Model and the Train Wreck - A Training Data How-To -- @mrogati's talk at Strata 2012

Editor's Notes