Getting training data for a recommender system is easy: if users clicked it, it’s a positive – if they didn’t, it’s a negative.
… Or is it? You’ve probably learned an algorithm to run on top of your existing algorithm, now and every time you re-train. And what do you do when the data product you’re building doesn’t have any users yet? Do you really launch with random results, hand label 50K examples, or ask a Turker to pretend they’re User #1337?
Unlike having a better algorithm, having better training data can improve your results by orders of magnitude. Yet training data generation is often an afterthought—a footnote in a formula-filled publication.
In this talk, we use examples from production recommender systems to bring training data to the forefront: from overcoming presentation bias to the art of crowdsourcing subjective judgments to creative data exhaust exploitation and feature creation.
7. Overcoming presentation bias
Explicitly model the presentation bias
◦ This includes the rank & snippet
◦ [Smith & Elkan KDD’07], [Yue et al WWW’10] etc.
Learning from positive examples only
◦ [Elkan & Noto KDD’08] (good overview), [Lee & Liu
ICML’03] etc.
Mix in random instances as negative examples to
avoid learning a classifier that stays too close to
the margin
Title : The Model and the Train Wreck – a Training Data How-Toby Monica Rogati, data scientist at LinkedInTalking about the Cinderella of Machine Learning - Training Data - in particular, how do you get good training data and why should you care – why does training data matter?
This is the deep data session, so we’re all familiar w/ recommender systems: highly personalized recommendations for products, articles, links or even for your next job. Let’s take a quick peek under the hood – what’s the basic algorithm behind these recommendations?
Well, you usually have a user profile, and you’re trying to match it to the items you’re trying to recommend (in this case, jobs) using a set of criteria – for example, their skills, their geography and their industry. Then, in order to show the top 3, you combine how well the profile matches the item across all these dimensions. But how do you do that? How do you decide what’s more important? You could code up some heuristics – which works if you have 3 criteria, not so well if you have 300.
That’s when machine learning comes in – and that’s where we have hundreds and hundreds of algorithms and their variants, with thousands of papers written every year trying to squeeze out a little more performance for a particular task. This is awesome, and there are some amazingly cool algorithms out there, and both academia and industry are pushing the boundaries of science. However, they’re all often using the same dataset, which means they’re using the same training data. This is great for science and reproducibility, not so great from an ROI perspective if you’re in industry trying to improve your results as much as possible.Using *more* training data usually beats better algorithms – this is widely known, including this famous talk by Peter Norvig from Google on the unreasonable effectiveness of data – we see it again and again in recommender systems, machine translation, text mining & classification, ad targeting… More data beats clever algorithms, and better data beats more data. – and in this talk, I’ll discuss a few techniques for getting better training data.
So how does training data look like? Well, to continue with the job matching example, a training data point consists of a (user, job) tuple, a score on how well they match across all these different dimensions, and a flag on whether they’re a good match or not, the response variable you’re trying to predict.So where do we get this training data from?
Well, one popular source of training data is using what the user clicked on to customize their recommendations & adapt the algorithm. That’s great, so let’s take a look at this. We show the user 3 jobs, they click on some, x out others and don’t click on the other one. So those are my training data points.But if you actually used that to train a model, you’re in trouble. Remember, you’re trying to teach the algorithm what makes a good match and what doesn’t. Using this data actually doesn’t work because of what we call presentation bias. So now you’re thinking, “I know what that is, it’s that items higher up the list are more likely to be clicked just because they’re higher up the list, not because they’re better matches”. That’s true, so you can do some clever things to get around that – but that’s only part of the story. The other part of the story is that you had some initial algorithm or heuristics that you used to show these items to the user – and let’s say your algorithm made geography a very important criterion, so all your top 3 results match geography exactly, but they’re going to be a mix of positive and negative datapoints. What this means is that when you train, your new model will learn that geography makes absolutely no difference in whether the user is going to click or not, so it’s simply going to ignore that criterion since it’s not a good predictor at all ! What you’ve done is learned a new model that’s actually a refinement on top of the old one. In this case, it was obvious, but many times it’s more subtle so you might just decide your algorithm doesn’t work, when the problem was in the training data all along.There’s good news though. Once you know about this trick, there are several ways to get around it.
This is a parody slide – I *could* have made all of them look like this and I actually have during my days in academia – so enjoy the rest of my image-driven slides.The info is real though.
This works. Your recommender system is a well oiled machine and you continue accumulating more and more data & building better and better models. Except…there’s one “detail” we conveniently overlooked. What do you do when your product doesn’t exist, when there’s nothing for the user to click on?
That’s the cold start problem, and you have several options. You could start by showing random items (this is nice because you’re getting rid of the presentation bias too). That works if it’s ads or news articles, it doesn’t work so well for job recommendations that people take very personally and are offended by bad matches, or if you’re launching a recommendation startup. Reid Hoffman said that if you’re not embarrassed w/ your product when it launches, you launched too late. Well, launching w/ random recommendations in this case is not an option. You could also manually set some weights in your model – again, this works w/ 3 features, not so well w/ 300. Ideally, you’d like to have some training data. So where do you get it – where do you get those coveted 0s and 1s?
Crowdsourcing Is awesome and these days, it would be silly not to get labeled data by using mechanical turk, Crowdflower or Samasource. I’m a big fan. However, crowdsourcing for recSys is hard. If you want to make it work, here are a few issues you’d have to consider.
You’re asking a worker to put themselves in the hsoes of a user and decide whether a job recommendation is good – whether they want to stay in the enterprise space, take over the family business, or if he loves Data as much as we do to become an Android developer. Or maybe he wants to switch universes entirely and have the Force be with him. The point is, this is hard to do for somebody other than the user.
If you’ve ever written task descriptions for crowdsourcing, you know it’s pretty much like writing good code. It needs to be concise, precise, explain edge cases and give very specific instructions and catch exceptions. Sometimes though, that’s not even possible. Take the case where you’d like to launch a product showing profiles similar to the one you’re looking at. How do you define similarity? How do you give them “rules” about what’s more important and how to trade off going to the same school vs. being in the same industry? If you had those rules, you should have just coded them up & not bother w/ crowdsourcing.
That’s a talk onto itself – how do you do quality control, how do you maintain your requester reputation, how do you decide how much data to label, how do you select your input data you’d like to have labeled – all things you’d have to consider to be successful when doing crowdsourcing for labeling recommender systems training data.
But maybe, just maybe, if you’re a bit creative, in some cases, you don’t need to do crowdsourcing at all. There is an alternative – data recycling.
The universe of data you have access to (be it within your own company or via external sources of data) is complex and beautiful – and it might contain the signals you can transform into training data for your application.Let’s go through a few examples.
Better algorithms are great. However, it takes time and effort to get them right.The Netflix prize took 3 years and thousands of people for a 10% reduction in error. Let’s compare that with some example lifts I’ve gotten by changing the *training data* on production rec systems . These are all systems that have been in production for more than a year at the time of the change.Using fresher & larger training data: 20% lift. Using some of the data recycling techniques I mentioned – 50% lift. Using training data at all vs. a hand built model – 100% lift. YMMV.This is why training data matters. It’s your best bang for the buck.
You know, I’ve started by calling training data the Cinderella of Machine Learning – but now I’m thinking it’s more like Harry Potter – first neglected in a cupboard under the stairs, then growing up and doing some real magic.