Getting training data for a recommender system is easy: if users clicked it, it’s a positive – if they didn’t, it’s a negative.
… Or is it? You’ve probably learned an algorithm to run on top of your existing algorithm, now and every time you re-train. And what do you do when the data product you’re building doesn’t have any users yet? Do you really launch with random results, hand label 50K examples, or ask a Turker to pretend they’re User #1337?
Unlike having a better algorithm, having better training data can improve your results by orders of magnitude. Yet training data generation is often an afterthought—a footnote in a formula-filled publication.
In this talk, we use examples from production recommender systems to bring training data to the forefront: from overcoming presentation bias to the art of crowdsourcing subjective judgments to creative data exhaust exploitation and feature creation.