• Save
RecSys 2012 Slides
Upcoming SlideShare
Loading in...5
×
 

RecSys 2012 Slides

on

  • 684 views

Slides of my talk at RecSys 2012 on multiple objective optimization in ranking of recommendations.

Slides of my talk at RecSys 2012 on multiple objective optimization in ranking of recommendations.

Statistics

Views

Total Views
684
Views on SlideShare
607
Embed Views
77

Actions

Likes
4
Downloads
0
Comments
0

4 Embeds 77

http://yali-ld1.linkedin.biz 39
http://www.linkedin.com 31
http://ltang-ld2.linkedin.biz 6
https://twitter.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Hi, my name is Mario Rodriguez and I’m here to talk about work that my colleagues and I are doing at LinkedIn to improve our recommender systems, specifically, in area of multiple objective optimization.
  • Here is the motivation for our work: the performance of a recommender system is given by its utility function, and this utility function is often multi-faceted. This talk is about improving multi-faceted utility functions in a way that focuses on the most promising facets. And by efficiency, we mean something that relates to ROI, or the best bang for your buck. Generally, since the utility function is hard to tackle directly, what is done is that it is broken down into subcategories, possibly recursively, until we arrive at something concrete that we can work with.
  • Here is a quick outline of the talk. We will do a deep dive on one specific case study, a system called TalentMatch, which is a revenue-generating recommender system at LinkedIn that recommends talent to job posters, and which will help illustrate the details of our approach. Though we are discussing a specific use case, our approach is broadly applicable, and is being used in a variety of products at LinkedIn. So, we are going to give a brief overview of Talent Match. We will discuss its utility function, which is multi-faceted. And then we will show how we improved its utility, with evidence from real A/B test results.
  • So, here is high level overview of Talent Match. Someone comes to the site and posts a job. We then scour the entire member database looking for the members who best match that job, and we recommend a ranked list of those members to the job poster.
  • Just to add some context, here’s what a job posting looks like. A job posting is very rich in content, and though it is not completely structured, parsing it is not too hard. Some obvious fields include the title, the description, the skills, and the region.
  • And here’s what a member profile looks like. We also know very detailed data about our members, including title and description of their current job, how long they’ve been there, their skills, their current region… and we can match those attributes to the respective attributes of the job posting.
  • This is how we do this matching. We combine the job and the candidate into a single feature vector, where each feature denotes various similarity measures between attributes of the job and attributes of the job poster, and then we find the relative importance of these features using a supervised learning method like logistic regression trained on a click signal such as job applications. This gives us a model that knows how to differentiate good job-member pairs from bad job-member pairs.
  • So, once someone posts a job, and we run our TalentMatch model on our member database, this is the snippet of results we produce for the job poster. At this point there is no identifiable information, and so if the job poster likes what he sees enough, he or she has the option to purchase the result set to be able to see their full profile and contact those members.
  • Let’s go over the facets of the utility function of the Talent Match system. First, the snippet needs to be good enough to convince the job poster to purchase the recommendations. That’s the booking rate. Then, once purchased, the job poster gets to look at the full profile of the candidate recommended and decides whether or not they are indeed a good match for the job. If the candidate is a good match, the job poster may then decide to email the candidate regarding the job opportunity. That’s the email rate. Finally, if the candidate is interested, then the candidate will reply positively to the job poster. Giving us the reply rate. Now that the link is established, they can take it from there. But from our perspective, these 3 steps are required for there to be relevant engagement within this system.Out of the 3 facets of the utility function, the reply rate was identified as needing improvement. Job posters were complaining the they were emailing candidates, but the candidates were not replying enough. This was the problem we needed to solve. We figured the booking rate and the email rate were well accounted for by the existing TalentMatch model, but even if someone is a great match for the job, that does not mean they are going to reply. So, we thought that maybe people were not replying because they were probably not looking for a job. What if we could determine if someone was a job seeker, and then include more of those people in the recommendations?
  • So, we had already developed a model that computes the job seeking propensity for each member. It turns out that many people who are open to new opportunities, do not self-identify as job seekers, so this model helps us identify those people. You can think of the job seeking propensity as the probability that the member will switch positions in the next month. We also output a segmentation of this probability into actives, passives, and non-job-seekers, and we consider actives and passives to have a high job seeking intent.This job seeking intent model is completely different from the TalentMatch model. It is a survival model where the entity whose survival we’re analyzing is a job, or more specifically, a position. Based on data derived from the lifetime of millions of positions, we model the duration of a position as a function of various features in what is known as an accelerated failure time model, and this allows us to compute the probability that a given position will end within the next time period.
  • There are many signals the we can use to compute the job seeking intent. We may have the user’s job seeking activity on the site: are they searching or applying for jobs. Those are obvious signals. But we have others. For example, we know that different industries have different attrition rates. This plot includes a few representative industries and their survival curves. The survival curve gives the probability that someone will still be at their position X months down the road if they start that position today.These are survival curves for a few of the most extreme industries, some of the most hazardous including “political organization” and “animation” and some of the least hazardous including “alternative medicine” and “ranching”. In the “political organization” industry, which is the red line at the bottom, more than 50% of people don’t last 2 years in a given position.
  • So, Intuitively, it makes sense to suggest users who are job seekers in TalentMatch. But we confirmed our intuition, we ran the numbers, and saw that users with a high job seeking intent (actives and passives) have a much higher rate of reply to career related emails when compared to non-job-seekers (16 times the reply rate). And this is exactly the facet of the utility function of TalentMatch that we are interested in improving. So, what we want to do is incorporate the job seeker intent into the TalentMatch model, and we want to do so without negatively affecting the booking rate and the email rate.
  • So, could we just add the job seeker propensity score as a feature into the talent match model and retrain? Actually, doing that is not a good idea. Talent Match learns the concept of whether or not the member is a good match for the job, and a member-job pair does not become a better or worse match as a function of the job seeking propensity of the member. So, we need something else…
  • So, how do we do it? Well, we can look at the talent match score distribution of the top-K recommendations, and treat that as a kind of ground truth from which we cannot deviate too much. Basically, having optimized for matching in the talent match model, we want to perturb the ranking slightly to gain as much as possible in other metrics, but without sacrificing the quality of the match. This is the talent match score distribution of the 12th recommendation. We care about the 12th recommendation because that’s how many results we show on a single page. The x axis goes from some threshold T (below which items do not get recommended) to 1.0, and we see that there is a peak around 1.0, suggesting that even at the 12th position, the bulk of the recommendations are of very high quality. This high quality is the essence of the system, and whatever enhancements we do to it, we don’t want loose this quality.
  • So, what we want is a controlled perturbation of the ranking output by the talent match model, and this is how we are gonna do it: given the talent match ranking, we run a perturbation function on it that generates another ranking, the perturbed ranking, which optimizes for a metric we’re interested in (in the case of TalentMatch, it’s number of users with high-job seeking intent in the top-12 recommendations). Given the 2 rankings and their distribution of match scores, we can compute the distance between them using a variety of metrics, for example KL divergence or Euclidean distance. This divergence score is what will help us to make sure we are not negatively affecting the quality of the recommendations. Notice how, in the perturbed ranking, item Z was bumped from its original third position, below the cutoff line, to the second position, and so whereas before we had 2 non-seekers above the cutoff, meaning they would be recommended, now we have a non-seeker and an active. Also notice, that the perturbation is minimal. We should feel comfortable bumping item Z to the second position, but not to the first position.There are then 3 functions that we need to define: the perturbation function, the divergence function, and the objective function. The parameters of the perturbation function is what we will be estimating based the performance established by the divergence and objective measures: we want high scores on the objective and low scores on the divergence.
  • Here is theinstantiation of those functions for the TalentMatch case. The perturbation function simply applies a small boost to the match score, denoted by the letter “y”, and we allow that boost to be different for active and passive job seekers (as denoted by the alpha and the beta parameters). The divergence function is simply the Euclidean distance between the distribution of scores in the talent match ranking and the distribution of scores in the perturbed ranking. This is simply a measure of how match quality was affected (a divergence score of 0 means that the quality of the matches remained unaffected). The objective is the average number of actives and passives in the top-12.
  • To find a good perturbation function, we can construct a typical loss function, where the effect of the divergence is governed by a regularization parameter lambda, and then optimize this loss function to find the parameters of the perturbation function, alpha and beta, which correspond respectively to the boost of active and passive job seekers. However, there is a complicating factor: both the divergence and objective functions depend on a ranking, which depends on a sorting operation, and therefore, traditional gradient based approaches are not readily applicable. Also, what should we set lambda to? We don’t just want to use the lambda that generates the lowest loss, we are actually more interested in what our options are regarding what our tradeoff is going to be between the objective and the divergence function.
  • We will discuss computational strategies for optimizing the perturbation function in a moment, but before that, we need to discuss the kind of optimization we are actually interested in. What we really want is Pareto optimization, where there is not one optimal solution, instead, there are some solutions which are better in one objective, while other solutions are better in others. In this plot, we have the objective, the average number of actives and passives in the top-12 results, on the y-axis, and the divergence on the x-axis. The original ranking has on average 4 actives and passives in the top-12, as shown in the table in the top left corner. Also, by definition, the divergence of this original ranking, is 0. Each point (or bubble) in the plot represents a specific assignment to the parameters of the perturbation function: alpha and beta. We see on the plot that the only way to increase the objective on the y-axis, is to also allow an increase on the divergence on the x-axis. We also see that for a given divergence, say 50, there are many assignments of alpha and beta with that divergence, with varying scores on the objective. We want the maximum objective for each divergence, and those are the points in the pareto frontier, which are the red points in the plot. So, no matter what divergence you allow, you should pick a point on the pareto frontier. Back to the table of sample plans, we see that if we set alpha and beta to 1.15, we can double the the number of actives and passives in the top-12 (from 4 to 8) while paying the cost of having a divergence of 64, and that this is a point in the pareto frontier.
  • Here we can get a better idea of what the divergence scores actually mean, the top left has the distribution of the original, unperturbed model, and as we move across the quadrant, we see how the divergence increases (0, 27, 54, and 100). In the top left histogram, we see the bump around the 0.9’s, and with each histogram, the bump is gradually attenuated, until there is no more bump in the bottom right. So, we would probably be willing to accept a divergence in the 50-60’s range (as shown in the bottom left), but not in the 100’s, which is what’s shown in the bottom right.
  • So, how will we actually learn the weights in the perturbation function, in other words, the right values for alpha and beta? There are several different ways of doing so, and the most appropriate varies with the specific use case. Grid search, for one, is very easy to implement: simply generate all possible solutions (up to some discretization amount) and evaluate them. This is feasible for small search spaces, but quickly become unwieldy due to the combinatorial explosion as a result of a large number of parameters.Gradient-based techniques would be another approach, and this would be useful in a scenario where the perturbation function has a high number of parameters and grid search is unfeasible. We mentioned earlier that there is an issue with our objective and divergence functions which make gradient optimization hard, and we will see how to overcome that.
  • Assuming the objective and divergence functions are amenable to gradient-based optimization techniques, they are typically scalarized into a singe loss function, and then optimized for a given value of the lambda tradeoff parameter. The Pareto frontier can be approximated in many cases with an approach like this by optimizing the loss function for several values of lambda, with a couple of them being represented by the lines with different slopes in the diagram. The tangent of the loss with the Pareto frontier represents an optimal solution for a given value of lambda.
  • We mentioned earlier that our objective and constraint functions are not smooth, since they depend on a sort, and so they’re not readily amenable to gradient-based methods. However, this is a problem that has been looked at in the field of information retrieval, where they’ve come up with methods for “learning to rank”. There’s been research on approximations to popular rank-based metrics such as the normalized discounted cumulative gain (NDCG) and the average precision (AP) which are amenable to gradient descent. We can leverage this work and frame our optimization problem using those metrics, where our objective function takes the shape of the approximate AP, and our divergence function takes the shape of the approximate NDCG. The approximate NDCG is highest when we’ve ranked higher the candidates with the highest match score, which a property we can exploit to constraint the perturbed model. There are more ways to do this, which I discuss in the paper, but once we frame our TalentMatch problem this way, we can then apply gradient descent to optimize alpha and beta for several values of lambda.
  • Given that we only had 2 parameters in our perturbation function, grid search was a satisfactory approach and so that’s what we used. When you have a set of pareto optimal values, typically what it’s done is that you look for the proverbial knee of the curve, a point after which you have to pay too much in one objective to get increases in another, and our curve actually displays this characteristic: the Pareto tradeoff is constant up to a divergence of about 60, which as we saw earlier in the histogram slide, was not too bad. Still we did not know exactly what a given divergence would do to the booking and email rate, so we picked a couple of values to A/B test. We picked the maximum value on that line, the one at the knee, and a point in the middle, which corresponded to a boost of 1.15 and 1.07 respectively.So, what did we expect from the tests? Since we knew the rate of reply to career-related emails of users with high-job seeking intent, as well as the expected proportion of those users in the top-12 recommendations, it was easy to get a ball park figure of how much of an increase in reply rate we would obtain: we expected a 50% increase over control for the 1.07 treatment and a 100% increase over control for the 1.15 treatment. Regarding the other 2 facets of the utility function, the booking rate and the email rate of job posters to candidates, what we hoped was that they would remain unchanged or only be minimally affected.
  • So, how did we do? Let’s see how facet of the utility was affected. The booking rate remained mostly unchanged, with possibly a very slight dip of 0.4% on the 1.15 treatment. The email rate, to our surprise actually increased in both treatments. This tells us that somehow, the profiles of users with high-job seeking intent were more appealing to job posters than those who weren’t. Specifically what about their profile was more appealing is something we have yet to look into. This also tells us that maybe the snippets that we show job posters were not a great representation of the value for them, and that perhaps better snippets would lead to higher booking rates. Finally, we see that we were able to increase the reply rate, which is what we had originally set out to do, and that the increase for the 1.15 treatment was double that of the 1.07 treatment: 42% and 22%, which was in line with our expectations. Now, these numbers are pretty good, but why weren’t they as high as we had expected? Well, we had thought that job posters contacted all the recommendations, since it did not cost them more to contact all than to contact one, but as we observed in the email rate, which we were able to improve, job posters do not, in fact, contact all of the recommendations.
  • So, in conclusion, I’d like to present you with 2 main takeaways:First, recommender systems often have a multifaceted utility function, of which matching is not just a big part, it is the crucial component, the secret sauce of the system. As you optimize for additional objectives, know exactly how the quality of the matches becomes affected, and justify any sacrifices to it. You don’t want kill the goose for its golden eggs. On the other hand, if you are just focusing on matching, then you are running a suboptimal system.We have presented a way to handle competing objectives, in case they surface as part of improving the utility. Listen to user feedback, which is how we found out that the reply rate was indeed the area to focus on in talentmatch. The users told us that the quality of the matches was great, but they wanted the users to engage with them. A/B test furiously. We have tons of examples of theory not meeting practice, offline not meeting online. Make sure you get a reality check.

RecSys 2012 Slides RecSys 2012 Slides Presentation Transcript

  • MARIOMultiple Objective Optimization in Recommender Systems Mario Rodriguez Christian Posse Ethan Zhang
  • Motivation1. Value of a recommender system given by its multi-faceted utility function: utility = fn(relevance, engagement, …)2. We want to efficiently improve the utility of the system
  • Outline• TalentMatch case study o Overview o Utility function – Multiple Objectives! o Approach details • Problem formulation & Optimization o A/B test results
  • TalentMatch Job Posting Member Profiles Ranked Talent Talent Match
  • JobPosting
  • MemberProfile
  • TalentMatch Model Job Posting Matching …titlegeo industry description Transitioncompany functional area probability Cosine Candidate similarityGeneral Current Positionexpertise title …specialties summaryeducation tenure lengthheadline industrygeo functional areaexperience …
  • TalentMatch Teaser Snippet
  • TalentMatch Utility =fn(booking rate, email rate, reply rate) Booking Rate Email Rate Reply Job seeker? Problem! Rate
  • Job Seeker Intent Model• Propensity Score o Indicates receptiveness to new opportunities o p(switch jobs in next time period) PASSIVE NON-JOB- SEEKER ACTIVE• Model o Survival Analysis of Positions o Accelerated failure time (AFT) model log Ti = Σkβkxik+σεi
  • Job-SeekerFeatureExample:IndustryAttrition Probability Time
  • What: Increase TalentMatch Utility fn(booking rate, email rate, reply rate)Job-Seeking Intent:actives & passives 16x reply rate oncareer-related mail Reply Rate
  • Job-Seeker (JS) Propensity as Another Feature? Job Posting Matching … title geo industry description Transition company functional area probability Cosine Candidate similarity General Current Position expertise title JS Propensity specialties summary education tenure length headline industry geo functional area experience …
  • MatchScoreHistogramof12thRank t
  • How: ControlledPerturbation Match Score Distributions Divergence Talent Match ranking score Match Score 1, Item X, 0.98, Non-Seeker 2, Item Y, 0.91, Non-Seeker --------------------------------------- Divergence 3, Item Z, 0.89, Active Function Δ() Perturbation Function f()Perturbed rankingMatch Score, Perturbed Score1, Item X, 0.98, 0.98, Non-Seeker Objective Objective2, Item Z, 0.89, 0.93, Active Function g() score------------------------------------------------3, Item Y, 0.91, 0.91, Non-Seeker
  • Problem Formulation• Perturbation Function• Divergence Function• Objective Function
  • Finding a Good Perturbation Function• Loss Function• Objective and divergence depend on a sort/rank, so gradient-based optimization not directly applicable• Lambda value?
  • ParetoOptimization
  • Match Score Histogram Divergence 0 27 54 100
  • Computational Approaches• Grid Search• Gradient-based techniques
  • Gradient-based Techniques 0.076 > λ > 0 …λ = 0.076
  • Gradient-based Techniques, cont.• Smooth approximations to popular ranking metrics amenable to gradient-descent (SmoothRank) o Normalized Discounted Cumulative Gain (NDCG) o Average Precision (AP)• Frame the Multi-Objective Optimization problem using those approximations
  • Experiments• A/B Test o Treatment 1: 1.15 boost (8/12) o Treatment 2: 1.07 boost (6/12) o Control: 1.0 boost (4/12)• Expectations o 50% increase in reply rate for 1.07 boost o 100% increase in reply rate for 1.15 boost o Expected booking rate and email rate to remain unchanged or minimally affected
  • A/B Test Results Email rate α = β = 1.07 31% (% increase over control) α = β = 1.15 25%Booking rateα = β = 1.07 0%α = β = 1.15 -0.4% Reply rate α = β = 1.07 22% α = β = 1.15 42%
  • Conclusion• Consider the multiple facets of your system’s utility function to improve utility efficiently o Handle competing objectives carefully • Know your tradeoff! o Listen to user feedback o A/B test furiously
  • Thank mrodriguez@linkedin.comYou! 175M+ 62% non U.S. 2/sec 25th Most visit website worldwide 90 We’re (Comscore 6-12) 55 Hiring! >2M Company pages 32 8 17 85% 4 Fortune 500 Companies use 2 LinkedIn to hire2004 2005 2006 2007 2008 2009 2010 2011 LinkedIn Members (Millions)