• Like
Recommender Systems: The Art and Science of Matching Items to Users - A LinkedIn open data talk by Deepak Agarwal from Yahoo Research!
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Recommender Systems: The Art and Science of Matching Items to Users - A LinkedIn open data talk by Deepak Agarwal from Yahoo Research!


Algorithmically matching items to users in a given context is essential for the success and profitability of large scale recommender systems like content optimization, computational advertising, …

Algorithmically matching items to users in a given context is essential for the success and profitability of large scale recommender systems like content optimization, computational advertising, search, shopping, movie recommendation, and many more. The objective is to maximize some utility (e.g. total revenue, total engagement) of interest over a long time horizon. This is a bandit problem since there is positive utility in displaying items that may have low mean but high variance. A key challenge in such bandit problems is the curse of dimensionality. Bandit problems are also difficult to work with for responses that are observed with considerable delay (e.g. return visits, confirmation of a buy). One approach is to optimize multiple competing objectives in the short-term to achieve the best long-term performance. For instance, in serving content to users on a website, one may want to optimize some combination of clicks and downstream advertising revenue in the short-term to maximize revenue in the long-run. In this talk, I will discuss some of the technical challenges by focusing on a concrete application - content optimization on the Yahoo! front page. I will also briefly discuss response prediction techniques for serving ads on the RightMedia Ad exchange.

Bio: Deepak Agarwal is a statistician at Yahoo! who is interested in developing statistical and machine learning methods to enhance the performance of large scale recommender systems. Deepak and his collaborators significantly improved article recommendation on several Yahoo! websites, most notably on the Yahoo! front page (a 200+% improvement in click-rates). He also works closely with teams in computational advertising to deploy elaborate statistical models on the RightMedia Ad Exchange, yet another large scale recommender system. He currently serves as associate editor for the Journal of American Statistical Association (JASA) and IEEE Transaction on Knowledge discovery and Data Engineering (TKDE).

Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads


Total Views
On SlideShare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide
  • Focus on Today module. Publishes trendy, eclectic articles on a broad range of topics including sports, finance, entertainment etc.For each visit, select 4 to display from an inventory of K. Hundreds of millions of visits/day, ~600M visitors per month.


  • 1. Recommender Systems: The Art and Science of Matching Items to Users
    Deepak Agarwal
    LinkedIn, 7th July, 2011
  • 2. Recommender Systems
    Serve the “right” item to users in an automated fashion to optimize long-term business objectives
  • 3. Content Optimization: Match articles to users
  • 4. Advertising: Recommend Ads on Pages
    Display/Graphical Ad
    Contextual Advertising
  • 5. Shopping: Recommend Related Items to buy
  • 6. Recommend Movies
  • 7. Recommend People
  • 8. Problem Definition
    Item Inventory
    Articles, web page,
    ads, …
    Example applications
    Content, Movie,
    Construct an automatedalgorithm
    to select item(s) to show
    Get feedback
    (click, time-spent,rating, buy,…)
    Refine parameters of the algorithm
    Repeat (large number of times)
    Optimize metric(s) of interest
    (Total clicks, Total revenue,…)
    Low Marginal cost per serve,
    Efficient and intelligent systems can
    provide significant improvements
    previous item viewed,

  • 9. Data Mining -> Clever Algorithms
    So much data, enough to process it all and process it fast?
    Ideally, we want to learn every user-item interaction
    Number of things to learn increases with data size
    Dynamic nature exacerbates the problem
    We want to learn things quickly in order to react fast
  • 10. Simple Approach: Segment Users/Items
    Estimate CTR of items in each user segment
    Serve most popular item in segment
    Item/item segments
    CTRij = clicksij/viewsij
    User segments
  • 11. Example Application: Yahoo! front page
    Recommend most popular article on slot F1 (out of 30-40, editorially programmed)
    Can collect data every 5 minutes
    Should be simple, just count clicks and views, right?
    Not quite!
    Today module
  • 12. Simple algorithm we began with
    Initialize CTR of every new article to some high number
    This ensures a new article has a chance of being shown
    Show the most popular CTR article (randomly breaking ties) for each user visit in the next 5 minutes
    Re-compute the global article CTRs after 5 minutes
    Show the new most popular for next 5 minutes
    Keep updating article popularity over time
    Quite intuitive. Did not work! Performance was bad. Why?
  • 13. Bias in the data: Article CTR decays over time
    This is what an article CTR curve looked like
    We were computing CTR by cumulating clicks and views.
    Missing decay dynamics? Dynamic growth model using a Kalman filter.
    New model tracked decay very well, performance still bad
    And the plot thickens, my dear Watson!
  • 14. Explanation of decay: Repeat exposure
    User Fatigue-> CTR Decay
  • 15. Clues to solve the mystery
    Users seeing an article for the first time have higher CTR, those being exposed have lower
    but we use the same CTR estimate for all ?
    Other sources of bias? How to adjust for them?
    A simple idea to remove bias
    Display articles at random to a small randomly chosen population
    Call this the Random bucket
    Randomization removes bias in data
    (Charles Pierce,1877; R.A. Fisher, 1935)
    Some other observations
    Sticking with an article for complete 5 minutes was degrading performance, many bad articles got displayed too many times
    Reaction time to display good articles was slower
  • 16. CTR of same article with/without randomization
    Serving bucket
    Random bucket
  • 17. CTR of articles in Random bucket
    Unbiased CTR, but it is dynamic. Simply counting clicks and views still didn’t won’t work well.
  • 18. New algorithm
    Create a small random bucket which selects one out of K existing articles at random for each user visit
    Learn unbiased article popularity using random bucket data by tracking (through a non-linear Kalman filter)
    Serve the most popular article in the serving bucket
    Override rules: Don’t show an article to a user after few previous exposures, other rules (diversity, voice),….
  • 19. Other advantages
    The random bucket ensures continuous flow of data for all articles, we quickly discard bad articles and converge to the best one
    This saved the day, the project was a success!
    Initial click-lift 40% (Agarwal et al. NIPS 08)
    after 3 years it is 200+% (fully deployed on Yahoo! front page and elsewhere on Yahoo!), we are still improving the system
  • 20. More Details
    Agarwal, Chen, Elango, Ramakrishnan, Motgi, Roy, Zachariah. Online models for Content Optimization, NIPS 2008
    Agarwal, Chen, Elango. Spatio-Temporal Models for Estimating Click-through Rate, WWW 2009
  • 21. Lessons learnt
    It is ok to start with simple models that learn a few things, but beware of the biases inherent in your data
    E.g. of things gone wrong
    Learning article popularity
    Data used from 5am-8am pst, served from 10am-1pm pst
    Bad idea if article popular on the east, not on the west
    Randomization is a friend, use it when you can. Update the models fast, this may reduce the bias
    User visit patterns close in time are similar
    What if we can’t afford complete randomization?
    Learn how to gamble
  • 22. Why learn how to gamble?
    Consider a slot machine with two arms
    (unknown payoff probabilities)
    p1 >
    The gambler has 1000 plays, what is the best way to experiment ?
    (to maximize total expected reward)
    This is called the “bandit” problem, have been studied for a long time.
    Optimal solution: Play the arm that has maximum potential of being good
  • 23. Recommender Problems: Bandits?
    Two Items: Item 1 CTR= 2/100 ; Item 2 CTR= 250/10000
    Greedy: Show Item 2 to all; not a good idea
    Item 1 CTR estimate noisy; item could be potentially better
    Invest in Item 1 for better overall performance on average
    This is also referred to as Explore/exploit problem
    Exploit what is known to be good, explore what is potentially good
    Article 2
    Article 1
    Probability density
  • 24. Bayes optimal solution in next 5 mins 2 articles, 1 uncertain
    Optimal allocation to uncertain article
    Uncertainty in CTR: pseudo #views
  • 25. More Details on the Bayes Optimal Solution
    Agarwal, Chen, Elango. Explore-Exploit Schemes for Web Content Optimization, ICDM 2009
    (Best Research Paper Award)
  • 26. Recommender Problems: bandits in a casino
    Items are arms of bandits, ratings/CTRs are unknown payoffs
    Goal is to converge to the best CTR item quickly
    But this assumes one size fits all (no personalization)
    Each user is a separate bandit
    Hundreds of millions of bandits (huge casino)
    Rich literature (several tutorials on the topic)
    Broadly : Clever/adaptive randomization
    Our random bucket is a solution, often a good one in practice.
  • 27. Back to the number of things to learn (curse of dimensionality)
    Pros of learning things at granular resolutions
    Better estimates of affinities at event level
    (ad 77 has high CTR on publisher 88, instead of ad 77 has good CTR on sports publisher)
    Bias becomes less problematic
    The more we chop, less prone we are to aggregating dissimilar things, less biased our estimates.
    Too much sparsity to learn everything at granular resolutions
    We don’t have that much traffic
    E.g. many ads are not even shown on many publishers
    Explore/exploit helps but cannot do so much experimentation
    In advertising, response rates (conversion, click) are too low, further exacerbates the problem
  • 28. Solution: Go granular but with back-off
    Too little data at granular level, need to borrow from coarse resolutions with abundant data (smoothing, shrinkage)
    CTR(1) = w1(0/5)
    + w11(2/200) +w12(40/1000)
    +w121(200/5000) +w111(400/10000)
    121. Adv-id=9
    111. Bay Area
    12. Pub-id=88, adv-id=9
    11. Palo Alto
    1. Pub-id=88, ad-id=77, zip=Palo Alto
  • 29. Sometimes too much data at granular level
    No need to back-off
    CTR(1) = 100/50000
    12. Pub-id=88, adv-id=8
    11. Arizona
    1. Pub-id=88, ad-id=80, zip=Arizona
  • 30. How much to borrow from ancestors?
    Learning the weights when there is little data
    Depends on heterogeneity in CTRs of small cells
    Ancestors with similar CTR child nodes are more credible
    E.g. if all zip-codes in Bay Area have similar CTRs, more weights given to Bay Area node
    Pool similar cells, separate dissimilar ones
    Palo Alto
    Bay Area
    Mtn View
    Las Gatos
  • 31. Crucial issue
    Obtain grouping structures to perform effective back-off
    How do we detect such groupings when dealing with high dimensional data?
    Billions/trillions of possible attribute combinations
    Statistical modeling to the rescue
    Art and science, requires experience.
    Important to understand the business, the problem, the data.
  • 32. How do we estimate heterogeneity for a group?
    Simple example: CTR of an ad in different zip-codes
    (si, ti): i=1,…,K; emCTRi = si /ti
    Var(emCTRi ) good measure of heterogeneity?
    Not quite, empirical estimates not good for small ti and(or) si
    Use a model
    Variance among true CTRs can be estimated in a better way using MLE/MOM
    (Agarwal & Chen, Latent OLAP, SIGMOD 2011)
  • 33. Two Examples of learning granular MODELS withback-off
  • 34. Online Advertising: Matching ads to opportunities
    Pick best ads
    Ad Network
    Examples:Yahoo, Google, MSN,
    Ad exchanges(network of “networks”) …
  • 35. How to Select “Best” ads
    Pick best ads
    Ad Network
    Response rates
    (click, conversion,
    Select argmax f(bid,rate)
  • 36. The Ad Exchange: Unified Marketplace
    Bids $0.75 via Network…
    Bids $0.50
    Bids $0.60
    Bids $0.65—WINS!
    Has ad impression to sell --
    … which becomes $0.45 bid
    Transparency and value
  • 37. Advertising example
    f(bid, rate) ---- rate is unknown, needs to be estimated
    Goal: maximize revenue, advertiser ROI
    High dimensional rate estimation
    Response obtained through interaction among few heavy-tailed categorical variables (pub, user, and ad)
    #levels : could be millions and changes over time
    ( pub, user)
  • 38. Data
    Features available for both opportunity and ad
    Publisher: Publisher content type
    User: demographics, geo,…
    Ad: Industry, text/video, text (if any)
    Hierarchically organized
    Publisher hierarchy: URL -> Domain -> Publisher type
    Geo hierarchy for users
    Ad hierarchy: Ad -> Campaign -> Advertiser
    Past empirical analysis (Agarwal et al, KDD 2007)
    Hierarchical grouping provides homogeneity in rates
    Here, groupings available through domain knowledge
  • 39. Model Setup
    Eij= ∑uB(xi ,xu,xj) (Expected Success)
    Sij~ Poisson(Eij λij)
    MLE ( Sij /Eij) does not work well
  • 40. Hierarchical Smoothing of residuals
    Assuming two hierarchies (Publisher and advertiser)
    cell (i,j)
    (Sij, Eij, λij)
  • 41. Back-off Model
    7 neighbors
    3 blues, 4 greens
    (Sij, Eij, λij)
    Back-off is through parameter sharing
    Blues and greens are neighbors of several reds
  • 42. Ad- exchange (RightMedia)
    Advertisers participate in different ways
    CPM (pay by ad-view)
    CPC (pay per click)
    CPA (pay per conversion)
    To conduct an auction, normalize across pricing types
    Compute eCPM (expected CPM)
    Click-based eCPM= click-rate*CPC
    Conversion-based eCPM= conv-rate*CPA
  • 43. Data
    Two kinds of conversion rates
    Post-Click conv-rate = click-rate*conv/click
    Post-View conv-rate = conv/ad-view
    Three response rate models
    Click-rate (CLICK), conv/click (PCC),
    post-view conv/view (PVC)
  • 44. Datasets : Right-Media
    CLICK [~90B training events, ~100M parameters]
    Post Click Conversion(PCC) (~.5B training events,~81M parameters)
    PVC – Post-View conversions (~7B events, ~6M parameters)
    Cookie gets augmented with pixel, trigger conversion when user visits the landing page
    Age, gender, ad-size, pub-class, user fatigue
    2 hierarchies (publisher and advertiser)
    Two baselines
    Pubid x adid [FINE] (no hierarchical information)
    Pubid x advertiser [COARSE] (collapse cells)
  • 45. Accuracy: Average test log-likelihood
  • 46. More Details
    Agarwal, Kota, Agrawal, Khanna: Estimating Rates of Rare Events with Multiple Hierarchies through Scalable Log-linear Models, KDD 2010
  • 47. Back to Yahoo! front page
    Recommend articles:
    Title, summary
    Links to other pages
    For each user visit,
    Pick 4 out of a pool of K
    Routes traffic to other pages
  • 48. DATA
    article j with
    item featuresxj
    (keywords, content categories, ...)
    Algorithm selects
    (i,j) : response yij
    User i
    user featuresxi
    browse history,
    search history, …)
    (rating or click/no-click)
  • 49. Bipartite Graph completion problem
    Observed Graph
    CTR Graph
  • 50. Factor Model to estimate CTR at granular levels
    Item popularity
    User popularity
  • 51. Estimating granular latent factors via back-off
    If user/item have high degree, good estimates of factors available else we need back-off
    Back-off: We use user/item features through regressions
    Age=old Geo=Mtn-View Int=Ski
    Uik = G1k 1(Agei=old) + G2k 1(Geoi=Mtn-View) + G3k 1(Inti=Ski)
    Weights of 8 different fallbacks using 3 parameters
  • 52. Estimates with back-off
    For new user/article, factor estimates based on features
    For old user/article, factor estimates
    Linear combination of regression and user “ratings”
  • 53. Estimating the back-off Regression function
    Integral cannot be computed in closed form,
    approximated by Monte Carlo using Gibbs Sampling
  • 54. Data Example
    2M binary observations by 30K heavy users on 4K articles
    Heavy user ---- at least 30 visits to the portal in last 5 months
    Article features
    Editorially labeled category information (~50 binary features)
    User features
    Demographics, browse behavior (~1K features)
    Training/test split by timestamp of events (75/25)
    Factor model with regression, no online updates
    Factor model with regression + online updates
    Online model based on user-user similarity (Online-UU)
    Online probabilistic latent semantic index (Online-PLSI)
  • 55. ROC curve
    Factor model: regression + online updates
    Factor model: regression only
  • 56. More Details
    Agarwal and Chen: Regression Based Latent Factor Models, KDD 2009
  • 57. Computation
    Both models run on Hadoop, scalable to large datasets
    For the factor models, also working on online EM
    Collaboration with Andrew Cron, Duke University
  • 58. Multi-ObjectivesBeyond Clicks
  • 59. Post-click utilities
    Clicks on FP links influence downstream supply distribution
    Downstream engagement
    (Time spent)
  • 60. Serving Content on Front Page: Click Shaping
    What do we want to optimize?
    Usual: Maximize clicks (maximize downstream supply from FP)
    But consider the following
    Article 1: CTR=5%, utility per click = 5
    Article 2: CTR=4.9%, utility per click=10
    By promoting 2, we lose 1 click/100 visits, gain 5 utils
    If we do this for a large number of visits --- lose some clicks but obtain significant gains in utility?
    E.g. lose 5% relative CTR, gain 20% in utility (revenue, engagement, etc)
  • 61. How are Clicks being Shaped ?
    Supply distribution
    SHAPING can happen with respect to multiple downstream metrics (like engagement, revenue,…)
  • 62. Multi-Objective Optimization
    n articles
    K properties
    m user segments
    xij: variables
    known pij, dij

    • CTR of user segment i on article j: pij
    • 63. Time duration of i on j: dij
  • 64. 63
    Multi-Objective Program
    • Scalarization
    • 65. Goal Programming
  • Pareto-optimal solution (more in KDD 2011)
  • 66. More Details
    Agarwal, Chen, Elango, Wang: Click Shaping to Optimize Multiple Objectives, KDD 2011 (forthcoming)
  • 67. Can we do it with Advertising Revenue?
    Yes, but need to be careful.
    Interventions can cause undesirable long-term impact
    Communication between two complex distributed systems
    Display advertising at Y! also sold as long-term guaranteed contracts
    We intervene to change supply when contract is at risk of under-delivering
    Research to be shared in the future
  • 68. Summary
    Simple models that learn a few parameters are fine to begin with BUT beware of bias in data
    Small amounts of randomization + fast model updates
    Clever Randomization using Explore/Exploit techniques
    Granular models are more effective but we need good statistical algorithms to provide back-off estimates
    Considering multi-objective optimization is often important
  • 69. A modeling strategy
    Feature Engineering
    Content: IR, clustering, taxonomy, entity,..
    User profiles: clicks, views, social, community,..
    (Fine resolution
    (item, user level)
    (Quick updates)
    Offline(Logistic, GBDT,..)
    Coarse and slow changing
    (Adaptive sampling)
  • 70. Indexing for fast retrieval at runtime
    Retrieving the top-k when item inventory is large in few a milli-seconds could be challenging for complex models
    Current work (joint with Maxim Guverich)
    Approximate the model by an index friendly synthetic model
    Index friendly model retrieves the top-K very fast, a second stage evaluation on top-K retrieves the top-k ( K > k)
    Research to be shared in a forthcoming paper
  • 71. Collaborators
    Bee-Chung Chen (Yahoo! Research, CA)
    Pradheep Elango (Yahoo! Labs, CA)
    Liang Zhang (Yahoo! Labs, CA)
    Nagaraj Kota (Yahoo! Labs, India)
    Xuanhui Wang (Yahoo! Labs, CA)
    Rajiv Khanna (Yahoo! Labs, India)
    Andrew Cron (Duke University)
    Engineering & Product Teams (CA, India)
    Special thanks to Yahoo! Labs senior leadership for the support
    Andrei Broder, Preston MacAfee ,Prabhakar Raghavan ,Raghu Ramakrishnan
  • 72. E-mail: dagarwal@yahoo-inc.com
    Thank you !