Enar short course

1,915 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,915
On SlideShare
0
From Embeds
0
Number of Embeds
23
Actions
Shares
0
Downloads
23
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Enar short course

  1. 1. Statistical Computing For Big Data Deepak Agarwal LinkedIn Applied Relevance Science dagarwal@linkedin.com ENAR 2014, Baltimore, USA
  2. 2. Structure of This Tutorial • Part I: Introduction to Map-Reduce and the Hadoop System – Overview of Distributed Computing – Introduction to Map-Reduce – Introduction to the Hadoop System – The Pig Language – A Deep Dive of Hadoop Map-Reduce • Part II: Examples of Statistical Computing for Big Data – Bag of Little Bootstraps – Large Scale Logistic Regression – Parallel Matrix Factorization • Part III: The Future of Cloud Computing
  3. 3. Structure of This Tutorial • Part I: Introduction to Map-Reduce and the Hadoop System – Overview of Distributed Computing – Introduction to Map-Reduce – Introduction to the Hadoop System – The Pig Language – A Deep Dive of Hadoop Map-Reduce • Part II: Examples of Statistical Computing for Big Data – Bag of Little Bootstraps – Large Scale Logistic Regression – Parallel Matrix Factorization • Part III: The Future of Cloud Computing
  4. 4. Big Data becoming Ubiquitous • Bioinformatics • Astronomy • Internet • Telecommunications • Climatology • …
  5. 5. Big Data: Some size estimates • 1000 human genomes: > 100TB of data (1000 genomes project) • Sloan Digital Sky Survey: 200GB data per night (>140TB aggregated) • Facebook: A billion monthly active users • LinkedIn: 225M members worldwide • Twitter: 500 million tweets a day • Over 6 billion mobile phones in the world generating data everyday
  6. 6. Big Data: Paradigm shift • Classical Statistics – Generalize using small data • Paradigm Shift with Big Data – We now have an almost infinite supply of data – Easy Statistics ? Just appeal to asymptotic theory? • So the issue is mostly computational? – Not quite • More data comes with more heterogeneity • Need to change our statistical thinking to adapt – Classical statistics still invaluable to think about big data analytics
  7. 7. Some Statistical Challenges • Exploratory Analysis (EDA), Visualization – Retrospective (on Terabytes) – More Real Time (streaming computations every few minutes/hours) • Statistical Modeling – Scale (computational challenge) – Curse of dimensionality • Millions of predictors, heterogeneity – Temporal and Spatial correlations
  8. 8. Statistical Challenges continued • Experiments – To test new methods, test hypothesis from randomized experiments – Adaptive experiments • Forecasting – Planning, advertising • Many more I are not fully well versed in
  9. 9. Defining Big Data • How to know you have the big data problem? – Is it only the number of terabytes ? – What about dimensionality, structured/unstructured, computa tions required,… • No clear definition, different point of views – When desired computation cannot be completed in the stipulated time with current best algorithm using cores available on a commodity PC
  10. 10. Distributed Computing for Big Data • Distributed computing invaluable tool to scale computations for big data • Some distributed computing models – Multi-threading – Graphics Processing Units (GPU) – Message Passing Interface (MPI) – Map-Reduce
  11. 11. Evaluating a method for a problem • Scalability – Process X GB in Y hours • Ease of use for a statistician • Reliability (fault tolerance) – Especially in an industrial environment • Cost – Hardware and cost of maintaining • Good for the computations required? – E.g., Iterative versus one pass • Resource sharing
  12. 12. Multithreading • Multiple threads take advantage of multiple CPUs • Shared memory • Threads can execute independently and concurrently • Can only handle Gigabytes of data • Reliable
  13. 13. Graphics Processing Units (GPU) • Number of cores: – CPU: Order of 10 – GPU: smaller cores • Order of 1000 • Can be >100x faster than CPU – Parallel computationally intensive tasks off-loaded to GPU • Good for certain computationally-intensive tasks • Can only handle Gigabytes of data • Not trivial to use, requires good understanding of low-level architecture for efficient use – But things changing, it is getting more user friendly
  14. 14. Message Passing Interface (MPI) • Language independent communication protocol among processes (e.g. computers) • Most suitable for master/slave model • Can handle Terabytes of data • Good for iterative processing • Fault tolerance is low
  15. 15. Map-Reduce (Dean & Ghemawat, 2004) Mappers Reducers Data Output • Computation split to Map (scatter) and Reduce (gather) stages • Easy to Use: – User needs to implement two functions: Mapper and Reducer • Easily handles Terabytes of data • Very good fault tolerance (failed tasks automatically get restarted)
  16. 16. Comparison of Distributed Computing Methods Multithreading GPU MPI Map-Reduce Scalability (data size) Gigabytes Gigabytes Terabytes Terabytes Fault Tolerance High High Low High Maintenance Cost Low Medium Medium Medium-High Iterative Process Complexity Cheap Cheap Cheap Usually expensive Resource Sharing Hard Hard Easy Easy Easy to Implement? Easy Needs understanding of low-level GPU architecture Easy Easy
  17. 17. Example Problem • Tabulating word counts in corpus of documents • Similar to table function in R
  18. 18. Word Count Through Map-Reduce Hello World Bye World Hello Hadoop Goodbye Hadoop Mapper 1 <Hello, 1> <Hadoop, 1> <Goodbye, 1> <Hadoop,1> <Hello, 1> <World, 1> <Bye, 1> <World,1> Mapper 2 Reducer 1 Words from A-G Reducer 2 Words from H-Z <Bye, 1> <Goodbye, 1> <Hello, 2> <World, 2> <Hadoop, 2>
  19. 19. Key Ideas about Map-Reduce Big Data Partition 1 Partition 2 … Partition N Mapper 1 Mapper 2 … Mapper N <Key, Value> <Key, Value><Key, Value><Key, Value> Reducer 1 Reducer 2 Reducer M… Output 1 Output 1Output 1Output 1
  20. 20. Key Ideas about Map-Reduce • Data are split into partitions and stored in many different machines on disk (distributed storage) • Mappers process data chunks independently and emit <Key, Value> pairs • Data with the same key are sent to the same reducer. One reducer can receive multiple keys • Every reducer sorts its data by key • For each key, the reducer processes the values corresponding to the key according to the customized reducer function and output
  21. 21. Compute Mean for Each Group ID Group No. Score 1 1 0.5 2 3 1.0 3 1 0.8 4 2 0.7 5 2 1.5 6 3 1.2 7 1 0.8 8 2 0.9 9 4 1.3 … … …
  22. 22. Key Ideas about Map-Reduce • Data are split into partitions and stored in many different machines on disk (distributed storage) • Mappers process data chunks independently and emit <Key, Value> pairs – For each row: • Key = Group No. • Value = Score • Data with the same key are sent to the same reducer. One reducer can receive multiple keys – E.g. 2 reducers – Reducer 1 receives data with key = 1, 2 – Reducer 2 receives data with key = 3, 4 • Every reducer sorts its data by key – E.g. Reducer 1: <key = 1, values=[0.5, 0.8, 0.8]>, <key=2, values=<0.7, 1.5, 0.9> • For each key, the reducer processes the values corresponding to the key according to the customized reducer function and output – E.g. Reducer 1 output: <1, mean(0.5, 0.8, 0.8)>, <2, mean(0.7, 1.5, 0.9)>
  23. 23. Key Ideas about Map-Reduce • Data are split into partitions and stored in many different machines on disk (distributed storage) • Mappers process data chunks independently and emit <Key, Value> pairs – For each row: • Key = Group No. • Value = Score • Data with the same key are sent to the same reducer. One reducer can receive multiple keys – E.g. 2 reducers – Reducer 1 receives data with key = 1, 2 – Reducer 2 receives data with key = 3, 4 • Every reducer sorts its data by key – E.g. Reducer 1: <key = 1, values=[0.5, 0.8, 0.8]>, <key=2, values=<0.7, 1.5, 0.9> • For each key, the reducer processes the values corresponding to the key according to the customized reducer function and output – E.g. Reducer 1 output: <1, mean(0.5, 0.8, 0.8)>, <2, mean(0.7, 1.5, 0.9)> What you need to implement
  24. 24. Mapper: Input: Data for (row in Data) { groupNo = row$groupNo; score = row$score; Output(c(groupNo, score)); } Reducer: Input: Key (groupNo), List Value (a list of scores that belong to the Key) count = 0; sum = 0; for (v in Value) { sum += v; count++; } Output(c(Key, sum/count)); Pseudo Code (in R)
  25. 25. Exercise 1 • Problem: Average height per {Grade, Gender}? • What should be the mapper output key? • What should be the mapper output value? • What are the reducer input? • What are the reducer output? • Write mapper and reducer for this? Student ID Grade Gender Height (cm) 1 3 M 120 2 2 F 115 3 2 M 116 … … …
  26. 26. • Problem: Average height per Grade and Gender? • What should be the mapper output key? – {Grade, Gender} • What should be the mapper output value? – Height • What are the reducer input? – Key: {Grade, Gender}, Value: List of Heights • What are the reducer output? – {Grade, Gender, mean(Heights)} Student ID Grade Gender Height (cm) 1 3 M 120 2 2 F 115 3 2 M 116 … … …
  27. 27. Exercise 2 • Problem: Number of students per {Grade, Gender}? • What should be the mapper output key? • What should be the mapper output value? • What are the reducer input? • What are the reducer output? • Write mapper and reducer for this? Student ID Grade Gender Height (cm) 1 3 M 120 2 2 F 115 3 2 M 116 … … …
  28. 28. • Problem: Number of students per {Grade, Gender}? • What should be the mapper output key? – {Grade, Gender} • What should be the mapper output value? – 1 • What are the reducer input? – Key: {Grade, Gender}, Value: List of 1’s • What are the reducer output? – {Grade, Gender, sum(value list)} – OR: {Grade, Gender, length(value list)} Student ID Grade Gender Height (cm) 1 3 M 120 2 2 F 115 3 2 M 116 … … …
  29. 29. More on Map-Reduce • Depends on distributed file systems • Typically mappers are the data storage nodes • Map/Reduce tasks automatically get restarted when they fail (good fault tolerance) • Map and Reduce I/O are all on disk – Data transmission from mappers to reducers are through disk copy • Iterative process through Map-Reduce – Each iteration becomes a map-reduce job – Can be expensive since map-reduce overhead is high
  30. 30. The Apache Hadoop System • An open-source software for reliable, scalable, distributed computing • The most popular distributed computing system in the world • Key modules: – Hadoop Distributed File System (HDFS) – Hadoop YARN (job scheduling and cluster resource management) – Hadoop MapReduce
  31. 31. Major Tools on Hadoop • Pig – A high-level language for Map-Reduce computation • Hive – A SQL-like query language for data querying via Map-Reduce • Hbase – A distributed & scalable database on Hadoop – Allows random, real time read/write access to big data – Voldemort is similar to Hbase • Mahout – A scalable machine learning library • …
  32. 32. Hadoop Installation • Setting up Hadoop on your desktop/laptop: – http://hadoop.apache.org/docs/stable/single_nod e_setup.html • Setting up Hadoop on a cluster of machines – http://hadoop.apache.org/docs/stable/cluster_set up.html
  33. 33. Hadoop Distributed File System (HDFS) • Master/Slave architecture • NameNode: a single master node that controls which data block is stored where. • DataNodes: slave nodes that store data and do R/W operations • Clients (Gateway): Allow users to login and interact with HDFS and submit Map-Reduce jobs • Big data is split to equal-sized blocks, each block can be stored in different DataNodes • Disk failure tolerance: data is replicated multiple times
  34. 34. Load the Data into Pig • A = LOAD ‘Sample-1.dat' USING PigStorage() AS (ID : int, groupNo: int, score: float); – The path of the data on HDFS after LOAD • USING PigStorage() means delimit the data by tab (can be omitted) • If data are delimited by other characters, e.g. space, use USING PigStorage(‘ ‘) • Data schema defined after AS • Variable types: int, long, float, double, chararray, …
  35. 35. Structure of This Tutorial • Part I: Introduction to Map-Reduce and the Hadoop System – Overview of Distributed Computing – Introduction to Map-Reduce – Introduction to the Hadoop System – The Pig Language – A Deep Dive of Hadoop Map-Reduce • Part II: Examples of Statistical Computing for Big Data – Bag of Little Bootstraps – Large Scale Logistic Regression – Parallel Matrix Factorization • Part III: The Future of Cloud Computing
  36. 36. Bag of Little Bootstraps Kleiner et al. 2012
  37. 37. Bootstrap (Efron, 1979) • A re-sampling based method to obtain statistical distribution of sample estimators • Why are we interested ? – Re-sampling is embarrassingly parallelizable • For example: – Standard deviation of the mean of N samples (μ) – For i = 1 to r do • Randomly sample with replacement N times from the original sample -> bootstrap data i • Compute mean of the i-th bootstrap data -> μi – Estimate of Sd(μ) = Sd(*μ1,…μr]) – r is usually a large number, e.g. 200
  38. 38. Bootstrap for Big Data • Can have r nodes running in parallel, each sampling one bootstrap data • However… – N can be very large – Data may not fit into memory – Collecting N samples with replacement on each node can be computationally expensive
  39. 39. M out of N Bootstrap (Bikel et al. 1997) • Obtain SdM(μ) by sampling M samples with replacement for each bootstrap, where M<N • Apply analytical correction to SdM(μ) to obtain Sd(μ) using prior knowledge of convergence rate of sample estimates • However… – Prior knowledge is required – Choice of M is critical to performance – Finding optimal value of M needs more computation
  40. 40. Bag of Little Bootstraps (BLB) • Example: Standard deviation of the mean • Generate S sampled data sets, each obtained by random sampling without replacement a subset of size b (or partition the original data into S partitions, each with size b) • For each data p = 1 to S do – For i = 1 to r do • N samples with replacement on data of size b • Compute mean of the resampled data μpi – Compute Sdp(μ) = Sd([μp1,…μpr]) • Estimate of Sd(μ) = Avg([Sd1(μ),…, SdS(μ)])
  41. 41. Bag of Little Bootstraps (BLB) • Interest: ξ(θ), where θ is an estimate obtained from size N data – ξ is some function of θ, such as standard deviation, … • Generate S sampled data sets, each obtained from random sampling without replacement a subset of size b (or partition the original data into S partitions, each with size b) • For each data p = 1 to S do – For i = 1 to r do • Sample N samples with replacement on data of size b • Compute mean of the resampled data θpi – Compute ξp(θ) = ξ([θp1,…θpr]) • Estimate of ξ(μ) = Avg([ξ1(θ),…, ξS(θ)])
  42. 42. Bag of Little Bootstraps (BLB) • Interest: ξ(θ), where θ is an estimate obtained from size N data – ξ is some function of θ, such as standard deviation, … • Generate S sampled data sets, each obtained from random sampling without replacement a subset of size b (or partition the original data into S partitions, each with size b) • For each data p = 1 to S do – For i = 1 to r do • Sample N samples with replacement on the data of size b • Compute mean of the resampled data θpi – Compute ξp(θ) = ξ([θp1,…θpr]) • Estimate of ξ(μ) = Avg([ξ1(θ),…, ξS(θ)]) MapperReducer Gateway
  43. 43. Why is BLB Efficient • Before: – N samples with replacement from size N data is expensive when N is large • Now: – N samples with replacement from size b data – b can be several magnitude smaller than N (e.g. b = Nγ, γ in [0.5, 1)) – Equivalent to: A multinomial sampler with dim = b – Storage = O(b), Computational complexity = O(b)
  44. 44. Simulation Experiment • 95% CI of Logistic Regression Coefficients • N = 20000, 10 explanatory variables • Relative Error = |Estimated CI width – True CI width | / True CI width • BLB-γ: BLB with b = Nγ • BOFN-γ: b out of N sampling with b = Nγ • BOOT: Naïve bootstrap
  45. 45. Simulation Experiment
  46. 46. Real Data • 95% CI of Logistic Regression Coefficients • N = 6M, 3000 explanatory variables • Data size = 150GB, r = 50, s = 5, γ = 0.7
  47. 47. Summary of BLB • A new algorithm for bootstrapping on big data • Advantages – Fast and efficient – Easy to parallelize – Easy to understand and implement – Friendly to Hadoop, makes it routine to perform statistical calculations on Big data
  48. 48. Large Scale Logistic Regression
  49. 49. Logistic Regression • Binary response: Y • Covariates: X • Yi ~ Bernoulli(pi) • log(pi/(1-pi)) = Xi Tβ ; β ~ MVN(0 , 1/λ I ) • Widely used (research and applications)
  50. 50. Large Scale Logistic Regression • Binary response: Y – E.g., Click / Non-click on an ad on a webpage • Covariates: X – User covariates: • Age, gender, industry, education, job, job title, … – Item covariates: • Categories, keywords, topics, … – Context covariates: • Time, page type, position, … – 2-way interaction: • User covariates X item covariates • Context covariates X item covariates • …
  51. 51. Computational Challenge • Hundreds of millions/billions of observations • Hundreds of thousands/millions of covariates • Fitting such a logistic regression model on a single machine not feasible • Model fitting iterative using methods like gradient descent, Newton’s method etc – Multiple passes over the data
  52. 52. Recap on Optimization method • Problem: Find x to min(F(x)) • Iteration n: xn = xn-1 – bn-1 F’(xn-1) • bn-1 is the step size that can change every iteration • Iterate until convergence • Conjugate gradient, LBFGS, Newton trust region, … all of this kind
  53. 53. Iterative Process with Hadoop Disk Mappers Disk Reducers DiskMappersDiskReducers Disk Mappers Disk Reducers
  54. 54. Limitations of Hadoop for fitting a big logistic regression • Iterative process is expensive and slow • Every iteration = a Map-Reduce job • I/O of mapper and reducers are both through disk • Plus: Waiting in queue time • Q: Can we find a fitting method that scales with Hadoop ?
  55. 55. Large Scale Logistic Regression • Naïve: – Partition the data and run logistic regression for each partition – Take the mean of the learned coefficients – Problem: Not guaranteed to converge to the model from single machine! • Alternating Direction Method of Multipliers (ADMM) – Boyd et al. 2011 – Set up constraints: each partition’s coefficient = global consensus – Solve the optimization problem using Lagrange Multipliers – Advantage: guaranteed to converge to a single machine logistic regression on the entire data with reasonable number of iterations
  56. 56. Large Scale Logistic Regression via ADMM BIG DATA Partition 1 Partition 2 Partition 3 Partition K Logistic Regression Logistic Regression Logistic Regression Logistic Regression Consensus Computation Iteration 1
  57. 57. Large Scale Logistic Regression via ADMM BIG DATA Partition 1 Partition 2 Partition 3 Partition K Logistic Regression Consensus Computation Logistic Regression Logistic Regression Logistic Regression Iteration 1
  58. 58. Large Scale Logistic Regression via ADMM BIG DATA Partition 1 Partition 2 Partition 3 Partition K Logistic Regression Logistic Regression Logistic Regression Logistic Regression Consensus Computation Iteration 2
  59. 59. Details of ADMM
  60. 60. Dual Ascent Method • Consider a convex optimization problem • Lagrangian for the problem: • Dual Ascent:
  61. 61. Augmented Lagrangians • Bring robustness to the dual ascent method • Yield convergence without assumptions like strict convexity or finiteness of f • • The value of ρ influences the convergence rate
  62. 62. Alternating Direction Method of Multipliers (ADMM) • Problem: • Augmented Lagrangians • ADMM:
  63. 63. Large Scale Logistic Regression via ADMM • Notation – (Xi , yi): data in the ith partition – βi: coefficient vector for partition i – β: Consensus coefficient vector – r(β): penalty component such as ||β||2 2 • Optimization problem
  64. 64. ADMM updates LOCAL REGRESSIONS Shrinkage towards current best global estimate UPDATED CONSENSUS
  65. 65. An example implementation • ADMM for Logistic regression model fitting with L2/L1 penalty • Each iteration of ADMM is a Map-Reduce job – Mapper: partition the data into K partitions – Reducer: For each partition, use liblinear/glmnet to fit a L1/L2 logistic regression – Gateway: consensus computation by results from all reducers, and sends back the consensus to each reducer node
  66. 66. KDD CUP 2010 Data • Bridge to Algebra 2008-2009 data in https://pslcdatashop.web.cmu.edu/KDDCup/do wnloads.jsp • Binary response, 20M covariates • Only keep covariates with >= 10 occurrences => 2.2M covariates • Training data: 8,407,752 samples • Test data : 510,302 samples
  67. 67. Avg Training Loglikelihood vs Number of Iterations
  68. 68. Test AUC vs Number of Iterations
  69. 69. Better Convergence Can Be Achieved By • Better Initialization – Use results from Naïve method to initialize the parameters • Adaptively change step size (ρ) for each iteration based on the convergence status of the consensus
  70. 70. Recommender Problems for Web Applications
  71. 71. Agenda • Topic of Interest – Recommender problems for dynamic, time- sensitive applications • Content Optimization, Online Advertising, Movie recommendation, shopping,… • Introduction • Offline components – Regression, Collaborative filtering (CF), … • Online components + initialization – Time-series, online/incremental methods, explore/exploit (bandit) • Evaluation methods + Multi-Objective • Challenges
  72. 72. Three components we will focus on • Defining the problem – Formulate objectives whose optimization achieves some long- term goals for the recommender system • E.g. How to serve content to optimize audience reach and engagement, optimize some combination of engagement and revenue ? • Modeling (to estimate some critical inputs) – Predict rates of some positive user interaction(s) with items based on data obtained from historical user-item interactions • E.g. Click rates, average time-spent on page, etc • Could be explicit feedback like ratings • Experimentation – Create experiments to collect data proactively to improve models, helps in converging to the best choice(s) cheaply and rapidly. • Explore and Exploit (continuous experimentation) • DOE (testing hypotheses by avoiding bias inherent in data)
  73. 73. Modern Recommendation Systems • Goal – Serve the right item to a user in a given context to optimize long-term business objectives • A scientific discipline that involves – Large scale Machine Learning & Statistics • Offline Models (capture global & stable characteristics) • Online Models (incorporates dynamic components) • Explore/Exploit (active and adaptive experimentation) – Multi-Objective Optimization • Click-rates (CTR), Engagement, advertising revenue, diversity, etc – Inferring user interest • Constructing User Profiles – Natural Language Processing to understand content • Topics, “aboutness”, entities, follow-up of something, breaking news,…
  74. 74. Some examples from content optimization • Simple version – I have a content module on my page, content inventory is obtained from a third party source which is further refined through editorial oversight. Can I algorithmically recommend content on this module? I want to improve overall click-rate (CTR) on this module • More advanced – I got X% lift in CTR. But I have additional information on other downstream utilities (e.g. advertising revenue). Can I increase downstream utility without losing too many clicks? • Highly advanced – There are multiple modules running on my webpage. How do I perform a simultaneous optimization?
  75. 75. Recommend applications Recommend search queries Recommend news article Recommend packages: Image Title, summary Links to other pages Pick 4 out of a pool of K K = 20 ~ 40 Dynamic Routes traffic other pages
  76. 76. Problems in this example • Optimize CTR on multiple modules – Today Module, Trending Now, Personal Assistant, News – Simple solution: Treat modules as independent, optimize separately. May not be the best when there are strong correlations. • For any single module – Optimize some combination of CTR, downstream engagement, and perhaps advertising revenue.
  77. 77. Online Advertising Advertisers Ad Network Ads Page Recommend Best ad(s) User Publisher Response rates (click, conversion, ad-view) Bids Auction Click conversion Select argmax f(bid,response rates) ML /Statistical model Examples: Yahoo, Google, MSN, … Ad exchanges (RightMedia, DoubleClick, … )
  78. 78. LinkedIn Today: Content Module Objective: Serve content to maximize engagement metrics like CTR (or weighted CTR)
  79. 79. LinkedIn Ads: Match ads to users visiting LinkedIn
  80. 80. Right Media Ad Exchange: Unified Marketplace Match ads to page views on publisher sites Has ad impression to sell -- AUCTIONS Bids $0.50 Bids $0.75 via Network… … which becomes $0.45 bid Bids $0.65—WINS! AdSense Ad.com Bids $0.60
  81. 81. Recommender problems in general USER Item Inventory Articles, web page, ads, … Use an automated algorithm to select item(s) to show Get feedback (click, time spent,..) Refine the models Repeat (large number of times) Optimize metric(s) of interest (Total clicks, Total revenue,…) Example applications Search: Web, Vertical Online Advertising Content ….. Context query, page, …
  82. 82. • Items: Articles, ads, modules, movies, users, updates, etc. • Context: query keywords, pages, mobile, social media, etc. • Metric to optimize (e.g., relevance score, CTR, revenue, engagement) – Currently, most applications are single-objective – Could be multi-objective optimization (maximize X subject to Y, Z,..) • Properties of the item pool – Size (e.g., all web pages vs. 40 stories) – Quality of the pool (e.g., anything vs. editorially selected) – Lifetime (e.g., mostly old items vs. mostly new items) Important Factors
  83. 83. Factors affecting Solution (continued) • Properties of the context – Pull: Specified by explicit, user-driven query (e.g., keywords, a form) – Push: Specified by implicit context (e.g., a page, a user, a session) • Most applications are somewhere on continuum of pull and push • Properties of the feedback on the matches made – Types and semantics of feedback (e.g., click, vote) – Latency (e.g., available in 5 minutes vs. 1 day) – Volume (e.g., 100K per day vs. 300M per day) • Constraints specifying legitimate matches – e.g., business rules, diversity rules, editorial Voice – Multiple objectives • Available Metadata (e.g., link graph, various user/item attributes)
  84. 84. Predicting User-Item Interactions (e.g. CTR) • Myth: We have so much data on the web, if we can only process it the problem is solved – Number of things to learn increases with sample size • Rate of increase is not slow – Dynamic nature of systems make things worse – We want to learn things quickly and react fast • Data is sparse in web recommender problems – We lack enough data to learn all we want to learn and as quickly as we would like to learn – Several Power laws interacting with each other • E.g. User visits power law, items served power law – Bivariate Zipf: Owen & Dyer, 2011
  85. 85. Can Machine Learning help? • Fortunately, there are group behaviors that generalize to individuals & they are relatively stable – E.g. Users in San Francisco tend to read more baseball news • Key issue: Estimating such groups – Coarse group : more stable but does not generalize that well. – Granular group: less stable with few individuals – Getting a good grouping structure is to hit the “sweet spot” • Another big advantage on the web – Intervene and run small experiments on a small population to collect data that helps rapid convergence to the best choices(s) • We don’t need to learn all user-item interactions, only those that are good.
  86. 86. Predicting user-item interaction rates Offline ( Captures stable characteristics at coarse resolutions) (Logistic, Boosting,….) Feature construction Content: IR, clustering, taxonomy, entity,.. User profiles: clicks, views, social, community,.. Near Online (Finer resolution Corrections) (item, user level) (Quick updates) Explore/Exploit (Adaptive sampling) (helps rapid convergence to best choices) Initialize
  87. 87. Post-click: An example in Content Optimization Recommender EDITORIAL content Clicks on FP links influence downstream supply distribution AD SERVER DISPLAY ADVERTISING Revenue Downstream engagement (Time spent)
  88. 88. Serving Content on Front Page: Click Shaping • What do we want to optimize? • Current: Maximize clicks (maximize downstream supply from FP) • But consider the following – Article 1: CTR=5%, utility per click = 5 – Article 2: CTR=4.9%, utility per click=10 • By promoting 2, we lose 1 click/100 visits, gain 5 utils • If we do this for a large number of visits --- lose some clicks but obtain significant gains in utility? – E.g. lose 5% relative CTR, gain 40% in utility (revenue, engagement, etc)
  89. 89. High level picture http request Statistical Models updated in Batch mode: e.g. once every 30 mins Server Item Recommendation system: thousands of computations in sub-seconds User Interacts e.g. click, does nothing
  90. 90. High level overview: Item Recommendation System User Info Item Index Id, meta-data ML/ Statistical Models Score Items P(Click), P(share), Semantic-relevance score,…. Rank Items: sort by score (CTR,bid*CTR,..) combine scores using Multi-obj optim, Threshold on some scores,…. User-item interaction Data: batch process Updated in batch: Activity, profile Pre-filter SPAM,editorial,,.. Feature extraction NLP, cllustering,..
  91. 91. ML/Statistical models for scoring Number of items Scored by ML Traffic volume 1000100 100k 1M 100M Few hours Few days Several days LinkedIn Today Yahoo! Front Page Right Media Ad exchange LinkedIn Ads
  92. 92. Summary of deployments • Yahoo! Front page Today Module (2008-2011): 300% improvement in click-through rates – Similar algorithms delivered via a self-serve platform, adopted by several Yahoo! Properties (2011): Significant improvement in engagement across Yahoo! Network • Fully deployed on LinkedIn Today Module (2012): Significant improvement in click-through rates (numbers not revealed due to reasons of confidentiality) • Yahoo! RightMedia exchange (2012): Fully deployed algorithms to estimate response rates (CTR, conversion rates). Significant improvement in revenue (numbers not revealed due to reasons of confidentiality) • LinkedIn self-serve ads (2012-2013):Fully deployed • LinkedIn News Feed (2013-2014): Fully deployed • Several others in progress….
  93. 93. Broad Themes • Curse of dimensionality – Large number of observations (rows), large number of potential features (columns) – Use domain knowledge and machine learning to reduce the “effective” dimension (constraints on parameters reduce degrees of freedom) • I will give examples as we move along • We often assume our job is to analyze “Big Data” but we often have control on what data to collect through clever experimentation – This can fundamentally change solutions • Think of computation and models together for Big data • Optimization: What we are trying to optimize is often complex,models to work in harmony with optimization – Pareto optimality with competing objectives
  94. 94. Statistical Problem • Rank items (from an admissible pool) for user visits in some context to maximize a utility of interest • Examples of utility functions – Click-rates (CTR) – Share-rates (CTR* [Share|Click] ) – Revenue per page-view = CTR*bid (more complex due to second price auction) • CTR is a fundamental measure that opens the door to a more principled approach to rank items • Converge rapidly to maximum utility items – Sequential decision making process (explore/exploit)
  95. 95. item j from a set of candidates User i with user features (e.g., industry, behavioral features, Demographic features,……) (i, j) : response yijvisits Algorithm selects (click or not) Which item should we select?  The item with highest predicted CTR  An item for which we need data to predict its CTR Exploit Explore LinkedIn Today, Yahoo! Today Module: Choose Items to maximize CTR This is an “Explore/Exploit” Problem
  96. 96. The Explore/Exploit Problem (to maximize CTR) • Problem definition: Pick k items from a pool of N for a large number of serves to maximize the number of clicks on the picked items • Easy!? Pick the items having the highest click-through rates (CTRs) • But … – The system is highly dynamic: • Items come and go with short lifetimes • CTR of each item may change over time – How much traffic should be allocated to explore new items to achieve optimal performance ? • Too little Unreliable CTR estimates due to “starvation” • Too much Little traffic to exploit the high CTR items
  97. 97. Y! front Page Application • Simplify: Maximize CTR on first slot (F1) • Item Pool – Editorially selected for high quality and brand image – Few articles in the pool but item pool dynamic
  98. 98. CTR Curves of Items on LinkedIn Today CTR
  99. 99. Impact of repeat item views on a given user • Same user is shown an item multiple times (despite not clicking)
  100. 100. Simple algorithm to estimate most popular item with small but dynamic item pool • Simple Explore/Exploit scheme – % explore: with a small probability (e.g. 5%), choose an item at random from the pool – (100− )% exploit: with large probability (e.g. 95%), choose highest scoring CTR item • Temporal Smoothing – Item CTRs change over time, provide more weight to recent data in estimating item CTRs • Kalman filter, moving average • Discount item score with repeat views – CTR(item) for a given user drops with repeat views by some “discount” factor (estimated from data) • Segmented most popular – Perform separate most-popular for each user segment
  101. 101. Time series Model: Kalman filter • Dynamic Gamma-Poisson: click-rate evolves over time in a multiplicative fashion • Estimated Click-rate distribution at time t+1 – Prior mean: – Prior variance: High CTR items more adaptive
  102. 102. More economical exploration? Better bandit solutions • Consider two armed problem p2 (unknown payoff probabilities) The gambler has 1000 plays, what is the best way to experiment ? (to maximize total expected reward) This is called the “multi-armed bandit” problem, have been studied for a long time. Optimal solution: Play the arm that has maximum potential of being good Optimism in the face of uncertainty p1 >
  103. 103. Item Recommendation: Bandits? • Two Items: Item 1 CTR= 2/100 ; Item 2 CTR= 250/10000 – Greedy: Show Item 2 to all; not a good idea – Item 1 CTR estimate noisy; item could be potentially better • Invest in Item 1 for better overall performance on average – Exploit what is known to be good, explore what is potentially good CTR Probabilitydensity Item 2 Item 1
  104. 104. Next few hours Most Popular Recommendation Personalized Recommendation Offline Models Collaborative filtering (cold-start problem) Online Models Time-series models Incremental CF, online regression Intelligent Initialization Prior estimation Prior estimation, dimension reduction Explore/Exploit Multi-armed bandits Bandits with covariates
  105. 105. Offline Components: Collaborative Filtering in Cold-start Situations
  106. 106. Problem Item j with User i with user features xi (demographics, browse history, search history, …) item features xj (keywords, content categories, ...) (i, j) : response yijvisits Algorithm selects (explicit rating, implicit click/no-click) Predict the unobserved entries based on features and the observed entries
  107. 107. Model Choices • Feature-based (or content-based) approach – Use features to predict response • (regression, Bayes Net, mixture models, …) – Limitation: need predictive features • Bias often high, does not capture signals at granular levels • Collaborative filtering (CF aka Memory based) – Make recommendation based on past user-item interaction • User-user, item-item, matrix factorization, … • See [Adomavicius & Tuzhilin, TKDE, 2005], [Konstan, SIGMOD’08 Tutorial], etc. – Better performance for old users and old items – Does not naturally handle new users and new items (cold- start)
  108. 108. Collaborative Filtering (Memory based methods) User-User Similarity Item-Item similarities, incorporating both Estimating Similarities Pearson’s correlation Optimization based (Koren et al)
  109. 109. How to Deal with the Cold-Start Problem • Heuristic-based approaches – Linear combination of regression and CF models – Filterbot • Add user features as psuedo users and do collaborative filtering - Hybrid approaches - Use content based to fill up entries, then use CF • Matrix Factorization – Good performance on Netflix (Koren, 2009) • Model-based approaches – Bilinear random-effects model (probabilistic matrix factorization) • Good on Netflix data [Ruslan et al ICML, 2009] – Add feature-based regression to matrix factorization • (Agarwal and Chen, 2009) – Add topic discovery (from textual items) to matrix factorization • (Agarwal and Chen, 2009; Chun and Blei, 2011)
  110. 110. Per-item regression models • When tracking users by cookies, distribution of visit patters could get extremely skewed – Majority of cookies have 1-2 visits • Per item models (regression) based on user covariates attractive in such cases
  111. 111. Several per-item regressions: Multi-task learning Low dimension (5-10), B estimated retrospective data • Agarwal,Chen and Elango, KDD, 2010 Affinity to old items
  112. 112. Per-user, per-item models via bilinear random-effects model
  113. 113. Motivation • Data measuring k-way interactions pervasive – Consider k = 2 for all our discussions • E.g. User-Movie, User-content, User-Publisher-Ads,…. – Power law on both user and item degrees • Classical Techniques – Approximate matrix through a singular value decomposition (SVD) • After adjusting for marginal effects (user pop, movie pop,..) – Does not work • Matrix highly incomplete, severe over-fitting – Key issue • Regularization of eigenvectors (factors) to avoid overfitting
  114. 114. Early work on complete matrices • Tukey’s 1-df model (1956) – Rank 1 approximation of small nearly complete matrix • Criss-cross regression (Gabriel, 1978) • Incomplete matrices: Psychometrics (1-factor model only; small data sets; 1960s) • Modern day recommender problems – Highly incomplete, large, noisy.
  115. 115. Latent Factor Models “newsy” “sporty” “newsy” s item v z Affinity = u’v Affinity = s’z u s p o r t y
  116. 116. Factorization – Brief Overview • Latent user factors: (αi , ui=(ui1,…,uin)) • (Nn + Mm) parameters • Key technical issue: • Latent movie factors: (βj , vj=(v j1,….,v jn)) will overfit for moderate values of n,m Regularization Interaction jijiij BvuyE )(
  117. 117. Latent Factor Models: Different Aspects • Matrix Factorization – Factors in Euclidean space – Factors on the simplex • Incorporating features and ratings simultaneously • Online updates
  118. 118. Maximum Margin Matrix Factorization (MMMF) • Complete matrix by minimizing loss (hinge,squared- error) on observed entries subject to constraints on trace norm – Srebro, Rennie, Jakkola (NIPS 2004) – Convex, Semi-definite programming (expensive, not scalable) • Fast MMMF (Rennie & Srebro, ICML, 2005) – Constrain the Frobenious norm of left and right eigenvector matrices, not convex but becomes scalable. • Other variation: Ensemble MMMF (DeCoste, ICML2005) – Ensembles of partially trained MMMF (some improvements)
  119. 119. Matrix Factorization for Netflix prize data • Minimize the objective function • Simon Funk: Stochastic Gradient Descent • Koren et al (KDD 2007): Alternate Least Squares – They move to SGD later in the competition obsij j j i ij T iij vuvur )()( 222
  120. 120. ui vj rij au av 2 Optimization is through Iterated conditional modes Other variations like constraining the mean through sigmoid, using “who-rated- whom” Combining with Boltzmann Machines also improved performance ),(~ ),(~ ),(~ 2 IaMVN IaMVN Nr vj ui j T iij 0v 0u vu Probabilistic Matrix Factorization (Ruslan & Minh, 2008, NIPS)
  121. 121. Bayesian Probabilistic Matrix Factorization (Ruslan and Minh, ICML 2008) • Fully Bayesian treatment using an MCMC approach – Significant improvement • Interpretation as a fully Bayesian hierarchical model shows why that is the case – Failing to incorporate uncertainty leads to bias in estimates – Multi-modal posterior, MCMC helps in converging to a better one r Var-comp: au MCEM also more resistant to over-fitting
  122. 122. Non-parametric Bayesian matrix completion (Zhou et al, SAM, 2010) • Specify rank probabilistically (automatic rank selection) )/)1(,/(~ )(~ ),(~ 1 2 rrbraBeta Berz vuzNy k kk r k jkikkij ))1(/(Factors)#( )))1(/(,1(~ rbaraE rbaaBerzk
  123. 123. How to incorporate features: Deal with both warm start and cold-start • Models to predict ratings for new pairs – Warm-start: (user, movie) present in the training data with large sample size – Cold-start: At least one of (user, movie) new or has small sample size • Rough definition, warm-start/cold-start is a continuum. • Challenges – Highly incomplete (user, movie) matrix – Heavy tailed degree distributions for users/movies • Large fraction of ratings from small fraction of users/movies – Handling both warm-start and cold-start effectively in the presence of predictive features
  124. 124. Possible approaches • Large scale regression based on covariates – Does not provide good estimates for heavy users/movies – Large number of predictors to estimate interactions • Collaborative filtering – Neighborhood based – Factorization • Good for warm-start; cold-start dealt with separately • Single model that handles cold-start and warm-start – Heavy users/movies → User/movie specific model – Light users/movies → fallback on regression model – Smooth fallback mechanism for good performance
  125. 125. Add Feature-based Regression into Matrix Factorization RLFM: Regression-based Latent Factor Model
  126. 126. Regression-based Factorization Model (RLFM) • Main idea: Flexible prior, predict factors through regressions • Seamlessly handles cold-start and warm- start • Modified state equation to incorporate covariates
  127. 127. RLFM: Model Rating: ),(~ 2 ijij Ny )(~ ijij Bernoulliy )(~ ijijij NPoissony Gaussian Model Logistic Model (for binary rating) Poisson Model (for counts) j t iji t ijij vubxt )( user i gives item j Bias of user i: ),0(~, 2 0 Nxg iii t i Popularity of item j: ),0(~, 2 0 Nxd jjj t j Factors of user i: ),0(~, 2 INGxu u u i u iii Factors of item j: ),0(~, 2 INDxv v v i v iji Could use other classes of regression models
  128. 128. Graphical representation of the model
  129. 129. Advantages of RLFM • Better regularization of factors – Covariates “shrink” towards a better centroid • Cold-start: Fallback regression model (FeatureOnly)
  130. 130. RLFM: Illustration of Shrinkage Plot the first factor value for each user (fitted using Yahoo! FP data)
  131. 131. Model fitting: EM for our class of models
  132. 132. The parameters for RLFM • Latent parameters • Hyper-parameters }){},{},{},({ jiji vuΔ )IaAI,aAD,G,,( vvuub
  133. 133. Computing the mode Minimized
  134. 134. The EM algorithm
  135. 135. Computing the E-step • Often hard to compute in closed form • Stochastic EM (Markov Chain EM; MCEM) – Compute expectation by drawing samples from – Effective for multi-modal posteriors but more expensive • Iterated Conditional Modes algorithm (ICM) – Faster but biased hyper-parameter estimates
  136. 136. Monte Carlo E-step • Through a vanilla Gibbs sampler (conditionals closed form) • Other conditionals also Gaussian and closed form • Conditionals of users (movies) sampled simultaneously • Small number of samples in early iterations, large numbers in later iterations
  137. 137. M-step (Why MCEM is better than ICM) • Update G, optimize • Update Au=au I Ignored by ICM, underestimates factor variability Factors over-shrunk, posterior not explored well
  138. 138. Experiment 1: Better regularization • MovieLens-100K, avg RMSE using pre-specified splits • ZeroMean, RLFM and FeatureOnly (no cold-start issues) • Covariates: – Users : age, gender, zipcode (1st digit only) – Movies: genres
  139. 139. Experiment 2: Better handling of Cold-start • MovieLens-1M; EachMovie • Training-test split based on timestamp • Same covariates as in Experiment 1.
  140. 140. Experiment 4: Predicting click-rate on articles • Goal: Predict click-rate on articles for a user on F1 position • Article lifetimes short, dynamic updates important • User covariates: – Age, Gender, Geo, Browse behavior • Article covariates – Content Category, keywords • 2M ratings, 30K users, 4.5 K articles
  141. 141. Results on Y! FP data
  142. 142. Some other related approaches • Stern, Herbrich and Graepel, WWW, 2009 – Similar to RLFM, different parametrization and expectation propagation used to fit the models • Porteus, Asuncion and Welling, AAAI, 2011 – Non-parametric approach using a Dirichlet process • Agarwal, Zhang and Mazumdar, Annals of Applied Statistics, 2011 – Regression + random effects per user regularized through a Graphical Lasso
  143. 143. Add Topic Discovery into Matrix Factorization fLDA: Matrix Factorization through Latent Dirichlet Allocation
  144. 144. fLDA: Introduction • Model the rating yij that user i gives to item j as the user’s affinity to the topics that the item has – Unlike regular unsupervised LDA topic modeling, here the LDA topics are learnt in a supervised manner based on past rating data – fLDA can be thought of as a “multi-task learning” version of the supervised LDA model [Blei’07] for cold-start recommendation k jkikij zsy ... User i ’s affinity to topic k Pr(item j has topic k) estimated by averaging the LDA topic of each word in item j Old items: zjk’s are Item latent factors learnt from data with the LDA prior New items: zjk’s are predicted based on the bag of words in the items
  145. 145. 11, …, 1W … k1, …, kW … K1, …, KW Topic 1 Topic k Topic K LDA Topic Modeling (1) • LDA is effective for unsupervised topic discovery [Blei’03] – It models the generating process of a corpus of items (articles) – For each topic k, draw a word distribution k = [ k1, …, kW] ~ Dir( ) – For each item j, draw a topic distribution j = [ j1, …, jK] ~ Dir( ) – For each word, say the nth word, in item j, • Draw a topic zjn for that word from j = [ j1, …, jK] • Draw a word wjn from k = [ k1, …, kW] with topic k = zjn Item j Topic distribution: [ j1, …, jK] Words: wj1, …, wjn, … Per-word topic: zj1, …, zjn, … Assume zjn = topic k Observed
  146. 146. LDA Topic Modeling (2) • Model training: – Estimate the prior parameters and the posterior topic word distribution based on a training corpus of items – EM + Gibbs sampling is a popular method • Inference for new items – Compute the item topic distribution based on the prior parameters and estimated in the training phase • Supervised LDA [Blei’07] – Predict a target value for each item based on supervised LDA topics k jkkj zsy Target value of item j Pr(item j has topic k) estimated by averaging the topic of each word in item j Regression weight for topic k k jkikij zsy ...vs. One regression per user Same set of topics across different regressions
  147. 147. fLDA: Model Rating: ),(~ 2 ijij Ny )(~ ijij Bernoulliy )(~ ijijij NPoissony Gaussian Model Logistic Model (for binary rating) Poisson Model (for counts) jkikkji t ijij zsbxt )( user i gives item j Bias of user i: ),0(~, 2 0 Nxg iii t i Popularity of item j: ),0(~, 2 0 Nxd jjj t j Topic affinity of user i: ),0(~, 2 INHxs s s i s iii Pr(item j has topic k): )iteminwords#/()(1 jkzz jnnjk The LDA topic of the nth word in item j Observed words: ),,(~ jnjn zLDAw The nth word in item j
  148. 148. Model Fitting • Given: – Features X = {xi, xj, xij} – Observed ratings y = {yij} and words w = {wjn} • Estimate: – Parameters: = [b, g0, d0, H, 2, a , a , As, , ] • Regression weights and prior parameters – Latent factors: = { i, j, si} and z = {zjn} • User factors, item factors and per-word topic assignment • Empirical Bayes approach: – Maximum likelihood estimate of the parameters: – The posterior distribution of the factors: dzdzwywy ]|,,,Pr[maxarg]|,Pr[maxargˆ ]ˆ,|,Pr[ yz
  149. 149. The EM Algorithm • Iterate through the E and M steps until convergence – Let be the current estimate – E-step: Compute • The expectation is not in closed form • We draw Gibbs samples and compute the Monte Carlo mean – M-step: Find • It consists of solving a number of regression and optimization problems )]|,,,Pr([log)( )ˆ,,|,( zwyEf n wyz )(maxargˆ )1( fn )(ˆ n
  150. 150. Supervised Topic Assignment ji jnij jn jkjn k jn kl jn kzyfZ WZ Z kz rated )|( )Rest|Pr( Same as unsupervised LDA Likelihood of observed ratings by users who rated item j when zjn is set to topic k Probability of observing yij given the model The topic of the nth word in item j
  151. 151. fLDA: Experimental Results (Movie) • Task: Predict the rating that a user would give a movie • Training/test split: – Sort observations by time – First 75% Training data – Last 25% Test data • Item warm-start scenario – Only 2% new items in test data Model Test RMSE RLFM 0.9363 fLDA 0.9381 Factor-Only 0.9422 FilterBot 0.9517 unsup-LDA 0.9520 MostPopular 0.9726 Feature-Only 1.0906 Constant 1.1190 fLDA is as strong as the best method It does not reduce the performance in warm-start scenarios
  152. 152. fLDA: Experimental Results (Yahoo! Buzz) • Task: Predict whether a user would buzz-up an article • Severe item cold-start – All items are new in test data Data Statistics 1.2M observations 4K users 10K articles fLDA significantly outperforms other models
  153. 153. Experimental Results: Buzzing Topics Top Terms (after stemming) Topic bush, tortur, interrog, terror, administr, CIA, offici, suspect, releas, investig, georg, memo, al CIA interrogation mexico, flu, pirat, swine, drug, ship, somali, border, mexican, hostag, offici, somalia, captain Swine flu NFL, player, team, suleman, game, nadya, star, high, octuplet, nadya_suleman, michael, week NFL games court, gai, marriag, suprem, right, judg, rule, sex, pope, supreme_court, appeal, ban, legal, allow Gay marriage palin, republican, parti, obama, limbaugh, sarah, rush, gop, presid, sarah_palin, sai, gov, alaska Sarah Palin idol, american, night, star, look, michel, win, dress, susan, danc, judg, boyl, michelle_obama American idol economi, recess, job, percent, econom, bank, expect, rate, jobless, year, unemploy, month Recession north, korea, china, north_korea, launch, nuclear, rocket, missil, south, said, russia North Korea issues 3/4 topics are interpretable; 1/2 are similar to unsupervised topics
  154. 154. fLDA Summary • fLDA is a useful model for cold-start item recommendation • It also provides interpretable recommendations for users – User’s preference to interpretable LDA topics • Future directions: – Investigate Gibbs sampling chains and the convergence properties of the EM algorithm – Apply fLDA to other multi-task prediction problems • fLDA can be used as a tool to generate supervised features (topics) from text data
  155. 155. Summary • Regularizing factors through covariates effective • Regression based factor model that regularizes better and deals with both cold-start and warm-start in a single framework in a seamless way looks attractive • Fitting method scalable; Gibbs sampling for users and movies can be done in parallel. Regressions in M-step can be done with any off-the-shelf scalable linear regression routine • Distributed computing on Hadoop: Multiple models and average across partitions (more later)
  156. 156. Online Components: Online Models, Intelligent Initialization, Explore / Exploit
  157. 157. Why Online Components? • Cold start – New items or new users come to the system – How to obtain data for new items/users (explore/exploit) – Once data becomes available, how to quickly update the model • Periodic rebuild (e.g., daily): Expensive • Continuous online update (e.g., every minute): Cheap • Concept drift – Item popularity, user interest, mood, and user-to-item affinity may change over time – How to track the most recent behavior • Down-weight old data – How to model temporal patterns for better prediction • … may not need to be online if the patterns are stationary
  158. 158. Big Picture Most Popular Recommendation Personalized Recommendation Offline Models Collaborative filtering (cold-start problem) Online Models Real systems are dynamic Time-series models Incremental CF, online regression Intelligent Initialization Do not start cold Prior estimation Prior estimation, dimension reduction Explore/Exploit Actively acquire data Multi-armed bandits Bandits with covariates Segmented Most Popular Recommendation Extension:
  159. 159. Online Components for Most Popular Recommendation Online models, intelligent initialization & explore/exploit
  160. 160. Most popular recommendation: Outline • Most popular recommendation (no personalization, all users see the same thing) – Time-series models (online models) – Prior estimation (initialization) – Multi-armed bandits (explore/exploit) – Sometimes hard to beat!! • Segmented most popular recommendation – Create user segments/clusters based on user features – Do most popular recommendation for each segment
  161. 161. Most Popular Recommendation • Problem definition: Pick k items (articles) from a pool of N to maximize the total number of clicks on the picked items • Easy!? Pick the items having the highest click- through rates (CTRs) • But … – The system is highly dynamic: • Items come and go with short lifetimes • CTR of each item changes over time – How much traffic should be allocated to explore new items to achieve optimal performance • Too little Unreliable CTR estimates • Too much Little traffic to exploit the high CTR items
  162. 162. CTR Curves for Two Days on Yahoo! Front Page Traffic obtained from a controlled randomized experiment (no confounding) Things to note: (a) Short lifetimes, (b) temporal effects, (c) often breaking news stories Each curve is the CTR of an item in the Today Module on www.yahoo.com over time
  163. 163. For Simplicity, Assume … • Pick only one item for each user visit – Multi-slot optimization later • No user segmentation, no personalization (discussion later) • The pool of candidate items is predetermined and is relatively small ( 1000) – E.g., selected by human editors or by a first-phase filtering method – Ideally, there should be a feedback loop – Large item pool problem later • Effects like user-fatigue, diversity in recommendations, multi-objective optimization not considered (discussion later)
  164. 164. Online Models • How to track the changing CTR of an item • Data: for each item, at time t, we observe – Number of times the item nt was displayed (i.e., #views) – Number of clicks ct on the item • Problem Definition: Given c1, n1, …, ct, nt, predict the CTR (click-through rate) pt+1 at time t+1 • Potential solutions: – Observed CTR at t: ct / nt highly unstable (nt is usually small) – Cumulative CTR: ( all i ci) / ( all i ni) react to changes very slowly – Moving window CTR: ( i last K ci) / ( i last K ni) reasonable • But, no estimation of Var[pt+1] (useful for explore/exploit)
  165. 165. Online Models: Dynamic Gamma-Poisson • Model-based approach – (ct | nt, pt) ~ Poisson(nt pt) – pt = pt-1 t, where t ~ Gamma(mean=1, var= ) – Model parameters: • p1 ~ Gamma(mean= 0, var= 0 2) is the offline CTR estimate • specifies how dynamic/smooth the CTR is over time – Posterior distribution (pt+1 | c1, n1, …, ct, nt) ~ Gamma(?,?) • Solve this recursively (online update rule) Show the item nt times Receive ct clicks pt = CTR at time t Notation: p10, 0 2 p2 … n1 c1 n2 c2
  166. 166. Online Models: Derivation size)sample(effective/Let ),(~),,...,,|( 2 2 1111 ttt ttttt varmeanGammancncp )( ),(~),,...,,|( 2 | 2 | 2 | 2 1 |1 2 11111 ttttttt ttt ttttt varmeanGammancncp tttttt ttttttt tttt ttttttt c n varmeanGammancncp || 2 | || | 2 ||11 / /)( size)sample(effectiveLet ),(~),,...,,|( High CTR items more adaptive Estimated CTR distribution at time t Estimated CTR distribution at time t+1
  167. 167. Tracking behavior of Gamma- Poisson model • Low click rate articles – More temporal smoothing
  168. 168. Intelligent Initialization: Prior Estimation • Prior CTR distribution: Gamma(mean= 0, var= 0 2) – N historical items: • ni = #views of item i in its first time interval • ci = #clicks on item i in its first time interval – Model • ci ~ Poisson(ni pi) and pi ~ Gamma( 0, 0 2) ci ~ NegBinomial( 0, 0 2, ni) – Maximum likelihood estimate (MLE) of ( 0, 0 2) • Better prior: Cluster items and find MLE for each cluster – Agarwal & Chen, 2011 (SIGMOD) i iii nccNN 2 0 0 2 0 2 0 2 0 2 0 2 0 2 0 2 0 0 2 0 2 0 2 00 loglogloglogmaxarg ,
  169. 169. Explore/Exploit: Problem Definition time Item 1 Item 2 … Item K x1% page views x2% page views … xK% page views Determine (x1, x2, …, xK) based on clicks and views observed before t in order to maximize the expected total number of clicks in the future t –1t –2 t now clicks in the future
  170. 170. Modeling the Uncertainty, NOT just the Mean Simplified setting: Two items CTR Probabilitydensity Item A Item B We know the CTR of Item A (say, shown 1 million times) We are uncertain about the CTR of Item B (only 100 times) If we only make a single decision, give 100% page views to Item A If we make multiple decisions in the future explore Item B since its CTR can potentially be higher qp dppfqp )()(Potential CTR of item A is q CTR of item B is p Probability density function of item B’s CTR is f(p)
  171. 171. Multi-Armed Bandits: Introduction (1) Bandit “arms” p1 p2 p3 (unknown payoff probabilities) “Pulling” arm i yields a reward: reward = 1 with probability pi (success) reward = 0 otherwise (failure) For now, we are attacking the problem of choosing best article/arm for all users
  172. 172. Multi-Armed Bandits: Introduction (2) Bandit “arms” p1 p2 p3 (unknown payoff probabilities) Goal: Pull arms sequentially to maximize the total reward Bandit scheme/policy: Sequential algorithm to play arms (items) Regret of a scheme = Expected loss relative to the “oracle” optimal scheme that always plays the best arm – “best” means highest success probability – But, the best arm is not known … unless you have an oracle – Regret is the price of exploration – Low regret implies quick convergence to the best
  173. 173. Multi-Armed Bandits: Introduction (3) • Bayesian approach – Seeks to find the Bayes optimal solution to a Markov decision process (MDP) with assumptions about probability distributions – Representative work: Gittins’ index, Whittle’s index – Very computationally intensive • Minimax approach – Seeks to find a scheme that incurs bounded regret (with no or mild assumptions about probability distributions) – Representative work: UCB by Lai, Auer – Usually, computationally easy – But, they tend to explore too much in practice (probably because the bounds are based on worse-case analysis) Skip details
  174. 174. Multi-Armed Bandits: Markov Decision Process (1) • Select an arm now at time t=0, to maximize expected total number of clicks in t=0,…,T • State at time t: t = ( 1t, …, Kt) – it = State of arm i at time t (that captures all we know about arm i at t) • Reward function Ri( t, t+1) – Reward of pulling arm i that brings the state from t to t+1 • Transition probability Pr[ t+1 | t, pulling arm i ] • Policy : A function that maps a state to an arm (action) – ( t) returns an arm (to pull) • Value of policy starting from the current state 0 with horizon T ),(),(),( 1110)(0 0 ΘΘΘΘ Θ TT VREV 11110)(001 ),(),()(,|Pr 0 ΘΘΘΘΘΘΘ Θ dVR T Immediate reward Value of the remaining T-1 time slots if we start from state 1
  175. 175. Multi-Armed Bandits: MDP (2) • Optimal policy: • Things to notice: – Value is defined recursively (actually T high-dim integrals) – Dynamic programming can be used to find the optimal policy – But, just evaluating the value of a fixed policy can be very expensive • Bandit Problem: The pull of one arm does not change the state of other arms and the set of arms do not change over time ),(),(),( 1110)(0 0 ΘΘΘΘ Θ TT VREV 11110)(001 ),(),()(,|Pr 0 ΘΘΘΘΘΘΘ Θ dVR T Immediate reward Value of the remaining T-1 time slots if we start from state 1 ),(maxarg 0ΘTV
  176. 176. Multi-Armed Bandits: MDP (3) • Which arm should be pulled next? – Not necessarily what looks best right now, since it might have had a few lucky successes – Looks like it will be a function of successes and failures of all arms • Consider a slightly different problem setting – Infinite time horizon, but – Future rewards are geometrically discounted Rtotal = R(0) + γ.R(1) + γ2.R(2) + … (0<γ<1) • Theorem [Gittins 1979]: The optimal policy decouples and solves a bandit problem for each arm independently Policy ( t) is a function of ( 1t, …, Kt) Policy ( t) = argmaxi { g( it) } One K-dimensional problem K one-dimensional problems Still computationally expensive!! Gittins’ Index
  177. 177. Multi-Armed Bandits: MDP (4) Bandit Policy 1. Compute the priority (Gittins’ index) of each arm based on its state 2. Pull arm with max priority, and observe reward 3. Update the state of the pulled arm Priority 1 Priority 2 Priority 3
  178. 178. Multi-Armed Bandits: MDP (5) • Theorem [Gittins 1979]: The optimal policy decouples and solves a bandit problem for each arm independently – Many proofs and different interpretations of Gittins’ index exist • The index of an arm is the fixed charge per pull for a game with two options, whether to pull the arm or not, so that the charge makes the optimal play of the game have zero net reward – Significantly reduces the dimension of the problem space – But, Gittins’ index g( it) is still hard to compute • For the Gamma-Poisson or Beta-Binomial models it = (#successes, #pulls) for arm i up to time t • g maps each possible (#successes, #pulls) pair to a number – Approximate methods are used in practice – Lai et al. have derived these for exponential family distributions
  179. 179. Multi-Armed Bandits: Minimax Approach (1) • Compute the priority of each arm i in a way that the regret is bounded – Lowest regret in the worst case • One common policy is UCB1 [Auer 2002] Number of successes of arm i Number of pulls of arm i Total number of pulls of all arms Observed success rate Factor representing uncertainty ii i i n n n c log2 Priority
  180. 180. Multi-Armed Bandits: Minimax Approach (2) • As total observations n becomes large: – Observed payoff tends asymptotically towards the true payoff probability – The system never completely “converges” to one best arm; only the rate of exploration tends to zero Observed payoff Factor representing uncertainty ii i i n n n c log2 Priority
  181. 181. Multi-Armed Bandits: Minimax Approach (3) • Sub-optimal arms are pulled O(log n) times • Hence, UCB1 has O(log n) regret • This is the lowest possible regret (but the constants matter ) • E.g. Regret after n plays is bounded by Observed payoff Factor representing uncertainty ii i i n n n c log2 Priority ibesti K j j i ibesti n where, 3 1 ln 8 1 2 :
  182. 182. • Classical multi-armed bandits – A fixed set of arms with fixed rewards – Observe the reward before the next pull • Bayesian approach (Markov decision process) – Gittins’ index [Gittins 1979]: Bayes optimal for classical bandits • Pull the arm currently having the highest index value – Whittle’s index [Whittle 1988]: Extension to a changing reward function – Computationally intensive • Minimax approach (providing guaranteed regret bounds) – UCB1 [Auer 2002]: Upper bound of a model agnostic confidence interval • Index of arm i = • Heuristics – -Greedy: Random exploration using fraction of traffic – Softmax: Pick arm i with probability – Posterior draw: Index = drawing from posterior CTR distribution of an arm j j i }/ˆexp{ }/ˆexp{ Classical Multi-Armed Bandits: Summary ii itemofCTRpredictedˆ iii nnnc log2 retemperatu
  183. 183. Do Classical Bandits Apply to Web Recommenders? Traffic obtained from a controlled randomized experiment (no confounding) Things to note: (a) Short lifetimes, (b) temporal effects, (c) often breaking news stories Each curve is the CTR of an item in the Today Module on www.yahoo.com over time
  184. 184. Characteristics of Real Recommender Systems• Dynamic set of items (arms) – Items come and go with short lifetimes (e.g., a day) – Asymptotically optimal policies may fail to achieve good performance when item lifetimes are short • Non-stationary CTR – CTR of an item can change dramatically over time • Different user populations at different times • Same user behaves differently at different times (e.g., morning, lunch time, at work, in the evening, etc.) • Attention to breaking news stories decays over time • Batch serving for scalability – Making a decision and updating the model for each user visit in real time is expensive – Batch serving is more feasible: Create time slots (e.g., 5 min); for each slot, decide the fraction xi of the visits in the slot to give to item i [Agarwal et al., ICDM, 2009]
  185. 185. Explore/Exploit in Recommender Systems time Item 1 Item 2 … Item K x1% page views x2% page views … xK% page views Determine (x1, x2, …, xK) based on clicks and views observed before t in order to maximize the expected total number of clicks in the future t –1t –2 t now clicks in the future Let’s solve this from first principle
  186. 186. Bayesian Solution: Two Items, Two Time Slots (1) • Two time slots: t = 0 and t = 1 – Item P: We are uncertain about its CTR, p0 at t = 0 and p1 at t = 1 – Item Q: We know its CTR exactly, q0 at t = 0 and q1 at t = 1 • To determine x, we need to estimate what would happen in the future Question: What fraction x of N0 views to item P (1-x) to item Q t=0 t=1 Now time N0 views N1 views End Obtain c clicks after serving x (not yet observed; random variable) Assume we observe c; we can update p1 CTR density Item Q Item P q1 p1(x,c) CTR density Item Q Item P q0 p0 If x and c are given, optimal solution: Give all views to Item P iff E[ p1 I x, c ] > q1 ),(ˆ1 cxp ),(ˆ1 cxp
  187. 187. • Expected total number of clicks in the two time slots }]),,(ˆ[max{)1(ˆ 1110000 qcxpENqxNpxN c Gain(x, q0, q1) = Expected number of additional clicks if we explore the uncertain item P with fraction x of views in slot 0, compared to a scheme that only shows the certain item Q in both slots Solution: argmaxx Gain(x, q0, q1) Bayesian Solution: Two Items, Two Time Slots (2) }]0,),(ˆ[max{)ˆ( 1110001100 qcxpENqpxNqNqN c E[#clicks] at t = 0 E[#clicks] at t = 1 Item P Item Q Show the item with higher E[CTR]: }),,(ˆmax{ 11 qcxp E[#clicks] if we always show item Q Gain(x, q0, q1) Gain of exploring the uncertain item P using x
  188. 188. • Approximate by the normal distribution – Reasonable approximation because of the central limit theorem • Proposition: Using the approximation, the Bayes optimal solution x can be found in time O(log N0) ),(ˆ1 cxp )ˆ( )( ˆ 1 )( ˆ )()ˆ(),,( 11 1 11 1 11 1100010 qp x pq x pq xNqpxNqqxGain )1()()( )],(ˆ[)( 2 0 0 1 2 1 baba ab xNba xN cxpVarx )/()],(ˆ[ˆ 11 baacxpEp c ),(~ofPrior 1 baBetap Bayesian Solution: Two Items, Two Time Slots (3)
  189. 189. Bayesian Solution: Two Items, Two Time Slots (4) • Quiz: Is it correct that the more we are uncertain about the CTR of an item, the more we should explore the item? Uncertainty: LowUncertainty: High Different curves are for different prior mean settings (Fractionofviewstogivetotheitem)
  190. 190. – Apply Whittle’s Lagrange relaxation (1988) to this problem setting Relax i zi(c) = 1, for all c, to Ec [ i zi(c)] = 1 Apply Lagrange multipliers (q0 and q1) to enforce the constraints – We essentially reduce the K-item case to K independent two-item sub- problems (which we have solved) Bayesian Solution: General Case (1) • From two items to K items – Very difficult problem: ))}],(ˆ{[maxˆ(max 1100 0 iiiiiii cxpENpxN c x )],(ˆ)([max 1 0 iiiii cxpzE cc z cc possibleallfor,1)(ii z Note: c = [c1, …, cK] ci is a random variable representing the # clicks on item i we may get 1ii x )),,(max(min 101100 , 10 qqxGainqNqN ii xqq i
  191. 191. Bayesian Solution: General Case (2) • From two intervals to multiple time slots – Approximate multiple time slots by two stages • Non-stationary CTR – Use the Dynamic Gamma-Poisson model to estimate the CTR distribution for each item
  192. 192. Simulation Experiment: Different Traffic Volume Simulation with ground truth estimated based on Yahoo! Front Page data Setting:16 live items per interval Scenarios: Web sites with different traffic volume (x-axis)
  193. 193. Simulation Experiment: Different Sizes of the Item Pool Simulation with ground truth estimated based on Yahoo! Front Page data Setting: 1000 views per interval; average item lifetime = 20 intervals Scenarios: Different sizes of the item pool (x-axis)
  194. 194. Characteristics of Different Explore/Exploit Schemes (1) • Why the Bayesian solution has better performance • Characterize each scheme by three dimensions: – Exploitation regret: The regret of a scheme when it is showing the item which it thinks is the best (may not actually be the best) • 0 means the scheme always picks the actual best • It quantifies the scheme’s ability of finding good items – Exploration regret: The regret of a scheme when it is exploring the items which it feels uncertain about • It quantifies the price of exploration (lower better) – Fraction of exploitation (higher better) • Fraction of exploration = 1 – fraction of exploitation Exploitation traffic Exploration traffic All traffic to a web site
  195. 195. Characteristics of Different Explore/Exploit Schemes (2) Exploitation regret: Ability of finding good items (lower better) Exploration regret: Price of exploration (lower better) Fraction of Exploitation (higher better) Exploration Regret Exploitation fraction ExploitationRegret ExploitationRegret Good Good
  196. 196. Discussion: Large Content Pool • The Bayesian solution looks promising – ~10% from true optimal for a content pool of 1000 live items • 1000 views per interval; item lifetime ~20 intervals • Intelligent initialization (offline modeling) – Use item features to reduce the prior variance of an item • E.g., Var[ item CTR | Sport ] < Var[ item CTR ] – Require a CTR model that outputs both mean and variance • Linear regression model • Segmented model: Estimate the CTR distribution of a random article in an item category – Existing taxonomies, decision tree, LDA topics • Feature-based explore/exploit – Estimate model parameters, instead of per-item CTR – More later
  197. 197. Discussion: Multiple Positions, Ranking • Feature-based approach – reward(page) = model( (item 1 at position 1, … item k at position k)) – Apply feature-based explore/exploit • Online optimization for ranked list – Ranked bandits [Radlinski et al., 2008]: Run an independent bandit algorithm for each position – Dueling bandit [Yue & Joachims, 2009]: Actions are pairwise comparisons • Online optimization of submodular functions – S1, S2 and a, fa(S1 S2) fa(S1) • where fa(S) = fa(S a ) – fa(S) – Streeter & Golovin (2008)
  198. 198. Discussion: Segmented Most Popular • Partition users into segments, and then for each segment, provide most popular recommendation • How to segment users – Hand-created segments: AgeGroup Gender – Clustering or decision tree based on user features • Users in the same cluster like similar items • Segments can be organized by taxonomies/hierarchies – Better CTR models can be built by hierarchical smoothing • Shrink the CTR of a segment toward its parent • Introduce bias to reduce uncertainty/variance – Bandits for taxonomies (Pandey et al., 2008) • First explore/exploit categories/segments • Then, switch to individual items
  199. 199. Most Popular Recommendation: Summary • Online model: – Estimate the mean and variance of the CTR of each item over time – Dynamic Gamma-Poisson model • Intelligent initialization: – Estimate the prior mean and variance of the CTR of each item cluster using historical data • Cluster items Maximum likelihood estimates of the priors • Explore/exploit: – Bayesian: Solve a Markov decision process problem • Gittins’ index, Whittle’s index, approximations • Better performance, computation intensive • Thompson sampling: Sample from the posterior (simple) – Minimax: Bound the regret • UCB1: Easy to compute • Explore more than necessary in practice – -Greedy: Empirically competitive for tuned
  200. 200. Online Components for Personalized Recommendation Online models, intelligent initialization & explore/exploit
  201. 201. Intelligent Initialization for Linear Model (1) • Linear/factorization model – How to estimate the prior parameters j and • Important for cold start: Predictions are made using prior • Leverage available features – How to learn the weights/factors quickly • High dimensional j slow convergence • Reduce the dimensionality Subscript: user i, item j ),(~ ),(~ 2 jj jiij N uNy rating that user i gives item j feature/factor vector of user i factor vector of item j
  202. 202. Feature-based model initialization • Dimensionality reduction for fast model convergence ),(~ jj AxN FOBFM: Fast Online Bilinear Factor Model ),(~,~ jjjiij NuyPer-item online model ),0(~ ~ Nv vuAxuy j jijiij predicted by features ),0(~ 2 IN Bv j jj Subscript: user i item j Data: yij = rating that user i gives item j ui = offline factor vector of user i xj = feature vector of item j B is a n k linear projection matrix (k << n) project: high dim(vj) low dim( j) low-rank approx of Var[ j]: = vj jB ),(~ 2 BBAxN jj Offline training: Determine A, B, 2 through the EM algorithm (once per day or hour)
  203. 203. Feature-based model initialization • Dimensionality reduction for fast model convergence • Fast, parallel online learning • Online selection of dimensionality (k = dim( j)) – Maintain an ensemble of models, one for each candidate dimensionality ),(~ jj AxN FOBFM: Fast Online Bilinear Factor Model ),(~,~ jjjiij NuyPer-item online model ),0(~ ~ Nv vuAxuy j jijiij predicted by features ),0(~ 2 IN Bv j jj B is a n k linear projection matrix (k << n) project: high dim(vj) low dim( j) low-rank approx of Var[ j]: jijiij BuAxuy )(~ offset new feature vector (low dimensional) , where j is updated in an online manner ),(~ 2 BBAxN jj Subscript: user i item j Data: yij = rating that user i gives item j ui = offline factor vector of user i xj = feature vector of item j
  204. 204. Experimental Results: My Yahoo! Dataset (1) • My Yahoo! is a personalized news reading site – Users manually select news/RSS feeds • ~12M “ratings” from ~3M users on ~13K articles – Click = positive – View without click = negative
  205. 205. Experimental Results: My Yahoo! Dataset (2) Item-based data split: Every item is new in the test data – First 8K articles are in the training data (offline training) – Remaining articles are in the test data (online prediction & learning) Supervised dimensionality reduction (reduced rank regression) significantly outperforms other methods Methods: No-init: Standard online regression with ~1000 parameters for each item Offline: Feature-based model without online update PCR, PCR+: Two principal component methods to estimate B FOBFM: Our fast online method
  206. 206. Experimental Results: My Yahoo! Dataset (3) • Small number of factors (low dimensionality) is better when the amount of data for online leaning is small • Large number of factors is better when the data for learning becomes large • The online selection method usually selects the best dimensionality # factors = Number of parameters per item updated online
  207. 207. Intelligent Initialization: Summary • For online learning, whenever historical data is available, do not start cold • For linear/factorization models – Use available features to setup the starting point – Reduce dimensionality to facilitate fast learning • Next – Explore/exploit for personalization – Users are represented by covariates • Features, factors, clusters, etc – Covariate bandits
  208. 208. Explore/Exploit for Personalized Recommendation • One extreme problem formulation – One bandit problem per user with one arm per item – Bandit problems are correlated: “Similar” users like similar items – Arms are correlated: “Similar” items have similar CTRs • Model this correlation through covariates/features – Input: User feature/factor vector, item feature/factor vector – Output: Mean and variance of the CTR of this (user, item) pair based on the data collected so far • Covariate bandits – Also known as contextual bandits, bandits with side observations – Provide a solution to • Large content pool (correlated arms) • Personalized recommendation (hint before pulling an arm)
  209. 209. Methods for Covariate Bandits • Priority-based methods – Rank items according to the user-specific “score” of each item; then, update the model based on the user’s response – UCB (upper confidence bound) • Score of an item = E[posterior CTR] + k StDev[posterior CTR] – Posterior draw (Thompson sampling) • Score of an item = a number drawn from the posterior CTR distribution – Softmax • Score of an item = a number drawn according to • -Greedy – Allocate fraction of traffic for random exploration ( may be adaptive) – Robust when the exploration pool is small • Bayesian scheme – Close to optimal if can be solved efficiently j j i }/ˆexp{ }/ˆexp{
  210. 210. Covariate Bandits: Some References • Just a small sample of papers – Hierarchical explore/exploit (Pandey et al., 2008) • Explore/exploit categories/segments first; then, switch to individuals – Variants of -greedy • Epoch-greedy (Langford & Zhang, 2007): is determined based on the generalization bound of the current model • Banditron (Kakade et al., 2008): Linear model with binary response • Non-parametric bandit (Yang & Zhu, 2002): decreases over time; example model: histogram, nearest neighbor – Variants of UCB methods • Linearly parameterized bandits (Rusmevichientong et al., 2008): minimax, based on uncertainty ellipsoid • LinUCB (Li et al., 2010): Gaussian linear regression model • Bandits in metric spaces (Kleinberg et al., 2008; Slivkins et al., 2009): – Similar arms have similar rewards: | reward(i) – reward(j) | distance(i,j)
  211. 211. Online Components: Summary • Real systems are dynamic • Cold-start problem – Incremental online update (online linear regression) – Intelligent initialization (use features to predict initial factor values) – Explore/exploit (UCB, posterior draw, softmax, -greedy) • Concept-drift problem – Tracking the most recent behavior (state-space models, Kalman filter) – Modeling temporal patterns (tensor factorization, spline)
  212. 212. Evaluation Methods and Challenges
  213. 213. Evaluation Methods • Ideal method – Experimental Design: Run side-by-side experiments on a small fraction of randomly selected traffic with new method (treatment) and status quo (control) – Limitation • Often expensive and difficult to test large number of methods • Problem: How do we evaluate methods offline on logged data? – Goal: To maximize clicks/revenue and not prediction accuracy on the entire system. Cost of predictive inaccuracy for different instances vary. • E.g. 100% error on a low CTR article may not matter much because it always co-occurs with a high CTR article that is predicted accurately
  214. 214. Usual Metrics • Predictive accuracy – Root Mean Squared Error (RMSE) – Mean Absolute Error (MAE) – Area under the Curve, ROC • Other rank based measures based on retrieval accuracy for top-k – Recall in test data • What Fraction of items that user actually liked in the test data were among the top-k recommended by the algorithm (fraction of hits, e.g. Karypsis, CIKM 2001) • One flaw in several papers – Training and test split are not based on time. • Information leakage • Even in Netflix, this is the case to some extent – Time split per user, not per event. For instance, information may leak if models are based on user-user similarity.
  215. 215. Metrics continued.. • Recall per event based on Replay-Match method – Fraction of clicked events where the top recommended item matches the clicked one. • This is good if logged data collected from a randomized serving scheme, with biased data this could be a problem – We will be inventing algorithms that provide recommendations that are similar to the current one • No reward for novel recommendations
  216. 216. Details on Replay-Match method (Li, Langford, et al) • x: feature vector for a visit • r = [r1,r2,…,rK]: reward vector for the K items in inventory • h(x): recommendation algorithm to be evaluated • Goal: Estimate expected reward for h(x) • s(x): recommendation scheme that generated logged-data • x1,..,xT: visits in the logged data • rti: reward for visit t, where i = s(xt)
  217. 217. Replay-Match continued • Estimator • If importance weights and – It can be shown estimator is unbiased • E.g. if s(x) is random serving scheme, importance weights are uniform over the item set • If s(x) is not random, importance weights have to be estimated through a model
  218. 218. Back to Multi-Objective Optimization Recommender EDITORIAL content Clicks on FP links influence downstream supply distribution AD SERVER PREMIUM display (GUARANTEED) Spot Market (Cheaper) Downstream engagement (Time spent)
  219. 219. Serving Content on Front Page: Click Shaping • What do we want to optimize? • Current: Maximize clicks (maximize downstream supply from FP) • But consider the following – Article 1: CTR=5%, utility per click = 5 – Article 2: CTR=4.9%, utility per click=10 • By promoting 2, we lose 1 click/100 visits, gain 5 utils • If we do this for a large number of visits --- lose some clicks but obtain significant gains in utility? – E.g. lose 5% relative CTR, gain 40% in utility (revenue, engagement, etc)
  220. 220. Why call it Click Shaping?autos finance health hotjobs movies new.music news omg realestate rivals shine shopping sports tech travel tv video other gmy.news buzz videogames autos finance health hotjobs movies new.music news omg realestate rivals shine shopping sports tech travel tv video other videogames buzz gmy.news -10.00% -8.00% -6.00% -4.00% -2.00% 0.00% 2.00% 4.00% 6.00% 8.00% 10.00% autos buzz finance gmy.news health hotjobs moviesnew.music news omg realestate rivals shine shopping sports tech travel tv videovideogames other Supply distribution Changes BEFORE AFTER SHAPING can happen with respect to any downstream metrics (like engagement)
  221. 221. 222 Multi-Objective Optimization A1 A2 An n articlesK properties news finance omg … … S1 S2 Sm m user segments … CTR of user segment i on article j: pij Time duration of i on j: dij
  222. 222. 11 Multi-Objective Program Scalarization Goal Programming Simplex constraints on xiJ is always applied Constraints are linear Every 10 mins, solve x Use this x as the serving scheme in the next 10 mins
  223. 223. Pareto-optimal solution (more in KDD 2011) 224
  224. 224. Summary • Modern recommendation systems on the web crucially depend on extracting intelligence from massive amounts of data collected on a routine basis • Lots of data and processing power not enough, the number of things we need to learn grows with data size • Extracting grouping structures at coarser resolutions based on similarity (correlations) is important – ML has a big role to play here • Continuous and adaptive experimentation in a judicious manner crucial to maximize performance – Again, ML has a big role to play • Multi-objective optimization is often required, the objectives are application dependent. – ML has to work in close collaboration with engineering, product & business execs
  225. 225. Challenges
  226. 226. Recall: Some examples • Simple version – I have an important module on my page, content inventory is obtained from a third party source which is further refined through editorial oversight. Can I algorithmically recommend content on this module? I want to drive up total CTR on this module • More advanced – I got X% lift in CTR. But I have additional information on other downstream utilities (e.g. dwell time). Can I increase downstream utility without losing too many clicks? • Highly advanced – There are multiple modules running on my website. How do I take a holistic approach and perform a simultaneous optimization?
  227. 227. For the simple version • Multi-position optimization – Explore/exploit, optimal subset selection • Explore/Exploit strategies for large content pool and high dimensional problems – Some work on hierarchical bandits but more needs to be done • Constructing user profiles from multiple sources with less than full coverage – Couple of papers at KDD 2011 • Content understanding • Metrics to measure user engagement (other than CTR)
  228. 228. Other problems • Whole page optimization – Incorporating correlations • Incentivizing User generated content • Incorporating Social information for better recommendation (News Feed Recommendation) • Multi-context Learning
  229. 229. Case Studies
  230. 230. Recommendations and Advertising on LinkedIn HP
  231. 231. EXAMPLE: DISPLAY AD PLACEMENTS ON LINKEDIN ©2013 LinkedIn Corporation. All Rights Reserved.
  232. 232. Recommendations and Advertising on LinkedIn HP
  233. 233. Click Cost = Bid3 x CTR3/CTR2 Profile: region = US, age = 20 Context = profile page, 300 x 250 ad slot Ad request Sorted by Bid * CTR Response Prediction Engine Campaigns eligible for auction Automatic Format Selection Filter Campaigns (Targeting criteria, Frequency Cap, Budget Pacing) LinkedIn Advertising: Flow Serving constraint < 100 millisec
  234. 234. CTR Prediction Model for Ads • Feature vectors – Member feature vector: xi (identity, behavioral, network) – Campaign feature vector: cj (text, adv-id,…) – Context feature vector: zk (page type, device, …) • Model:
  235. 235. CTR Prediction Model for Ads • Feature vectors – Member feature vector: xi – Campaign feature vector: cj – Context feature vector: zk • Model: Cold-start component Warm-start per-campaign component
  236. 236. CTR Prediction Model for Ads • Feature vectors – Member feature vector: xi – Campaign feature vector: cj – Context feature vector: zk • Model: Cold-start component Warm-start per-campaign component Cold-start: Warm-start: Both can have L2 penalties.
  237. 237. Model Fitting • Single machine (well understood) – conjugate gradient – L-BFGS – Trusted region – … • Model Training with Large scale data – Cold-start component Θw is more stable • Weekly/bi-weekly training good enough • However: difficulty from need for large-scale logistic regression – Warm-start per-campaign model Θc is more dynamic • New items can get generated any time • Big loss if opportunities missed • Need to update the warm-start component as frequently as possible
  238. 238. Model Fitting • Single machine (well understood) – conjugate gradient – L-BFGS – Trusted region – … • Model Training with Large scale data – Cold-start component Θw is more stable • Weekly/bi-weekly training good enough • However: difficulty from need for large-scale logistic regression – Warm-start per-campaign model Θc is more dynamic • New items can get generated any time • Big loss if opportunities missed • Need to update the warm-start component as frequently as possible Large Scale Logistic Regression Per-item logistic regression given Θc
  239. 239. Explore/Exploit with Logistic Regression 240 + + + + + + + _ _ _ _ _ _ _ _ _ _ _ _ _ COLD START COLD + WARM START for an Ad-id POSTERIOR of WARM-START COEFFICIENTS E/E: Sample a line from the posterior (Thompson Sampling)
  240. 240. Models Considered • CONTROL: per-campaign CTR counting model • COLD-ONLY: only cold-start component • LASER: our model (cold-start + warm-start) • LASER-EE: our model with Explore-Exploit using Thompson sampling
  241. 241. Metrics • Model metrics (offline) – Test Log-likelihood – AUC/ROC – Observed/Expected ratio • Online metrics (Online A/B Test) – CTR – CPM (Revenue per impression) – Unique ads per user (diversity)
  242. 242. Observed / Expected Ratio • Offline replay difficult with large items (randomization costly) • Observed: #Clicks in the data , Expected: Sum of predicted CTR for all impressions • Not a “standard” classifier metric, but useful for this application • What we usually see: Observed / Expected < 1 – Quantifies the “winner’s curse” aka selection bias in auctions • When choosing from among thousands of candidates, an item with mistakenly over-estimated CTR may end up winning the auction • Particularly helpful in spotting inefficiencies by segment – E.g. by bid, number of impressions in training (warmness), geo, etc. – Allows us to see where the model might be giving too much weight to the wrong campaigns • High correlation between O/E ratio and model performance online
  243. 243. Online A/B Test • Three models – CONTROL (10%) – LASER (85%) – LASER-EE (5%) • Segmented Analysis – 8 segments by campaign warmness • Degree of warmness: the number of training samples available in the training data for the campaign • Segment #1: Campaigns with almost no data in training • Segment #8: Campaigns that are served most heavily in the previous batches so that their CTR estimate can be quite accurate
  244. 244. Daily CTR Lift Over Control PercentageofCTRLift +% +% +% +% +% Day1 Day2 Day3 Day4 Day5 Day6 Day7 ● ● ● ● ● ● ● ● LASER LASER−EE
  245. 245. Daily CPM Lift Over Control PercentageofeCPMLift +% +% +% +% +% +% Day1 Day2 Day3 Day4 Day5 Day6 Day7 ● ● ● ● ● ● ● ● LASER LASER−EE
  246. 246. CPM Lift By Campaign Warmness Segments Campaign Warmness Segment LiftPercentageofCPM −% −% −% 0% +% +% 1 2 3 4 5 6 7 8 LASER LASER−EE
  247. 247. O/E Ratio By Campaign Warmness Segments Campaign Warmness Segment ObservedClick/ExpectedClicks 0.5 0.6 0.7 0.8 0.9 1 1 2 3 4 5 6 7 8 CONTROL LASER LASER−EE
  248. 248. Number of Campaigns Served Improvement from E/E
  249. 249. Insights • Overall performance: – LASER and LASER-EE are both much better than control – LASER and LASER-EE performance are very similar • Great news! We get exploration without much additional cost • Exploration has other benefits – LASER-EE serves significantly more campaigns than LASER – Provides healthier market place, more ad diversity per user (better experience)
  250. 250. Solutions to Practical Problems • Rapid model development cycle – Quick reaction to changes in data, product – Write once for training, testing, inference • Can adapt to changing data – Integrated Thompson sampling explore/exploit – Automatic training – Multiple training frequencies for different parts of model • Good tools yield good models – Reusable components for feature extraction and transformation – Very high-performance inference engine for deployment – Modelers can concentrate on building models, not re-writing common functions or worrying about production issues
  251. 251. Summary • Reducing dimension through logistic regression coupled with explore/exploit schemes like Thompson sampling effective mechanism to solve response prediction problems in advertising • Partitioning model components by cold-start (stable) and warm-start (non-stationary) with different training frequencies effective mechanism to scale computations • ADMM with few modifications effective model training strategy for large data with high dimensionality • Methods work well for LinkedIn advertising, significant improvements ©2013 LinkedIn Corporation. All Rights Reserved.
  252. 252. Theory vs. Practice Textbook • Data is stationary • Training data is clean • Training is hard, testing and inference are easy • Models don’t change • Complex algorithms work best Reality • Features, items changing constantly • Fraud, bugs,tracking delays, online/offline inconsistencies, etc. • All aspects have challenges at web scale • Never-ending processes of improvement • Simple models with good features and lots of data win
  253. 253. Current Work: Feed Recommendation Network Updates Job change Job anniversaries Connections Endorsements Upload Photos ………….. Content Articles by Influencer Shares by friends Content in Channels followed Content by companies followed Jobs recommendation ………… Sponsored Updates Company updates Jobs
  254. 254. Tiered Approach to Ranking  A second pass ranker that blends disparate results returned by first-pass rankers JobsAdsNetwork Updates Content ------------ BLENDER Top k
  255. 255. Challenges • Personalization – Viewer-actor affinity by type (depends on strength of connections in multiple contexts) – Blending identity and behavioral data • Frequency discounting, freshness, diversification • Multi-objectives (Revenue, Engagement) • A/B tests with interference • Engagement metrics – Function of various actions that optimize long-term engagement metric like return visits • Summarization and adding new content types

×