Sequential Selection of Correlated Ads by POMDPs

891 views

Published on

Slides presented by Shuai Yuan at CIKM '12.

Published in: Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
891
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
22
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Sequential Selection of Correlated Ads by POMDPs

  1. 1. Sequential Selection of Correlated Ads by POMDPs Shuai Yuan, Jun Wang University College London October 29, 2012
  2. 2. Motivations and contributionsMotivations, • help publishers gain more profit by displaying ads; • go further than offline, content-based matching of webpages and ads;Contributions, • a framework of ad selection for revenue optimisation; • formulating the sequential selection problem by Partially observable Markov decision process and providing exact and approximate solutions; • a public keyword-bid-ad-webpage dataset for reproducible research1 . 1 http://www.computational-advertising.org
  3. 3. Related worksContextual advertising, • A semantic approach to contextual advertising [Broder 2007] • Impedance coupling in content-targeted advertising [Ribeiro 2005] • Contextual advertising by combining relevance with click feedback [Chakrabarti 2008]Inventory management (contracts), • Targeted advertising on the Web with inventory management [Chickering 2003] • Revenue management for online advertising: Impatient advertisers [Fridgeirsdottir 2007] • Dynamic revenue management for online display advertising [Roels 2009]Optimal pricing model, • Pricing of Online Advertising: Cost-Per-Click-Through Vs. Cost-Per-Action [Hu 2010] • Online advertising: Pay-per-view versus pay-per-click [Mangani 2004] • Online advertising: Pay-per-view versus pay-per-click A comment [Fjell 2009] • Single period balancing of pay-per-click and pay-per-view online display advertisements [Kwon 2011]
  4. 4. Related works (cont.)Ad scheduling, • Scheduling advertisements on a web page to maximize revenue [Kumar 2006] • Scheduling of dynamic in-game advertising [Turner 2011]Multi-armed bandits, • Using confidence bounds for exploitation-exploration trade-offs [Auer 2003] • Multi-armed bandit problems with dependent arms [Pandey 2007]POMDPs, • A survey of POMDP applications [Cassandra 1998] • Monte Carlo POMDPs [Thrun 2000] • Perseus: Randomized point-based value iteration for POMDPs [Spaan 2005]
  5. 5. Problem statement - setup 500 400 300 200 100 0 200 400 600 800 1000 $ 500 400 300 200 100 0 200 400 600 800 1000 500 400 300 200 100 0 200 400 600 800 1000Figure : 1 webpage, 1 ad slot, M impressions at each time step. 2Payoff of ads follows X ∼ N (µ, I · σ0 ). µ is generated by µ ∼ N (θ, Σ).
  6. 6. Problem statement - graphical model θ(1), Σ(1), T-1 θ(2), Σ(2), T-2 θ(T), Σ(T), 0 s(1) s(2) θ, Σ s(T) μ(1) μ(2) μ(T) 2 σ 0 x(1) x(2) x(T)Figure : The payoff model illustrated by an influence diagramrepresentation with generative processes of a finite horizon POMDP.s(t) is the selection action. θ(t), Σ(t) is the belief at some stage.
  7. 7. Problem statement - object functionTo maximise the expected cumulative payoff over time,   T T ∗ π = arg max E [Rπ (T )] = arg max E  Xs(t) (t) = arg max E Xs(t) (t) π π π t=1 t=1 T T =arg max xs(t) (t)p(xs(t) (t)|Ψ(t))dx = arg max θs(t) (t) (1) π x π t=1 t=1where, • s(t) is the selection decision; • Ψ(t) is the available information; • π is a selection policy and π ∗ is the optimal one; • “M impressions” is dropped from object function.
  8. 8. Belief update $ t=1 t=2 ... Figure : Updating belief on ads’ performance over time.
  9. 9. Belief update - the selected adWe update the belief using Bayes’ theorem. p (x1 |x1 (t), Ψ(t)) = p (x1 |x1 (t), Ψ(t), µ1 ) p (µ1 |x1 (t), Ψ(t))dµ (2)by “completing squares”, p µ1 |x1 (t), Ψ(t) ∝ p(x1 (t)|µ1 , Ψ(t))p(µ1 |Ψ(t)) 2 2 ∝ exp − x1 (t) − µ1 − µ1 − θ1 (t) (3)we obtain the new belief, 2 µ1 |x1 (t) ∼ N θ1 (t + 1), σ1 (t + 1) (4) 2 2 σ1 (t)x1 (t) + σ0 θ1 (t) 2 σ1 (t)σ02 2 θ1 (t + 1) = 2 2 σ1 (t + 1) = 2 (t) + σ 2 (5) σ1 (t) + σ0 σ1 0we write θi (t) and σi2 (t) as the shorthand for θi |Ψ(t) and σi2 |Ψ(t).
  10. 10. Belief update - the correlated adWe also update the belief of non-selected ads, p (x2 |x1 (t), Ψ(t)) = p (x2 |µ2 , x1 (t), Ψ(t)) p(µ2 |x1 (t), Ψ(t))dµ2 (6)with linear Gaussian property, 2 µ1 |µ2 ∼ N (θ1 |µ2 , σ1 |µ2 ) (7) 2 σ1,2 σ1,2 2 2 θ1 |µ2 = θ1 + 2 (µ2 − θ2 ) σ1 |µ2 = σ1 − 2 (8) σ2 σ2we obtain the new belief on a correlated ad, 2 µ2 |x1 (t) ∼ N (θ2 (t + 1), σ2 (t + 1)) (9) 2 σ1,2 x1 (t) − θ1 (t) 2 2 θ2 (t + 1) = θ2 (t) + σ1,2 2 2 σ2 (t + 1) = σ2 (t) − 2 (t) + 2 (10) σ1 (t) + σ0 σ1 σ0
  11. 11. Belief update - expected payoffWe also obtain the expected payoff of the selected ad, 2 2 X1 |x1 (t), Ψ(t) ∼ N θ1 (t + 1), σ0 + σ1 (t + 1) (11)and the expected payoff of the correlated ad, 2 2 X2 |x1 (t), Ψ(t) ∼ N θ2 (t + 1), σ0 + σ2 (t + 1) (12)The final objective function is, T π ∗ = arg max θs(t) (t) subject to (13) π t=1 xs(t) (t) − θs(t) (t) θs(t+1) (t + 1) = θs(t+1) (t) + σs(t),s(t+1) 2 2 (14) σs(t) (t) + σ0 2 σs(t),s(t+1) 2 2 σs(t+1) (t + 1) = σs(t+1) (t) − 2 2 (15) σs(t) (t) + σ0
  12. 12. POMDP formulation and solution (belief state) 500 400 300 (observation 200 & reward) (action) 100 0 200 400 600 800 1000 $ 500 400 300 200 100 (hidden state) 0 200 400 600 800 1000 500 400 300 200 100 0 200 400 600 800 1000Figure : The POMDP model for the revenue optimisation problem.(θ(t), Σ(t)) is belief at some stage; x(t) is observation and reward;s(t) is action; (θ, Σ) is the hidden state. There is no state transition.
  13. 13. Value iteration and MAB approximationThe value function could be expressed as,    s(t)= arg max Vs(t) (Ψ(t)) = arg max   ¯ (xi ) + ξ(Ψ(t), i)   s(t)∈N i∈N the expected immediate reward the expected future reward (16)The exact solution using Value iteration2 : V ∗ (θ, Σ, T ) = max E Xs(t) (1) + V ∗ θ|Xs(t) (1), Σ|Xs(t) (1), T − 1 (17) s(1)∈NThe approximation based on multi-armed bandit3 : qi − ti θi2 (t) t −1 ξUCB 1- NORMAL = 16 · · (18) ti − 1 ti 2 R. E. Bellman. (1957) “Dynamic Programming” 3 Auer, P. et al. (2002) “Finite-time analysis of the multi-armed banditproblem”
  14. 14. Value iteration with Monte Carlo sampling4We use sampling to reduce the computational complexity,1: function VALUE F UNC(θ, Σ, t)2: array V ← 0 Expected reward vector.3: loop i ← 1 to N4: V [i] ← θi (t) Expected immediate reward.5: if t < T then6: for all s in S AMPLE(θ, Σ) do7: [θ , Σ ] ← U PDATE B ELIEF(θ, Σ, s, i) New belief after selecting i and observing s. Equations 13. 18: V [i] ← V [i] + M VALUE F UNC(θ , Σ , t + 1) 09: end for10: end if11: end loop12: return [M AX(V ), M AX I NDEX(V )]13: end function 4 Thrun, S. (2000) “Monte Carlo POMDPs”
  15. 15. Multi-armed bandit based approximation(cont.)The UCB 1- NORMAL - COR algorithm:1: function P LAN(θ, Σ, Ψ(t))2: array V ← 03: loop i ← 1 to N4: if ti < 8 log t then ti is the number of times ad i gets selected.5: return i6: end if7: end loop8: [θ , Σ ] ← U PDATE B ELIEF(θ, Σ, Ψ(t)) New belief of all ads with all available information. Equations 13.9: loop i ← 1 to N q −t θ 210: V [i] ← θi + 16 · i t −1i · t−1 i ti Expected reward. i11: end loop12: return [MAX(V ), M AX I NDEX(V )]13: end function
  16. 16. Experiment datasets ad network/exchange Google AdWords INTRANET Traffic Estimator service $ $$$ $$ advertisers publishers • publishers gain 68% of advertisers’ spending (2003); • data was collected from 12/2011 to 05/2012; • 512 different keywords, 310 with non-zero mean payoff, 8 categories; • 20% for training and 80% for testing; • we consider each keyword to be an ad.
  17. 17. Competing algorithmsWe compare the following algorithms, • RANDOM policy, which selects candidates randomly (uniform); • MYOPIC policy, based on the expected immediate reward; • UCB 1 policy, which assumes independent between arms and is model-free of reward distribution; • UCB 1- NORMAL policy, which assumes independent between arms and the reward following Gaussian distribution; • VI - COR policy, which solves Value iteration using Monte Carlo sampling; and • UCB 1- NORMAL - COR policy, which consider the dependencies between candidates.
  18. 18. Results Datasets MYOPIC RANDOM UCB 1 UCB 1- N VI - COR UCB 1- N - COR Education 21.9 23.0 30.9 30.9 41.2* 27.6 Finance-1 38.5 27.8 40.9 26.4 44.5 27.4 Finance-2 22.1 16.5 30.6 22.8 38.0* 22.9 Information 14.1 12.9 27.8 15.9 29.4 15.9 P&O 41.6 30.4 50.5 31.4 72.9* 63.3 Shopping-1 17.4 10.6 42.3 16.1 40.2 16.4 Shopping-2 29.9 14.5 34.3 75.3 52.9 79.2* Shopping-3 9.7 4.3 21.9 18.3 27.3 19.4 P&S 24.7 26.0 47.2 57.1 67.9* 59.9 Medical 30.5 19.6 52.7 32.2 58.0* 33.5Table : The cumulative payoffs are averaged on 8 chunks then normalized w.r.t theGOLDEN policy for a better representation. The one with highest cumulative payoff isin bold and with ∗ if the difference with the second best is significant by Wilcoxonsigned-rank test. P&O is “People & organisations” and P&S is “‘Products & services”.
  19. 19. Results (cont.) VI COR UCB1 Normal COR 4000 UCB1 Normal UCB1 Golden Myopic 3000 Random 2000 1000 20 40 60 80 100Figure : Cumulative payoff on “People & organization” category, 5candidates.
  20. 20. Results (cont.) 1 Myopic 0.9 VI-Cor UCB1-Normal 0.8 Normalized cumulative payoff UCB1-Normal-Cor 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Edu F-1 F-2 Info P&O S-1 S-2 S-3 P&S MedFigure : Comparison of accumulated payoffs on the 10 datasets.VI-COR always performed better than MYOPIC and UCB1-NORMAL-CORalways performed better than UCB1-NORMAL across all datasets.
  21. 21. Results (cont.) 5000 best phones 4500 term insurance 4000 3500 Daily payoff 3000 2500 2000 1500 1000 500 0 0 50 100 150 DayFigure : Special case: the daily payoff of two candidates with asudden change.
  22. 22. Results (cont.) 4 x 10 10 Golden Myopic 9 VI−COR UCB1−Normal−COR 8Cumulative payoff Figure : The 7 impact of the noise 2 6 factor σ0 for the situation in the 5 previous figure. 4 3 −2 0 2 4 10 10 10 10 Noise factor σ2 0 xs(t) (t) − θs(t) (t) θs(t+1) (t + 1) = θs(t+1) (t) + σs(t),s(t+1) 2 2 σs(t) (t) + σ0
  23. 23. Future works • correlated update: if ad a1 on webpage w1 was shown to user u1 and we observed its performance, what’s the belief on performance of ad a2 on webpage w2 when showing to user u2 with correlations known? • multiple ads with diversification (another exploration and exploitation dilemma); • better solution for our continuous POMDP problem.

×