Successfully reported this slideshow.
Upcoming SlideShare
×

# Sequential Selection of Correlated Ads by POMDPs

1,075 views

Published on

Slides presented by Shuai Yuan at CIKM '12.

Published in: Education
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

### Sequential Selection of Correlated Ads by POMDPs

1. 1. Sequential Selection of Correlated Ads by POMDPs Shuai Yuan, Jun Wang University College London October 29, 2012
2. 2. Motivations and contributionsMotivations, • help publishers gain more proﬁt by displaying ads; • go further than ofﬂine, content-based matching of webpages and ads;Contributions, • a framework of ad selection for revenue optimisation; • formulating the sequential selection problem by Partially observable Markov decision process and providing exact and approximate solutions; • a public keyword-bid-ad-webpage dataset for reproducible research1 . 1 http://www.computational-advertising.org
4. 4. Related works (cont.)Ad scheduling, • Scheduling advertisements on a web page to maximize revenue [Kumar 2006] • Scheduling of dynamic in-game advertising [Turner 2011]Multi-armed bandits, • Using conﬁdence bounds for exploitation-exploration trade-offs [Auer 2003] • Multi-armed bandit problems with dependent arms [Pandey 2007]POMDPs, • A survey of POMDP applications [Cassandra 1998] • Monte Carlo POMDPs [Thrun 2000] • Perseus: Randomized point-based value iteration for POMDPs [Spaan 2005]
5. 5. Problem statement - setup 500 400 300 200 100 0 200 400 600 800 1000 \$ 500 400 300 200 100 0 200 400 600 800 1000 500 400 300 200 100 0 200 400 600 800 1000Figure : 1 webpage, 1 ad slot, M impressions at each time step. 2Payoff of ads follows X ∼ N (µ, I · σ0 ). µ is generated by µ ∼ N (θ, Σ).
6. 6. Problem statement - graphical model θ(1), Σ(1), T-1 θ(2), Σ(2), T-2 θ(T), Σ(T), 0 s(1) s(2) θ, Σ s(T) μ(1) μ(2) μ(T) 2 σ 0 x(1) x(2) x(T)Figure : The payoff model illustrated by an inﬂuence diagramrepresentation with generative processes of a ﬁnite horizon POMDP.s(t) is the selection action. θ(t), Σ(t) is the belief at some stage.
7. 7. Problem statement - object functionTo maximise the expected cumulative payoff over time,   T T ∗ π = arg max E [Rπ (T )] = arg max E  Xs(t) (t) = arg max E Xs(t) (t) π π π t=1 t=1 T T =arg max xs(t) (t)p(xs(t) (t)|Ψ(t))dx = arg max θs(t) (t) (1) π x π t=1 t=1where, • s(t) is the selection decision; • Ψ(t) is the available information; • π is a selection policy and π ∗ is the optimal one; • “M impressions” is dropped from object function.
8. 8. Belief update \$ t=1 t=2 ... Figure : Updating belief on ads’ performance over time.
9. 9. Belief update - the selected adWe update the belief using Bayes’ theorem. p (x1 |x1 (t), Ψ(t)) = p (x1 |x1 (t), Ψ(t), µ1 ) p (µ1 |x1 (t), Ψ(t))dµ (2)by “completing squares”, p µ1 |x1 (t), Ψ(t) ∝ p(x1 (t)|µ1 , Ψ(t))p(µ1 |Ψ(t)) 2 2 ∝ exp − x1 (t) − µ1 − µ1 − θ1 (t) (3)we obtain the new belief, 2 µ1 |x1 (t) ∼ N θ1 (t + 1), σ1 (t + 1) (4) 2 2 σ1 (t)x1 (t) + σ0 θ1 (t) 2 σ1 (t)σ02 2 θ1 (t + 1) = 2 2 σ1 (t + 1) = 2 (t) + σ 2 (5) σ1 (t) + σ0 σ1 0we write θi (t) and σi2 (t) as the shorthand for θi |Ψ(t) and σi2 |Ψ(t).
10. 10. Belief update - the correlated adWe also update the belief of non-selected ads, p (x2 |x1 (t), Ψ(t)) = p (x2 |µ2 , x1 (t), Ψ(t)) p(µ2 |x1 (t), Ψ(t))dµ2 (6)with linear Gaussian property, 2 µ1 |µ2 ∼ N (θ1 |µ2 , σ1 |µ2 ) (7) 2 σ1,2 σ1,2 2 2 θ1 |µ2 = θ1 + 2 (µ2 − θ2 ) σ1 |µ2 = σ1 − 2 (8) σ2 σ2we obtain the new belief on a correlated ad, 2 µ2 |x1 (t) ∼ N (θ2 (t + 1), σ2 (t + 1)) (9) 2 σ1,2 x1 (t) − θ1 (t) 2 2 θ2 (t + 1) = θ2 (t) + σ1,2 2 2 σ2 (t + 1) = σ2 (t) − 2 (t) + 2 (10) σ1 (t) + σ0 σ1 σ0
11. 11. Belief update - expected payoffWe also obtain the expected payoff of the selected ad, 2 2 X1 |x1 (t), Ψ(t) ∼ N θ1 (t + 1), σ0 + σ1 (t + 1) (11)and the expected payoff of the correlated ad, 2 2 X2 |x1 (t), Ψ(t) ∼ N θ2 (t + 1), σ0 + σ2 (t + 1) (12)The ﬁnal objective function is, T π ∗ = arg max θs(t) (t) subject to (13) π t=1 xs(t) (t) − θs(t) (t) θs(t+1) (t + 1) = θs(t+1) (t) + σs(t),s(t+1) 2 2 (14) σs(t) (t) + σ0 2 σs(t),s(t+1) 2 2 σs(t+1) (t + 1) = σs(t+1) (t) − 2 2 (15) σs(t) (t) + σ0
12. 12. POMDP formulation and solution (belief state) 500 400 300 (observation 200 & reward) (action) 100 0 200 400 600 800 1000 \$ 500 400 300 200 100 (hidden state) 0 200 400 600 800 1000 500 400 300 200 100 0 200 400 600 800 1000Figure : The POMDP model for the revenue optimisation problem.(θ(t), Σ(t)) is belief at some stage; x(t) is observation and reward;s(t) is action; (θ, Σ) is the hidden state. There is no state transition.
13. 13. Value iteration and MAB approximationThe value function could be expressed as,    s(t)= arg max Vs(t) (Ψ(t)) = arg max   ¯ (xi ) + ξ(Ψ(t), i)   s(t)∈N i∈N the expected immediate reward the expected future reward (16)The exact solution using Value iteration2 : V ∗ (θ, Σ, T ) = max E Xs(t) (1) + V ∗ θ|Xs(t) (1), Σ|Xs(t) (1), T − 1 (17) s(1)∈NThe approximation based on multi-armed bandit3 : qi − ti θi2 (t) t −1 ξUCB 1- NORMAL = 16 · · (18) ti − 1 ti 2 R. E. Bellman. (1957) “Dynamic Programming” 3 Auer, P. et al. (2002) “Finite-time analysis of the multi-armed banditproblem”
14. 14. Value iteration with Monte Carlo sampling4We use sampling to reduce the computational complexity,1: function VALUE F UNC(θ, Σ, t)2: array V ← 0 Expected reward vector.3: loop i ← 1 to N4: V [i] ← θi (t) Expected immediate reward.5: if t < T then6: for all s in S AMPLE(θ, Σ) do7: [θ , Σ ] ← U PDATE B ELIEF(θ, Σ, s, i) New belief after selecting i and observing s. Equations 13. 18: V [i] ← V [i] + M VALUE F UNC(θ , Σ , t + 1) 09: end for10: end if11: end loop12: return [M AX(V ), M AX I NDEX(V )]13: end function 4 Thrun, S. (2000) “Monte Carlo POMDPs”
15. 15. Multi-armed bandit based approximation(cont.)The UCB 1- NORMAL - COR algorithm:1: function P LAN(θ, Σ, Ψ(t))2: array V ← 03: loop i ← 1 to N4: if ti < 8 log t then ti is the number of times ad i gets selected.5: return i6: end if7: end loop8: [θ , Σ ] ← U PDATE B ELIEF(θ, Σ, Ψ(t)) New belief of all ads with all available information. Equations 13.9: loop i ← 1 to N q −t θ 210: V [i] ← θi + 16 · i t −1i · t−1 i ti Expected reward. i11: end loop12: return [MAX(V ), M AX I NDEX(V )]13: end function
16. 16. Experiment datasets ad network/exchange Google AdWords INTRANET Traffic Estimator service \$ \$\$\$ \$\$ advertisers publishers • publishers gain 68% of advertisers’ spending (2003); • data was collected from 12/2011 to 05/2012; • 512 different keywords, 310 with non-zero mean payoff, 8 categories; • 20% for training and 80% for testing; • we consider each keyword to be an ad.
17. 17. Competing algorithmsWe compare the following algorithms, • RANDOM policy, which selects candidates randomly (uniform); • MYOPIC policy, based on the expected immediate reward; • UCB 1 policy, which assumes independent between arms and is model-free of reward distribution; • UCB 1- NORMAL policy, which assumes independent between arms and the reward following Gaussian distribution; • VI - COR policy, which solves Value iteration using Monte Carlo sampling; and • UCB 1- NORMAL - COR policy, which consider the dependencies between candidates.
18. 18. Results Datasets MYOPIC RANDOM UCB 1 UCB 1- N VI - COR UCB 1- N - COR Education 21.9 23.0 30.9 30.9 41.2* 27.6 Finance-1 38.5 27.8 40.9 26.4 44.5 27.4 Finance-2 22.1 16.5 30.6 22.8 38.0* 22.9 Information 14.1 12.9 27.8 15.9 29.4 15.9 P&O 41.6 30.4 50.5 31.4 72.9* 63.3 Shopping-1 17.4 10.6 42.3 16.1 40.2 16.4 Shopping-2 29.9 14.5 34.3 75.3 52.9 79.2* Shopping-3 9.7 4.3 21.9 18.3 27.3 19.4 P&S 24.7 26.0 47.2 57.1 67.9* 59.9 Medical 30.5 19.6 52.7 32.2 58.0* 33.5Table : The cumulative payoffs are averaged on 8 chunks then normalized w.r.t theGOLDEN policy for a better representation. The one with highest cumulative payoff isin bold and with ∗ if the difference with the second best is signiﬁcant by Wilcoxonsigned-rank test. P&O is “People & organisations” and P&S is “‘Products & services”.
19. 19. Results (cont.) VI COR UCB1 Normal COR 4000 UCB1 Normal UCB1 Golden Myopic 3000 Random 2000 1000 20 40 60 80 100Figure : Cumulative payoff on “People & organization” category, 5candidates.
20. 20. Results (cont.) 1 Myopic 0.9 VI-Cor UCB1-Normal 0.8 Normalized cumulative payoff UCB1-Normal-Cor 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Edu F-1 F-2 Info P&O S-1 S-2 S-3 P&S MedFigure : Comparison of accumulated payoffs on the 10 datasets.VI-COR always performed better than MYOPIC and UCB1-NORMAL-CORalways performed better than UCB1-NORMAL across all datasets.
21. 21. Results (cont.) 5000 best phones 4500 term insurance 4000 3500 Daily payoff 3000 2500 2000 1500 1000 500 0 0 50 100 150 DayFigure : Special case: the daily payoff of two candidates with asudden change.
22. 22. Results (cont.) 4 x 10 10 Golden Myopic 9 VI−COR UCB1−Normal−COR 8Cumulative payoff Figure : The 7 impact of the noise 2 6 factor σ0 for the situation in the 5 previous ﬁgure. 4 3 −2 0 2 4 10 10 10 10 Noise factor σ2 0 xs(t) (t) − θs(t) (t) θs(t+1) (t + 1) = θs(t+1) (t) + σs(t),s(t+1) 2 2 σs(t) (t) + σ0
23. 23. Future works • correlated update: if ad a1 on webpage w1 was shown to user u1 and we observed its performance, what’s the belief on performance of ad a2 on webpage w2 when showing to user u2 with correlations known? • multiple ads with diversiﬁcation (another exploration and exploitation dilemma); • better solution for our continuous POMDP problem.