Balancing User and Content Provider Goals in Recommender Systems

2021.5.20
織田拓磨 (Takuma 0da)
株式会社ディー・エヌ・エー＋株式会社 Mobility Technologies
The Web Conference 2021
参加報告

2
項目
01｜学会概要
02｜論文紹介: 効果検証
03｜論文紹介: 推薦システム
04｜論文紹介: モビリティ

4
l Web・データマイニング分野の国際会議
l International World Wide Web Conference（通称WWW）から改称
l フルリモート開催
The Web Conference

5
l 355 research papers
l 9 (+5 ) keynotes
l 2 panels
l Web of Health: 6 talks
l Future of the Web: 5 talks
l 25 workshops
l 23 tutorials
プログラム概要

6
投稿論⽂の傾向
1. Social Network Analysis and Graph Algorithms
2. Web Mining and Content Analysis
3. User Modeling and Personalization

7
採択率
355 accepted papers / 1736 submissions

8
l グラフデータマイニング界で著名なJure先⽣
の講演（Mining Massive Data Setsの著者）
l SafeGraphというモバイルアプリの位置デー
タを匿名化して集約したデータを元に１時間
単位の⼈々の移動パターンを推定 => 現実の
感染率に⾼精度でフィットするモデルを提案
l 経済（移動）の再開度合いが感染率に与える
影響を分析
l 少数のPOIが多くの感染源となっていることを
特定
l 社会経済状況によって感染率が異なることを
正しく予測
(Pre-conference) Keynote: Jure Leskovec (Stanford University)
Mobility network models of COVID-19 explain inequities and inform reopening
https://www.nature.com/articles/s41586-020-2923-3

9
Enabling the quantum revolution ̶ pioneering advances to achieve quantum computing and
impact at scale
https://www.youtube.com/watch?v=FkH3T7guZ6Y 03:03:00〜
量⼦計算に関する平易で応⽤視点の解説
l Practical quantum advantage
○ the crossover time needs to be not more than a few weeks
○ slow read: 10,000 Gbit/s v.s. 1 Gbit/s => small data, big compute
○ slow operation: peta v.s mega => need superquadratic speedup
l Application
○ ground states energy for certain molecules => carbon fixation
Keynote: Krysta Svore (Microsoft Research)

10
l モビリティ関連
○ 7th WebAndTheCity ‒ Web Intelligence and Resilience in Smart Cities
○ Workshop LocWeb 2021 at The Web Conference 2021
○ Hands-on tutorial Flatland: Multi-Agent Reinforcement Learning on Trains
l グラフマイニング
l 因果推論
l 推薦システム
l ⾃然⾔語処理など
Tutorial & Workshop

11
モビリティ関連の論⽂

12
l ⽬的：オンラインのメンタルヘルスサポートプラット
フォームにおける”empathy”を改善したい
l empathic rewriting: テキストの追加と削除により、
与えられた返答がより共感度が⾼くなるように編集す
る問題
l 会話のコンテキスト、感情の深く理解しつつ、会話の
質を維持する必要がある
l 深層強化学習によるアプローチを提案
Best Paper Award
Towards Facilitating Empathic Conversations in Online Mental Health Support:
A Reinforcement Learning Approach

14
Best Paper Award
l 通常の精度評価の他、⼈間の評価でもほとん
どの指標でベースラインを⼤きく上回る

15
l 開催２週間前くらいまでオンライン開催の情報がなかった
l 参加者のコミュニケーションツールが貧弱
l オンラインプラットフォーム（MiTeam）が使いにくい
l Chatの返答が遅い/ない
l 画質・⾳質が悪い
残念だったところ

16
02 論文紹介: 効果検証

17
URL：https://arxiv.org/abs/1901.10550

18
l A/Bテストにおいて、全てのユーザーに対して、平均的に最も良いグローバルな処置変数（UI変
更、推薦システムのパラメータ変更など）を割り当てることが多い（Global allocation）
l ⼀般に個々のユーザーの処置効果 E[Y(1) ‒ Y(0)|X = x] は⼤きく異なる（Heterogeneity of
treatment effect）
Þ Personalized approach
l 配⾞アプリの例：
○ 処置変数：ピン⽴て時の待ち時間の上限と下限
○ メイン指標(objective)：CVR
○ サブ指標(Guardrail)：ETAの正確性、キャンセル率など
Background

19
l HTE(Heterogeneity of Treatment Effect)が異なるコホートを識別し、処置割り当てを最適化す
る汎⽤的なフレームワークを提案
l 個々のユーザーに適した処置変数を選択することで、全体としての処置効果を⾼め、マイナーグ
ループの体験を良くしたい（more inclusive）
Overview

20
Problem Setup
HTEが正規分布に従う確率変数と仮定

21
問題を２つのステップに分割できる
1. Heterogeneous Effect Estimation
ランダム化実験（通常のA/Bテスト）のデータを使って、処置効果が異なるコホートを識別し、それぞれ
のコホートの処置効果（HTE）を推定する
2. Optimization Solution
最適化問題を解いて各メンバー/コホートの最適な処置変数を決定する
Problem Setup

l Susan Athey先⽣のCausal Tree (Recursive partitioning for heterogeneous causal effects)
○ ランダム化実験データを使って決定⽊でHTEを推定する
○ データ：ユニットごとの（観測指標、処置変数、特徴量）
○ ⽬的関数：
22
Heterogeneous Effect Estimation
リーフ内の平均処置効果で推定
Π: 学習済みの⽊
S :処置効果推定に⽤いるサンプル

l Susan Athey先⽣のCausal Tree (Recursive partitioning for heterogeneous causal effects)
○ ランダム化実験データを使って決定⽊でHTEを推定する
○ データ：ユニットごとの（観測指標、処置変数、特徴量）
○ ⽬的関数：
23
浅⾒さん資料参考：https://speakerdeck.com/masa_asa/mian-qiang-hui-zhun-bei-zi-liao-bei-wang-causal-forest-and-r-learner?slide=14
学習データのリーフ内分散
観測不可
処置効果推定に学習とは独⽴したサンプルを使うことで、
splitting指標を観測可能なデータのみに書き換えられる

24
Causal Treeを次のように拡張する
l J個の処置変数とK+1個の評価指標があるため、J (K + 1)個のHTEを推定する別々のCausal Tree
を学習
l Merging Trees: J (K + 1)個のcohort setを⼊⼒として、マージした１つのcohort setを出⼒
U0
-2
3
1
U1
0
2
(3, 0)
(1, 2)
(-2, 2)
(-2, 0)
(1, 0)
(U0, U1)

25
Deterministic Optimization
SAA: sample average approximation（⽬的関数と制約条件をサンプル平均で置き換える）
○ 線形計画ソルバーで解ける
○ ただし、分散を全く使っていないので、多くの場合ロバストではない
Optimization Solution

26
Stochastic Optimization
○ Uはcausal treeで推定した平均、分散の正規分布に従う確率変数として扱う
Optimization Solution
iterationごとに全データ
のJサンプルの平均をと
ることで推定

28
Simulation Analysis
personalizationが有効な
場合
CT.STが全体的にロバス
トでバランスが良い
グローバルな最適⽅策
が存在する場合
コホートレベルの⽅策
はグローバルな⽅策と
同程度

29
Notification System at LinkedIn

30
Personalized Capping Problem

31
03 論文紹介: 推薦システム

32
URL: https://arxiv.org/abs/2105.02377

33
l ほとんどの推薦システムはユーザーの利得を最⼤化することにフォーカスしている
l 推薦によってコンテンツプロバイダー（例えばライブ配信サービスであればライバー）も⼤きく
影響を受ける
l ユーザー中⼼の推薦だと少数の⼈気のプロバイダーが露出の機会を独占してしまい、地位を確⽴
していないプロバイダーはなかなか注⽬してもらえずにプラットフォームを去ってしまう
Þ Content Provider Aware (Multi-stakeholders) Recommender Systems
l フードデリバリー
Background

34
l ユーザーとプロバイダーの利得を同時に最⼤化する推薦問題を強化学習として定式化
l シミュレーションによる提案⼿法の検証
l Top-K Off-Policy Correction for a REINFORCE Recommender System (Chen et al. 2019)を
ベースにしている（奥村さんのブログ参考：https://medium.com/eureka-engineering/youtube-recommender-algorithm-
survey-341a3aa1fbd6）
Overview

35
Problem Definition
全てのプロバイダー
ユーザー
別々のユーザー（状態s）
の軌跡で平均をとる
（実験条件から推定される）仮定：
○ ⼀定期間Tにおいて、全てのユーザーがプラットフォーム上に存在し、同期的にプロバイダーとやりと
りしている
○ ユーザーは推薦されたコンテンツのみ消費できる（推薦コンテンツ以外の消費を考慮していない）
○ ユーザーのプラットフォームへのin/outは考慮しない
ゴール：ユーザーとプロバイダーの利得の重み付き和を最⼤化する⽅策を学習する

l user-state: ユーザーのトピックの好みなど
l provider-state: プロバイダーの将来のコンテンツ⽣成の好み、プラットフォームの満⾜度など
l action space: 提供可能なコンテンツ
l reward / utility:
36
RL Formulation
納得感のある報酬を設計できるかはアプリケーションに⼤きく依存しそう

37
No Provider Externality Assumption
推薦されなかったプロバイダーの利得はどのプロバイダーが推薦されたかには依存しない

ユーザーの軌跡データ（context, action, reward）を元に利得を推定するRNNを学習
38
User States via Utility Imputation
軌跡データをhistory-action-returnに変換

39
Provider States via Utility Imputation
プロバイダーの軌跡データ（context, action, reward）を元に利得を推定するRNNを学習
プロバイダーの変化はユーザーに⽐べて遅く、１タイムステップはユーザーよりも⻑い
=> ⾏動Aは複数timestepの相互作⽤のサマリ
○ 推薦数: m
○ ユーザーの報酬和: sum(r)
○ weighted bag-of-words（どのトピックの評判が良かったかを表すベクトル）

41
Simulated Environment
l Content:
○ topic~ categorical
○ quality~ truncated normal
l User Updates:
○ 時刻tのユーザーuのトピックの好みをベクトルで表現
○ ユーザーのsensitivityとコンテンツのqualityが⾼いほど推薦されたコンテンツのトピックに好みがシフ
トする
l Content Provider Updates:
○ 時刻tのプロバイダーcの（コンテンツ⽣成における）トピックの好みをベクトルで表現
○ Satisfaction関数が閾値を下回ると離脱（m: 推薦数、r: user feedback）
○ プロバイダーのsensitivityと推薦したコンテンツのuser feedbackが⾼いほど推薦したコンテンツのト
ピックに好みがシフトする

42
Content Provider Satisfaction Design
○ 線形関数と凸関数（log）を⽐較
○ logの⽅がプロバイダーの満⾜度が
⼀定以上で飽和するため、現実をよ
く表している
シミュレーションにおけるSatisfaction関数の選定⽅法について

43
Experiments
l EcoAgentが有益になるのはどのようなときか？
○ Satisfaction: log、 λ: 0(user-only) ~1(provider-only) でシミュレーション評価
○ 期待通りλが⾼いほどプロバイダーの利得とプロバイダーの⽣存率が⾼くなる
○ 適切なλを選択すれば、ユーザーの利得もλ=0の時よりも⾼まる

44
Experiments
l EcoAgentが健全ではない状況を招くのはどのようなときか？
○ Satisfaction: 線形（プロバイダーごとに異なる⽐例係数）でシミュレーション
○ ⽐例係数が⼤きいプロバイダーが推薦される => 利得は増える
○ （⼈気になっても⽐例係数が⾼いプロバイダーが継続して推薦されやすいため）全体の⽣存数は減少
する => ⽣存数を維持したいのであれば別の指標（provider variability upliftなど）の考慮が必要

45
04 論文紹介: モビリティ

47
Ride-hailing driver modeling
○ Effective modeling of road network graph structures
○ Interactions of large number of agents (hundreds~)
○ Robustness to changes in environmental dynamics and data noise
Overview
Our Goal
Imitating passenger-seeking behaviors of multiple taxis in a road network
with unknown dynamics

48
MDP Formulation
𝑠!"#!,!"
= 𝑠”
1 − 𝜌$
𝑠! = 𝑠
𝑠!"%$
= 𝑠&
𝜌$
𝑎! = 𝑎
𝑑','"
𝑜!"%$
= 1
Ride Destination
Distribution
Pick-up
Probability

49
l Model each road as an independent queue
l Estimate 𝜆𝑠 and 𝜎𝑠 by the maximum likelihood estimation
Pickup Probability Modeling
𝜇): service rate
(=traffic flow)
𝜆): customer arrival rate
𝜎): dropout rate

50
l An agent policy depends on other agents only through expected visitation count, i.e. traffic
flow
l The multi-agent policy learning problem can be formulated as:
Multi-agent RL Objective
Reward function (we aim to learn)
Entropy Regularization
Flow

52
Flow Computation
l Shared drivers policy:
l State/Action visitation count (traffic flow) :
Initial state distribution (=drop-off distribution)

53
Equilibrium Value Iteration

54
Value Iteration Policy Propagation
Model Update
Policy π
Visitation Count µ
Pickup Probability ρ
Initial State Distribution µ*
© Zenrin © Mapbox © Zenrin © Mapbox

56
Equilibrium Inverse Reinforcement Learning

57
Equilibrium Inverse Reinforcement Learning

58
l Area:
○ The most densely populated area in Yokohama,
Japan
l Trajectory:
○ (driver id, trip id, latitude, longitude, timestamp) of
empty vehicles
○ Linked to the road network by map-matching
l Road network:
○ 10765 nodes and 18649 edges
Experimental Data

59
l Divided into 3 groups:
○ Train: 2019-07-01 ~ 2019-09-23 (12 weeks)
○ 19Dec: 2019-12-12 ~ 2020-02-06 (8 weeks
including winter holiday seasons)
○ 20Apr: 2020-04-01 ~ 2020-04-29 (4 weeks
during the most severe taxi demand decline
due to COVID-19)
l Treated each 30-minute period as a different
context (weekdays 7am-10pm)
Experimental Data

60
l Baselines
○ Opt: Shortest time policy to pick-up (flow-independent)
○ SE-Opt: Shortest time policy to pick-up (equilibrium)
○ Tr-Expert: Expert policy estimated from the simple statistics of the training dataset
l Procedure
1. (SEIRL only) Learn the cost function from training dataset
2. Estimate the equilibrium policy for each context (every 30 minutes between 7:00 and 22:00) in
each dataset (19Dec, 20Apr)
3. Compute the equilibrium visitation count (flow) by repeating the policy propagation
4. Compare estimated flows with expert flow by Mismatch Distance Ratio.
Evaluation

Balancing User and Content Provider Goals in Recommender Systems

Recommended

Recommended

More Related Content

What's hot

What's hot (14)

Similar to Balancing User and Content Provider Goals in Recommender Systems

Similar to Balancing User and Content Provider Goals in Recommender Systems (20)

More from Takuma Oda

More from Takuma Oda (6)

Recently uploaded

Recently uploaded (20)

Balancing User and Content Provider Goals in Recommender Systems