Reward constrained interactive recommendation with natural language feedback noani

Reward-Constrained Interactive Recommendation
with Natural Language Feedback
2020. 02. 24.
Jeong-Gwan Lee
1
"Text-Based Interactive Recommendation via Constraint-Augmented Reinforcement Learning." NeurIPS 2019
(Duke University, Samsung Research America, University at Buffalo)

2
Table of contents
● Visual Item Interactive Recommendation
● Non-Natural Language Feedback
● Natural Language Feedback
● Dataset and Setup
● MDP & Constrained MDP
● Recommendation as MDP
● Reward Constrained Recommender Model
● Model Detail(Feature Extractor, Discriminator, Recommender)
● Reward function
● Recommendation as Constrained MDP
● Model Training
● Evaluation
● Conclusion

3
Visual Item Interactive Recommendation
Recommender system has sought to interact with users,
to adapt to user preferences over time.
• Non-Natural Language Feedback
• Clicking Data
• Updated Rating
They provide little information to reflect complex user attitude.
……Round 1
Round 2
……Round 1
Round 2
0.2 0.2 0.6 0.8

4
Visual Item Interactive Recommendation
Text-based recommendation provides richer user feedback.
• Natural Language Feedback (Not dialogue-based)
This paper targets this setting.
Recommender
Seeker

5
Visual Item Recommendation
with Natural Language Feedback Setting
UT-Zappos50K
• A shoe dataset consisting of 50,025 shoe images.
• Samples
• Labels

6
Visual Item Recommendation
with Natural Language Feedback Setting
UT-Zappos50K
• Rich attribute data
1. shoes category(4) = {Shoes, Boots, Sandals, Slippers}
2. shoes subcategory(21) = {Oxfords, MidCalf, Heel, Ankle,…}
3. heel height(7) = {flat, Under 1inch, 1~2inch, 2~3inch,…}
4. closure(18) = {leather, padded, removable,…}
5. gender(8) = {men, women, boys, girls,…}
6. toe style(17) = {Capped, Round, Square,…}

7
Dataset and Setup
User simulator
• Unfortunately, Zappos50K didn’t collect the user’s comments relevant
to attributes with ground truth.
1. Given pairs of recommended item and desired item, (10,000 pairs)
the real-world sentences are collected from annotators.
2. From above, the authors derive several sentence templates and
synthesize 20,000 labeled sentence by filling these templates
with the attribute label.
3. They train a Seq2seq based user simulator.
(input : the difference on one attribute value between two items,
output: a sentence describing the visual attribute difference)
Template
recommended desired
Show me more shoes with round toe.
Gender : Men Gender : Women
I prefer shoes for women.

8
Reward Constrained Recommendation
They propose Reward Constrained Recommendation(RCR),
which sequentially incorporates constraints from previous
feedback.
• A constraint-augmented RL problem setting
• A learnable discriminator to detect violations of user
preferences in an adversarial manner

9
MDP & Constrained MDP
MDP(Markov Decision Process)
Constrained MDP

10
Recommendation as MDP
We can model the recommendation-feedback loop as an MDP,
abstractly.
Recommender
Seeker
𝒔 𝟏
𝒂 𝟏
𝒙 𝟏
𝒓 𝟏?
𝒔 𝟐
𝒂 𝟐
𝒙 𝟐
𝒓 𝟐?
𝒔 𝟑
𝒂 𝟑
𝒙 𝟑
𝒓 𝟑?
𝒔 𝟒
𝒓 𝟒?
𝒂 𝟒
𝒙 𝟒

11
Remind of dataset
UT-Zappos50K
• Rich attribute data (shoes category(4), shoes subcategory(21), heel
height(7), closure(18), gender(8) and toe style(17))
• Samples
• Labels

12
Reward Constrained Recommender Model
Feature Extractor (extract features of feedback, recommended items)
Recommender (predict attributes, match, and recommend)
Discriminator (prevent constraint violation)

13
Feature Extractor
Visual Encoder = ResNet50[1] + AttrNet (pretrained)
Textual Encoder = Embedding + LSTM + FC
[1] He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE
conference on computer vision and pattern recognition. 2016.

14
Feature Extractor
Cat : Shoes
SubCat : Dress shoes
HeelHei. : X
Closure : …
Attributes (at training time)

15
Feature Extractor
ResNet50
AttrNet
Concat
Visual Encoder
Cat : Shoes
HeelHei. : X
Closure : …

16
Feature Extractor
ResNet50
AttrNet
Concat
Visual Encoder
Cat : Shoes
HeelHei. : X
Closure : …
Category(4)
SubCategory(21)
Heel Height(7)
AttrNet
…
ResNet
Features
Attribute Net

17
Recommender
Policy 𝝅 𝜽 selects the closest to the sampled attribute values under
Euclidean distance in the visual attribute space.
Feature Representation

18
Recommender
Categorical
Sampling!
FCs
FCs
…
Policy 𝝅 𝜽 with multi-discrete action space
Softmax
Softmax
FCs Softmax
Category(4)
SubCategory(21)
Heel Height(7)…

19
Recommender
ResNet50 AttrNet
Visual Encoder
Categorical
Sampling!
FCs
FCs
…
Softmax
Softmax
FCs Softmax
Category(4)
SubCategory(21)
Heel Height(7)…

20
Recommender
ResNet50 AttrNet
Visual Encoder
Categorical
Sampling!
FCs
FCs
…
Softmax
Softmax
FCs Softmax
Category(4)
SubCategory(21)
Heel Height(7)…
Category = shoes
SubCat = heel
Heel.H = 3 inch.
[1,0,0,0]
[0,0,0,1,….]
[0,0,1,0,….]
Categorical Sampling Results
…
Euclidean
distance
Distance-based Matching

21
Reward function
Reward : the visual and attribute similarity between the
recommended and desired items.
• It is desired that the recommended one becomes more similar to the
desired one with more interaction
• We want to minimize visual and attribute difference.
• to ensure the scales of the two distances are similar
• If the system can’t find the desired item before 50 iterations,
the system will receive an extra reward -3 (as a penalty)

Recommender
Seeker
22
Why explicitly constraints need?
RL algorithms which doesn’t consider constraints easily violate
preference from past feedback, since it needs to explore new items
for further improvement.
• Success case
• Failure case
Recommender
Seeker

23
Discriminator
Discriminator 𝐶" outputs whether the recommended item
violates the user comment.
𝑥!"# : I prefer leather.
𝑥! : I prefer high heel.
…
Feedback History

24
Collecting (non-)violation distribution
One user session
User session finish!

25
One user session
Non-violation pair

26
One user session
Violation pair

27
One user session
Non-violation pair

28
Discriminator
A discriminator is defined as a constraint function.
• Discriminator training
• 𝐶" 𝒔, 𝒂 is induced to 1, if violation.
• 𝐶" 𝒔, 𝒂 is induced to 0, if non-violation.
violation pair non-violation pair

29
Discriminator is updated after each user session.
It can’t be pretrained.
• To judge violations or not, we need sequential feedbacks.
• But the dataset doesn’t have sequential feedback.
(only user simulator)
One user session
User session finish!

30
Remind: Reward Constrained Recommender Model
Feature Extractor (extract features of feedback, rec. items)
Discriminator (prevent constraint violation)
Recommender (predict attributes, match, and recommend)
𝑪 𝝓(𝒔, 𝒂)
𝝅 𝜽(𝒂|𝐬)

31
Recommendation as Constrained MDP
Directly solving the constrained-optimization is difficult,
Lagrange relaxation transforms the objective to dual problem.
• Primal problem
• Dual problem(refer to Appendix: Lagrange Relaxation)
• Lagrangian function
• Relaxed objective
Lagrange multiplier

32
Recommendation as Constrained MDP
The goal is to find a saddle point,
can be achieved by alternating gradient descent/ascent
approximately.
Reward function with constraints penalizes the policy for violation.
𝜆 is also optimized to ensure the constraints.
1) If violations happen, 𝜆 will increase to penalize the policy.
2) If there is no violation, 𝜆 will decrease to give the policy more reward
Reward function with Constraints

33
Model Training
Reward Constrained Recommendation Process
• Alternatively training the discriminator 𝐶& and the recommender 𝜋'
: a projection operator, which
keeps the stability as the parameters
are updated within a trust region[1]
: projects 𝜆 into the range [0, 𝜆()*]
[1] Schulman, John, et al. "Trust region policy optimization." International conference on machine
learning. 2015.
One user session

34
Evaluation
SR@K : Success Rate after K interactions
NI : Number of user Interactions before success
NV : Number of Violated attributes compared with the desired
attributes of users
𝜆 increases at early stage
(since violation ↑),
𝜆 becomes stable more.
𝜆 ≈ 0.04 is automatically learned
discriminator weight.

35
Evaluation
RL baseline : ignoring the constraints.
RL + Naive constraints : Fixed the lagrange multiplier 𝜆
• All models are trained for 100,000 iterations (user sessions)
• Seen : training data
• Unseen : test data
• Averaged over 100 sessions with standard error
The learned constraint (discriminator) has better generalization.

36
Conclusion
They propose Reward Constrained Recommendation(RCR), which
sequentially incorporates constraints from previous feedback.
• A constraint-augmented RL problem setting
• A learnable discriminator to detect violations of user preferences in an
adversarial manner
The proposed method can be extended to other applications,
such as,
1. vision-and-dialogue navigation
2. Interactive Recommendation with user’s prior information
3. Dialogue-based Recommendation

37
Appendix: Lagrange Relaxation

38
Appendix: Generated feedback
Simulator only generates simple comments on the visual
attribute difference between the candidate image and the
desired image

39
Appendix: Hyperparameter setting
In reinforcement learning, they use Adam as the optimizer.
They set ,
• 𝛼 : threshold of constraints (refer to page 15)
• 𝜆()* : projection boundary of 𝜆

Reward constrained interactive recommendation with natural language feedback noani

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Reward constrained interactive recommendation with natural language feedback noani

Similar to Reward constrained interactive recommendation with natural language feedback noani (20)

Recently uploaded

Recently uploaded (20)

Reward constrained interactive recommendation with natural language feedback noani