Personal Project: Using reinforcement learning in NLP tasks of which the goal is to predict the missing pronouns in sentences so to improve the dialogue system.
1. RL in Zero-Pronoun Resolution
Olivia Zhi
September 28, 2018
Insight AI Fellow
2. Motivation
Hey Siri, I want to buy some shoes.
Siri Human
OK, here’s what I
found:
Here’s what I
found:
In white please.
3. Motivation
Hey Siri, I want to buy some shoes.
Siri Human
OK, here’s what I
found:
Here’s what I
found:
Those in white please.
4. Concepts
• Zero: In linguistics, a zero (denoted by "∅”) is a segment which is not pronounced or
written.
• Zero-Pronoun: A pronoun that is not written in a sentence.
Frequently occurred in language of Chinese, Japanese, etc. for coherence.
Sometimes also occurred in English.
5. Data
• Resource: Chinese portion of the OntoNotes 5.0 dataset
# of
annotated
ZPs
E.g. zero-pronoun (∅)
- Chinese: 我 觉得 ∅ 还可以。
(- English: I think ∅ OK.)
E.g. antecedents
-Chinese: 餐厅, 食物, 环境, 它, 那里, 那个
(- English: Restaurant, Food, Environment,
it, there, that)
6. Goal
• What: Predicting which antecedents (some previously occurred noun.) a zero pronoun
refers to.
• Why: Helping machine understand better in the task of translation and dialogue.
• Use Case: Intelligent Personal Assistant/ Personal Digital Assistant (PDA)
• Techniques: Reinforcement Learning
7. Why RL?
• Traditional deep learning models heavily make local coreference decisions.
Only considering the coreference relationship between zero pronoun and one single
candidate antecedent one at a time while overlooking their impacts on future
decisions. RL can solve the problem by including the information of previous antecedents of
the zero-pronoun in the current state.
• Reinforcement Learning is flexible and it can be to adapt the model based on the aims.
• Reinforcement Learning is the state-of-the-art technology and has already shown impact on
such tasks.
8. Models/Algorithms
NP: candidate
antecedents
• State: ZP, handcrafted features, Candidate antecedents, Antecedents(predicted by the model)
• Action: Whether the candidate antecedent NPt is the antecedent the ZP refers to (1 if NP is the antecedent, 0 otherwise)
• Reward: F1 score for the selected antecedents
Existing model
9. Models/Algorithms
My current
model
• Actor-Critic (AC) Network:
- The actor and the critic branches share the weighs of the first layer.
- Actor: learns a policy 𝜋(𝑎|𝑠) (to pick anaphoric or non-anaphoric action) by receiving feedback from a critic.
- Critic: learns a value function 𝑉(𝑆), which is the baseline to determine how advantageous it is to be in a state.
10. Reward function
My current
model
● For each zero-pronoun - candidate antecedent pair:
- Large positive reward for correctly identifying the antecedent (TP)
- Large negative reward for misidentifying the (FP)
- Small negative reward for not identifying the antecedent (FN)
- Small positive reward for correctly identifying the candidate is not the antecedent (TN)
12. Problem & Solution
• Prevent Overfitting
- AC model gets best result at 15th episode vs. previous model at 30th episode
- However, the performances (based on precision, recall and F1 score) decrease after the 15th episode
• Solution:
Further split the train set to train and
validation set, select best model based on
validation set and look at the performance
on the test set.
Save the best
model here!
13. About Me
Data Science Program
Congcong (Olivia) Zhi
At Waterloo,
we’re all nerds!
Skills
● Python
● R
● PySpark
● SQL
● Git
● Reinforcement Learning
● Deep Learning
Interests
● NLP
(dialogue system, text mining,
information retrieval)
● Reinforcement Learning
● Other Deep Learning
Field
https://www.linkedin.com/in/congcongoliviazhi/