Abductive Commonsense
Reasoning
San Kim
2020.07.03
Chandra Bhagavatula(1), Ronan Le Bras(1), Chaitanya Malaviya(1), Keisuke Sakaguchi(1),
Ari Holtzman(1), Hannah Rashkin(1), Doug Downey(1), Scott Wen-tau Yih(2), Yejin Choi(1,3)
1. Allen Institute for AI
2. Facebook AI
3. Paul G. Allen School of Computer Scient & Engineering
Contributions
1. Abductive Commonsense Reasoning Challenges
1. Abductive NLI (𝛼𝑁𝐿𝐼)
2. Abductive NLG (𝛼𝑁𝐿𝐺)
2. ART: Dataset for Abductive Reasoning in Narrative Text
1. 20K Narrative Context
2. 200K Hypotheses
1. Experiments & Key Insights
Abductive Commonsense Reasoning
𝜶𝑵𝑳𝑰 is formulated as multiple choice problems consisting of a pair of
observations as context and a pair of hypothesis choices.
𝑂1: The observation at time 𝑡1
𝑂2: The observation at time 𝑡2 > 𝑡1.
ℎ+
: A plausible hypothesis that explains the two observations 𝑂1 and 𝑂2.
ℎ−: An implausible (or less plausible) hypothesis for observations 𝑂1 and 𝑂2.
Abductive Natural Language Generation
𝜶𝑵𝑳𝑮 is the task of generating a valid hypothesis ℎ+ given the two
observation 𝑂1 and 𝑂2. Formally, the task requires to maximize 𝑃 ℎ+
𝑂1, 𝑂2 .
Abductive Natural Language Inference
Abductive Reasoning
Adopted from [1]
A Probabilistic Framework for 𝛼𝑁𝐿𝐼
The 𝛼𝑁𝐿𝐼 task is to select the hypothesis ℎ∗
that is most probable given the
observations.
ℎ∗
= arg max
ℎ 𝑖
𝑃 𝐻 = ℎ𝑖
𝑂1, 𝑂2
Rewriting the objective using Bayes Rule conditioned on 𝑂1, we have:
𝑃 ℎ𝑖 𝑂1, 𝑂2 ∝ 𝑃 𝑂2 𝐻 𝑖, 𝑂1 𝑃 ℎ𝑖 𝑂1
Illustration of the graphical models described in the probabilistic framework.
Adopted from [1]
A Probabilistic Framework for 𝛼𝑁𝐿𝐼
(a) Hypothesis Only
• the strong assumption: the hypothesis is entirely independent of both observations, i.e.
(𝐻 ⊥ 𝑂1, 𝑂2).
• Maximize the marginal 𝑷 𝑯 .
(b, c) First (or Second) Observation Only
• Weaker assumption: the hypothesis depends on only one of the first 𝑂1 or second 𝑂2
observation.
• Maximize the conditional probability 𝑷 𝑯 𝑶 𝟏 or 𝑷 𝑶 𝟐 𝑯 .
Adopted from [1]
A Probabilistic Framework for 𝛼𝑁𝐿𝐼
(d) Linear Chain
• Uses both observations, but consider each observation’s influence on the hypothesis
independently.
• The model assumes that the three variables < 𝑶 𝟏, 𝑯, 𝑶 𝟐 > form a linear Markov chain.
• The second observation is conditionally independent of the first, given the hypothesis (i.e.
𝑂1 ⊥ 𝑂2 𝐻))
• ℎ∗
= arg max
ℎ 𝑖
𝑃 𝑂2 ℎ∗
𝑃 ℎ∗
𝑂1 𝑤ℎ𝑒𝑟𝑒 𝑂1 ⊥ 𝑂2 𝐻
(e) Fully Connected
• Jointly models all three random variables.
• Combine information across both observations to choose the correct hypothesis.
• ℎ∗ = arg max
ℎ 𝑖
𝑃 𝑂2 ℎ𝑖, 𝑂1 𝑃 ℎ𝑖 𝑂1
Adopted from [1]
Difference btw/ the Linear Chain and Fully Connected model
𝑂1: Carl went to the store desperately searching for flour tortillas for a recipe.
𝑂2: Carl left the store very frustrated.
ℎ1: The cashier was rude. ℎ2: The store had corn tortillas, but not flour ones.
𝑡0
𝑡1
𝑡2
Linear Chain
𝑂2 ℎ1 ℎ1
𝑂1𝑃 ) 𝑃 )
𝑃 ) 𝑃 )
:Plausible! :Plausible!
𝑂2 ℎ2
ℎ2
𝑂1 :Plausible!:Plausible…
Fully Connected
𝑂2 ℎ1 ℎ1 𝑂1𝑃 , ) 𝑃 )
𝑃 )
:Plausible… :Plausible!
𝑂2 ℎ2
ℎ2 𝑂2 :Plausible!:Plausible!
𝑂1
𝑃 , )𝑂1
ℎ1 ℎ2
ℎ1 ℎ2
>
<
𝛼𝑁𝐿𝐺 model
Given ℎ+
= 𝑤1
ℎ
⋯ 𝑤𝑙
ℎ
, 𝑂1 = 𝑤1
𝑜1
⋯ 𝑤 𝑚
𝑜1
, 𝑂2 = 𝑤1
𝑜2
⋯ 𝑤 𝑛
𝑜2
as sequence of tokens,
The 𝛼𝑁𝐿𝐺 task can be modeled as 𝑷 𝒉+ 𝑶 𝟏, 𝑶 𝟐 = 𝑷 𝒘𝒊
𝒉
𝒘<𝒊
𝒉
, 𝒘 𝟏
𝒐 𝟏
⋯ 𝒘 𝒎
𝒐 𝟏
, 𝒘 𝟏
𝒐 𝟐
⋯ 𝒘 𝒏
𝒐 𝟐
Optionally, The model can also be conditioned on background knowledge 𝓚.
ℒ = −
𝑖=1
𝑁
log 𝑃 𝑤𝑖
ℎ
𝑤<𝑖
ℎ
, 𝑤1
𝑜1
⋯ 𝑤 𝑚
𝑜1
, 𝑤1
𝑜2
⋯ 𝑤 𝑛
𝑜2
, 𝒦
Adopted from [1]
ConceptNet [5]
• ConceptNet is a multilingual knowledge base
• Is constructed of:
• Nodes ( a word or a short phrase)
• Edges (relation and weight)
• ConceptNet is a multilingual knowledge base
• Is constructed of:
• Nodes ( a word or a short phrase)
• Edges (relation and weight)
Birds is not capable of … {bark, chew their food,
breathe water}
• Related Works
• Wordnet
• Microsoft Concept Net
• Google knowledge base
Adopted from [6]
ATOMIC [2]
1. xIntent: Why does X cause an event?
2. xNeed: What does X need to do before the
event?
3. xAttr: How would X be described?
4. xEffect: What effects does the event have on X?
5. xWant: What would X likely want to do after
the event?
6. xReaction: How does X feel after the event?
7. oReact: How do others’ feel after the event?
8. oWant: What would others likely want to do
after the event?
9. oEffect: What effects does the event have on
others?
If-Then Relation Types
• If-Event-Then-Mental-State
• If-Event-Then-Event
• If-Event-Then-Persona
Inference dimension
Adopted from [2]
ATOMIC
• If-Event-Then-Mental-State: three relations relating to the mental pre- and post-conditions of an event.
X compliments Y X wants to be nice
X feels good
Y feels flattered
xIntent: likely intents of the event
xReaction: likely (emotional) reactions of the event’s subject
oReaction: likely (emotional) reactions of others
Adopted from [2]
ATOMIC
• If-Event-Then-Event: five relations relating to events that constitute probable pre- and post-conditions of
a given event.
X calls the police X needs to dial 911
X starts to panic
X wants to explain everything to the police
xNeeds
xEffect
xWant
oWant
oEffect
Others want to dispatch some officers
Y will smileX pays Y a compliment
• If-Event-Then-Persona: a stative relation that describes how the subject of an event is described or
perceived.
xAttrX is flatteringX pays Y a compliment
xAttrX is caring
ATOMIC
Adopted from [2]
COMET: COMmonsEnse Transformers for Automatic Knowledge Graph
Construction [3]
Adopted from [3]
COMET: COMmonsEnse Transformers for Automatic Knowledge Graph
Construction
Adopted from [3]
COMET: COMmonsEnse Transformers for Automatic Knowledge Graph
Construction
Loss function
ℒ = −
𝑡= 𝑠 +|𝑟|
𝑠 + 𝑟 +|𝑜|
log 𝑃 𝑥 𝑡 𝑥<𝑡
Input Template
Notation
• Natural language tuples in {s: subject, r: relation, o: object}
• Subject tokens: 𝑿 𝒔 = 𝑥0
𝑠
, ⋯ , 𝑥 𝑠
𝑠
• Relation tokens: 𝑿 𝒓
= 𝑥0
𝑟
, ⋯ , 𝑥 𝑟
𝑟
• Object tokens: 𝑿 𝒐
= 𝑥0
𝑜
, ⋯ , 𝑥 𝑜
𝑜
Adopted from [3]
COMET: COMmonsEnse Transformers for Automatic Knowledge Graph
Construction
Adversarial Filtering Algorithm
• A significant challenge in creating datasets is avoiding annotation artifacts.
• Annotation artifacts: unintentional patterns in the data that leak information about the
target label.
In each iteration 𝑖,
• 𝑀𝑖: adversarial model
• 𝒯𝑖: a random subset for training
• 𝒱𝑖: validation set
• ℎ 𝑘
+
, ℎ 𝑘
−
: plausible and implausible
hypotheses for an instance 𝑘.
• 𝛿 = ∆ 𝑀 𝑖
ℎ 𝑘
+
, ℎ 𝑘
−
: the difference in the
model evaluation of ℎ 𝑘
+
and ℎ 𝑘
−
.
With probability 𝑡𝑖, update instance 𝑘 that
𝑀𝑖 gets correct with a pair ℎ+, ℎ− ∈ ℋ𝑘
+
×
ℋ𝑘
−
of hypotheses that reduces the value
of 𝛿, where ℋ𝑘
+
(resp. ℋ𝑘
−
) is the pool of
plausible (resp. implausible) hypotheses for
instance 𝑘.
Adopted from [1]
Performance of baselines (𝛼𝑁𝐿𝐼)
Adopted from [1]
Performance of baselines (𝛼𝑁𝐿𝐼)
Adopted from [1]
Performance of baselines (𝛼𝑁𝐿𝐺)
Adopted from [1]
Adopted from [1]
BERT-Score
Evaluating Text Generation with BERT (like inception-score, Frechet-inception-distance, kernel-inception-distance in image generation)
Adopted from [4]
BERT-Score
BERT-Score
𝑅 𝐵𝐸𝑅𝑇 =
1
|𝑥|
𝑥 𝑖∈𝑥
max
𝑥 𝑗∈ 𝑥
𝐱 𝑖
⊺
𝐱𝑗 𝑃𝐵𝐸𝑅𝑇 =
1
| 𝑥|
𝑥ℎ∈ 𝑥
max
𝑥 𝑖∈𝑥
𝐱 𝑖
⊺
𝐱𝑗 𝐹𝐵𝐸𝑅𝑇 =
2 𝑃𝐵𝐸𝑅𝑇 ⋅ 𝑅 𝐵𝐸𝑅𝑇
𝑃𝐵𝐸𝑅𝑇 + 𝑅 𝐵𝐸𝑅𝑇
Importance Weighting using inverse document frequency(idf)
𝑖𝑑𝑓 𝑤 = − log
1
𝑀 𝑖=1
𝑀
𝕀 𝑤 ∈ 𝑥 𝑖 , where 𝕀 ⋅ is an indicator function.
𝑅 𝐵𝐸𝑅𝑇 =
Σ 𝑥 𝑖∈𝑥 𝑖𝑑𝑓 𝑥𝑖 max
𝑥 𝑗∈ 𝑥
𝐱 𝑖
⊺
𝐱𝑗
Σ 𝑥 𝑖∈𝑥 𝑖𝑑𝑓(𝑥𝑖)
Baseline Rescaling: compute baseline b using Common Crawl monolingual datasets.
𝑅 𝐵𝐸𝑅𝑇 =
𝑅 𝐵𝐸𝑅𝑇 − 𝑏
1 − 𝑏
BERT-Score
Adopted from [4]
Abductive Commonsense Reasoning
1. 𝑂1 ↛ ℎ−: ℎ− is unlikely to follow after the first observation 𝑂1.
2. ℎ− ↛ 𝑂2: ℎ− is plausible after 𝑂1 but unlikely to precede the second observation 𝑂2.
3. Plausible: 𝑂1, ℎ−
, 𝑂2 is a coherent narrative and forms a plausible alternative, but it less plausible
than 𝑂1, ℎ+
, 𝑂2 .
Adopted from [1]
Bert performance
Abductive Commonsense Reasoning
Adopted from [1]
Abductive Commonsense Reasoning
Adopted from [1]
Reference
[1] Chandra Bhagavatula et. al., Abductive Commonsense Reasoning, ICLR 2020.
[2] Maarten Sap et. al., ATOMIC: An atlas of Machine Commonsense for If-Then Reasoning,
arXiv:1811.00146
[3] Antonie Bosselut et. al., COMET: Commonsense Transformers for Automatic Knowledge Graph
Construction, ACL 2019
[4] Tianyi Zhang et. al., BERTScore: Evaluating Text Generation with BERT, ICLR 2020.
[5] Robyn Speer et. al., ConceptNet 5.5: An Open Multilingual Graph of General Knowledge,
arXiv:1612.03975
[6] MIT medialab, URL: https://conceptnet.io/

Abductive commonsense reasoning

  • 1.
    Abductive Commonsense Reasoning San Kim 2020.07.03 ChandraBhagavatula(1), Ronan Le Bras(1), Chaitanya Malaviya(1), Keisuke Sakaguchi(1), Ari Holtzman(1), Hannah Rashkin(1), Doug Downey(1), Scott Wen-tau Yih(2), Yejin Choi(1,3) 1. Allen Institute for AI 2. Facebook AI 3. Paul G. Allen School of Computer Scient & Engineering
  • 2.
    Contributions 1. Abductive CommonsenseReasoning Challenges 1. Abductive NLI (𝛼𝑁𝐿𝐼) 2. Abductive NLG (𝛼𝑁𝐿𝐺) 2. ART: Dataset for Abductive Reasoning in Narrative Text 1. 20K Narrative Context 2. 200K Hypotheses 1. Experiments & Key Insights
  • 3.
    Abductive Commonsense Reasoning 𝜶𝑵𝑳𝑰is formulated as multiple choice problems consisting of a pair of observations as context and a pair of hypothesis choices. 𝑂1: The observation at time 𝑡1 𝑂2: The observation at time 𝑡2 > 𝑡1. ℎ+ : A plausible hypothesis that explains the two observations 𝑂1 and 𝑂2. ℎ−: An implausible (or less plausible) hypothesis for observations 𝑂1 and 𝑂2. Abductive Natural Language Generation 𝜶𝑵𝑳𝑮 is the task of generating a valid hypothesis ℎ+ given the two observation 𝑂1 and 𝑂2. Formally, the task requires to maximize 𝑃 ℎ+ 𝑂1, 𝑂2 . Abductive Natural Language Inference
  • 4.
  • 5.
    A Probabilistic Frameworkfor 𝛼𝑁𝐿𝐼 The 𝛼𝑁𝐿𝐼 task is to select the hypothesis ℎ∗ that is most probable given the observations. ℎ∗ = arg max ℎ 𝑖 𝑃 𝐻 = ℎ𝑖 𝑂1, 𝑂2 Rewriting the objective using Bayes Rule conditioned on 𝑂1, we have: 𝑃 ℎ𝑖 𝑂1, 𝑂2 ∝ 𝑃 𝑂2 𝐻 𝑖, 𝑂1 𝑃 ℎ𝑖 𝑂1 Illustration of the graphical models described in the probabilistic framework. Adopted from [1]
  • 6.
    A Probabilistic Frameworkfor 𝛼𝑁𝐿𝐼 (a) Hypothesis Only • the strong assumption: the hypothesis is entirely independent of both observations, i.e. (𝐻 ⊥ 𝑂1, 𝑂2). • Maximize the marginal 𝑷 𝑯 . (b, c) First (or Second) Observation Only • Weaker assumption: the hypothesis depends on only one of the first 𝑂1 or second 𝑂2 observation. • Maximize the conditional probability 𝑷 𝑯 𝑶 𝟏 or 𝑷 𝑶 𝟐 𝑯 . Adopted from [1]
  • 7.
    A Probabilistic Frameworkfor 𝛼𝑁𝐿𝐼 (d) Linear Chain • Uses both observations, but consider each observation’s influence on the hypothesis independently. • The model assumes that the three variables < 𝑶 𝟏, 𝑯, 𝑶 𝟐 > form a linear Markov chain. • The second observation is conditionally independent of the first, given the hypothesis (i.e. 𝑂1 ⊥ 𝑂2 𝐻)) • ℎ∗ = arg max ℎ 𝑖 𝑃 𝑂2 ℎ∗ 𝑃 ℎ∗ 𝑂1 𝑤ℎ𝑒𝑟𝑒 𝑂1 ⊥ 𝑂2 𝐻 (e) Fully Connected • Jointly models all three random variables. • Combine information across both observations to choose the correct hypothesis. • ℎ∗ = arg max ℎ 𝑖 𝑃 𝑂2 ℎ𝑖, 𝑂1 𝑃 ℎ𝑖 𝑂1 Adopted from [1]
  • 8.
    Difference btw/ theLinear Chain and Fully Connected model 𝑂1: Carl went to the store desperately searching for flour tortillas for a recipe. 𝑂2: Carl left the store very frustrated. ℎ1: The cashier was rude. ℎ2: The store had corn tortillas, but not flour ones. 𝑡0 𝑡1 𝑡2 Linear Chain 𝑂2 ℎ1 ℎ1 𝑂1𝑃 ) 𝑃 ) 𝑃 ) 𝑃 ) :Plausible! :Plausible! 𝑂2 ℎ2 ℎ2 𝑂1 :Plausible!:Plausible… Fully Connected 𝑂2 ℎ1 ℎ1 𝑂1𝑃 , ) 𝑃 ) 𝑃 ) :Plausible… :Plausible! 𝑂2 ℎ2 ℎ2 𝑂2 :Plausible!:Plausible! 𝑂1 𝑃 , )𝑂1 ℎ1 ℎ2 ℎ1 ℎ2 > <
  • 9.
    𝛼𝑁𝐿𝐺 model Given ℎ+ =𝑤1 ℎ ⋯ 𝑤𝑙 ℎ , 𝑂1 = 𝑤1 𝑜1 ⋯ 𝑤 𝑚 𝑜1 , 𝑂2 = 𝑤1 𝑜2 ⋯ 𝑤 𝑛 𝑜2 as sequence of tokens, The 𝛼𝑁𝐿𝐺 task can be modeled as 𝑷 𝒉+ 𝑶 𝟏, 𝑶 𝟐 = 𝑷 𝒘𝒊 𝒉 𝒘<𝒊 𝒉 , 𝒘 𝟏 𝒐 𝟏 ⋯ 𝒘 𝒎 𝒐 𝟏 , 𝒘 𝟏 𝒐 𝟐 ⋯ 𝒘 𝒏 𝒐 𝟐 Optionally, The model can also be conditioned on background knowledge 𝓚. ℒ = − 𝑖=1 𝑁 log 𝑃 𝑤𝑖 ℎ 𝑤<𝑖 ℎ , 𝑤1 𝑜1 ⋯ 𝑤 𝑚 𝑜1 , 𝑤1 𝑜2 ⋯ 𝑤 𝑛 𝑜2 , 𝒦 Adopted from [1]
  • 10.
    ConceptNet [5] • ConceptNetis a multilingual knowledge base • Is constructed of: • Nodes ( a word or a short phrase) • Edges (relation and weight) • ConceptNet is a multilingual knowledge base • Is constructed of: • Nodes ( a word or a short phrase) • Edges (relation and weight) Birds is not capable of … {bark, chew their food, breathe water} • Related Works • Wordnet • Microsoft Concept Net • Google knowledge base Adopted from [6]
  • 11.
    ATOMIC [2] 1. xIntent:Why does X cause an event? 2. xNeed: What does X need to do before the event? 3. xAttr: How would X be described? 4. xEffect: What effects does the event have on X? 5. xWant: What would X likely want to do after the event? 6. xReaction: How does X feel after the event? 7. oReact: How do others’ feel after the event? 8. oWant: What would others likely want to do after the event? 9. oEffect: What effects does the event have on others? If-Then Relation Types • If-Event-Then-Mental-State • If-Event-Then-Event • If-Event-Then-Persona Inference dimension Adopted from [2]
  • 12.
    ATOMIC • If-Event-Then-Mental-State: threerelations relating to the mental pre- and post-conditions of an event. X compliments Y X wants to be nice X feels good Y feels flattered xIntent: likely intents of the event xReaction: likely (emotional) reactions of the event’s subject oReaction: likely (emotional) reactions of others Adopted from [2]
  • 13.
    ATOMIC • If-Event-Then-Event: fiverelations relating to events that constitute probable pre- and post-conditions of a given event. X calls the police X needs to dial 911 X starts to panic X wants to explain everything to the police xNeeds xEffect xWant oWant oEffect Others want to dispatch some officers Y will smileX pays Y a compliment • If-Event-Then-Persona: a stative relation that describes how the subject of an event is described or perceived. xAttrX is flatteringX pays Y a compliment xAttrX is caring
  • 14.
  • 15.
    COMET: COMmonsEnse Transformersfor Automatic Knowledge Graph Construction [3] Adopted from [3]
  • 16.
    COMET: COMmonsEnse Transformersfor Automatic Knowledge Graph Construction Adopted from [3]
  • 17.
    COMET: COMmonsEnse Transformersfor Automatic Knowledge Graph Construction Loss function ℒ = − 𝑡= 𝑠 +|𝑟| 𝑠 + 𝑟 +|𝑜| log 𝑃 𝑥 𝑡 𝑥<𝑡 Input Template Notation • Natural language tuples in {s: subject, r: relation, o: object} • Subject tokens: 𝑿 𝒔 = 𝑥0 𝑠 , ⋯ , 𝑥 𝑠 𝑠 • Relation tokens: 𝑿 𝒓 = 𝑥0 𝑟 , ⋯ , 𝑥 𝑟 𝑟 • Object tokens: 𝑿 𝒐 = 𝑥0 𝑜 , ⋯ , 𝑥 𝑜 𝑜 Adopted from [3]
  • 18.
    COMET: COMmonsEnse Transformersfor Automatic Knowledge Graph Construction
  • 19.
    Adversarial Filtering Algorithm •A significant challenge in creating datasets is avoiding annotation artifacts. • Annotation artifacts: unintentional patterns in the data that leak information about the target label. In each iteration 𝑖, • 𝑀𝑖: adversarial model • 𝒯𝑖: a random subset for training • 𝒱𝑖: validation set • ℎ 𝑘 + , ℎ 𝑘 − : plausible and implausible hypotheses for an instance 𝑘. • 𝛿 = ∆ 𝑀 𝑖 ℎ 𝑘 + , ℎ 𝑘 − : the difference in the model evaluation of ℎ 𝑘 + and ℎ 𝑘 − . With probability 𝑡𝑖, update instance 𝑘 that 𝑀𝑖 gets correct with a pair ℎ+, ℎ− ∈ ℋ𝑘 + × ℋ𝑘 − of hypotheses that reduces the value of 𝛿, where ℋ𝑘 + (resp. ℋ𝑘 − ) is the pool of plausible (resp. implausible) hypotheses for instance 𝑘. Adopted from [1]
  • 20.
    Performance of baselines(𝛼𝑁𝐿𝐼) Adopted from [1]
  • 21.
    Performance of baselines(𝛼𝑁𝐿𝐼) Adopted from [1]
  • 22.
    Performance of baselines(𝛼𝑁𝐿𝐺) Adopted from [1] Adopted from [1]
  • 23.
    BERT-Score Evaluating Text Generationwith BERT (like inception-score, Frechet-inception-distance, kernel-inception-distance in image generation) Adopted from [4]
  • 24.
    BERT-Score BERT-Score 𝑅 𝐵𝐸𝑅𝑇 = 1 |𝑥| 𝑥𝑖∈𝑥 max 𝑥 𝑗∈ 𝑥 𝐱 𝑖 ⊺ 𝐱𝑗 𝑃𝐵𝐸𝑅𝑇 = 1 | 𝑥| 𝑥ℎ∈ 𝑥 max 𝑥 𝑖∈𝑥 𝐱 𝑖 ⊺ 𝐱𝑗 𝐹𝐵𝐸𝑅𝑇 = 2 𝑃𝐵𝐸𝑅𝑇 ⋅ 𝑅 𝐵𝐸𝑅𝑇 𝑃𝐵𝐸𝑅𝑇 + 𝑅 𝐵𝐸𝑅𝑇 Importance Weighting using inverse document frequency(idf) 𝑖𝑑𝑓 𝑤 = − log 1 𝑀 𝑖=1 𝑀 𝕀 𝑤 ∈ 𝑥 𝑖 , where 𝕀 ⋅ is an indicator function. 𝑅 𝐵𝐸𝑅𝑇 = Σ 𝑥 𝑖∈𝑥 𝑖𝑑𝑓 𝑥𝑖 max 𝑥 𝑗∈ 𝑥 𝐱 𝑖 ⊺ 𝐱𝑗 Σ 𝑥 𝑖∈𝑥 𝑖𝑑𝑓(𝑥𝑖) Baseline Rescaling: compute baseline b using Common Crawl monolingual datasets. 𝑅 𝐵𝐸𝑅𝑇 = 𝑅 𝐵𝐸𝑅𝑇 − 𝑏 1 − 𝑏
  • 25.
  • 26.
    Abductive Commonsense Reasoning 1.𝑂1 ↛ ℎ−: ℎ− is unlikely to follow after the first observation 𝑂1. 2. ℎ− ↛ 𝑂2: ℎ− is plausible after 𝑂1 but unlikely to precede the second observation 𝑂2. 3. Plausible: 𝑂1, ℎ− , 𝑂2 is a coherent narrative and forms a plausible alternative, but it less plausible than 𝑂1, ℎ+ , 𝑂2 . Adopted from [1] Bert performance
  • 27.
  • 28.
  • 29.
    Reference [1] Chandra Bhagavatulaet. al., Abductive Commonsense Reasoning, ICLR 2020. [2] Maarten Sap et. al., ATOMIC: An atlas of Machine Commonsense for If-Then Reasoning, arXiv:1811.00146 [3] Antonie Bosselut et. al., COMET: Commonsense Transformers for Automatic Knowledge Graph Construction, ACL 2019 [4] Tianyi Zhang et. al., BERTScore: Evaluating Text Generation with BERT, ICLR 2020. [5] Robyn Speer et. al., ConceptNet 5.5: An Open Multilingual Graph of General Knowledge, arXiv:1612.03975 [6] MIT medialab, URL: https://conceptnet.io/