Towads Unsupervised Commonsense Reasoning in AI

Tassilo Klein, Senior Research Scientist, SAP AI Research
September 7th , 2020
Towards Unsupervised Commonsense Reasoning

Source: https://github.com/minimaxir/gpt-3-experiments/, https://twitter.com/yoavgo, https://lacker.io/
Q: Who were the founders of Microsoft?
A: Bill Gates and Paul Allen were the
founders of Microsoft.
……………………………………………………………………
GPT-3 >
GPT-3 >
GPT-3 > _
Q: What are cubs and cowboys?
A: Cubs are baby bears. Cowboys are people
who work on ranches.
……………………………………………………………………
GPT-3 >
GPT-3 >
GPT-3 > _
Q: Which is heavier, a mouse or an elephant?
A: An elephant is heavier than a mouse.
……………………………………………………………………
GPT-3 >
GPT-3 >
GPT-3 > _
Q: Are there any animals with three legs?
A: No, there are no animals with three legs.
……………………………………………………………………
GPT-3 >
GPT-3 >
GPT-3 > _
GPT-3 > _

Source: https://github.com/minimaxir/gpt-3-experiments/, https://twitter.com/yoavgo, https://lacker.io/
Q: What do dolphins and eagles have in common?
A: Both are birds.
……………………………………………………………………
GPT-3 >
GPT-3 >
GPT-3 > _
Q: What do the words ‘dolphins’ and ‘eagles’
have in common?
A: Both words start with the letter ‘e’.
……………………………………………………………………
GPT-3 >
GPT-3 >
GPT-3 > _
GPT-3 > _

What is common sense?
• Concept is ambiguous
• Depends on many factors, e.g., culture, age, education, etc.
• Generic: natural laws, social conventions to unwritten rules
• Examples: “Knives cut things”, “Your parents are older than you”,
“Objects don’t disappear when you cover them with a blanket.”,
“Animals don’t drive cars”.

Definitions of common sense
• Merriam Webster - “sound and prudent judgment based
on a simple perception of the situation or facts”
• Cambridge Dictionary – “the basic level of practical
knowledge and judgment that we all need to help us live in
a reasonable and safe way.”
• […..]
• “Sound judgment derived from experience rather than
study”
• “Sound and prudent judgment based on a simple
perception of the situation or facts”

Common Sense in AI
“We shall therefore say that a program has common sense if it automatically
deduces for itself a sufficiently wide class of immediate consequences of anything
it is told and what it already knows.
[…]
Our ultimate objective is to make programs that learn from their experience as
effectively as humans do.”
John McCarthy, “Programs with Common Sense”, 1958
The great irony of common sense—and indeed AI itself—is that it is stuff that
everybody knows, yet nobody seems to know what exactly it is or how to build
machines that have it.”,
Gary Marcus. “Rebooting AI: Building Artificial Intelligence We Can Trust”, 2019

AI’s struggle with common sense
• Common sense completeness issue - the lack of a precise definition
• Supervised learning intractable
• Common training data for LM (e.g., Wikipedia) does not contain
commonsense knowledge (assumed triviality)
• Deep learning
• Great at pattern recognition, poor at adaptation
• Hard to incorporating abstract knowledge
• Common AI training is goal-oriented fashion, e.g., backpropagation
• Issue: pure goal-orientation leads to shortcuts, unable to generalize, no human-like
reasoning
• Ideal goal: fundamental rethinking learning - leveraging existing knowledge
vs

Human commonsense reasoning
• Human-like reasoning
• Extremely complex
• Intrinsics are far from being fully understood
• Captures time, space, causality, basic knowledge of physical objects
and their interactions
• Mechanisms such as conceptualization and compositionality
• Conceptualization is an abstract, simplified view of the world that we wish to
represent for some purpose, Gruber, 1995
• ”Concepts are the glue that holds our mental world together”, Murphy, 2002
• Compositionality is the capacity to understand and produce novel
combinations from known components, Montague, 1970
“The Big Book of Concepts”, Gregory Murphy, 2002“Universal grammar”, Richard Montague, 1970
“Toward principles for the design of ontologies used for knowledge sharing”, Gruber, 1995

Applications: Human-Centered AI & Robust AI
• Human-centered AI
• Advanced chatbots
• Assistants
• Interpretable AI
• Robust AI
• Generalization: distribution of events is long-tailed
• Infrequent & significant, e.g., “black swans”
“The Black Swan: The Impact of the Highly Improbable”, Nassim Nicholas Taleb, 2007
tail

Testing Common Sense Reasoning – Winograd Schemas
• Alternative to the Turing Test for commonsense reasoning
• Winograd schema introduced by Terry Winograd (1972)
• Schema structure:
• A sentence with two parties
• An ambigious pronoun refering to one of them
• Trigger-word(s) induce flipping the answer
• Objective: What does the pronoun refer to?
• Winograd Schema Challenge (WSC): 273 Multiple-choice questions
• “Google-proof” Winograd schemas, manually curated by AI experts
• Easy for humans, hard for machines
Turing, “Computing machinery and intelligence”, 1950, Winograd, “Understanding Natural Language”, 1972,
Levesque et al., “The Winograd Schema Challenge”, 2012
Example: “The trophy does not fit into the suitcase, because it is too big.”
“The trophy does not fit into the suitcase, because it is too small.”

Approaches
• Feature-based approaches: explicit rules from knowledge bases, internet search
queries, logic-based systems
• Neural-network-based approaches: semantic similarities on word embeddings,
RNNs/LSTMs to encode the local context, pre-trained on unstructured data
• Recently: Leveraging language model (LM) trained on large amounts of
unsupervised text, e.g., BERT
• Unsupervised: BERT LM likelihood scoring
• Supervised: BERT Masked LM Model
The trophy does not fit into the suitcase, because [MASK] is too big.
Answer: The trophy
Devlin et al., “Bert: Pre-training of deep bidirectional transformers for language understanding”, 2018, Kocijan et al., “A Surprisingly Robust Trick for Winograd Schema Challenge”, ACL, 2019,
Trinh and Le, “A Simple Method for Commonsense Reasoning”, 2018, Kocijan et al. “A Review of Winograd Schema Challenge Datasets and Approaches”, 2020
ScoreLM(“The trophy does not fit into the suitcase, because the trophy is too big.”)
ScoreLM(“The trophy does not fit into the suitcase, because the suitcase is too big.”)

BERT - Transformer Encoder Stack
http://jalammar.github.io/illustrated-transformer/Source:
“Deconstructing BERT, Part 2: Visualizing the Inner
Workings of Attention” https://towardsdatascience.com/
• 𝐴 ∈ ℝ 𝐿×𝐻×|𝐶|
• L: #layers
• H: #heads
• C: sequence
• Idea: Leverage the
attention tensor 𝐴
L
H
|C|
Devlin et al., “Bert: Pre-training of deep bidirectional transformers for language understanding”, 2018

Maximum Attention Score (MAS)
• Re-implementation of BERT for commonsense reasoning
• Exploitation of the associative leverage of self-attention
• Idea: max-pooling on attention level
• Retaining attention of a candidate only where it is most dominant
• Frequency of occurrence to weight the importance
• Implementation:
• Slicing attention tensor A into attention matrices Ac
• Isolate dominant links with a binary mask matrix Mc
• Score = Ratio of sum of masked attention values
ACL’19
“Attention Is (not) All You Need for Commonsense Reasoning”, Klein and Nabi, 2019
0 0 0
1 0 0
1 0 1
MCAC

MAS – Schematic Illustration
0.2 0.1 0.5
0.1 0.7 0.5
0 0.2 0.2
0.1 0 0.4
0.2 0.6 0.4
0.1 0.1 0.3
0.2 0.1 0.5
0 0.7 0.5
0 0.2 0
0 0 0
0.2 0 0
0.1 0 0.3
The trophy doesn’t fit in the suitcase because IT is too small
0 0.2
A1: A2:
“Attention Is (not) All You Need for Commonsense Reasoning”, Klein and Nabi, 2019
ACL’19
+ +
vs

Quantitative Results
Davis et al., “Human tests of materials for the winograd schema challenge”, 2016, Levesque et al., “The Winograd Schema Challenge”, 2012
Kocijan et al., “A Surprisingly Robust Trick for Winograd Schema Challenge”, 2019

Qualitative Results
1.0
0.5
0.0
The drain is clogged with hair. It has to be cleaned.
The drain is clogged with hair. It has to be removed.
Steve follows Fred's example in everything. He admires him hugely.
Steve follows Fred's example in everything. He influences him hugely.
The fish ate the worm . It was hungry.
The fish ate the worm . It was tasty.
The trophy doesn’t fit into the suitcase, because it is too big.
The trophy doesn’t fit into the suitcase, because it is too big.
Probability
The foxes are attacking the chickens at night. I have to kill them.
The foxes are attacking the chickens at night. I have to guard them.

Can we do better?
• Task: Devising a difficult task that allows to capture a deeper notion of
commonsense and generalize, without labels
• Idea:
• Exploit Structural Prior (no labels needed) → Mutual-Exclusivity
• Find consistency in answers
The trophy does not fit into the suitcase, because the trophy is too big.
The trophy does not fit into the suitcase, because the trophy is too small.
The trophy does not fit into the suitcase, because the suitcase is too big.
The trophy does not fit into the suitcase, because the suitcase is too small.
or
The trophy doesn’t fit into the suitcase, because it is too big/too small.

Contrastive Self-Supervised (CSS) - Method I
• BERT Masked LM Model
• Pair-level: “soft” mutual-exclusiveness using LM likelihoods (MEx)
Candidate 1: The trophy Candidate 2: The suitcase
Devlin et al., “Bert: Pre-training of deep bidirectional transformers for language understanding”, 2018
Sajjadi et al., “Regularization with stochastic transformations and perturbations for deep semi-supervised learning”, 2016
The trophy does not fit into the suitcase, because [MASK] is too small.
⊕ XOR operator
𝑖=1
𝑘
𝑐𝑖 ⟹
𝑖=1
𝑘
𝑝𝑖
𝑖=1
𝑘
𝑐𝑖 ⟹
𝑖=1
𝑘
𝑝𝑖
𝑐𝑖,𝑗 ∈ {0,1} candidate j in sentence i
¬𝑐𝑖 ⟹ (1 − 𝑝𝑖)
(𝑐𝑖,1 ⊕ 𝑐𝑖+1,1)⋀(𝑐𝑖,2 ⊕ 𝑐𝑖+1,2)⋀(𝑐𝑖,1 ⊕ 𝑐𝑖,2)
𝑝𝑖,1 𝑝𝑖+1,2 1 − 𝑝𝑖,2 𝑝𝑖+1,1 +
𝑝𝑖,2 𝑝𝑖+1,1 1 − 𝑝𝑖,1 𝑝𝑖+1,2
𝑝𝑖,𝑗 ∈ 0,1 LM likelihood of 𝑐𝑖,𝑗
ACL’20

Contrastive Self-Supervised (CSS) - Method II
• Sentence Level: Contrastive margin (CM)
• Training
• Joint loss:
• Self-supervised fine-tuning on DPR (no labels needed)
• Definite Pronoun Resolution (DPR) is similar to WSC-273
• Relaxed Winograd scheme constraints, i.e., dataset are not Google-proof.
• 1322 training samples
Devlin et al., “Bert: Pre-training of deep bidirectional transformers for language understanding”, 2018, Rahman and Ng, “Resolving complex cases of definite pronouns: The Winograd schema challenge”, 2012
𝔏 𝑓𝜃 = 𝔏(𝑓𝜃) 𝑀𝐸𝑥
+ 𝔏(𝑓𝜃) 𝐶𝑀
𝐿𝑎𝑛𝑔𝑢𝑎𝑔𝑒 𝑚𝑜𝑑𝑒𝑙 𝑓, 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑖𝑧𝑒𝑑 𝑏𝑦 𝜃
Candidate 1: The trophy Candidate 2: The suitcase
𝑚𝑎𝑥 𝑝𝑖,𝑗 − 𝑝𝑖,𝑗+1
ACL’20

CSS - Schematic Loss Illustration
Candidate 1 Candidate 2
LM Loss
The trophy does not fit into the suitcase, because it is too big.
The trophy does not fit into the suitcase, because it is too small.
MExLoss
S1:
S2:
1.0
0.5
0.0
MEx Loss
Contrastive margin
LMLoss
LMLikelihood

Results
Davis et al., “Human tests of materials for the winograd schema challenge”, 2016, Levesque et al., “The Winograd Schema Challenge”, 2012, Rahman and Ng, “Resolving complex cases of definite
pronouns: The Winograd schema challenge”, 2012, Emami et al., “The knowref coreference corpus: Removing gender and number cues for difficult pronominal anaphora resolution”, 2019

Conclusion
• BERT implicitly establishes complex relationships between entities
• Self-supervision is possible for commonsense reasoning
• Leveraging structural prior (mutual-exclusivity) instead of direct
supervision
• Outperforming all unsupervised approaches
• Comparable performance to supervised approaches
• Future work
• Relaxing the structural prior of twin-pairs
• Transferal of inductive bias to other commonsense-demanding downstream
tasks, e.g., Q&A

Q: What are the names of the papers presented?
A: “Attention Is (not) All You Need for
Commonsense Reasoning”, “Contrastive Self-
Supervised Learning for Commonsense Reasoning”
GPT-3 >
GPT-3 > _
Q: What’s the Github repo for the papers?
A: https://github.com/SAP-samples/acl2019-commonsense,
https://github.com/SAP-samples/acl2020-commonsense
GPT-3 >
GPT-3 > _
Q: Does SAP AI Research offer internships?
A: Yes, check out: https://jobs.sap.com/GPT-3 >
GPT-3 > _
Q: How to contact the presenter?
A: tassilo.klein@sap.com, tjklein.github.ioGPT-3 >
GPT-3 > _
Thanks for your attention
SAP AI Research, Berlin
Q: Is there anything else I should know?
A: Yes, check out the our research blog:
https://medium.com/sap-machine-learning-research
GPT-3 >
GPT-3 > _
GPT-3 > _

Towads Unsupervised Commonsense Reasoning in AI

Recommended

Recommended

More Related Content

What's hot

What's hot (7)

Similar to Towads Unsupervised Commonsense Reasoning in AI

Similar to Towads Unsupervised Commonsense Reasoning in AI (20)

Recently uploaded

Recently uploaded (20)

Towads Unsupervised Commonsense Reasoning in AI