From deep learning to deep reasoning

From Deep Learning to Deep Reasoning
14/08/2021 1
Tutorial at KDD, August 14th 2021
Truyen Tran, Vuong Le, Hung Le and Thao Le
{truyen.tran,vuong.le,thai.le,thao.le}@deakin.edu.au
https://bit.ly/37DYQn7
Part A: Learning to reason

Logistics
14/08/2021 2
Truyen Tran Vuong Le Hung Le Thao Le

Agenda
• Introduction
• Part A: Learning-to-reason framework
• Part B: Reasoning over unstructured and structured data
• Part C: Memory | Data efficiency | Recursive reasoning
14/08/2021 3

2012
2016
AusDM 2016
Turing Awards 2018
GPT-3 2020
DL: 8 years snapshot

DL has been fantastic, but …
• It is great at interpolating
•  data hungry to cover all variations and smooth local manifolds
•  little systematic generalization (novel combinations)
• Lack of human-perceived reasoning capability
• Lack natural mechanism to incorporate prior knowledge, e.g., common sense
• No built-in causal mechanisms
•  Have trust issues!
• To be fair, may of these problems are common in statistical learning!
14/08/2021 5

Why still DL in 2021?
Theoretical
Expressiveness: Neural
nets can approximate any
function.
Learnability: Neural nets
are trained easily.
Generalisability: Neural
nets generalize surprisingly
well to unseen data.
Practical
Generality: Applicable to
many domains.
Competitive: DL is hard to
beat as long as there are
data to train.
Scalability: DL is better with
more data, and it is very
scalable.

The next AI/ML challenge
2020s-2030s
 Learning + reasoning, general
purpose, human-like
 Has contextual and common-
sense reasoning
 Requires less data
 Adapt to change
 Explainable
Photo credit: DARPA

Toward deeper reasoning
System 1:
Intuitive
System 1:
Intuitive
System 1:
Intuitive
• Fast
• Implicit/automatic
• Pattern recognition
• Multiple
System 2:
Analytical
• Slow
• Deliberate/rational
• Careful analysis
• Single, sequential
Single
Image credit: VectorStock | Wikimedia
Perception
Theory of mind
Recursive reasoning
Facts
Semantics
Events and relations
Working space
Memory

System 2
• Holds hypothetical thought
• Decoupling from representation
• Working memory size is not essential.
Its attentional control is.
14/08/2021 9

Figure credit: Jonathan Hui
Reasoning in Probabilistic Graphical Models (PGM)
• Assuming models are fully specified
(e.g., by hand or learnt)
• Estimate MAP as energy
minimization
• Compute marginal probability
• Compute expectation &
normalisation constant
• Key algorithm: Pearl’s Belief
Propagation, a.k.a Sum-Product
algorithm in factor graphs.
• Known result in 2001-2003: BP
minimises Bethe free-energy
minimization.
14/08/2021 10
Heskes, Tom. "Stable fixed points of loopy belief propagation are local minima of the bethe free
energy." Advances in neural information processing systems. 2003.

Can we learn to infer directly from data
without full specification of models?
14/08/2021 11

Agenda
• Introduction
14/08/2021 12

Part A: Sub-topics
• Reasoning as a prediction skill that can be learnt from data.
• Question answering as zero-shot learning.
• Neural network operations for learning to reason:
• Concept-object binding.
• Attention & transformers.
• Dynamic neural networks, conditional computation & differentiable programming.
• Reasoning as iterative representation refinement & query-driven program
synthesis and execution
• Compositional attention networks.
• Neural module networks.
• Combinatorics reasoning
14/08/2021 13

Learning to reason
• Learning is to self-improve by experiencing ~
acquiring knowledge & skills
• Reasoning is to deduce knowledge from
previously acquired knowledge in response to a
query (or a cues)
• Learning to reason is to improve the ability to
decide if a knowledge base entails a predicate.
• E.g., given a video f, determines if the person with the
hat turns before singing.
• Hypotheses:
• Reasoning as just-in-time program synthesis.
• It employs conditional computation.
14/08/2021 14
Khardon, Roni, and Dan Roth. "Learning to reason." Journal of the ACM
(JACM) 44.5 (1997): 697-725.
(Dan Roth; ACM Fellow; IJCAI
John McCarthy Award)

Learning to reason, a definition
14/08/2021 15
Khardon, Roni, and Dan Roth. "Learning to reason." Journal of the ACM
(JACM) 44.5 (1997): 697-725.
E.g., given a video f, determines if the person with the
hat turns before singing.

Practical setting: (query,database,answer) triplets
• This is very general:
• Classification: Query = what is this? Database = data.
• Regression: Query = how much? Database = data.
• QA: Query = NLP question. Database = context/image/text.
• Multi-task learning: Query = task ID. Database = data.
• Zero-shot learning: Query = task description. Database = data.
• Drug-protein binding: Query = drug. Database = protein.
• Recommender system: Query = User (or item). Database =
inventories (or user base);
14/08/2021 16

Can neural networks reason?
Reasoning is not necessarily
achieved by making logical
inferences
There is a continuity between
[algebraically rich inference] and
[connecting together trainable
learning systems]
Central to reasoning is composition
rules to guide the combinations of
modules to address new tasks
14/08/2021 17
“When we observe a visual scene, when we
hear a complex sentence, we are able to
explain in formal terms the relation of the
objects in the scene, or the precise meaning
of the sentence components. However, there
is no evidence that such a formal analysis
necessarily takes place: we see a scene, we
hear a sentence, and we just know what they
mean. This suggests the existence of a
middle layer, already a form of reasoning, but
not yet formal or logical.”
Bottou, Léon. "From machine learning to machine
reasoning." Machine learning 94.2 (2014): 133-149.

Hypotheses
• Reasoning as just-in-time program synthesis.
• It employs conditional computation.
• Reasoning is recursive, e.g., mental travel.
14/08/2021 18

Two approaches to neural reasoning
• Implicit chaining of predicates through recurrence:
• Step-wise query-specific attention to relevant concepts & relations.
• Iterative concept refinement & combination, e.g., through a working
memory.
• Answer is computed from the last memory state & question embedding.
• Explicit program synthesis:
• There is a set of modules, each performs an pre-defined operation.
• Question is parse into a symbolic program.
• The program is implemented as a computational graph constructed by
chaining separate modules.
• The program is executed to compute an answer.
14/08/2021 19

In search for basic neural operators for reasoning
• Basics:
• Neuron as feature detector  Sensor, filter
• Computational graph  Circuit
• Skip-connection  Short circuit
• Essentials
• Multiplicative gates  AND gate, Transistor,
Resistor
• Attention mechanism  SWITCH gate
• Memory + forgetting  Capacitor + leakage
• Compositionality  Modular design
• ..
14/08/2021 20
Photo credit: Nicola Asuni

Part A: Sub-topics
synthesis and execution.
• Reasoning as Neural module networks.
14/08/2021 21

Concept-object binding
• Perceived data (e.g., visual objects) may not share the same semantic space
with high-level concepts.
• Binding between concept-object enables reasoning at the concept level
14/08/2021 22
Example of concept-object binding in LOGNet (Le et al, IJCAI’2020)
More reading: Greff, Klaus, Sjoerd van Steenkiste, and Jürgen Schmidhuber. "On the
binding problem in artificial neural networks." arXiv preprint arXiv:2012.05208 (2020).

Attentions: Picking up only what is needed at a step
• Need attention model to select or ignore
certain computations or inputs
• Can be “soft” (differentiable) or “hard”
(requires RL)
• Needed for selecting predicates in
reasoning.
• Attention provides a short-cut  long-
term dependencies
• Needed for long chain of reasoning.
• Also encourages sparsity if done right!
http://distill.pub/2016/augmented-rnns/

Fast weights | HyperNet – the multiplicative interaction
• Early ideas in early 1990s by Juergen Schmidhuber and
collaborators.
• Data-dependent weights | Using a controller to generate weights of
the main net.
14/08/2021 24
Ha, David, Andrew Dai, and Quoc V. Le. "Hypernetworks." arXiv preprint arXiv:1609.09106 (2016).

Memory networks: Holding the data ready for inference
• Input is a set  Load into
memory, which is NOT updated.
• State is a RNN with attention
reading from inputs
• Concepts: Query, key and
content + Content addressing.
• Deep models, but constant path
length from input to output.
• Equivalent to a RNN with shared
input set.
14/08/2021 25
Sukhbaatar, Sainbayar, Jason Weston, and Rob
Fergus. "End-to-end memory networks." Advances in
neural information processing systems. 2015.

Transformers: Analogical reasoning through self-
attention
14/08/2021 26
Tay, Yi, et al. "Efficient transformers: A survey." arXiv
preprint arXiv:2009.06732 (2020).
State
Key
Query Memory

Transformer as implicit reasoning
• Recall: Reasoning as (free-) energy minimisation
• The classic Belief Propagation algorithm is minimization algorithm
of the Bethe free-energy!
• Transformer has relational, iterative state refinement makes it
a great candidate for implicit relational reasoning.
14/08/2021 27
Ramsauer, Hubert, et al. "Hopfield networks is all you need." arXiv preprint
arXiv:2008.02217 (2020).

Transformer v.s. memory networks
• Memory network:
• Attention to input set
• One hidden state update at a time.
• Final state integrate information of the set, conditioned on the query.
• Transformer:
• Loading all inputs into working memory
• Assigns one hidden state per input element.
• All hidden states (including those from the query) to compute the answer.
14/08/2021 28

Universal transformers
14/08/2021 29
https://ai.googleblog.com/2018/08/moving-beyond-translation-with.html
Dehghani, Mostafa, et al. "Universal
Transformers." International Conference on
Learning Representations. 2018.

Dynamic neural networks
• Memory-Augmented Neural Networks
• Modular program layout
• Program synthesis
14/08/2021 30

Neural Turing machine (NTM)
A memory-augmented neural network (MANN)
• A controller that takes
input/output and talks to an
external memory module.
• Memory has read/write
operations.
• The main issue is where to
write, and how to update the
memory state.
• All operations are
differentiable.
Source: rylanschaeffer.github.io

MANN for reasoning
• Three steps:
• Store data into memory
• Read query, process sequentially, consult memory
• Output answer
• Behind the scene:
• Memory contains data & results of intermediate steps
• LOGNet does the same, memory consists of object
representations
• Drawbacks of current MANNs:
• No memory of controllers  Less modularity and
compositionality when query is complex
• No memory of relations  Much harder to chain predicates.
14/08/2021 32
Source: rylanschaeffer.github.io

Part A: Sub-topics
• Reasoning as iterative representation refinement & query-driven
program synthesis and execution.
14/08/2021 33

MAC Net: Recurrent,
iterative representation
refinement
14/08/2021 34
Hudson, Drew A., and Christopher D. Manning. "Compositional attention
networks for machine reasoning." ICLR 2018.

Module networks
(reasoning by constructing and executing neural programs)
• Reasoning as laying
out modules to reach
an answer
• Composable neural
architecture 
question parsed as
program (layout of
modules)
• A module is a function
(x  y), could be a
sub-reasoning process
((x, q)  y).
14/08/2021 35
https://bair.berkeley.edu/blog/2017/06/20/learning-to-reason-with-neural-module-networks/

Putting things together:
A framework for visual
reasoning
14/08/2021 36
@Truyen Tran & Vuong Le, Deakin Uni

Part A: Sub-topics
synthesis and execution.
14/08/2021 37

Implement combinatorial algorithms
with neural networks
38
Generalizable
Inflexible
Noisy
High dimensional
Train neural processor P to imitate algorithm A
Processor P:
(a) aligned with the
computations of the target
algorithm;
(b) operates by matrix
multiplications, hence
natively admits useful
gradients;
(c) operates over high-
dimensional latent spaces
Veličković, Petar, and Charles Blundell. "Neural Algorithmic Reasoning." arXiv preprint arXiv:2105.02761 (2021).

Processor as RNN
• Do not assume knowing the
structure of the input, input as a
sequence
not really reasonable, harder to
generalize
• RNN is Turing-complete
 can simulate any algorithm
• But, it is not easy to learn the
simulation from data (input-
output)
Pointer network
39
Assume O(N) memory
And O(N^2) computation
N is the size of input
Vinyals, Oriol, Meire Fortunato, and Navdeep Jaitly. "Pointer networks."
In Proceedings of the 28th International Conference on Neural Information Processing Systems-Volume 2, pp. 2692-2700. 2015.

Processor as MANN
• MANN simulates neural
computers or Turing
machine ideal for
implement algorithms
• Sequential input, no
assumption on input
structure
• Assume O(1) memory
and O(N) computation
40
Graves, A., Wayne, G., Reynolds, M. et al. Hybrid computing using a neural network with dynamic external memory. Nature 538, 471–476 (2016)

Sequential encoding of graphs
41
• Each node is associated with random one-hot
or binary features
• Output is the features of the solution
[x1,y1, feature1],
[x2,y2, feature2],
…
[feature4],
[feature2],
…
Geometry
[node_feature1, node_feature2, edge12],
[node_feature1, node_feature2, edge13],
…
[node_feature4],
[node_feature2],
…
Graph
Convex
Hull
TSP
Shortest
Path
Minimum
Spanning
Tree
Le, Hung, Truyen Tran, and Svetha Venkatesh. "Self-attentive associative memory." In International Conference on Machine Learning, pp. 5682-5691. PMLR, 2020.

DNC: graph
reasoning
42
Graves, A., Wayne, G., Reynolds, M. et al. Hybrid computing using a neural network with dynamic external memory. Nature 538, 471–476 (2016)

NUTM: learning multiple algorithms at once
43
Le, Hung, Truyen Tran, and Svetha Venkatesh. "Neural Stored-program Memory."
In International Conference on Learning Representations. 2019.

Processor as graph neural network (GNN)
44
https://petar-v.com/talks/Algo-WWW.pdf
Veličković, Petar, Rex Ying, Matilde Padovano, Raia Hadsell, and Charles Blundell.
"Neural Execution of Graph Algorithms." In International Conference on Learning Representations. 2019.
Motivation:
• Many algorithm operates on graphs
• Supervise graph neural networks with algorithm operation/step/final output
• Encoder-Process-Decode framework:
Attention Message
passing

Example: GNN for a specific problem (DNF counting)
• Count #assignments that satisfy disjuntive normal
form (DNF) formula
• Classical algorithm is P-hard O(mn)
• m: #clauses, n: #variables
• Supervised training on output-level
45
Best: O(m+n)
Abboud, Ralph, Ismail Ceylan, and Thomas Lukasiewicz. "Learning to reason: Leveraging neural networks for approximate DNF counting.“
In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 04, pp. 3097-3104. 2020.

Neural networks and algorithms alignment
46
Xu, Keylu, Jingling Li, Mozhi Zhang, Simon S. Du, Ken-ichi Kawarabayashi, and Stefanie Jegelka. "What Can Neural Networks Reason About?." ICLR 2020 (2020).
https://petar-v.com/talks/Algo-WWW.pdf
Neural exhaustive
search

GNN is aligned with Dynamic
Programming (DP)
47
Neural exhaustive
search

If alignment exists  step-by-step supervision
48
Veličković, Petar, Rex Ying, Matilde Padovano, Raia Hadsell, and Charles Blundell. "Neural Execution of Graph Algorithms." In International Conference on Learning Representations. 2019.
• Merely simulate the
classical graph algorithm,
generalizable
• No algorithm discovery
Joint training is
encouraged

Processor as Transformer
• Back to input sequence
(set), but stronger
generalization
• Transformer with encoder
mask ~ graph attention
• Use Transformer with:
• Binary representation of
numbers
• Dynamic conditional masking
49
Yan, Yujun, Kevin Swersky, Danai Koutra, Parthasarathy Ranganathan, and Milad Hashemi.
"Neural Execution Engines: Learning to Execute Subroutines." Advances in Neural Information Processing Systems 33 (2020).
Next step
Masked
encoding
Decoding
Mask
prediction

Training with execution trace
50

End of part A
14/08/2021 51

14/08/2021 1
Part B: Reasoning over unstructured and structured data

Agenda
• Cross-modality reasoning, the case of vision-language
integration.
• Reasoning as set-set interaction.
• Relational reasoning
• Temporal reasoning
• Video question answering.
2
14/08/2021

Learning to Reason formulation
• Input:
• A knowledge context C
• A query q
• Output: an answer satisfying
• C can be
• structured: knowledge graphs
• unstructured: text, image, sound, video
Q: Is it simply an optimization problem like recognition, detection or even translation?
 No, because the logics from C, q into a is more complex than other solved optimization problems
 We can solve (some parts of) it with good structures and inference strategies
Q: “What affects her mobility?”
14/08/2021 3

A case study: Image Question Answering
• Realization
• C: visual content of an image
• q: a linguistic question
• a: a linguistic phrase as
the answer to q regarding K
• Challenges
• Reasoning through facts and logics
• Cross-modality integration
14/08/2021 4

Image QA: Question types
14/08/2021 Slide credit: Thao Minh Le 5

Image QA datasets
14/08/2021 Slide credit: Thao Minh Le 6

The two main themes in Image QA
• Neuro-symbolic reasoning
• Parse the question into a “program” of small steps
• Learn the generic steps as neural modules
• Use and reuse the modules for different programs
• Compositional reasoning
• Extract visual and linguistic individual- and joint- representation
• Reasoning happens on the structure of the representation
• Sets/graphs/sequences
• The representation got refined through multi-step compositional
reasoning
14/08/2021 7

Agenda
integration.
8
14/08/2021

A simple approach
 Issue: This is very
susceptible to the nuances of
images and questions
14/08/2021 Agrawal et al., 2015, Slide credit: Thao Minh Le 9

Reasoning as set-set interaction
• : a set of context objects
• Faster-RCNN regions
• CNN tubes
• q: a set of linguistic objects L.
- biLSTM embedding of q
 Reasoning is formulated as the interaction between the two sets O and L
for the answer a
14/08/2021 10

Set operations
• Reducing operation (eg: sum/average/max)
• Attention-based combination (Bahdanau et al. 2015)
• Attention weights as query-key dot product (Vaswani et al., 2017)
 Attention-based set ops seem very suitable for visual reasoning
14/08/2021 11

Attention-based reasoning
• Unidirectional attention
• Find relation score between parts in the context C to the question
q:
Options for f:
• Hermann et al. (2015)
• Chen et al. (2016)
• Normalized by softmax into attention weights
• Attended context vector:
 We can now extract information from the context that is “relevant” to the query
14/08/2021 12

Bottom-up-top-down attention (Anderson et al 2017)
• Bottom-up set construction: Choosing Faster-RCNN regions with
high class scores
• Top-down attention: Attending on visual features by question
 Q: How about attention from vision objects to linguistic objects?
14/08/2021 13

Bi-directional attention
• Question-context similarity measure
• Question-guided context attention
• Softmax across columns
• Context-guided question attention
• Softmax across rows
 Q: Probably not working for image qa where single words
does not have the co-reference with a region?
14/08/2021
Dynamic coattention networks for question answering (Seo et al., ICLR
2017) 14

Hierarchical co-attention for ImageQA
• The co-attention is found on a word-phrase-sentence hierarchy
 better cross-domain co-references
 Q: Can this be done on text qa as well?
 Q: How about questions with many reasoning hops?
14/08/2021 15

Multi-step compositional reasoning
• Complex question need multiple hops
of reasoning
• Relations inside the context are multi-
step themselves
• Single shot of attention won’t be
enough
• Single shot of information gathering is
definitely not enough
16
 Q: How to do multi-hop attentional reasoning?
14/08/2021 Figure: Hudson and Manning – ICLR 2018

Multi-step reasoning - Memory, Attention, and Composition (MAC
Nets)
• Attention reasoning is done through multiple sequential steps.
• Each step is done with a recurrent neural cell
• What is the key differences to the normal RNN (LSTM/GRU) cell?
• Not a sequential input, it is sequential processing on static input set.
• Guided by the question through a controller.
14/08/2021 MAC network, Hudson and Manning – ICLR 2018 17

Multi-step attentional reasoning
• At each step, the controller decide what to
look next
• After each step, a piece of information is
gathered, represented through the
attention map on question words and
visual objects
• A common memory kept all the
information extracted toward an answer
14/08/2021
MAC network, Hudson and Manning – ICLR 2018
18

Multi-step attentional reasoning
• Step 1: attends to the “tiny blue
block”, updating m1
• Step 2: look for “the sphere in
front” m2.
• Step3: traverse from the cyan ball
to the final objective – the purple
cylinder,
19
14/08/2021

Reasoning as set-set interaction – a look back
• : a set of context objects
• q: a set of linguistic objects
• Reasoning is formulated as the
interaction between the two
sets O and L for the answer a
Q:What is the brown
animal sitting inside of?
 Q: Set-set interaction falls short for questions about relations between objects
14/08/2021 20

Agenda
integration.
21
14/08/2021

Reasoning on Graphs
• Relational questions: requiring explicit reasoning about the
relations between multiple objects
14/08/2021 Figure credit: Santoro et al 2017 22

• Relation networks
• and are neural functions
• generate “relation” between the two objects
• is the aggregation function
Relation networks (Santoro et al 2017)
 The relations here are implicit, complete, pair-wise – inefficient, and lack expressiveness
14/08/2021 23

Reasoning with Graph convolution networks
• Input graph is built from image entities and question
• GCN is used to gather facts and produce answer
 The relations are now explicit
and pruned
 But the graph building is very
stiff:
- Unrecovrable if it makes a
mistake?
- Information during reasoning are
not used to build graphs
14/08/2021 Narasimhan et.al NIPS2018 24

Reasoning with Graph attention networks
• The graph is determined during reasoning process with
attention mechanism
The relations are now
adaptive and integrated
with reasoning
 Are the relations
singular and static?
14/08/2021 ReGAT model, Li et.al. ICCV19 25

Dynamic reasoning graphs
• On complex questions,
multiple sets of relations
are needed
• We need not only multi-
step but also multi-form
structures
• Let’s do multiple
dynamically–built graphs!
14/08/2021 LCGN, Hu et.al. ICCV19 26

Dynamic reasoning graphs
The questions so far act as an unstructured command in the process
Aren’t their structures and relations important too?
14/08/2021 LCGN, Hu et.al. ICCV19 27

Reasoning on cross-modality graphs
• Two types of nodes: Linguistic entities and visual objects
• Two types of edges:
• Visual
• Linguistic-visual binding (as a fuzzy grounding)
• Adaptively updated during reasoning
14/08/2021 LOGNet, T.M Le et.al. IJCAI2020 28

Language-binding Object Graph (LOG) Unit
• Graph constructor: build the dynamic vision graph
• Language binding constructor: find the dynamic L-V relations

LOGNet: multi-step visual-linguistic binding
• Object-centric representation 
• Multi-step/multi-structure compositional reasoning 
• Linguistic-vision detail interaction 

Dynamic language-vision graphs in
actions

We got sets and graphs, how about sequences?
• Videos pose another challenge for visual reasoning: the
dynamics through time.
• Sets and graphs now becomes sequences of such.
• Temporal relations are the key factors
• The size of context is a core issue
14/08/2021 32

Agenda
integration.
33
14/08/2021

Overview
• Goals of this part of the tutorial
• Understanding Video QA as a complete testbed of
visual reasoning.
• Representative state-of-the-art approaches for
spatio-temporal reasoning.
34
14/08/2021

Video Question Answering
Short-form Video Question Answering
Movie Question Answering
35
14/08/2021

36
Reasoning
Qualitative spatial
reasoning
Relational, temporal
inference
Commonsense
Object recognition
Scene graphs
Computer Vision
Natural Language
Processing
Machine
learning
Visual QA
Parsing
Symbol binding
Systematic generalization
Learning to classify
entailment
Unsupervised
learning
Reinforcement
learning
Program synthesis
Action graphs
Event detection
Object
discovery
14/08/2021 36

Challenges
37
37
• Difficulties in data annotation.
• Content for performing reasoning spreads over space-
time and multiple modalities (videos, subtitles, speech
etc.)
14/08/2021

Video QA Datasets
38
38
Movie QA
(Tapaswi, M., et al.,
2016)
MSRVTT-QA and
MSVD-QA
(Xu, D., et al., 2017)
TGIF-QA
(Jang, Y., et al.,
2017)
MarioQA
(Mun, J., et al.,
2017)
CLEVRER
(Yi, K., et al., 2019)
KnowIT VQA
(Garcia, N., et al.,
2020)
14/08/2021

Video QA datasets
39
39
(TGIF-QA, Jang et al., 2018) (CLEVRER, Yi, Kexin, et al., 2020)
14/08/2021

Video QA as a spatio-temporal
extension of Image QA
40
(a) Extended end-to-end
memory network
(b) Extended simple
VQA model
(c) Extended temporal
attention model
(d) Extended sequence-
to-sequence model
14/08/2021
Zeng, Kuo-Hao, et al. "Leveraging video descriptions to learn video question answering." AAAI’17.

Spatio-temporal cross-modality
alignment
41
Key ideas:
• Explore the correlation
between vision and
language via attention
mechanisms.
• Joint representations
are query-driven
spatio-temporal
features of a given
videos.
14/08/2021 Zhao, Zhou, et al. "Video question answering via hierarchical dual-level attention network learning." ACL’17.

Memory-based Video QA
42
General Dynamic Memory Network (DMN)
Co-memory attention networks for Video QA
Key ideas:
• DMN refines attention over a set of
facts to extract reasoning clues.
• Motion and appearance features are
complementary clues for question
answering.
14/08/2021 Gao, Jiyang, et al. "Motion-appearance co-memory networks for video question answering." CVPR’18.

Memory-based Video QA
43
Heterogeneous video memory for Video QA
Key differences:
• Learning a joint representation of
multimodal inputs at each memory
read/write step.
• Utilizing external question memory
to model context-dependent
question words.
14/08/2021
Fan, Chenyou, et al. "Heterogeneous memory enhanced multimodal attention model for video question answering." CVPR’19.

Multimodal reasoning units for Video QA
44
• CRN: Conditional Relation
Networks.
• Inputs:
• Frame-based
appearance features
• Motion features
• Query features
• Outputs:
encoding temporal
relations, motion, query.
.
14/08/2021 Le, Thao Minh, et al. "Hierarchical conditional relation networks for video question answering.“ CVPR’20

Object-oriented spatio-temporal reasoning for
Video QA
45
• OSTR: Object-oriented
Spatio-Temporal Reasoning.
• Inputs:
• Object lives tracked
through time.
• Context (motion).
• Query features.
• Outputs:
encoding temporal
relations, motion, query. .
14/08/2021 Dang, Long Hoang, et al. "Hierarchical Object-oriented Spatio-Temporal Reasoning for Video Question Answering." IJCAI’21

Video QA as a down-stream task of
video language pre-training
46
VideoBERT
Apr., 2019
HowTo100M
Jun., 2019
MIL-NCE
Dec., 2019
UniViLM
Feb., 2020
HERO
May, 2020
ClipBERT
Feb., 2021
14/08/2021

VideoBERT: a joint model for video
and language representation learning
47
• Data for training: Sample videos and texts from YouCook II.
Instructions in text given by ASR toolkit
Subsampled video segments
Sun, Chen, et al. "Videobert: A joint model for video and language representation learning.“ ICCV’19.
14/08/2021

48
• Linguistic representations:
• Tokenized texts into
WordPieces, similar as
BERT.
• Visual representations:
• S3D features for each segmented
video clips.
• Tokenized into clusters using
hierarchical k-means.
Pre-training
14/08/2021

49
Pre-training
Down-stream
tasks
Video
captioning
Video question
answering
Zero-shot action
classification
14/08/2021

CLIPBERT: video language pre-training
with sparse sampling
50
Lei, Jie, et al. "Less is more: Clipbert for video-and-language learning via sparse sampling." CVPR’21.
ClipBERT
Prev. methods
ClipBERT overview
Procedure:
• Pretraining on large-scale image-text datasets.
• Finetuning on video-text tasks.
14/08/2021

From short-form Video QA to Movie QA
51
Lei, Jie, et al. "Tvqa: Localized, compositional video question answering." EMNLP’18.
Long-term temporal relationships
Multimodal inputs
14/08/2021

Conventional methods for Movie QA
52
Question-driven multi-stream
models:
• Short-term temporal relationships are
less important.
• Long-term temporal relationships and
multimodal interactions are key.
• Language is dominant over visual
counterpart.
Le, Thao Minh, et al. "Hierarchical conditional
relation networks for video question answering.“
IJCV’21.
Lei, Jie, et al. "Tvqa: Localized, compositional video question answering." EMNLP’18.
14/08/2021

HERO: large-scale pre-training for Movie QA
53
Li, Linjie, et al. "Hero: Hierarchical encoder for video+ language omni-representation pre-training." EMNLP’20.
• Pre-trained on 7.6M
videos and
associated subtitles.
• Achieved state-of-
the-art results on all
datasets.
14/08/2021

End of part B
14/08/2021 54

14/08/2021 1
Part C: Memory | Data efficiency | Recursive reasoning

Agenda
• Reasoning with external memories
• Memory of entities – memory-augmented neural networks
• Memory of relations with tensors and graphs
• Memory of programs & neural program construction.
• Learning to reason with less labels
• Data augmentation with analogical and counterfactual examples
• Question generation
• Self-supervised learning for question answering
• Learning with external knowledge graphs
• Recursive reasoning with neural theory of mind.
2

Agenda
• Learning to reason with less labels:
3

Memory is part of intelligence
• Memory is the ability to
store, retain and recall
information
• Brain memory stores
items, events and high-
level structures
• Computer memory
stores data and
temporary variables
5

Memory-reasoning analogy
6
• 2 processes: fast-slow
o Memory: familiarity-
recollection
• Cognitive test:
o Corresponding reasoning and
memorization performance
o Increasing # premises,
inductive/deductive
reasoning is affected
Heit, Evan, and Brett K. Hayes. "Predicting reasoning from memory." Journal of Experimental Psychology: General 140, no. 1 (2011): 76.

Common memory activities
• Encode: write information to
the memory, often requiring
compression capability
• Retain: keep the information
overtime. This is often assumed
in machinery memory
• Retrieve: read information from
the memory to solve the task at
hand
Encode
Retain
Retrieve
7

Memory taxonomy based on memory content
8
Item
Memory
• Objects, events, items,
variables, entities
Relational
Memory
• Relationships, structures,
graphs
Program
Memory
• Programs, functions,
procedures, how-to knowledge

Item memory
Associative memory
RAM-like memory
Independent memory
9

Distributed item memory as
associative memory
10
"Green" means
"go," but what
does "red" mean?
Language
birthday party on
30th Jan
Time Object
Where is my pen?
What is the
password?
Behaviour
10
Semantic
memory
Episodic
memory
Working
memory
Motor
memory

Associate memory can be implemented as
Hopfield network
Correlation matrix memory Hopfield network
Encode Retrieve Retrieve
Feed-forward
retrieval
Recurrent
retrieval 11
“Fast-weight
�
𝑀𝑀

Rule-based reasoning with associative
memory
• Encode a set of rules:
“pre-conditions
post-conditions”
• Support variable
binding, rule-conflict
handling and partial
rule input
• Example of encoding
rule “A:1,B:3,C:4X”
12
Outer product
for binding
Austin, Jim. "Distributed associative memories for high-speed symbolic reasoning." Fuzzy Sets and Systems 82, no. 2 (1996): 223-233.

Memory-augmented neural networks:
computation-storage separation
13
RNN Symposium 2016: Alex Graves - Differentiable Neural Computer
RAM

Neural Turing Machine (NTM)
• Memory is a 2d matrix
• Controller is a neural
network
• The controller
read/writes to memory
at certain addresses.
• Trained end-to-end,
differentiable
• Simulate Turing Machine
support symbolic
reasoning, algorithm
solving
14
Graves, Alex, Greg Wayne, and Ivo Danihelka. "Neural turing machines." arXiv preprint arXiv:1410.5401 (2014).

Addressing mechanism in NTM
Input
𝑒𝑒𝑡𝑡, 𝑎𝑎𝑡𝑡
Memory writing Memory reading

Algorithmic reasoning
16
Copy
Associative
recall
Priority sort

Optimal memory writing for
memorization
• Simple finding: writing too often
deteriorates memory content (not
retainable)
• Given input sequence of length T
and only D writes, when should we
write to the memory?
17
Le, Hung, Truyen Tran, and Svetha Venkatesh. "Learning to Remember More with Less Memorization." In International Conference on Learning Representations. 2018.
Uniform writing is optimal for
memorization

Better memorization means better algorithmic reasoning
18
T=50, D=5
Regular Uniform (cached)

Memory of independent entities
• Each slot store one or some entities
• Memory writing is done separately for
each memory slot
each slot maintains the life of one or
more entities
• The memory is a set of N parallel RNNs
19
John Apple __ John Apple Office
Apple John __
John Apple Kitchen
Apple John Office Apple John Kitchen
Weston, Jason, Bordes, Antoine, Chopra, Sumit, and Mikolov, Tomas.
Towards ai-complete question answering: A set of prerequisite toy tasks. CoRR, abs/1502.05698, 2015.
RNN 1
RNN 2
…
Time

Recurrent entity network
20
Garden
Henaff, Mikael, Jason Weston, Arthur Szlam, Antoine Bordes, and Yann LeCun.
"Tracking the world state with recurrent entity networks."
In 5th International Conference on Learning Representations, ICLR 2017. 2017.

Recurrent Independent Mechanisms
21
Goyal, Anirudh, Alex Lamb, Jordan Hoffmann, Shagun Sodhani, Sergey Levine, Yoshua Bengio, and Bernhard Schölkopf. "Recurrent independent mechanisms.“ ICLR21.

Reasoning with independent
dynamics
22
Copy
Ball
dynamics

Relational memory
Graph memory
Tensor memory
23

Why relational memory? Item memory
is weak at recognizing relationships
Item
Memory
• Store and retrieve individual items
• Relate pair of items of the same time step
• Fail to relate temporally distant items
24

Dual process in memory
25
• Store items
• Simple, low-order
• System 1
Relational
Memory
• Store relationships between items
• Complicated, high-order
• System 2
Item
Memory
Howard Eichenbaum, Memory, amnesia, and the hippocampal system (MIT press, 1993).
Alex Konkel and Neal J Cohen, "Relational memory and the hippocampus: representations and methods", Frontiers in neuroscience 3 (2009).

Memory as graph
• Memory is a static graph with
fixed nodes and edges
• Relationship is somehow
known
• Each memory node stores
the state of the graph’s node
• Write to node via message
passing
• Read from node via MLP
26
Palm, Rasmus Berg, Ulrich Paquet, and Ole Winther. "Recurrent Relational Networks." In NeurIPS. 2018.

bAbI
27
Fact 1
Fact 2
Fact 3
Question
Node
Edge
Answer
CLEVER
Node
(colour, shape. position)
Edge
(distance)

Memory of graphs access conditioned on query
• Encode multiple graphs, each
graph is stored in a set of
memory row
• For each graph, the controller
read/write to the memory:
• Read uses content-based
attention
• Write use message passing
• Aggregate read vectors from
all graphs to create output
28
Pham, Trang, Truyen Tran, and Svetha Venkatesh. "Relational dynamic memory networks." arXiv preprint arXiv:1808.04247 (2018).

Capturing relationship can be done via
memory slot interactions using attention
• Graph memory needs customization to an explicit design of nodes and
edges
• Can we automatically learns structure with a 2d tensor memory?
• Capture relationship: each slot interacts with all other slots (self-
attention)
29
Santoro, Adam, Ryan Faulkner, David Raposo, Jack Rae, Mike Chrzanowski, Théophane Weber, Daan Wierstra, Oriol Vinyals, Razvan Pascanu, and Timothy Lillicrap.
"Relational recurrent neural networks." In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 7310-7321. 2018.

Relational Memory Core (RMC) operation
30
RNN-like
Interface

31
Allowing pair-wise interactions can answer
questions on temporal relationship

Dot product attention works for
simple relationship, but …
32
What is
most
similar to
me?
0.7 0.9 - 0.1 0.4
What is most
similar to me
but different
from tiger?
For hard relationship, scalar
representation is limited

Complicated relationship needs high-
order relational memory
33
Extract items
Item
memory
Associate every pairs of them
…
3d relational
tensor
Relational
memory
Le, Hung, Truyen Tran, and Svetha Venkatesh. "Self-
attentive associative memory." In International Conference
on Machine Learning, pp. 5682-5691. PMLR, 2020.

Program memory
Module memory
Stored-program memory
34

Predefining program for subtask
• A program designed for a
task becomes a module
• Parse a question to module
layout (order of program
execution)
• Learn the weight of each
module to master the task
35
Andreas, Jacob, Marcus Rohrbach, Trevor Darrell, and Dan Klein. "Neural module networks." In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 39-48. 2016.

Program selection is based on
parser, others are end2end trained
36
5 module
templates
1 2
3
4
5
Parsing

The most powerful memory is one that stores
both program and data
• Computer architecture:
Universal Turing
Machines/Harvard/VNM
• Stored-program principle
• Break a big task into subtasks,
each can be handled by a
TM/single purposed program
stored in a program memory
37
https://en.wikipedia.org/

NUTM: Learn to select program (neural weight)
via program attention
• Neural stored-program memory
(NSM) stores key (the address)
and values (the weight)
• The weight is selected and
loaded to the controller of NTM
• The stored NTM weights and
the weight of the NUTM is
learnt end-to-end by
backpropagation
38
Le, Hung, Truyen Tran, and Svetha Venkatesh. "Neural Stored-program Memory."
In International Conference on Learning Representations. 2019.

Scaling with memory of mini-programs
• Prior, 1 program = 1 neural
network (millions of
parameters)
• Parameter inefficiency since
the programs do not share
common parameters
• Solution: store sharable
mini-programs to compose
infinite number of programs
39
it is analogous to building Lego structures
corresponding to inputs from basic Lego bricks.

Recurrent program attention to retrieve
singular components of a program
40
Le, Hung, and Svetha Venkatesh. "Neurocoder: Learning General-Purpose Computation Using Stored Neural Programs." arXiv preprint arXiv:2009.11443 (2020).

41
Program attention is equivalent to
binary decision tree reasoning
Recurrent program attention auto
detects task boundary

Agenda
42

Data Augmentation with Analogical and
Counterfactual Examples
43
• Poor generalization when training under independent
and identically distributed assumption.
• Intuition: augmenting counterfactual samples to allow
machines to understand the critical changes in the
input that lead to changes in the answer space.
• Perceptually similar, yet
• Semantically dissimilar realistic samples
Visual counterfactual example
Language counterfactual examples
Gokhale, Tejas, et al. "Mutant: A training paradigm for out-of-distribution
generalization in visual question answering." EMNLP’20.

Question Generations
44
Li, Yikang, et al. "Visual question generation as dual task of visual question answering." CVPR’18.
Krishna, Ranjay, Michael Bernstein, and Li Fei-Fei. "Information maximizing visual question
generation." CVPR’19.
• Question answering is a zero-shot
learning problem. Question
generation helps cover a wider
range of concepts.
• Question generation can be done
with either supervised and
unsupervised learning.

BERT: Transformer That Predicts Its Own
Masked Parts
46
BERT is like parallel
approximate pseudo-
likelihood
• ~ Maximizing the
conditional likelihood of
some variables given the
rest.
• When the number of
variables is large, this
converses to MLE
(maximum likelihood
estimate).
[Slide credit: Truyen Tran]
https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270

Visual QA as a Down-stream Task of Visual-
Language BERT Pre-trained Models
47
Numerous pre-trained visual language models during 2019-2021.
VisualBERT (Li, Liunian Harold, et al., 2019)
VL-BERT (Su, Weijie, et al., 2019)
UNITER (Chen, Yen-Chun, et al., 2019)
12-in-1 (Lu, Jiasen, et al., 2020)
Pixel-BERT (Huang, Zhicheng, et al., 2019)
OSCAR (Li, Xiujun, et al., 2020)
Single-stream model Two-stream model
ViLBERT (Lu, Jiasen, et al. , 2019)
LXMERT (Tan, Hao, and Mohit Bansal, 2019)
[Slide credit: Licheng Yu et al.]

Learning with External Knowledge
48
Why external knowledge
for reasoning?
• Questions can be beyond
visual recognition (e.g.
firetrucks usually use a fire
hydrant).
• Human’s prior knowledge for
cognition-level reasoning (e.g.
human’s goals, intents etc.)
Q: What sort of vehicle uses this item?
A: firetruck
Q: What is the sports position of the
man in the orange shirt?
A: goalie/goalkeeper
Marino, Kenneth, et al. "Ok-vqa: A visual question
answering benchmark requiring external
knowledge." CVPR’19.
Zellers, Rowan, et al. "From recognition to cognition: Visual commonsense reasoning." CVPR’19.

Learning with External Knowledge
49
Retrieved by Wikipedia search API
Marino, Kenneth, et al. "Ok-vqa: A visual question
answering benchmark requiring external
knowledge." CVPR’19.
Shah, Sanket, et al. "Kvqa: Knowledge-aware visual question
answering." AAAI’19.

Agenda
50

Source: religious studies project
Core AI faculty:
Theory of mind

Where would ToM fit in?
System 1:
Intuitive
System 1:
Intuitive
System 1:
Intuitive
• Fast
• Multiple
System 2:
Analytical
• Slow
Single
Perception
Theory of mind
Recursive reasoning
Facts
Semantics
Working space
Memory

Contextualized recursive reasoning
• Thus far, QA tasks are straightforward and objective:
• Questioner: I will ask about what I don’t know.
• Answerer: I will answer what I know.
• Real life can be tricky, more subjective:
• Questioner: I will ask only questions I think they can
answer.
• Answerer 1: This is what I think they want from an answer.
• Answerer 2: I will answer only what I think they think I can.
14/08/2021 53
 We need Theory of Mind to function socially.

Social dilemma: Stag Hunt games
• Difficult decision: individual outcomes (selfish)
or group outcomes (cooperative).
• Together hunt Stag (both are cooperative): Both have more
meat.
• Solely hunt Hare (both are selfish): Both have less meat.
• One hunts Stag (cooperative), other hunts Hare (selfish): Only
one hunts hare has meat.
• Human evidence: Self-interested but
considerate of others (cultures vary).
• Idea: Belief-based guilt-aversion
• One experiences loss if it lets other down.
• Necessitates Theory of Mind: reasoning about other’s mind.

Theory of Mind Agent with Guilt Aversion (ToMAGA)
Update Theory of Mind
• Predict whether other’s behaviour are
cooperative or uncooperative
• Updated the zero-order belief (what
other will do)
• Update the first-order belief (what other
think about me)
Guilt Aversion
• Compute the expected material reward
of other based on Theory of Mind
• Compute the psychological rewards, i.e.
“feeling guilty”
• Reward shaping: subtract the expected
loss of the other.
Nguyen, Dung, et al. "Theory of Mind with Guilt Aversion Facilitates
Cooperative Reinforcement Learning." Asian Conference on Machine
Learning. PMLR, 2020.
[Slide credit: Dung Nguyen]

Machine Theory of Mind Architecture (inside the Observer)
Successor
representations
next-step action
probability
goal
Rabinowitz, Neil, et al. "Machine theory of mind." International conference on machine learning. PMLR, 2018.
[Slide credit: Dung Nguyen]

A ToM
architecture
• Observer maintains memory of
previous episodes of the agent.
• It theorizes the “traits” of the
agent.
• Implemented as Hyper Networks.
• Given the current episode, the
observer tries to infer goal,
intention, action, etc of the
agent.
• Implemented as memory retrieval
through attention mechanisms.
14/08/2021 57

Wrapping up
• Reasoning as the next challenge for deep neural networks
• Reasoning as a prediction skill that can be learnt from data
• Dynamic neural networks are capable
• Reasoning over unstructured sets
• Relational reasoning over structured data
• Memories of items, relations and programs
• Learning with less labels
• Theory of mind
14/08/2021 59

A possible framework for learning and reasoning
with deep neural networks
System 1:
Intuitive
System 1:
Intuitive
System 1:
Intuitive
• Fast
• Multiple
System 2:
Analytical
• Slow
Single
Perception
Theory of mind
Recursive reasoning
Facts
Semantics
Working space
Memory

QA
14/08/2021 61

From deep learning to deep reasoning

More Related Content

What's hot

Similar to From deep learning to deep reasoning

More from Deakin University

Recently uploaded

From deep learning to deep reasoning