From Deep Learning to Deep Reasoning
14/08/2021 1
Tutorial at KDD, August 14th 2021
Truyen Tran, Vuong Le, Hung Le and Thao Le
{truyen.tran,vuong.le,thai.le,thao.le}@deakin.edu.au
https://bit.ly/37DYQn7
Part A: Learning to reason
Logistics
14/08/2021 2
Truyen Tran Vuong Le Hung Le Thao Le
https://bit.ly/37DYQn7
Agenda
• Introduction
• Part A: Learning-to-reason framework
• Part B: Reasoning over unstructured and structured data
• Part C: Memory | Data efficiency | Recursive reasoning
14/08/2021 3
2012
2016
AusDM 2016
Turing Awards 2018
GPT-3 2020
DL: 8 years snapshot
DL has been fantastic, but …
• It is great at interpolating
•  data hungry to cover all variations and smooth local manifolds
•  little systematic generalization (novel combinations)
• Lack of human-perceived reasoning capability
• Lack natural mechanism to incorporate prior knowledge, e.g., common sense
• No built-in causal mechanisms
•  Have trust issues!
• To be fair, may of these problems are common in statistical learning!
14/08/2021 5
Why still DL in 2021?
Theoretical
Expressiveness: Neural
nets can approximate any
function.
Learnability: Neural nets
are trained easily.
Generalisability: Neural
nets generalize surprisingly
well to unseen data.
Practical
Generality: Applicable to
many domains.
Competitive: DL is hard to
beat as long as there are
data to train.
Scalability: DL is better with
more data, and it is very
scalable.
The next AI/ML challenge
2020s-2030s
 Learning + reasoning, general
purpose, human-like
 Has contextual and common-
sense reasoning
 Requires less data
 Adapt to change
 Explainable
Photo credit: DARPA
Toward deeper reasoning
System 1:
Intuitive
System 1:
Intuitive
System 1:
Intuitive
• Fast
• Implicit/automatic
• Pattern recognition
• Multiple
System 2:
Analytical
• Slow
• Deliberate/rational
• Careful analysis
• Single, sequential
Single
Image credit: VectorStock | Wikimedia
Perception
Theory of mind
Recursive reasoning
Facts
Semantics
Events and relations
Working space
Memory
System 2
• Holds hypothetical thought
• Decoupling from representation
• Working memory size is not essential.
Its attentional control is.
14/08/2021 9
Figure credit: Jonathan Hui
Reasoning in Probabilistic Graphical Models (PGM)
• Assuming models are fully specified
(e.g., by hand or learnt)
• Estimate MAP as energy
minimization
• Compute marginal probability
• Compute expectation &
normalisation constant
• Key algorithm: Pearl’s Belief
Propagation, a.k.a Sum-Product
algorithm in factor graphs.
• Known result in 2001-2003: BP
minimises Bethe free-energy
minimization.
14/08/2021 10
Heskes, Tom. "Stable fixed points of loopy belief propagation are local minima of the bethe free
energy." Advances in neural information processing systems. 2003.
Can we learn to infer directly from data
without full specification of models?
14/08/2021 11
Agenda
• Introduction
• Part A: Learning-to-reason framework
• Part B: Reasoning over unstructured and structured data
• Part C: Memory | Data efficiency | Recursive reasoning
14/08/2021 12
Part A: Sub-topics
• Reasoning as a prediction skill that can be learnt from data.
• Question answering as zero-shot learning.
• Neural network operations for learning to reason:
• Concept-object binding.
• Attention & transformers.
• Dynamic neural networks, conditional computation & differentiable programming.
• Reasoning as iterative representation refinement & query-driven program
synthesis and execution
• Compositional attention networks.
• Neural module networks.
• Combinatorics reasoning
14/08/2021 13
Learning to reason
• Learning is to self-improve by experiencing ~
acquiring knowledge & skills
• Reasoning is to deduce knowledge from
previously acquired knowledge in response to a
query (or a cues)
• Learning to reason is to improve the ability to
decide if a knowledge base entails a predicate.
• E.g., given a video f, determines if the person with the
hat turns before singing.
• Hypotheses:
• Reasoning as just-in-time program synthesis.
• It employs conditional computation.
14/08/2021 14
Khardon, Roni, and Dan Roth. "Learning to reason." Journal of the ACM
(JACM) 44.5 (1997): 697-725.
(Dan Roth; ACM Fellow; IJCAI
John McCarthy Award)
Learning to reason, a definition
14/08/2021 15
Khardon, Roni, and Dan Roth. "Learning to reason." Journal of the ACM
(JACM) 44.5 (1997): 697-725.
E.g., given a video f, determines if the person with the
hat turns before singing.
Practical setting: (query,database,answer) triplets
• This is very general:
• Classification: Query = what is this? Database = data.
• Regression: Query = how much? Database = data.
• QA: Query = NLP question. Database = context/image/text.
• Multi-task learning: Query = task ID. Database = data.
• Zero-shot learning: Query = task description. Database = data.
• Drug-protein binding: Query = drug. Database = protein.
• Recommender system: Query = User (or item). Database =
inventories (or user base);
14/08/2021 16
Can neural networks reason?
Reasoning is not necessarily
achieved by making logical
inferences
There is a continuity between
[algebraically rich inference] and
[connecting together trainable
learning systems]
Central to reasoning is composition
rules to guide the combinations of
modules to address new tasks
14/08/2021 17
“When we observe a visual scene, when we
hear a complex sentence, we are able to
explain in formal terms the relation of the
objects in the scene, or the precise meaning
of the sentence components. However, there
is no evidence that such a formal analysis
necessarily takes place: we see a scene, we
hear a sentence, and we just know what they
mean. This suggests the existence of a
middle layer, already a form of reasoning, but
not yet formal or logical.”
Bottou, Léon. "From machine learning to machine
reasoning." Machine learning 94.2 (2014): 133-149.
Hypotheses
• Reasoning as just-in-time program synthesis.
• It employs conditional computation.
• Reasoning is recursive, e.g., mental travel.
14/08/2021 18
Two approaches to neural reasoning
• Implicit chaining of predicates through recurrence:
• Step-wise query-specific attention to relevant concepts & relations.
• Iterative concept refinement & combination, e.g., through a working
memory.
• Answer is computed from the last memory state & question embedding.
• Explicit program synthesis:
• There is a set of modules, each performs an pre-defined operation.
• Question is parse into a symbolic program.
• The program is implemented as a computational graph constructed by
chaining separate modules.
• The program is executed to compute an answer.
14/08/2021 19
In search for basic neural operators for reasoning
• Basics:
• Neuron as feature detector  Sensor, filter
• Computational graph  Circuit
• Skip-connection  Short circuit
• Essentials
• Multiplicative gates  AND gate, Transistor,
Resistor
• Attention mechanism  SWITCH gate
• Memory + forgetting  Capacitor + leakage
• Compositionality  Modular design
• ..
14/08/2021 20
Photo credit: Nicola Asuni
Part A: Sub-topics
• Reasoning as a prediction skill that can be learnt from data.
• Question answering as zero-shot learning.
• Neural network operations for learning to reason:
• Concept-object binding.
• Attention & transformers.
• Dynamic neural networks, conditional computation & differentiable programming.
• Reasoning as iterative representation refinement & query-driven program
synthesis and execution.
• Compositional attention networks.
• Reasoning as Neural module networks.
• Combinatorics reasoning
14/08/2021 21
Concept-object binding
• Perceived data (e.g., visual objects) may not share the same semantic space
with high-level concepts.
• Binding between concept-object enables reasoning at the concept level
14/08/2021 22
Example of concept-object binding in LOGNet (Le et al, IJCAI’2020)
More reading: Greff, Klaus, Sjoerd van Steenkiste, and Jürgen Schmidhuber. "On the
binding problem in artificial neural networks." arXiv preprint arXiv:2012.05208 (2020).
Attentions: Picking up only what is needed at a step
• Need attention model to select or ignore
certain computations or inputs
• Can be “soft” (differentiable) or “hard”
(requires RL)
• Needed for selecting predicates in
reasoning.
• Attention provides a short-cut  long-
term dependencies
• Needed for long chain of reasoning.
• Also encourages sparsity if done right!
http://distill.pub/2016/augmented-rnns/
Fast weights | HyperNet – the multiplicative interaction
• Early ideas in early 1990s by Juergen Schmidhuber and
collaborators.
• Data-dependent weights | Using a controller to generate weights of
the main net.
14/08/2021 24
Ha, David, Andrew Dai, and Quoc V. Le. "Hypernetworks." arXiv preprint arXiv:1609.09106 (2016).
Memory networks: Holding the data ready for inference
• Input is a set  Load into
memory, which is NOT updated.
• State is a RNN with attention
reading from inputs
• Concepts: Query, key and
content + Content addressing.
• Deep models, but constant path
length from input to output.
• Equivalent to a RNN with shared
input set.
14/08/2021 25
Sukhbaatar, Sainbayar, Jason Weston, and Rob
Fergus. "End-to-end memory networks." Advances in
neural information processing systems. 2015.
Transformers: Analogical reasoning through self-
attention
14/08/2021 26
Tay, Yi, et al. "Efficient transformers: A survey." arXiv
preprint arXiv:2009.06732 (2020).
State
Key
Query Memory
Transformer as implicit reasoning
• Recall: Reasoning as (free-) energy minimisation
• The classic Belief Propagation algorithm is minimization algorithm
of the Bethe free-energy!
• Transformer has relational, iterative state refinement makes it
a great candidate for implicit relational reasoning.
14/08/2021 27
Ramsauer, Hubert, et al. "Hopfield networks is all you need." arXiv preprint
arXiv:2008.02217 (2020).
Transformer v.s. memory networks
• Memory network:
• Attention to input set
• One hidden state update at a time.
• Final state integrate information of the set, conditioned on the query.
• Transformer:
• Loading all inputs into working memory
• Assigns one hidden state per input element.
• All hidden states (including those from the query) to compute the answer.
14/08/2021 28
Universal transformers
14/08/2021 29
https://ai.googleblog.com/2018/08/moving-beyond-translation-with.html
Dehghani, Mostafa, et al. "Universal
Transformers." International Conference on
Learning Representations. 2018.
Dynamic neural networks
• Memory-Augmented Neural Networks
• Modular program layout
• Program synthesis
14/08/2021 30
Neural Turing machine (NTM)
A memory-augmented neural network (MANN)
• A controller that takes
input/output and talks to an
external memory module.
• Memory has read/write
operations.
• The main issue is where to
write, and how to update the
memory state.
• All operations are
differentiable.
Source: rylanschaeffer.github.io
MANN for reasoning
• Three steps:
• Store data into memory
• Read query, process sequentially, consult memory
• Output answer
• Behind the scene:
• Memory contains data & results of intermediate steps
• LOGNet does the same, memory consists of object
representations
• Drawbacks of current MANNs:
• No memory of controllers  Less modularity and
compositionality when query is complex
• No memory of relations  Much harder to chain predicates.
14/08/2021 32
Source: rylanschaeffer.github.io
Part A: Sub-topics
• Reasoning as a prediction skill that can be learnt from data.
• Question answering as zero-shot learning.
• Neural network operations for learning to reason:
• Concept-object binding.
• Attention & transformers.
• Dynamic neural networks, conditional computation & differentiable programming.
• Reasoning as iterative representation refinement & query-driven
program synthesis and execution.
• Compositional attention networks.
• Reasoning as Neural module networks.
• Combinatorics reasoning
14/08/2021 33
MAC Net: Recurrent,
iterative representation
refinement
14/08/2021 34
Hudson, Drew A., and Christopher D. Manning. "Compositional attention
networks for machine reasoning." ICLR 2018.
Module networks
(reasoning by constructing and executing neural programs)
• Reasoning as laying
out modules to reach
an answer
• Composable neural
architecture 
question parsed as
program (layout of
modules)
• A module is a function
(x  y), could be a
sub-reasoning process
((x, q)  y).
14/08/2021 35
https://bair.berkeley.edu/blog/2017/06/20/learning-to-reason-with-neural-module-networks/
Putting things together:
A framework for visual
reasoning
14/08/2021 36
@Truyen Tran & Vuong Le, Deakin Uni
Part A: Sub-topics
• Reasoning as a prediction skill that can be learnt from data.
• Question answering as zero-shot learning.
• Neural network operations for learning to reason:
• Concept-object binding.
• Attention & transformers.
• Dynamic neural networks, conditional computation & differentiable programming.
• Reasoning as iterative representation refinement & query-driven program
synthesis and execution.
• Compositional attention networks.
• Reasoning as Neural module networks.
• Combinatorics reasoning
14/08/2021 37
Implement combinatorial algorithms
with neural networks
38
Generalizable
Inflexible
Noisy
High dimensional
Train neural processor P to imitate algorithm A
Processor P:
(a) aligned with the
computations of the target
algorithm;
(b) operates by matrix
multiplications, hence
natively admits useful
gradients;
(c) operates over high-
dimensional latent spaces
Veličković, Petar, and Charles Blundell. "Neural Algorithmic Reasoning." arXiv preprint arXiv:2105.02761 (2021).
Processor as RNN
• Do not assume knowing the
structure of the input, input as a
sequence
not really reasonable, harder to
generalize
• RNN is Turing-complete
 can simulate any algorithm
• But, it is not easy to learn the
simulation from data (input-
output)
Pointer network
39
Assume O(N) memory
And O(N^2) computation
N is the size of input
Vinyals, Oriol, Meire Fortunato, and Navdeep Jaitly. "Pointer networks."
In Proceedings of the 28th International Conference on Neural Information Processing Systems-Volume 2, pp. 2692-2700. 2015.
Processor as MANN
• MANN simulates neural
computers or Turing
machine ideal for
implement algorithms
• Sequential input, no
assumption on input
structure
• Assume O(1) memory
and O(N) computation
40
Graves, A., Wayne, G., Reynolds, M. et al. Hybrid computing using a neural network with dynamic external memory. Nature 538, 471–476 (2016)
Sequential encoding of graphs
41
• Each node is associated with random one-hot
or binary features
• Output is the features of the solution
[x1,y1, feature1],
[x2,y2, feature2],
…
[feature4],
[feature2],
…
Geometry
[node_feature1, node_feature2, edge12],
[node_feature1, node_feature2, edge13],
…
[node_feature4],
[node_feature2],
…
Graph
Convex
Hull
TSP
Shortest
Path
Minimum
Spanning
Tree
Le, Hung, Truyen Tran, and Svetha Venkatesh. "Self-attentive associative memory." In International Conference on Machine Learning, pp. 5682-5691. PMLR, 2020.
DNC: graph
reasoning
42
Graves, A., Wayne, G., Reynolds, M. et al. Hybrid computing using a neural network with dynamic external memory. Nature 538, 471–476 (2016)
NUTM: learning multiple algorithms at once
43
Le, Hung, Truyen Tran, and Svetha Venkatesh. "Neural Stored-program Memory."
In International Conference on Learning Representations. 2019.
Processor as graph neural network (GNN)
44
https://petar-v.com/talks/Algo-WWW.pdf
Veličković, Petar, Rex Ying, Matilde Padovano, Raia Hadsell, and Charles Blundell.
"Neural Execution of Graph Algorithms." In International Conference on Learning Representations. 2019.
Motivation:
• Many algorithm operates on graphs
• Supervise graph neural networks with algorithm operation/step/final output
• Encoder-Process-Decode framework:
Attention Message
passing
Example: GNN for a specific problem (DNF counting)
• Count #assignments that satisfy disjuntive normal
form (DNF) formula
• Classical algorithm is P-hard O(mn)
• m: #clauses, n: #variables
• Supervised training on output-level
45
Best: O(m+n)
Abboud, Ralph, Ismail Ceylan, and Thomas Lukasiewicz. "Learning to reason: Leveraging neural networks for approximate DNF counting.“
In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 04, pp. 3097-3104. 2020.
Neural networks and algorithms alignment
46
Xu, Keylu, Jingling Li, Mozhi Zhang, Simon S. Du, Ken-ichi Kawarabayashi, and Stefanie Jegelka. "What Can Neural Networks Reason About?." ICLR 2020 (2020).
https://petar-v.com/talks/Algo-WWW.pdf
Neural exhaustive
search
GNN is aligned with Dynamic
Programming (DP)
47
Neural exhaustive
search
If alignment exists  step-by-step supervision
48
Veličković, Petar, Rex Ying, Matilde Padovano, Raia Hadsell, and Charles Blundell. "Neural Execution of Graph Algorithms." In International Conference on Learning Representations. 2019.
• Merely simulate the
classical graph algorithm,
generalizable
• No algorithm discovery
Joint training is
encouraged
Processor as Transformer
• Back to input sequence
(set), but stronger
generalization
• Transformer with encoder
mask ~ graph attention
• Use Transformer with:
• Binary representation of
numbers
• Dynamic conditional masking
49
Yan, Yujun, Kevin Swersky, Danai Koutra, Parthasarathy Ranganathan, and Milad Hashemi.
"Neural Execution Engines: Learning to Execute Subroutines." Advances in Neural Information Processing Systems 33 (2020).
Next step
Masked
encoding
Decoding
Mask
prediction
Training with execution trace
50
End of part A
14/08/2021 51
https://bit.ly/37DYQn7
From Deep Learning to Deep Reasoning
14/08/2021 1
Tutorial at KDD, August 14th 2021
Truyen Tran, Vuong Le, Hung Le and Thao Le
{truyen.tran,vuong.le,thai.le,thao.le}@deakin.edu.au
https://bit.ly/37DYQn7
Part B: Reasoning over unstructured and structured data
Agenda
• Cross-modality reasoning, the case of vision-language
integration.
• Reasoning as set-set interaction.
• Relational reasoning
• Temporal reasoning
• Video question answering.
2
14/08/2021
Learning to Reason formulation
• Input:
• A knowledge context C
• A query q
• Output: an answer satisfying
• C can be
• structured: knowledge graphs
• unstructured: text, image, sound, video
Q: Is it simply an optimization problem like recognition, detection or even translation?
 No, because the logics from C, q into a is more complex than other solved optimization problems
 We can solve (some parts of) it with good structures and inference strategies
Q: “What affects her mobility?”
14/08/2021 3
A case study: Image Question Answering
• Realization
• C: visual content of an image
• q: a linguistic question
• a: a linguistic phrase as
the answer to q regarding K
• Challenges
• Reasoning through facts and logics
• Cross-modality integration
14/08/2021 4
Image QA: Question types
14/08/2021 Slide credit: Thao Minh Le 5
Image QA datasets
14/08/2021 Slide credit: Thao Minh Le 6
The two main themes in Image QA
• Neuro-symbolic reasoning
• Parse the question into a “program” of small steps
• Learn the generic steps as neural modules
• Use and reuse the modules for different programs
• Compositional reasoning
• Extract visual and linguistic individual- and joint- representation
• Reasoning happens on the structure of the representation
• Sets/graphs/sequences
• The representation got refined through multi-step compositional
reasoning
14/08/2021 7
Agenda
• Cross-modality reasoning, the case of vision-language
integration.
• Reasoning as set-set interaction.
• Relational reasoning
• Temporal reasoning
• Video question answering.
8
14/08/2021
A simple approach
 Issue: This is very
susceptible to the nuances of
images and questions
14/08/2021 Agrawal et al., 2015, Slide credit: Thao Minh Le 9
Reasoning as set-set interaction
• : a set of context objects
• Faster-RCNN regions
• CNN tubes
• q: a set of linguistic objects L.
- biLSTM embedding of q
 Reasoning is formulated as the interaction between the two sets O and L
for the answer a
14/08/2021 10
Set operations
• Reducing operation (eg: sum/average/max)
• Attention-based combination (Bahdanau et al. 2015)
• Attention weights as query-key dot product (Vaswani et al., 2017)
 Attention-based set ops seem very suitable for visual reasoning
14/08/2021 11
Attention-based reasoning
• Unidirectional attention
• Find relation score between parts in the context C to the question
q:
Options for f:
• Hermann et al. (2015)
• Chen et al. (2016)
• Normalized by softmax into attention weights
• Attended context vector:
 We can now extract information from the context that is “relevant” to the query
14/08/2021 12
Bottom-up-top-down attention (Anderson et al 2017)
• Bottom-up set construction: Choosing Faster-RCNN regions with
high class scores
• Top-down attention: Attending on visual features by question
 Q: How about attention from vision objects to linguistic objects?
14/08/2021 13
Bi-directional attention
• Question-context similarity measure
• Question-guided context attention
• Softmax across columns
• Context-guided question attention
• Softmax across rows
 Q: Probably not working for image qa where single words
does not have the co-reference with a region?
14/08/2021
Dynamic coattention networks for question answering (Seo et al., ICLR
2017) 14
Hierarchical co-attention for ImageQA
• The co-attention is found on a word-phrase-sentence hierarchy
 better cross-domain co-references
 Q: Can this be done on text qa as well?
 Q: How about questions with many reasoning hops?
14/08/2021 15
Multi-step compositional reasoning
• Complex question need multiple hops
of reasoning
• Relations inside the context are multi-
step themselves
• Single shot of attention won’t be
enough
• Single shot of information gathering is
definitely not enough
16
 Q: How to do multi-hop attentional reasoning?
14/08/2021 Figure: Hudson and Manning – ICLR 2018
Multi-step reasoning - Memory, Attention, and Composition (MAC
Nets)
• Attention reasoning is done through multiple sequential steps.
• Each step is done with a recurrent neural cell
• What is the key differences to the normal RNN (LSTM/GRU) cell?
• Not a sequential input, it is sequential processing on static input set.
• Guided by the question through a controller.
14/08/2021 MAC network, Hudson and Manning – ICLR 2018 17
Multi-step attentional reasoning
• At each step, the controller decide what to
look next
• After each step, a piece of information is
gathered, represented through the
attention map on question words and
visual objects
• A common memory kept all the
information extracted toward an answer
14/08/2021
MAC network, Hudson and Manning – ICLR 2018
18
Multi-step attentional reasoning
• Step 1: attends to the “tiny blue
block”, updating m1
• Step 2: look for “the sphere in
front” m2.
• Step3: traverse from the cyan ball
to the final objective – the purple
cylinder,
19
14/08/2021
Reasoning as set-set interaction – a look back
• : a set of context objects
• q: a set of linguistic objects
• Reasoning is formulated as the
interaction between the two
sets O and L for the answer a
Q:What is the brown
animal sitting inside of?
 Q: Set-set interaction falls short for questions about relations between objects
14/08/2021 20
Agenda
• Cross-modality reasoning, the case of vision-language
integration.
• Reasoning as set-set interaction.
• Relational reasoning
• Temporal reasoning
• Video question answering.
21
14/08/2021
Reasoning on Graphs
• Relational questions: requiring explicit reasoning about the
relations between multiple objects
14/08/2021 Figure credit: Santoro et al 2017 22
• Relation networks
• and are neural functions
• generate “relation” between the two objects
• is the aggregation function
Relation networks (Santoro et al 2017)
 The relations here are implicit, complete, pair-wise – inefficient, and lack expressiveness
14/08/2021 23
Reasoning with Graph convolution networks
• Input graph is built from image entities and question
• GCN is used to gather facts and produce answer
 The relations are now explicit
and pruned
 But the graph building is very
stiff:
- Unrecovrable if it makes a
mistake?
- Information during reasoning are
not used to build graphs
14/08/2021 Narasimhan et.al NIPS2018 24
Reasoning with Graph attention networks
• The graph is determined during reasoning process with
attention mechanism
The relations are now
adaptive and integrated
with reasoning
 Are the relations
singular and static?
14/08/2021 ReGAT model, Li et.al. ICCV19 25
Dynamic reasoning graphs
• On complex questions,
multiple sets of relations
are needed
• We need not only multi-
step but also multi-form
structures
• Let’s do multiple
dynamically–built graphs!
14/08/2021 LCGN, Hu et.al. ICCV19 26
Dynamic reasoning graphs
The questions so far act as an unstructured command in the process
Aren’t their structures and relations important too?
14/08/2021 LCGN, Hu et.al. ICCV19 27
Reasoning on cross-modality graphs
• Two types of nodes: Linguistic entities and visual objects
• Two types of edges:
• Visual
• Linguistic-visual binding (as a fuzzy grounding)
• Adaptively updated during reasoning
14/08/2021 LOGNet, T.M Le et.al. IJCAI2020 28
Language-binding Object Graph (LOG) Unit
• Graph constructor: build the dynamic vision graph
• Language binding constructor: find the dynamic L-V relations
14/08/2021 LOGNet, T.M Le et.al. IJCAI2020 29
LOGNet: multi-step visual-linguistic binding
• Object-centric representation 
• Multi-step/multi-structure compositional reasoning 
• Linguistic-vision detail interaction 
14/08/2021 LOGNet, T.M Le et.al. IJCAI2020 30
Dynamic language-vision graphs in
actions
14/08/2021 LOGNet, T.M Le et.al. IJCAI2020 31
We got sets and graphs, how about sequences?
• Videos pose another challenge for visual reasoning: the
dynamics through time.
• Sets and graphs now becomes sequences of such.
• Temporal relations are the key factors
• The size of context is a core issue
14/08/2021 32
Agenda
• Cross-modality reasoning, the case of vision-language
integration.
• Reasoning as set-set interaction.
• Relational reasoning
• Temporal reasoning
• Video question answering.
33
14/08/2021
Overview
• Goals of this part of the tutorial
• Understanding Video QA as a complete testbed of
visual reasoning.
• Representative state-of-the-art approaches for
spatio-temporal reasoning.
34
14/08/2021
Video Question Answering
Short-form Video Question Answering
Movie Question Answering
35
14/08/2021
36
Reasoning
Qualitative spatial
reasoning
Relational, temporal
inference
Commonsense
Object recognition
Scene graphs
Computer Vision
Natural Language
Processing
Machine
learning
Visual QA
Parsing
Symbol binding
Systematic generalization
Learning to classify
entailment
Unsupervised
learning
Reinforcement
learning
Program synthesis
Action graphs
Event detection
Object
discovery
14/08/2021 36
Challenges
37
37
• Difficulties in data annotation.
• Content for performing reasoning spreads over space-
time and multiple modalities (videos, subtitles, speech
etc.)
14/08/2021
Video QA Datasets
38
38
Movie QA
(Tapaswi, M., et al.,
2016)
MSRVTT-QA and
MSVD-QA
(Xu, D., et al., 2017)
TGIF-QA
(Jang, Y., et al.,
2017)
MarioQA
(Mun, J., et al.,
2017)
CLEVRER
(Yi, K., et al., 2019)
KnowIT VQA
(Garcia, N., et al.,
2020)
14/08/2021
Video QA datasets
39
39
(TGIF-QA, Jang et al., 2018) (CLEVRER, Yi, Kexin, et al., 2020)
14/08/2021
Video QA as a spatio-temporal
extension of Image QA
40
(a) Extended end-to-end
memory network
(b) Extended simple
VQA model
(c) Extended temporal
attention model
(d) Extended sequence-
to-sequence model
14/08/2021
Zeng, Kuo-Hao, et al. "Leveraging video descriptions to learn video question answering." AAAI’17.
Spatio-temporal cross-modality
alignment
41
Key ideas:
• Explore the correlation
between vision and
language via attention
mechanisms.
• Joint representations
are query-driven
spatio-temporal
features of a given
videos.
14/08/2021 Zhao, Zhou, et al. "Video question answering via hierarchical dual-level attention network learning." ACL’17.
Memory-based Video QA
42
General Dynamic Memory Network (DMN)
Co-memory attention networks for Video QA
Key ideas:
• DMN refines attention over a set of
facts to extract reasoning clues.
• Motion and appearance features are
complementary clues for question
answering.
14/08/2021 Gao, Jiyang, et al. "Motion-appearance co-memory networks for video question answering." CVPR’18.
Memory-based Video QA
43
Heterogeneous video memory for Video QA
Key differences:
• Learning a joint representation of
multimodal inputs at each memory
read/write step.
• Utilizing external question memory
to model context-dependent
question words.
14/08/2021
Fan, Chenyou, et al. "Heterogeneous memory enhanced multimodal attention model for video question answering." CVPR’19.
Multimodal reasoning units for Video QA
44
• CRN: Conditional Relation
Networks.
• Inputs:
• Frame-based
appearance features
• Motion features
• Query features
• Outputs:
• Joint representations
encoding temporal
relations, motion, query.
.
14/08/2021 Le, Thao Minh, et al. "Hierarchical conditional relation networks for video question answering.“ CVPR’20
Object-oriented spatio-temporal reasoning for
Video QA
45
• OSTR: Object-oriented
Spatio-Temporal Reasoning.
• Inputs:
• Object lives tracked
through time.
• Context (motion).
• Query features.
• Outputs:
• Joint representations
encoding temporal
relations, motion, query. .
14/08/2021 Dang, Long Hoang, et al. "Hierarchical Object-oriented Spatio-Temporal Reasoning for Video Question Answering." IJCAI’21
Video QA as a down-stream task of
video language pre-training
46
VideoBERT
Apr., 2019
HowTo100M
Jun., 2019
MIL-NCE
Dec., 2019
UniViLM
Feb., 2020
HERO
May, 2020
ClipBERT
Feb., 2021
14/08/2021
VideoBERT: a joint model for video
and language representation learning
47
• Data for training: Sample videos and texts from YouCook II.
Instructions in text given by ASR toolkit
Subsampled video segments
Sun, Chen, et al. "Videobert: A joint model for video and language representation learning.“ ICCV’19.
14/08/2021
VideoBERT: a joint model for video
and language representation learning
48
Sun, Chen, et al. "Videobert: A joint model for video and language representation learning.“ ICCV’19.
• Linguistic representations:
• Tokenized texts into
WordPieces, similar as
BERT.
• Visual representations:
• S3D features for each segmented
video clips.
• Tokenized into clusters using
hierarchical k-means.
Pre-training
14/08/2021
VideoBERT: a joint model for video
and language representation learning
49
Pre-training
Down-stream
tasks
Sun, Chen, et al. "Videobert: A joint model for video and language representation learning.“ ICCV’19.
Video
captioning
Video question
answering
Zero-shot action
classification
14/08/2021
CLIPBERT: video language pre-training
with sparse sampling
50
Lei, Jie, et al. "Less is more: Clipbert for video-and-language learning via sparse sampling." CVPR’21.
ClipBERT
Prev. methods
ClipBERT overview
Procedure:
• Pretraining on large-scale image-text datasets.
• Finetuning on video-text tasks.
14/08/2021
From short-form Video QA to Movie QA
51
Lei, Jie, et al. "Tvqa: Localized, compositional video question answering." EMNLP’18.
Long-term temporal relationships
Multimodal inputs
14/08/2021
Conventional methods for Movie QA
52
Question-driven multi-stream
models:
• Short-term temporal relationships are
less important.
• Long-term temporal relationships and
multimodal interactions are key.
• Language is dominant over visual
counterpart.
Le, Thao Minh, et al. "Hierarchical conditional
relation networks for video question answering.“
IJCV’21.
Lei, Jie, et al. "Tvqa: Localized, compositional video question answering." EMNLP’18.
14/08/2021
HERO: large-scale pre-training for Movie QA
53
Li, Linjie, et al. "Hero: Hierarchical encoder for video+ language omni-representation pre-training." EMNLP’20.
• Pre-trained on 7.6M
videos and
associated subtitles.
• Achieved state-of-
the-art results on all
datasets.
14/08/2021
End of part B
14/08/2021 54
https://bit.ly/37DYQn7
From Deep Learning to Deep Reasoning
14/08/2021 1
Tutorial at KDD, August 14th 2021
Truyen Tran, Vuong Le, Hung Le and Thao Le
{truyen.tran,vuong.le,thai.le,thao.le}@deakin.edu.au
https://bit.ly/37DYQn7
Part C: Memory | Data efficiency | Recursive reasoning
Agenda
• Reasoning with external memories
• Memory of entities – memory-augmented neural networks
• Memory of relations with tensors and graphs
• Memory of programs & neural program construction.
• Learning to reason with less labels
• Data augmentation with analogical and counterfactual examples
• Question generation
• Self-supervised learning for question answering
• Learning with external knowledge graphs
• Recursive reasoning with neural theory of mind.
2
Agenda
• Reasoning with external memories
• Memory of entities – memory-augmented neural networks
• Memory of relations with tensors and graphs
• Memory of programs & neural program construction.
• Learning to reason with less labels:
• Data augmentation with analogical and counterfactual examples
• Question generation
• Self-supervised learning for question answering
• Learning with external knowledge graphs
• Recursive reasoning with neural theory of mind.
3
Introduction
4
Memory is part of intelligence
• Memory is the ability to
store, retain and recall
information
• Brain memory stores
items, events and high-
level structures
• Computer memory
stores data and
temporary variables
5
Memory-reasoning analogy
6
• 2 processes: fast-slow
o Memory: familiarity-
recollection
• Cognitive test:
o Corresponding reasoning and
memorization performance
o Increasing # premises,
inductive/deductive
reasoning is affected
Heit, Evan, and Brett K. Hayes. "Predicting reasoning from memory." Journal of Experimental Psychology: General 140, no. 1 (2011): 76.
Common memory activities
• Encode: write information to
the memory, often requiring
compression capability
• Retain: keep the information
overtime. This is often assumed
in machinery memory
• Retrieve: read information from
the memory to solve the task at
hand
Encode
Retain
Retrieve
7
Memory taxonomy based on memory content
8
Item
Memory
• Objects, events, items,
variables, entities
Relational
Memory
• Relationships, structures,
graphs
Program
Memory
• Programs, functions,
procedures, how-to knowledge
Item memory
Associative memory
RAM-like memory
Independent memory
9
Distributed item memory as
associative memory
10
"Green" means
"go," but what
does "red" mean?
Language
birthday party on
30th Jan
Time Object
Where is my pen?
What is the
password?
Behaviour
10
Semantic
memory
Episodic
memory
Working
memory
Motor
memory
Associate memory can be implemented as
Hopfield network
Correlation matrix memory Hopfield network
Encode Retrieve Retrieve
Feed-forward
retrieval
Recurrent
retrieval 11
“Fast-weight
�
𝑀𝑀
Rule-based reasoning with associative
memory
• Encode a set of rules:
“pre-conditions
post-conditions”
• Support variable
binding, rule-conflict
handling and partial
rule input
• Example of encoding
rule “A:1,B:3,C:4X”
12
Outer product
for binding
Austin, Jim. "Distributed associative memories for high-speed symbolic reasoning." Fuzzy Sets and Systems 82, no. 2 (1996): 223-233.
Memory-augmented neural networks:
computation-storage separation
13
RNN Symposium 2016: Alex Graves - Differentiable Neural Computer
RAM
Neural Turing Machine (NTM)
• Memory is a 2d matrix
• Controller is a neural
network
• The controller
read/writes to memory
at certain addresses.
• Trained end-to-end,
differentiable
• Simulate Turing Machine
support symbolic
reasoning, algorithm
solving
14
Graves, Alex, Greg Wayne, and Ivo Danihelka. "Neural turing machines." arXiv preprint arXiv:1410.5401 (2014).
Addressing mechanism in NTM
Input
𝑒𝑒𝑡𝑡, 𝑎𝑎𝑡𝑡
Memory writing Memory reading
Algorithmic reasoning
16
Copy
Associative
recall
Priority sort
Optimal memory writing for
memorization
• Simple finding: writing too often
deteriorates memory content (not
retainable)
• Given input sequence of length T
and only D writes, when should we
write to the memory?
17
Le, Hung, Truyen Tran, and Svetha Venkatesh. "Learning to Remember More with Less Memorization." In International Conference on Learning Representations. 2018.
Uniform writing is optimal for
memorization
Better memorization means better algorithmic reasoning
18
T=50, D=5
Regular Uniform (cached)
Memory of independent entities
• Each slot store one or some entities
• Memory writing is done separately for
each memory slot
each slot maintains the life of one or
more entities
• The memory is a set of N parallel RNNs
19
John Apple __ John Apple Office
Apple John __
John Apple Kitchen
Apple John Office Apple John Kitchen
Weston, Jason, Bordes, Antoine, Chopra, Sumit, and Mikolov, Tomas.
Towards ai-complete question answering: A set of prerequisite toy tasks. CoRR, abs/1502.05698, 2015.
RNN 1
RNN 2
…
Time
Recurrent entity network
20
Garden
Henaff, Mikael, Jason Weston, Arthur Szlam, Antoine Bordes, and Yann LeCun.
"Tracking the world state with recurrent entity networks."
In 5th International Conference on Learning Representations, ICLR 2017. 2017.
Recurrent Independent Mechanisms
21
Goyal, Anirudh, Alex Lamb, Jordan Hoffmann, Shagun Sodhani, Sergey Levine, Yoshua Bengio, and Bernhard Schölkopf. "Recurrent independent mechanisms.“ ICLR21.
Reasoning with independent
dynamics
22
Copy
Ball
dynamics
Relational memory
Graph memory
Tensor memory
23
Why relational memory? Item memory
is weak at recognizing relationships
Item
Memory
• Store and retrieve individual items
• Relate pair of items of the same time step
• Fail to relate temporally distant items
24
Dual process in memory
25
• Store items
• Simple, low-order
• System 1
Relational
Memory
• Store relationships between items
• Complicated, high-order
• System 2
Item
Memory
Howard Eichenbaum, Memory, amnesia, and the hippocampal system (MIT press, 1993).
Alex Konkel and Neal J Cohen, "Relational memory and the hippocampus: representations and methods", Frontiers in neuroscience 3 (2009).
Memory as graph
• Memory is a static graph with
fixed nodes and edges
• Relationship is somehow
known
• Each memory node stores
the state of the graph’s node
• Write to node via message
passing
• Read from node via MLP
26
Palm, Rasmus Berg, Ulrich Paquet, and Ole Winther. "Recurrent Relational Networks." In NeurIPS. 2018.
bAbI
27
Fact 1
Fact 2
Fact 3
Question
Node
Edge
Answer
CLEVER
Node
(colour, shape. position)
Edge
(distance)
Memory of graphs access conditioned on query
• Encode multiple graphs, each
graph is stored in a set of
memory row
• For each graph, the controller
read/write to the memory:
• Read uses content-based
attention
• Write use message passing
• Aggregate read vectors from
all graphs to create output
28
Pham, Trang, Truyen Tran, and Svetha Venkatesh. "Relational dynamic memory networks." arXiv preprint arXiv:1808.04247 (2018).
Capturing relationship can be done via
memory slot interactions using attention
• Graph memory needs customization to an explicit design of nodes and
edges
• Can we automatically learns structure with a 2d tensor memory?
• Capture relationship: each slot interacts with all other slots (self-
attention)
29
Santoro, Adam, Ryan Faulkner, David Raposo, Jack Rae, Mike Chrzanowski, Théophane Weber, Daan Wierstra, Oriol Vinyals, Razvan Pascanu, and Timothy Lillicrap.
"Relational recurrent neural networks." In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 7310-7321. 2018.
Relational Memory Core (RMC) operation
30
RNN-like
Interface
31
Allowing pair-wise interactions can answer
questions on temporal relationship
Dot product attention works for
simple relationship, but …
32
What is
most
similar to
me?
0.7 0.9 - 0.1 0.4
What is most
similar to me
but different
from tiger?
For hard relationship, scalar
representation is limited
Complicated relationship needs high-
order relational memory
33
Extract items
Item
memory
Associate every pairs of them
…
3d relational
tensor
Relational
memory
Le, Hung, Truyen Tran, and Svetha Venkatesh. "Self-
attentive associative memory." In International Conference
on Machine Learning, pp. 5682-5691. PMLR, 2020.
Program memory
Module memory
Stored-program memory
34
Predefining program for subtask
• A program designed for a
task becomes a module
• Parse a question to module
layout (order of program
execution)
• Learn the weight of each
module to master the task
35
Andreas, Jacob, Marcus Rohrbach, Trevor Darrell, and Dan Klein. "Neural module networks." In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 39-48. 2016.
Program selection is based on
parser, others are end2end trained
36
5 module
templates
1 2
3
4
5
Parsing
The most powerful memory is one that stores
both program and data
• Computer architecture:
Universal Turing
Machines/Harvard/VNM
• Stored-program principle
• Break a big task into subtasks,
each can be handled by a
TM/single purposed program
stored in a program memory
37
https://en.wikipedia.org/
NUTM: Learn to select program (neural weight)
via program attention
• Neural stored-program memory
(NSM) stores key (the address)
and values (the weight)
• The weight is selected and
loaded to the controller of NTM
• The stored NTM weights and
the weight of the NUTM is
learnt end-to-end by
backpropagation
38
Le, Hung, Truyen Tran, and Svetha Venkatesh. "Neural Stored-program Memory."
In International Conference on Learning Representations. 2019.
Scaling with memory of mini-programs
• Prior, 1 program = 1 neural
network (millions of
parameters)
• Parameter inefficiency since
the programs do not share
common parameters
• Solution: store sharable
mini-programs to compose
infinite number of programs
39
it is analogous to building Lego structures
corresponding to inputs from basic Lego bricks.
Recurrent program attention to retrieve
singular components of a program
40
Le, Hung, and Svetha Venkatesh. "Neurocoder: Learning General-Purpose Computation Using Stored Neural Programs." arXiv preprint arXiv:2009.11443 (2020).
41
Program attention is equivalent to
binary decision tree reasoning
Recurrent program attention auto
detects task boundary
Agenda
• Reasoning with external memories
• Memory of entities – memory-augmented neural networks
• Memory of relations with tensors and graphs
• Memory of programs & neural program construction.
• Learning to reason with less labels:
• Data augmentation with analogical and counterfactual examples
• Question generation
• Self-supervised learning for question answering
• Learning with external knowledge graphs
• Recursive reasoning with neural theory of mind.
42
Data Augmentation with Analogical and
Counterfactual Examples
43
• Poor generalization when training under independent
and identically distributed assumption.
• Intuition: augmenting counterfactual samples to allow
machines to understand the critical changes in the
input that lead to changes in the answer space.
• Perceptually similar, yet
• Semantically dissimilar realistic samples
Visual counterfactual example
Language counterfactual examples
Gokhale, Tejas, et al. "Mutant: A training paradigm for out-of-distribution
generalization in visual question answering." EMNLP’20.
Question Generations
44
Li, Yikang, et al. "Visual question generation as dual task of visual question answering." CVPR’18.
Krishna, Ranjay, Michael Bernstein, and Li Fei-Fei. "Information maximizing visual question
generation." CVPR’19.
• Question answering is a zero-shot
learning problem. Question
generation helps cover a wider
range of concepts.
• Question generation can be done
with either supervised and
unsupervised learning.
BERT: Transformer That Predicts Its Own
Masked Parts
46
BERT is like parallel
approximate pseudo-
likelihood
• ~ Maximizing the
conditional likelihood of
some variables given the
rest.
• When the number of
variables is large, this
converses to MLE
(maximum likelihood
estimate).
[Slide credit: Truyen Tran]
https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270
Visual QA as a Down-stream Task of Visual-
Language BERT Pre-trained Models
47
Numerous pre-trained visual language models during 2019-2021.
VisualBERT (Li, Liunian Harold, et al., 2019)
VL-BERT (Su, Weijie, et al., 2019)
UNITER (Chen, Yen-Chun, et al., 2019)
12-in-1 (Lu, Jiasen, et al., 2020)
Pixel-BERT (Huang, Zhicheng, et al., 2019)
OSCAR (Li, Xiujun, et al., 2020)
Single-stream model Two-stream model
ViLBERT (Lu, Jiasen, et al. , 2019)
LXMERT (Tan, Hao, and Mohit Bansal, 2019)
[Slide credit: Licheng Yu et al.]
Learning with External Knowledge
48
Why external knowledge
for reasoning?
• Questions can be beyond
visual recognition (e.g.
firetrucks usually use a fire
hydrant).
• Human’s prior knowledge for
cognition-level reasoning (e.g.
human’s goals, intents etc.)
Q: What sort of vehicle uses this item?
A: firetruck
Q: What is the sports position of the
man in the orange shirt?
A: goalie/goalkeeper
Marino, Kenneth, et al. "Ok-vqa: A visual question
answering benchmark requiring external
knowledge." CVPR’19.
Zellers, Rowan, et al. "From recognition to cognition: Visual commonsense reasoning." CVPR’19.
Learning with External Knowledge
49
Retrieved by Wikipedia search API
Marino, Kenneth, et al. "Ok-vqa: A visual question
answering benchmark requiring external
knowledge." CVPR’19.
Shah, Sanket, et al. "Kvqa: Knowledge-aware visual question
answering." AAAI’19.
Agenda
• Reasoning with external memories
• Memory of entities – memory-augmented neural networks
• Memory of relations with tensors and graphs
• Memory of programs & neural program construction.
• Learning to reason with less labels:
• Data augmentation with analogical and counterfactual examples
• Question generation
• Self-supervised learning for question answering
• Learning with external knowledge graphs
• Recursive reasoning with neural theory of mind.
50
Source: religious studies project
Core AI faculty:
Theory of mind
Where would ToM fit in?
System 1:
Intuitive
System 1:
Intuitive
System 1:
Intuitive
• Fast
• Implicit/automatic
• Pattern recognition
• Multiple
System 2:
Analytical
• Slow
• Deliberate/rational
• Careful analysis
• Single, sequential
Single
Image credit: VectorStock | Wikimedia
Perception
Theory of mind
Recursive reasoning
Facts
Semantics
Events and relations
Working space
Memory
Contextualized recursive reasoning
• Thus far, QA tasks are straightforward and objective:
• Questioner: I will ask about what I don’t know.
• Answerer: I will answer what I know.
• Real life can be tricky, more subjective:
• Questioner: I will ask only questions I think they can
answer.
• Answerer 1: This is what I think they want from an answer.
• Answerer 2: I will answer only what I think they think I can.
14/08/2021 53
 We need Theory of Mind to function socially.
Social dilemma: Stag Hunt games
• Difficult decision: individual outcomes (selfish)
or group outcomes (cooperative).
• Together hunt Stag (both are cooperative): Both have more
meat.
• Solely hunt Hare (both are selfish): Both have less meat.
• One hunts Stag (cooperative), other hunts Hare (selfish): Only
one hunts hare has meat.
• Human evidence: Self-interested but
considerate of others (cultures vary).
• Idea: Belief-based guilt-aversion
• One experiences loss if it lets other down.
• Necessitates Theory of Mind: reasoning about other’s mind.
Theory of Mind Agent with Guilt Aversion (ToMAGA)
Update Theory of Mind
• Predict whether other’s behaviour are
cooperative or uncooperative
• Updated the zero-order belief (what
other will do)
• Update the first-order belief (what other
think about me)
Guilt Aversion
• Compute the expected material reward
of other based on Theory of Mind
• Compute the psychological rewards, i.e.
“feeling guilty”
• Reward shaping: subtract the expected
loss of the other.
Nguyen, Dung, et al. "Theory of Mind with Guilt Aversion Facilitates
Cooperative Reinforcement Learning." Asian Conference on Machine
Learning. PMLR, 2020.
[Slide credit: Dung Nguyen]
Machine Theory of Mind Architecture (inside the Observer)
Successor
representations
next-step action
probability
goal
Rabinowitz, Neil, et al. "Machine theory of mind." International conference on machine learning. PMLR, 2018.
[Slide credit: Dung Nguyen]
A ToM
architecture
• Observer maintains memory of
previous episodes of the agent.
• It theorizes the “traits” of the
agent.
• Implemented as Hyper Networks.
• Given the current episode, the
observer tries to infer goal,
intention, action, etc of the
agent.
• Implemented as memory retrieval
through attention mechanisms.
14/08/2021 57
Wrapping up
58
Wrapping up
• Reasoning as the next challenge for deep neural networks
• Part A: Learning-to-reason framework
• Reasoning as a prediction skill that can be learnt from data
• Dynamic neural networks are capable
• Combinatorics reasoning
• Part B: Reasoning over unstructured and structured data
• Reasoning over unstructured sets
• Relational reasoning over structured data
• Part C: Memory | Data efficiency | Recursive reasoning
• Memories of items, relations and programs
• Learning with less labels
• Theory of mind
14/08/2021 59
A possible framework for learning and reasoning
with deep neural networks
System 1:
Intuitive
System 1:
Intuitive
System 1:
Intuitive
• Fast
• Implicit/automatic
• Pattern recognition
• Multiple
System 2:
Analytical
• Slow
• Deliberate/rational
• Careful analysis
• Single, sequential
Single
Image credit: VectorStock | Wikimedia
Perception
Theory of mind
Recursive reasoning
Facts
Semantics
Events and relations
Working space
Memory
QA
14/08/2021 61
https://bit.ly/37DYQn7

From deep learning to deep reasoning

  • 1.
    From Deep Learningto Deep Reasoning 14/08/2021 1 Tutorial at KDD, August 14th 2021 Truyen Tran, Vuong Le, Hung Le and Thao Le {truyen.tran,vuong.le,thai.le,thao.le}@deakin.edu.au https://bit.ly/37DYQn7 Part A: Learning to reason
  • 2.
    Logistics 14/08/2021 2 Truyen TranVuong Le Hung Le Thao Le https://bit.ly/37DYQn7
  • 3.
    Agenda • Introduction • PartA: Learning-to-reason framework • Part B: Reasoning over unstructured and structured data • Part C: Memory | Data efficiency | Recursive reasoning 14/08/2021 3
  • 4.
    2012 2016 AusDM 2016 Turing Awards2018 GPT-3 2020 DL: 8 years snapshot
  • 5.
    DL has beenfantastic, but … • It is great at interpolating •  data hungry to cover all variations and smooth local manifolds •  little systematic generalization (novel combinations) • Lack of human-perceived reasoning capability • Lack natural mechanism to incorporate prior knowledge, e.g., common sense • No built-in causal mechanisms •  Have trust issues! • To be fair, may of these problems are common in statistical learning! 14/08/2021 5
  • 6.
    Why still DLin 2021? Theoretical Expressiveness: Neural nets can approximate any function. Learnability: Neural nets are trained easily. Generalisability: Neural nets generalize surprisingly well to unseen data. Practical Generality: Applicable to many domains. Competitive: DL is hard to beat as long as there are data to train. Scalability: DL is better with more data, and it is very scalable.
  • 7.
    The next AI/MLchallenge 2020s-2030s  Learning + reasoning, general purpose, human-like  Has contextual and common- sense reasoning  Requires less data  Adapt to change  Explainable Photo credit: DARPA
  • 8.
    Toward deeper reasoning System1: Intuitive System 1: Intuitive System 1: Intuitive • Fast • Implicit/automatic • Pattern recognition • Multiple System 2: Analytical • Slow • Deliberate/rational • Careful analysis • Single, sequential Single Image credit: VectorStock | Wikimedia Perception Theory of mind Recursive reasoning Facts Semantics Events and relations Working space Memory
  • 9.
    System 2 • Holdshypothetical thought • Decoupling from representation • Working memory size is not essential. Its attentional control is. 14/08/2021 9
  • 10.
    Figure credit: JonathanHui Reasoning in Probabilistic Graphical Models (PGM) • Assuming models are fully specified (e.g., by hand or learnt) • Estimate MAP as energy minimization • Compute marginal probability • Compute expectation & normalisation constant • Key algorithm: Pearl’s Belief Propagation, a.k.a Sum-Product algorithm in factor graphs. • Known result in 2001-2003: BP minimises Bethe free-energy minimization. 14/08/2021 10 Heskes, Tom. "Stable fixed points of loopy belief propagation are local minima of the bethe free energy." Advances in neural information processing systems. 2003.
  • 11.
    Can we learnto infer directly from data without full specification of models? 14/08/2021 11
  • 12.
    Agenda • Introduction • PartA: Learning-to-reason framework • Part B: Reasoning over unstructured and structured data • Part C: Memory | Data efficiency | Recursive reasoning 14/08/2021 12
  • 13.
    Part A: Sub-topics •Reasoning as a prediction skill that can be learnt from data. • Question answering as zero-shot learning. • Neural network operations for learning to reason: • Concept-object binding. • Attention & transformers. • Dynamic neural networks, conditional computation & differentiable programming. • Reasoning as iterative representation refinement & query-driven program synthesis and execution • Compositional attention networks. • Neural module networks. • Combinatorics reasoning 14/08/2021 13
  • 14.
    Learning to reason •Learning is to self-improve by experiencing ~ acquiring knowledge & skills • Reasoning is to deduce knowledge from previously acquired knowledge in response to a query (or a cues) • Learning to reason is to improve the ability to decide if a knowledge base entails a predicate. • E.g., given a video f, determines if the person with the hat turns before singing. • Hypotheses: • Reasoning as just-in-time program synthesis. • It employs conditional computation. 14/08/2021 14 Khardon, Roni, and Dan Roth. "Learning to reason." Journal of the ACM (JACM) 44.5 (1997): 697-725. (Dan Roth; ACM Fellow; IJCAI John McCarthy Award)
  • 15.
    Learning to reason,a definition 14/08/2021 15 Khardon, Roni, and Dan Roth. "Learning to reason." Journal of the ACM (JACM) 44.5 (1997): 697-725. E.g., given a video f, determines if the person with the hat turns before singing.
  • 16.
    Practical setting: (query,database,answer)triplets • This is very general: • Classification: Query = what is this? Database = data. • Regression: Query = how much? Database = data. • QA: Query = NLP question. Database = context/image/text. • Multi-task learning: Query = task ID. Database = data. • Zero-shot learning: Query = task description. Database = data. • Drug-protein binding: Query = drug. Database = protein. • Recommender system: Query = User (or item). Database = inventories (or user base); 14/08/2021 16
  • 17.
    Can neural networksreason? Reasoning is not necessarily achieved by making logical inferences There is a continuity between [algebraically rich inference] and [connecting together trainable learning systems] Central to reasoning is composition rules to guide the combinations of modules to address new tasks 14/08/2021 17 “When we observe a visual scene, when we hear a complex sentence, we are able to explain in formal terms the relation of the objects in the scene, or the precise meaning of the sentence components. However, there is no evidence that such a formal analysis necessarily takes place: we see a scene, we hear a sentence, and we just know what they mean. This suggests the existence of a middle layer, already a form of reasoning, but not yet formal or logical.” Bottou, Léon. "From machine learning to machine reasoning." Machine learning 94.2 (2014): 133-149.
  • 18.
    Hypotheses • Reasoning asjust-in-time program synthesis. • It employs conditional computation. • Reasoning is recursive, e.g., mental travel. 14/08/2021 18
  • 19.
    Two approaches toneural reasoning • Implicit chaining of predicates through recurrence: • Step-wise query-specific attention to relevant concepts & relations. • Iterative concept refinement & combination, e.g., through a working memory. • Answer is computed from the last memory state & question embedding. • Explicit program synthesis: • There is a set of modules, each performs an pre-defined operation. • Question is parse into a symbolic program. • The program is implemented as a computational graph constructed by chaining separate modules. • The program is executed to compute an answer. 14/08/2021 19
  • 20.
    In search forbasic neural operators for reasoning • Basics: • Neuron as feature detector  Sensor, filter • Computational graph  Circuit • Skip-connection  Short circuit • Essentials • Multiplicative gates  AND gate, Transistor, Resistor • Attention mechanism  SWITCH gate • Memory + forgetting  Capacitor + leakage • Compositionality  Modular design • .. 14/08/2021 20 Photo credit: Nicola Asuni
  • 21.
    Part A: Sub-topics •Reasoning as a prediction skill that can be learnt from data. • Question answering as zero-shot learning. • Neural network operations for learning to reason: • Concept-object binding. • Attention & transformers. • Dynamic neural networks, conditional computation & differentiable programming. • Reasoning as iterative representation refinement & query-driven program synthesis and execution. • Compositional attention networks. • Reasoning as Neural module networks. • Combinatorics reasoning 14/08/2021 21
  • 22.
    Concept-object binding • Perceiveddata (e.g., visual objects) may not share the same semantic space with high-level concepts. • Binding between concept-object enables reasoning at the concept level 14/08/2021 22 Example of concept-object binding in LOGNet (Le et al, IJCAI’2020) More reading: Greff, Klaus, Sjoerd van Steenkiste, and Jürgen Schmidhuber. "On the binding problem in artificial neural networks." arXiv preprint arXiv:2012.05208 (2020).
  • 23.
    Attentions: Picking uponly what is needed at a step • Need attention model to select or ignore certain computations or inputs • Can be “soft” (differentiable) or “hard” (requires RL) • Needed for selecting predicates in reasoning. • Attention provides a short-cut  long- term dependencies • Needed for long chain of reasoning. • Also encourages sparsity if done right! http://distill.pub/2016/augmented-rnns/
  • 24.
    Fast weights |HyperNet – the multiplicative interaction • Early ideas in early 1990s by Juergen Schmidhuber and collaborators. • Data-dependent weights | Using a controller to generate weights of the main net. 14/08/2021 24 Ha, David, Andrew Dai, and Quoc V. Le. "Hypernetworks." arXiv preprint arXiv:1609.09106 (2016).
  • 25.
    Memory networks: Holdingthe data ready for inference • Input is a set  Load into memory, which is NOT updated. • State is a RNN with attention reading from inputs • Concepts: Query, key and content + Content addressing. • Deep models, but constant path length from input to output. • Equivalent to a RNN with shared input set. 14/08/2021 25 Sukhbaatar, Sainbayar, Jason Weston, and Rob Fergus. "End-to-end memory networks." Advances in neural information processing systems. 2015.
  • 26.
    Transformers: Analogical reasoningthrough self- attention 14/08/2021 26 Tay, Yi, et al. "Efficient transformers: A survey." arXiv preprint arXiv:2009.06732 (2020). State Key Query Memory
  • 27.
    Transformer as implicitreasoning • Recall: Reasoning as (free-) energy minimisation • The classic Belief Propagation algorithm is minimization algorithm of the Bethe free-energy! • Transformer has relational, iterative state refinement makes it a great candidate for implicit relational reasoning. 14/08/2021 27 Ramsauer, Hubert, et al. "Hopfield networks is all you need." arXiv preprint arXiv:2008.02217 (2020).
  • 28.
    Transformer v.s. memorynetworks • Memory network: • Attention to input set • One hidden state update at a time. • Final state integrate information of the set, conditioned on the query. • Transformer: • Loading all inputs into working memory • Assigns one hidden state per input element. • All hidden states (including those from the query) to compute the answer. 14/08/2021 28
  • 29.
    Universal transformers 14/08/2021 29 https://ai.googleblog.com/2018/08/moving-beyond-translation-with.html Dehghani,Mostafa, et al. "Universal Transformers." International Conference on Learning Representations. 2018.
  • 30.
    Dynamic neural networks •Memory-Augmented Neural Networks • Modular program layout • Program synthesis 14/08/2021 30
  • 31.
    Neural Turing machine(NTM) A memory-augmented neural network (MANN) • A controller that takes input/output and talks to an external memory module. • Memory has read/write operations. • The main issue is where to write, and how to update the memory state. • All operations are differentiable. Source: rylanschaeffer.github.io
  • 32.
    MANN for reasoning •Three steps: • Store data into memory • Read query, process sequentially, consult memory • Output answer • Behind the scene: • Memory contains data & results of intermediate steps • LOGNet does the same, memory consists of object representations • Drawbacks of current MANNs: • No memory of controllers  Less modularity and compositionality when query is complex • No memory of relations  Much harder to chain predicates. 14/08/2021 32 Source: rylanschaeffer.github.io
  • 33.
    Part A: Sub-topics •Reasoning as a prediction skill that can be learnt from data. • Question answering as zero-shot learning. • Neural network operations for learning to reason: • Concept-object binding. • Attention & transformers. • Dynamic neural networks, conditional computation & differentiable programming. • Reasoning as iterative representation refinement & query-driven program synthesis and execution. • Compositional attention networks. • Reasoning as Neural module networks. • Combinatorics reasoning 14/08/2021 33
  • 34.
    MAC Net: Recurrent, iterativerepresentation refinement 14/08/2021 34 Hudson, Drew A., and Christopher D. Manning. "Compositional attention networks for machine reasoning." ICLR 2018.
  • 35.
    Module networks (reasoning byconstructing and executing neural programs) • Reasoning as laying out modules to reach an answer • Composable neural architecture  question parsed as program (layout of modules) • A module is a function (x  y), could be a sub-reasoning process ((x, q)  y). 14/08/2021 35 https://bair.berkeley.edu/blog/2017/06/20/learning-to-reason-with-neural-module-networks/
  • 36.
    Putting things together: Aframework for visual reasoning 14/08/2021 36 @Truyen Tran & Vuong Le, Deakin Uni
  • 37.
    Part A: Sub-topics •Reasoning as a prediction skill that can be learnt from data. • Question answering as zero-shot learning. • Neural network operations for learning to reason: • Concept-object binding. • Attention & transformers. • Dynamic neural networks, conditional computation & differentiable programming. • Reasoning as iterative representation refinement & query-driven program synthesis and execution. • Compositional attention networks. • Reasoning as Neural module networks. • Combinatorics reasoning 14/08/2021 37
  • 38.
    Implement combinatorial algorithms withneural networks 38 Generalizable Inflexible Noisy High dimensional Train neural processor P to imitate algorithm A Processor P: (a) aligned with the computations of the target algorithm; (b) operates by matrix multiplications, hence natively admits useful gradients; (c) operates over high- dimensional latent spaces Veličković, Petar, and Charles Blundell. "Neural Algorithmic Reasoning." arXiv preprint arXiv:2105.02761 (2021).
  • 39.
    Processor as RNN •Do not assume knowing the structure of the input, input as a sequence not really reasonable, harder to generalize • RNN is Turing-complete  can simulate any algorithm • But, it is not easy to learn the simulation from data (input- output) Pointer network 39 Assume O(N) memory And O(N^2) computation N is the size of input Vinyals, Oriol, Meire Fortunato, and Navdeep Jaitly. "Pointer networks." In Proceedings of the 28th International Conference on Neural Information Processing Systems-Volume 2, pp. 2692-2700. 2015.
  • 40.
    Processor as MANN •MANN simulates neural computers or Turing machine ideal for implement algorithms • Sequential input, no assumption on input structure • Assume O(1) memory and O(N) computation 40 Graves, A., Wayne, G., Reynolds, M. et al. Hybrid computing using a neural network with dynamic external memory. Nature 538, 471–476 (2016)
  • 41.
    Sequential encoding ofgraphs 41 • Each node is associated with random one-hot or binary features • Output is the features of the solution [x1,y1, feature1], [x2,y2, feature2], … [feature4], [feature2], … Geometry [node_feature1, node_feature2, edge12], [node_feature1, node_feature2, edge13], … [node_feature4], [node_feature2], … Graph Convex Hull TSP Shortest Path Minimum Spanning Tree Le, Hung, Truyen Tran, and Svetha Venkatesh. "Self-attentive associative memory." In International Conference on Machine Learning, pp. 5682-5691. PMLR, 2020.
  • 42.
    DNC: graph reasoning 42 Graves, A.,Wayne, G., Reynolds, M. et al. Hybrid computing using a neural network with dynamic external memory. Nature 538, 471–476 (2016)
  • 43.
    NUTM: learning multiplealgorithms at once 43 Le, Hung, Truyen Tran, and Svetha Venkatesh. "Neural Stored-program Memory." In International Conference on Learning Representations. 2019.
  • 44.
    Processor as graphneural network (GNN) 44 https://petar-v.com/talks/Algo-WWW.pdf Veličković, Petar, Rex Ying, Matilde Padovano, Raia Hadsell, and Charles Blundell. "Neural Execution of Graph Algorithms." In International Conference on Learning Representations. 2019. Motivation: • Many algorithm operates on graphs • Supervise graph neural networks with algorithm operation/step/final output • Encoder-Process-Decode framework: Attention Message passing
  • 45.
    Example: GNN fora specific problem (DNF counting) • Count #assignments that satisfy disjuntive normal form (DNF) formula • Classical algorithm is P-hard O(mn) • m: #clauses, n: #variables • Supervised training on output-level 45 Best: O(m+n) Abboud, Ralph, Ismail Ceylan, and Thomas Lukasiewicz. "Learning to reason: Leveraging neural networks for approximate DNF counting.“ In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 04, pp. 3097-3104. 2020.
  • 46.
    Neural networks andalgorithms alignment 46 Xu, Keylu, Jingling Li, Mozhi Zhang, Simon S. Du, Ken-ichi Kawarabayashi, and Stefanie Jegelka. "What Can Neural Networks Reason About?." ICLR 2020 (2020). https://petar-v.com/talks/Algo-WWW.pdf Neural exhaustive search
  • 47.
    GNN is alignedwith Dynamic Programming (DP) 47 Neural exhaustive search
  • 48.
    If alignment exists step-by-step supervision 48 Veličković, Petar, Rex Ying, Matilde Padovano, Raia Hadsell, and Charles Blundell. "Neural Execution of Graph Algorithms." In International Conference on Learning Representations. 2019. • Merely simulate the classical graph algorithm, generalizable • No algorithm discovery Joint training is encouraged
  • 49.
    Processor as Transformer •Back to input sequence (set), but stronger generalization • Transformer with encoder mask ~ graph attention • Use Transformer with: • Binary representation of numbers • Dynamic conditional masking 49 Yan, Yujun, Kevin Swersky, Danai Koutra, Parthasarathy Ranganathan, and Milad Hashemi. "Neural Execution Engines: Learning to Execute Subroutines." Advances in Neural Information Processing Systems 33 (2020). Next step Masked encoding Decoding Mask prediction
  • 50.
  • 51.
    End of partA 14/08/2021 51 https://bit.ly/37DYQn7
  • 52.
    From Deep Learningto Deep Reasoning 14/08/2021 1 Tutorial at KDD, August 14th 2021 Truyen Tran, Vuong Le, Hung Le and Thao Le {truyen.tran,vuong.le,thai.le,thao.le}@deakin.edu.au https://bit.ly/37DYQn7 Part B: Reasoning over unstructured and structured data
  • 53.
    Agenda • Cross-modality reasoning,the case of vision-language integration. • Reasoning as set-set interaction. • Relational reasoning • Temporal reasoning • Video question answering. 2 14/08/2021
  • 54.
    Learning to Reasonformulation • Input: • A knowledge context C • A query q • Output: an answer satisfying • C can be • structured: knowledge graphs • unstructured: text, image, sound, video Q: Is it simply an optimization problem like recognition, detection or even translation?  No, because the logics from C, q into a is more complex than other solved optimization problems  We can solve (some parts of) it with good structures and inference strategies Q: “What affects her mobility?” 14/08/2021 3
  • 55.
    A case study:Image Question Answering • Realization • C: visual content of an image • q: a linguistic question • a: a linguistic phrase as the answer to q regarding K • Challenges • Reasoning through facts and logics • Cross-modality integration 14/08/2021 4
  • 56.
    Image QA: Questiontypes 14/08/2021 Slide credit: Thao Minh Le 5
  • 57.
    Image QA datasets 14/08/2021Slide credit: Thao Minh Le 6
  • 58.
    The two mainthemes in Image QA • Neuro-symbolic reasoning • Parse the question into a “program” of small steps • Learn the generic steps as neural modules • Use and reuse the modules for different programs • Compositional reasoning • Extract visual and linguistic individual- and joint- representation • Reasoning happens on the structure of the representation • Sets/graphs/sequences • The representation got refined through multi-step compositional reasoning 14/08/2021 7
  • 59.
    Agenda • Cross-modality reasoning,the case of vision-language integration. • Reasoning as set-set interaction. • Relational reasoning • Temporal reasoning • Video question answering. 8 14/08/2021
  • 60.
    A simple approach Issue: This is very susceptible to the nuances of images and questions 14/08/2021 Agrawal et al., 2015, Slide credit: Thao Minh Le 9
  • 61.
    Reasoning as set-setinteraction • : a set of context objects • Faster-RCNN regions • CNN tubes • q: a set of linguistic objects L. - biLSTM embedding of q  Reasoning is formulated as the interaction between the two sets O and L for the answer a 14/08/2021 10
  • 62.
    Set operations • Reducingoperation (eg: sum/average/max) • Attention-based combination (Bahdanau et al. 2015) • Attention weights as query-key dot product (Vaswani et al., 2017)  Attention-based set ops seem very suitable for visual reasoning 14/08/2021 11
  • 63.
    Attention-based reasoning • Unidirectionalattention • Find relation score between parts in the context C to the question q: Options for f: • Hermann et al. (2015) • Chen et al. (2016) • Normalized by softmax into attention weights • Attended context vector:  We can now extract information from the context that is “relevant” to the query 14/08/2021 12
  • 64.
    Bottom-up-top-down attention (Andersonet al 2017) • Bottom-up set construction: Choosing Faster-RCNN regions with high class scores • Top-down attention: Attending on visual features by question  Q: How about attention from vision objects to linguistic objects? 14/08/2021 13
  • 65.
    Bi-directional attention • Question-contextsimilarity measure • Question-guided context attention • Softmax across columns • Context-guided question attention • Softmax across rows  Q: Probably not working for image qa where single words does not have the co-reference with a region? 14/08/2021 Dynamic coattention networks for question answering (Seo et al., ICLR 2017) 14
  • 66.
    Hierarchical co-attention forImageQA • The co-attention is found on a word-phrase-sentence hierarchy  better cross-domain co-references  Q: Can this be done on text qa as well?  Q: How about questions with many reasoning hops? 14/08/2021 15
  • 67.
    Multi-step compositional reasoning •Complex question need multiple hops of reasoning • Relations inside the context are multi- step themselves • Single shot of attention won’t be enough • Single shot of information gathering is definitely not enough 16  Q: How to do multi-hop attentional reasoning? 14/08/2021 Figure: Hudson and Manning – ICLR 2018
  • 68.
    Multi-step reasoning -Memory, Attention, and Composition (MAC Nets) • Attention reasoning is done through multiple sequential steps. • Each step is done with a recurrent neural cell • What is the key differences to the normal RNN (LSTM/GRU) cell? • Not a sequential input, it is sequential processing on static input set. • Guided by the question through a controller. 14/08/2021 MAC network, Hudson and Manning – ICLR 2018 17
  • 69.
    Multi-step attentional reasoning •At each step, the controller decide what to look next • After each step, a piece of information is gathered, represented through the attention map on question words and visual objects • A common memory kept all the information extracted toward an answer 14/08/2021 MAC network, Hudson and Manning – ICLR 2018 18
  • 70.
    Multi-step attentional reasoning •Step 1: attends to the “tiny blue block”, updating m1 • Step 2: look for “the sphere in front” m2. • Step3: traverse from the cyan ball to the final objective – the purple cylinder, 19 14/08/2021
  • 71.
    Reasoning as set-setinteraction – a look back • : a set of context objects • q: a set of linguistic objects • Reasoning is formulated as the interaction between the two sets O and L for the answer a Q:What is the brown animal sitting inside of?  Q: Set-set interaction falls short for questions about relations between objects 14/08/2021 20
  • 72.
    Agenda • Cross-modality reasoning,the case of vision-language integration. • Reasoning as set-set interaction. • Relational reasoning • Temporal reasoning • Video question answering. 21 14/08/2021
  • 73.
    Reasoning on Graphs •Relational questions: requiring explicit reasoning about the relations between multiple objects 14/08/2021 Figure credit: Santoro et al 2017 22
  • 74.
    • Relation networks •and are neural functions • generate “relation” between the two objects • is the aggregation function Relation networks (Santoro et al 2017)  The relations here are implicit, complete, pair-wise – inefficient, and lack expressiveness 14/08/2021 23
  • 75.
    Reasoning with Graphconvolution networks • Input graph is built from image entities and question • GCN is used to gather facts and produce answer  The relations are now explicit and pruned  But the graph building is very stiff: - Unrecovrable if it makes a mistake? - Information during reasoning are not used to build graphs 14/08/2021 Narasimhan et.al NIPS2018 24
  • 76.
    Reasoning with Graphattention networks • The graph is determined during reasoning process with attention mechanism The relations are now adaptive and integrated with reasoning  Are the relations singular and static? 14/08/2021 ReGAT model, Li et.al. ICCV19 25
  • 77.
    Dynamic reasoning graphs •On complex questions, multiple sets of relations are needed • We need not only multi- step but also multi-form structures • Let’s do multiple dynamically–built graphs! 14/08/2021 LCGN, Hu et.al. ICCV19 26
  • 78.
    Dynamic reasoning graphs Thequestions so far act as an unstructured command in the process Aren’t their structures and relations important too? 14/08/2021 LCGN, Hu et.al. ICCV19 27
  • 79.
    Reasoning on cross-modalitygraphs • Two types of nodes: Linguistic entities and visual objects • Two types of edges: • Visual • Linguistic-visual binding (as a fuzzy grounding) • Adaptively updated during reasoning 14/08/2021 LOGNet, T.M Le et.al. IJCAI2020 28
  • 80.
    Language-binding Object Graph(LOG) Unit • Graph constructor: build the dynamic vision graph • Language binding constructor: find the dynamic L-V relations 14/08/2021 LOGNet, T.M Le et.al. IJCAI2020 29
  • 81.
    LOGNet: multi-step visual-linguisticbinding • Object-centric representation  • Multi-step/multi-structure compositional reasoning  • Linguistic-vision detail interaction  14/08/2021 LOGNet, T.M Le et.al. IJCAI2020 30
  • 82.
    Dynamic language-vision graphsin actions 14/08/2021 LOGNet, T.M Le et.al. IJCAI2020 31
  • 83.
    We got setsand graphs, how about sequences? • Videos pose another challenge for visual reasoning: the dynamics through time. • Sets and graphs now becomes sequences of such. • Temporal relations are the key factors • The size of context is a core issue 14/08/2021 32
  • 84.
    Agenda • Cross-modality reasoning,the case of vision-language integration. • Reasoning as set-set interaction. • Relational reasoning • Temporal reasoning • Video question answering. 33 14/08/2021
  • 85.
    Overview • Goals ofthis part of the tutorial • Understanding Video QA as a complete testbed of visual reasoning. • Representative state-of-the-art approaches for spatio-temporal reasoning. 34 14/08/2021
  • 86.
    Video Question Answering Short-formVideo Question Answering Movie Question Answering 35 14/08/2021
  • 87.
    36 Reasoning Qualitative spatial reasoning Relational, temporal inference Commonsense Objectrecognition Scene graphs Computer Vision Natural Language Processing Machine learning Visual QA Parsing Symbol binding Systematic generalization Learning to classify entailment Unsupervised learning Reinforcement learning Program synthesis Action graphs Event detection Object discovery 14/08/2021 36
  • 88.
    Challenges 37 37 • Difficulties indata annotation. • Content for performing reasoning spreads over space- time and multiple modalities (videos, subtitles, speech etc.) 14/08/2021
  • 89.
    Video QA Datasets 38 38 MovieQA (Tapaswi, M., et al., 2016) MSRVTT-QA and MSVD-QA (Xu, D., et al., 2017) TGIF-QA (Jang, Y., et al., 2017) MarioQA (Mun, J., et al., 2017) CLEVRER (Yi, K., et al., 2019) KnowIT VQA (Garcia, N., et al., 2020) 14/08/2021
  • 90.
    Video QA datasets 39 39 (TGIF-QA,Jang et al., 2018) (CLEVRER, Yi, Kexin, et al., 2020) 14/08/2021
  • 91.
    Video QA asa spatio-temporal extension of Image QA 40 (a) Extended end-to-end memory network (b) Extended simple VQA model (c) Extended temporal attention model (d) Extended sequence- to-sequence model 14/08/2021 Zeng, Kuo-Hao, et al. "Leveraging video descriptions to learn video question answering." AAAI’17.
  • 92.
    Spatio-temporal cross-modality alignment 41 Key ideas: •Explore the correlation between vision and language via attention mechanisms. • Joint representations are query-driven spatio-temporal features of a given videos. 14/08/2021 Zhao, Zhou, et al. "Video question answering via hierarchical dual-level attention network learning." ACL’17.
  • 93.
    Memory-based Video QA 42 GeneralDynamic Memory Network (DMN) Co-memory attention networks for Video QA Key ideas: • DMN refines attention over a set of facts to extract reasoning clues. • Motion and appearance features are complementary clues for question answering. 14/08/2021 Gao, Jiyang, et al. "Motion-appearance co-memory networks for video question answering." CVPR’18.
  • 94.
    Memory-based Video QA 43 Heterogeneousvideo memory for Video QA Key differences: • Learning a joint representation of multimodal inputs at each memory read/write step. • Utilizing external question memory to model context-dependent question words. 14/08/2021 Fan, Chenyou, et al. "Heterogeneous memory enhanced multimodal attention model for video question answering." CVPR’19.
  • 95.
    Multimodal reasoning unitsfor Video QA 44 • CRN: Conditional Relation Networks. • Inputs: • Frame-based appearance features • Motion features • Query features • Outputs: • Joint representations encoding temporal relations, motion, query. . 14/08/2021 Le, Thao Minh, et al. "Hierarchical conditional relation networks for video question answering.“ CVPR’20
  • 96.
    Object-oriented spatio-temporal reasoningfor Video QA 45 • OSTR: Object-oriented Spatio-Temporal Reasoning. • Inputs: • Object lives tracked through time. • Context (motion). • Query features. • Outputs: • Joint representations encoding temporal relations, motion, query. . 14/08/2021 Dang, Long Hoang, et al. "Hierarchical Object-oriented Spatio-Temporal Reasoning for Video Question Answering." IJCAI’21
  • 97.
    Video QA asa down-stream task of video language pre-training 46 VideoBERT Apr., 2019 HowTo100M Jun., 2019 MIL-NCE Dec., 2019 UniViLM Feb., 2020 HERO May, 2020 ClipBERT Feb., 2021 14/08/2021
  • 98.
    VideoBERT: a jointmodel for video and language representation learning 47 • Data for training: Sample videos and texts from YouCook II. Instructions in text given by ASR toolkit Subsampled video segments Sun, Chen, et al. "Videobert: A joint model for video and language representation learning.“ ICCV’19. 14/08/2021
  • 99.
    VideoBERT: a jointmodel for video and language representation learning 48 Sun, Chen, et al. "Videobert: A joint model for video and language representation learning.“ ICCV’19. • Linguistic representations: • Tokenized texts into WordPieces, similar as BERT. • Visual representations: • S3D features for each segmented video clips. • Tokenized into clusters using hierarchical k-means. Pre-training 14/08/2021
  • 100.
    VideoBERT: a jointmodel for video and language representation learning 49 Pre-training Down-stream tasks Sun, Chen, et al. "Videobert: A joint model for video and language representation learning.“ ICCV’19. Video captioning Video question answering Zero-shot action classification 14/08/2021
  • 101.
    CLIPBERT: video languagepre-training with sparse sampling 50 Lei, Jie, et al. "Less is more: Clipbert for video-and-language learning via sparse sampling." CVPR’21. ClipBERT Prev. methods ClipBERT overview Procedure: • Pretraining on large-scale image-text datasets. • Finetuning on video-text tasks. 14/08/2021
  • 102.
    From short-form VideoQA to Movie QA 51 Lei, Jie, et al. "Tvqa: Localized, compositional video question answering." EMNLP’18. Long-term temporal relationships Multimodal inputs 14/08/2021
  • 103.
    Conventional methods forMovie QA 52 Question-driven multi-stream models: • Short-term temporal relationships are less important. • Long-term temporal relationships and multimodal interactions are key. • Language is dominant over visual counterpart. Le, Thao Minh, et al. "Hierarchical conditional relation networks for video question answering.“ IJCV’21. Lei, Jie, et al. "Tvqa: Localized, compositional video question answering." EMNLP’18. 14/08/2021
  • 104.
    HERO: large-scale pre-trainingfor Movie QA 53 Li, Linjie, et al. "Hero: Hierarchical encoder for video+ language omni-representation pre-training." EMNLP’20. • Pre-trained on 7.6M videos and associated subtitles. • Achieved state-of- the-art results on all datasets. 14/08/2021
  • 105.
    End of partB 14/08/2021 54 https://bit.ly/37DYQn7
  • 106.
    From Deep Learningto Deep Reasoning 14/08/2021 1 Tutorial at KDD, August 14th 2021 Truyen Tran, Vuong Le, Hung Le and Thao Le {truyen.tran,vuong.le,thai.le,thao.le}@deakin.edu.au https://bit.ly/37DYQn7 Part C: Memory | Data efficiency | Recursive reasoning
  • 107.
    Agenda • Reasoning withexternal memories • Memory of entities – memory-augmented neural networks • Memory of relations with tensors and graphs • Memory of programs & neural program construction. • Learning to reason with less labels • Data augmentation with analogical and counterfactual examples • Question generation • Self-supervised learning for question answering • Learning with external knowledge graphs • Recursive reasoning with neural theory of mind. 2
  • 108.
    Agenda • Reasoning withexternal memories • Memory of entities – memory-augmented neural networks • Memory of relations with tensors and graphs • Memory of programs & neural program construction. • Learning to reason with less labels: • Data augmentation with analogical and counterfactual examples • Question generation • Self-supervised learning for question answering • Learning with external knowledge graphs • Recursive reasoning with neural theory of mind. 3
  • 109.
  • 110.
    Memory is partof intelligence • Memory is the ability to store, retain and recall information • Brain memory stores items, events and high- level structures • Computer memory stores data and temporary variables 5
  • 111.
    Memory-reasoning analogy 6 • 2processes: fast-slow o Memory: familiarity- recollection • Cognitive test: o Corresponding reasoning and memorization performance o Increasing # premises, inductive/deductive reasoning is affected Heit, Evan, and Brett K. Hayes. "Predicting reasoning from memory." Journal of Experimental Psychology: General 140, no. 1 (2011): 76.
  • 112.
    Common memory activities •Encode: write information to the memory, often requiring compression capability • Retain: keep the information overtime. This is often assumed in machinery memory • Retrieve: read information from the memory to solve the task at hand Encode Retain Retrieve 7
  • 113.
    Memory taxonomy basedon memory content 8 Item Memory • Objects, events, items, variables, entities Relational Memory • Relationships, structures, graphs Program Memory • Programs, functions, procedures, how-to knowledge
  • 114.
    Item memory Associative memory RAM-likememory Independent memory 9
  • 115.
    Distributed item memoryas associative memory 10 "Green" means "go," but what does "red" mean? Language birthday party on 30th Jan Time Object Where is my pen? What is the password? Behaviour 10 Semantic memory Episodic memory Working memory Motor memory
  • 116.
    Associate memory canbe implemented as Hopfield network Correlation matrix memory Hopfield network Encode Retrieve Retrieve Feed-forward retrieval Recurrent retrieval 11 “Fast-weight � 𝑀𝑀
  • 117.
    Rule-based reasoning withassociative memory • Encode a set of rules: “pre-conditions post-conditions” • Support variable binding, rule-conflict handling and partial rule input • Example of encoding rule “A:1,B:3,C:4X” 12 Outer product for binding Austin, Jim. "Distributed associative memories for high-speed symbolic reasoning." Fuzzy Sets and Systems 82, no. 2 (1996): 223-233.
  • 118.
    Memory-augmented neural networks: computation-storageseparation 13 RNN Symposium 2016: Alex Graves - Differentiable Neural Computer RAM
  • 119.
    Neural Turing Machine(NTM) • Memory is a 2d matrix • Controller is a neural network • The controller read/writes to memory at certain addresses. • Trained end-to-end, differentiable • Simulate Turing Machine support symbolic reasoning, algorithm solving 14 Graves, Alex, Greg Wayne, and Ivo Danihelka. "Neural turing machines." arXiv preprint arXiv:1410.5401 (2014).
  • 120.
    Addressing mechanism inNTM Input 𝑒𝑒𝑡𝑡, 𝑎𝑎𝑡𝑡 Memory writing Memory reading
  • 121.
  • 122.
    Optimal memory writingfor memorization • Simple finding: writing too often deteriorates memory content (not retainable) • Given input sequence of length T and only D writes, when should we write to the memory? 17 Le, Hung, Truyen Tran, and Svetha Venkatesh. "Learning to Remember More with Less Memorization." In International Conference on Learning Representations. 2018. Uniform writing is optimal for memorization
  • 123.
    Better memorization meansbetter algorithmic reasoning 18 T=50, D=5 Regular Uniform (cached)
  • 124.
    Memory of independententities • Each slot store one or some entities • Memory writing is done separately for each memory slot each slot maintains the life of one or more entities • The memory is a set of N parallel RNNs 19 John Apple __ John Apple Office Apple John __ John Apple Kitchen Apple John Office Apple John Kitchen Weston, Jason, Bordes, Antoine, Chopra, Sumit, and Mikolov, Tomas. Towards ai-complete question answering: A set of prerequisite toy tasks. CoRR, abs/1502.05698, 2015. RNN 1 RNN 2 … Time
  • 125.
    Recurrent entity network 20 Garden Henaff,Mikael, Jason Weston, Arthur Szlam, Antoine Bordes, and Yann LeCun. "Tracking the world state with recurrent entity networks." In 5th International Conference on Learning Representations, ICLR 2017. 2017.
  • 126.
    Recurrent Independent Mechanisms 21 Goyal,Anirudh, Alex Lamb, Jordan Hoffmann, Shagun Sodhani, Sergey Levine, Yoshua Bengio, and Bernhard Schölkopf. "Recurrent independent mechanisms.“ ICLR21.
  • 127.
  • 128.
  • 129.
    Why relational memory?Item memory is weak at recognizing relationships Item Memory • Store and retrieve individual items • Relate pair of items of the same time step • Fail to relate temporally distant items 24
  • 130.
    Dual process inmemory 25 • Store items • Simple, low-order • System 1 Relational Memory • Store relationships between items • Complicated, high-order • System 2 Item Memory Howard Eichenbaum, Memory, amnesia, and the hippocampal system (MIT press, 1993). Alex Konkel and Neal J Cohen, "Relational memory and the hippocampus: representations and methods", Frontiers in neuroscience 3 (2009).
  • 131.
    Memory as graph •Memory is a static graph with fixed nodes and edges • Relationship is somehow known • Each memory node stores the state of the graph’s node • Write to node via message passing • Read from node via MLP 26 Palm, Rasmus Berg, Ulrich Paquet, and Ole Winther. "Recurrent Relational Networks." In NeurIPS. 2018.
  • 132.
    bAbI 27 Fact 1 Fact 2 Fact3 Question Node Edge Answer CLEVER Node (colour, shape. position) Edge (distance)
  • 133.
    Memory of graphsaccess conditioned on query • Encode multiple graphs, each graph is stored in a set of memory row • For each graph, the controller read/write to the memory: • Read uses content-based attention • Write use message passing • Aggregate read vectors from all graphs to create output 28 Pham, Trang, Truyen Tran, and Svetha Venkatesh. "Relational dynamic memory networks." arXiv preprint arXiv:1808.04247 (2018).
  • 134.
    Capturing relationship canbe done via memory slot interactions using attention • Graph memory needs customization to an explicit design of nodes and edges • Can we automatically learns structure with a 2d tensor memory? • Capture relationship: each slot interacts with all other slots (self- attention) 29 Santoro, Adam, Ryan Faulkner, David Raposo, Jack Rae, Mike Chrzanowski, Théophane Weber, Daan Wierstra, Oriol Vinyals, Razvan Pascanu, and Timothy Lillicrap. "Relational recurrent neural networks." In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 7310-7321. 2018.
  • 135.
    Relational Memory Core(RMC) operation 30 RNN-like Interface
  • 136.
    31 Allowing pair-wise interactionscan answer questions on temporal relationship
  • 137.
    Dot product attentionworks for simple relationship, but … 32 What is most similar to me? 0.7 0.9 - 0.1 0.4 What is most similar to me but different from tiger? For hard relationship, scalar representation is limited
  • 138.
    Complicated relationship needshigh- order relational memory 33 Extract items Item memory Associate every pairs of them … 3d relational tensor Relational memory Le, Hung, Truyen Tran, and Svetha Venkatesh. "Self- attentive associative memory." In International Conference on Machine Learning, pp. 5682-5691. PMLR, 2020.
  • 139.
  • 140.
    Predefining program forsubtask • A program designed for a task becomes a module • Parse a question to module layout (order of program execution) • Learn the weight of each module to master the task 35 Andreas, Jacob, Marcus Rohrbach, Trevor Darrell, and Dan Klein. "Neural module networks." In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 39-48. 2016.
  • 141.
    Program selection isbased on parser, others are end2end trained 36 5 module templates 1 2 3 4 5 Parsing
  • 142.
    The most powerfulmemory is one that stores both program and data • Computer architecture: Universal Turing Machines/Harvard/VNM • Stored-program principle • Break a big task into subtasks, each can be handled by a TM/single purposed program stored in a program memory 37 https://en.wikipedia.org/
  • 143.
    NUTM: Learn toselect program (neural weight) via program attention • Neural stored-program memory (NSM) stores key (the address) and values (the weight) • The weight is selected and loaded to the controller of NTM • The stored NTM weights and the weight of the NUTM is learnt end-to-end by backpropagation 38 Le, Hung, Truyen Tran, and Svetha Venkatesh. "Neural Stored-program Memory." In International Conference on Learning Representations. 2019.
  • 144.
    Scaling with memoryof mini-programs • Prior, 1 program = 1 neural network (millions of parameters) • Parameter inefficiency since the programs do not share common parameters • Solution: store sharable mini-programs to compose infinite number of programs 39 it is analogous to building Lego structures corresponding to inputs from basic Lego bricks.
  • 145.
    Recurrent program attentionto retrieve singular components of a program 40 Le, Hung, and Svetha Venkatesh. "Neurocoder: Learning General-Purpose Computation Using Stored Neural Programs." arXiv preprint arXiv:2009.11443 (2020).
  • 146.
    41 Program attention isequivalent to binary decision tree reasoning Recurrent program attention auto detects task boundary
  • 147.
    Agenda • Reasoning withexternal memories • Memory of entities – memory-augmented neural networks • Memory of relations with tensors and graphs • Memory of programs & neural program construction. • Learning to reason with less labels: • Data augmentation with analogical and counterfactual examples • Question generation • Self-supervised learning for question answering • Learning with external knowledge graphs • Recursive reasoning with neural theory of mind. 42
  • 148.
    Data Augmentation withAnalogical and Counterfactual Examples 43 • Poor generalization when training under independent and identically distributed assumption. • Intuition: augmenting counterfactual samples to allow machines to understand the critical changes in the input that lead to changes in the answer space. • Perceptually similar, yet • Semantically dissimilar realistic samples Visual counterfactual example Language counterfactual examples Gokhale, Tejas, et al. "Mutant: A training paradigm for out-of-distribution generalization in visual question answering." EMNLP’20.
  • 149.
    Question Generations 44 Li, Yikang,et al. "Visual question generation as dual task of visual question answering." CVPR’18. Krishna, Ranjay, Michael Bernstein, and Li Fei-Fei. "Information maximizing visual question generation." CVPR’19. • Question answering is a zero-shot learning problem. Question generation helps cover a wider range of concepts. • Question generation can be done with either supervised and unsupervised learning.
  • 150.
    BERT: Transformer ThatPredicts Its Own Masked Parts 46 BERT is like parallel approximate pseudo- likelihood • ~ Maximizing the conditional likelihood of some variables given the rest. • When the number of variables is large, this converses to MLE (maximum likelihood estimate). [Slide credit: Truyen Tran] https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270
  • 151.
    Visual QA asa Down-stream Task of Visual- Language BERT Pre-trained Models 47 Numerous pre-trained visual language models during 2019-2021. VisualBERT (Li, Liunian Harold, et al., 2019) VL-BERT (Su, Weijie, et al., 2019) UNITER (Chen, Yen-Chun, et al., 2019) 12-in-1 (Lu, Jiasen, et al., 2020) Pixel-BERT (Huang, Zhicheng, et al., 2019) OSCAR (Li, Xiujun, et al., 2020) Single-stream model Two-stream model ViLBERT (Lu, Jiasen, et al. , 2019) LXMERT (Tan, Hao, and Mohit Bansal, 2019) [Slide credit: Licheng Yu et al.]
  • 152.
    Learning with ExternalKnowledge 48 Why external knowledge for reasoning? • Questions can be beyond visual recognition (e.g. firetrucks usually use a fire hydrant). • Human’s prior knowledge for cognition-level reasoning (e.g. human’s goals, intents etc.) Q: What sort of vehicle uses this item? A: firetruck Q: What is the sports position of the man in the orange shirt? A: goalie/goalkeeper Marino, Kenneth, et al. "Ok-vqa: A visual question answering benchmark requiring external knowledge." CVPR’19. Zellers, Rowan, et al. "From recognition to cognition: Visual commonsense reasoning." CVPR’19.
  • 153.
    Learning with ExternalKnowledge 49 Retrieved by Wikipedia search API Marino, Kenneth, et al. "Ok-vqa: A visual question answering benchmark requiring external knowledge." CVPR’19. Shah, Sanket, et al. "Kvqa: Knowledge-aware visual question answering." AAAI’19.
  • 154.
    Agenda • Reasoning withexternal memories • Memory of entities – memory-augmented neural networks • Memory of relations with tensors and graphs • Memory of programs & neural program construction. • Learning to reason with less labels: • Data augmentation with analogical and counterfactual examples • Question generation • Self-supervised learning for question answering • Learning with external knowledge graphs • Recursive reasoning with neural theory of mind. 50
  • 155.
    Source: religious studiesproject Core AI faculty: Theory of mind
  • 156.
    Where would ToMfit in? System 1: Intuitive System 1: Intuitive System 1: Intuitive • Fast • Implicit/automatic • Pattern recognition • Multiple System 2: Analytical • Slow • Deliberate/rational • Careful analysis • Single, sequential Single Image credit: VectorStock | Wikimedia Perception Theory of mind Recursive reasoning Facts Semantics Events and relations Working space Memory
  • 157.
    Contextualized recursive reasoning •Thus far, QA tasks are straightforward and objective: • Questioner: I will ask about what I don’t know. • Answerer: I will answer what I know. • Real life can be tricky, more subjective: • Questioner: I will ask only questions I think they can answer. • Answerer 1: This is what I think they want from an answer. • Answerer 2: I will answer only what I think they think I can. 14/08/2021 53  We need Theory of Mind to function socially.
  • 158.
    Social dilemma: StagHunt games • Difficult decision: individual outcomes (selfish) or group outcomes (cooperative). • Together hunt Stag (both are cooperative): Both have more meat. • Solely hunt Hare (both are selfish): Both have less meat. • One hunts Stag (cooperative), other hunts Hare (selfish): Only one hunts hare has meat. • Human evidence: Self-interested but considerate of others (cultures vary). • Idea: Belief-based guilt-aversion • One experiences loss if it lets other down. • Necessitates Theory of Mind: reasoning about other’s mind.
  • 159.
    Theory of MindAgent with Guilt Aversion (ToMAGA) Update Theory of Mind • Predict whether other’s behaviour are cooperative or uncooperative • Updated the zero-order belief (what other will do) • Update the first-order belief (what other think about me) Guilt Aversion • Compute the expected material reward of other based on Theory of Mind • Compute the psychological rewards, i.e. “feeling guilty” • Reward shaping: subtract the expected loss of the other. Nguyen, Dung, et al. "Theory of Mind with Guilt Aversion Facilitates Cooperative Reinforcement Learning." Asian Conference on Machine Learning. PMLR, 2020. [Slide credit: Dung Nguyen]
  • 160.
    Machine Theory ofMind Architecture (inside the Observer) Successor representations next-step action probability goal Rabinowitz, Neil, et al. "Machine theory of mind." International conference on machine learning. PMLR, 2018. [Slide credit: Dung Nguyen]
  • 161.
    A ToM architecture • Observermaintains memory of previous episodes of the agent. • It theorizes the “traits” of the agent. • Implemented as Hyper Networks. • Given the current episode, the observer tries to infer goal, intention, action, etc of the agent. • Implemented as memory retrieval through attention mechanisms. 14/08/2021 57
  • 162.
  • 163.
    Wrapping up • Reasoningas the next challenge for deep neural networks • Part A: Learning-to-reason framework • Reasoning as a prediction skill that can be learnt from data • Dynamic neural networks are capable • Combinatorics reasoning • Part B: Reasoning over unstructured and structured data • Reasoning over unstructured sets • Relational reasoning over structured data • Part C: Memory | Data efficiency | Recursive reasoning • Memories of items, relations and programs • Learning with less labels • Theory of mind 14/08/2021 59
  • 164.
    A possible frameworkfor learning and reasoning with deep neural networks System 1: Intuitive System 1: Intuitive System 1: Intuitive • Fast • Implicit/automatic • Pattern recognition • Multiple System 2: Analytical • Slow • Deliberate/rational • Careful analysis • Single, sequential Single Image credit: VectorStock | Wikimedia Perception Theory of mind Recursive reasoning Facts Semantics Events and relations Working space Memory
  • 165.