Deep learning and reasoning: Recent advances

Deep learning and reasoning:
Recent advances
3/07/2023 1
A/Prof Truyen Tran
Deakin University
@truyenoz
truyentran.github.io
truyen.tran@deakin.edu.au
letdataspeak.blogspot.com
goo.gl/3jJ1O0
RADL Summer School 2023

3/07/2023 2
Cartoonist Zach Weinersmith, Science:
Abridged Beyond the Point of
Usefulness, 2017

Differentiable programming
Neuro-symbolic systems
Neural reasoning
Post DL
What needs work
3/07/2023 3
Agenda
Overview
Neural building blocks
Graph neural networks
Unsupervised learning
What works

2012
2016
Turing Awards 2018
11 years snapshot
Picture taken from Bommasani et al, 2021
Source: @walidsaba
2023

3/07/2023 5
“[By 2023] …
Emergence of the
generally agreed upon
"next big thing" in AI
beyond deep learning.”
Rodney Brooks
rodneybrooks.com
“[…] general-purpose computer
programs, built on top of far richer
primitives than our current
differentiable layers—[…] we will
get to reasoning and abstraction,
the fundamental weakness of
current models.”
Francois Chollet
blog.keras.io
“Software 2.0 is written in
neural network weights”
Andrej Karpathy
medium.com/@karpathy

Why (still) DL in 2023?
Practical
• Generality: Applicable to many
domains.
• Competitive: DL is hard to beat as
long as there are data to train.
• Scalability: DL is better with more
data, and it is very scalable.
Theoretical
Expressiveness: Neural nets
can approximate any function.
Learnability: Neural nets are
trained easily.
Generalisability: Neural nets
generalize surprisingly well to
unseen data.

3/07/2023 7
ICLR 2023
Source: https://github.com/EdisonLeeeee/ICLR2023-OpenReviewData

Neural reasoning
Post DL
What needs work
3/07/2023 8
Agenda
Overview
What works

y = f(x; W)
3/07/2023 9
Machine learning in a nutshell
• Most machine learning tasks reduce to
estimating a mapping f from x to y
• The estimation is more accurate with more
experiences, e.g., seeing more pair (x,y) in
training data.
• The mapping f is often parameterized by W.
• When y is a token/scalar/vector/tensor ->
prediction task.
• When y is a program ->
translation/synthesis task.
• When y is an intermediate form ->
representation learning.
❖ Much of ML is in specifying x,
a.k.a feature engineering.
❖ Much of DL is to specify
skeleton of W, a.k.a
architecture engineering.
❖ Much of LLMs is to specify x
again, but with fixed W, a.k.a
prompt engineering.

1980s: Parallel Distributed Processing
• Information is stored in many places
(distributed)
• Activations are sparse (enabling
selectivity and invariance)
• Factors of variation can be coded
efficiently
• Popular these days: Word & doc
embedding (word2vec, glove,
anything2vec)
Credit: Geoff Hinton

Symbolic vs.Distributed Representations
• Symbolic Representation
• Distributed Representation
6
Megan_Rapinoe
Ian_McKellen
Play
Game
Game Play
M egan_Rapinoe
Ian_McKellen
Slide credit: Pacheco & Goldwasser, 2021

Deep models via layer stacking
Theoretically powerful, but limited in practice
Integrate-and-fire neuron
andreykurenkov.com
Feature detector
Block representation
3/07/2023 12

http://torch.ch/blog/2016/02/04/resnets.html
Practice
Shorten path length with skip-connections
Easier information and gradient flows
3/07/2023 13
http://qiita.com/supersaiakujin/items/935bbc9610d0f87607e8
Theory

Sequence model with recurrence
Assume the stationary world
Classification
Image captioning
Sentence classification
Neural machine translation
Sequence labelling
Source: http://karpathy.github.io/assets/rnn/diags.jpeg
3/07/2023 14

Spatial model with convolutions
Assume filters/motifs are translation
invariant
http://colah.github.io/posts/2015-09-NN-Types-FP/
Learnable kernels
andreykurenkov.com
Feature detector,
often many

Convolutional networks
Summarizing filter responses, destroying
locations
adeshpande3.github.io
3/07/2023 16

Operator on sets/bags: Attentions
Not everything is created equal for a goal
• Need attention model to select or
ignore certain computations or inputs
• Can be “soft” (differentiable) or “hard”
(requires RL)
• Attention provides a short-cut → long-
term dependencies
• Also encourages sparsity if done right!
http://distill.pub/2016/augmented-rnns/

Why attention?
• Visual attention in human: Focus on specific
parts of visual inputs to compute the
adequate responses.
• Examples:
• We focus on objects rather than the background
of an image.
• We skim text by looking at important words.
• In neural computation, we need to select
the most relevance piece of information and
ignore all other parts
Slide credit: Trang Pham
Photo: programmersought

Transformer
Slide credit: Adham Beykikhoshk
• Tokenization
• Token encoding
• Position coding
• Sparsity
• Exploit spatio-
temporal structure

Transformer: Key ideas
• Use self-similarity to refine token’s representation (embedding).
• “June is happy” -> June is represented as a person’s name.
• Hidden contexts are borrowed from other sentences that share
tokens/motifs/patterns, e.g., “She is happy”, “Her name is June”, etc.
• Akin to retrieval: matching query to key.
• Context is simply other tokens co-occurring in the same text segment.
• Related to “co-location”.
• How big is context? → Small window, a sentence, a paragraph, the whole doc.
• What is about relative position? → Position coding.
3/07/2023 20

Positional Encoding
• The Transformer relaxes the sequentiality of data
• Positional encoding to embed sequential order in model
Slide credit: Adham Beykikhoshk

Theory: Transformers are (new) Hopfield
net
3/07/2023 22
Ramsauer, Hubert, et al. "Hopfield networks is all you need." arXiv preprint
arXiv:2008.02217 (2020).

Speed up: Vanilla Transformers are not efficient
Slide credit: Hung Le

Speed up: Efficient Transformers
3/07/2023 24
Tay, Yi, et al. "Efficient transformers: A survey." arXiv
preprint arXiv:2009.06732 (2020).

Speed up: Kernerlization and associative tricks
Same index,
reusable sum
Reduce
complexity
The idea is linked back to
Efficient Attention: Attention with Linear Complexities by Shen et.al, 2018.

Computation verification

Fast weights | HyperNet
The model world is recursive
• Early ideas in early 1990s by Juergen Schmidhuber and collaborators.
• Data-dependent weights | Using a controller to generate weights of the
main net.
3/07/2023 27
Ha, David, Andrew Dai, and Quoc V. Le. "Hypernetworks." arXiv preprint arXiv:1609.09106 (2016).

Neural networks vs Electronic circuits
• Computational graph → Circuit
• Compositionality → Modular design
• Neuron as feature detector → SENSOR, FILTER
• Multiplicative gates → AND gate, Transistor,
Resistor
• Attention mechanism → SWITCH gate
• Memory + forgetting → Capacitor + leakage
• Skip-connection → Short circuit
3/07/2023 28

Module composition
The system is modular, composable
3/07/2023 29
Source: https://www.ruder.io/modular-deep-learning/

Neural architecture search
When design is cheap and non-creative
• The space is huge and discrete
• Can be done through meta-heuristics (e.g., genetic algorithms) or
Reinforcement learning (e.g., one discrete change in model structure
is an action).
3/07/2023 30
Bello, Irwan, et al. "Neural optimizer search with reinforcement learning." arXiv preprint arXiv:1709.07417 (2017).

Neural networks design goals
•Capture long-term
dependencies in time and
space
•Capture invariances
natively
•Capture equivariance
3/07/2023 31
• Expressivity
• Scalability
• Reusability/modularity
• Compositionality
• Universality

Neural networks design goals (2)
3/07/2023 32
• Easy to train / learnability
• Use (almost) no labels => Unsupervised learning
• Resource adaptive
• Ability to extrapolate => Must go beyond surface statistics
• Support fast and slow learning (Complementary learning)
• Support fast and slow inference (Dual system theory)

Neural reasoning
Post DL
What needs work
3/07/2023 33
Agenda
Overview
What works

Graph Structures in real world – Network Science
Internet
Social networks
World wide web
Communication Citations Biological networks
credit: Jure Leskovec
Slide credit: Yao Ma, Wei Jin, Yiqi Wang, Jiliang Tang, Tyler Derr, AAAI21

#REF: Penmatsa, Aravind, Kevin H. Wang,
and Eric Gouaux. "X-ray structure of
dopamine transporter elucidates
antidepressant
mechanism." Nature 503.7474 (2013): 85-
90.
Biology, pharmacy &
chemistry, materials
• Molecule/crystal as graph:
atoms as nodes, chemical
bonds as edges
• Computing molecular
properties
• Chemical-chemical
interaction
• Chemical reaction
3/07/2023 35
Gilmer, Justin, et al. "Neural message passing for quantum
chemistry." arXiv preprint arXiv:1704.01212 (2017).

Scene graphs as intermediate representation for image
captioning
Yao et al. Exploring Visual Relationship for Image Captioning, ECCV 2018
Fei-Fei Li, Ranjay Krishna, Danfei Xu

GNN in videos: Space-time region graphs
(Abhinav Gupta et al, ECCV’18)

Transformer is a special type of GNN
3/07/2023 38
Image credit: Chaitanya Joshi

chain-like wiring
patterns
LeNet
AlexNet
VGGNet
The evolution of graph structures in modern
NN design (Unintentional!)
multiple wiring paths
Inception
ResNet
DenseNet
ResNeXt
Credit: Saining Xie

Natural evolution of representing the world
• Vector → Embedding, MLP
• Sequence → RNN (LSTM, GRU)
• Grid → CNN (AlexNet, VGG, ResNet, EfficientNet, etc)
• Set → Word2vec, Attention, Transformer
• Graph → GNN (node2vec, DeepWalk, GCN, Graph Attention Net,
Column Net, MPNN etc)
• ResNet is a special case of GNN on grid!
• Transformer is a special case of GNN on fully connected graph.
3/07/2023 40

• Graphs are pervasive
in many scientific
disciplines.
• The sub-area of graph
representation has
reached a certain
maturity, with
multiple reviews,
workshops and papers
at top AI/ML venues.
3/07/2023 41
GNN in research
Source: https://github.com/EdisonLeeeee/ICLR2023-OpenReviewData

Deep Graph Learning: Foundations, Advances and
Applications
Graph Neural Network as a solution
Graph Neural Network
Graph/Node
Representation
Applications
Node
Classification
Link Prediction
Community
Detection
Graph
Generation
………
Neural network model that can deal with graph data.
Yu Rong, Wenbing Huang, Tingyang Xu, Hong Cheng, Junzhou
Huang 2020

Two Main Operations in GNN
43
Graph Filtering
Graph Filtering
Graph filtering refines the node features
Slide credit: Yao Ma and Yiqi Wang, Tyler Derr, Lingfei Wu and Tengfei Ma

Two Main Operations in GNN
44
Graph Pooling
Graph Pooling
Graph pooling generates a smaller graph

General GNN Framework
45
… …
…
𝐵1 𝐵𝑛
Filtering Layer Activation Pooling Layer (Optional)

Generalizing 2D convolutions to Graph Convolutions
- Graph convolutions involve similar local
operations on nodes.
- Nodes are now object representations
and not activations
- The ordering of neighbors should not
matter.
- The number of neighbors should not
matter.
- N(i) are the neighbors of node I
- Attention can be employed for edge
selection
Kipf & Welling (ICLR 2017)
Fei-Fei Li, Ranjay Krishna, Danfei Xu

Generalizing GNNs through message passing
3/07/2023 47
#REF: Pham, Trang, et al. "Column Networks for Collective Classification." AAAI. 2017.
Relation graph
Generalized message passing

Message Passing Neural Net
48
ℎ2, 𝑙2
ℎ1, 𝑙1
ℎ3, 𝑙3
ℎ4, 𝑙4
ℎ5, 𝑙5
ℎ6, 𝑙6
ℎ7, 𝑙7
𝑣2 𝑣8
𝑣1
𝑣3 𝑣4
𝑣5
𝑣6
𝑣7
ℎ8, 𝑙8
Message Passing
Feature Updating
𝑀𝑘() and 𝑈𝑘() are functions to be designed
Neural Message Passing for Quantum Chemistry. ICML 2017.
Slide credit: Yao Ma, Wei Jin, Yiqi Wang, Jiliang Tang, Tyler Derr, AAAI21

Neural graph morphism
• Input: Graph
• Output: A new graph.
Same nodes, different
edges.
• Model: Graph
morphism
• Method: Graph
transformation policy
network (GTPN)
3/07/2023 49
Kien Do, Truyen Tran, and Svetha Venkatesh. "Graph Transformation Policy Network for
Chemical Reaction Prediction." KDD’19.

Neural graph recurrence
• Graphs that represent interaction between entities through
time
• Spatial edges are node interaction at a time step
• Temporal edges are consistency relationship through time

Challenges
• The addition of temporal edges make the graphs
bigger, more complex
• Relying on context specific constraints to reduce the
complexity by approximations
• Through time, structures of the graph may change
• Hard to solve, most methods model short sequences to
avoid this

ASSIGN: Asynchronous, Sparse Interaction Graph
Network
(Morais et al, 2021 @ A2I2, Deakin – CVPR’21)
3/07/2023 52

GraphRNN to generate graphs
• A case of graph
dynamics: nodes
and edges are
added
sequentially.
• Solve tractability
using BFS
3/07/2023 53
You, Jiaxuan, et al.
"GraphRNN: Generating
realistic graphs with deep
auto-regressive
models." ICML (2018).

Neural reasoning
Post DL
What needs work
3/07/2023 54
Agenda
Overview
What works

Representation learning, a bit of history
•“Representation is the use of signs that stand in
for and take the place of something else”
It has been a goal of neural networks since the 1980s and the current
wave of deep learning (2005-present) → Replacing feature engineering
Between 2006-2012, many unsupervised learning models with varying
degree of success: RBM, DBN, DBM, DAE, DDAE, PSD
Between 2013-2018, most models were supervised, following AlexNet
Since 2018, unsupervised learning has become competitive (with
contrastive learning, self-supervised learning, BERT)!
3/07/2023 55

Criteria for a good representation
• Separates factors of variation (aka disentanglement), which are
linearly correlated with desired outputs of downstream tasks.
• Provides abstraction that is invariant against deformations and
small variations.
• Is distributed (one concept is represented by multiple units), which
is compact and good for interpolation.
• Optionally, offers dimensionality reduction.
• Optionally, is sparse, giving room for emerging symbols.
3/07/2023 56
Bengio, Yoshua, Aaron Courville, and Pascal Vincent. "Representation learning: A review and new
perspectives." IEEE transactions on pattern analysis and machine intelligence 35.8 (2013): 1798-1828.

Why neural unsupervised learning?
• Neural nets have representational richness:
• FFN are functional approximator
• RNN are program approximator, can estimate a program behaviour and generate a string
• CNN are for translation invariance
• Transformers are powerful contextual encoder
• Compactness: Representations are (sparse and) distributed.
• Essential to perception, compact storage and reasoning
• Accounting for uncertainty: Neural nets can be stochastic to model
distributions
• Symbolic representation: realisation through sparse activations and gating
mechanisms
3/07/2023 57

Generative models:
Discover the underlying process that generates
data
3/07/2023 58
Many applications:
• Text to speech
• Simulate data that are hard to obtain/share in
real life (e.g., healthcare)
• Generate meaningful sentences conditioned on
some input (foreign language, image, video)
• Semi-supervised learning
• Planning

Deep (Denoising) AutoEncoder:
Self-reconstruction of data
3/07/2023 59
Auto-encoder
Feature detector
Representation
Raw data
(optionally
with added
noise)
Reconstruction
Deep Auto-encoder
Encoder
Decoder

FSDL 2022
• "Latent Diffusion" model: diffuse in
lower-dimensional latent space, then
decode back into pixel space
• Frozen CLIP ViT-L/14, trained 860M
UNet, 123M text encoder
• Trained on LAOIN-5B on 256 A100s for
24 days ($600K)
• FULLY OPEN-SOURCE
StableDiffusion
60
Slide credit: Karayev, 2022

Credit: kvfrans.com
Gaussian
hidden
variables
Data
Generative
net
Recognisin
g
net
Variational Autoencoder
Approximating the posterior by a neural net
• Two separate processes: generative (hidden → visible) versus
recognition (visible → hidden)

GAN: Generative Adversarial nets
Matching data statistics
• Instead of modeling the entire distribution of data, learns to
map ANY random distribution into the region of data, so that
there is no discriminator that can distinguish sampled data
from real data.
Any random distribution
in any space
Binary discriminator,
usually a neural
classifier
Neural net that maps
z → x

Generative adversarial networks
(Adapted from Goodfellow’s, NIPS 2014)
3/07/2023 63

BERT
Transformer that predicts its own masked
parts
• BERT is like parallel
approximate pseudo-
likelihood
• ~ Maximizing the conditional
likelihood of some variables
given the rest.
• When the number of
variables is large, this
converses to MLE (maximum
likelihood estimate).
3/07/2023 64
https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270

Neural
autoregressive
models:
Predict the next
step given the
history
• The keys: (a) long-term dependencies, (b) ordering, & (c)
parameter sharing.
• Can be realized using:
• RNN
• CNN: One-sided CNN, dilated CNN (e.g., WaveNet), PixelCNN
• Transformers → GPT-X family
• Masked autoencoder → MADE
• Pros: General, good quality thus far
• Cons: Slow – needs better inductive biases for scalability
3/07/2023 65
lyusungwon.github.io/studies/2018/07/25/nade/

FSDL 2022
• Generative Pre-trained Transformer
• Decoder-only (uses masked self-attention)
• Trained on 8M web pages, largest model is 1.5B
GPT / GPT-2 (2019)
https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
66

Contrastive
learning:
Comparing
samples
3/07/2023 67
Le-Khac, Phuc H., Graham Healy, and
Alan F. Smeaton. "Contrastive
Representation Learning: A Framework
and Review." arXiv preprint
arXiv:2010.05113 (2020).

• 400M image-text pairs
crawled from the Internet
• Transformer to encode
text, ResNet or Visual
Transformer to encode
image
• Contrastive training:
maximize cosine similarity
of correct image-text pairs
(32K pairs per batch)
79
CLIP: Image-pair vs the rest
https://arxiv.org/pdf/2103.00020.pdf

Unsupervised
learning: A few
more points
• No external labels, but rich training signals (thousand bits per sample, as opposed to a
few bits in supervised learning). A few techniques:
• Compressing data as much as possible with little loss
• Energy-based, i.e., pull down energy of observed data, pull up every else
• Filling the missing slots (aka predictive learning, self-supervised learning)
• We have not covered unsupervised learning on graphs (e.g., DeepWalk, GPT-GNN), but
the general principles should hold.
• Question: Multiple objectives, or no objective at all?
• Question: Emergence from many simple interacting elements?
3/07/2023 69
Liu, Xiao, et al. "Self-supervised learning: Generative or contrastive." arXiv preprint arXiv:2006.08218 (2020).
Assran, Mahmoud, et al. "Self-Supervised Learning from Images with a Joint-Embedding Predictive
Architecture." arXiv preprint arXiv:2301.08243 (2023).

Picture taken from (Bommasani et al, 2021)
A Tipping Point: Foundation Models
70
• A foundation model is a
model trained at broad
scale that can adapted
to a wide range of
downstream tasks
• Scale and the ability to
perform tasks beyond
training
Slide credit: Samuel Albanie, 2022

Slide credit: Chris Ré, Stanford, 2022
word2vec
2013

Twokeyideasunderpinfoundation model?
Emergence
•system behaviour is implicitly induced rather than explicitly constructed
•cause of scientific excitement and anxiety of unanticipated consequences
Homogenisation
•consolidation of methodology for building machine learning system across many applications
•provides strong leverage for many tasks, but also creates single points of failure

Homogenisation
Learning instead of algorithm: Many applications can be powered by the
same learning algorithm.
• => Feature engineering
Deep architecture engineering: Instead of hand-crafting features, the same
architecture could be used widely.
• => Architecture engineering
Modern Transformer is universal: Same architecture, just different data!
• => Data & Prompt engineering

3/07/2023 74
convolution --
motif detection
3
sequencing
time gaps/transfer
phrase/admission
1
embedding
2
word
vector
medical record
visits/admissions
time gap
?
prediction point output
max-pooling
prediction
4
5
record
vector
Homogenisation-Deepr
Nguyen, Phuoc, Truyen Tran,
Nilmini Wickramasinghe, and
Svetha Venkatesh. Deepr: a
convolutional net for medical
records." IEEE journal of
biomedical and health
informatics 21, no. 1 (2016): 22-30.
Concept: Stringify() – everything as a string

Neural reasoning
Post DL
What needs work
3/07/2023 76
Agenda
Overview
What works

1960s-1990s
▪ Hand-crafting rules, domain-
specific, logic-based
▪ High in reasoning
▪ Can’t scale.
▪ Fail on unseen cases.
3/07/2023
77
2020s-2030s
 Learning + reasoning, general
purpose, human-like
 Has contextual and common-
sense reasoning
 Requires less data
 Adapt to change
 Explainable
1990s-2020s
 Machine learning, general
purpose, statistics-based
 Low in reasoning
 Needs lots of data
 Less adaptive
 Little explanation
Photo credit: DARPA

From ML to Machine Reasoning
3/07/2023 78
cylinder
cube sphere cylinder sphere
cyan
brown
orange
red
object detection
Reasoning
Slide credit: Tin Pham

What is missing in deep
learning?
• Modern neural networks are good at
interpolating
→ Data hungry to cover all variations and smooth
local manifolds
→Little systematic generalization (novel
combinations)
• Lack of human-perceived reasoning capability
• Lack of logical inference
• Lack of natural mechanism to incorporate prior
knowledge, e.g., common sense
• No built-in causal mechanisms
3/07/2023 79

Machine reasoning
Reasoning is concerned with arriving at a deduction
about a new combination of circumstances.
Reasoning is to deduce new knowledge from
previously acquired knowledge in response to a
query.
3/07/2023 80
Leslie Valiant
Leon Bottou

Machine reasoning
• Two-part process
• manipulate previously acquired knowledge
• to draw novel inferences or answer new questions
• Example:
• Premise:
• A is to the left of B
• B is to the left of C
• D is in front of A
• E is in front of C
• Conclusion: what is the relation between D and E?
3/07/2023 81

Geometry example
3/07/2023 82
Premise
• AM = MN (1)
• BM = MC (2)
• ෣
𝐴𝑀𝐵 = ෣
𝑁𝑀𝐶 (3)
Solution:
From (1), (2), (3)
➔△AMB = △NMC (4)
➔AB = CN
From (1), (2) ➔ ABNC is
a parallelogram (5)
→ AB // CN
Existing
knowledge
Conclusion
• AB = CN?
• AB // CN?

Is reasoning always formal/logical?
3/07/2023 83
Bottou, Léon. "From machine learning to machine
reasoning." Machine learning 94.2 (2014): 133-149.
Leon Bottou
• “When we observe a visual scene, when we hear a complex
sentence, we are able to explain in formal terms the
relation of the objects in the scene, or the precise meaning
of the sentence components.
• However, there is no evidence that such a formal analysis
necessarily takes place: we see a scene, we hear a
sentence, and we just know what they mean.
• This suggests the existence of a middle layer, already a
form of reasoning, but not yet formal or logical.”

Why not just neural reasoning?
Central to reasoning is composition rules to guide the combinations of modules to
address new tasks
Bottou:
• Reasoning is not necessarily achieved by making logical inferences
• There is a continuity between [algebraically rich inference] and [connecting
together trainable learning systems]
→Neural networks are a plausible candidate!
→But still not natural to represent abstract discrete concepts and relations.
Hinton/Bengio/LeCun: Neural networks can do everything!
The rest: Not so fast! => Neurosymbolic systems!
3/07/2023 84
Bottou, Léon. "From machine learning to machine
reasoning." Machine learning 94.2 (2014): 133-149.

Learning to reason
• Learning is to improve itself by experiencing ~ acquiring
knowledge & skills
• Reasoning is to deduce knowledge from previously acquired
knowledge in response to a query (or a cues)
• Learning to reason is to improve the ability to decide if a
knowledge base entails a predicate.
• E.g., given a video f, determines if the person with the hat turns
before singing.
• Hypotheses:
• Reasoning as just-in-time program synthesis.
• It employs conditional computation.
• It minimises an energy function, or maximise the compatibility
between input (prompt) and output.
3/07/2023 85
Khardon, Roni, and Dan Roth. "Learning to reason." Journal of the ACM
(JACM) 44.5 (1997): 697-725.
(Dan Roth; ACM
Fellow; IJCAI John
McCarthy Award)

Reasoning as a skill
• Reasoning as a prediction skill that can be learnt from data.
• Question answering as zero-shot learning.
• Neural network operations for learning to reason:
• Attention & transformers.
• Dynamic neural networks, conditional computation & differentiable programming.
• Reasoning as iterative representation refinement & query-driven program
synthesis and execution.
• Compositional attention networks.
• Reasoning as Neural module networks.
3/07/2023 86

Practical setting:
(query,database,answer) triplets
• Classification: Query = what is this? Database = data.
• Regression: Query = how much? Database = data.
• QA: Query = NLP question. Database = context/image/text.
• Multi-task learning: Query = task ID. Database = data.
• Zero-shot learning: Query = task description. Database = data.
• Drug-protein binding: Query = drug. Database = protein.
• Recommender system: Query = User (or item). Database = inventories (or user
base);
3/07/2023 87

The two approaches to neural reasoning
• Implicit chaining of predicates through recurrence:
• Step-wise query-specific attention to relevant concepts & relations.
• Iterative concept refinement & combination, e.g., through a working memory.
• Answer is computed from the last memory state & question embedding.
• Explicit program synthesis:
• There is a set of modules, each performs an pre-defined operation.
• Question is parse into a symbolic program.
• The program is implemented as a computational graph constructed by chaining
separate modules.
• The program is executed to compute an answer
3/07/2023 88

MACNet: Composition-
Attention-Control
(reasoning by progressive
refinement of selected data)
3/07/2023 89
Hudson, Drew A., and Christopher D. Manning.
"Compositional attention networks for machine
reasoning." arXiv preprint arXiv:1803.03067 (2018).

LOGNet: Relational object reasoning with language
binding
90
• Key insight: Reasoning is chaining of relational predicates to arrive
at a final conclusion
→ Needs to uncover spatial relations, conditioned on query
→ Chaining is query-driven
→ Objects/language needs binding
→ Object semantics is query-dependent
→ Very thing is end-to-end differentiable
Thao Minh Le, Vuong Le, Svetha Venkatesh, and
Truyen Tran, “Dynamic Language Binding in
Relational Visual Reasoning”, IJCAI’20.

91
LOGNet for VQA
Thao Minh Le, Vuong Le,
Svetha Venkatesh, and
Truyen Tran, “Dynamic
Language Binding in
Relational Visual
Reasoning”, IJCAI’20.

What is about Transformer?
• Reasoning as (free-) energy minimisation
• The classic Belief Propagation algorithm is minimization algorithm
of the Bethe free-energy!
• Transformer has relational, iterative state refinement makes
it a great candidate for implicit relational reasoning.
3/07/2023 93
Heskes, Tom. "Stable fixed points of loopy belief propagation are local minima of the bethe free
energy." Advances in neural information processing systems. 2003.
Ramsauer, Hubert, et al. "Hopfield networks is all you need." arXiv preprint
arXiv:2008.02217 (2020).

3/07/2023 94
http://mccormickml.com/2020/03/10/question-answering-with-a-fine-tuned-BERT/
On SQuAD, Answer = start/end positions

Module networks
(reasoning by constructing and executing neural programs)
• Reasoning as laying out
modules to reach an
answer
• Composable neural
architecture → question
parsed as program (layout
of modules)
• A module is a function (x
→ y), could be a sub-
reasoning process ((x, q)
→ y).
3/07/2023 95
https://bair.berkeley.edu/blog/2017/06/20/learning-to-reason-with-neural-module-networks/

Program execution
• Work on object-based visual
representation
• An intermediate set of objects is
represented by a vector, as attention
mask over all object in the scene. For
example, Filter(Green_cube) outputs a
mask (0,1,0,0).
• The output mask is fed into the next
module (e.g Relate)
96

Source: @rao2z
What is about reasoning in LLMs?
• LLMs have HUGE associative memory.
• With “Let’s think step-by-step”?
• With “Chain of Thought”?
• Or just a pattern recognition of chain of
reasoning?
• Finding short-cuts to approximate provably
correct reasoning procedure.
• => Very poor OOD generalisation.
3/07/2023 97

A general framework
3/07/2023 98
Explicit Knowledge Graphs
+
Large Language Models
(implicit common sense knowledge,
associative database)

Neural reasoning
Post DL
What needs work
3/07/2023 99
Agenda
Overview
What works

3/07/2023 100
Learning a Turing
machine
→ Can we learn a (neural)
program that learns to
program from data?

Memory networks • Input is a set → Load into memory,
which is NOT updated.
• State is a RNN with attention reading
from inputs
• Concepts: Query, key and content +
Content addressing.
• Deep models, but constant path length
from input to output.
• Equivalent to a RNN with shared input
set.
• => Seq2seq with attention is a Memory
Network (Memory = input seq).
• => Transformer is a kind of Memory
Network with Parallel Memory Update!
3/07/2023 101
Sukhbaatar, Sainbayar, Jason Weston, and Rob
Fergus. "End-to-end memory networks." Advances in
neural information processing systems. 2015.

MANN: Memory-Augmented Neural Networks
(a constant path length)
• Long-term dependency
• E.g., outcome depends on the far past
• Memory is needed (e.g., as in LSTM)
• => This is what make Transformers powerful!
• Complex program requires multiple computational steps
• Each step can be selective (attentive) to certain memory cell
• Operations: Encoding | Decoding | Retrieval

MANN: Neural Turing machine (NTM)
(simulating a differentiable Turing machine)
• A controller that takes
input/output and talks to an
external memory module.
• Memory has read/write
operations.
• The main issue is where to write,
and how to update the memory
state.
• All operations are differentiable.
Source: rylanschaeffer.github.io

3/07/2023 104
NTM unrolled in time with LSTM as controller
#Ref: https://medium.com/snips-ai/ntm-lasagne-a-library-for-neural-turing-machines-in-lasagne-2cdce6837315

MANN for reasoning
• Three steps:
• Store data into memory
• Read query, process sequentially, consult memory
• Output answer
• Behind the scene:
• Memory contains data & results of intermediate steps
• Drawbacks of current MANNs:
• No memory of controllers → Less modularity and
compositionality when query is complex
• No memory of relations → Much harder to chain predicates.
3/07/2023 105
Source: rylanschaeffer.github.io

Failures of item-only MANNs for
reasoning
• Relational representation is NOT stored → Can’t reuse later in the
chain
• A single memory of items and relations → Can’t understand how
relational reasoning occurs
• The memory-memory relationship is coarse since it is represented as
either dot product, or weighted sum.
3/07/2023 106

Self-attentive associative memories (SAM)
Learning relations automatically over time
3/07/2023 107
Hung Le, Truyen Tran, Svetha Venkatesh, “Self-
attentive associative memory”, ICML'20.

Neural reasoning
Post DL
What needs work
3/07/2023 108
Agenda
Overview
What works

Neural nets are
powerful but we still
want:
• Learning with less and zero-shot
learning;
• Generalization of the solutions to
unseen tasks and unforeseen data
distributions;
• Explainability by construction;
3/07/2023 109
https://ibm.github.io/neuro-symbolic-ai/events/ns-
workshop2023
Self-Aware Learning
• Deeper learning for challenging tasks
• Integrating continuous and symbolic
representations
• Diversified learning modalities
Credit: Yolanda Gil, Bart Selman
AI to Understand Human
Intelligence
• 5 years: AI systems could be designed to
study psychological models of complex
intelligent phenomena that are based on
combinations of symbolic processing and
artificial neural networks.

Symbolic forms
• Words in Wordnet
• Syntax in NLP & Code
• Logic, prepositional and first-order
• Variables, equations
• Knowledge structure: Semantic nets, knowledge graphs
• Graphical models: Bayesian networks, Markov random fields, Markov
logic networks.
• Function (names), indirection, pointer in C/C++.
3/07/2023 110

Henry Kautz's taxonomy (1)
• Symbolic Neural symbolic—is the current approach of many neural models in
natural language processing, where words or subword tokens are both the
ultimate input and output of large language models. Examples include BERT,
RoBERTa, and GPT-3.
3/07/2023 111
Kautz, H., 2022. The third AI summer: AAAI Robert S. Engelmore memorial lecture. AI
Magazine, 43(1), pp.105-125. https://en.wikipedia.org/wiki/Neuro-symbolic_AI

Representing Context and Structure
Known as contextualized language models
10
Devlin et-al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” 2019

What does BERT learn?
Emergent linguistic structure in artificial neural networks trained by self-supervision. PNAS Manning et-al, 2020
Linguistic structure emerges without direct supervision

Using BERT for ReasoningTasks
• BERT-based near-human performance on Winograd Schema
WINOGRANDE: An Adversarial Winograd Schema Challenge at Scale. Sakaguchi et-al,
AAAI’20
Can “thinking-slow” tasks be accomplished with “thinking-fast” systems?
Not a panacea (McCoy et al ACL’19, others), often relies on simple heuristics when
learning complex decisions
12
Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference. McCoy et-al, ACL’19
World Knowledge and
Commonsense inferences
reflected in coref
decisions

• Symbolic[Neural]—is exemplified by
AlphaGo, where symbolic techniques are
used to call neural techniques. In this case,
the symbolic approach is Monte Carlo tree
search and the neural techniques learn
how to evaluate game positions.
3/07/2023 115
Kautz, H., 2022. The third AI summer: AAAI Robert S. Engelmore
memorial lecture. AI Magazine, 43(1), pp.105-125.
https://en.wikipedia.org/wiki/Neuro-symbolic_AI

• Neural | Symbolic—uses a neural architecture to interpret perceptual data as
symbols and relationships that are reasoned about symbolically. The Neural-
Concept Learner is an example.
3/07/2023 116

End-to-End Module Networks
• Construct the program internally
• The two parts are jointly learnable
3/07/2023 End-to-End Module Networks , Hu et.al., ICCV 17 117
Slide credit: Vuong Le

• Neural: Symbolic → Neural—relies on symbolic reasoning to generate or label
training data that is subsequently learned by a deep learning model, e.g., to train
a neural model for symbolic computation by using a Macsyma-like symbolic
mathematics system to create or label examples.
3/07/2023 118
Kautz, H., 2022. The third AI summer: AAAI
Robert S. Engelmore memorial lecture. AI
Magazine, 43(1), pp.105-125.
Lample, Guillaume, and François Charton. 2020.
“Deep Learning For Symbolic Mathematics.”
In Proceedings of the International Conference on
Learning Representations.

• Neural_{Symbolic}—uses a
neural net that is generated
from symbolic rules. An
example is the Neural
Theorem Prover, which
constructs a neural network
from an AND-OR proof tree
generated from knowledge
base rules and terms. Logic
Tensor Networks also fall
into this category.
3/07/2023 119

• Neural[Symbolic]—allows a
neural model to directly call a
symbolic reasoning engine, e.g.,
to perform an action or evaluate
a state. An example would be
ChatGPT using a plugin to query
Wolfram Alpha.
3/07/2023 120

LLMs for
calling tools
• Information retriever
• Symbolic/math module & code interpreters
• Virtual agents
• Robotic arms. See https://palm-e.github.io/
3/07/2023 121
Credit: Khattab et al

Symbols via Indirection
3/07/2023 122
Z = X + Y
3 1 2
Bind symbols with values
Pointer in Computer Science
Information binding in the brain
https://www.linkedin.com/pulse/
unsolved-problems-ai-part-2-binding-problem-eberhard-schoeneburg/
Indirection binds two objects together and uses one to refer to the other.
Slide credit: Kha Pham

Indirection is a key design principle in
software engineering
3/07/2023 123
Client
Indirectional
Layer
Target
https://medium.com/@nmckinnonblog/indirection-fba1857630e2
Indirection removes direct coupling
between units and promotes:
• Extensibility
• Control
• Evolvability
• Encapsulation of code and design
complexity
Every computer science
problem can be solved with a
higher level of indirection.
Andrew Koenig, Butler Lampson, David J. Wheeler

Leveraging indirection to improve OOD
generalization
3/07/2023 124
Why
indirection?
Indirection binds concrete data to abstract symbols, and
reasoning on symbols is likely to improve generalization.
What
to bind?
Concrete information of data, e.g., representations,
functional relations between data, etc.
Functional
indirection
Structural
indirection
How
to bind?
During indirection, some concrete information of
data will be ignored, and thus we have to decide
what to maintain, i.e., invariances across data.
→ Indirection connects invariance and symbolic
approaches.

Structural Indirection: InLay
3/07/2023 125
• InLay simultaneously leverages indirection and data internal relationships to
construct indirection representations, which respect the similarities between
internal relationships.
• InLay connects invariance and symbolic approaches:
• InLay constructs indirection representations from a fixed set of symbolic
vectors.
• InLay assumes two invariances:
• The data internal relationships are invariant through indirection.
• The set of symbolic vectors to compute indirection representations is
invariant across train and test samples.
Slide credit: Kha Pham Pham, K., Le, H., Ngo, M. and Tran, T., Improving Out-of-distribution
Generalization with Indirection Representations. In The Eleventh
International Conference on Learning Representations.

Structure-Mapping Theory (SMT)
3/07/2023 126
• Improve previous theories of analogy, i.e. the
Tversky’s contrast theory, which assumed that an
analogy is stronger if the more attributes the base
and target share in common.
• SMT [1] argued that it is not object attributes
which are mapped in an analogy, but relationships
between objects. X12 star system Solar system
similarity
Hydrogen atom
analogy
Rutherford’s analogy
No.
attributes
mapped
No.
relations
mapped
Literal
similarity
Many Many
Analogy Few Many
[1] Gentner, Dedre. "Structure-mapping: A theoretical framework for analogy." Cognitive science 7.2 (1983): 155-170.

Structure-Mapping Theory (SMT) (cont.)
3/07/2023 127
Which will be chosen to be mapped in an analogy?
Systematicity Principle: A predicate that belongs to a mappable system of mutually
interconnecting relationships is more likely to be imported into the target than is an isolated
predicate.
Solar system
Distance
Attractive
force
Revolves
around
Color Temperature
Hydrogen atom
Distance
Attractive
force
Revolves
around
Color Temperature

Model architecture
3/07/2023 128
• Concrete data representation is viewed as a complete graph
with weighted edges.
• The indirection operator maps this graph to a symbolic graph
with the same weight edges, however the vertices are fixed and
trainable.
• This symbolic graph is propagated and the updated node
features are indirection representations

Experiments on IQ datasets – RAVEN dataset
3/07/2023 129
An IQ problem in RAVEN [1] dataset
Model Accuracy
LSTM 30.1/39.2
Transformers 15.1/42.5
RelationNet 12.5/46.4
PrediNet 13.8/15.6
Average test accuracies (%) without/with InLay in
different OOD testing scenarios on RAVEN
[1] Zhang, Chi, et al. "Raven: A dataset for relational and analogical visual reasoning."
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.
• The original paper of RAVEN dataset proposes
different OOD testing scenarios, in which models
are trained on one configuration and tested on
another (but related) configuration.

Experiments on OOD image classification tasks
3/07/2023 130
Dog Dog?
OOD image classification,
in which test images are distorted.
• When test images are injected with different kinds
of distortions other than ones in training, deep
neural networks may fail drastically in image
classification tasks. [1]
[1] Robert Geirhos, Carlos RM Temme, Jonas Rauber, Heiko H Schütt, Matthias Bethge, and
Felix A Wichmann. Generalisation in humans and deep neural networks. Advances in neural
information processing systems, 31, 2018.
Dataset ViT accuracy
SVHN 65.9/68.8
CIFAR10 38.2/43.1
CIFAR100 17.1/20.4
Average test accuracies (%) without/with InLay of Vision
Transformers (ViT) on different types of distortions

Here “physics”
refers to
empirical or
theoretical laws
that exist in
nature.
R² = 0.989
1
10
100
1,000
10,000
100,000
2015 2016 2017 2018 2019 2020 2021 2022
#Papers on PIML
Physics-informed NN

Integrate-and-fire neuron
andreykurenkov.com
Priors that work
• Neuron as trainable feature
detector
• Depth + Skip-connection
• Invariance/equivariance:
• Convolution (Translation)
• Recurrence (Time travel)
• Attention (Permutation)
• Analogy
• Kernel, case-based reasoning,
• Attention, memory
Feature detector
Source: http://karpathy.github.io/assets/rnn/diags.jpeg

Physics invariance
• Newton laws
• Symmetry
• Conversation laws
• Noether’s Theorem linking symmetry and
conservation.
First page of Emmy Noether's
article "Invariante
Variationsprobleme" (1918).
Source: Wikipedia

ML, data & physics
• Data collection/annotation for ML is expensive
• ML solutions don’t respect symmetries and conservation laws
• Physics laws are universal (upto scale) | ML only generalizes in-
distribution.
Karniadakis, George Em, et al. "Physics-informed machine learning." Nature Reviews Physics 3.6 (2021): 422-440.

Embedding physics into ML
https://medium.com/@zhaoshuai1989/why-do-we-need-physics-informed-machine-learning-piml-d11fe0c4436c

Physics guides neural architecture
• Physics-informed neural networks (PINN)
Figure from talk by Perdikaris & Wang, 2020.

Physics guides learning dynamics
• Physics-informed neural networks (PINN)
Figure from talk by Perdikaris & Wang, 2020.

Case study: Damped harmonic oscillation
Source: https://benmoseley.blog/my-research/so-what-is-a-physics-informed-neural-network/

Case study: COVID-19 in VN 2021
• Failed to contain the new exponential growth
due to Delta variant.
• The cost: 20 thousand lives within 3 months!!
• At the peak, the daily mortality ~ Vietnam War’s
rate.
• What worked in 2020 didn’t in 2021.
3/07/2023 139

SIR family for pandemics
• N = Population
• S = Susceptible
• I = Infectious
• R = Recovered
Source: Wikipedia
Basic reproduction number

Covid-19 infections
• SIR: Close-form solutions hard to calculate
• Parameters change over time due to intervention → Need more flexible
framework.
• Solution: Richards equation → Richards curve | Gompertz curve
• Task: 10-20 data points → Extrapolate 150 more.

Model design
• Remember often we have only 20-30 highly correlated data points to
learn from!
• Model is sum of 2-3 “waves” – each is a 3-param Gompertz curve
• Height of the peak
• Location of the peak
• Scale of the wave (the effective width)
• The number of waves indicates of the observed waves, and some
hypothetical waves.
• Model can be thought as a special neural network, each hidden unit is a
wave, but with Gompertz-based kernel.
3/07/2023 142

Estimating the model priors
• Impossible to know without assumptions!
• Need priors on wave size & possibly, the scale (e.g., min-max)
• One solution:
• Look for other countries, with adjustment in population size.
• Hopefully the culture, economic structure & actions are similar.
• It depends on:
• The virus variant (original != Delta != Omicron)
• Health/border capacity (closed boder + lockdown in the beginning)
• Vaccination coverage (80% tended to be the threshold for openning)
• Total cases/population.
3/07/2023 143

Case of HCM City
0
2,000
4,000
6,000
8,000
10,000
12,000
14,000
16,000
18,000
0
50
100
150
200
250
300
350
400
3/07/21
10/07/21
17/07/21
24/07/21
31/07/21
7/08/21
14/08/21
21/08/21
28/08/21
4/09/21
11/09/21
18/09/21
25/09/21
2/10/21
9/10/21
16/10/21
23/10/21
30/10/21
Ước lượng số ca tử vong do Covid-19, TP HCM
Tử vong ghi nhận Tử vong ước lượng Tử vong tích lũy (thực tế)
20-21/8: Peak
Total cases
16/10
11/8: Predicting date

145
28-30/8: Peak
Total cases
25/10
17/8: Predicting date
Case of Binh Duong province

Neural reasoning
Post DL
What needs work
3/07/2023 146
Agenda
Overview
What works

In 2022, DL has reached as new height: GPT-4,
PaLM-E, GATO, etc.
3/07/2023 147

Major remaining problems of DL
• Massive associative machine
→ Lack of causality prior, prone to learning wrong things, or work for the wrong
reasons.
→Overconfident for the wrong reasons (e.g., prone to adversarial attacks).
→Exploits short-cuts => poor on OOD generalisation
→Sample inefficient
→Approximate reasoning patterns, not from the first principles.
• Inference separated from learning
→No built-in adaptation other than retraining
→Catastrophic forgetting
• Limited theoretical understanding
3/07/2023 148

Are limitations inherent?
• YES, statistical systems tend to memorize data and find short-cuts.
• We need lots of data to cover all possible variations, hence lots of compute.
• But aren’t we great copiers?
• NO, neural nets were founded on the basis of distributed
representation and parallel processing. These are robust, fast and
energy efficient.
• We still need to find “binding” tricks that do all sorts of things without relying
on statistical training signals + backprop.
3/07/2023 149

Dimensions of progress
• Continuation of current works/paths
• Expansion/optimisation
• Industrialisation: Scale up & scale out
• Challenge fundamental assumptions
• DL as part of more holistic solution to Human-Level AI (HLAI)
• Dealing with the unexpected: Uncertainty, safety, security
3/07/2023 150

Continuation
• Enabling techs: Data, compute, network
• Work with noisy quantum computing (which will take time to mature)
• DL fundamentals: Representation, learning & inference
• Rep = data rep + computational graph + symmetry
• Learning as pre-training to extract as much knowledge from data as possible
• Learning as on-the-fly inference (Bayesian, hypernetwork/fast weight)
• Extreme inference = dynamic computational graph on-the-fly.
3/07/2023 151

Continuation (2)
• DL applications
• Data-rich & data-poor
• Cognitive domains (vision, NLP)
• Improve manufacturing
• Accelerate science
3/07/2023 152

Expansion/optimisation
• New inductive biases (for vision, NLP, living things, science, social AI,
ethical AI)
• Cutting the statistical/associative short-cuts
• Shifting from feature space to function space.
• Pushing for high-level analogy (rather than just feature-based
kernel/template matching)
• Binding, indirection, symbols
• Injection of knowledge into models.
3/07/2023 153

Expansion (2)
• Expanding to classical AI areas (planning, reasoning, knowledge
representation, symbol manipulation).
• Needs to solve symbol grounding for that to happen.
• Physics-informed neural networks (e.g., my work in Covid-19
forecasting)
• Social dimensions, human-in-the-loop
3/07/2023 154

Industrialisation: Scaling - success
formula thus far
Data + knowledge + compute + generic scalable algorithms
3/07/2023 155

Scaling - Rich Sutton’s Bitter Lesson (2019)
3/07/2023 156
“The biggest lesson that can be read from 70 years of AI research is
that general methods that leverage computation are ultimately the
most effective, and by a large margin. ”
http://www.incompleteideas.net/IncIdeas/BitterLesson.html
“The two methods that seem to scale arbitrarily in this way
are search and learning.”

DeepMind: Scale (up) is enough
3/07/2023 157

But …
• Scaling is like building a taller ladder to get to the Moon.
• We need rocket and science of escape velocity.
• Human brain is big (1e+14 synapses) but does exactly opposite –
maximize entropy reduction using minimum energy (thinking of the
most efficient heat engine).
• Just 20W is enough for human-level intelligence!
• => Must use different principles rather than just (sample inefficient) statistics!
• No need to go around like computer: Analog -> Digital/sequential -> Parallel
analog simulation.
3/07/2023 158

DL is part of Broad AI
3/07/2023 159
Hochreiter, S., 2022.
Toward a broad AI.
Communications of
the ACM, 65(4), pp.56-
57.

DL is part of Integrated Intelligence
LeCun’s plan
3/07/2023 160
https://ai.facebook.com/blog/yann-lecun-advances-in-ai-research/
Knowledge?

DL “accidental” history
3/07/2023 162
Source: rikochet_band
1950s: Rosenblatt wired the first trainable perceptron, hyping AI up.
1970-1980s: Minsky and Papert almost killed it until Rumelhart et al. worked out high-school
math to train multi-layer perceptron.
1980-1990s: LeCun managed to get CNN work for something real.
1990s: RNN was proved to be Turing-equivalent. Schmidhuber got excited and bombarded the
field with lots of cool ideas.
1990s-2000s: But the models were shallow and hard to train. Almost no one worked on it for 2
decades until the Canadian mafia fought back with new tricks to train deeper models.
2010s: !Accidently DL took off like a rocket, thanks to gamers.
2020s: Now DL works on everything, except for:
small data, shifted data, noisy data, artificially twisted data, deep stuffs,
exact stuffs, abstract stuffs, causal stuffs, symbolic stuffs, thinking stuffs, and
stuffs that no one knows how they work like consciousness.
2020s: DL believers got rich, and a new bunch of students got over trained.

Neural reasoning
Post DL
What needs work
3/07/2023 163
Agenda
Overview
What works

Final words
• Deep neural networks are here to stay, may be as a part of the holistic solution to
human-level AI.
• Gradient-based learning is still without parallel.
• DL will be much more general/universal/versatile (e.g., dynamic architecture,
with Transformer is a relaxed approximation)
• Higher cognitive capabilities will be there, may be with symbol manipulation
capacity.
• Better generalization capability (e.g., extreme)
• We have to deal with consequences of its own success.
• Negative effect; Jevon’s paradox
• The DL is now an industry, and is still going strong. But students may be over-fitted to
particular DL ways of thinking.
• The industry will need to keep the highly trained (overfitted) DL workforce busy!
3/07/2023 164

Second
bitter lesson
Little priors (innateness?) + lots of
experiments > strong priors (theory of
intelligence) + trying to prove it.
=> Chomsky would disagree here.
3/07/2023 165
Source: QuestionPro

3/07/2023 166
Credit: AvePoint

Deep learning and reasoning: Recent advances

Recommended

Recommended

More Related Content

Similar to Deep learning and reasoning: Recent advances

Similar to Deep learning and reasoning: Recent advances (20)

More from Deakin University

More from Deakin University (20)

Recently uploaded

Recently uploaded (20)

Deep learning and reasoning: Recent advances