SlideShare a Scribd company logo
Deep learning and reasoning:
Recent advances
3/07/2023 1
A/Prof Truyen Tran
Deakin University
@truyenoz
truyentran.github.io
truyen.tran@deakin.edu.au
letdataspeak.blogspot.com
goo.gl/3jJ1O0
RADL Summer School 2023
3/07/2023 2
Cartoonist Zach Weinersmith, Science:
Abridged Beyond the Point of
Usefulness, 2017
Differentiable programming
Neuro-symbolic systems
Neural reasoning
Post DL
What needs work
3/07/2023 3
Agenda
Overview
Neural building blocks
Graph neural networks
Unsupervised learning
What works
2012
2016
Turing Awards 2018
11 years snapshot
Picture taken from Bommasani et al, 2021
Source: @walidsaba
2023
3/07/2023 5
“[By 2023] …
Emergence of the
generally agreed upon
"next big thing" in AI
beyond deep learning.”
Rodney Brooks
rodneybrooks.com
“[…] general-purpose computer
programs, built on top of far richer
primitives than our current
differentiable layers—[…] we will
get to reasoning and abstraction,
the fundamental weakness of
current models.”
Francois Chollet
blog.keras.io
“Software 2.0 is written in
neural network weights”
Andrej Karpathy
medium.com/@karpathy
Why (still) DL in 2023?
Practical
• Generality: Applicable to many
domains.
• Competitive: DL is hard to beat as
long as there are data to train.
• Scalability: DL is better with more
data, and it is very scalable.
Theoretical
Expressiveness: Neural nets
can approximate any function.
Learnability: Neural nets are
trained easily.
Generalisability: Neural nets
generalize surprisingly well to
unseen data.
3/07/2023 7
ICLR 2023
Source: https://github.com/EdisonLeeeee/ICLR2023-OpenReviewData
Differentiable programming
Neuro-symbolic systems
Neural reasoning
Post DL
What needs work
3/07/2023 8
Agenda
Overview
Neural building blocks
Graph neural networks
Unsupervised learning
What works
y = f(x; W)
3/07/2023 9
Machine learning in a nutshell
• Most machine learning tasks reduce to
estimating a mapping f from x to y
• The estimation is more accurate with more
experiences, e.g., seeing more pair (x,y) in
training data.
• The mapping f is often parameterized by W.
• When y is a token/scalar/vector/tensor ->
prediction task.
• When y is a program ->
translation/synthesis task.
• When y is an intermediate form ->
representation learning.
❖ Much of ML is in specifying x,
a.k.a feature engineering.
❖ Much of DL is to specify
skeleton of W, a.k.a
architecture engineering.
❖ Much of LLMs is to specify x
again, but with fixed W, a.k.a
prompt engineering.
1980s: Parallel Distributed Processing
• Information is stored in many places
(distributed)
• Activations are sparse (enabling
selectivity and invariance)
• Factors of variation can be coded
efficiently
• Popular these days: Word & doc
embedding (word2vec, glove,
anything2vec)
Credit: Geoff Hinton
Symbolic vs.Distributed Representations
• Symbolic Representation
• Distributed Representation
6
Megan_Rapinoe
Ian_McKellen
Play
Game
Game Play
M egan_Rapinoe
Ian_McKellen
Slide credit: Pacheco & Goldwasser, 2021
Deep models via layer stacking
Theoretically powerful, but limited in practice
Integrate-and-fire neuron
andreykurenkov.com
Feature detector
Block representation
3/07/2023 12
http://torch.ch/blog/2016/02/04/resnets.html
Practice
Shorten path length with skip-connections
Easier information and gradient flows
3/07/2023 13
http://qiita.com/supersaiakujin/items/935bbc9610d0f87607e8
Theory
Sequence model with recurrence
Assume the stationary world
Classification
Image captioning
Sentence classification
Neural machine translation
Sequence labelling
Source: http://karpathy.github.io/assets/rnn/diags.jpeg
3/07/2023 14
Spatial model with convolutions
Assume filters/motifs are translation
invariant
http://colah.github.io/posts/2015-09-NN-Types-FP/
Learnable kernels
andreykurenkov.com
Feature detector,
often many
Convolutional networks
Summarizing filter responses, destroying
locations
adeshpande3.github.io
3/07/2023 16
Operator on sets/bags: Attentions
Not everything is created equal for a goal
• Need attention model to select or
ignore certain computations or inputs
• Can be “soft” (differentiable) or “hard”
(requires RL)
• Attention provides a short-cut → long-
term dependencies
• Also encourages sparsity if done right!
http://distill.pub/2016/augmented-rnns/
Why attention?
• Visual attention in human: Focus on specific
parts of visual inputs to compute the
adequate responses.
• Examples:
• We focus on objects rather than the background
of an image.
• We skim text by looking at important words.
• In neural computation, we need to select
the most relevance piece of information and
ignore all other parts
Slide credit: Trang Pham
Photo: programmersought
Transformer
Slide credit: Adham Beykikhoshk
• Tokenization
• Token encoding
• Position coding
• Sparsity
• Exploit spatio-
temporal structure
Transformer: Key ideas
• Use self-similarity to refine token’s representation (embedding).
• “June is happy” -> June is represented as a person’s name.
• Hidden contexts are borrowed from other sentences that share
tokens/motifs/patterns, e.g., “She is happy”, “Her name is June”, etc.
• Akin to retrieval: matching query to key.
• Context is simply other tokens co-occurring in the same text segment.
• Related to “co-location”.
• How big is context? → Small window, a sentence, a paragraph, the whole doc.
• What is about relative position? → Position coding.
3/07/2023 20
Positional Encoding
• The Transformer relaxes the sequentiality of data
• Positional encoding to embed sequential order in model
Slide credit: Adham Beykikhoshk
Theory: Transformers are (new) Hopfield
net
3/07/2023 22
Ramsauer, Hubert, et al. "Hopfield networks is all you need." arXiv preprint
arXiv:2008.02217 (2020).
Speed up: Vanilla Transformers are not efficient
Slide credit: Hung Le
Speed up: Efficient Transformers
3/07/2023 24
Tay, Yi, et al. "Efficient transformers: A survey." arXiv
preprint arXiv:2009.06732 (2020).
Speed up: Kernerlization and associative tricks
Same index,
reusable sum
Reduce
complexity
The idea is linked back to
Efficient Attention: Attention with Linear Complexities by Shen et.al, 2018.
Slide credit: Hung Le
Computation verification
Slide credit: Hung Le
Fast weights | HyperNet
The model world is recursive
• Early ideas in early 1990s by Juergen Schmidhuber and collaborators.
• Data-dependent weights | Using a controller to generate weights of the
main net.
3/07/2023 27
Ha, David, Andrew Dai, and Quoc V. Le. "Hypernetworks." arXiv preprint arXiv:1609.09106 (2016).
Neural networks vs Electronic circuits
• Computational graph → Circuit
• Compositionality → Modular design
• Neuron as feature detector → SENSOR, FILTER
• Multiplicative gates → AND gate, Transistor,
Resistor
• Attention mechanism → SWITCH gate
• Memory + forgetting → Capacitor + leakage
• Skip-connection → Short circuit
3/07/2023 28
Module composition
The system is modular, composable
3/07/2023 29
Source: https://www.ruder.io/modular-deep-learning/
Neural architecture search
When design is cheap and non-creative
• The space is huge and discrete
• Can be done through meta-heuristics (e.g., genetic algorithms) or
Reinforcement learning (e.g., one discrete change in model structure
is an action).
3/07/2023 30
Bello, Irwan, et al. "Neural optimizer search with reinforcement learning." arXiv preprint arXiv:1709.07417 (2017).
Neural networks design goals
•Capture long-term
dependencies in time and
space
•Capture invariances
natively
•Capture equivariance
3/07/2023 31
• Expressivity
• Scalability
• Reusability/modularity
• Compositionality
• Universality
Neural networks design goals (2)
3/07/2023 32
• Easy to train / learnability
• Use (almost) no labels => Unsupervised learning
• Resource adaptive
• Ability to extrapolate => Must go beyond surface statistics
• Support fast and slow learning (Complementary learning)
• Support fast and slow inference (Dual system theory)
Differentiable programming
Neuro-symbolic systems
Neural reasoning
Post DL
What needs work
3/07/2023 33
Agenda
Overview
Neural building blocks
Graph neural networks
Unsupervised learning
What works
Graph Structures in real world – Network Science
Internet
Social networks
World wide web
Communication Citations Biological networks
credit: Jure Leskovec
Slide credit: Yao Ma, Wei Jin, Yiqi Wang, Jiliang Tang, Tyler Derr, AAAI21
#REF: Penmatsa, Aravind, Kevin H. Wang,
and Eric Gouaux. "X-ray structure of
dopamine transporter elucidates
antidepressant
mechanism." Nature 503.7474 (2013): 85-
90.
Biology, pharmacy &
chemistry, materials
• Molecule/crystal as graph:
atoms as nodes, chemical
bonds as edges
• Computing molecular
properties
• Chemical-chemical
interaction
• Chemical reaction
3/07/2023 35
Gilmer, Justin, et al. "Neural message passing for quantum
chemistry." arXiv preprint arXiv:1704.01212 (2017).
Scene graphs as intermediate representation for image
captioning
Yao et al. Exploring Visual Relationship for Image Captioning, ECCV 2018
Fei-Fei Li, Ranjay Krishna, Danfei Xu
GNN in videos: Space-time region graphs
(Abhinav Gupta et al, ECCV’18)
Transformer is a special type of GNN
3/07/2023 38
Image credit: Chaitanya Joshi
chain-like wiring
patterns
LeNet
AlexNet
VGGNet
The evolution of graph structures in modern
NN design (Unintentional!)
multiple wiring paths
Inception
ResNet
DenseNet
ResNeXt
Credit: Saining Xie
Natural evolution of representing the world
• Vector → Embedding, MLP
• Sequence → RNN (LSTM, GRU)
• Grid → CNN (AlexNet, VGG, ResNet, EfficientNet, etc)
• Set → Word2vec, Attention, Transformer
• Graph → GNN (node2vec, DeepWalk, GCN, Graph Attention Net,
Column Net, MPNN etc)
• ResNet is a special case of GNN on grid!
• Transformer is a special case of GNN on fully connected graph.
3/07/2023 40
• Graphs are pervasive
in many scientific
disciplines.
• The sub-area of graph
representation has
reached a certain
maturity, with
multiple reviews,
workshops and papers
at top AI/ML venues.
3/07/2023 41
GNN in research
Source: https://github.com/EdisonLeeeee/ICLR2023-OpenReviewData
Deep Graph Learning: Foundations, Advances and
Applications
Graph Neural Network as a solution
Graph Neural Network
Graph/Node
Representation
Applications
Node
Classification
Link Prediction
Community
Detection
Graph
Generation
………
Neural network model that can deal with graph data.
Yu Rong, Wenbing Huang, Tingyang Xu, Hong Cheng, Junzhou
Huang 2020
Two Main Operations in GNN
43
Graph Filtering
Graph Filtering
Graph filtering refines the node features
Slide credit: Yao Ma and Yiqi Wang, Tyler Derr, Lingfei Wu and Tengfei Ma
Two Main Operations in GNN
44
Graph Pooling
Graph Pooling
Graph pooling generates a smaller graph
Slide credit: Yao Ma and Yiqi Wang, Tyler Derr, Lingfei Wu and Tengfei Ma
General GNN Framework
45
… …
…
𝐵1 𝐵𝑛
Filtering Layer Activation Pooling Layer (Optional)
Slide credit: Yao Ma and Yiqi Wang, Tyler Derr, Lingfei Wu and Tengfei Ma
Generalizing 2D convolutions to Graph Convolutions
- Graph convolutions involve similar local
operations on nodes.
- Nodes are now object representations
and not activations
- The ordering of neighbors should not
matter.
- The number of neighbors should not
matter.
- N(i) are the neighbors of node I
- Attention can be employed for edge
selection
Kipf & Welling (ICLR 2017)
Fei-Fei Li, Ranjay Krishna, Danfei Xu
Generalizing GNNs through message passing
3/07/2023 47
#REF: Pham, Trang, et al. "Column Networks for Collective Classification." AAAI. 2017.
Relation graph
Generalized message passing
Message Passing Neural Net
48
ℎ2, 𝑙2
ℎ1, 𝑙1
ℎ3, 𝑙3
ℎ4, 𝑙4
ℎ5, 𝑙5
ℎ6, 𝑙6
ℎ7, 𝑙7
𝑣2 𝑣8
𝑣1
𝑣3 𝑣4
𝑣5
𝑣6
𝑣7
ℎ8, 𝑙8
Message Passing
Feature Updating
𝑀𝑘() and 𝑈𝑘() are functions to be designed
Neural Message Passing for Quantum Chemistry. ICML 2017.
Slide credit: Yao Ma, Wei Jin, Yiqi Wang, Jiliang Tang, Tyler Derr, AAAI21
Neural graph morphism
• Input: Graph
• Output: A new graph.
Same nodes, different
edges.
• Model: Graph
morphism
• Method: Graph
transformation policy
network (GTPN)
3/07/2023 49
Kien Do, Truyen Tran, and Svetha Venkatesh. "Graph Transformation Policy Network for
Chemical Reaction Prediction." KDD’19.
Neural graph recurrence
• Graphs that represent interaction between entities through
time
• Spatial edges are node interaction at a time step
• Temporal edges are consistency relationship through time
Challenges
• The addition of temporal edges make the graphs
bigger, more complex
• Relying on context specific constraints to reduce the
complexity by approximations
• Through time, structures of the graph may change
• Hard to solve, most methods model short sequences to
avoid this
ASSIGN: Asynchronous, Sparse Interaction Graph
Network
(Morais et al, 2021 @ A2I2, Deakin – CVPR’21)
3/07/2023 52
GraphRNN to generate graphs
• A case of graph
dynamics: nodes
and edges are
added
sequentially.
• Solve tractability
using BFS
3/07/2023 53
You, Jiaxuan, et al.
"GraphRNN: Generating
realistic graphs with deep
auto-regressive
models." ICML (2018).
Differentiable programming
Neuro-symbolic systems
Neural reasoning
Post DL
What needs work
3/07/2023 54
Agenda
Overview
Neural building blocks
Graph neural networks
Unsupervised learning
What works
Representation learning, a bit of history
•“Representation is the use of signs that stand in
for and take the place of something else”
It has been a goal of neural networks since the 1980s and the current
wave of deep learning (2005-present) → Replacing feature engineering
Between 2006-2012, many unsupervised learning models with varying
degree of success: RBM, DBN, DBM, DAE, DDAE, PSD
Between 2013-2018, most models were supervised, following AlexNet
Since 2018, unsupervised learning has become competitive (with
contrastive learning, self-supervised learning, BERT)!
3/07/2023 55
Criteria for a good representation
• Separates factors of variation (aka disentanglement), which are
linearly correlated with desired outputs of downstream tasks.
• Provides abstraction that is invariant against deformations and
small variations.
• Is distributed (one concept is represented by multiple units), which
is compact and good for interpolation.
• Optionally, offers dimensionality reduction.
• Optionally, is sparse, giving room for emerging symbols.
3/07/2023 56
Bengio, Yoshua, Aaron Courville, and Pascal Vincent. "Representation learning: A review and new
perspectives." IEEE transactions on pattern analysis and machine intelligence 35.8 (2013): 1798-1828.
Why neural unsupervised learning?
• Neural nets have representational richness:
• FFN are functional approximator
• RNN are program approximator, can estimate a program behaviour and generate a string
• CNN are for translation invariance
• Transformers are powerful contextual encoder
• Compactness: Representations are (sparse and) distributed.
• Essential to perception, compact storage and reasoning
• Accounting for uncertainty: Neural nets can be stochastic to model
distributions
• Symbolic representation: realisation through sparse activations and gating
mechanisms
3/07/2023 57
Generative models:
Discover the underlying process that generates
data
3/07/2023 58
Many applications:
• Text to speech
• Simulate data that are hard to obtain/share in
real life (e.g., healthcare)
• Generate meaningful sentences conditioned on
some input (foreign language, image, video)
• Semi-supervised learning
• Planning
Deep (Denoising) AutoEncoder:
Self-reconstruction of data
3/07/2023 59
Auto-encoder
Feature detector
Representation
Raw data
(optionally
with added
noise)
Reconstruction
Deep Auto-encoder
Encoder
Decoder
FSDL 2022
• "Latent Diffusion" model: diffuse in
lower-dimensional latent space, then
decode back into pixel space
• Frozen CLIP ViT-L/14, trained 860M
UNet, 123M text encoder
• Trained on LAOIN-5B on 256 A100s for
24 days ($600K)
• FULLY OPEN-SOURCE
StableDiffusion
60
Slide credit: Karayev, 2022
Credit: kvfrans.com
Gaussian
hidden
variables
Data
Generative
net
Recognisin
g
net
Variational Autoencoder
Approximating the posterior by a neural net
• Two separate processes: generative (hidden → visible) versus
recognition (visible → hidden)
GAN: Generative Adversarial nets
Matching data statistics
• Instead of modeling the entire distribution of data, learns to
map ANY random distribution into the region of data, so that
there is no discriminator that can distinguish sampled data
from real data.
Any random distribution
in any space
Binary discriminator,
usually a neural
classifier
Neural net that maps
z → x
Generative adversarial networks
(Adapted from Goodfellow’s, NIPS 2014)
3/07/2023 63
BERT
Transformer that predicts its own masked
parts
• BERT is like parallel
approximate pseudo-
likelihood
• ~ Maximizing the conditional
likelihood of some variables
given the rest.
• When the number of
variables is large, this
converses to MLE (maximum
likelihood estimate).
3/07/2023 64
https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270
Neural
autoregressive
models:
Predict the next
step given the
history
• The keys: (a) long-term dependencies, (b) ordering, & (c)
parameter sharing.
• Can be realized using:
• RNN
• CNN: One-sided CNN, dilated CNN (e.g., WaveNet), PixelCNN
• Transformers → GPT-X family
• Masked autoencoder → MADE
• Pros: General, good quality thus far
• Cons: Slow – needs better inductive biases for scalability
3/07/2023 65
lyusungwon.github.io/studies/2018/07/25/nade/
FSDL 2022
• Generative Pre-trained Transformer
• Decoder-only (uses masked self-attention)
• Trained on 8M web pages, largest model is 1.5B
GPT / GPT-2 (2019)
https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
66
Slide credit: Karayev, 2022
Contrastive
learning:
Comparing
samples
3/07/2023 67
Le-Khac, Phuc H., Graham Healy, and
Alan F. Smeaton. "Contrastive
Representation Learning: A Framework
and Review." arXiv preprint
arXiv:2010.05113 (2020).
• 400M image-text pairs
crawled from the Internet
• Transformer to encode
text, ResNet or Visual
Transformer to encode
image
• Contrastive training:
maximize cosine similarity
of correct image-text pairs
(32K pairs per batch)
79
CLIP: Image-pair vs the rest
https://arxiv.org/pdf/2103.00020.pdf
Slide credit: Karayev, 2022
Unsupervised
learning: A few
more points
• No external labels, but rich training signals (thousand bits per sample, as opposed to a
few bits in supervised learning). A few techniques:
• Compressing data as much as possible with little loss
• Energy-based, i.e., pull down energy of observed data, pull up every else
• Filling the missing slots (aka predictive learning, self-supervised learning)
• We have not covered unsupervised learning on graphs (e.g., DeepWalk, GPT-GNN), but
the general principles should hold.
• Question: Multiple objectives, or no objective at all?
• Question: Emergence from many simple interacting elements?
3/07/2023 69
Liu, Xiao, et al. "Self-supervised learning: Generative or contrastive." arXiv preprint arXiv:2006.08218 (2020).
Assran, Mahmoud, et al. "Self-Supervised Learning from Images with a Joint-Embedding Predictive
Architecture." arXiv preprint arXiv:2301.08243 (2023).
Picture taken from (Bommasani et al, 2021)
A Tipping Point: Foundation Models
70
• A foundation model is a
model trained at broad
scale that can adapted
to a wide range of
downstream tasks
• Scale and the ability to
perform tasks beyond
training
Slide credit: Samuel Albanie, 2022
Slide credit: Chris Ré, Stanford, 2022
word2vec
2013
Twokeyideasunderpinfoundation model?
Emergence
•system behaviour is implicitly induced rather than explicitly constructed
•cause of scientific excitement and anxiety of unanticipated consequences
Homogenisation
•consolidation of methodology for building machine learning system across many applications
•provides strong leverage for many tasks, but also creates single points of failure
Slide credit: Samuel Albanie, 2022
Homogenisation
Learning instead of algorithm: Many applications can be powered by the
same learning algorithm.
• => Feature engineering
Deep architecture engineering: Instead of hand-crafting features, the same
architecture could be used widely.
• => Architecture engineering
Modern Transformer is universal: Same architecture, just different data!
• => Data & Prompt engineering
Slide credit: Samuel Albanie, 2022
3/07/2023 74
convolution --
motif detection
3
sequencing
time gaps/transfer
phrase/admission
1
embedding
2
word
vector
medical record
visits/admissions
time gap
?
prediction point output
max-pooling
prediction
4
5
record
vector
Homogenisation-Deepr
Nguyen, Phuoc, Truyen Tran,
Nilmini Wickramasinghe, and
Svetha Venkatesh. Deepr: a
convolutional net for medical
records." IEEE journal of
biomedical and health
informatics 21, no. 1 (2016): 22-30.
Concept: Stringify() – everything as a string
3/07/2023 75
Credit: AvePoint
Differentiable programming
Neuro-symbolic systems
Neural reasoning
Post DL
What needs work
3/07/2023 76
Agenda
Overview
Neural building blocks
Graph neural networks
Unsupervised learning
What works
1960s-1990s
▪ Hand-crafting rules, domain-
specific, logic-based
▪ High in reasoning
▪ Can’t scale.
▪ Fail on unseen cases.
3/07/2023
77
2020s-2030s
 Learning + reasoning, general
purpose, human-like
 Has contextual and common-
sense reasoning
 Requires less data
 Adapt to change
 Explainable
1990s-2020s
 Machine learning, general
purpose, statistics-based
 Low in reasoning
 Needs lots of data
 Less adaptive
 Little explanation
Photo credit: DARPA
From ML to Machine Reasoning
3/07/2023 78
cylinder
cube sphere cylinder sphere
cyan
brown
orange
red
object detection
Reasoning
Slide credit: Tin Pham
What is missing in deep
learning?
• Modern neural networks are good at
interpolating
→ Data hungry to cover all variations and smooth
local manifolds
→Little systematic generalization (novel
combinations)
• Lack of human-perceived reasoning capability
• Lack of logical inference
• Lack of natural mechanism to incorporate prior
knowledge, e.g., common sense
• No built-in causal mechanisms
3/07/2023 79
Machine reasoning
Reasoning is concerned with arriving at a deduction
about a new combination of circumstances.
Reasoning is to deduce new knowledge from
previously acquired knowledge in response to a
query.
3/07/2023 80
Leslie Valiant
Leon Bottou
Machine reasoning
• Two-part process
• manipulate previously acquired knowledge
• to draw novel inferences or answer new questions
• Example:
• Premise:
• A is to the left of B
• B is to the left of C
• D is in front of A
• E is in front of C
• Conclusion: what is the relation between D and E?
3/07/2023 81
Slide credit: Tin Pham
Geometry example
3/07/2023 82
Premise
• AM = MN (1)
• BM = MC (2)
• ෣
𝐴𝑀𝐵 = ෣
𝑁𝑀𝐶 (3)
Solution:
From (1), (2), (3)
➔△AMB = △NMC (4)
➔AB = CN
From (1), (2) ➔ ABNC is
a parallelogram (5)
→ AB // CN
Existing
knowledge
Conclusion
• AB = CN?
• AB // CN?
Slide credit: Tin Pham
Is reasoning always formal/logical?
3/07/2023 83
Bottou, Léon. "From machine learning to machine
reasoning." Machine learning 94.2 (2014): 133-149.
Leon Bottou
• “When we observe a visual scene, when we hear a complex
sentence, we are able to explain in formal terms the
relation of the objects in the scene, or the precise meaning
of the sentence components.
• However, there is no evidence that such a formal analysis
necessarily takes place: we see a scene, we hear a
sentence, and we just know what they mean.
• This suggests the existence of a middle layer, already a
form of reasoning, but not yet formal or logical.”
Why not just neural reasoning?
Central to reasoning is composition rules to guide the combinations of modules to
address new tasks
Bottou:
• Reasoning is not necessarily achieved by making logical inferences
• There is a continuity between [algebraically rich inference] and [connecting
together trainable learning systems]
→Neural networks are a plausible candidate!
→But still not natural to represent abstract discrete concepts and relations.
Hinton/Bengio/LeCun: Neural networks can do everything!
The rest: Not so fast! => Neurosymbolic systems!
3/07/2023 84
Bottou, Léon. "From machine learning to machine
reasoning." Machine learning 94.2 (2014): 133-149.
Learning to reason
• Learning is to improve itself by experiencing ~ acquiring
knowledge & skills
• Reasoning is to deduce knowledge from previously acquired
knowledge in response to a query (or a cues)
• Learning to reason is to improve the ability to decide if a
knowledge base entails a predicate.
• E.g., given a video f, determines if the person with the hat turns
before singing.
• Hypotheses:
• Reasoning as just-in-time program synthesis.
• It employs conditional computation.
• It minimises an energy function, or maximise the compatibility
between input (prompt) and output.
3/07/2023 85
Khardon, Roni, and Dan Roth. "Learning to reason." Journal of the ACM
(JACM) 44.5 (1997): 697-725.
(Dan Roth; ACM
Fellow; IJCAI John
McCarthy Award)
Reasoning as a skill
• Reasoning as a prediction skill that can be learnt from data.
• Question answering as zero-shot learning.
• Neural network operations for learning to reason:
• Attention & transformers.
• Dynamic neural networks, conditional computation & differentiable programming.
• Reasoning as iterative representation refinement & query-driven program
synthesis and execution.
• Compositional attention networks.
• Reasoning as Neural module networks.
3/07/2023 86
Practical setting:
(query,database,answer) triplets
• Classification: Query = what is this? Database = data.
• Regression: Query = how much? Database = data.
• QA: Query = NLP question. Database = context/image/text.
• Multi-task learning: Query = task ID. Database = data.
• Zero-shot learning: Query = task description. Database = data.
• Drug-protein binding: Query = drug. Database = protein.
• Recommender system: Query = User (or item). Database = inventories (or user
base);
3/07/2023 87
The two approaches to neural reasoning
• Implicit chaining of predicates through recurrence:
• Step-wise query-specific attention to relevant concepts & relations.
• Iterative concept refinement & combination, e.g., through a working memory.
• Answer is computed from the last memory state & question embedding.
• Explicit program synthesis:
• There is a set of modules, each performs an pre-defined operation.
• Question is parse into a symbolic program.
• The program is implemented as a computational graph constructed by chaining
separate modules.
• The program is executed to compute an answer
3/07/2023 88
MACNet: Composition-
Attention-Control
(reasoning by progressive
refinement of selected data)
3/07/2023 89
Hudson, Drew A., and Christopher D. Manning.
"Compositional attention networks for machine
reasoning." arXiv preprint arXiv:1803.03067 (2018).
LOGNet: Relational object reasoning with language
binding
90
• Key insight: Reasoning is chaining of relational predicates to arrive
at a final conclusion
→ Needs to uncover spatial relations, conditioned on query
→ Chaining is query-driven
→ Objects/language needs binding
→ Object semantics is query-dependent
→ Very thing is end-to-end differentiable
Thao Minh Le, Vuong Le, Svetha Venkatesh, and
Truyen Tran, “Dynamic Language Binding in
Relational Visual Reasoning”, IJCAI’20.
91
LOGNet for VQA
Thao Minh Le, Vuong Le,
Svetha Venkatesh, and
Truyen Tran, “Dynamic
Language Binding in
Relational Visual
Reasoning”, IJCAI’20.
Visual QA in action
What is about Transformer?
• Reasoning as (free-) energy minimisation
• The classic Belief Propagation algorithm is minimization algorithm
of the Bethe free-energy!
• Transformer has relational, iterative state refinement makes
it a great candidate for implicit relational reasoning.
3/07/2023 93
Heskes, Tom. "Stable fixed points of loopy belief propagation are local minima of the bethe free
energy." Advances in neural information processing systems. 2003.
Ramsauer, Hubert, et al. "Hopfield networks is all you need." arXiv preprint
arXiv:2008.02217 (2020).
3/07/2023 94
http://mccormickml.com/2020/03/10/question-answering-with-a-fine-tuned-BERT/
On SQuAD, Answer = start/end positions
Module networks
(reasoning by constructing and executing neural programs)
• Reasoning as laying out
modules to reach an
answer
• Composable neural
architecture → question
parsed as program (layout
of modules)
• A module is a function (x
→ y), could be a sub-
reasoning process ((x, q)
→ y).
3/07/2023 95
https://bair.berkeley.edu/blog/2017/06/20/learning-to-reason-with-neural-module-networks/
Program execution
• Work on object-based visual
representation
• An intermediate set of objects is
represented by a vector, as attention
mask over all object in the scene. For
example, Filter(Green_cube) outputs a
mask (0,1,0,0).
• The output mask is fed into the next
module (e.g Relate)
96
Source: @rao2z
What is about reasoning in LLMs?
• LLMs have HUGE associative memory.
• With “Let’s think step-by-step”?
• With “Chain of Thought”?
• Or just a pattern recognition of chain of
reasoning?
• Finding short-cuts to approximate provably
correct reasoning procedure.
• => Very poor OOD generalisation.
3/07/2023 97
A general framework
3/07/2023 98
Explicit Knowledge Graphs
+
Large Language Models
(implicit common sense knowledge,
associative database)
Differentiable programming
Neuro-symbolic systems
Neural reasoning
Post DL
What needs work
3/07/2023 99
Agenda
Overview
Neural building blocks
Graph neural networks
Unsupervised learning
What works
3/07/2023 100
Learning a Turing
machine
→ Can we learn a (neural)
program that learns to
program from data?
Memory networks • Input is a set → Load into memory,
which is NOT updated.
• State is a RNN with attention reading
from inputs
• Concepts: Query, key and content +
Content addressing.
• Deep models, but constant path length
from input to output.
• Equivalent to a RNN with shared input
set.
• => Seq2seq with attention is a Memory
Network (Memory = input seq).
• => Transformer is a kind of Memory
Network with Parallel Memory Update!
3/07/2023 101
Sukhbaatar, Sainbayar, Jason Weston, and Rob
Fergus. "End-to-end memory networks." Advances in
neural information processing systems. 2015.
MANN: Memory-Augmented Neural Networks
(a constant path length)
• Long-term dependency
• E.g., outcome depends on the far past
• Memory is needed (e.g., as in LSTM)
• => This is what make Transformers powerful!
• Complex program requires multiple computational steps
• Each step can be selective (attentive) to certain memory cell
• Operations: Encoding | Decoding | Retrieval
MANN: Neural Turing machine (NTM)
(simulating a differentiable Turing machine)
• A controller that takes
input/output and talks to an
external memory module.
• Memory has read/write
operations.
• The main issue is where to write,
and how to update the memory
state.
• All operations are differentiable.
Source: rylanschaeffer.github.io
3/07/2023 104
NTM unrolled in time with LSTM as controller
#Ref: https://medium.com/snips-ai/ntm-lasagne-a-library-for-neural-turing-machines-in-lasagne-2cdce6837315
MANN for reasoning
• Three steps:
• Store data into memory
• Read query, process sequentially, consult memory
• Output answer
• Behind the scene:
• Memory contains data & results of intermediate steps
• Drawbacks of current MANNs:
• No memory of controllers → Less modularity and
compositionality when query is complex
• No memory of relations → Much harder to chain predicates.
3/07/2023 105
Source: rylanschaeffer.github.io
Failures of item-only MANNs for
reasoning
• Relational representation is NOT stored → Can’t reuse later in the
chain
• A single memory of items and relations → Can’t understand how
relational reasoning occurs
• The memory-memory relationship is coarse since it is represented as
either dot product, or weighted sum.
3/07/2023 106
Self-attentive associative memories (SAM)
Learning relations automatically over time
3/07/2023 107
Hung Le, Truyen Tran, Svetha Venkatesh, “Self-
attentive associative memory”, ICML'20.
Differentiable programming
Neuro-symbolic systems
Neural reasoning
Post DL
What needs work
3/07/2023 108
Agenda
Overview
Neural building blocks
Graph neural networks
Unsupervised learning
What works
Neural nets are
powerful but we still
want:
• Learning with less and zero-shot
learning;
• Generalization of the solutions to
unseen tasks and unforeseen data
distributions;
• Explainability by construction;
3/07/2023 109
https://ibm.github.io/neuro-symbolic-ai/events/ns-
workshop2023
Self-Aware Learning
• Deeper learning for challenging tasks
• Integrating continuous and symbolic
representations
• Diversified learning modalities
Credit: Yolanda Gil, Bart Selman
AI to Understand Human
Intelligence
• 5 years: AI systems could be designed to
study psychological models of complex
intelligent phenomena that are based on
combinations of symbolic processing and
artificial neural networks.
Symbolic forms
• Words in Wordnet
• Syntax in NLP & Code
• Logic, prepositional and first-order
• Variables, equations
• Knowledge structure: Semantic nets, knowledge graphs
• Graphical models: Bayesian networks, Markov random fields, Markov
logic networks.
• Function (names), indirection, pointer in C/C++.
3/07/2023 110
Henry Kautz's taxonomy (1)
• Symbolic Neural symbolic—is the current approach of many neural models in
natural language processing, where words or subword tokens are both the
ultimate input and output of large language models. Examples include BERT,
RoBERTa, and GPT-3.
3/07/2023 111
Kautz, H., 2022. The third AI summer: AAAI Robert S. Engelmore memorial lecture. AI
Magazine, 43(1), pp.105-125. https://en.wikipedia.org/wiki/Neuro-symbolic_AI
Representing Context and Structure
Known as contextualized language models
10
Devlin et-al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” 2019
Slide credit: Pacheco & Goldwasser, 2021
What does BERT learn?
Emergent linguistic structure in artificial neural networks trained by self-supervision. PNAS Manning et-al, 2020
Linguistic structure emerges without direct supervision
Slide credit: Pacheco & Goldwasser, 2021
Using BERT for ReasoningTasks
• BERT-based near-human performance on Winograd Schema
WINOGRANDE: An Adversarial Winograd Schema Challenge at Scale. Sakaguchi et-al,
AAAI’20
Can “thinking-slow” tasks be accomplished with “thinking-fast” systems?
Not a panacea (McCoy et al ACL’19, others), often relies on simple heuristics when
learning complex decisions
12
Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference. McCoy et-al, ACL’19
World Knowledge and
Commonsense inferences
reflected in coref
decisions
Slide credit: Pacheco & Goldwasser, 2021
Henry Kautz's taxonomy (2)
• Symbolic[Neural]—is exemplified by
AlphaGo, where symbolic techniques are
used to call neural techniques. In this case,
the symbolic approach is Monte Carlo tree
search and the neural techniques learn
how to evaluate game positions.
3/07/2023 115
Kautz, H., 2022. The third AI summer: AAAI Robert S. Engelmore
memorial lecture. AI Magazine, 43(1), pp.105-125.
https://en.wikipedia.org/wiki/Neuro-symbolic_AI
Henry Kautz's taxonomy (3)
• Neural | Symbolic—uses a neural architecture to interpret perceptual data as
symbols and relationships that are reasoned about symbolically. The Neural-
Concept Learner is an example.
3/07/2023 116
Kautz, H., 2022. The third AI summer: AAAI Robert S. Engelmore
memorial lecture. AI Magazine, 43(1), pp.105-125.
https://en.wikipedia.org/wiki/Neuro-symbolic_AI
End-to-End Module Networks
• Construct the program internally
• The two parts are jointly learnable
3/07/2023 End-to-End Module Networks , Hu et.al., ICCV 17 117
Slide credit: Vuong Le
Henry Kautz's taxonomy (4)
• Neural: Symbolic → Neural—relies on symbolic reasoning to generate or label
training data that is subsequently learned by a deep learning model, e.g., to train
a neural model for symbolic computation by using a Macsyma-like symbolic
mathematics system to create or label examples.
3/07/2023 118
Kautz, H., 2022. The third AI summer: AAAI
Robert S. Engelmore memorial lecture. AI
Magazine, 43(1), pp.105-125.
https://en.wikipedia.org/wiki/Neuro-symbolic_AI
Lample, Guillaume, and François Charton. 2020.
“Deep Learning For Symbolic Mathematics.”
In Proceedings of the International Conference on
Learning Representations.
Henry Kautz's taxonomy (5)
• Neural_{Symbolic}—uses a
neural net that is generated
from symbolic rules. An
example is the Neural
Theorem Prover, which
constructs a neural network
from an AND-OR proof tree
generated from knowledge
base rules and terms. Logic
Tensor Networks also fall
into this category.
3/07/2023 119
Kautz, H., 2022. The third AI summer: AAAI Robert S. Engelmore
memorial lecture. AI Magazine, 43(1), pp.105-125.
https://en.wikipedia.org/wiki/Neuro-symbolic_AI
Henry Kautz's taxonomy (6)
• Neural[Symbolic]—allows a
neural model to directly call a
symbolic reasoning engine, e.g.,
to perform an action or evaluate
a state. An example would be
ChatGPT using a plugin to query
Wolfram Alpha.
3/07/2023 120
Kautz, H., 2022. The third AI summer: AAAI Robert S. Engelmore
memorial lecture. AI Magazine, 43(1), pp.105-125.
https://en.wikipedia.org/wiki/Neuro-symbolic_AI
LLMs for
calling tools
• Information retriever
• Symbolic/math module & code interpreters
• Virtual agents
• Robotic arms. See https://palm-e.github.io/
3/07/2023 121
Credit: Khattab et al
Symbols via Indirection
3/07/2023 122
Z = X + Y
3 1 2
Bind symbols with values
Pointer in Computer Science
Information binding in the brain
https://www.linkedin.com/pulse/
unsolved-problems-ai-part-2-binding-problem-eberhard-schoeneburg/
Indirection binds two objects together and uses one to refer to the other.
Slide credit: Kha Pham
Indirection is a key design principle in
software engineering
3/07/2023 123
Client
Indirectional
Layer
Target
https://medium.com/@nmckinnonblog/indirection-fba1857630e2
Indirection removes direct coupling
between units and promotes:
• Extensibility
• Control
• Evolvability
• Encapsulation of code and design
complexity
Every computer science
problem can be solved with a
higher level of indirection.
Andrew Koenig, Butler Lampson, David J. Wheeler
Slide credit: Kha Pham
Leveraging indirection to improve OOD
generalization
3/07/2023 124
Why
indirection?
Indirection binds concrete data to abstract symbols, and
reasoning on symbols is likely to improve generalization.
What
to bind?
Concrete information of data, e.g., representations,
functional relations between data, etc.
Functional
indirection
Structural
indirection
How
to bind?
During indirection, some concrete information of
data will be ignored, and thus we have to decide
what to maintain, i.e., invariances across data.
→ Indirection connects invariance and symbolic
approaches.
Slide credit: Kha Pham
Structural Indirection: InLay
3/07/2023 125
• InLay simultaneously leverages indirection and data internal relationships to
construct indirection representations, which respect the similarities between
internal relationships.
• InLay connects invariance and symbolic approaches:
• InLay constructs indirection representations from a fixed set of symbolic
vectors.
• InLay assumes two invariances:
• The data internal relationships are invariant through indirection.
• The set of symbolic vectors to compute indirection representations is
invariant across train and test samples.
Slide credit: Kha Pham Pham, K., Le, H., Ngo, M. and Tran, T., Improving Out-of-distribution
Generalization with Indirection Representations. In The Eleventh
International Conference on Learning Representations.
Structure-Mapping Theory (SMT)
3/07/2023 126
• Improve previous theories of analogy, i.e. the
Tversky’s contrast theory, which assumed that an
analogy is stronger if the more attributes the base
and target share in common.
• SMT [1] argued that it is not object attributes
which are mapped in an analogy, but relationships
between objects. X12 star system Solar system
similarity
Hydrogen atom
analogy
Rutherford’s analogy
No.
attributes
mapped
No.
relations
mapped
Literal
similarity
Many Many
Analogy Few Many
[1] Gentner, Dedre. "Structure-mapping: A theoretical framework for analogy." Cognitive science 7.2 (1983): 155-170.
Slide credit: Kha Pham
Structure-Mapping Theory (SMT) (cont.)
3/07/2023 127
Which will be chosen to be mapped in an analogy?
Systematicity Principle: A predicate that belongs to a mappable system of mutually
interconnecting relationships is more likely to be imported into the target than is an isolated
predicate.
Solar system
Distance
Attractive
force
Revolves
around
Color Temperature
Hydrogen atom
Distance
Attractive
force
Revolves
around
Color Temperature
Slide credit: Kha Pham
Model architecture
3/07/2023 128
• Concrete data representation is viewed as a complete graph
with weighted edges.
• The indirection operator maps this graph to a symbolic graph
with the same weight edges, however the vertices are fixed and
trainable.
• This symbolic graph is propagated and the updated node
features are indirection representations
Slide credit: Kha Pham
Experiments on IQ datasets – RAVEN dataset
3/07/2023 129
An IQ problem in RAVEN [1] dataset
Model Accuracy
LSTM 30.1/39.2
Transformers 15.1/42.5
RelationNet 12.5/46.4
PrediNet 13.8/15.6
Average test accuracies (%) without/with InLay in
different OOD testing scenarios on RAVEN
[1] Zhang, Chi, et al. "Raven: A dataset for relational and analogical visual reasoning."
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.
• The original paper of RAVEN dataset proposes
different OOD testing scenarios, in which models
are trained on one configuration and tested on
another (but related) configuration.
Slide credit: Kha Pham
Experiments on OOD image classification tasks
3/07/2023 130
Dog Dog?
OOD image classification,
in which test images are distorted.
• When test images are injected with different kinds
of distortions other than ones in training, deep
neural networks may fail drastically in image
classification tasks. [1]
[1] Robert Geirhos, Carlos RM Temme, Jonas Rauber, Heiko H Schütt, Matthias Bethge, and
Felix A Wichmann. Generalisation in humans and deep neural networks. Advances in neural
information processing systems, 31, 2018.
Dataset ViT accuracy
SVHN 65.9/68.8
CIFAR10 38.2/43.1
CIFAR100 17.1/20.4
Average test accuracies (%) without/with InLay of Vision
Transformers (ViT) on different types of distortions
Slide credit: Kha Pham
Here “physics”
refers to
empirical or
theoretical laws
that exist in
nature.
R² = 0.989
1
10
100
1,000
10,000
100,000
2015 2016 2017 2018 2019 2020 2021 2022
#Papers on PIML
Physics-informed NN
Integrate-and-fire neuron
andreykurenkov.com
Priors that work
• Neuron as trainable feature
detector
• Depth + Skip-connection
• Invariance/equivariance:
• Convolution (Translation)
• Recurrence (Time travel)
• Attention (Permutation)
• Analogy
• Kernel, case-based reasoning,
• Attention, memory
Feature detector
Source: http://karpathy.github.io/assets/rnn/diags.jpeg
Physics invariance
• Newton laws
• Symmetry
• Conversation laws
• Noether’s Theorem linking symmetry and
conservation.
First page of Emmy Noether's
article "Invariante
Variationsprobleme" (1918).
Source: Wikipedia
ML, data & physics
• Data collection/annotation for ML is expensive
• ML solutions don’t respect symmetries and conservation laws
• Physics laws are universal (upto scale) | ML only generalizes in-
distribution.
Karniadakis, George Em, et al. "Physics-informed machine learning." Nature Reviews Physics 3.6 (2021): 422-440.
Embedding physics into ML
https://medium.com/@zhaoshuai1989/why-do-we-need-physics-informed-machine-learning-piml-d11fe0c4436c
Physics guides neural architecture
• Physics-informed neural networks (PINN)
Figure from talk by Perdikaris & Wang, 2020.
Physics guides learning dynamics
• Physics-informed neural networks (PINN)
Figure from talk by Perdikaris & Wang, 2020.
Case study: Damped harmonic oscillation
Source: https://benmoseley.blog/my-research/so-what-is-a-physics-informed-neural-network/
Case study: COVID-19 in VN 2021
• Failed to contain the new exponential growth
due to Delta variant.
• The cost: 20 thousand lives within 3 months!!
• At the peak, the daily mortality ~ Vietnam War’s
rate.
• What worked in 2020 didn’t in 2021.
3/07/2023 139
SIR family for pandemics
• N = Population
• S = Susceptible
• I = Infectious
• R = Recovered
Source: Wikipedia
Basic reproduction number
Covid-19 infections
• SIR: Close-form solutions hard to calculate
• Parameters change over time due to intervention → Need more flexible
framework.
• Solution: Richards equation → Richards curve | Gompertz curve
• Task: 10-20 data points → Extrapolate 150 more.
Model design
• Remember often we have only 20-30 highly correlated data points to
learn from!
• Model is sum of 2-3 “waves” – each is a 3-param Gompertz curve
• Height of the peak
• Location of the peak
• Scale of the wave (the effective width)
• The number of waves indicates of the observed waves, and some
hypothetical waves.
• Model can be thought as a special neural network, each hidden unit is a
wave, but with Gompertz-based kernel.
3/07/2023 142
Estimating the model priors
• Impossible to know without assumptions!
• Need priors on wave size & possibly, the scale (e.g., min-max)
• One solution:
• Look for other countries, with adjustment in population size.
• Hopefully the culture, economic structure & actions are similar.
• It depends on:
• The virus variant (original != Delta != Omicron)
• Health/border capacity (closed boder + lockdown in the beginning)
• Vaccination coverage (80% tended to be the threshold for openning)
• Total cases/population.
3/07/2023 143
Case of HCM City
0
2,000
4,000
6,000
8,000
10,000
12,000
14,000
16,000
18,000
0
50
100
150
200
250
300
350
400
3/07/21
10/07/21
17/07/21
24/07/21
31/07/21
7/08/21
14/08/21
21/08/21
28/08/21
4/09/21
11/09/21
18/09/21
25/09/21
2/10/21
9/10/21
16/10/21
23/10/21
30/10/21
Ước lượng số ca tử vong do Covid-19, TP HCM
Tử vong ghi nhận Tử vong ước lượng Tử vong tích lũy (thực tế)
20-21/8: Peak
Total cases
16/10
11/8: Predicting date
145
28-30/8: Peak
Total cases
25/10
17/8: Predicting date
Case of Binh Duong province
Differentiable programming
Neuro-symbolic systems
Neural reasoning
Post DL
What needs work
3/07/2023 146
Agenda
Overview
Neural building blocks
Graph neural networks
Unsupervised learning
What works
In 2022, DL has reached as new height: GPT-4,
PaLM-E, GATO, etc.
3/07/2023 147
Major remaining problems of DL
• Massive associative machine
→ Lack of causality prior, prone to learning wrong things, or work for the wrong
reasons.
→Overconfident for the wrong reasons (e.g., prone to adversarial attacks).
→Exploits short-cuts => poor on OOD generalisation
→Sample inefficient
→Approximate reasoning patterns, not from the first principles.
• Inference separated from learning
→No built-in adaptation other than retraining
→Catastrophic forgetting
• Limited theoretical understanding
3/07/2023 148
Are limitations inherent?
• YES, statistical systems tend to memorize data and find short-cuts.
• We need lots of data to cover all possible variations, hence lots of compute.
• But aren’t we great copiers?
• NO, neural nets were founded on the basis of distributed
representation and parallel processing. These are robust, fast and
energy efficient.
• We still need to find “binding” tricks that do all sorts of things without relying
on statistical training signals + backprop.
3/07/2023 149
Dimensions of progress
• Continuation of current works/paths
• Expansion/optimisation
• Industrialisation: Scale up & scale out
• Challenge fundamental assumptions
• DL as part of more holistic solution to Human-Level AI (HLAI)
• Dealing with the unexpected: Uncertainty, safety, security
3/07/2023 150
Continuation
• Enabling techs: Data, compute, network
• Work with noisy quantum computing (which will take time to mature)
• DL fundamentals: Representation, learning & inference
• Rep = data rep + computational graph + symmetry
• Learning as pre-training to extract as much knowledge from data as possible
• Learning as on-the-fly inference (Bayesian, hypernetwork/fast weight)
• Extreme inference = dynamic computational graph on-the-fly.
3/07/2023 151
Continuation (2)
• DL applications
• Data-rich & data-poor
• Cognitive domains (vision, NLP)
• Improve manufacturing
• Accelerate science
3/07/2023 152
Expansion/optimisation
• New inductive biases (for vision, NLP, living things, science, social AI,
ethical AI)
• Cutting the statistical/associative short-cuts
• Shifting from feature space to function space.
• Pushing for high-level analogy (rather than just feature-based
kernel/template matching)
• Binding, indirection, symbols
• Injection of knowledge into models.
3/07/2023 153
Expansion (2)
• Expanding to classical AI areas (planning, reasoning, knowledge
representation, symbol manipulation).
• Needs to solve symbol grounding for that to happen.
• Physics-informed neural networks (e.g., my work in Covid-19
forecasting)
• Social dimensions, human-in-the-loop
3/07/2023 154
Industrialisation: Scaling - success
formula thus far
Data + knowledge + compute + generic scalable algorithms
3/07/2023 155
Scaling - Rich Sutton’s Bitter Lesson (2019)
3/07/2023 156
“The biggest lesson that can be read from 70 years of AI research is
that general methods that leverage computation are ultimately the
most effective, and by a large margin. ”
http://www.incompleteideas.net/IncIdeas/BitterLesson.html
“The two methods that seem to scale arbitrarily in this way
are search and learning.”
DeepMind: Scale (up) is enough
3/07/2023 157
But …
• Scaling is like building a taller ladder to get to the Moon.
• We need rocket and science of escape velocity.
• Human brain is big (1e+14 synapses) but does exactly opposite –
maximize entropy reduction using minimum energy (thinking of the
most efficient heat engine).
• Just 20W is enough for human-level intelligence!
• => Must use different principles rather than just (sample inefficient) statistics!
• No need to go around like computer: Analog -> Digital/sequential -> Parallel
analog simulation.
3/07/2023 158
DL is part of Broad AI
3/07/2023 159
Hochreiter, S., 2022.
Toward a broad AI.
Communications of
the ACM, 65(4), pp.56-
57.
DL is part of Integrated Intelligence
LeCun’s plan
3/07/2023 160
https://ai.facebook.com/blog/yann-lecun-advances-in-ai-research/
Knowledge?
Summary
3/07/2023 161
DL “accidental” history
3/07/2023 162
Source: rikochet_band
1950s: Rosenblatt wired the first trainable perceptron, hyping AI up.
1970-1980s: Minsky and Papert almost killed it until Rumelhart et al. worked out high-school
math to train multi-layer perceptron.
1980-1990s: LeCun managed to get CNN work for something real.
1990s: RNN was proved to be Turing-equivalent. Schmidhuber got excited and bombarded the
field with lots of cool ideas.
1990s-2000s: But the models were shallow and hard to train. Almost no one worked on it for 2
decades until the Canadian mafia fought back with new tricks to train deeper models.
2010s: !Accidently DL took off like a rocket, thanks to gamers.
2020s: Now DL works on everything, except for:
small data, shifted data, noisy data, artificially twisted data, deep stuffs,
exact stuffs, abstract stuffs, causal stuffs, symbolic stuffs, thinking stuffs, and
stuffs that no one knows how they work like consciousness.
2020s: DL believers got rich, and a new bunch of students got over trained.
Differentiable programming
Neuro-symbolic systems
Neural reasoning
Post DL
What needs work
3/07/2023 163
Agenda
Overview
Neural building blocks
Graph neural networks
Unsupervised learning
What works
Final words
• Deep neural networks are here to stay, may be as a part of the holistic solution to
human-level AI.
• Gradient-based learning is still without parallel.
• DL will be much more general/universal/versatile (e.g., dynamic architecture,
with Transformer is a relaxed approximation)
• Higher cognitive capabilities will be there, may be with symbol manipulation
capacity.
• Better generalization capability (e.g., extreme)
• We have to deal with consequences of its own success.
• Negative effect; Jevon’s paradox
• The DL is now an industry, and is still going strong. But students may be over-fitted to
particular DL ways of thinking.
• The industry will need to keep the highly trained (overfitted) DL workforce busy!
3/07/2023 164
Second
bitter lesson
Little priors (innateness?) + lots of
experiments > strong priors (theory of
intelligence) + trying to prove it.
=> Chomsky would disagree here.
3/07/2023 165
Source: QuestionPro
3/07/2023 166
Credit: AvePoint

More Related Content

Similar to Deep learning and reasoning: Recent advances

Generative AI to Accelerate Discovery of Materials
Generative AI to Accelerate Discovery of MaterialsGenerative AI to Accelerate Discovery of Materials
Generative AI to Accelerate Discovery of Materials
Deakin University
 
20140327 - Hashing Object Embedding
20140327 - Hashing Object Embedding20140327 - Hashing Object Embedding
20140327 - Hashing Object EmbeddingJacob Xu
 
NS-CUK Seminar: H.B.Kim, Review on "metapath2vec: Scalable representation l...
NS-CUK Seminar:  H.B.Kim,  Review on "metapath2vec: Scalable representation l...NS-CUK Seminar:  H.B.Kim,  Review on "metapath2vec: Scalable representation l...
NS-CUK Seminar: H.B.Kim, Review on "metapath2vec: Scalable representation l...
ssuser4b1f48
 
resume_Yuli_Liang
resume_Yuli_Liangresume_Yuli_Liang
resume_Yuli_LiangYuli Liang
 
NS-CUK Seminar: H.B.Kim, Review on "metapath2vec: Scalable representation le...
NS-CUK Seminar: H.B.Kim,  Review on "metapath2vec: Scalable representation le...NS-CUK Seminar: H.B.Kim,  Review on "metapath2vec: Scalable representation le...
NS-CUK Seminar: H.B.Kim, Review on "metapath2vec: Scalable representation le...
ssuser4b1f48
 
AI for automated materials discovery via learning to represent, predict, gene...
AI for automated materials discovery via learning to represent, predict, gene...AI for automated materials discovery via learning to represent, predict, gene...
AI for automated materials discovery via learning to represent, predict, gene...
Deakin University
 
PhD Defense
PhD DefensePhD Defense
PhD Defense
Taehoon Lee
 
Model Evaluation in the land of Deep Learning
Model Evaluation in the land of Deep LearningModel Evaluation in the land of Deep Learning
Model Evaluation in the land of Deep Learning
Pramit Choudhary
 
3234150
32341503234150
3234150
sanjay sharma
 
Insights from Knowledge Graphs
Insights from Knowledge GraphsInsights from Knowledge Graphs
Insights from Knowledge Graphs
Anirudh Prabhu
 
Xin Yao: "What can evolutionary computation do for you?"
Xin Yao: "What can evolutionary computation do for you?"Xin Yao: "What can evolutionary computation do for you?"
Xin Yao: "What can evolutionary computation do for you?"ieee_cis_cyprus
 
Introduction to Topological Data Analysis
Introduction to Topological Data AnalysisIntroduction to Topological Data Analysis
Introduction to Topological Data Analysis
Mason Porter
 
Looking for Commonsense in the Semantic Web
Looking for Commonsense in the Semantic WebLooking for Commonsense in the Semantic Web
Looking for Commonsense in the Semantic Web
Valentina Presutti
 
Concurrent Inference of Topic Models and Distributed Vector Representations
Concurrent Inference of Topic Models and Distributed Vector RepresentationsConcurrent Inference of Topic Models and Distributed Vector Representations
Concurrent Inference of Topic Models and Distributed Vector Representations
Parang Saraf
 
DLD meetup 2017, Efficient Deep Learning
DLD meetup 2017, Efficient Deep LearningDLD meetup 2017, Efficient Deep Learning
DLD meetup 2017, Efficient Deep Learning
Brodmann17
 
Social Relation Based Scalable Semantic Search Refinement
Social Relation Based Scalable Semantic Search RefinementSocial Relation Based Scalable Semantic Search Refinement
Social Relation Based Scalable Semantic Search Refinement
Yi Zeng
 
Big Data Intelligence: from Correlation Discovery to Causal Reasoning
Big Data Intelligence: from Correlation Discovery to Causal Reasoning Big Data Intelligence: from Correlation Discovery to Causal Reasoning
Big Data Intelligence: from Correlation Discovery to Causal Reasoning
Wanjin Yu
 
Deep Learning 2.0
Deep Learning 2.0Deep Learning 2.0
Deep Learning 2.0
Deakin University
 

Similar to Deep learning and reasoning: Recent advances (20)

Generative AI to Accelerate Discovery of Materials
Generative AI to Accelerate Discovery of MaterialsGenerative AI to Accelerate Discovery of Materials
Generative AI to Accelerate Discovery of Materials
 
Lesson 19
Lesson 19Lesson 19
Lesson 19
 
AI Lesson 19
AI Lesson 19AI Lesson 19
AI Lesson 19
 
20140327 - Hashing Object Embedding
20140327 - Hashing Object Embedding20140327 - Hashing Object Embedding
20140327 - Hashing Object Embedding
 
NS-CUK Seminar: H.B.Kim, Review on "metapath2vec: Scalable representation l...
NS-CUK Seminar:  H.B.Kim,  Review on "metapath2vec: Scalable representation l...NS-CUK Seminar:  H.B.Kim,  Review on "metapath2vec: Scalable representation l...
NS-CUK Seminar: H.B.Kim, Review on "metapath2vec: Scalable representation l...
 
resume_Yuli_Liang
resume_Yuli_Liangresume_Yuli_Liang
resume_Yuli_Liang
 
NS-CUK Seminar: H.B.Kim, Review on "metapath2vec: Scalable representation le...
NS-CUK Seminar: H.B.Kim,  Review on "metapath2vec: Scalable representation le...NS-CUK Seminar: H.B.Kim,  Review on "metapath2vec: Scalable representation le...
NS-CUK Seminar: H.B.Kim, Review on "metapath2vec: Scalable representation le...
 
AI for automated materials discovery via learning to represent, predict, gene...
AI for automated materials discovery via learning to represent, predict, gene...AI for automated materials discovery via learning to represent, predict, gene...
AI for automated materials discovery via learning to represent, predict, gene...
 
PhD Defense
PhD DefensePhD Defense
PhD Defense
 
Model Evaluation in the land of Deep Learning
Model Evaluation in the land of Deep LearningModel Evaluation in the land of Deep Learning
Model Evaluation in the land of Deep Learning
 
3234150
32341503234150
3234150
 
Insights from Knowledge Graphs
Insights from Knowledge GraphsInsights from Knowledge Graphs
Insights from Knowledge Graphs
 
Xin Yao: "What can evolutionary computation do for you?"
Xin Yao: "What can evolutionary computation do for you?"Xin Yao: "What can evolutionary computation do for you?"
Xin Yao: "What can evolutionary computation do for you?"
 
Introduction to Topological Data Analysis
Introduction to Topological Data AnalysisIntroduction to Topological Data Analysis
Introduction to Topological Data Analysis
 
Looking for Commonsense in the Semantic Web
Looking for Commonsense in the Semantic WebLooking for Commonsense in the Semantic Web
Looking for Commonsense in the Semantic Web
 
Concurrent Inference of Topic Models and Distributed Vector Representations
Concurrent Inference of Topic Models and Distributed Vector RepresentationsConcurrent Inference of Topic Models and Distributed Vector Representations
Concurrent Inference of Topic Models and Distributed Vector Representations
 
DLD meetup 2017, Efficient Deep Learning
DLD meetup 2017, Efficient Deep LearningDLD meetup 2017, Efficient Deep Learning
DLD meetup 2017, Efficient Deep Learning
 
Social Relation Based Scalable Semantic Search Refinement
Social Relation Based Scalable Semantic Search RefinementSocial Relation Based Scalable Semantic Search Refinement
Social Relation Based Scalable Semantic Search Refinement
 
Big Data Intelligence: from Correlation Discovery to Causal Reasoning
Big Data Intelligence: from Correlation Discovery to Causal Reasoning Big Data Intelligence: from Correlation Discovery to Causal Reasoning
Big Data Intelligence: from Correlation Discovery to Causal Reasoning
 
Deep Learning 2.0
Deep Learning 2.0Deep Learning 2.0
Deep Learning 2.0
 

More from Deakin University

Machine Learning and Reasoning for Drug Discovery
Machine Learning and Reasoning for Drug DiscoveryMachine Learning and Reasoning for Drug Discovery
Machine Learning and Reasoning for Drug Discovery
Deakin University
 
Deep learning 1.0 and Beyond, Part 2
Deep learning 1.0 and Beyond, Part 2Deep learning 1.0 and Beyond, Part 2
Deep learning 1.0 and Beyond, Part 2
Deakin University
 
Machine reasoning
Machine reasoningMachine reasoning
Machine reasoning
Deakin University
 
AI/ML as an empirical science
AI/ML as an empirical scienceAI/ML as an empirical science
AI/ML as an empirical science
Deakin University
 
Machine Reasoning at A2I2, Deakin University
Machine Reasoning at A2I2, Deakin UniversityMachine Reasoning at A2I2, Deakin University
Machine Reasoning at A2I2, Deakin University
Deakin University
 
AI in the Covid-19 pandemic
AI in the Covid-19 pandemicAI in the Covid-19 pandemic
AI in the Covid-19 pandemic
Deakin University
 
Visual reasoning
Visual reasoningVisual reasoning
Visual reasoning
Deakin University
 
AI for tackling climate change
AI for tackling climate changeAI for tackling climate change
AI for tackling climate change
Deakin University
 
AI for drug discovery
AI for drug discoveryAI for drug discovery
AI for drug discovery
Deakin University
 
Deep learning and applications in non-cognitive domains I
Deep learning and applications in non-cognitive domains IDeep learning and applications in non-cognitive domains I
Deep learning and applications in non-cognitive domains I
Deakin University
 
Deep learning and applications in non-cognitive domains II
Deep learning and applications in non-cognitive domains IIDeep learning and applications in non-cognitive domains II
Deep learning and applications in non-cognitive domains II
Deakin University
 
Deep learning and applications in non-cognitive domains III
Deep learning and applications in non-cognitive domains IIIDeep learning and applications in non-cognitive domains III
Deep learning and applications in non-cognitive domains III
Deakin University
 
Deep learning for episodic interventional data
Deep learning for episodic interventional dataDeep learning for episodic interventional data
Deep learning for episodic interventional data
Deakin University
 
Deep learning for detecting anomalies and software vulnerabilities
Deep learning for detecting anomalies and software vulnerabilitiesDeep learning for detecting anomalies and software vulnerabilities
Deep learning for detecting anomalies and software vulnerabilities
Deakin University
 
Deep learning for biomedical discovery and data mining I
Deep learning for biomedical discovery and data mining IDeep learning for biomedical discovery and data mining I
Deep learning for biomedical discovery and data mining I
Deakin University
 
Deep learning for biomedical discovery and data mining II
Deep learning for biomedical discovery and data mining IIDeep learning for biomedical discovery and data mining II
Deep learning for biomedical discovery and data mining II
Deakin University
 
AI that/for matters
AI that/for mattersAI that/for matters
AI that/for matters
Deakin University
 
Representation learning on graphs
Representation learning on graphsRepresentation learning on graphs
Representation learning on graphs
Deakin University
 
Empirical AI Research
Empirical AI Research Empirical AI Research
Empirical AI Research
Deakin University
 
Deep learning for genomics: Present and future
Deep learning for genomics: Present and futureDeep learning for genomics: Present and future
Deep learning for genomics: Present and future
Deakin University
 

More from Deakin University (20)

Machine Learning and Reasoning for Drug Discovery
Machine Learning and Reasoning for Drug DiscoveryMachine Learning and Reasoning for Drug Discovery
Machine Learning and Reasoning for Drug Discovery
 
Deep learning 1.0 and Beyond, Part 2
Deep learning 1.0 and Beyond, Part 2Deep learning 1.0 and Beyond, Part 2
Deep learning 1.0 and Beyond, Part 2
 
Machine reasoning
Machine reasoningMachine reasoning
Machine reasoning
 
AI/ML as an empirical science
AI/ML as an empirical scienceAI/ML as an empirical science
AI/ML as an empirical science
 
Machine Reasoning at A2I2, Deakin University
Machine Reasoning at A2I2, Deakin UniversityMachine Reasoning at A2I2, Deakin University
Machine Reasoning at A2I2, Deakin University
 
AI in the Covid-19 pandemic
AI in the Covid-19 pandemicAI in the Covid-19 pandemic
AI in the Covid-19 pandemic
 
Visual reasoning
Visual reasoningVisual reasoning
Visual reasoning
 
AI for tackling climate change
AI for tackling climate changeAI for tackling climate change
AI for tackling climate change
 
AI for drug discovery
AI for drug discoveryAI for drug discovery
AI for drug discovery
 
Deep learning and applications in non-cognitive domains I
Deep learning and applications in non-cognitive domains IDeep learning and applications in non-cognitive domains I
Deep learning and applications in non-cognitive domains I
 
Deep learning and applications in non-cognitive domains II
Deep learning and applications in non-cognitive domains IIDeep learning and applications in non-cognitive domains II
Deep learning and applications in non-cognitive domains II
 
Deep learning and applications in non-cognitive domains III
Deep learning and applications in non-cognitive domains IIIDeep learning and applications in non-cognitive domains III
Deep learning and applications in non-cognitive domains III
 
Deep learning for episodic interventional data
Deep learning for episodic interventional dataDeep learning for episodic interventional data
Deep learning for episodic interventional data
 
Deep learning for detecting anomalies and software vulnerabilities
Deep learning for detecting anomalies and software vulnerabilitiesDeep learning for detecting anomalies and software vulnerabilities
Deep learning for detecting anomalies and software vulnerabilities
 
Deep learning for biomedical discovery and data mining I
Deep learning for biomedical discovery and data mining IDeep learning for biomedical discovery and data mining I
Deep learning for biomedical discovery and data mining I
 
Deep learning for biomedical discovery and data mining II
Deep learning for biomedical discovery and data mining IIDeep learning for biomedical discovery and data mining II
Deep learning for biomedical discovery and data mining II
 
AI that/for matters
AI that/for mattersAI that/for matters
AI that/for matters
 
Representation learning on graphs
Representation learning on graphsRepresentation learning on graphs
Representation learning on graphs
 
Empirical AI Research
Empirical AI Research Empirical AI Research
Empirical AI Research
 
Deep learning for genomics: Present and future
Deep learning for genomics: Present and futureDeep learning for genomics: Present and future
Deep learning for genomics: Present and future
 

Recently uploaded

Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 

Recently uploaded (20)

Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 

Deep learning and reasoning: Recent advances

  • 1. Deep learning and reasoning: Recent advances 3/07/2023 1 A/Prof Truyen Tran Deakin University @truyenoz truyentran.github.io truyen.tran@deakin.edu.au letdataspeak.blogspot.com goo.gl/3jJ1O0 RADL Summer School 2023
  • 2. 3/07/2023 2 Cartoonist Zach Weinersmith, Science: Abridged Beyond the Point of Usefulness, 2017
  • 3. Differentiable programming Neuro-symbolic systems Neural reasoning Post DL What needs work 3/07/2023 3 Agenda Overview Neural building blocks Graph neural networks Unsupervised learning What works
  • 4. 2012 2016 Turing Awards 2018 11 years snapshot Picture taken from Bommasani et al, 2021 Source: @walidsaba 2023
  • 5. 3/07/2023 5 “[By 2023] … Emergence of the generally agreed upon "next big thing" in AI beyond deep learning.” Rodney Brooks rodneybrooks.com “[…] general-purpose computer programs, built on top of far richer primitives than our current differentiable layers—[…] we will get to reasoning and abstraction, the fundamental weakness of current models.” Francois Chollet blog.keras.io “Software 2.0 is written in neural network weights” Andrej Karpathy medium.com/@karpathy
  • 6. Why (still) DL in 2023? Practical • Generality: Applicable to many domains. • Competitive: DL is hard to beat as long as there are data to train. • Scalability: DL is better with more data, and it is very scalable. Theoretical Expressiveness: Neural nets can approximate any function. Learnability: Neural nets are trained easily. Generalisability: Neural nets generalize surprisingly well to unseen data.
  • 7. 3/07/2023 7 ICLR 2023 Source: https://github.com/EdisonLeeeee/ICLR2023-OpenReviewData
  • 8. Differentiable programming Neuro-symbolic systems Neural reasoning Post DL What needs work 3/07/2023 8 Agenda Overview Neural building blocks Graph neural networks Unsupervised learning What works
  • 9. y = f(x; W) 3/07/2023 9 Machine learning in a nutshell • Most machine learning tasks reduce to estimating a mapping f from x to y • The estimation is more accurate with more experiences, e.g., seeing more pair (x,y) in training data. • The mapping f is often parameterized by W. • When y is a token/scalar/vector/tensor -> prediction task. • When y is a program -> translation/synthesis task. • When y is an intermediate form -> representation learning. ❖ Much of ML is in specifying x, a.k.a feature engineering. ❖ Much of DL is to specify skeleton of W, a.k.a architecture engineering. ❖ Much of LLMs is to specify x again, but with fixed W, a.k.a prompt engineering.
  • 10. 1980s: Parallel Distributed Processing • Information is stored in many places (distributed) • Activations are sparse (enabling selectivity and invariance) • Factors of variation can be coded efficiently • Popular these days: Word & doc embedding (word2vec, glove, anything2vec) Credit: Geoff Hinton
  • 11. Symbolic vs.Distributed Representations • Symbolic Representation • Distributed Representation 6 Megan_Rapinoe Ian_McKellen Play Game Game Play M egan_Rapinoe Ian_McKellen Slide credit: Pacheco & Goldwasser, 2021
  • 12. Deep models via layer stacking Theoretically powerful, but limited in practice Integrate-and-fire neuron andreykurenkov.com Feature detector Block representation 3/07/2023 12
  • 13. http://torch.ch/blog/2016/02/04/resnets.html Practice Shorten path length with skip-connections Easier information and gradient flows 3/07/2023 13 http://qiita.com/supersaiakujin/items/935bbc9610d0f87607e8 Theory
  • 14. Sequence model with recurrence Assume the stationary world Classification Image captioning Sentence classification Neural machine translation Sequence labelling Source: http://karpathy.github.io/assets/rnn/diags.jpeg 3/07/2023 14
  • 15. Spatial model with convolutions Assume filters/motifs are translation invariant http://colah.github.io/posts/2015-09-NN-Types-FP/ Learnable kernels andreykurenkov.com Feature detector, often many
  • 16. Convolutional networks Summarizing filter responses, destroying locations adeshpande3.github.io 3/07/2023 16
  • 17. Operator on sets/bags: Attentions Not everything is created equal for a goal • Need attention model to select or ignore certain computations or inputs • Can be “soft” (differentiable) or “hard” (requires RL) • Attention provides a short-cut → long- term dependencies • Also encourages sparsity if done right! http://distill.pub/2016/augmented-rnns/
  • 18. Why attention? • Visual attention in human: Focus on specific parts of visual inputs to compute the adequate responses. • Examples: • We focus on objects rather than the background of an image. • We skim text by looking at important words. • In neural computation, we need to select the most relevance piece of information and ignore all other parts Slide credit: Trang Pham Photo: programmersought
  • 19. Transformer Slide credit: Adham Beykikhoshk • Tokenization • Token encoding • Position coding • Sparsity • Exploit spatio- temporal structure
  • 20. Transformer: Key ideas • Use self-similarity to refine token’s representation (embedding). • “June is happy” -> June is represented as a person’s name. • Hidden contexts are borrowed from other sentences that share tokens/motifs/patterns, e.g., “She is happy”, “Her name is June”, etc. • Akin to retrieval: matching query to key. • Context is simply other tokens co-occurring in the same text segment. • Related to “co-location”. • How big is context? → Small window, a sentence, a paragraph, the whole doc. • What is about relative position? → Position coding. 3/07/2023 20
  • 21. Positional Encoding • The Transformer relaxes the sequentiality of data • Positional encoding to embed sequential order in model Slide credit: Adham Beykikhoshk
  • 22. Theory: Transformers are (new) Hopfield net 3/07/2023 22 Ramsauer, Hubert, et al. "Hopfield networks is all you need." arXiv preprint arXiv:2008.02217 (2020).
  • 23. Speed up: Vanilla Transformers are not efficient Slide credit: Hung Le
  • 24. Speed up: Efficient Transformers 3/07/2023 24 Tay, Yi, et al. "Efficient transformers: A survey." arXiv preprint arXiv:2009.06732 (2020).
  • 25. Speed up: Kernerlization and associative tricks Same index, reusable sum Reduce complexity The idea is linked back to Efficient Attention: Attention with Linear Complexities by Shen et.al, 2018. Slide credit: Hung Le
  • 27. Fast weights | HyperNet The model world is recursive • Early ideas in early 1990s by Juergen Schmidhuber and collaborators. • Data-dependent weights | Using a controller to generate weights of the main net. 3/07/2023 27 Ha, David, Andrew Dai, and Quoc V. Le. "Hypernetworks." arXiv preprint arXiv:1609.09106 (2016).
  • 28. Neural networks vs Electronic circuits • Computational graph → Circuit • Compositionality → Modular design • Neuron as feature detector → SENSOR, FILTER • Multiplicative gates → AND gate, Transistor, Resistor • Attention mechanism → SWITCH gate • Memory + forgetting → Capacitor + leakage • Skip-connection → Short circuit 3/07/2023 28
  • 29. Module composition The system is modular, composable 3/07/2023 29 Source: https://www.ruder.io/modular-deep-learning/
  • 30. Neural architecture search When design is cheap and non-creative • The space is huge and discrete • Can be done through meta-heuristics (e.g., genetic algorithms) or Reinforcement learning (e.g., one discrete change in model structure is an action). 3/07/2023 30 Bello, Irwan, et al. "Neural optimizer search with reinforcement learning." arXiv preprint arXiv:1709.07417 (2017).
  • 31. Neural networks design goals •Capture long-term dependencies in time and space •Capture invariances natively •Capture equivariance 3/07/2023 31 • Expressivity • Scalability • Reusability/modularity • Compositionality • Universality
  • 32. Neural networks design goals (2) 3/07/2023 32 • Easy to train / learnability • Use (almost) no labels => Unsupervised learning • Resource adaptive • Ability to extrapolate => Must go beyond surface statistics • Support fast and slow learning (Complementary learning) • Support fast and slow inference (Dual system theory)
  • 33. Differentiable programming Neuro-symbolic systems Neural reasoning Post DL What needs work 3/07/2023 33 Agenda Overview Neural building blocks Graph neural networks Unsupervised learning What works
  • 34. Graph Structures in real world – Network Science Internet Social networks World wide web Communication Citations Biological networks credit: Jure Leskovec Slide credit: Yao Ma, Wei Jin, Yiqi Wang, Jiliang Tang, Tyler Derr, AAAI21
  • 35. #REF: Penmatsa, Aravind, Kevin H. Wang, and Eric Gouaux. "X-ray structure of dopamine transporter elucidates antidepressant mechanism." Nature 503.7474 (2013): 85- 90. Biology, pharmacy & chemistry, materials • Molecule/crystal as graph: atoms as nodes, chemical bonds as edges • Computing molecular properties • Chemical-chemical interaction • Chemical reaction 3/07/2023 35 Gilmer, Justin, et al. "Neural message passing for quantum chemistry." arXiv preprint arXiv:1704.01212 (2017).
  • 36. Scene graphs as intermediate representation for image captioning Yao et al. Exploring Visual Relationship for Image Captioning, ECCV 2018 Fei-Fei Li, Ranjay Krishna, Danfei Xu
  • 37. GNN in videos: Space-time region graphs (Abhinav Gupta et al, ECCV’18)
  • 38. Transformer is a special type of GNN 3/07/2023 38 Image credit: Chaitanya Joshi
  • 39. chain-like wiring patterns LeNet AlexNet VGGNet The evolution of graph structures in modern NN design (Unintentional!) multiple wiring paths Inception ResNet DenseNet ResNeXt Credit: Saining Xie
  • 40. Natural evolution of representing the world • Vector → Embedding, MLP • Sequence → RNN (LSTM, GRU) • Grid → CNN (AlexNet, VGG, ResNet, EfficientNet, etc) • Set → Word2vec, Attention, Transformer • Graph → GNN (node2vec, DeepWalk, GCN, Graph Attention Net, Column Net, MPNN etc) • ResNet is a special case of GNN on grid! • Transformer is a special case of GNN on fully connected graph. 3/07/2023 40
  • 41. • Graphs are pervasive in many scientific disciplines. • The sub-area of graph representation has reached a certain maturity, with multiple reviews, workshops and papers at top AI/ML venues. 3/07/2023 41 GNN in research Source: https://github.com/EdisonLeeeee/ICLR2023-OpenReviewData
  • 42. Deep Graph Learning: Foundations, Advances and Applications Graph Neural Network as a solution Graph Neural Network Graph/Node Representation Applications Node Classification Link Prediction Community Detection Graph Generation ……… Neural network model that can deal with graph data. Yu Rong, Wenbing Huang, Tingyang Xu, Hong Cheng, Junzhou Huang 2020
  • 43. Two Main Operations in GNN 43 Graph Filtering Graph Filtering Graph filtering refines the node features Slide credit: Yao Ma and Yiqi Wang, Tyler Derr, Lingfei Wu and Tengfei Ma
  • 44. Two Main Operations in GNN 44 Graph Pooling Graph Pooling Graph pooling generates a smaller graph Slide credit: Yao Ma and Yiqi Wang, Tyler Derr, Lingfei Wu and Tengfei Ma
  • 45. General GNN Framework 45 … … … 𝐵1 𝐵𝑛 Filtering Layer Activation Pooling Layer (Optional) Slide credit: Yao Ma and Yiqi Wang, Tyler Derr, Lingfei Wu and Tengfei Ma
  • 46. Generalizing 2D convolutions to Graph Convolutions - Graph convolutions involve similar local operations on nodes. - Nodes are now object representations and not activations - The ordering of neighbors should not matter. - The number of neighbors should not matter. - N(i) are the neighbors of node I - Attention can be employed for edge selection Kipf & Welling (ICLR 2017) Fei-Fei Li, Ranjay Krishna, Danfei Xu
  • 47. Generalizing GNNs through message passing 3/07/2023 47 #REF: Pham, Trang, et al. "Column Networks for Collective Classification." AAAI. 2017. Relation graph Generalized message passing
  • 48. Message Passing Neural Net 48 ℎ2, 𝑙2 ℎ1, 𝑙1 ℎ3, 𝑙3 ℎ4, 𝑙4 ℎ5, 𝑙5 ℎ6, 𝑙6 ℎ7, 𝑙7 𝑣2 𝑣8 𝑣1 𝑣3 𝑣4 𝑣5 𝑣6 𝑣7 ℎ8, 𝑙8 Message Passing Feature Updating 𝑀𝑘() and 𝑈𝑘() are functions to be designed Neural Message Passing for Quantum Chemistry. ICML 2017. Slide credit: Yao Ma, Wei Jin, Yiqi Wang, Jiliang Tang, Tyler Derr, AAAI21
  • 49. Neural graph morphism • Input: Graph • Output: A new graph. Same nodes, different edges. • Model: Graph morphism • Method: Graph transformation policy network (GTPN) 3/07/2023 49 Kien Do, Truyen Tran, and Svetha Venkatesh. "Graph Transformation Policy Network for Chemical Reaction Prediction." KDD’19.
  • 50. Neural graph recurrence • Graphs that represent interaction between entities through time • Spatial edges are node interaction at a time step • Temporal edges are consistency relationship through time
  • 51. Challenges • The addition of temporal edges make the graphs bigger, more complex • Relying on context specific constraints to reduce the complexity by approximations • Through time, structures of the graph may change • Hard to solve, most methods model short sequences to avoid this
  • 52. ASSIGN: Asynchronous, Sparse Interaction Graph Network (Morais et al, 2021 @ A2I2, Deakin – CVPR’21) 3/07/2023 52
  • 53. GraphRNN to generate graphs • A case of graph dynamics: nodes and edges are added sequentially. • Solve tractability using BFS 3/07/2023 53 You, Jiaxuan, et al. "GraphRNN: Generating realistic graphs with deep auto-regressive models." ICML (2018).
  • 54. Differentiable programming Neuro-symbolic systems Neural reasoning Post DL What needs work 3/07/2023 54 Agenda Overview Neural building blocks Graph neural networks Unsupervised learning What works
  • 55. Representation learning, a bit of history •“Representation is the use of signs that stand in for and take the place of something else” It has been a goal of neural networks since the 1980s and the current wave of deep learning (2005-present) → Replacing feature engineering Between 2006-2012, many unsupervised learning models with varying degree of success: RBM, DBN, DBM, DAE, DDAE, PSD Between 2013-2018, most models were supervised, following AlexNet Since 2018, unsupervised learning has become competitive (with contrastive learning, self-supervised learning, BERT)! 3/07/2023 55
  • 56. Criteria for a good representation • Separates factors of variation (aka disentanglement), which are linearly correlated with desired outputs of downstream tasks. • Provides abstraction that is invariant against deformations and small variations. • Is distributed (one concept is represented by multiple units), which is compact and good for interpolation. • Optionally, offers dimensionality reduction. • Optionally, is sparse, giving room for emerging symbols. 3/07/2023 56 Bengio, Yoshua, Aaron Courville, and Pascal Vincent. "Representation learning: A review and new perspectives." IEEE transactions on pattern analysis and machine intelligence 35.8 (2013): 1798-1828.
  • 57. Why neural unsupervised learning? • Neural nets have representational richness: • FFN are functional approximator • RNN are program approximator, can estimate a program behaviour and generate a string • CNN are for translation invariance • Transformers are powerful contextual encoder • Compactness: Representations are (sparse and) distributed. • Essential to perception, compact storage and reasoning • Accounting for uncertainty: Neural nets can be stochastic to model distributions • Symbolic representation: realisation through sparse activations and gating mechanisms 3/07/2023 57
  • 58. Generative models: Discover the underlying process that generates data 3/07/2023 58 Many applications: • Text to speech • Simulate data that are hard to obtain/share in real life (e.g., healthcare) • Generate meaningful sentences conditioned on some input (foreign language, image, video) • Semi-supervised learning • Planning
  • 59. Deep (Denoising) AutoEncoder: Self-reconstruction of data 3/07/2023 59 Auto-encoder Feature detector Representation Raw data (optionally with added noise) Reconstruction Deep Auto-encoder Encoder Decoder
  • 60. FSDL 2022 • "Latent Diffusion" model: diffuse in lower-dimensional latent space, then decode back into pixel space • Frozen CLIP ViT-L/14, trained 860M UNet, 123M text encoder • Trained on LAOIN-5B on 256 A100s for 24 days ($600K) • FULLY OPEN-SOURCE StableDiffusion 60 Slide credit: Karayev, 2022
  • 61. Credit: kvfrans.com Gaussian hidden variables Data Generative net Recognisin g net Variational Autoencoder Approximating the posterior by a neural net • Two separate processes: generative (hidden → visible) versus recognition (visible → hidden)
  • 62. GAN: Generative Adversarial nets Matching data statistics • Instead of modeling the entire distribution of data, learns to map ANY random distribution into the region of data, so that there is no discriminator that can distinguish sampled data from real data. Any random distribution in any space Binary discriminator, usually a neural classifier Neural net that maps z → x
  • 63. Generative adversarial networks (Adapted from Goodfellow’s, NIPS 2014) 3/07/2023 63
  • 64. BERT Transformer that predicts its own masked parts • BERT is like parallel approximate pseudo- likelihood • ~ Maximizing the conditional likelihood of some variables given the rest. • When the number of variables is large, this converses to MLE (maximum likelihood estimate). 3/07/2023 64 https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270
  • 65. Neural autoregressive models: Predict the next step given the history • The keys: (a) long-term dependencies, (b) ordering, & (c) parameter sharing. • Can be realized using: • RNN • CNN: One-sided CNN, dilated CNN (e.g., WaveNet), PixelCNN • Transformers → GPT-X family • Masked autoencoder → MADE • Pros: General, good quality thus far • Cons: Slow – needs better inductive biases for scalability 3/07/2023 65 lyusungwon.github.io/studies/2018/07/25/nade/
  • 66. FSDL 2022 • Generative Pre-trained Transformer • Decoder-only (uses masked self-attention) • Trained on 8M web pages, largest model is 1.5B GPT / GPT-2 (2019) https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf 66 Slide credit: Karayev, 2022
  • 67. Contrastive learning: Comparing samples 3/07/2023 67 Le-Khac, Phuc H., Graham Healy, and Alan F. Smeaton. "Contrastive Representation Learning: A Framework and Review." arXiv preprint arXiv:2010.05113 (2020).
  • 68. • 400M image-text pairs crawled from the Internet • Transformer to encode text, ResNet or Visual Transformer to encode image • Contrastive training: maximize cosine similarity of correct image-text pairs (32K pairs per batch) 79 CLIP: Image-pair vs the rest https://arxiv.org/pdf/2103.00020.pdf Slide credit: Karayev, 2022
  • 69. Unsupervised learning: A few more points • No external labels, but rich training signals (thousand bits per sample, as opposed to a few bits in supervised learning). A few techniques: • Compressing data as much as possible with little loss • Energy-based, i.e., pull down energy of observed data, pull up every else • Filling the missing slots (aka predictive learning, self-supervised learning) • We have not covered unsupervised learning on graphs (e.g., DeepWalk, GPT-GNN), but the general principles should hold. • Question: Multiple objectives, or no objective at all? • Question: Emergence from many simple interacting elements? 3/07/2023 69 Liu, Xiao, et al. "Self-supervised learning: Generative or contrastive." arXiv preprint arXiv:2006.08218 (2020). Assran, Mahmoud, et al. "Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture." arXiv preprint arXiv:2301.08243 (2023).
  • 70. Picture taken from (Bommasani et al, 2021) A Tipping Point: Foundation Models 70 • A foundation model is a model trained at broad scale that can adapted to a wide range of downstream tasks • Scale and the ability to perform tasks beyond training Slide credit: Samuel Albanie, 2022
  • 71. Slide credit: Chris Ré, Stanford, 2022 word2vec 2013
  • 72. Twokeyideasunderpinfoundation model? Emergence •system behaviour is implicitly induced rather than explicitly constructed •cause of scientific excitement and anxiety of unanticipated consequences Homogenisation •consolidation of methodology for building machine learning system across many applications •provides strong leverage for many tasks, but also creates single points of failure Slide credit: Samuel Albanie, 2022
  • 73. Homogenisation Learning instead of algorithm: Many applications can be powered by the same learning algorithm. • => Feature engineering Deep architecture engineering: Instead of hand-crafting features, the same architecture could be used widely. • => Architecture engineering Modern Transformer is universal: Same architecture, just different data! • => Data & Prompt engineering Slide credit: Samuel Albanie, 2022
  • 74. 3/07/2023 74 convolution -- motif detection 3 sequencing time gaps/transfer phrase/admission 1 embedding 2 word vector medical record visits/admissions time gap ? prediction point output max-pooling prediction 4 5 record vector Homogenisation-Deepr Nguyen, Phuoc, Truyen Tran, Nilmini Wickramasinghe, and Svetha Venkatesh. Deepr: a convolutional net for medical records." IEEE journal of biomedical and health informatics 21, no. 1 (2016): 22-30. Concept: Stringify() – everything as a string
  • 76. Differentiable programming Neuro-symbolic systems Neural reasoning Post DL What needs work 3/07/2023 76 Agenda Overview Neural building blocks Graph neural networks Unsupervised learning What works
  • 77. 1960s-1990s ▪ Hand-crafting rules, domain- specific, logic-based ▪ High in reasoning ▪ Can’t scale. ▪ Fail on unseen cases. 3/07/2023 77 2020s-2030s  Learning + reasoning, general purpose, human-like  Has contextual and common- sense reasoning  Requires less data  Adapt to change  Explainable 1990s-2020s  Machine learning, general purpose, statistics-based  Low in reasoning  Needs lots of data  Less adaptive  Little explanation Photo credit: DARPA
  • 78. From ML to Machine Reasoning 3/07/2023 78 cylinder cube sphere cylinder sphere cyan brown orange red object detection Reasoning Slide credit: Tin Pham
  • 79. What is missing in deep learning? • Modern neural networks are good at interpolating → Data hungry to cover all variations and smooth local manifolds →Little systematic generalization (novel combinations) • Lack of human-perceived reasoning capability • Lack of logical inference • Lack of natural mechanism to incorporate prior knowledge, e.g., common sense • No built-in causal mechanisms 3/07/2023 79
  • 80. Machine reasoning Reasoning is concerned with arriving at a deduction about a new combination of circumstances. Reasoning is to deduce new knowledge from previously acquired knowledge in response to a query. 3/07/2023 80 Leslie Valiant Leon Bottou
  • 81. Machine reasoning • Two-part process • manipulate previously acquired knowledge • to draw novel inferences or answer new questions • Example: • Premise: • A is to the left of B • B is to the left of C • D is in front of A • E is in front of C • Conclusion: what is the relation between D and E? 3/07/2023 81 Slide credit: Tin Pham
  • 82. Geometry example 3/07/2023 82 Premise • AM = MN (1) • BM = MC (2) • ෣ 𝐴𝑀𝐵 = ෣ 𝑁𝑀𝐶 (3) Solution: From (1), (2), (3) ➔△AMB = △NMC (4) ➔AB = CN From (1), (2) ➔ ABNC is a parallelogram (5) → AB // CN Existing knowledge Conclusion • AB = CN? • AB // CN? Slide credit: Tin Pham
  • 83. Is reasoning always formal/logical? 3/07/2023 83 Bottou, Léon. "From machine learning to machine reasoning." Machine learning 94.2 (2014): 133-149. Leon Bottou • “When we observe a visual scene, when we hear a complex sentence, we are able to explain in formal terms the relation of the objects in the scene, or the precise meaning of the sentence components. • However, there is no evidence that such a formal analysis necessarily takes place: we see a scene, we hear a sentence, and we just know what they mean. • This suggests the existence of a middle layer, already a form of reasoning, but not yet formal or logical.”
  • 84. Why not just neural reasoning? Central to reasoning is composition rules to guide the combinations of modules to address new tasks Bottou: • Reasoning is not necessarily achieved by making logical inferences • There is a continuity between [algebraically rich inference] and [connecting together trainable learning systems] →Neural networks are a plausible candidate! →But still not natural to represent abstract discrete concepts and relations. Hinton/Bengio/LeCun: Neural networks can do everything! The rest: Not so fast! => Neurosymbolic systems! 3/07/2023 84 Bottou, Léon. "From machine learning to machine reasoning." Machine learning 94.2 (2014): 133-149.
  • 85. Learning to reason • Learning is to improve itself by experiencing ~ acquiring knowledge & skills • Reasoning is to deduce knowledge from previously acquired knowledge in response to a query (or a cues) • Learning to reason is to improve the ability to decide if a knowledge base entails a predicate. • E.g., given a video f, determines if the person with the hat turns before singing. • Hypotheses: • Reasoning as just-in-time program synthesis. • It employs conditional computation. • It minimises an energy function, or maximise the compatibility between input (prompt) and output. 3/07/2023 85 Khardon, Roni, and Dan Roth. "Learning to reason." Journal of the ACM (JACM) 44.5 (1997): 697-725. (Dan Roth; ACM Fellow; IJCAI John McCarthy Award)
  • 86. Reasoning as a skill • Reasoning as a prediction skill that can be learnt from data. • Question answering as zero-shot learning. • Neural network operations for learning to reason: • Attention & transformers. • Dynamic neural networks, conditional computation & differentiable programming. • Reasoning as iterative representation refinement & query-driven program synthesis and execution. • Compositional attention networks. • Reasoning as Neural module networks. 3/07/2023 86
  • 87. Practical setting: (query,database,answer) triplets • Classification: Query = what is this? Database = data. • Regression: Query = how much? Database = data. • QA: Query = NLP question. Database = context/image/text. • Multi-task learning: Query = task ID. Database = data. • Zero-shot learning: Query = task description. Database = data. • Drug-protein binding: Query = drug. Database = protein. • Recommender system: Query = User (or item). Database = inventories (or user base); 3/07/2023 87
  • 88. The two approaches to neural reasoning • Implicit chaining of predicates through recurrence: • Step-wise query-specific attention to relevant concepts & relations. • Iterative concept refinement & combination, e.g., through a working memory. • Answer is computed from the last memory state & question embedding. • Explicit program synthesis: • There is a set of modules, each performs an pre-defined operation. • Question is parse into a symbolic program. • The program is implemented as a computational graph constructed by chaining separate modules. • The program is executed to compute an answer 3/07/2023 88
  • 89. MACNet: Composition- Attention-Control (reasoning by progressive refinement of selected data) 3/07/2023 89 Hudson, Drew A., and Christopher D. Manning. "Compositional attention networks for machine reasoning." arXiv preprint arXiv:1803.03067 (2018).
  • 90. LOGNet: Relational object reasoning with language binding 90 • Key insight: Reasoning is chaining of relational predicates to arrive at a final conclusion → Needs to uncover spatial relations, conditioned on query → Chaining is query-driven → Objects/language needs binding → Object semantics is query-dependent → Very thing is end-to-end differentiable Thao Minh Le, Vuong Le, Svetha Venkatesh, and Truyen Tran, “Dynamic Language Binding in Relational Visual Reasoning”, IJCAI’20.
  • 91. 91 LOGNet for VQA Thao Minh Le, Vuong Le, Svetha Venkatesh, and Truyen Tran, “Dynamic Language Binding in Relational Visual Reasoning”, IJCAI’20.
  • 92. Visual QA in action
  • 93. What is about Transformer? • Reasoning as (free-) energy minimisation • The classic Belief Propagation algorithm is minimization algorithm of the Bethe free-energy! • Transformer has relational, iterative state refinement makes it a great candidate for implicit relational reasoning. 3/07/2023 93 Heskes, Tom. "Stable fixed points of loopy belief propagation are local minima of the bethe free energy." Advances in neural information processing systems. 2003. Ramsauer, Hubert, et al. "Hopfield networks is all you need." arXiv preprint arXiv:2008.02217 (2020).
  • 95. Module networks (reasoning by constructing and executing neural programs) • Reasoning as laying out modules to reach an answer • Composable neural architecture → question parsed as program (layout of modules) • A module is a function (x → y), could be a sub- reasoning process ((x, q) → y). 3/07/2023 95 https://bair.berkeley.edu/blog/2017/06/20/learning-to-reason-with-neural-module-networks/
  • 96. Program execution • Work on object-based visual representation • An intermediate set of objects is represented by a vector, as attention mask over all object in the scene. For example, Filter(Green_cube) outputs a mask (0,1,0,0). • The output mask is fed into the next module (e.g Relate) 96
  • 97. Source: @rao2z What is about reasoning in LLMs? • LLMs have HUGE associative memory. • With “Let’s think step-by-step”? • With “Chain of Thought”? • Or just a pattern recognition of chain of reasoning? • Finding short-cuts to approximate provably correct reasoning procedure. • => Very poor OOD generalisation. 3/07/2023 97
  • 98. A general framework 3/07/2023 98 Explicit Knowledge Graphs + Large Language Models (implicit common sense knowledge, associative database)
  • 99. Differentiable programming Neuro-symbolic systems Neural reasoning Post DL What needs work 3/07/2023 99 Agenda Overview Neural building blocks Graph neural networks Unsupervised learning What works
  • 100. 3/07/2023 100 Learning a Turing machine → Can we learn a (neural) program that learns to program from data?
  • 101. Memory networks • Input is a set → Load into memory, which is NOT updated. • State is a RNN with attention reading from inputs • Concepts: Query, key and content + Content addressing. • Deep models, but constant path length from input to output. • Equivalent to a RNN with shared input set. • => Seq2seq with attention is a Memory Network (Memory = input seq). • => Transformer is a kind of Memory Network with Parallel Memory Update! 3/07/2023 101 Sukhbaatar, Sainbayar, Jason Weston, and Rob Fergus. "End-to-end memory networks." Advances in neural information processing systems. 2015.
  • 102. MANN: Memory-Augmented Neural Networks (a constant path length) • Long-term dependency • E.g., outcome depends on the far past • Memory is needed (e.g., as in LSTM) • => This is what make Transformers powerful! • Complex program requires multiple computational steps • Each step can be selective (attentive) to certain memory cell • Operations: Encoding | Decoding | Retrieval
  • 103. MANN: Neural Turing machine (NTM) (simulating a differentiable Turing machine) • A controller that takes input/output and talks to an external memory module. • Memory has read/write operations. • The main issue is where to write, and how to update the memory state. • All operations are differentiable. Source: rylanschaeffer.github.io
  • 104. 3/07/2023 104 NTM unrolled in time with LSTM as controller #Ref: https://medium.com/snips-ai/ntm-lasagne-a-library-for-neural-turing-machines-in-lasagne-2cdce6837315
  • 105. MANN for reasoning • Three steps: • Store data into memory • Read query, process sequentially, consult memory • Output answer • Behind the scene: • Memory contains data & results of intermediate steps • Drawbacks of current MANNs: • No memory of controllers → Less modularity and compositionality when query is complex • No memory of relations → Much harder to chain predicates. 3/07/2023 105 Source: rylanschaeffer.github.io
  • 106. Failures of item-only MANNs for reasoning • Relational representation is NOT stored → Can’t reuse later in the chain • A single memory of items and relations → Can’t understand how relational reasoning occurs • The memory-memory relationship is coarse since it is represented as either dot product, or weighted sum. 3/07/2023 106
  • 107. Self-attentive associative memories (SAM) Learning relations automatically over time 3/07/2023 107 Hung Le, Truyen Tran, Svetha Venkatesh, “Self- attentive associative memory”, ICML'20.
  • 108. Differentiable programming Neuro-symbolic systems Neural reasoning Post DL What needs work 3/07/2023 108 Agenda Overview Neural building blocks Graph neural networks Unsupervised learning What works
  • 109. Neural nets are powerful but we still want: • Learning with less and zero-shot learning; • Generalization of the solutions to unseen tasks and unforeseen data distributions; • Explainability by construction; 3/07/2023 109 https://ibm.github.io/neuro-symbolic-ai/events/ns- workshop2023 Self-Aware Learning • Deeper learning for challenging tasks • Integrating continuous and symbolic representations • Diversified learning modalities Credit: Yolanda Gil, Bart Selman AI to Understand Human Intelligence • 5 years: AI systems could be designed to study psychological models of complex intelligent phenomena that are based on combinations of symbolic processing and artificial neural networks.
  • 110. Symbolic forms • Words in Wordnet • Syntax in NLP & Code • Logic, prepositional and first-order • Variables, equations • Knowledge structure: Semantic nets, knowledge graphs • Graphical models: Bayesian networks, Markov random fields, Markov logic networks. • Function (names), indirection, pointer in C/C++. 3/07/2023 110
  • 111. Henry Kautz's taxonomy (1) • Symbolic Neural symbolic—is the current approach of many neural models in natural language processing, where words or subword tokens are both the ultimate input and output of large language models. Examples include BERT, RoBERTa, and GPT-3. 3/07/2023 111 Kautz, H., 2022. The third AI summer: AAAI Robert S. Engelmore memorial lecture. AI Magazine, 43(1), pp.105-125. https://en.wikipedia.org/wiki/Neuro-symbolic_AI
  • 112. Representing Context and Structure Known as contextualized language models 10 Devlin et-al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” 2019 Slide credit: Pacheco & Goldwasser, 2021
  • 113. What does BERT learn? Emergent linguistic structure in artificial neural networks trained by self-supervision. PNAS Manning et-al, 2020 Linguistic structure emerges without direct supervision Slide credit: Pacheco & Goldwasser, 2021
  • 114. Using BERT for ReasoningTasks • BERT-based near-human performance on Winograd Schema WINOGRANDE: An Adversarial Winograd Schema Challenge at Scale. Sakaguchi et-al, AAAI’20 Can “thinking-slow” tasks be accomplished with “thinking-fast” systems? Not a panacea (McCoy et al ACL’19, others), often relies on simple heuristics when learning complex decisions 12 Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference. McCoy et-al, ACL’19 World Knowledge and Commonsense inferences reflected in coref decisions Slide credit: Pacheco & Goldwasser, 2021
  • 115. Henry Kautz's taxonomy (2) • Symbolic[Neural]—is exemplified by AlphaGo, where symbolic techniques are used to call neural techniques. In this case, the symbolic approach is Monte Carlo tree search and the neural techniques learn how to evaluate game positions. 3/07/2023 115 Kautz, H., 2022. The third AI summer: AAAI Robert S. Engelmore memorial lecture. AI Magazine, 43(1), pp.105-125. https://en.wikipedia.org/wiki/Neuro-symbolic_AI
  • 116. Henry Kautz's taxonomy (3) • Neural | Symbolic—uses a neural architecture to interpret perceptual data as symbols and relationships that are reasoned about symbolically. The Neural- Concept Learner is an example. 3/07/2023 116 Kautz, H., 2022. The third AI summer: AAAI Robert S. Engelmore memorial lecture. AI Magazine, 43(1), pp.105-125. https://en.wikipedia.org/wiki/Neuro-symbolic_AI
  • 117. End-to-End Module Networks • Construct the program internally • The two parts are jointly learnable 3/07/2023 End-to-End Module Networks , Hu et.al., ICCV 17 117 Slide credit: Vuong Le
  • 118. Henry Kautz's taxonomy (4) • Neural: Symbolic → Neural—relies on symbolic reasoning to generate or label training data that is subsequently learned by a deep learning model, e.g., to train a neural model for symbolic computation by using a Macsyma-like symbolic mathematics system to create or label examples. 3/07/2023 118 Kautz, H., 2022. The third AI summer: AAAI Robert S. Engelmore memorial lecture. AI Magazine, 43(1), pp.105-125. https://en.wikipedia.org/wiki/Neuro-symbolic_AI Lample, Guillaume, and François Charton. 2020. “Deep Learning For Symbolic Mathematics.” In Proceedings of the International Conference on Learning Representations.
  • 119. Henry Kautz's taxonomy (5) • Neural_{Symbolic}—uses a neural net that is generated from symbolic rules. An example is the Neural Theorem Prover, which constructs a neural network from an AND-OR proof tree generated from knowledge base rules and terms. Logic Tensor Networks also fall into this category. 3/07/2023 119 Kautz, H., 2022. The third AI summer: AAAI Robert S. Engelmore memorial lecture. AI Magazine, 43(1), pp.105-125. https://en.wikipedia.org/wiki/Neuro-symbolic_AI
  • 120. Henry Kautz's taxonomy (6) • Neural[Symbolic]—allows a neural model to directly call a symbolic reasoning engine, e.g., to perform an action or evaluate a state. An example would be ChatGPT using a plugin to query Wolfram Alpha. 3/07/2023 120 Kautz, H., 2022. The third AI summer: AAAI Robert S. Engelmore memorial lecture. AI Magazine, 43(1), pp.105-125. https://en.wikipedia.org/wiki/Neuro-symbolic_AI
  • 121. LLMs for calling tools • Information retriever • Symbolic/math module & code interpreters • Virtual agents • Robotic arms. See https://palm-e.github.io/ 3/07/2023 121 Credit: Khattab et al
  • 122. Symbols via Indirection 3/07/2023 122 Z = X + Y 3 1 2 Bind symbols with values Pointer in Computer Science Information binding in the brain https://www.linkedin.com/pulse/ unsolved-problems-ai-part-2-binding-problem-eberhard-schoeneburg/ Indirection binds two objects together and uses one to refer to the other. Slide credit: Kha Pham
  • 123. Indirection is a key design principle in software engineering 3/07/2023 123 Client Indirectional Layer Target https://medium.com/@nmckinnonblog/indirection-fba1857630e2 Indirection removes direct coupling between units and promotes: • Extensibility • Control • Evolvability • Encapsulation of code and design complexity Every computer science problem can be solved with a higher level of indirection. Andrew Koenig, Butler Lampson, David J. Wheeler Slide credit: Kha Pham
  • 124. Leveraging indirection to improve OOD generalization 3/07/2023 124 Why indirection? Indirection binds concrete data to abstract symbols, and reasoning on symbols is likely to improve generalization. What to bind? Concrete information of data, e.g., representations, functional relations between data, etc. Functional indirection Structural indirection How to bind? During indirection, some concrete information of data will be ignored, and thus we have to decide what to maintain, i.e., invariances across data. → Indirection connects invariance and symbolic approaches. Slide credit: Kha Pham
  • 125. Structural Indirection: InLay 3/07/2023 125 • InLay simultaneously leverages indirection and data internal relationships to construct indirection representations, which respect the similarities between internal relationships. • InLay connects invariance and symbolic approaches: • InLay constructs indirection representations from a fixed set of symbolic vectors. • InLay assumes two invariances: • The data internal relationships are invariant through indirection. • The set of symbolic vectors to compute indirection representations is invariant across train and test samples. Slide credit: Kha Pham Pham, K., Le, H., Ngo, M. and Tran, T., Improving Out-of-distribution Generalization with Indirection Representations. In The Eleventh International Conference on Learning Representations.
  • 126. Structure-Mapping Theory (SMT) 3/07/2023 126 • Improve previous theories of analogy, i.e. the Tversky’s contrast theory, which assumed that an analogy is stronger if the more attributes the base and target share in common. • SMT [1] argued that it is not object attributes which are mapped in an analogy, but relationships between objects. X12 star system Solar system similarity Hydrogen atom analogy Rutherford’s analogy No. attributes mapped No. relations mapped Literal similarity Many Many Analogy Few Many [1] Gentner, Dedre. "Structure-mapping: A theoretical framework for analogy." Cognitive science 7.2 (1983): 155-170. Slide credit: Kha Pham
  • 127. Structure-Mapping Theory (SMT) (cont.) 3/07/2023 127 Which will be chosen to be mapped in an analogy? Systematicity Principle: A predicate that belongs to a mappable system of mutually interconnecting relationships is more likely to be imported into the target than is an isolated predicate. Solar system Distance Attractive force Revolves around Color Temperature Hydrogen atom Distance Attractive force Revolves around Color Temperature Slide credit: Kha Pham
  • 128. Model architecture 3/07/2023 128 • Concrete data representation is viewed as a complete graph with weighted edges. • The indirection operator maps this graph to a symbolic graph with the same weight edges, however the vertices are fixed and trainable. • This symbolic graph is propagated and the updated node features are indirection representations Slide credit: Kha Pham
  • 129. Experiments on IQ datasets – RAVEN dataset 3/07/2023 129 An IQ problem in RAVEN [1] dataset Model Accuracy LSTM 30.1/39.2 Transformers 15.1/42.5 RelationNet 12.5/46.4 PrediNet 13.8/15.6 Average test accuracies (%) without/with InLay in different OOD testing scenarios on RAVEN [1] Zhang, Chi, et al. "Raven: A dataset for relational and analogical visual reasoning." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019. • The original paper of RAVEN dataset proposes different OOD testing scenarios, in which models are trained on one configuration and tested on another (but related) configuration. Slide credit: Kha Pham
  • 130. Experiments on OOD image classification tasks 3/07/2023 130 Dog Dog? OOD image classification, in which test images are distorted. • When test images are injected with different kinds of distortions other than ones in training, deep neural networks may fail drastically in image classification tasks. [1] [1] Robert Geirhos, Carlos RM Temme, Jonas Rauber, Heiko H Schütt, Matthias Bethge, and Felix A Wichmann. Generalisation in humans and deep neural networks. Advances in neural information processing systems, 31, 2018. Dataset ViT accuracy SVHN 65.9/68.8 CIFAR10 38.2/43.1 CIFAR100 17.1/20.4 Average test accuracies (%) without/with InLay of Vision Transformers (ViT) on different types of distortions Slide credit: Kha Pham
  • 131. Here “physics” refers to empirical or theoretical laws that exist in nature. R² = 0.989 1 10 100 1,000 10,000 100,000 2015 2016 2017 2018 2019 2020 2021 2022 #Papers on PIML Physics-informed NN
  • 132. Integrate-and-fire neuron andreykurenkov.com Priors that work • Neuron as trainable feature detector • Depth + Skip-connection • Invariance/equivariance: • Convolution (Translation) • Recurrence (Time travel) • Attention (Permutation) • Analogy • Kernel, case-based reasoning, • Attention, memory Feature detector Source: http://karpathy.github.io/assets/rnn/diags.jpeg
  • 133. Physics invariance • Newton laws • Symmetry • Conversation laws • Noether’s Theorem linking symmetry and conservation. First page of Emmy Noether's article "Invariante Variationsprobleme" (1918). Source: Wikipedia
  • 134. ML, data & physics • Data collection/annotation for ML is expensive • ML solutions don’t respect symmetries and conservation laws • Physics laws are universal (upto scale) | ML only generalizes in- distribution. Karniadakis, George Em, et al. "Physics-informed machine learning." Nature Reviews Physics 3.6 (2021): 422-440.
  • 135. Embedding physics into ML https://medium.com/@zhaoshuai1989/why-do-we-need-physics-informed-machine-learning-piml-d11fe0c4436c
  • 136. Physics guides neural architecture • Physics-informed neural networks (PINN) Figure from talk by Perdikaris & Wang, 2020.
  • 137. Physics guides learning dynamics • Physics-informed neural networks (PINN) Figure from talk by Perdikaris & Wang, 2020.
  • 138. Case study: Damped harmonic oscillation Source: https://benmoseley.blog/my-research/so-what-is-a-physics-informed-neural-network/
  • 139. Case study: COVID-19 in VN 2021 • Failed to contain the new exponential growth due to Delta variant. • The cost: 20 thousand lives within 3 months!! • At the peak, the daily mortality ~ Vietnam War’s rate. • What worked in 2020 didn’t in 2021. 3/07/2023 139
  • 140. SIR family for pandemics • N = Population • S = Susceptible • I = Infectious • R = Recovered Source: Wikipedia Basic reproduction number
  • 141. Covid-19 infections • SIR: Close-form solutions hard to calculate • Parameters change over time due to intervention → Need more flexible framework. • Solution: Richards equation → Richards curve | Gompertz curve • Task: 10-20 data points → Extrapolate 150 more.
  • 142. Model design • Remember often we have only 20-30 highly correlated data points to learn from! • Model is sum of 2-3 “waves” – each is a 3-param Gompertz curve • Height of the peak • Location of the peak • Scale of the wave (the effective width) • The number of waves indicates of the observed waves, and some hypothetical waves. • Model can be thought as a special neural network, each hidden unit is a wave, but with Gompertz-based kernel. 3/07/2023 142
  • 143. Estimating the model priors • Impossible to know without assumptions! • Need priors on wave size & possibly, the scale (e.g., min-max) • One solution: • Look for other countries, with adjustment in population size. • Hopefully the culture, economic structure & actions are similar. • It depends on: • The virus variant (original != Delta != Omicron) • Health/border capacity (closed boder + lockdown in the beginning) • Vaccination coverage (80% tended to be the threshold for openning) • Total cases/population. 3/07/2023 143
  • 144. Case of HCM City 0 2,000 4,000 6,000 8,000 10,000 12,000 14,000 16,000 18,000 0 50 100 150 200 250 300 350 400 3/07/21 10/07/21 17/07/21 24/07/21 31/07/21 7/08/21 14/08/21 21/08/21 28/08/21 4/09/21 11/09/21 18/09/21 25/09/21 2/10/21 9/10/21 16/10/21 23/10/21 30/10/21 Ước lượng số ca tử vong do Covid-19, TP HCM Tử vong ghi nhận Tử vong ước lượng Tử vong tích lũy (thực tế) 20-21/8: Peak Total cases 16/10 11/8: Predicting date
  • 145. 145 28-30/8: Peak Total cases 25/10 17/8: Predicting date Case of Binh Duong province
  • 146. Differentiable programming Neuro-symbolic systems Neural reasoning Post DL What needs work 3/07/2023 146 Agenda Overview Neural building blocks Graph neural networks Unsupervised learning What works
  • 147. In 2022, DL has reached as new height: GPT-4, PaLM-E, GATO, etc. 3/07/2023 147
  • 148. Major remaining problems of DL • Massive associative machine → Lack of causality prior, prone to learning wrong things, or work for the wrong reasons. →Overconfident for the wrong reasons (e.g., prone to adversarial attacks). →Exploits short-cuts => poor on OOD generalisation →Sample inefficient →Approximate reasoning patterns, not from the first principles. • Inference separated from learning →No built-in adaptation other than retraining →Catastrophic forgetting • Limited theoretical understanding 3/07/2023 148
  • 149. Are limitations inherent? • YES, statistical systems tend to memorize data and find short-cuts. • We need lots of data to cover all possible variations, hence lots of compute. • But aren’t we great copiers? • NO, neural nets were founded on the basis of distributed representation and parallel processing. These are robust, fast and energy efficient. • We still need to find “binding” tricks that do all sorts of things without relying on statistical training signals + backprop. 3/07/2023 149
  • 150. Dimensions of progress • Continuation of current works/paths • Expansion/optimisation • Industrialisation: Scale up & scale out • Challenge fundamental assumptions • DL as part of more holistic solution to Human-Level AI (HLAI) • Dealing with the unexpected: Uncertainty, safety, security 3/07/2023 150
  • 151. Continuation • Enabling techs: Data, compute, network • Work with noisy quantum computing (which will take time to mature) • DL fundamentals: Representation, learning & inference • Rep = data rep + computational graph + symmetry • Learning as pre-training to extract as much knowledge from data as possible • Learning as on-the-fly inference (Bayesian, hypernetwork/fast weight) • Extreme inference = dynamic computational graph on-the-fly. 3/07/2023 151
  • 152. Continuation (2) • DL applications • Data-rich & data-poor • Cognitive domains (vision, NLP) • Improve manufacturing • Accelerate science 3/07/2023 152
  • 153. Expansion/optimisation • New inductive biases (for vision, NLP, living things, science, social AI, ethical AI) • Cutting the statistical/associative short-cuts • Shifting from feature space to function space. • Pushing for high-level analogy (rather than just feature-based kernel/template matching) • Binding, indirection, symbols • Injection of knowledge into models. 3/07/2023 153
  • 154. Expansion (2) • Expanding to classical AI areas (planning, reasoning, knowledge representation, symbol manipulation). • Needs to solve symbol grounding for that to happen. • Physics-informed neural networks (e.g., my work in Covid-19 forecasting) • Social dimensions, human-in-the-loop 3/07/2023 154
  • 155. Industrialisation: Scaling - success formula thus far Data + knowledge + compute + generic scalable algorithms 3/07/2023 155
  • 156. Scaling - Rich Sutton’s Bitter Lesson (2019) 3/07/2023 156 “The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin. ” http://www.incompleteideas.net/IncIdeas/BitterLesson.html “The two methods that seem to scale arbitrarily in this way are search and learning.”
  • 157. DeepMind: Scale (up) is enough 3/07/2023 157
  • 158. But … • Scaling is like building a taller ladder to get to the Moon. • We need rocket and science of escape velocity. • Human brain is big (1e+14 synapses) but does exactly opposite – maximize entropy reduction using minimum energy (thinking of the most efficient heat engine). • Just 20W is enough for human-level intelligence! • => Must use different principles rather than just (sample inefficient) statistics! • No need to go around like computer: Analog -> Digital/sequential -> Parallel analog simulation. 3/07/2023 158
  • 159. DL is part of Broad AI 3/07/2023 159 Hochreiter, S., 2022. Toward a broad AI. Communications of the ACM, 65(4), pp.56- 57.
  • 160. DL is part of Integrated Intelligence LeCun’s plan 3/07/2023 160 https://ai.facebook.com/blog/yann-lecun-advances-in-ai-research/ Knowledge?
  • 162. DL “accidental” history 3/07/2023 162 Source: rikochet_band 1950s: Rosenblatt wired the first trainable perceptron, hyping AI up. 1970-1980s: Minsky and Papert almost killed it until Rumelhart et al. worked out high-school math to train multi-layer perceptron. 1980-1990s: LeCun managed to get CNN work for something real. 1990s: RNN was proved to be Turing-equivalent. Schmidhuber got excited and bombarded the field with lots of cool ideas. 1990s-2000s: But the models were shallow and hard to train. Almost no one worked on it for 2 decades until the Canadian mafia fought back with new tricks to train deeper models. 2010s: !Accidently DL took off like a rocket, thanks to gamers. 2020s: Now DL works on everything, except for: small data, shifted data, noisy data, artificially twisted data, deep stuffs, exact stuffs, abstract stuffs, causal stuffs, symbolic stuffs, thinking stuffs, and stuffs that no one knows how they work like consciousness. 2020s: DL believers got rich, and a new bunch of students got over trained.
  • 163. Differentiable programming Neuro-symbolic systems Neural reasoning Post DL What needs work 3/07/2023 163 Agenda Overview Neural building blocks Graph neural networks Unsupervised learning What works
  • 164. Final words • Deep neural networks are here to stay, may be as a part of the holistic solution to human-level AI. • Gradient-based learning is still without parallel. • DL will be much more general/universal/versatile (e.g., dynamic architecture, with Transformer is a relaxed approximation) • Higher cognitive capabilities will be there, may be with symbol manipulation capacity. • Better generalization capability (e.g., extreme) • We have to deal with consequences of its own success. • Negative effect; Jevon’s paradox • The DL is now an industry, and is still going strong. But students may be over-fitted to particular DL ways of thinking. • The industry will need to keep the highly trained (overfitted) DL workforce busy! 3/07/2023 164
  • 165. Second bitter lesson Little priors (innateness?) + lots of experiments > strong priors (theory of intelligence) + trying to prove it. => Chomsky would disagree here. 3/07/2023 165 Source: QuestionPro