Deep learning has recently reached the heights that pioneers in the field had aspired to, serving as the driving force behind recent breakthroughs in AI, which have arguably surpassed the Turing test. At present, the spotlight is on scaling Transformers and diffusion models on Internet-scale data. In this talk, I will provide an overview of the fundamental principles of deep learning, its powers, and limitations, and explore the new era of post-deep learning. This new era encompasses novel objectives, dynamic architectures, abstract reasoning, neurosymbolic hybrid systems, and LLM-based agent systems.
Unblocking The Main Thread Solving ANRs and Frozen Frames
Artificial intelligence in the post-deep learning era
1. AI in the post-deep
learning era
Prof Truyen Tran
Head of AI, Health and Science
Applied AI Institute (A2I2), Deakin University
truyen.tran@deakin.edu.au
12/04/2024 1
2. “[By 2023] … Emergence
of the generally agreed
upon "next big thing" in
AI beyond deep learning.”
Rodney Brooks, Jan 2018
4/12/2024 2
3. “Deep learning is going to be able to do everything”
(Geoff Hinton, Nov 2020)
12/04/2024 3
4. 12/04/2024 4
The quest of modern AI:
Learning a Turing machine
A mechanical Turing machine
Can we design a (neural)
program that learns to
program from data?
5. Three kinds of AI
• Cognitive automation: encoding human
abstractions → automate tasks normally performed
by humans.
• Majority of current machine learning & symbolic AI fall
into this category.
• Cognitive assistance: AI helps us make sense of the
world (perceive, think, understand).
• This is where the true potential of AI lies.
• Some applications of ML fall into this category at present.
• Cognitive autonomy: Artificial minds thrive
independently of us, exist for their own sake.
• Science fiction!
François Chollet
12/04/2024 5
8. Depth refers to number of steps between input-output
Integrate-and-fire neuron
andreykurenkov.com
Feature detector
Block representation
12/04/2024 8
11. What DL really means
• Functional view: Nested function composition. Base functions are
feature transformation.
• Depth is number of transformation steps between raw data and output.
• State view: Layered data abstraction, distributed representation.
• Kernel view: Nested kernels, aka “glorified template matching”.
• Programming view: Differentiable programming, dynamic modular
composition, trainable computational graphs.
• Memory view: An associative way to compress data/world model into
weights, and decompress data when prompted.
12/04/2024 11
12. Advances in the past 10 years
• Architectures – CNN/RNN family, attention/Transformers, memory/differentiable programming,
native data structures (sequence, tree, grid, graph, set), skip-connection, hypernetwork/fast
weight.
• Training techniques (Param initialization, Adam, RMSProp, BERT, self-supervised learning,
contrastive learning).
• Robustness (Dropout, normalization).
• Large models/compute (GPT-X, etc).
• Deep generative models (VAE, GAN, Normalizing flows, Diffusion).
• New theoretical understanding (overparameterization, role of depth, nature of gradient learning).
• Hardware to support parallelization (GPU, TPU).
12/04/2024 12
13. Picture taken from (Bommasani et al, 2021)
A tipping point: Foundation models
• A foundation model is a
model trained at broad
scale that can adapted
to a wide range of
downstream tasks
• Scale and the ability to
perform tasks beyond
training
Slide credit: Samuel Albanie, 2022
13
12/04/2024
14. Key concepts that make DL work
• Distributed representation
• Associative learning
• Layers + backprop
→ DL picks up contextual information easily, as long as there are signals (numerical or textual).
→ DL mimics training signals. At extreme, it will be indistinguisable from human’s expression.
→Cross-modal association isn’t hard. Symbol grounding “appears” to be solved (it isn’t).
→DL scales arbitrarily with data and compute (really key for modern AI)
12/04/2024 14
15. DL works on almost all modalities
SIGNALS STRINGS TABLES
12/04/2024 15
16. “Software 2.0 is written in
neural network weights”
Andrej Karpathy, Nov
2017
4/12/2024 16
17. Why DL is so powerful?
Practical
• Generality: Applicable to many
domains.
• Competitive: DL is hard to beat as
long as there are data to train.
• Scalability: DL is better with more
data, and it is very scalable.
Theoretical
Expressiveness: Neural nets
can approximate any function.
Learnability: Neural nets are
trained easily.
Generalisability: Neural nets
generalize surprisingly well to
unseen data.
12/04/2024 17
18. Why is deep
generative
models
(DGMs) so
powerful?
DGMs are
compression
engine
Prompting is conditioning
for the (preference-
guided) decompression.
DGMs are
approximate
program database
Prompting is retrieving an
approximate program that
takes input and delivers
output.
DGMs are
World Model
We can live entirely in
simulation!
12/04/2024 18
19. The power comes from arbitrary scaling - Rich Sutton’s
Bitter Lesson (2019)
12/04/2024 19
“The biggest lesson that can be read from 70 years of AI research is that
general methods that leverage computation are ultimately the most
effective, and by a large margin. ”
http://www.incompleteideas.net/IncIdeas/BitterLesson.html
“The two methods that seem to scale arbitrarily in this way
are search and learning.”
20. DL can learn everything
…
as long as we have the right architecture and clean data
…
as long as we have the right architecture and clean data
12/04/2024 20
21. What are limitations of deep learning?
• Modern neural networks are massive curve fitting
• Good at interpolating
→Data hungry to cover all variations and smooth
local manifolds
→Very sample/energy inefficient, low rate of data-
knowledge conversion.
→ Little systematic generalization (novel
combinations)
• Inference separated from learning
→No built-in adaptation other than retraining
→Catastrophic forgetting
12/04/2024
• Lack of human-perceived reasoning
capability
• Lack of logical inference
• Lack of natural mechanism to
incorporate prior knowledge, e.g.,
common sense
• No built-in causal mechanisms
• Limited theoretical understanding.
12/04/2024 21
22. Are these limitations inherent?
• YES, statistical systems tend to memorize data and find short-cuts.
• We need lots of data to cover all possible variations, hence lots of compute.
• But aren’t we great copiers?
• NO, neural nets were founded on the basis of distributed
representation and parallel processing. These are robust, fast and
energy efficient.
• We still need to find “binding” tricks that do all sorts of things without relying
on statistical training signals + backprop.
12/04/2024 22
24. Dual system: A possible architecture
System 1:
Intuitive
System 1:
Intuitive
System 1:
Intuitive
• Fast
• Implicit/automatic
• Pattern recognition
• Multiple
System 2:
Analytical
• Slow
• Deliberate/rational
• Careful analysis
• Single, sequential
Single
Image credit: VectorStock | Wikimedia
Perception
Theory of mind
Recursive reasoning
Facts
Semantics
Events and relations
Working space
Memory
12/04/2024 24
25. Continuation of System 1
• DL has been heavily invested by industry
• → They need to reap the benefits for the years to come, both hardware and
software sides.
• Enabling techs: Data, compute, networking
• → Scaling up (bigger) & scaling out (mixture)
• → One model for all
• DL fundamentals: Representation, learning & inference
• Rep = data rep + computational graph + symmetry
• Learning as pre-training to extract as much knowledge from data as possible
• Learning as on-the-fly inference (Bayesian, hypernetwork/fast weight)
• Extreme inference = dynamic computational graph on-the-fly.
12/04/2024 25
27. But …
• Scaling is like building a taller ladder to get to the Moon.
• We need rocket and science of escape velocity.
• Human brain is big (1e+14 synapses) but does exactly opposite –
maximize entropy reduction using minimum energy (thinking of the
most efficient heat engine).
12/04/2024 27
28. One model for all – our early attempt
• «(a) multi-label, (b) multi-view, (c) multi-
view/multi-label and (d) multi-instance »
• Columns are generic message passing scheme
between entities
12/04/2024 28
Pham, Trang, Truyen Tran, and Svetha Venkatesh. "One size fits many: Column
bundle for multi-x learning." arXiv preprint arXiv:1702.07021 (2017).
29. 12/04/2024 29
convolution --
motif detection
3
sequencing
time gaps/transfer
phrase/admission
1
embedding
2
word
vector
medical record
visits/admissions
time gap
?
prediction point output
max-pooling
prediction
4
5
record
vector
Our early attempt (2): Deepr
Nguyen, Phuoc, Truyen Tran,
Nilmini Wickramasinghe, and
Svetha Venkatesh. Deepr: a
convolutional net for medical
records." IEEE journal of
biomedical and health
informatics 21, no. 1 (2016): 22-30.
Concept: Stringify() – everything as a string
31. Why one-model-for-all possible?
• The world is regular: Rules, patterns, motifs, grammars, recurrence
• World models are learnable from data!
• Advances in ML:
• Model flexibility
• Powerful training and inference machines
• Smart tricks
• Human brain gives an examole
• One brain, but capable of processing all modalities, doing plenty of tasks, and
learning from different kind of training signals.
• Thinking at high level is independent of input modalities and task-specific
skills.
12/04/2024 31
32. RL Team: Reward is enough
12/04/2024 32
Silver, David, Satinder Singh, Doina Precup, and Richard S. Sutton.
"Reward is enough." Artificial Intelligence 299 (2021): 103535.
35. Machine reasoning
Reasoning is concerned with arriving at a deduction
about a new combination of circumstances.
Reasoning is to deduce new knowledge from
previously acquired knowledge in response to a
query.
12/04/2024 35
Leslie Valiant
Leon Bottou
36. Hypotheses
Reasoning as just-
in-time program
synthesis.
It employs
conditional
computation.
Reasoning is
recursive, e.g.,
mental travel.
12/04/2024 36
37. Neural reasoning: Two methods
• Implicit chaining of predicates through recurrence:
• Step-wise query-specific attention to relevant concepts & relations.
• Iterative concept refinement & combination, e.g., through a working memory.
• Answer is computed from the last memory state & question embedding.
• Explicit program synthesis:
• There is a set of modules, each performs an pre-defined operation.
• Question is parse into a symbolic program.
• The program is implemented as a computational graph constructed by chaining
separate modules.
• The program is executed to compute an answer.
12/04/2024 37
38. Learning to reason: Reasoning as a skill
• Reasoning as a prediction skill that can be learnt
from data.
• Question answering as zero-shot learning.
• Neural network operations for learning to reason:
• Attention & transformers.
• Dynamic neural networks, conditional computation &
differentiable programming.
• Module networks
• LLMs to generate program on the fly + feedbacks
12/04/2024 38
(Dan Roth; ACM Fellow; IJCAI
John McCarthy Award)
39. Example: LOGNet
Thao Minh Le, Vuong Le, Svetha Venkatesh, and Truyen Tran, “Dynamic Language
Binding in Relational Visual Reasoning”, IJCAI’20.
12/04/2024 39
40. Deliberative reasoning
implies memory
• Three steps:
• Store data/representations into memory
• Read query, process sequentially, consult/update memory
• Output answer
• But data memory isn’t enough:
• No memory of controllers → Less modularity and compositionality when
query is complex
• No memory of relations → Much harder to chain predicates.
• Still iterative refinement → Prone to curve fitting
12/04/2024 40
Source: rylanschaeffer.github.io
42. Program memory → Program synthesis
Le, Hung, Truyen Tran, and Svetha Venkatesh. "Neural Stored-program Memory."
In International Conference on Learning Representations. 2019.
Slide credit: Hung Le
Neural stored-program memory
(NSM) stores key (the address)
and values (the weight)
The weight is selected and
loaded to the controller of NTM
The stored NTM weights and the
weight of the NUTM is learnt
end-to-end by backpropagation
12/04/2024 42
44. Symbolic processing is
desirable in System 2
• Learning with less and zero-shot
learning;
• Generalization of the solutions to
unseen tasks and unforeseen data
distributions;
• Explainability by construction;
12/04/2024 44
https://ibm.github.io/neuro-symbolic-ai/events/ns-
workshop2023
Self-Aware Learning
• Deeper learning for challenging tasks
• Integrating continuous and symbolic
representations
• Diversified learning modalities
Credit: Yolanda Gil, Bart Selman
AI to Understand Human
Intelligence
• 5 years: AI systems could be designed to
study psychological models of complex
intelligent phenomena that are based on
combinations of symbolic processing and
artificial neural networks.
45. Henry Kautz's taxonomy (2)
• Symbolic[Neural]—is exemplified by
AlphaGo, where symbolic techniques are
used to call neural techniques. In this case,
the symbolic approach is Monte Carlo tree
search and the neural techniques learn
how to evaluate game positions.
12/04/2024 45
Kautz, H., 2022. The third AI summer: AAAI Robert S. Engelmore
memorial lecture. AI Magazine, 43(1), pp.105-125.
https://en.wikipedia.org/wiki/Neuro-symbolic_AI
46. Henry Kautz's taxonomy (3)
• Neural | Symbolic—uses a neural architecture to interpret perceptual data as
symbols and relationships that are reasoned about symbolically. The Neural-
Concept Learner is an example.
12/04/2024 46
Kautz, H., 2022. The third AI summer: AAAI Robert S. Engelmore
memorial lecture. AI Magazine, 43(1), pp.105-125.
https://en.wikipedia.org/wiki/Neuro-symbolic_AI
47. Henry Kautz's taxonomy (6)
• Neural[Symbolic]—allows a
neural model to directly call a
symbolic reasoning engine, e.g.,
to perform an action or evaluate
a state. An example would be
ChatGPT using a plugin to query
Wolfram Alpha.
12/04/2024 47
Kautz, H., 2022. The third AI summer: AAAI Robert S. Engelmore
memorial lecture. AI Magazine, 43(1), pp.105-125.
https://en.wikipedia.org/wiki/Neuro-symbolic_AI
48. Symbols via Indirection
12/04/2024 48
Z = X + Y
3 1 2
Bind symbols with values
Pointer in Computer Science
https://www.linkedin.com/pulse/
unsolved-problems-ai-part-2-binding-problem-eberhard-schoeneburg/
Indirection binds two objects together and uses one to refer to the other.
Slide credit: Kha Pham
Every computer science
problem can be solved with a
higher level of indirection.
Andrew Koenig, Butler Lampson, David J. Wheeler
49. InLay: Indirection layer
12/04/2024 49
• Concrete data representation is viewed as a complete graph
with weighted edges.
• The indirection operator maps this graph to a symbolic graph
with the same weight edges, however the vertices are fixed and
trainable.
• This symbolic graph is propagated and the updated node
features are indirection representations
Slide credit: Kha Pham
50. Experiments on IQ datasets – RAVEN dataset
12/04/2024 50
An IQ problem in RAVEN [1] dataset
Model Accuracy
LSTM 30.1/39.2
Transformers 15.1/42.5
RelationNet 12.5/46.4
PrediNet 13.8/15.6
Average test accuracies (%) without/with InLay in
different OOD testing scenarios on RAVEN
[1] Zhang, Chi, et al. "Raven: A dataset for relational and analogical visual reasoning."
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.
• The original paper of RAVEN dataset proposes
different OOD testing scenarios, in which models
are trained on one configuration and tested on
another (but related) configuration.
Slide credit: Kha Pham
51. Experiments on OOD image classification tasks
12/04/2024 51
Dog Dog?
OOD image classification,
in which test images are distorted.
• When test images are injected with different kinds
of distortions other than ones in training, deep
neural networks may fail drastically in image
classification tasks. [1]
[1] Robert Geirhos, Carlos RM Temme, Jonas Rauber, Heiko H Schütt, Matthias Bethge, and
Felix A Wichmann. Generalisation in humans and deep neural networks. Advances in neural
information processing systems, 31, 2018.
Dataset ViT accuracy
SVHN 65.9/68.8
CIFAR10 38.2/43.1
CIFAR100 17.1/20.4
Average test accuracies (%) without/with InLay of Vision
Transformers (ViT) on different types of distortions
Slide credit: Kha Pham
54. Case study: Covid-19 infections in VN 2021
• Classic model SIR: Close-form solutions hard to calculate
• Parameters change over time due to intervention → Need
more flexible framework.
• Solution: Richards equation → Mixture of Gompertz
curves
• Task: 10-20 data points → Extrapolate 150 more.
12/04/2024 54
55. Case of HCM City
0
2,000
4,000
6,000
8,000
10,000
12,000
14,000
16,000
18,000
0
50
100
150
200
250
300
350
400
3/07/21
10/07/21
17/07/21
24/07/21
31/07/21
7/08/21
14/08/21
21/08/21
28/08/21
4/09/21
11/09/21
18/09/21
25/09/21
2/10/21
9/10/21
16/10/21
23/10/21
30/10/21
Ước lượng số ca tử vong do Covid-19, TP HCM
Tử vong ghi nhận Tử vong ước lượng Tử vong tích lũy (thực tế)
20-21/8: Peak
Total cases
16/10
11/8: Predicting date
12/04/2024 55
57. DL pushes changes in practice of AI
12/04/2024 57
2000s
Focus: Model
Flow: Data → Feature → Model → Deploy
Reception: Skeptical
2010s
Focus: Data
Flow: Data → Model → Deploy
Reception: Accelerating
2020s
Focus: Prompt
Flow: Prompt → Deploy
Reception: Responsible
58. Newbehaviours
Emergence
•system behaviour is implicitly induced rather than explicitly constructed
•cause of scientific excitement and anxiety of unanticipated consequences
Homogenisation
•consolidation of methodology for building machine learning system across many applications
•provides strong leverage for many tasks, but also creates single points of failure
Slide credit: Samuel Albanie, 2022
12/04/2024 58
59. The shifting towards science
Engineering
Design man-made systems
AI
Discover emergent behaviours
Science
Discover laws in nature.
12/04/2024 59
60. Example: Data → Prompt → Deploy
Long Dang, Thao Le, Vuong Le, Tu Minh Phuong, Truyen Tran, SADL: An Effective In-Context
Learning Method for Compositional Visual QA, 2023
12/04/2024 60
61. Example: LLM agent for scientific discovery
Request: Design a
material that:
- <Requirement 1>
- <Requirement 2>
- …
User
Crystal LLM Agent
Designed
Prompt
- Task description
- Tools description
- Few-shot examples
- …
High-level tasks
Search for template
Generate from template
Evaluate requirement 1
Evaluate requirement 2
Tools set
Tool 1 Tool 2 Tool 3
Selected tools
Tool 1
Tool2
Tool 3
Tool 3
Execution
Reflect
Correction
Final answer
12/04/2024 61
62. LLM social agents
• Extended actions space:
• APIs
• RAG
• Architectures with LLM
and external memory
• Long-term
• Short-term/sensory
• Working
Memory LLM World
Other
Agent
Other
Agent
Other
Agent
12/04/2024 62
• Social Interactions
• Working as a team in cooperative tasks
• Effective Communications:
• When/Who/What to communicate?
• Via other’s actions and messages
• What is others’ knowledge or belief?
• Should others’ knowledge be corrected by
communication?
64. Conclusion
• DL reached its peak in 2022 with ChatGPT. This changed the AI practice
dramatically.
• Deep neural networks are here to stay, may be as a part of the holistic solution to
human-level AI.
• Gradient-based learning is still without parallel.
• DL will be much more general/universal/versatile
• Higher cognitive capabilities will be there, may be with symbol manipulation
capacity.
• Better generalization capability (e.g., extreme)
• We have to deal with consequences of its own success.
• The industry will need to keep the highly trained (overfitted) DL workforce busy!
12/04/2024 64