Deep Learning has taken the digital world by storm. As a general purpose technology, it is now present in all walks of life. Although the fundamental developments in methodology have been slowing down in the past few years, applications are flourishing with major breakthroughs in Computer Vision, NLP and Biomedical Sciences. The primary successes can be attributed to the availability of large labelled data, powerful GPU servers and programming frameworks, and advances in neural architecture engineering. This combination enables rapid construction of large, efficient neural networks that scale to the real world. But the fundamental questions of unsupervised learning, deep reasoning, and rapid contextual adaptation remain unsolved. We shall call what we currently have Deep Learning 1.0, and the next possible breakthroughs as Deep Learning 2.0.
This is part 1 of the Tutorial delivered at IEEE SSCI 2020, Canberra, December 1st (Virtual).
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Deep Learning 1.0 and Beyond: A Tutorial on Classic and Emerging Models (Part I
1. 16/11/2020 1
A/Prof Truyen Tran
With contribution from Vuong Le, Hung
Le, Thao Le, Tin Pham & Dung Nguyen
Deakin University
December 2020
Deep learning 1.0 and Beyond
A tutorial
Part I
@truyenoz
truyentran.github.io
truyen.tran@deakin.edu.au
letdataspeak.blogspot.com
goo.gl/3jJ1O0
linkedin.com/in/truyen-tran
3. Why (still) DL?
Practical
Generality: Applicable to many
domains.
Competitive: DL is hard to beat as
long as there are data to train.
Scalability: DL is better with more
data, and it is very scalable.
Theoretical
Expressiveness: Neural nets
can approximate any function.
Learnability: Neural nets are
trained easily.
Generalisability: Neural nets
generalize surprisingly well to
unseen data.
4. It is easy to get lost in current DL zoo
16/11/2020 4
Vietnam News
AAAI’20
6. Model design goals
Resource adaptive,
compressible
Easy to train
Use (almost) no labels
Ability to extrapolate
Support both fast and slow
learning
Support both fast and slow
inference
16/11/2020 6
Uniformity
Universality
Scalability
Reusability
Capture long-term
dependencies in time and
space
Capture invariances natively
7. Neural memories
Theory of mind
Neural reasoning
A system view
Deep learning 2.0
16/11/2020 7
Agenda
Classic models
Transformers
Graph neural networks
Unsupervised learning
Deep learning 1.0
8. Deep models via layer stacking
Theoretically powerful, but limited in practice
Integrate-and-fire neuron
andreykurenkov.com
Feature detector
Block representation16/11/2020 8
10. Sequence model with recurrence
Assume the stationary world
Classification
Image captioning
Sentence classification
Neural machine translation
Sequence labelling
Source: http://karpathy.github.io/assets/rnn/diags.jpeg
16/11/2020 10
11. Spatial model with convolutions
Assume filters/motifs are translation invariant
http://colah.github.io/posts/2015-09-NN-Types-FP/
Learnable kernels
andreykurenkov.com
Feature detector,
often many
13. Operator on sets/bags: Attentions
Not everything is created equal for a goal
Need attention model to select or ignore
certain computations or inputs
Can be “soft” (differentiable) or “hard”
(requires RL)
Attention provides a short-cut long-
term dependencies
Also encourages sparsity if done right!
http://distill.pub/2016/augmented-rnns/
14. Fast weights | HyperNet
The world is recursive
Early ideas in early 1990s by Juergen Schmidhuber and collaborators.
Data-dependent weights | Using a controller to generate weights of the main
net.
16/11/2020 14
Ha, David, Andrew Dai, and Quoc V. Le. "Hypernetworks." arXiv preprint arXiv:1609.09106 (2016).
15. Neural architecture search
When design is cheap and non-creative
The space is huge and discrete
Can be done through meta-heuristics (e.g., genetic algorithms) or Reinforcement
learning (e.g., one discrete change in model structure is an action).
16/11/2020 15
Bello, Irwan, et al. "Neural optimizer search with reinforcement learning." arXiv preprint arXiv:1709.07417 (2017).
16. Neural memories
Theory of mind
Neural reasoning
A system view
Deep learning 2.0
16/11/2020 16
Agenda
Classic models
Transformers
Graph neural networks
Unsupervised learning
Deep learning 1.0
17. Motivations
RNN is theoretically powerful, but purely sequential, hence slow and has
limited effective memory for finite size.
Augmenting with external memories solve some problem, but still slow
CNN is a feed-forward net, can be parallelized, but theoretically not too
strong – random long-term dependencies are hard to encode
Prior to 2017, most architectures are mixture of FNN, RNN and CNN Non-
uniformity, hard to scale to a large number of tasks.
We need supports for
Parallel computation
Long-rang dependency encoding (constant path length)
Uniform construction (e.g., like columnar structure of neocortex)
16/11/2020 17
18. Prelim: Memory networks
Input is a set Load into memory,
which is NOT updated.
State is a RNN with attention reading
from inputs
Concepts: Query, key and content +
Content addressing.
Deep models, but constant path length
from input to output.
Equivalent to a RNN with shared input
set.
16/11/2020 18
Sukhbaatar, Sainbayar, Jason Weston, and Rob
Fergus. "End-to-end memory networks." Advances in
neural information processing systems. 2015.
19. Transformers: The triumph of self-attention
16/11/2020 19
Tay, Yi, et al. "Efficient transformers: A survey." arXiv
preprint arXiv:2009.06732 (2020).
State
KeyQuery Memory
20. Transformers are (new) Hopfield net
16/11/2020 20
Ramsauer, Hubert, et al. "Hopfield networks is all you need." arXiv preprint
arXiv:2008.02217 (2020).
21. Transformer v.s. memory networks
Memory network:
Attention to input set
One hidden state update at a time.
Final state integrate information of the set, conditioned on the query.
Transformer:
Loading all inputs into working memory
Assigns one hidden state per input element.
All hidden states (including those from the query) to compute the answer.
16/11/2020 21
23. Efficient Transformers
Transformer is quadratic in time
Cannot deal with large sets
(or sequence)
16/11/2020 23
Tay, Yi, et al. "Efficient transformers: A survey." arXiv
preprint arXiv:2009.06732 (2020).
24. Neural memories
Theory of mind
Neural reasoning
A system view
Deep learning 2.0
16/11/2020 24
Agenda
Classic models
Transformers
Graph neural networks
Unsupervised learning
Deep learning 1.0
25. Why graphs?
Graphs are pervasive in many
scientific disciplines.
Deep learning needs to move beyond
vector, fixed-size data.
The sub-area of graph representation
has reached a certain maturity, with
multiple reviews, workshops and
papers at top AI/ML venues.
16/11/2020 25
NeurIPS 2020
27. Biology, pharmacy & chemistry
Molecule as graph: atoms as
nodes, chemical bonds as edges
Computing molecular
properties
Chemical-chemical interaction
Chemical reaction
16/11/2020 27
#REF: Penmatsa, Aravind, Kevin H. Wang, and Eric Gouaux. "X-
ray structure of dopamine transporter elucidates antidepressant
mechanism." Nature 503.7474 (2013): 85-90.
Gilmer, Justin, et al. "Neural message passing for quantum
chemistry." arXiv preprint arXiv:1704.01212 (2017).
28. Materials science
16/11/2020 28
Xie, Tian, and Jeffrey C. Grossman.
"Crystal Graph Convolutional Neural
Networks for an Accurate and
Interpretable Prediction of Material
Properties." Physical review
letters 120.14 (2018): 145301.
• Crystal properties
• Exploring/generating
solid structures
• Inverse design
30. Basic neural graph mechanism:
Message passing
16/11/2020 30
#REF: Pham, Trang, et al. "Column Networks
for Collective Classification." AAAI. 2017.
Relation graph
GCN update rule, vector form
GCN update rule, matrix form
Generalized message passing
31. Attention: Not all messages are created equal
(Do et al arXiv’s17, Veličković et al ICLR’ 18)
16/11/2020 31
Learning deep matrix representations, K Do, T Tran, S
Venkatesh, arXiv preprint arXiv:1703.01454
32. Neural graph morphism
Input: Graph
Output: A new graph. Same
nodes, different edges.
Model: Graph morphism
Method: Graph
transformation policy
network (GTPN)
16/11/2020 32
Kien Do, Truyen Tran, and Svetha Venkatesh. "Graph Transformation Policy Network for Chemical
Reaction Prediction." KDD’19.
33. Neural graph recurrence
Graphs that represent interaction between entities through time
Spatial edges are node interaction at a time step
Temporal edges are consistency relationship through time
34. ASSIGN: Asynchronous, Sparse Interaction Graph Network
(Morais et al, 2021 @ A2I2, Deakin – Work in Progress)
16/11/2020 35
35. Graph generation
No regular structures (e.g. grid, sequence,…)
Graphs are permutation invariant:
#permutations are exponential function of #nodes
The probability of a generated graph G need to be
marginalized over all possible permutations
Generating graphs with variable size
Aim for diversity of generated graphs
36. Generation methods
Classical random graph models, e.g., An exponential
family of probability distributions for directed graphs
(Holland and Leinhardt, 1981)
Deep generative models: GraphVAE, Graphphite,
Junction Tree VAE, GAN variants etc.
Sequence-based & RL methods
16/11/2020 37
37. GraphRNN
A case of graph
dynamics: nodes and
edges are added
sequentially.
Solve tractability using
BFS
16/11/2020 38
You, Jiaxuan, et al.
"GraphRNN: Generating
realistic graphs with deep
auto-regressive
models." ICML (2018).
38. Graphs step-wise construction using
reinforcement learning
16/11/2020 39
You, Jiaxuan, et al. "Graph Convolutional Policy Network for Goal-Directed Molecular Graph Generation." NeurIPS (2018).
Graph rep (message passing) | graph validation (RL) | graph
faithfulness (GAN)
39. Neural memories
Theory of mind
Neural reasoning
A system view
Deep learning 2.0
16/11/2020 41
Agenda
Classic models
Transformers
Graph neural networks
Unsupervised learning
Deep learning 1.0
41. Representation learning, a bit of history
“Representation is the use of signs that stand in for
and take the place of something else”
It has been a goal of neural networks since the 1980s and the current wave
of deep learning (2005-present) Replacing feature engineering
Between 2006-2012, many unsupervised learning models with varying
degree of success: RBM, DBN, DBM, DAE, DDAE, PSD
Between 2013-2018, most models were supervised, following AlexNet
Since 2018, unsupervised learning has become competitive (with
contrastive learning, self-supervised learning, BERT)!
16/11/2020 43
43. Criteria for a good representation
Separates factors of variation (aka disentanglement), which are
linearly correlated with desired outputs of downstream tasks.
Provides abstraction that is invariant against deformations and
small variations.
Is distributed (one concept is represented by multiple units), which
is compact and good for interpolation.
Optionally, offers dimensionality reduction.
Optionally, is sparse, giving room for emerging symbols.
16/11/2020 45
Bengio, Yoshua, Aaron Courville, and Pascal Vincent. "Representation learning: A review and new
perspectives." IEEE transactions on pattern analysis and machine intelligence 35.8 (2013): 1798-1828.
44. Why neural unsupervised learning?
Neural nets have representational richness:
FFN are functional approximator
RNN are program approximator, can estimate a program behaviour and
generate a string
CNN are for translation invariance
Transformers are powerful contextual encoder
Compactness: Representations are (sparse and) distributed.
Essential to perception, compact storage and reasoning
Accounting for uncertainty: Neural nets can be stochastic to model
distributions
Symbolic representation: realisation through sparse activations and
gating mechanisms
16/11/2020 46
45. Neural
autoregressive
models:
Predict the next step
given the history
The keys: (a) long-term dependencies, (b)
ordering, & (c) parameter sharing.
Can be realized using:
RNN
CNN: One-sided CNN, dilated CNN (e.g., WaveNet),
PixelCNN
Transformers GPT-X family
Masked autoencoder MADE
Pros: General, good quality thus far
Cons: Slow – needs better inductive biases for
scalability16/11/2020 47
lyusungwon.github.io/studies/2018/07/25/nade/
46. Generative models:
Discover the underlying process that generates data
16/11/2020 48
Many applications:
• Text to speech
• Simulate data that are hard to obtain/share in
real life (e.g., healthcare)
• Generate meaningful sentences conditioned on
some input (foreign language, image, video)
• Semi-supervised learning
• Planning
47. Deep (Denoising) AutoEncoder:
Self-reconstruction of data
16/11/2020
49
Auto-encoderFeature detector
Representation
Raw data
(optionally with
added noise)
Reconstruction
Deep Auto-encoder
Encoder
Decoder
49. GAN: Generative Adversarial nets
Matching data statistics
Yann LeCun: GAN is one of best idea in past 10 years!
Instead of modeling the entire distribution of data, learns to map ANY random
distribution into the region of data, so that there is no discriminator that
can distinguish sampled data from real data.
Any random distribution
in any space
Binary discriminator,
usually a neural
classifier
Neural net that maps
z x
51. Progressive GAN: Generated images
16/11/2020 53
Karras, T., Aila, T., Laine, S., & Lehtinen, J. (2017). Progressive growing of gans for improved
quality, stability, and variation. arXiv preprint arXiv:1710.10196.
52. BERT
Transformer that predicts its own masked parts
BERT is like parallel
approximate pseudo-
likelihood
~ Maximizing the conditional
likelihood of some variables
given the rest.
When the number of variables is
large, this converses to MLE
(maximum likelihood estimate).
16/11/2020 54
https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270
54. Unsupervised learning: A few more points
No external labels, but rich training signals (thousand bits per sample,
as opposed to a few bits in supervised learning).
A few techniques:
Compressing data as much as possible with little loss
Energy-based, i.e., pull down energy of observed data, pull up every else
Filling the missing slots (aka predictive learning, self-supervised learning)
We have not covered unsupervised learning on graphs (e.g.,
DeepWalk, GPT-GNN), but the general principles should hold.
Question: Multiple objectives, or no objective at all?
Question: Emergence from many simple interacting elements?
16/11/2020 56
Liu, Xiao, et al. "Self-supervised learning: Generative or contrastive." arXiv preprint
arXiv:2006.08218 (2020).