2019 4-nn-and-dl-tao wang@unc-v2

Copyright © SAS Institute Inc. All rights reserved.
Artificial Neural Networks
and Deep Learning
Tao Wang
2019
Guest Lecture, UNC Chapel Hill
This presentation is based on information in the public domain
Opinions expressed are solely my own, therefore may not represent the views of my employer

Part 1:
Introduction

AI vs. Machine Learning
Take 1
3
Source: https://towardsdatascience.com/cousins-of-artificial-intelligence-dda4edc27b55

Take 2
Machine learning is a “field of study that gives computers the
ability to learn without being explicitly programmed.”
– Arthur Samuel, 1959
4
“Artificial intelligence (AI), sometimes called machine
intelligence, is intelligence demonstrated by machines.”
– Wikipedia, retrieved 2018

5
Take 3

Take 4 – my own version
6
AI: goal
Analytics: business Machine Learning: means

7
Machine Learning and Deep Learning
Machine
Learning
Deep
Learning
AlphaGo
Bidirectional Encode
AlphaFold
Image sources: Siliconangle, googleblog, profacgen

Part 2:
Artificial Neural
Networks
8

Brief History
9
Source: Lukas Masuch, Deep Learning, The Past, Present and Future of Artificial Intelligence

Artificial Neural Network
Neuron
10
Source: Lukas Masuch, Deep Learning, The Past, Present and Future of Artificial Intelligence & http://cs231n.github.io/neural-networks-1/

Biological Neural Networks
Pigeons as art experts (Watanabe et al.1995)
11
Source: Yan Xu, Building an artificial neural network, 2017
• Pigeons: discriminate between Van Gogh and
Chagall with 95% accuracy (in training dataset)
• 85% accuracy for unseen paintings (validation
dataset)
• Pigeons can learn to recognize "style" using its
Biological Neural Networks
• Artificial Neural Networks should be able to do
the same!

Start with the vanilla (multilayer perceptron) version
MNIST dataset, 28x28 grayscale, handwritten digits (0-9)
12
Source: But what *is* a Neural Network?
Input layer: 28x28=784, 2 hidden layers with 16 neurons in this example, output layer: 0-9

Why does it have 2 hidden layers?
We hope: 1 layer=edges/pieces, 2 layer= parts
13

How about the connections?
Connections = Weights
14
Activation function: Sigmoid/ReLU/ELU (Rectified/Exponential Linear Units)

What is Machine Learning in ANN?
ML in ANN = find the right weights and biases without over-fitting
15

How does ANN learn?
Minimize the Cost function using Gradient Decent
16
Cost function: average training error

How does ANN learn, exactly?
Backpropagation, one neuron per layer in this example
17

Now, multiple neurons per layer
Result of backpropagation: weight and bias change
18
New weight = old weight – learning rate * gradient

Wait, you may find some local minima
If you use SGD for efficiency & results may not be repeatable
19

Part 3:
Deep Learning
20

21
Deep Learning: Model with Depth
Shallow
Deep
Learning
• Model with one or a
few layers
• Multiple layers, layer-by-layer
processing
• Feature extraction/transformation
• Learn complex structures
Data
Model
Output
Data
Output
Model
Layer
Layer
Layer
Deep Learning (DL) = Deep Neural Networks (DNN), ignoring subtle stuff
Source: [5] X. Hunt, et al.

Pros and cons
• Advantages
1. Requires minimal feature engineering (end-to-end ML)
2. Flexible structures
3. Learning often improves with more data
4. Proven track records in speech/text processing and image/video recognition
• Disadvantages
1. Difficult to interpret – often treated as a “black-box” model
2. Long training time, over-fitting
3. Hard to train, non-repeatable results, numerous architectures/hyper-parameters
4. Requires a large amount of training data to get good models
22

Why so popular?
1. End2end/distributed feature learning
2. Advances in algorithms/optimizations (min-batch, drop-out, BN, SGD, etc.)
3. Cloud computing and GPU made it possible to train very deep models
4. Proven track records in speech/text processing and image/video recognition
23
Source: [6] D. Silver

More about DNN
• When should I use DNN?
• Deal with image/video/text/speech
• Works for small-medium data, but prefers big data
• The underlying model is complex and non-linear
• OK with non-interpretability, and/or have cloud/GPU
• Common DNN architectures
• Deep Forward Nets
• Convolutional neural networks (CNN)
• Recurrent neural networks (RNN)
• Stacked auto-encoders
24

Deep Forward Net
• A flat architecture
• Regression and classification
DNN
architectures
1
25
Source: [4] W. Thompson

Convolutional neural network (CNN)
• A feedforward neural net with conv layers
• 3D volumes of neurons
• Feature extraction
• Memorize the training data
• Applications: image/video recognition
• GPU can be useful (parallel processing of pixels)
DNN
architectures
2
26

AlexNet: open the eyes of AI
2012-2018, CV moment for DL: AI can see
• ImageNet-1000, 5 conv layers, 3 max pooling layers, 3 dense layers
• Convolution Layer: feature extraction using image convolution
• Pooling layer: downsize the input image
• Dense (fully connected) layer: prediction
27
Source: Lukas Masuch, Deep Learning, The Past, Present and Future of Artificial Intelligence, page 38, AlexNet

Source: Angjoo Kanazawa, Convolutional Neural Networks, 2015
SIFT: Scale-Invariant Feature Transform
28

How about ColorNet?
Maybe your black-box CNN did color space transformation, already
• ColorNet, https://arxiv.org/abs/1902.00267, 1 Feb 2019
29

How fast is the training of CNN?
With ResNet-50 on ImageNet
• Yet Another Accelerated SGD: ResNetd-50 Training on ImageNet in 74.7 seconds
• Mar 29, 2019
• MXNet, an open source deep learning framework written in C++ and CUDA C languages
30

Recurrent neural network (RNN)
• Contain at least one feed-back connection
• Memorize the sequence/history of training data
• Time-series forecasting, speech recognition, NLP
• GPU gives limited speedup (sequential processing)
DNN
architectures
3
delay
h1(t)h1(t-1)
31

LSTM and GRU
Long Short-Term Memory and Gated Recurrent Unit RNN
• Vanishing gradient RNN = Short-Term Memory
• Gradient becomes too small in backpropagation = forget longer history
• LSTM: learn what to remember, what to forget = memorize longer history
• GRU: simpler structure
32
Source: https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21

Attention Models
2014-2017
33
• Neural Machine Translation by Jointly Learning to Align and Translate
• Attention Is All You Need

Some Recent Achievements, 2018-2019Q1
ELMo, GPT, BERT, GPT-v2
• ELMo, Deep contextualized word representations, Feb 2018
• OpenAI GPT, Improving Language Understanding by Generative Pre-
Training, Jun 2018
• BERT, Pre-training of Deep Bidirectional Transformers for Language
Understanding, Oct 2018
• Outperforming human performance in NLP
• Open AI GPT-2, Feb 2019
• Only releasing a much smaller version of GPT-2 along with sampling code, due
to concerns about it being used to generate deceptive language at scale
34

More about BERT
Because many SOTA models = BERT + something
35

More about BERT
Two Steps
• Pre-training Step
• Task #1: Masked LM
Input: The man went to the [MASK1]. He bought a [MASK2] of milk.
Labels: [MASK] = store; [MASK2] = gallon.
• Task #2: Next sentence prediction
Input A: The man went to the store. Input A: The man went to the store.
Input B: He bought a gallon of milk. Input C: Penguins are birds.
Label: IsNext Label: NotNext
• Fine-tuning step
36
Source: L. Cai, From Word Embeddings to BERT, 2019

NLP moment for DL has arrived: AI can read
2014/17-present, Attention, Transformer, BERT, GPT-v2 and beyond
37
Source: https://gluebenchmark.com/leaderboard, retrieved in Apr 2019

Auto-encoder
• A generative graphical model
• Feature coding, dimension reduction and compression
DNN
architectures
4
38

2018 ACM A.M. Turing Award (announced in 2019)
Yoshua Bengio, Geoffrey Hinton, Yann LeCun
• They saved AI, changed the world
Source: https://awards.acm.org/about/2018-turing
• But how about RNN/LSTM?
• Sepp Hochreiter and Jürgen Schmidhuber, 1997
• Kyunghyun Cho et al., Gated Recurrent Unit (GRU), 2014
• Bloomberg: This Man Is the Godfather the AI Community Wants to Forget
39

SOTA
State-Of-The-Art
• https://paperswithcode.com/sota
40

Can you talk about GPU?
RAPIDS
• Open GPU Data Science from NVIDIA
• Place your bet, CPU or GPU
41

DNN supported by SAS
42
Source: [7] White paper: How to Do Deep Learning With SAS?

SAS platform for DL
43

SAS® Visual Machine learning and Machine Learning
(VDMML)
Visual “drag & drop” GUI
44

Applications
Input
DNN
Military
Surveillance
Speech
recognition
Fraud
Detection
Image
classification
Autonomous
Vehicles
Patient
Identification
45

Autonomous vehicles
An application of DNN
The tipping point: level 3 Partial Autonomy
Source: https://iq.intel.com/autonomous-cars-road-ahead/
Expected Timeline for Full Autonomy?
Source: https://thelastdriverlicenseholder.com/2016/12/29/expected-timeline-for-full-autonomy/
Focus on Level 3 and deliver!
46

Navigant Research Leaderboard
Automated Driving Vehicles
Source: https://www.navigantresearch.com/research/navigant-research-leaderboard-automated-driving-vehicles, retrieved in 2018
47

End to End Learning for Self-Driving Cars
• arXiv:1604.07316, Apr 2016, from NVIDIA
• Basic idea: behavioral cloning, train the car to drive like you do
• Uses CNN to map images from cameras to steering commands
• Never explicitly train the CNN to detect/follow lanes, path planning, etc.
48
High-level view of the data collection system Training the CNN Self-driving
Source: [1] M. Bojarski, et al.

CNN architecture and the core source code
49
Read it from bottomup. Input layer, normalization layer, 5 conv2D layers: feature extraction. 3
fully-connected layers, output: controller.
27M connections, 250K parameters, 3MB in size. Source: arXiv:1604.07316
Source: github, the NVIDIA 2016 paper implementation

Part 4:
Some interesting stuff
50
THE POWER OF
THE PACK
AI with
THE POWER OF
DIVERSITY
AI with
THE POWER OF
TRUST
AI with

51
Rediscover Deep Learning
End to
End
1
Distributed
Feature
Learning
2
Big Data
Big Model
3

52
Source: Yoshua Bengio
Source: Pablo Picasso
Capsule Networks – power of the pack
Source: CB Insights, State of AI Source: Forbes

Capsule Network paper
Yep, talking about paper again
• S. Sabour, N. Frosst, G. Hinton, Dynamic Routing Between Capsules,
Google Brain, NIPS 2017, https://arxiv.org/abs/1710.09829
• Introduced years ago by Hinton, but was not working properly until now
• Widely considered as the beginning of a new chapter of deep learning
• Some follow-up papers, such as Matrix Capsules With EM Routing
• https://openreview.net/pdf?id=HJWLfGWRb, ICLR 2018
• Introduced capsule convolution layer and more sophisticated routing
53
Source: http://www.cs.toronto.edu/~hinton

Dynamic Routing Between Capsules
• Idea #1: capsule is an encapsulated vector/matrix in the network
• A capsule is a group of neurons that represents the parameters of some specific feature.
• A vector or matrix is extended from a scalar
• The length represents the probability of the presence of a feature or an object
• Each dimension within the capsule represents the detailed information of location, size,
orientation, etc.
• Idea #2: routing by agreement
• Lower-level capsule (which is near input) prefers to send its output to higher-level (which
is near output) capsules with “similar” prediction
• Cosine similarity is used to measure the agreement
54

CapsNet Architecture
▪ Input: MNIST dataset
▪ ReLU conv1: extracts local features
▪ PrimaryCaps: forms new neural unit (capsule)
▪ DigitCaps: contains 10 capsules (number 0 to 9)
▪ Cosine similarity (routing) is applied between PrimaryCaps and DigitCaps
▪ Reconstruction: a regularization method to encourage the capsules to encode the input digit
Figure 1: A simple CapsNet with 3 layers Figure 2: Reconstruct a digit from the DigitCaps layer representation
source: https://arxiv.org/abs/1710.09829
55

Core source code
Source: github, the NIPS 2017 paper implementation
56

Numerical results of the NIPS paper
source: https://arxiv.org/abs/1710.09829
57

58
𝐸 = 𝐸 − 𝐷
Deep Forest – power of diversity

Deep Forest paper series
• Deep Forest [10], using RF to do DL with the “3 key ingredients”:
• In-model feature extraction and transformation, end-to-end machine learning
• Layer by layer processing, distributed representation learning
• Complex model
• AutoEncoder by Forest [11]
• The first tree ensemble based auto-encoder
• Multi-Layered Gradient Boosting Decision Trees [12]
• A variant of target propagation, pseudo-mapping F, pseudo-inverse-mapping G,
pseudo-label Z (F-G-Z framework)
• More to come?
59
Why always Neural Nets? We can do DL using Decision Trees!

Deep Forest paper
• IJCAI 2017 paper [10], by Zhou and his student
• DeepForest = Forest ensemble, double-happiness (ensemble of
ensembles)
1. Multi-grain scanning, sliding window to extract features
2. Cascade of multiple random forests layers, for prediction
• Very few hyper-parameters (how nice!) & as good as DNN
• Default settings are good for many applications
• Non-differentiable model, no back propagation
60
Source: https://en.wikipedia.org/wiki/Zhi-Hua_Zhou

Deep Forest paper
Problems of DNN
• Too many hyper-parameters (like an art rather than science)
• Does not work well for small data
• Model architecture/complexity is determined in advance (via tuning)
• Often overly complicated
• Shortcut connection, pruning, binarization, etc. are often applied
61

Deep Forest paper
Why deep forest? Motivations?
• Decision trees
• Architecture learning (grow/split until done)
• Data driven
• Almost unbeatable on tabular data in Kaggle
• Motivations
• DL = DNN?
• Can we do DL with non-differentiable models (no back-propagation)?
• Maybe repeatable results (think SGD)?
62

Deep Forest paper
Inspiration from DNN
• Distributed representation learning (end to end, in-model feature trans.)
• Layer-by-layer processing
• Model complexity
63
Source: [10] Z. Zhou and J. Feng

Deep Forest paper
Multi-Grained Scanning for Feature Engineering
64
• Sequential
relationships
are
important
• Spatial
relationships
are
important

Deep Forest paper
Cascade Forest Structure for Prediction
• Ensemble of
ensembles
• K-fold cross
validation
• Architecture
learning (stop
growing
when
satisfied)
65

Deep Forest paper
Class Vector Generation
66

Deep Forest paper
Overall Architecture
67

Deep Forest paper
Hyper-parameters and default settings
68

Deep Forest paper
Experimental results
69
Image Categorization Face Recognition
Music Classification Hand Movement Recognition

Deep Forest paper
More experimental results
70
Sentiment Classification
Low-Dimensional Data
High-Dimensional Data
(hard to beat successful method
at its killer-app with
a brand-new algorithm)

Deep Forest paper
Hyper-parameter sensitivity
71

A Unified Framework for Trustable AI, Machine Learning and Analytics
AI
Analytics Machine Learning
Blockchain – power of trust
72

Proposed framework
A Unified Analytical Framework for Trustable ML Running with Blockchain
73

What’s next for AI & Deep Learning?
CV, NLP, then?
• CV moment for DL: AI can see
• 2012 - 2018
• NLP moment for DL: AI can read
• 2014/17 - present
• Blockchain moment for DL: AI can trust
• TBD?
74

Closing Remarks
AI and deep learning are very hard – just keep trying!
75

More photos like this
Just google it
76
Source: https://www.npr.org/sections/thesalt/2016/03/11/470084215/canine-or-cuisine-this-photo-meme-is-fetching

Selected References
• [1] M. Bojarski, et al., End to End Learning for Self-Driving Cars, arXiv:1604.07316, 2016.
• [2] S. Sabour, N. Frosst, G. Hinton, Dynamic Routing Between Capsules, Google Brain, NIPS 2017, https://arxiv.org/abs/1710.09829
• [3] D. Silver, A. Huang, et, al. (2016). "Mastering the game of Go with deep neural networks and tree search". Nature 529 (7587): 484–
489.
• [4] W. Thompson, et al., Introduction to Deep learning, SAS, 2016.
• [5] X. Hunt, et al., Machine Learning Landscape, SAS, 2017.
• [6] D. Silver, Tutorial: Deep Reinforcement Learning, 2017.
• [7] White paper: How to Do Deep Learning With SAS? 2018.
• [8] Y. LeCun, et al., Deep learning, Nature, 2015.
• [9] I. Goodfellow, et al., Generative Adversarial Net, https://arxiv.org/abs/1406.2661
• [10] Z. Zhou and J. Feng, Deep Forest, IJCAI 2017.
• [11] J. Feng and Z. Zhou, AutoEncoder by Forest, AAAI 2018.
• [12] J. Feng, Y. Yu, Z. Zhou, Multi-Layered Gradient Boosting Decision Trees, https://arxiv.org/abs/1806.00007, 2018
• [13] R. Tanno, et al., Adaptive Neural Trees, https://arxiv.org/abs/1807.06699, 17 Jul 2018.
• [14] T. Wang, A Unified Analytical Framework for Trustable Machine Learning and Automation Running with Blockchain, IEEE Big Data
Workshops, https://arxiv.org/abs/1903.08801, 2018.
77

78
Upcoming Events, and AMA (Ask Me Anything)
Shameless ads
78
• Running for 2019 ACM SIGAI Vice-Chair
• Vote for Tao Wang
• RTP ACM Chapter is up & running, join us!
• AutoML 2019 workshop, recruiting PC
• Call For Papers
• IEEE SMC 2019 Special Sessions (Human Perception in
Multimedia Computing, code: bf856), Oct 2019, Bari, Italy
• ICSM 2019, Dec 2019, San Diego, CA, USA

2019 4-nn-and-dl-tao wang@unc-v2

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 2019 4-nn-and-dl-tao wang@unc-v2

Similar to 2019 4-nn-and-dl-tao wang@unc-v2 (20)

Recently uploaded

Recently uploaded (20)

2019 4-nn-and-dl-tao wang@unc-v2