SlideShare a Scribd company logo
1 of 51
Download to read offline
Deep Learning
and Automatic Differentiation
from Theano to PyTorch
ş
About me
Postdoctoral researcher, University of Oxford
● Working with Frank Wood, Department of Engineering Science
● Working on probabilistic programming and its applications in science
(high-energy physics and HPC)
Long-term interests:
● Automatic (algorithmic) differentiation (e.g. http://diffsharp.github.io )
● Evolutionary algorithms
● Computational physics
2
Deep learning
3
Deep learning
A reincarnation/rebranding of artificial neural networks, with roots in
● Threshold logic (McCulloch & Pitts, 1943)
● Hebbian learning (Hebb, 1949)
● Perceptron (Rosenblatt, 1957)
● Backpropagation in NNs (Werbos, 1975; Rumelhart, Hinton, Williams,
1986)
Frank Rosenblatt with
the Mark I Perceptron, holding a
set of neural network weights
4
Deep learning
State of the art in computer vision
● ImageNet classification with deep convolutional neural networks
(Krizhevsky et al., 2012)
○ Halved the error rate achieved with pre-deep-learning methods
● Replacing hand-engineered features
● Modern systems surpass human performance
Faster R-CNN
(Ren et al., 2015)
Top-5 error rate for ImageNet
https://devblogs.nvidia.com/parallelforall/mocha-jl-deep-learning-julia/ 5
Deep learning
State of the art in speech recognition
● Seminal work by Hinton et al. (2012)
○ First major industrial application of deep learning
● Replacing HMM-GMM-based models
Recognition word error rates (Huang et al., 2014) 6
Deep learning
State of the art in machine translation
● Based on RNNs and CNNs (Bahdanu et al., 2015; Wu et al., 2016)
● Replacing statistical translation with engineered features
● Google (Sep 2016), Microsoft (Nov 2016), Facebook (Aug 2017) moved to
neural machine translation
https://techcrunch.com/2017/08/03/facebook-finishes-its-move-to-neural-machine-translation/
Google Neural Machine Translation System (GNMT) 7
What makes deep learning tick?
8
Deep neural networks
An artificial “neuron" is loosely based on the biological one
● Receive a set of weighted inputs (dendrites)
● Integrating and transforming (cell body)
● Passing the output further (axon)
9
Deep neural networks
Three main building blocks
● Feedforward (Rosenblatt, 1957)
● Convolutional (LeCun et al., 1989)
● Recurrent (Hopfield, 1982; Hochreiter & Schmidhuber, 1997)
10
Deep neural networks
Newer additions introduce (differentiable) algorithmic elements
● Neural Turing Machine (Graves et al., 2014)
○ Can infer algorithms: copy, sort, recall
● Stack-augmented RNN (Joulin & Mikolov, 2015)
● End-to-end memory network (Sukhbaatar et al., 2015)
● Stack, queue, deque (Grefenstette et al., 2015)
● Discrete interfaces (Zaremba & Sutskever, 2015)
Neural Turing
Machine on copy task
(Graves et al., 2014)
11
Deep neural networks
● Deep, distributed representations learned end-to-end (also called
“representation learning”)
○ From raw pixels to object labels
○ From raw waveforms of speech to text
(Masuch, 2015)
Labels
12
Deep neural networks
Deeper is better (Bengio, 2009)
● Deep architectures consistently beat shallow ones
● Shallow architectures exponentially inefficient for same performance
● Hierarchical, distributed representations achieve non-local
generalization
13
Data
● Deep learning needs massive amounts of data
○ need to be labeled if performing supervised learning
● A rough rule of thumb: “deep learning will generally match or exceed
human performance with a dataset containing at least 10 million
labeled examples.” (The Deep Learning book, Goodfellow et al., 2016)
Dataset sizes
(Goodfellow et
al., 2016)
14
GPUs
Layers of NNs are conveniently expressed as series of
matrix multiplications
One input vector,
n neurons of k inputs
A batch of m input
vectors
15
GPUs
● BLAS (mainly GEMM) is at the hearth of mainstream deep learning,
commonly running on off-the-shelf graphics processing units
● Rapid adoption after
○ Nvidia released CUDA (2007)
○ Raina et al. (2009) and Ciresan et al. (2010)
● ASICs such as tensor processing units (TPUs) are being introduced
○ As low as 8-bit floating point precision, better power efficiency
Nvidia Titan Xp (2017) Google Cloud TPU server 16
Deep learning frameworks
● Modern tools make it extremely easy to implement / reuse models
● Off-the-shelf components
○ Simple: linear, convolution, recurrent layers
○ Complex: compositions of complex models (e.g., CNN + RNN)
● Base frameworks: Torch (2002), Theano (2011), Caffe (2014),
TensorFlow (2015), PyTorch (2016)
● Higher-level model-building libraries: Keras (Theano & TensorFlow),
Lasagne (Theano)
● High-performance low-level bindings: BLAS, Intel MKL, CUDA, Magma
17
Learning: gradient-based optimization
Loss function
(Ruder, 2017)
http://ruder.io/optimizing-
gradient-descent/
18
Parameter update
Stochastic gradient descent (SGD)
In practice we use SGD or adaptive-learning-rate varieties such as
Adam, RMSProp
Deep learning
19
Neural networks
+ Data
+ Gradient-based optimization
Deep learning
20
Neural networks
+ Data
+ Gradient-based optimization
We need derivatives
How do we compute derivatives?
21
Manual
● Calculus 101, rules of differentiation
...
22
Manual
● Analytical derivatives are needed for theoretical insight
○ analytic solutions, proofs
○ mathematical analysis, e.g., stability of fixed points
● They are unnecessary when we just need numerical derivatives for
optimization
● Until very recently, machine learning looked like this:
23
Symbolic derivatives
● Symbolic computation with Mathematica, Maple, Maxima, also deep
learning frameworks such as Theano
● Main issue: expression swell
24
Symbolic derivatives
● Symbolic computation with, e.g., Mathematica, Maple, Maxima
● Main limitation: only applicable to closed-form expressions
You can find the derivative of:
But not of:
In deep learning, symbolic graph builders such as Theano and TensorFlow
face issues with control flow, loops, recursion
Numerical differentiation
26
Numerical differentiation
27
Automatic differentiation
● Small but established subfield of
scientific computing
http://www.autodiff.org/
● Traditional application domains:
○ Computational fluid dynamics
○ Atmospheric sciences
○ Engineering design
optimization
○ Computational finance
AD has shared roots with the
backpropagation algorithm for neural
networks, but it is more general
28
Automatic differentiation
29
Automatic differentiation
Exact derivatives, not an approximation 30
Automatic differentiation
AD has two main modes:
● Forward mode: straightforward
● Reverse mode: slightly more difficult, when you see it for
the first time
31
Forward mode
32
Forward mode
33
Reverse mode
34
Reverse mode
35
Reverse mode
36
Reverse mode
37
Reverse mode
38
Reverse mode
39
More on this in the
afternoon exercise
session
Forward vs reverse
40
Forward vs reverse
41
Derivatives and
deep learning frameworks
42
Deep learning frameworks
Two main families:
● Symbolic graph builders
● Dynamic graph builders (general-purpose AD)
43
Symbolic graph builders
● Implement models using symbolic placeholders, using a
mini-language
● Severely limited (and unintuitive) control flow and expressivity
● The graph gets “compiled” to take care of expression swell
44
Graph
compilation in
Theano
Symbolic graph builders
45
Theano, TensorFlow, CNTK
Dynamic graph builders (general-purpose AD)
46
● Use general-purpose automatic differentiation operator
overloading
● Implement models as regular programs, with full support for control
flow, branching, loops, recursion, procedure calls
Dynamic graph builders (general-purpose AD)
a general-purpose AD system that allows you to implement models as
regular Python programs (more on this in the exercise session)
47
autograd (Python) by Harvard Intelligent Probabilistic Systems Group
https://github.com/HIPS/autograd
torch-autograd by Twitter Cortex
https://github.com/twitter/torch-autograd
PyTorch
http://pytorch.org/
Summary
48
Summary
● Neural networks + data + gradient descent = deep learning
● General-purpose AD is the future of deep learning
● More distance to cover:
○ Implementations better than operator overloading exist in AD
literature (source transformation) but not in machine learning
○ Forward AD is not currently present in any mainstream machine
learning framework
○ Nesting of forward and reverse AD enables efficient higher-order
derivatives such as Hessian-vector products
49
Thank you!
50
Baydin, A.G., Pearlmutter, B.A., Radul, A.A. and Siskind, J.M., 2015. Automatic differentiation in machine learning: a survey. arXiv
preprint arXiv:1502.05767.
Baydin, Atılım Güneş, Barak A. Pearlmutter, and Jeffrey Mark Siskind. 2016. “Tricks from Deep Learning.” In 7th International
Conference on Algorithmic Differentiation, Christ Church Oxford, UK, September 12–15, 2016.
Baydin, Atılım Güneş, Barak A. Pearlmutter, and Jeffrey Mark Siskind. 2016. “DiffSharp: An AD Library for .NET Languages.” In 7th
International Conference on Algorithmic Differentiation, Christ Church Oxford, UK, September 12–15, 2016.
Maclaurin, D., Duvenaud, D. and Adams, R., 2015, June. Gradient-based hyperparameter optimization through reversible learning. In
International Conference on Machine Learning (pp. 2113-2122).
Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. Yonghui Wu, Mike
Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus
Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo,
Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol
Vinyals, Greg Corrado, Macduff Hughes, Jeffrey Dean. Technical Report, 2016.
Convolutional Sequence to Sequence Learning. Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, Yann N. Dauphin. arXiv,
2017
Neural Machine Translation by Jointly Learning to Align and Translate. Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio.
International Conference on Learning Representations, 2015.
A Historical Perspective of Speech Recognition. Xuedong Huang, James Baker, Raj Reddy. Communications of the ACM, Vol. 57
No. 1, Pages 94-103 10.1145/2500887
Bengio, Yoshua. "Learning deep architectures for AI." Foundations and trends® in Machine Learning 2.1 (2009): 1-127.
References
51

More Related Content

What's hot

Kristian Kersting, Associate Professor for Computer Science, TU Dortmund Univ...
Kristian Kersting, Associate Professor for Computer Science, TU Dortmund Univ...Kristian Kersting, Associate Professor for Computer Science, TU Dortmund Univ...
Kristian Kersting, Associate Professor for Computer Science, TU Dortmund Univ...
MLconf
 

What's hot (20)

Kristian Kersting, Associate Professor for Computer Science, TU Dortmund Univ...
Kristian Kersting, Associate Professor for Computer Science, TU Dortmund Univ...Kristian Kersting, Associate Professor for Computer Science, TU Dortmund Univ...
Kristian Kersting, Associate Professor for Computer Science, TU Dortmund Univ...
 
Summarizing videos with Attention
Summarizing videos with AttentionSummarizing videos with Attention
Summarizing videos with Attention
 
Unsupervised Learning (D2L6 2017 UPC Deep Learning for Computer Vision)
Unsupervised Learning (D2L6 2017 UPC Deep Learning for Computer Vision)Unsupervised Learning (D2L6 2017 UPC Deep Learning for Computer Vision)
Unsupervised Learning (D2L6 2017 UPC Deep Learning for Computer Vision)
 
Nips 2017 in a nutshell
Nips 2017 in a nutshellNips 2017 in a nutshell
Nips 2017 in a nutshell
 
Deep Neural Networks (D1L2 Insight@DCU Machine Learning Workshop 2017)
Deep Neural Networks (D1L2 Insight@DCU Machine Learning Workshop 2017)Deep Neural Networks (D1L2 Insight@DCU Machine Learning Workshop 2017)
Deep Neural Networks (D1L2 Insight@DCU Machine Learning Workshop 2017)
 
Introduction to neural networks and Keras
Introduction to neural networks and KerasIntroduction to neural networks and Keras
Introduction to neural networks and Keras
 
Centernet
CenternetCenternet
Centernet
 
Deep Learning for Computer Vision: Unsupervised Learning (UPC 2016)
Deep Learning for Computer Vision: Unsupervised Learning (UPC 2016)Deep Learning for Computer Vision: Unsupervised Learning (UPC 2016)
Deep Learning for Computer Vision: Unsupervised Learning (UPC 2016)
 
Transformers in 2021
Transformers in 2021Transformers in 2021
Transformers in 2021
 
3D human body modeling from RGB images
3D human body modeling from RGB images3D human body modeling from RGB images
3D human body modeling from RGB images
 
Weakly supervised semantic segmentation of 3D point cloud
Weakly supervised semantic segmentation of 3D point cloudWeakly supervised semantic segmentation of 3D point cloud
Weakly supervised semantic segmentation of 3D point cloud
 
Recent progress on distributing deep learning
Recent progress on distributing deep learningRecent progress on distributing deep learning
Recent progress on distributing deep learning
 
The Perceptron (D1L1 Insight@DCU Machine Learning Workshop 2017)
The Perceptron (D1L1 Insight@DCU Machine Learning Workshop 2017)The Perceptron (D1L1 Insight@DCU Machine Learning Workshop 2017)
The Perceptron (D1L1 Insight@DCU Machine Learning Workshop 2017)
 
Transfer Learning and Domain Adaptation (D2L3 2017 UPC Deep Learning for Comp...
Transfer Learning and Domain Adaptation (D2L3 2017 UPC Deep Learning for Comp...Transfer Learning and Domain Adaptation (D2L3 2017 UPC Deep Learning for Comp...
Transfer Learning and Domain Adaptation (D2L3 2017 UPC Deep Learning for Comp...
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
 
Visualization of Deep Learning Models (D1L6 2017 UPC Deep Learning for Comput...
Visualization of Deep Learning Models (D1L6 2017 UPC Deep Learning for Comput...Visualization of Deep Learning Models (D1L6 2017 UPC Deep Learning for Comput...
Visualization of Deep Learning Models (D1L6 2017 UPC Deep Learning for Comput...
 
Transfer Learning (D2L4 Insight@DCU Machine Learning Workshop 2017)
Transfer Learning (D2L4 Insight@DCU Machine Learning Workshop 2017)Transfer Learning (D2L4 Insight@DCU Machine Learning Workshop 2017)
Transfer Learning (D2L4 Insight@DCU Machine Learning Workshop 2017)
 
Recurrent Neural Networks (D2L2 2017 UPC Deep Learning for Computer Vision)
Recurrent Neural Networks (D2L2 2017 UPC Deep Learning for Computer Vision)Recurrent Neural Networks (D2L2 2017 UPC Deep Learning for Computer Vision)
Recurrent Neural Networks (D2L2 2017 UPC Deep Learning for Computer Vision)
 
Methodology (DLAI D6L2 2017 UPC Deep Learning for Artificial Intelligence)
Methodology (DLAI D6L2 2017 UPC Deep Learning for Artificial Intelligence)Methodology (DLAI D6L2 2017 UPC Deep Learning for Artificial Intelligence)
Methodology (DLAI D6L2 2017 UPC Deep Learning for Artificial Intelligence)
 
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
 

Similar to Deep Learning and Automatic Differentiation from Theano to PyTorch

Presentation
PresentationPresentation
Presentation
butest
 
Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...
Anubhav Jain
 

Similar to Deep Learning and Automatic Differentiation from Theano to PyTorch (20)

Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
AI and Deep Learning
AI and Deep Learning AI and Deep Learning
AI and Deep Learning
 
Unsupervised Deep Learning (D2L1 Insight@DCU Machine Learning Workshop 2017)
Unsupervised Deep Learning (D2L1 Insight@DCU Machine Learning Workshop 2017)Unsupervised Deep Learning (D2L1 Insight@DCU Machine Learning Workshop 2017)
Unsupervised Deep Learning (D2L1 Insight@DCU Machine Learning Workshop 2017)
 
Finding the best solution for Image Processing
Finding the best solution for Image ProcessingFinding the best solution for Image Processing
Finding the best solution for Image Processing
 
Distributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark MeetupDistributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark Meetup
 
IoT Tech Expo 2023_Pedro Trancoso presentation
IoT Tech Expo 2023_Pedro Trancoso presentationIoT Tech Expo 2023_Pedro Trancoso presentation
IoT Tech Expo 2023_Pedro Trancoso presentation
 
Semi-Supervised Classification with Graph Convolutional Networks @ICLR2017読み会
Semi-Supervised Classification with Graph Convolutional Networks @ICLR2017読み会Semi-Supervised Classification with Graph Convolutional Networks @ICLR2017読み会
Semi-Supervised Classification with Graph Convolutional Networks @ICLR2017読み会
 
Presentation
PresentationPresentation
Presentation
 
Machine Learning and Deep Learning with R
Machine Learning and Deep Learning with RMachine Learning and Deep Learning with R
Machine Learning and Deep Learning with R
 
Deep learning 1.0 and Beyond, Part 1
Deep learning 1.0 and Beyond, Part 1Deep learning 1.0 and Beyond, Part 1
Deep learning 1.0 and Beyond, Part 1
 
The Future is Big Graphs: A Community View on Graph Processing Systems
The Future is Big Graphs: A Community View on Graph Processing SystemsThe Future is Big Graphs: A Community View on Graph Processing Systems
The Future is Big Graphs: A Community View on Graph Processing Systems
 
Reservoir computing fast deep learning for sequences
Reservoir computing   fast deep learning for sequencesReservoir computing   fast deep learning for sequences
Reservoir computing fast deep learning for sequences
 
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
 
Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...
 
Distributed deep learning_over_spark_20_nov_2014_ver_2.8
Distributed deep learning_over_spark_20_nov_2014_ver_2.8Distributed deep learning_over_spark_20_nov_2014_ver_2.8
Distributed deep learning_over_spark_20_nov_2014_ver_2.8
 
Learning Pulse - paper presentation at LAK17
Learning Pulse - paper presentation at LAK17Learning Pulse - paper presentation at LAK17
Learning Pulse - paper presentation at LAK17
 
DRESD In a Nutshell July07
DRESD In a Nutshell July07DRESD In a Nutshell July07
DRESD In a Nutshell July07
 
Introduction to deep learning
Introduction to deep learningIntroduction to deep learning
Introduction to deep learning
 
polystore_NYC_inrae_sysinfo2021-1.pdf
polystore_NYC_inrae_sysinfo2021-1.pdfpolystore_NYC_inrae_sysinfo2021-1.pdf
polystore_NYC_inrae_sysinfo2021-1.pdf
 
deepnet-lourentzou.ppt
deepnet-lourentzou.pptdeepnet-lourentzou.ppt
deepnet-lourentzou.ppt
 

More from inside-BigData.com

Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
inside-BigData.com
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
inside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
inside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
inside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
inside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
inside-BigData.com
 

More from inside-BigData.com (20)

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 

Recently uploaded

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Recently uploaded (20)

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 

Deep Learning and Automatic Differentiation from Theano to PyTorch

  • 1. Deep Learning and Automatic Differentiation from Theano to PyTorch ş
  • 2. About me Postdoctoral researcher, University of Oxford ● Working with Frank Wood, Department of Engineering Science ● Working on probabilistic programming and its applications in science (high-energy physics and HPC) Long-term interests: ● Automatic (algorithmic) differentiation (e.g. http://diffsharp.github.io ) ● Evolutionary algorithms ● Computational physics 2
  • 4. Deep learning A reincarnation/rebranding of artificial neural networks, with roots in ● Threshold logic (McCulloch & Pitts, 1943) ● Hebbian learning (Hebb, 1949) ● Perceptron (Rosenblatt, 1957) ● Backpropagation in NNs (Werbos, 1975; Rumelhart, Hinton, Williams, 1986) Frank Rosenblatt with the Mark I Perceptron, holding a set of neural network weights 4
  • 5. Deep learning State of the art in computer vision ● ImageNet classification with deep convolutional neural networks (Krizhevsky et al., 2012) ○ Halved the error rate achieved with pre-deep-learning methods ● Replacing hand-engineered features ● Modern systems surpass human performance Faster R-CNN (Ren et al., 2015) Top-5 error rate for ImageNet https://devblogs.nvidia.com/parallelforall/mocha-jl-deep-learning-julia/ 5
  • 6. Deep learning State of the art in speech recognition ● Seminal work by Hinton et al. (2012) ○ First major industrial application of deep learning ● Replacing HMM-GMM-based models Recognition word error rates (Huang et al., 2014) 6
  • 7. Deep learning State of the art in machine translation ● Based on RNNs and CNNs (Bahdanu et al., 2015; Wu et al., 2016) ● Replacing statistical translation with engineered features ● Google (Sep 2016), Microsoft (Nov 2016), Facebook (Aug 2017) moved to neural machine translation https://techcrunch.com/2017/08/03/facebook-finishes-its-move-to-neural-machine-translation/ Google Neural Machine Translation System (GNMT) 7
  • 8. What makes deep learning tick? 8
  • 9. Deep neural networks An artificial “neuron" is loosely based on the biological one ● Receive a set of weighted inputs (dendrites) ● Integrating and transforming (cell body) ● Passing the output further (axon) 9
  • 10. Deep neural networks Three main building blocks ● Feedforward (Rosenblatt, 1957) ● Convolutional (LeCun et al., 1989) ● Recurrent (Hopfield, 1982; Hochreiter & Schmidhuber, 1997) 10
  • 11. Deep neural networks Newer additions introduce (differentiable) algorithmic elements ● Neural Turing Machine (Graves et al., 2014) ○ Can infer algorithms: copy, sort, recall ● Stack-augmented RNN (Joulin & Mikolov, 2015) ● End-to-end memory network (Sukhbaatar et al., 2015) ● Stack, queue, deque (Grefenstette et al., 2015) ● Discrete interfaces (Zaremba & Sutskever, 2015) Neural Turing Machine on copy task (Graves et al., 2014) 11
  • 12. Deep neural networks ● Deep, distributed representations learned end-to-end (also called “representation learning”) ○ From raw pixels to object labels ○ From raw waveforms of speech to text (Masuch, 2015) Labels 12
  • 13. Deep neural networks Deeper is better (Bengio, 2009) ● Deep architectures consistently beat shallow ones ● Shallow architectures exponentially inefficient for same performance ● Hierarchical, distributed representations achieve non-local generalization 13
  • 14. Data ● Deep learning needs massive amounts of data ○ need to be labeled if performing supervised learning ● A rough rule of thumb: “deep learning will generally match or exceed human performance with a dataset containing at least 10 million labeled examples.” (The Deep Learning book, Goodfellow et al., 2016) Dataset sizes (Goodfellow et al., 2016) 14
  • 15. GPUs Layers of NNs are conveniently expressed as series of matrix multiplications One input vector, n neurons of k inputs A batch of m input vectors 15
  • 16. GPUs ● BLAS (mainly GEMM) is at the hearth of mainstream deep learning, commonly running on off-the-shelf graphics processing units ● Rapid adoption after ○ Nvidia released CUDA (2007) ○ Raina et al. (2009) and Ciresan et al. (2010) ● ASICs such as tensor processing units (TPUs) are being introduced ○ As low as 8-bit floating point precision, better power efficiency Nvidia Titan Xp (2017) Google Cloud TPU server 16
  • 17. Deep learning frameworks ● Modern tools make it extremely easy to implement / reuse models ● Off-the-shelf components ○ Simple: linear, convolution, recurrent layers ○ Complex: compositions of complex models (e.g., CNN + RNN) ● Base frameworks: Torch (2002), Theano (2011), Caffe (2014), TensorFlow (2015), PyTorch (2016) ● Higher-level model-building libraries: Keras (Theano & TensorFlow), Lasagne (Theano) ● High-performance low-level bindings: BLAS, Intel MKL, CUDA, Magma 17
  • 18. Learning: gradient-based optimization Loss function (Ruder, 2017) http://ruder.io/optimizing- gradient-descent/ 18 Parameter update Stochastic gradient descent (SGD) In practice we use SGD or adaptive-learning-rate varieties such as Adam, RMSProp
  • 19. Deep learning 19 Neural networks + Data + Gradient-based optimization
  • 20. Deep learning 20 Neural networks + Data + Gradient-based optimization We need derivatives
  • 21. How do we compute derivatives? 21
  • 22. Manual ● Calculus 101, rules of differentiation ... 22
  • 23. Manual ● Analytical derivatives are needed for theoretical insight ○ analytic solutions, proofs ○ mathematical analysis, e.g., stability of fixed points ● They are unnecessary when we just need numerical derivatives for optimization ● Until very recently, machine learning looked like this: 23
  • 24. Symbolic derivatives ● Symbolic computation with Mathematica, Maple, Maxima, also deep learning frameworks such as Theano ● Main issue: expression swell 24
  • 25. Symbolic derivatives ● Symbolic computation with, e.g., Mathematica, Maple, Maxima ● Main limitation: only applicable to closed-form expressions You can find the derivative of: But not of: In deep learning, symbolic graph builders such as Theano and TensorFlow face issues with control flow, loops, recursion
  • 28. Automatic differentiation ● Small but established subfield of scientific computing http://www.autodiff.org/ ● Traditional application domains: ○ Computational fluid dynamics ○ Atmospheric sciences ○ Engineering design optimization ○ Computational finance AD has shared roots with the backpropagation algorithm for neural networks, but it is more general 28
  • 31. Automatic differentiation AD has two main modes: ● Forward mode: straightforward ● Reverse mode: slightly more difficult, when you see it for the first time 31
  • 39. Reverse mode 39 More on this in the afternoon exercise session
  • 43. Deep learning frameworks Two main families: ● Symbolic graph builders ● Dynamic graph builders (general-purpose AD) 43
  • 44. Symbolic graph builders ● Implement models using symbolic placeholders, using a mini-language ● Severely limited (and unintuitive) control flow and expressivity ● The graph gets “compiled” to take care of expression swell 44 Graph compilation in Theano
  • 46. Dynamic graph builders (general-purpose AD) 46 ● Use general-purpose automatic differentiation operator overloading ● Implement models as regular programs, with full support for control flow, branching, loops, recursion, procedure calls
  • 47. Dynamic graph builders (general-purpose AD) a general-purpose AD system that allows you to implement models as regular Python programs (more on this in the exercise session) 47 autograd (Python) by Harvard Intelligent Probabilistic Systems Group https://github.com/HIPS/autograd torch-autograd by Twitter Cortex https://github.com/twitter/torch-autograd PyTorch http://pytorch.org/
  • 49. Summary ● Neural networks + data + gradient descent = deep learning ● General-purpose AD is the future of deep learning ● More distance to cover: ○ Implementations better than operator overloading exist in AD literature (source transformation) but not in machine learning ○ Forward AD is not currently present in any mainstream machine learning framework ○ Nesting of forward and reverse AD enables efficient higher-order derivatives such as Hessian-vector products 49
  • 51. Baydin, A.G., Pearlmutter, B.A., Radul, A.A. and Siskind, J.M., 2015. Automatic differentiation in machine learning: a survey. arXiv preprint arXiv:1502.05767. Baydin, Atılım Güneş, Barak A. Pearlmutter, and Jeffrey Mark Siskind. 2016. “Tricks from Deep Learning.” In 7th International Conference on Algorithmic Differentiation, Christ Church Oxford, UK, September 12–15, 2016. Baydin, Atılım Güneş, Barak A. Pearlmutter, and Jeffrey Mark Siskind. 2016. “DiffSharp: An AD Library for .NET Languages.” In 7th International Conference on Algorithmic Differentiation, Christ Church Oxford, UK, September 12–15, 2016. Maclaurin, D., Duvenaud, D. and Adams, R., 2015, June. Gradient-based hyperparameter optimization through reversible learning. In International Conference on Machine Learning (pp. 2113-2122). Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, Jeffrey Dean. Technical Report, 2016. Convolutional Sequence to Sequence Learning. Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, Yann N. Dauphin. arXiv, 2017 Neural Machine Translation by Jointly Learning to Align and Translate. Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio. International Conference on Learning Representations, 2015. A Historical Perspective of Speech Recognition. Xuedong Huang, James Baker, Raj Reddy. Communications of the ACM, Vol. 57 No. 1, Pages 94-103 10.1145/2500887 Bengio, Yoshua. "Learning deep architectures for AI." Foundations and trends® in Machine Learning 2.1 (2009): 1-127. References 51