Deep Learning: AI Breakthrough
Mohsen Fayyaz
Sensifai
Tehran University – 15 Dey 1395 (4 Jan 2017)
Video Processing and Deep Learning
What is Video?
• Batches of Frames
• Can we process video as batches of frames?
Motion cannot be inferred from single frame
Why do we need video processing?
• Self-Driving Cars: Video Semantic Segmentation
Feature Space Optimization for Semantic Video Segmentation, Kundu et. al., 2016
Why do we need video processing?
• Robots: Action Recognition
Simonyan et. al., 2014
Why do we need video processing?
• Google, YouTube, Aparat : Video Tagging
Densecap, Johnson et. al., 2016 (Image captioning)
Why do we need video processing?
• Network Video Broadcasting: Frame Prediction
Patraucean et. al., 2016
From Images to Video
3
Image
CNN
Extracted
Features
Frames
?
Extracted
Features
Image Video
From Images to Video
CNN
Extracted Spatio-Temporal
Features
Frames
LSTM
Donahe et. al., 2015
From Images to Video
CNN
Extracted Spatio-Temporal
Features
Frames
LSTM
Donahe et. al., 2015
What if we want regional
features?
From Images to Video - STFCN
CNN
Extracted Regional Spatio-Temporal
FeaturesFrames
Convolutional LSTM
Fayyaz et. al., 2016
From Images to Video – C3D
3D
CNN
Extracted Regional Spatio-Temporal
FeaturesFrames
Tran et. al., 2015
Now that we have the appropriate tool
Let’s see some real world applications
Video Semantic Segmentation - STFCN
Fayyaz et. al., 2016
Video Semantic Segmentation – C3D
Tran et. al., 2015
Action Recognition & Video Classification
Simonyan et. al., 2014
Does video have visual data only?
Action Recognition & Video Classification
Wu et al., 2015
Audio
+
Vision
Let’s briefly take a look at some state-of-the-
art Image based Networks
Extremely Deep Networks
Residual Networks
• Problem: Gradients Vanish in Back-propagation
• Solution: Let’s make a shortcut for them!
• Y = 𝐻(𝑋, 𝑊𝐻) -> Y = 𝐻 𝑋, 𝑊𝐻 + 𝑋
Extremely Deep Networks
Highway Networks
• Similar to ResNets
• The shortcuts are controlled using a learnable parameter to
have a better trade-off between being
• Y = 𝐻 𝑋, 𝑊𝐻 . 𝑇 𝑋, 𝑊𝑇 + 𝑋. (1 − 𝑇 𝑋, 𝑊𝑇 )
Extremely Deep Networks
DenseNets
• If ResNet works with just connecting previous layers, why
not connecting all?!
• 𝑌 = 𝐹(𝑋 𝑛, 𝑋 𝑛−1, …, 𝑋0)
• Improvements in both Forward &
• Backward
Now what if we use the idea of propagating
data and gradients between shallow and
deep layers in video based networks?
Up to here everything was Supervised
But there are bunch of data across the
Internet with weak labels …
Let’s go through Weakly-Supervised
methods
Weakly Supervised Learning
Weakly Supervised Learning with CNNs
• Multiple Labeling
• Weakly Localization
• Data can be crawled
over Internet
• Can be adopted to Video
Oquab et. al., 2015
How about some Unsupervised methods …
Unsupervised Learning
Anticipating Visual Representations From Unlabeled Video
• Training on Big Huge amount of unlabeled Video across the net
• Training Classifiers on the final output
Vondrick et. al., 2016
Practical considerations
What Hardware do I use?
• NVIDIA GPU + SSD + HDD
• More info on:
http://www.DeepLearning.ir
What framework do I use?
Caffe
Torch
Tensorflow
Theano
Keras
Microsoft CNTK
Deeplearning4j
…
What framework do I use?
Tensorflow Torch Theano
From Karpathy’s slides
Distributed Training:
Will be presented at my next presentation
at Sharif University of Technology
on 22 Dey 1395 (11 Jan 2017)
From Karpathy’s slides
Thank You
Fayyaz@Sensifai.com

Deep Learning: AI Breakthrough

  • 1.
    Deep Learning: AIBreakthrough Mohsen Fayyaz Sensifai Tehran University – 15 Dey 1395 (4 Jan 2017)
  • 2.
    Video Processing andDeep Learning
  • 3.
    What is Video? •Batches of Frames • Can we process video as batches of frames? Motion cannot be inferred from single frame
  • 4.
    Why do weneed video processing? • Self-Driving Cars: Video Semantic Segmentation Feature Space Optimization for Semantic Video Segmentation, Kundu et. al., 2016
  • 5.
    Why do weneed video processing? • Robots: Action Recognition Simonyan et. al., 2014
  • 6.
    Why do weneed video processing? • Google, YouTube, Aparat : Video Tagging Densecap, Johnson et. al., 2016 (Image captioning)
  • 7.
    Why do weneed video processing? • Network Video Broadcasting: Frame Prediction Patraucean et. al., 2016
  • 8.
    From Images toVideo 3 Image CNN Extracted Features Frames ? Extracted Features Image Video
  • 9.
    From Images toVideo CNN Extracted Spatio-Temporal Features Frames LSTM Donahe et. al., 2015
  • 10.
    From Images toVideo CNN Extracted Spatio-Temporal Features Frames LSTM Donahe et. al., 2015 What if we want regional features?
  • 11.
    From Images toVideo - STFCN CNN Extracted Regional Spatio-Temporal FeaturesFrames Convolutional LSTM Fayyaz et. al., 2016
  • 12.
    From Images toVideo – C3D 3D CNN Extracted Regional Spatio-Temporal FeaturesFrames Tran et. al., 2015
  • 13.
    Now that wehave the appropriate tool Let’s see some real world applications
  • 14.
    Video Semantic Segmentation- STFCN Fayyaz et. al., 2016
  • 15.
    Video Semantic Segmentation– C3D Tran et. al., 2015
  • 16.
    Action Recognition &Video Classification Simonyan et. al., 2014
  • 17.
    Does video havevisual data only?
  • 18.
    Action Recognition &Video Classification Wu et al., 2015 Audio + Vision
  • 19.
    Let’s briefly takea look at some state-of-the- art Image based Networks
  • 20.
    Extremely Deep Networks ResidualNetworks • Problem: Gradients Vanish in Back-propagation • Solution: Let’s make a shortcut for them! • Y = 𝐻(𝑋, 𝑊𝐻) -> Y = 𝐻 𝑋, 𝑊𝐻 + 𝑋
  • 21.
    Extremely Deep Networks HighwayNetworks • Similar to ResNets • The shortcuts are controlled using a learnable parameter to have a better trade-off between being • Y = 𝐻 𝑋, 𝑊𝐻 . 𝑇 𝑋, 𝑊𝑇 + 𝑋. (1 − 𝑇 𝑋, 𝑊𝑇 )
  • 22.
    Extremely Deep Networks DenseNets •If ResNet works with just connecting previous layers, why not connecting all?! • 𝑌 = 𝐹(𝑋 𝑛, 𝑋 𝑛−1, …, 𝑋0) • Improvements in both Forward & • Backward
  • 23.
    Now what ifwe use the idea of propagating data and gradients between shallow and deep layers in video based networks?
  • 24.
    Up to hereeverything was Supervised But there are bunch of data across the Internet with weak labels … Let’s go through Weakly-Supervised methods
  • 25.
    Weakly Supervised Learning WeaklySupervised Learning with CNNs • Multiple Labeling • Weakly Localization • Data can be crawled over Internet • Can be adopted to Video Oquab et. al., 2015
  • 26.
    How about someUnsupervised methods …
  • 27.
    Unsupervised Learning Anticipating VisualRepresentations From Unlabeled Video • Training on Big Huge amount of unlabeled Video across the net • Training Classifiers on the final output Vondrick et. al., 2016
  • 28.
  • 29.
    What Hardware doI use? • NVIDIA GPU + SSD + HDD • More info on: http://www.DeepLearning.ir
  • 30.
    What framework doI use? Caffe Torch Tensorflow Theano Keras Microsoft CNTK Deeplearning4j …
  • 31.
    What framework doI use? Tensorflow Torch Theano From Karpathy’s slides
  • 32.
    Distributed Training: Will bepresented at my next presentation at Sharif University of Technology on 22 Dey 1395 (11 Jan 2017) From Karpathy’s slides
  • 33.