Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Deep Learning for Computer Vision

25 views

Published on

2017/05/26

Published in: Education
  • Be the first to comment

  • Be the first to like this

Deep Learning for Computer Vision

  1. 1. Deep Learning for Computer Vision Yuan-Kai Wang Fu Jen Catholic University 2017/05/26 1
  2. 2. What Is Deep Learning 2
  3. 3. 3
  4. 4. Google Lens 4
  5. 5. Google I/O 2017 5
  6. 6. Why Does Deep Learning Success(1/4) Big Data 6
  7. 7. Why Does Deep Learning Success(2/4) Beast Processor 7 *** 2017 Google TPU
  8. 8. Why Does Deep Learning Success(3/4) • Algorithm • Stochastic gradient descent (SGD) : fast convergence for learning • ReLU activation function : solve vanishing gradient problem • Dropout : regularization Technical Break Through gradient descent(batch) stochastic gradient descent 8
  9. 9. Why Does Deep Learning Success(4/4) • Architecture: hierarchical representation Technical Break Through The Extraordinary Link Between Deep Neural Networks and the Nature of the Universe MIT Technology Review, 2016/09. 9
  10. 10. 10
  11. 11. Neural Network Evolution 11
  12. 12. 12
  13. 13. Hubel/Wiesel Architecture • D. Hubel and T. Wiesel (1959, 1962, Nobel Prize 1981) • Visual cortex consists of a hierarchy of simple, complex, and hyper-complex cells 13
  14. 14. “sandwich” architecture (SCSCSC…) simple cells: modifiable parameters complex cells: perform pooling Neocognitron Fukushima 1980 14
  15. 15. SIFT Haar Textons Computer Vision Features SURF, MSER, LBP, Color-SIFT, Color histogram, GLOH, ….. and many others: 15 Hand-designed feature extraction Trainable classifier Image/ Video Pixels Object Class Traditional recognition HoG
  16. 16. Shallow vs Deep Architectures Hand-designed feature extraction Trainable classifier Image/ Video Pixels Object Class Layer 1 Layer N Simple classifier Object Class Image/ Video Pixels Traditional recognition: “Shallow” architecture Deep learning: “Deep” architecture … 16 Image Low-level vision features (edges, SIFT, HOG, etc.) Object detection / classification feature extractor classifier
  17. 17. Learn Feature Hierarchy Fill in representation gap in recognition Feature representation Input data 1st layer “Edges” 2nd layer “Object parts” 3rd layer “Objects” Pixels Layer 1 Simple Classifier Image/Video Pixels Layer 2 Layer 3 "Object Detectors Emerge in Deep Scene CNNs," B. Zhou, et al., ICLR 2015 17 Learning algorithm: SGD, ReLU, dropout
  18. 18. No More Handcrafted Features !18
  19. 19. Taxonomy of Feature Learning Methods • T• Support Vector Machine • Logistic Regression • Perceptron • Deep Neural Net • Convolutional Neural Net (CNN) • Recurrent Neural Net • Autoencoder • Restricted Boltzmann machines* • Sparse coding* • Generative Adversarial Net (GAN)* • Deep Belief Nets* Deep Boltzmann machines* • Hierarchical Sparse Coding* DeepShallow Supervised Unsupervised * supervised version exists 19 • Siamese Net
  20. 20. e.g. Google Photos search Face Verification, Taigman et al. 2014 (FAIR) Self-driving cars[Goodfellow et al. 2014] Ciresan et al. 2013 Turaga et al 2010 CNN Applications (1/3) 20
  21. 21. ATARI game playing, Mnih 2013 AlphaGo, Silver et al 2016 VizDoom StarCraft CNN Applications (2/3) 21
  22. 22. DeepDream reddit.com/r/deepdream NeuralStyle, Gatys et al. 2015 deepart.io, Prisma, etc. CNN Applications (3/3) 22
  23. 23. CNN Example: Recognition OCR House Number Traffic Sign Taigman et al. “DeepFace: Closing the Gap to Human-Level Performance in Face Verification,” CVPR 2014 23
  24. 24. CNN Example: Object/Pedestrian Detection 24
  25. 25. CNN Example: Scene Labeling 25 Farabet et al. "Learning hierarchical features for scene labeling" PAMI 2013 (LeCun) Pinheiro et al. "Recurrent Convolutional Neural Networks for Scene Labeling" ICML 2014
  26. 26. CNN Example: Action Recognition from Videos 26Simonyan et al. "Two-Stream Convolutional Networks for Action Recognition in Videos" NIPS 2014 A. Kapathy et al. "Large-scale Video Classification with Convolutional Neural Networks" CVPR 2014
  27. 27. Convolutional Neural Networks (CNN, ConvNets) 27
  28. 28. CNN (Convnet) by LeCun in 1998 • Neural network with specialized connectivity structure • Stack multiple stages of feature extractors • Higher stages compute more global, more invariant features • Classification layer at the end Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE 86(11): 2278–2324, 1998. 28
  29. 29. Basic Module in CNN Input Image Convolution (Learned) Non-linearity Pooling • Feed-forward: – Convolve input with learned filters – Non-linearity (rectified linear) – Pooling (local max) • Supervised learning • Train convolutional filters by back-propagating classification error LeCun et al. 1998 Feature maps 29
  30. 30. Components of Each Layer Pixels / Features Filter with Dictionary (convolutional or tiled) Spatial/Feature (Sum or Max) Normalization between feature responses Output Features + Non-linearity [Optional] Slide: R. Fergus [Optional] 30
  31. 31. Convolutional Filtering Input Feature Map – Dependencies are local – Translation equivariance – Tied filter weights (few params) – Stride 1,2,… (faster, less mem.) . . . 31
  32. 32. Non-Linearity • Every neuron performs a non-linear operation – Tanh – Sigmoid: 1/(1+exp(-x)) – Rectified linear unit (ReLU) • Simplifies backprop • Makes learning faster • Avoids saturation issues * Preferred option x1 x2 x3 . . . xd w2 w3 wd σ(wx+b) w1 32
  33. 33. Pooling • Sum or max • Non-overlapping / overlapping regions • Boureau et al. ICML’10 for theoretical analysis Sum Max 33
  34. 34. Normalization • Contrast normalization (across feature maps) – Local mean = 0, local std. = 1 (7x7 Gaussian) – Equalizes the features maps Feature Maps Feature Maps After Contrast Normalization 34
  35. 35. Compare: SIFT Descriptor Image Pixels Apply Gabor filters Spatial pool (Sum) Normalize to unit length Feature Vector Slide: R. Fergus Lowe [IJCV 2004] 35
  36. 36. e.g. 200K numbers e.g. 10 numbers An Example of CNN 36
  37. 37. CNN Classical Models Comparison AlexNet, GoogLeNet, VGG, ResNet 37
  38. 38. ImageNet Challenge: ILSVRC 38 • ~14 million labeled images, 20k classes • Images gathered from Internet • Human labels via Amazon Turk • Challenge: 1.2 million training images, 1000 classes Karpathy et al. "Large-scale video classification with convolutional neural networks." CVPR 2014 (Fei-Fei Li)
  39. 39. car 99% ILSVRC 2011 winner with 25.8 error rate 39
  40. 40. Going Deeper from 2012 40 Clarifai He et al. "Deep Residual Learning for Image Recognition," CVPR 2016
  41. 41. AlexNet (2012 Winner) • Similar framework to LeCun’98 but: • More data (106 vs. 103 images) • Bigger model (7 hidden layers, 650,000 units, 60,000,000 params) • GPU implementation (50x speedup over CPU) • Trained on two GPUs for a week • Better regularization for training (DropOut) Alex Krizhevsky, I. Sutskever, and G. Hinton, "ImageNet Classification with Deep Convolutional Neural Networks" NIPS 2012 41
  42. 42. AlexNe : 8 layers total Trained on ImageNet 16.4% top-5 error Layer 4: Conv Layer 3: Conv Layer 2: Conv + Pool Layer 1: Conv + Pool Layer 6: Full Softmax Output Layer 5: Conv + Pool Layer 7: Full Input Image How Important Is Depth? (1/2) 42 Layer 6: Full Layer 5: Conv + Pool Layer 4: Conv Layer 3: Conv Layer 2: Conv + Pool Layer 1: Conv + Pool Softmax Output Input Image Remove top fully connected layer(Layer 7) Drop 16 million parameters Only 1.1% drop in performance! Layer 5: Conv + Pool Layer 4: Conv Layer 3: Conv Layer 2: Conv + Pool Layer 1: Conv + Pool Softmax Output Input Image Remove layers 6 & 7 Drop 50 million parameters 5.7% drop in performance
  43. 43. AlexNe : 8 layers total Trained on ImageNet 16.4% top-5 error Layer 4: Conv Layer 3: Conv Layer 2: Conv + Pool Layer 1: Conv + Pool Layer 6: Full Softmax Output Layer 5: Conv + Pool Layer 7: Full Input Image How Important Is Depth? (2/2) 43 Remove layers 3 & 4 Drop 1 million parameters 3.0% drop in performance Layer 6: Full Softmax Output Layer 5: Conv + Pool Layer 7: Full Input Image Layer 2: Conv + Pool Layer 1: Conv + Pool Layer 1: Conv + Pool Layer 2: Conv + Pool Layer 5: Conv + Pool Input Image Softmax Output Remove layers 3, 4, 6 ,7 33.5% drop in performance Depth of network is key
  44. 44. ZFNet (2013 2nd, Improved AlexNet) 44 M. Zeiler and R. Fergus, Visualizing and Understanding Convolutional Networks, ECCV, 2014 CONV1: change from (11x11 stride 4) to (7x7 stride 2) CONV3,4,5: instead of 384, 384, 256 filters use 512, 1024, 512 ImageNet top 5 error: 16.4% -> 14.8% Meaning of Each Layer in ZFNet
  45. 45. best model Only 3x3 CONV stride 1, pad 1 and 2x2 MAX POOL stride 2 VGGNet (2014 2nd) 45 Simonyan & Zisserman, "Very deep convolutional networks for large-scale image recognition" ICLR 2015 TOTAL memory: 24M * 4 bytes ~= 93MB / image (only forward! ~*2 for bwd) TOTAL params: 138M parameters 19 layers Top-5 error 7.3%
  46. 46. Inception module GoogLeNet (2014 Winner) 46 Szegedy et al. "Going deeper with convolutions" CVPR 2015 • Important features: Only 5 million parameters! (Removes FC layers completely) • Compared to AlexNet: 12X less params 2x more compute 6.67% (vs. 16.4%) 22 layers Top-5 error 6.7%
  47. 47. 152 layers Top-5 error 3.6% ResNet (2015 Winner) 47He et al. "Deep Residual Learning for Image Recognition" CVPR 2016 spatial dimension only 56x56!
  48. 48. DL Trend : CNN Models Research by arxiv-sanity database of 28,303 (arxiv) Machine Learning papers over the last 5 years , 2017/04. 48
  49. 49. DL Trend : Optimization algorithm Research by arxiv-sanity database of 28,303 (arxiv) Machine Learning papers over the last 5 years , 2017/04. 49
  50. 50. DL Trend : Top Hot Keywords Research by arxiv-sanity database of 28,303 (arxiv) Machine Learning papers over the last 5 years, 2017/04. 50
  51. 51. CNN Architectures for Different Applications 51
  52. 52. CNN for Classification 52 “tabby cat” 1000-dim vector end-to-end learning image CNN features e.g. vector of 1000 numbers giving probabilities for different classes. Fully connected layer
  53. 53. Localization / Detection image CNN features fully connected layer Class probabilities 4 numbers: - X coord - Y coord - Width - Height 53 image CNN features 1x1 CONV E.g. YOLO (You Only Look Once) (Demo: http://pjreddie.com/darknet/yolo/) 7x7x(5*B+C) For each of 7x7 locations: - [x,y,width,height,confidence]*B - class
  54. 54. CNN for Pedestrian Detection 54 Girshick et al. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” NIPS 2015 Region Proposal Networks (RPNs)CNN Faster R-CNN
  55. 55. Multi-scale Architecture 55Farabet et al. "Learning hierarchical features for scene labeling" PAMI 2013 (LeCun)
  56. 56. Multi-modal Architecture 56 Frome et al. "Devise: A deep visual-semantic embedding model." NIPS 2013 (Bengio)
  57. 57. Multi-task Architecture 57 Zhang et al. "Panda: Pose aligned networks for deep attribute modeling" CVPR 2014.
  58. 58. Semantic Segmentation pixels in, pixels out 58 image CNN features NxNx3 deconv layers NxNx20 array of class probabilities at each pixel image class “map”
  59. 59. Convolution and Deconvolution ConvNet (CNN) DeconvNet (Deconvolutional Layer) Convolutional Autoencoder, Variational Autoencoder Generational Adversarial Net 59ConvNetDeconvNet
  60. 60. Image Denoise by Generative Adversarial Net 60
  61. 61. Object Tracking by CNN and RNN 61
  62. 62. Person Reidentification by CNN, RNN and Siamese 62
  63. 63. Deep Neural Networks in Practice 63
  64. 64. CNN Libraries (open source) 64 • TensorFlow (Google): C++, Python • Torch: Python • Keras: Python • Cuda-convnet (Google): C/C++, Python • Caffe2 (Facebook): C/C++, Matlab, Python • Caffe (Berkeley): C/C++, Matlab, Python • Overfeat (NYU): C/C++ • ConvNetJS: Java script • MatConvNet (VLFeat): Matlab • DeepLearn Toolbox: Matlab
  65. 65. Hardware 65 • Buy your own GPU machine - NVIDIA DIGITS DevBox (TITAN X) - NVIDIA DGX-1 (P100 GPUs) • GPUs in the cloud - Google Cloud Platform (GPU/TPU, TensorFlow) - Amazon AWS EC2 - Microsoft Azure VGG: ~2-3 weeks training with 4 GPUs ResNet 101: 2-3 weeks with 4 GPUs
  66. 66. Q: How do I know what architecture to use? Ans: don’t be a hero. 1. Take whatever works best on ILSVRC (latest ResNet) 2. Download a pretrained model 3. Potentially add/delete some parts of it 4. Finetune it on your application. 66 Andrej Karpathy, Bay Area Deep Learning School, 2016
  67. 67. Q: How do I know what hyperparameters to use? Ans: don’t be a hero. - Use whatever is reported to work best on ILSVRC. - Play with the regularization strength (dropout rates) 67 Andrej Karpathy, Bay Area Deep Learning School, 2016
  68. 68. 68 Thank you!

×