Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020

588 views

Published on

Deep neural networks have achieved outstanding results in various applications such as vision, language, audio, speech, or reinforcement learning. These powerful function approximators typically require large amounts of data to be trained, which poses a challenge in the usual case where little labeled data is available. During the last year, multiple solutions have been proposed to leverage this problem, based on the concept of self-supervised learning, which can be understood as a specific case of unsupervised learning. This talk will cover its basic principles and provide examples in the field of multimedia.

Published in: Data & Analytics
  • Be the first to comment

Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020

  1. 1. Slides @DocXavi [https://deep-self-supervised-learning.carrd.co/] Xavier Giro-i-Nieto @DocXavi xavier.giro@upc.edu Associate Professor Universitat Politècnica de Catalunya Barcelona Supercomputing Center
  2. 2. Slides @DocXavi 2 Acknowledgements Víctor Campos Junting Pan Xunyu Lin Sebastian Palacio Carlos Arenas Oscar Mañas
  3. 3. Slides @DocXavi 3
  4. 4. Slides @DocXavi 4 Source: OpenSelfSup by MMLAB (CUHK & Nanyang)
  5. 5. Slides @DocXavi 5 Outline 1. Representation Learning 2. Unsupervised Learning 3. Self-supervised Learning 4. Predictive methods 5. Contrastive methods
  6. 6. Slides @DocXavi 6 Source: OpenSelfSup by MMLAB (CUHK & Nanyang)
  7. 7. Slides @DocXavi Representation + Learning pipeline Raw data (ex: images) Feature Extraction Classifier Decisor y = ‘CAT’ X1: weight X2: age Probabilities: CAT: 0.7 DOG: 0.3 7 Slide concept: Santiago Pascual (UPC 2019). Garfield is a character created by Jim Davis. Garfield and Odie are characters created by Jim Davis.
  8. 8. Slides @DocXavi Raw data (ex: images) Feature Extraction Classifier Decisor y = ‘CAT’ X1: weight X2: age Probabilities: CAT: 0.7 DOG: 0.3 Neural Network Shall we extract features now? 8 Representation + Learning pipeline Slide concept: Santiago Pascual (UPC 2019). Garfield is a character created by Jim Davis. Garfield and Odie are characters created by Jim Davis.
  9. 9. Slides @DocXavi Raw data (ex: images) Classifier Decisor y = ‘CAT’ Probabilities: CAT: 0.7 DOG: 0.3 Neural Network We CAN inject the raw data, and features will be learned in the NN !! End to End concept 9 End-to-end Representation Learning Slide concept: Santiago Pascual (UPC 2019). Garfield is a character created by Jim Davis. Garfield and Odie are characters created by Jim Davis.
  10. 10. Slides @DocXavi 10Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." NIPS 2012 649,63 citations (June 2020)
  11. 11. Slides @DocXavi 11 ● 1,000 object classes (categories). ● Labeled Images: ○ 1.2 M train ○ 100k test. Russakovsky, Olga, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang et al. "Imagenet large scale visual recognition challenge." International Journal of Computer Vision 115, no. 3 (2015): 211-252. [web]
  12. 12. Slides @DocXavi 12 Outline 1. Representation Learning 2. Unsupervised Learning 3. Self-supervised Learning 4. Predictive methods 5. Contrastive methods
  13. 13. Slides @DocXavi 13 Source: OpenSelfSup by MMLAB (CUHK & Nanyang)
  14. 14. Slides @DocXavi Types of machine learning Yann Lecun’s Black Forest cake 14
  15. 15. Slides @DocXavi 15 A broader picture of types of learning... Slide inspired by Alex Graves (Deepmind) at “Unsupervised Learning Tutorial” @ NeurIPS 2018. ...with a teacher ...without a teacher Active agent... Reinforcement learning (with extrinsic reward) Intrinsic motivation / Exploration. Passive agent... Supervised learning Unsupervised learning
  16. 16. Slides @DocXavi Supervised learning 16 Slide credit: Kevin McGuinness y^
  17. 17. Slides @DocXavi 17 A broader picture of types of learning... Slide inspired by Alex Graves (Deepmind) at “Unsupervised Learning Tutorial” @ NeurIPS 2018. ...with a teacher ...without a teacher Active agent... Reinforcement learning (with extrinsic reward) Intrinsic motivation / Exploration. Passive agent... Supervised learning Unsupervised learning
  18. 18. Slides @DocXavi Unsupervised learning 18 Slide credit: Kevin McGuinness
  19. 19. Slides @DocXavi Unsupervised learning 19 ● It is the nature of how intelligent beings percept the world. ● It can save us tons of efforts to build an AI agent compared to a totally supervised fashion. ● Vast amounts of unlabelled data.
  20. 20. Slides @DocXavi 20 Outline 1. Unsupervised Learning 2. Self-supervised Learning 3. Predictive methods 4. Contrastive methods
  21. 21. Slides @DocXavi 21 Source: OpenSelfSup by MMLAB (CUHK & Nanyang)
  22. 22. Slides @DocXavi 22 Self-supervised learning (...) “Doing this properly and reliably is the greatest challenge in ML and AI of the next few years in my opinion.”
  23. 23. Slides @DocXavi 23
  24. 24. Slides @DocXavi 24 Self-supervised learning Reference: Andrew Zisserman (PAISS 2018) Self-supervised learning is a form of unsupervised learning where the data provides the supervision. ● A pretext (or surrogate) task must be invented by withholding a part of the unlabeled data and training the NN to predict it. Unlabeled data (X)
  25. 25. Slides @DocXavi 25 Self-supervised learning Reference: Andrew Zisserman (PAISS 2018) Self-supervised learning is a form of unsupervised learning where the data provides the supervision. ● By defining a proxy loss, the NN learns representations, which should be valuable for the actual downstream task. ^ y loss Representations learned without labels
  26. 26. Slides @DocXavi 26 Source: OpenSelfSup by MMLAB (CUHK & Nanyang)
  27. 27. Slides @DocXavi 27 Transfer Learning Target data E.g. PASCAL Target model Target labels Small amount of data/labels Instead of training a deep network from scratch for your task... Slide credit: Kevin McGuinness
  28. 28. Slides @DocXavi 28 Transfer Learning Source data E.g. ImageNet Source model Source labels Target data E.g. PASCAL Target model Target labels Transfer Learned Knowledge Large amount of data/labels Small amount of data/labels Slide credit: Kevin McGuinness Instead of training a deep network from scratch for your task: ● Take a network trained on a different source domain and/or task ● Adapt it for your target domain and/or task
  29. 29. Slides @DocXavi 29 Transfer Learning Slide credit: Kevin McGuinness labels Train deep net on “nearby” task using standard backprop ● Supervised learning (e.g. ImageNet). ● Self-supervised learning (e.g. Language Model) Cut off top layer(s) of network and replace with supervised objective for target domain Fine-tune network using backprop with labels for target domain until validation loss starts to increase conv2 conv3 fc1 conv1 surrogate loss surrogate data fc2 + softmax real labelsreal data real loss my_fc2 + softmax
  30. 30. Slides @DocXavi 30 Self-supervised + Transfer Learning #Shuffle&Learn Misra, Ishan, C. Lawrence Zitnick, and Martial Hebert. "Shuffle and learn: unsupervised learning using temporal order verification." ECCV 2016. [code] (Slides by Xunyu Lin):
  31. 31. Slides @DocXavi 31 Transfer Learning: Video-lectures Kevin McGuinness (UPC DLCV 2016) Ramon Morros (UPC DLAI 2018)
  32. 32. Slides @DocXavi 32 Predictive vs Contrastive Methods Source: Ankesh Anand, “Contrastive Self-Supervised Learning” (2020)”
  33. 33. Slides @DocXavi 33 Predictive vs Contrastive Methods Source: Ankesh Anand, “Contrastive Self-Supervised Learning” (2020)”
  34. 34. Slides @DocXavi 34 Outline 1. Unsupervised Learning 2. Self-supervised Learning 3. Predictive Methods ○ Autoencoder
  35. 35. Slides @DocXavi 35 Autoencoder (AE) Fig: “Deep Learning Tutorial” Stanford Autoencoders: ● Predict at the output the same input data. ● Do not need labels.
  36. 36. Slides @DocXavi 36 Autoencoder (AE) Slide: Kevin McGuinness (DLCV UPC 2017) Encoder W1 Decoder W2 hdata reconstruction Loss (reconstruction error) Latent variables (representation/features) 1. Initialize a NN by solving an autoencoding problem.
  37. 37. Slides @DocXavi 37 Autoencoder (AE) Slide: Kevin McGuinness (DLCV UPC 2017) Latent variables (representation/features) Encoder W1 hdata Classifier WC prediction y Loss (cross entropy) 1. Initialize a NN solving an autoencoding problem. 2. Train for final task with “few” labels.
  38. 38. Slides @DocXavi 38 Autoencoder (AE) Fig: “Deep Learning Tutorial” Stanford What other application does the AE target address ?
  39. 39. Slides @DocXavi 39 Autoencoder (AE) Fig: “Deep Learning Tutorial” Stanford Dimensionality reduction: Use the hidden layer as a feature extractor of any desired size.
  40. 40. Slides @DocXavi 40 Denoising Autoencoder (DAE) Vincent, Pascal, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. "Extracting and composing robust features with denoising autoencoders." ICML 2008. 1. Corrupt the input signal, but predict the clean version at its output. 2. Train for final task with “few” labels. Encoder W1 Decoder W2 hdata reconstruction Loss (reconstruction error) corrupted data
  41. 41. Slides @DocXavi 41 Latent variables (representation/features) Encoder W1 hdata Classifier WC prediction y Loss (cross entropy) 1. Corrupt the input signal, but predict the clean version at its output. 2. Train for final task with “few” labels. Denoising Autoencoder (DAE)
  42. 42. Slides @DocXavi 42 Denoising Autoencoder (DAE) Source: “Tutorial: Your First Model (DAE)” by Opendeep.org
  43. 43. Slides @DocXavi 43 Eigen, David, Dilip Krishnan, and Rob Fergus. "Restoring an image taken through a window covered with dirt or rain." ICCV 2013.
  44. 44. Slides @DocXavi 44 #BERT Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. "Bert: Pre-training of deep bidirectional transformers for language understanding." NAACL 2019 (best paper award). Take a sequence of words on input, mask out 15% of the words, and ask the system to predict the missing words (or a distribution of words) Masked Autoencoder (MAE)
  45. 45. Slides @DocXavi 45 Outline 1. Unsupervised Learning 2. Self-supervised Learning 3. Predictive Methods ○ Spatial relations
  46. 46. Slides @DocXavi 46 Spatial verification Predict the relative position between two image patches. #RelPos Doersch, Carl, Abhinav Gupta, and Alexei A. Efros. "Unsupervised visual representation learning by context prediction." ICCV 2015.
  47. 47. Slides @DocXavi 47 Spatial verification #Jigsaw Noroozi, M., & Favaro, P. Unsupervised learning of visual representations by solving jigsaw puzzles. ECCV 2016. Train a neural network to solve a jigsaw puzzle.
  48. 48. Slides @DocXavi 48 Spatial verification #Jigsaw Noroozi, M., & Favaro, P. Unsupervised learning of visual representations by solving jigsaw puzzles. ECCV 2016. Train a neural network to solve a jigsaw puzzle.
  49. 49. Slides @DocXavi 49 Outline 1. Unsupervised Learning 2. Self-supervised Learning 3. Predictive Methods ○ Temporal relations
  50. 50. Slides @DocXavi 50 Temporal coherence #Shuffle&Learn Misra, Ishan, C. Lawrence Zitnick, and Martial Hebert. "Shuffle and learn: unsupervised learning using temporal order verification." ECCV 2016. [code] (Slides by Xunyu Lin): Temporal order of frames is exploited as the supervisory signal for learning.
  51. 51. Slides @DocXavi 51 Temporal sorting Lee, Hsin-Ying, Jia-Bin Huang, Maneesh Singh, and Ming-Hsuan Yang. "Unsupervised representation learning by sorting sequences." ICCV 2017. Sort the sequence of frames.
  52. 52. Slides @DocXavi 52#T-CAM Wei, Donglai, Joseph J. Lim, Andrew Zisserman, and William T. Freeman. "Learning and using the arrow of time." CVPR 2018. Predict whether the video moves forward or backward. Arrow of time
  53. 53. Slides @DocXavi 53 Outline 1. Unsupervised Learning 2. Self-supervised Learning 3. Predictive Methods ○ Cycle Consistency
  54. 54. Slides @DocXavi 54#TimeCycle Xiaolong Wang, Allan Jabri, Alexei A. Efros, “Learning Correspondence from the Cycle-Consistency of Time”. CVPR 2019. [code] [Slides by Carles Ventura] Cycle-consistency Training: Tracking backward and then forward with pixel embeddings, to later train a NN to match the pairs of associated pixel embeddings.
  55. 55. Slides @DocXavi 55#TimeCycle Xiaolong Wang, Allan Jabri, Alexei A. Efros, “Learning Correspondence from the Cycle-Consistency of Time”. CVPR 2019. [code] [Slides by Carles Ventura] Cycle-consistency Inference (video object segmentation): Track pixels across video sequences.
  56. 56. Slides @DocXavi 56 Outline 1. Unsupervised Learning 2. Self-supervised Learning 3. Predictive Methods ○ Video Frame Prediction
  57. 57. Slides @DocXavi 57 Video Frame Prediction Slide credit: Yann LeCun
  58. 58. Slides @DocXavi 58 Next word prediction (language model) Figure: TensorFlow tutorial Bengio, Yoshua, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. "A neural probabilistic language model." Journal of machine learning research 3, no. Feb (2003): 1137-1155. Self-supervised learning
  59. 59. Slides @DocXavi 59 Next sentence prediction (TRUE/FALSE) #BERT Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. "Bert: Pre-training of deep bidirectional transformers for language understanding." NAACL 2019 (best paper award). BERT is also pre-trained for a binarized next sentence prediction task: Select some sentences for A and B, where 50% of the data B is the next sentence of A, and the remaining 50% of the data B are randomly selected in the corpus, and learn the correlation.
  60. 60. Slides @DocXavi 60 Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised Learning of Video Representations using LSTMs." In ICML 2015. [Github] Learning video representations (features) by... Video Frame Prediction
  61. 61. Slides @DocXavi 61 (1) frame reconstruction (AE): Learning video representations (features) by... Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised Learning of Video Representations using LSTMs." ICML 2015. [Github] Video Frame Prediction
  62. 62. Slides @DocXavi 62 Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised Learning of Video Representations using LSTMs." ICML 2015. [Github] Learning video representations (features) by... (2) frame prediction Video Frame Prediction
  63. 63. Slides @DocXavi 63 Unsupervised learned features (lots of data) are fine-tuned for activity recognition (small data). Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised Learning of Video Representations using LSTMs." ICML 2015. [Github] Video Frame Prediction
  64. 64. Slides @DocXavi 64 Outline 1. Unsupervised Learning 2. Self-supervised Learning 3. Predictive Methods ○ Colorization
  65. 65. Slides @DocXavi 65 Zhang, R., Isola, P., & Efros, A. A. Colorful image colorization. ECCV 2016. Colorization (still image) Surrogate task: Colorize a gray scale image.
  66. 66. Slides @DocXavi 66Vondrick, Carl, Abhinav Shrivastava, Alireza Fathi, Sergio Guadarrama, and Kevin Murphy. "Tracking emerges by colorizing videos." ECCV 2018. [blog] Pixel tracking Pre-training: A NN is trained to point the location in a reference frame with the most similar color...
  67. 67. Slides @DocXavi 67Vondrick, Carl, Abhinav Shrivastava, Alireza Fathi, Sergio Guadarrama, and Kevin Murphy. "Tracking emerges by colorizing videos." ECCV 2018. [blog] Pre-training: A NN is trained to point the location in a reference frame with the most similar color… computing a loss with respect to the actual color. Pixel tracking
  68. 68. Slides @DocXavi 68Vondrick, Carl, Abhinav Shrivastava, Alireza Fathi, Sergio Guadarrama, and Kevin Murphy. "Tracking emerges by colorizing videos." ECCV 2018. [blog] For each grayscale pixel to colorize, find the color of most similar pixel embedding (representation) from the first frame. Pixel tracking
  69. 69. Slides @DocXavi 69Vondrick, Carl, Abhinav Shrivastava, Alireza Fathi, Sergio Guadarrama, and Kevin Murphy. "Tracking emerges by colorizing videos." ECCV 2018. [blog] Pixel tracking Application: colorization from a reference frame. Reference frame Grayscale video Colorized video
  70. 70. Slides @DocXavi 70 Outline 1. Unsupervised Learning 2. Self-supervised Learning 3. Predictive Methods 4. Contrastive Methods
  71. 71. Slides @DocXavi 71 Predictive vs Contrastive Methods Source: Ankesh Anand, “Contrastive Self-Supervised Learning” (2020)”
  72. 72. Slides @DocXavi 72
  73. 73. Slides @DocXavi The Gelato Bet "If, by the first day of autumn (Sept 23) of 2015, a method will exist that can match or beat the performance of R-CNN on Pascal VOC detection, without the use of any extra, human annotations (e.g. ImageNet) as pre-training, Mr. Malik promises to buy Mr. Efros one (1) gelato (2 scoops: one chocolate, one vanilla)." Source: Ankesh Anand, “Contrastive Self-Supervised Learning” (2020)”
  74. 74. Slides @DocXavi Hadsell, Raia, Sumit Chopra, and Yann LeCun. "Dimensionality reduction by learning an invariant mapping." CVPR 2006. Learn Gw (X) with pairs (X1 ,X2 ) which are similar (Y=0) or dissimilar (Y=1). Contrastive Loss similar pair (X1 ,X2 ) dissimilar pair (X1 ,X2 )
  75. 75. Slides @DocXavi Contrastive loss between: ● Encoded features of a query image. ● A running queue of encoded keys. Losses over representations (instead of pixels) facilitate abstraction. #MoCo He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. . Momentum Contrast for Unsupervised Visual Representation Learning. CVPR 2020. [code] Momentum Contrast (MoCo)
  76. 76. Slides @DocXavi Wu, Z., Xiong, Y., Yu, S. X., & Lin, D. Unsupervised feature learning via non-parametric instance discrimination. CVPR 2018. Pretext task: Treat each image instance as a distinct class of its own and train a classifier to distinguish whether twp crops (views) belong to the same image. Momentum Contrast (MoCo)
  77. 77. Slides @DocXavi 77 Transfer Learning from ImageNet labels (super. IM-1M) Source data E.g. ImageNet Source model Source labels Target data E.g. PASCAL Target model Target labels Transfer Learned Knowledge Large amount of data/labels Small amount of data/labels Slide credit: Kevin McGuinness
  78. 78. Slides @DocXavi 78 Transfer Learning without ImageNet labels (MoCo IN-1M) Source data E.g. ImageNet Source model Source labels Target data E.g. PASCAL Target model Target labels Transfer Learned Knowledge Large amount of data Small amount of data/labels Slide concept: Kevin McGuinness
  79. 79. Slides @DocXavi 79 Transfer Learning (super. IN-1M vs MoCo IN-1M) Transferring to VOC object detection: a Faster R-CNN detector (C4-backbone) is fine-tuned end-to-end.
  80. 80. Slides @DocXavi 80
  81. 81. Slides @DocXavi 81
  82. 82. Slides @DocXavi 82 Correlated views by data augmentation Original image cc-by: Von.grzanka #SimCLR Chen, Ting, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. "A simple framework for contrastive learning of visual representations." arXiv preprint arXiv:2002.05709 (2020). [tweet]
  83. 83. Slides @DocXavi 83 https://paperswithcode.com/sota/self-supervised-image-classification-on Coming next...
  84. 84. Slides @DocXavi Unsupervised Representation Learning OpenSelfSup by MMLab [tweet] CMC: https://arxiv.org/abs/1906.05849 MoCo: https://arxiv.org/abs/1911.05722 PIRL: https://arxiv.org/abs/1912.01991 SimCLR: https://arxiv.org/abs/2002.05709 MoCov2: https://arxiv.org/abs/2003.04297 Shortcut removal lens: https://arxiv.org/abs/2002.08822 SSL study: https://arxiv.org/abs/2003.14323 SimCLRv2: https://arxiv.org/abs/2006.10029 SwAV: https://arxiv.org/abs/2006.09882 DCL: https://arxiv.org/abs/2007.00224 Linear probing SOTA: https://paperswithcode.com/sota/self-supervised-image-classification-on Fine-tuning SOTA (1%): https://paperswithcode.com/sota/semi-supervised-image-classification-on-1 Fine-tuning SOTA (10%): https://paperswithcode.com/sota/semi-supervised-image-classification-on-2 By Oscar Mañas
  85. 85. Slides @DocXavi 85 Outline 1. Representation Learning 2. Unsupervised Learning 3. Self-supervised Learning a. Predictive Methods b. Contrastive Methods
  86. 86. Slides @DocXavi 86 Suggested reading Alex Graves, Kelly Clancy, “Unsupervised learning: The curious pupil”. Deepmind blog 2019.
  87. 87. Slides @DocXavi 87 Suggested talks Demis Hassabis (Deepmind) Aloysha Efros (UC Berkeley) Andrew Zisserman (Oxford/Deepmind) Yann LeCun (Facebook AI)
  88. 88. Slides @DocXavi 88 Related lecture Aravind Srinivas, “Deep Unsupervised Learning”, UC Berkeley 2020
  89. 89. Slides @DocXavi 89 Video lectures on Unsupervised Learning Kevin McGuinness, UPC DLCV 2016 Xavier Giró, UPC DLAI 2017
  90. 90. Slides @DocXavi Extended (and a bit outdated) version of this talk 90 Lecture from in Master in Computer Vision Barcelona 2019
  91. 91. Slides @DocXavi 91 Coming related tutorial Giro-i-Nieto, X. One Perceptron to Rule Them All: Language, Vision, Audio and Speech. ICMR 2020. Monday 26th October, 2020
  92. 92. Slides @DocXavi 92 Related lecture Self-supervised Audiovisual Learning [slides] [video]
  93. 93. Slides @DocXavi Deep Learning @ Barcelona, Catalonia
  94. 94. Slides @DocXavi Deep Learning @ UPC TelecomBCN Foundations ● MSc course [2017] [2018] [2019] ● BSc course [2018] [2019] [2020] Multimedia Vision: [2016] [2017][2018][2019] Language & Speech: [2017] [2018] [2019] Reinforcement Learning ● [2020 Spring] [2020 Autumn]
  95. 95. Slides @DocXavi 4th (face-to-face) & 5th edition (online) start November 2020. Sign up here.
  96. 96. Slides @DocXavi Our research
  97. 97. Slides @DocXavi Thank you xavier.giro@upc.edu @DocXavi Amanda Duarte Xavi Giró Míriam Bellver Benet Oriol Carles Ventura Oscar Mañas Maria Gonzalez Laia Tarrés Peter Muschick Giannis Kazakos Lucas Ventura Andreu Girbau Dèlia Fernández Eduard Ramon Pol Caselles Victor Campos Cristina Puntí Juanjo Nieto

×