This document is a slide presentation on recent advances in deep learning. It discusses self-supervised learning, which involves using unlabeled data to learn representations by predicting structural information within the data. The presentation covers pretext tasks, invariance-based approaches, and generation-based approaches for self-supervised learning in computer vision and natural language processing. It provides examples of specific self-supervised methods like predicting image rotations, clustering representations to generate pseudo-labels, and masked language modeling.
BERT: Bidirectional Encoder Representation from Transformer.
BERT is a Pretrained Model by Google for State of the art NLP tasks.
BERT has the ability to take into account Syntaxtic and Semantic meaning of Text.
An introduction to the Transformers architecture and BERTSuman Debnath
The transformer is one of the most popular state-of-the-art deep (SOTA) learning architectures that is mostly used for natural language processing (NLP) tasks. Ever since the advent of the transformer, it has replaced RNN and LSTM for various tasks. The transformer also created a major breakthrough in the field of NLP and also paved the way for new revolutionary architectures such as BERT.
PR-355: Masked Autoencoders Are Scalable Vision LearnersJinwon Lee
- Masked Autoencoders Are Scalable Vision Learners presents a new self-supervised learning method called Masked Autoencoder (MAE) for computer vision.
- MAE works by masking random patches of input images, encoding the visible patches, and decoding to reconstruct the full image. This forces the model to learn visual representations from incomplete views of images.
- Experiments on ImageNet show that MAE achieves superior results compared to supervised pre-training from scratch as well as other self-supervised methods, scaling effectively to larger models. MAE representations also transfer well to downstream tasks like object detection, instance segmentation and semantic segmentation.
Survey of Attention mechanism & Use in Computer VisionSwatiNarkhede1
This presentation contains the overview of Attention models. It also has information of the stand alone self attention model used for Computer Vision tasks.
Continual learning involves building machine learning systems that can learn continuously over time from new data and tasks while retaining knowledge from previous learning. This mimics how humans learn throughout their lives. However, continual learning faces challenges like catastrophic forgetting where new learning interferes with past knowledge. Potential solutions involve balancing plasticity to learn new things with stability to retain old knowledge. The field is still new with experiments focused on simple tasks, but continual learning could enable increasingly intelligent systems that learn forever.
The document provides an introduction to diffusion models. It discusses that diffusion models have achieved state-of-the-art performance in image generation, density estimation, and image editing. Specifically, it covers the Denoising Diffusion Probabilistic Model (DDPM) which reparametrizes the reverse distributions of diffusion models to be more efficient. It also discusses the Denoising Diffusion Implicit Model (DDIM) which generates rough sketches of images and then refines them, significantly reducing the number of sampling steps needed compared to DDPM. In summary, diffusion models have emerged as a highly effective approach for generative modeling tasks.
This document is a slide presentation on recent advances in deep learning. It discusses self-supervised learning, which involves using unlabeled data to learn representations by predicting structural information within the data. The presentation covers pretext tasks, invariance-based approaches, and generation-based approaches for self-supervised learning in computer vision and natural language processing. It provides examples of specific self-supervised methods like predicting image rotations, clustering representations to generate pseudo-labels, and masked language modeling.
BERT: Bidirectional Encoder Representation from Transformer.
BERT is a Pretrained Model by Google for State of the art NLP tasks.
BERT has the ability to take into account Syntaxtic and Semantic meaning of Text.
An introduction to the Transformers architecture and BERTSuman Debnath
The transformer is one of the most popular state-of-the-art deep (SOTA) learning architectures that is mostly used for natural language processing (NLP) tasks. Ever since the advent of the transformer, it has replaced RNN and LSTM for various tasks. The transformer also created a major breakthrough in the field of NLP and also paved the way for new revolutionary architectures such as BERT.
PR-355: Masked Autoencoders Are Scalable Vision LearnersJinwon Lee
- Masked Autoencoders Are Scalable Vision Learners presents a new self-supervised learning method called Masked Autoencoder (MAE) for computer vision.
- MAE works by masking random patches of input images, encoding the visible patches, and decoding to reconstruct the full image. This forces the model to learn visual representations from incomplete views of images.
- Experiments on ImageNet show that MAE achieves superior results compared to supervised pre-training from scratch as well as other self-supervised methods, scaling effectively to larger models. MAE representations also transfer well to downstream tasks like object detection, instance segmentation and semantic segmentation.
Survey of Attention mechanism & Use in Computer VisionSwatiNarkhede1
This presentation contains the overview of Attention models. It also has information of the stand alone self attention model used for Computer Vision tasks.
Continual learning involves building machine learning systems that can learn continuously over time from new data and tasks while retaining knowledge from previous learning. This mimics how humans learn throughout their lives. However, continual learning faces challenges like catastrophic forgetting where new learning interferes with past knowledge. Potential solutions involve balancing plasticity to learn new things with stability to retain old knowledge. The field is still new with experiments focused on simple tasks, but continual learning could enable increasingly intelligent systems that learn forever.
The document provides an introduction to diffusion models. It discusses that diffusion models have achieved state-of-the-art performance in image generation, density estimation, and image editing. Specifically, it covers the Denoising Diffusion Probabilistic Model (DDPM) which reparametrizes the reverse distributions of diffusion models to be more efficient. It also discusses the Denoising Diffusion Implicit Model (DDIM) which generates rough sketches of images and then refines them, significantly reducing the number of sampling steps needed compared to DDPM. In summary, diffusion models have emerged as a highly effective approach for generative modeling tasks.
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)Sergey Karayev
This document discusses a lecture on transfer learning and transformers. It begins with an outline of topics to be covered, including transfer learning in computer vision, embeddings and language models, ELMO/ULMFit as "NLP's ImageNet Moment", transformers, attention in detail, and BERT, GPT-2, DistillBERT and T5. It then goes on to provide slides and explanations on these topics, discussing how transfer learning works, word embeddings, language models like Word2Vec, ELMO, ULMFit, the transformer architecture, attention mechanisms, and prominent transformer models.
The document discusses the BERT model for natural language processing. It begins with an introduction to BERT and how it achieved state-of-the-art results on 11 NLP tasks in 2018. The document then covers related work on language representation models including ELMo and GPT. It describes the key aspects of the BERT model, including its bidirectional Transformer architecture, pre-training using masked language modeling and next sentence prediction, and fine-tuning for downstream tasks. Experimental results are presented showing BERT outperforming previous models on the GLUE benchmark, SQuAD 1.1, SQuAD 2.0, and SWAG. Ablation studies examine the importance of the pre-training tasks and the effect of model size.
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train.
Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Scaling Instruction-Finetuned Language Modelstaeseon ryu
The document discusses improving the performance of language models on unseen tasks through instruction finetuning, wherein models are finetuned on a large collection of tasks described as instructions rather than examples. It finds that scaling both the number of finetuning tasks and the size of the model improves performance, and finetuning on chain-of-thought annotations particularly helps the model's reasoning abilities. Instruction finetuning is shown to generalize across models and improve usability while mitigating potential harms.
Emerging Properties in Self-Supervised Vision TransformersSungchul Kim
The document summarizes the DINO self-supervised learning approach for vision transformers. DINO uses a teacher-student framework where the teacher's predictions are used to supervise the student through knowledge distillation. Two global and several local views of an image are passed through the student, while only global views are passed through the teacher. The student is trained to match the teacher's predictions for local views. DINO achieves state-of-the-art results on ImageNet with linear evaluation and transfers well to downstream tasks. It also enables vision transformers to discover object boundaries and semantic layouts.
Unsupervised Data Augmentation for Consistency TrainingSungchul Kim
This document discusses semi-supervised learning and unsupervised data augmentation (UDA). It begins by explaining techniques in semi-supervised learning like entropy minimization and consistency regularization. It then introduces UDA, which trains models to be less sensitive to noise by minimizing the divergence between predictions on original and augmented data. The document reports on experiments applying UDA and comparing it to other methods on image datasets, finding it achieves better performance. It also explores techniques like training signal annealing and discusses ablation studies.
1) Transformers use self-attention to solve problems with RNNs like vanishing gradients and parallelization. They combine CNNs and attention.
2) Transformers have encoder and decoder blocks. The encoder models input and decoder models output. Variations remove encoder (GPT) or decoder (BERT) for language modeling.
3) GPT-3 is a large Transformer with 175B parameters that can perform many NLP tasks but still has safety and bias issues.
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...Po-Chuan Chen
This paper proposes LLaMA-Adapter, a lightweight method to efficiently fine-tune the LLaMA language model into an instruction-following model. It uses learnable adaption prompts prepended to word tokens in higher transformer layers. Additionally, it introduces zero-initialized attention with a gating mechanism that incorporates instructional signals while preserving pre-trained knowledge. Experiments show LLaMA-Adapter can generate high-quality responses comparable to fully fine-tuned models, and it can be extended to multi-modal reasoning tasks.
Google's Pathways Language Model and Chain-of-ThoughtVaclav1
By Václav Košař (GLAMI): Pathways Language Model (PaLM) is a 540-billion parameter with architecture similar to GPT-3. This model, published April 4th, 2022, achieves breakthrough capabilities on language understanding and generation, reasoning, and coding tasks. For example for reasoning tasks, PaLM used chain-of-thought prompting, which applies simulated inner monologue to solve grade school level math questions. In this talk, we will discuss both general-public-accessible intuition of how knowledge and reasoning can be represented in computers as well as technical details of PaLM model architecture.
A Simple Introduction to Word EmbeddingsBhaskar Mitra
In information retrieval there is a long history of learning vector representations for words. In recent times, neural word embeddings have gained significant popularity for many natural language processing tasks, such as word analogy and machine translation. The goal of this talk is to introduce basic intuitions behind these simple but elegant models of text representation. We will start our discussion with classic vector space models and then make our way to recently proposed neural word embeddings. We will see how these models can be useful for analogical reasoning as well applied to many information retrieval tasks.
Imagen: Photorealistic Text-to-Image Diffusion Models with Deep Language Unde...Vitaly Bondar
1. This document describes Imagen, a new state-of-the-art photorealistic text-to-image diffusion model with deep language understanding.
2. Key contributions include using large frozen language models as effective text encoders, a new dynamic thresholding sampling technique for more photorealistic images, and an efficient U-Net architecture.
3. On various benchmarks including COCO FID and a new DrawBench, human evaluations found Imagen generates images that better align with text prompts and outperform other models including DALL-E 2.
GPT-2: Language Models are Unsupervised Multitask LearnersYoung Seok Kim
This document summarizes a technical paper about GPT-2, an unsupervised language model created by OpenAI. GPT-2 is a transformer-based model trained on a large corpus of internet text using byte-pair encoding. The paper describes experiments showing GPT-2 can perform various NLP tasks like summarization, translation, and question answering with limited or no supervision, though performance is still below supervised models. It concludes that unsupervised task learning is a promising area for further research.
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Deep Learning Italia
This document provides an overview of transformer seq2seq models, including their concepts, trends, and limitations. It discusses how transformer models have replaced RNNs for seq2seq tasks due to being more parallelizable and effective at modeling long-term dependencies. Popular seq2seq models like T5, BART, and Pegasus are introduced. The document reviews common pretraining objectives for seq2seq models and current trends in larger model sizes, task-specific pretraining, and long-range modeling techniques. Limitations discussed include the need for grounded representations and efficient generation for seq2seq models.
PR-409: Denoising Diffusion Probabilistic ModelsHyeongmin Lee
이번 논문은 요즘 핫한 Diffusion을 처음으로 유행시킨 Denoising Diffusion Probabilistic Models (DDPM) 입니다. ICML 2015년에 처음 제안된 Diffusion의 여러 실용적인 측면들을 멋지게 해결하여 그 유행의 시작을 알린 논문인데요, Generative Model의 여러 분야와 Diffusion, 그리고 DDPM에서는 무엇이 바뀌었는지 알아보도록 하겠습니다.
논문 링크: https://arxiv.org/abs/2006.11239
영상 링크: https://youtu.be/1j0W_lu55nc
This document describes the CIFAR-10 dataset for classifying images into 10 categories. It contains 60,000 32x32 color images split into 50,000 training and 10,000 test images. Two methods are proposed: Method 1 extracts patches and features from each image and uses SVM/kNN, while Method 2 uses LoG and HoG features to preserve shape before SVM/kNN classification. Experiments test different parameters, with the best accuracy around 42% using a 13-dimensional Fisher vector and RBF SVM kernel.
The document discusses the application of transformers to computer vision tasks. It first introduces the standard transformer architecture and its use in natural language processing. It then summarizes recent works on applying transformers to object detection (DETR) and image classification (ViT). DETR proposes an end-to-end object detection method using a CNN-Transformer encoder-decoder architecture. Deformable DETR improves on DETR by incorporating deformable attention mechanisms. ViT represents images as sequences of patches and applies a standard Transformer encoder for image recognition, exceeding state-of-the-art models with less pre-training computation. While promising results have been achieved, challenges remain regarding model parameters and expanding transformer applications to other computer vision tasks.
Natural language processing and transformer modelsDing Li
The document discusses several approaches for text classification using machine learning algorithms:
1. Count the frequency of individual words in tweets and sum for each tweet to create feature vectors for classification models like regression. However, this loses some word context information.
2. Use Bayes' rule and calculate word probabilities conditioned on class to perform naive Bayes classification. Laplacian smoothing is used to handle zero probabilities.
3. Incorporate word n-grams and context by calculating word probabilities within n-gram contexts rather than independently. This captures more linguistic information than the first two approaches.
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)Sergey Karayev
This document discusses a lecture on transfer learning and transformers. It begins with an outline of topics to be covered, including transfer learning in computer vision, embeddings and language models, ELMO/ULMFit as "NLP's ImageNet Moment", transformers, attention in detail, and BERT, GPT-2, DistillBERT and T5. It then goes on to provide slides and explanations on these topics, discussing how transfer learning works, word embeddings, language models like Word2Vec, ELMO, ULMFit, the transformer architecture, attention mechanisms, and prominent transformer models.
The document discusses the BERT model for natural language processing. It begins with an introduction to BERT and how it achieved state-of-the-art results on 11 NLP tasks in 2018. The document then covers related work on language representation models including ELMo and GPT. It describes the key aspects of the BERT model, including its bidirectional Transformer architecture, pre-training using masked language modeling and next sentence prediction, and fine-tuning for downstream tasks. Experimental results are presented showing BERT outperforming previous models on the GLUE benchmark, SQuAD 1.1, SQuAD 2.0, and SWAG. Ablation studies examine the importance of the pre-training tasks and the effect of model size.
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train.
Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Scaling Instruction-Finetuned Language Modelstaeseon ryu
The document discusses improving the performance of language models on unseen tasks through instruction finetuning, wherein models are finetuned on a large collection of tasks described as instructions rather than examples. It finds that scaling both the number of finetuning tasks and the size of the model improves performance, and finetuning on chain-of-thought annotations particularly helps the model's reasoning abilities. Instruction finetuning is shown to generalize across models and improve usability while mitigating potential harms.
Emerging Properties in Self-Supervised Vision TransformersSungchul Kim
The document summarizes the DINO self-supervised learning approach for vision transformers. DINO uses a teacher-student framework where the teacher's predictions are used to supervise the student through knowledge distillation. Two global and several local views of an image are passed through the student, while only global views are passed through the teacher. The student is trained to match the teacher's predictions for local views. DINO achieves state-of-the-art results on ImageNet with linear evaluation and transfers well to downstream tasks. It also enables vision transformers to discover object boundaries and semantic layouts.
Unsupervised Data Augmentation for Consistency TrainingSungchul Kim
This document discusses semi-supervised learning and unsupervised data augmentation (UDA). It begins by explaining techniques in semi-supervised learning like entropy minimization and consistency regularization. It then introduces UDA, which trains models to be less sensitive to noise by minimizing the divergence between predictions on original and augmented data. The document reports on experiments applying UDA and comparing it to other methods on image datasets, finding it achieves better performance. It also explores techniques like training signal annealing and discusses ablation studies.
1) Transformers use self-attention to solve problems with RNNs like vanishing gradients and parallelization. They combine CNNs and attention.
2) Transformers have encoder and decoder blocks. The encoder models input and decoder models output. Variations remove encoder (GPT) or decoder (BERT) for language modeling.
3) GPT-3 is a large Transformer with 175B parameters that can perform many NLP tasks but still has safety and bias issues.
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...Po-Chuan Chen
This paper proposes LLaMA-Adapter, a lightweight method to efficiently fine-tune the LLaMA language model into an instruction-following model. It uses learnable adaption prompts prepended to word tokens in higher transformer layers. Additionally, it introduces zero-initialized attention with a gating mechanism that incorporates instructional signals while preserving pre-trained knowledge. Experiments show LLaMA-Adapter can generate high-quality responses comparable to fully fine-tuned models, and it can be extended to multi-modal reasoning tasks.
Google's Pathways Language Model and Chain-of-ThoughtVaclav1
By Václav Košař (GLAMI): Pathways Language Model (PaLM) is a 540-billion parameter with architecture similar to GPT-3. This model, published April 4th, 2022, achieves breakthrough capabilities on language understanding and generation, reasoning, and coding tasks. For example for reasoning tasks, PaLM used chain-of-thought prompting, which applies simulated inner monologue to solve grade school level math questions. In this talk, we will discuss both general-public-accessible intuition of how knowledge and reasoning can be represented in computers as well as technical details of PaLM model architecture.
A Simple Introduction to Word EmbeddingsBhaskar Mitra
In information retrieval there is a long history of learning vector representations for words. In recent times, neural word embeddings have gained significant popularity for many natural language processing tasks, such as word analogy and machine translation. The goal of this talk is to introduce basic intuitions behind these simple but elegant models of text representation. We will start our discussion with classic vector space models and then make our way to recently proposed neural word embeddings. We will see how these models can be useful for analogical reasoning as well applied to many information retrieval tasks.
Imagen: Photorealistic Text-to-Image Diffusion Models with Deep Language Unde...Vitaly Bondar
1. This document describes Imagen, a new state-of-the-art photorealistic text-to-image diffusion model with deep language understanding.
2. Key contributions include using large frozen language models as effective text encoders, a new dynamic thresholding sampling technique for more photorealistic images, and an efficient U-Net architecture.
3. On various benchmarks including COCO FID and a new DrawBench, human evaluations found Imagen generates images that better align with text prompts and outperform other models including DALL-E 2.
GPT-2: Language Models are Unsupervised Multitask LearnersYoung Seok Kim
This document summarizes a technical paper about GPT-2, an unsupervised language model created by OpenAI. GPT-2 is a transformer-based model trained on a large corpus of internet text using byte-pair encoding. The paper describes experiments showing GPT-2 can perform various NLP tasks like summarization, translation, and question answering with limited or no supervision, though performance is still below supervised models. It concludes that unsupervised task learning is a promising area for further research.
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Deep Learning Italia
This document provides an overview of transformer seq2seq models, including their concepts, trends, and limitations. It discusses how transformer models have replaced RNNs for seq2seq tasks due to being more parallelizable and effective at modeling long-term dependencies. Popular seq2seq models like T5, BART, and Pegasus are introduced. The document reviews common pretraining objectives for seq2seq models and current trends in larger model sizes, task-specific pretraining, and long-range modeling techniques. Limitations discussed include the need for grounded representations and efficient generation for seq2seq models.
PR-409: Denoising Diffusion Probabilistic ModelsHyeongmin Lee
이번 논문은 요즘 핫한 Diffusion을 처음으로 유행시킨 Denoising Diffusion Probabilistic Models (DDPM) 입니다. ICML 2015년에 처음 제안된 Diffusion의 여러 실용적인 측면들을 멋지게 해결하여 그 유행의 시작을 알린 논문인데요, Generative Model의 여러 분야와 Diffusion, 그리고 DDPM에서는 무엇이 바뀌었는지 알아보도록 하겠습니다.
논문 링크: https://arxiv.org/abs/2006.11239
영상 링크: https://youtu.be/1j0W_lu55nc
This document describes the CIFAR-10 dataset for classifying images into 10 categories. It contains 60,000 32x32 color images split into 50,000 training and 10,000 test images. Two methods are proposed: Method 1 extracts patches and features from each image and uses SVM/kNN, while Method 2 uses LoG and HoG features to preserve shape before SVM/kNN classification. Experiments test different parameters, with the best accuracy around 42% using a 13-dimensional Fisher vector and RBF SVM kernel.
The document discusses the application of transformers to computer vision tasks. It first introduces the standard transformer architecture and its use in natural language processing. It then summarizes recent works on applying transformers to object detection (DETR) and image classification (ViT). DETR proposes an end-to-end object detection method using a CNN-Transformer encoder-decoder architecture. Deformable DETR improves on DETR by incorporating deformable attention mechanisms. ViT represents images as sequences of patches and applies a standard Transformer encoder for image recognition, exceeding state-of-the-art models with less pre-training computation. While promising results have been achieved, challenges remain regarding model parameters and expanding transformer applications to other computer vision tasks.
Natural language processing and transformer modelsDing Li
The document discusses several approaches for text classification using machine learning algorithms:
1. Count the frequency of individual words in tweets and sum for each tweet to create feature vectors for classification models like regression. However, this loses some word context information.
2. Use Bayes' rule and calculate word probabilities conditioned on class to perform naive Bayes classification. Laplacian smoothing is used to handle zero probabilities.
3. Incorporate word n-grams and context by calculating word probabilities within n-gram contexts rather than independently. This captures more linguistic information than the first two approaches.
Searching for magic formula by deep learningJames Ahn
지난 삼성오픈소스 컨퍼런스에서 발표한 것으로
핀테크에 Deep Learning을 적용해 과연 기계에 의한 투자가 가능한지를 시험해 본 발표자료입니다.
첫시도에는 데이터를 무시하고 알고리즘에 전적으로 의존해 완벽한 실패로 끝났지만,
두번째 시도에서는 데이터를 이해하고, 이에 맞는 알고리즘의 적용으로 처참했던 첫번째 시도보다 훨씬 더 좋은 결과를 보여주었습니다.
역시 데이터를 가지고 무엇을 한다는 것은 데이터에 대한 이해가 처음이자 마지막이라는 것을 확실히 느끼게 되었습니다.
Deep learning text NLP and Spark Collaboration . 한글 딥러닝 Text NLP & Sparkhoondong kim
This slide explain the Deep Learning Text NLP for Korean Language. We will also discuss expansion using Spark in Deep Learning Approach to BigData Scale data.
이 슬라이드에서는 한글의 deep learning Text NLP에 대하여 설명한다. 또한, BigData Scale 데이타에 대한 Deep Learning Approach 에 있어, Spark 를 이용한 확장에 대하여도 다룬다.
4. 개인적 의견
- 약인공지능은 거의 정복된 것 같다
- 현재 딥러닝으로 풀기 힘든 문제들?
- 데이터가 너무 적은 경우
- 문제 자체가 매우 복잡한 경우 (long term dependency, very multimodal)
- 단순한 scaling/architecture 변환으로 해결 할 수 있을까?
6. 어떻게 할까?
- 아기처럼 세상을 관찰
- 환경과 상호작용하며 계획을
짤 줄 알아야 함
- Gradient based learning과
호환 가능해야함
7. 아기처럼 세상을 관찰
- 아기가 보는 세상은 label이 없다 → self-supervised learning
- 아기는 여러가지 감각을 활용해 세상을 본다 → multimodal deep learning
- 환경과 상호작용하며 계획을 짤 줄 알아야 함 → RL/Decision Transformer
10. Self supervised learning이란?
- Unsupervised Learning의 한 종류
- Pretext task를 정해, unlabeled 데이터셋을 사용하여 학습
- 데이터 자체의 정보를 적당히 변형/사용하여 supervision으로 쓴다
- Pretext task의 선정이 가장 중요!
18. NLP에서 성공적이었던 두가지 pretext task를 쓰면 안될까?
- 일단 Autoregressive learning의 경우, image의 nature 자체가 time series data
가 아니기에 적용 불가
- Video쪽 foundational model이 나온다면 AR 방식을 쓸지도?
- Masked Autoencoder는 어떨까?
19. Image에서 mask prediction이 어려운 이유
- Uncertainty의 어려움이 가장 큰 문제
- NLP에서는 mask에 들어갈 수 있는 단어가 discrete하고 한정되어있음
- 그렇기에 classification task로 접근 가능
- CV에서 mask는 high dimensional하고 continuous함 → uncertainty가 너무 심
함
20. Mask Prediction (Denoising Autoencoder) 뜯어보기
- Mask Prediction은 Energy based model의 일종으로 볼 수 있음
- 부가 설명: http://helper.ipam.ucla.edu/publications/mlpws4/mlpws4_15927.pdf
- Energy Based Model 이라하면, data pair가 있을 때, 둘이 compatible한 쌍인
지 아닌지를 구별할 수 있다는 소리
- 무슨 소린지는 칠판에서…
26. Data2Vec에서 가장 중요한 2가지
- Byol과 비슷한 momentum encoder로 trivial solution 해결
- Reconstruction이 아닌, latent network representation prediction으로 접근
- CV/Audio에서 겪는 high dimension에서 오는 uncertainty 문제 해결
- 그러면서도 성능도 좋음 (NLP task에서 RoberTa를 이긴건 놀라웠음)
28. 그래도 해보자!
- VIT Architecture을 이용해 encoding
- 75%를 random masking
- 16x16 path로 쪼개고 75% random
- masking ratio가 놀랍게도 큰데, 작을 경우
interpolation등으로 추론해 semantic한 정보
학습이 어려움
29. 한계
A school bus is parked on a grey road
A school bus is parked on a [mask] road
위 둘은 완전히 다름 (Semantic segment vs
Pixel)
30. 얘기할거리
- Self Supervised Learning이 미래다
- 특히 Mask prediction/Autoregressive한 학습은 인간과 매우 유사한듯
- NLP와 달리, CV/Audio은 이거다 싶은 pretext task가 아직 없다
- Siamese Network를 쓰는 방식은 뭔가 찝찝하다 → 간단하지 못하달까?
- 한계를 인지하면서도 Masked Autoencoder 논문이 나온 이유가 이것 아닐까?
- Multimodal한 Self Supervised learning이 답일지도?
- 다음 시간에 얘기할 CLIP, COCA