Show and tell: A Neural Image caption generatorHojin Yang
The document describes a neural image caption generator that uses a CNN-RNN model. It summarizes the model, which uses a CNN like GoogLeNet to encode an image into a vector, which is then input to an RNN decoder to generate a caption word-by-word. The RNN is trained with cross entropy loss to maximize the probability of the correct caption. It also discusses techniques for variable length captions, sampling methods for inference, and using attention to focus on parts of the image.
This document discusses multi-layer perceptrons (MLPs) as an alternative to self-attention in transformer models. It introduces gated MLP (gMLP) blocks that maintain spatial information across tokens through a spatial gating unit without positional encodings. Experiments on image classification and masked language modeling show gMLPs can match or exceed self-attention performance with increased model capacity. A hybrid model combining a small self-attention and spatial gating unit performs best, showing spatial interactions can replace positional encodings.
Deep learning based object detection basicsBrodmann17
The document discusses different approaches to object detection in images using deep learning. It begins with describing detection as classification, where an image is classified into categories for what objects are present. It then discusses approaches that involve separating detection into a classification head and localization head. The document also covers improvements like R-CNN which uses region proposals to first generate candidate object regions before running classification and bounding box regression on those regions using CNN features. This helps address issues with previous approaches like being too slow when running the CNN over the entire image at multiple locations and scales.
The document discusses the FaceNet paper, which proposes a unified embedding for face recognition and clustering using a deep neural network. Some key points:
- FaceNet uses a triplet loss during training to learn a embedding space where distances between faces correspond to whether they are from the same person or not.
- This eliminates the need for complex multi-stage training pipelines used by previous works.
- On standard benchmarks, FaceNet achieves over 99% accuracy for face verification, outperforming prior state-of-the-art models.
- The unified embedding allows for face recognition via distance thresholding and face clustering via k-means in the learned space.
Tracking is the problem of estimating the trajectory of an object as it moves around a scene. Motion tracking involves collecting data on human movement using sensors to control outputs like music or lighting based on performer actions. Motion tracking differs from motion capture in that it requires less equipment, is less expensive, and is concerned with qualities of motion rather than highly accurate data collection. Optical flow estimates the pixel-wise motion between frames in a video by calculating velocity vectors for each pixel.
This document provides an overview of single image super resolution using deep learning. It discusses how super resolution can be used to generate a high resolution image from a low resolution input. Deep learning models like SRCNN were early approaches for super resolution but newer models use deeper networks and perceptual losses. Generative adversarial networks have also been applied to improve perceptual quality. Key applications are in satellite imagery, medical imaging, and video enhancement. Metrics like PSNR and SSIM are commonly used but may not correlate with human perception. Overall, deep learning has advanced super resolution techniques but challenges remain in fully evaluating perceptual quality.
Super resolution in deep learning era - Jaejun YooJaeJun Yoo
1) The document discusses super-resolution techniques in deep learning, including inverse problems, image restoration problems, and different deep learning models.
2) Early models like SRCNN used convolutional networks for super-resolution but were shallow, while later models incorporated residual learning (VDSR), recursive learning (DRCN), and became very deep and dense (SRResNet).
3) Key developments included EDSR which provided a strong backbone model and GAN-based approaches like SRGAN which aimed to generate more realistic textures but require new evaluation metrics.
Show and tell: A Neural Image caption generatorHojin Yang
The document describes a neural image caption generator that uses a CNN-RNN model. It summarizes the model, which uses a CNN like GoogLeNet to encode an image into a vector, which is then input to an RNN decoder to generate a caption word-by-word. The RNN is trained with cross entropy loss to maximize the probability of the correct caption. It also discusses techniques for variable length captions, sampling methods for inference, and using attention to focus on parts of the image.
This document discusses multi-layer perceptrons (MLPs) as an alternative to self-attention in transformer models. It introduces gated MLP (gMLP) blocks that maintain spatial information across tokens through a spatial gating unit without positional encodings. Experiments on image classification and masked language modeling show gMLPs can match or exceed self-attention performance with increased model capacity. A hybrid model combining a small self-attention and spatial gating unit performs best, showing spatial interactions can replace positional encodings.
Deep learning based object detection basicsBrodmann17
The document discusses different approaches to object detection in images using deep learning. It begins with describing detection as classification, where an image is classified into categories for what objects are present. It then discusses approaches that involve separating detection into a classification head and localization head. The document also covers improvements like R-CNN which uses region proposals to first generate candidate object regions before running classification and bounding box regression on those regions using CNN features. This helps address issues with previous approaches like being too slow when running the CNN over the entire image at multiple locations and scales.
The document discusses the FaceNet paper, which proposes a unified embedding for face recognition and clustering using a deep neural network. Some key points:
- FaceNet uses a triplet loss during training to learn a embedding space where distances between faces correspond to whether they are from the same person or not.
- This eliminates the need for complex multi-stage training pipelines used by previous works.
- On standard benchmarks, FaceNet achieves over 99% accuracy for face verification, outperforming prior state-of-the-art models.
- The unified embedding allows for face recognition via distance thresholding and face clustering via k-means in the learned space.
Tracking is the problem of estimating the trajectory of an object as it moves around a scene. Motion tracking involves collecting data on human movement using sensors to control outputs like music or lighting based on performer actions. Motion tracking differs from motion capture in that it requires less equipment, is less expensive, and is concerned with qualities of motion rather than highly accurate data collection. Optical flow estimates the pixel-wise motion between frames in a video by calculating velocity vectors for each pixel.
This document provides an overview of single image super resolution using deep learning. It discusses how super resolution can be used to generate a high resolution image from a low resolution input. Deep learning models like SRCNN were early approaches for super resolution but newer models use deeper networks and perceptual losses. Generative adversarial networks have also been applied to improve perceptual quality. Key applications are in satellite imagery, medical imaging, and video enhancement. Metrics like PSNR and SSIM are commonly used but may not correlate with human perception. Overall, deep learning has advanced super resolution techniques but challenges remain in fully evaluating perceptual quality.
Super resolution in deep learning era - Jaejun YooJaeJun Yoo
1) The document discusses super-resolution techniques in deep learning, including inverse problems, image restoration problems, and different deep learning models.
2) Early models like SRCNN used convolutional networks for super-resolution but were shallow, while later models incorporated residual learning (VDSR), recursive learning (DRCN), and became very deep and dense (SRResNet).
3) Key developments included EDSR which provided a strong backbone model and GAN-based approaches like SRGAN which aimed to generate more realistic textures but require new evaluation metrics.
Photo-realistic Single Image Super-resolution using a Generative Adversarial ...Hansol Kang
* Ledig, Christian, et al. "Photo-realistic single image super-resolution using a generative adversarial network." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.
PR-231: A Simple Framework for Contrastive Learning of Visual RepresentationsJinwon Lee
The document presents SimCLR, a framework for contrastive learning of visual representations using simple data augmentation. Key aspects of SimCLR include using random cropping and color distortions to generate positive sample pairs for the contrastive loss, a nonlinear projection head to learn representations, and large batch sizes. Evaluation shows SimCLR learns representations that outperform supervised pretraining on downstream tasks and achieves state-of-the-art results with only view augmentation and contrastive loss.
PR-207: YOLOv3: An Incremental ImprovementJinwon Lee
YOLOv3 makes the following incremental improvements over previous versions of YOLO:
1. It predicts bounding boxes at three different scales to detect objects more accurately at a variety of sizes.
2. It uses Darknet-53 as its feature extractor, which provides better performance than ResNet while being faster to evaluate.
3. It predicts more bounding boxes overall (over 10,000) to detect objects more precisely, as compared to YOLOv2 which predicts around 800 boxes.
Yolo is an end-to-end, real-time object detection system that uses a single convolutional neural network to predict bounding boxes and class probabilities directly from full images. It uses a deeper Darknet-53 backbone network and multi-scale predictions to achieve state-of-the-art accuracy while running faster than other algorithms. Yolo is trained on a merged ImageNet and COCO dataset and predicts bounding boxes using predefined anchor boxes and associated class probabilities at three different scales to localize and classify objects in images with just one pass through the network.
Recent Progress on Single-Image Super-ResolutionHiroto Honda
This document summarizes recent progress in single image super resolution (SISR) techniques using deep convolutional neural networks. It discusses early networks like SRCNN and VDSR, as well as more advanced models such as SRResNet, SRGAN, and EDSR that utilize residual blocks and perceptual loss functions. The document notes that while SISR accuracy has improved significantly in recent years, achieving both high PSNR and natural perceptual quality remains challenging due to a distortion-perception tradeoff. It concludes that the application determines whether more accurate or plausible output is preferred.
Image Segmentation Using Deep Learning : A surveyNUPUR YADAV
1. The document discusses various deep learning models for image segmentation, including fully convolutional networks, encoder-decoder models, multi-scale pyramid networks, and dilated convolutional models.
2. It provides details on popular architectures like U-Net, SegNet, and models from the DeepLab family.
3. The document also reviews datasets commonly used to evaluate image segmentation methods and reports accuracies of different models on the Cityscapes dataset.
The document describes improvements made to the YOLO object detection model, resulting in YOLOv2 and YOLO9000. YOLOv2 uses techniques like batch normalization, anchor boxes, and multi-scale training to improve accuracy while maintaining speed. YOLO9000 further trains the model on a combination of detection and classification datasets totaling over 9,000 classes, using a word hierarchy to relate labels. This allows training an object detector on a much larger scale than typical detection datasets alone. Evaluation shows YOLO9000 can detect new classes it wasn't directly trained on, bringing object detection closer to the scale of image classification tasks.
제 17회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [6시내고양포CAT몬] : Cat Anti-aging Project based Style...BOAZ Bigdata
데이터 분석 프로젝트를 진행한 6시내고양포CAT몬 팀에서는 아래와 같은 프로젝트를 진행했습니다.
Cat Anti-aging Project based StyleGAN2
18기 박규연 국민대학교 소프트웨어학부
18기 김가영 숙명여자대학교 통계학과
18기 서은유 동덕여자대학교 정보통계학과
18기 이기원 고려대학교 식품자원경제학과
Object Detection Methods using Deep LearningSungjoon Choi
The document discusses object detection techniques including R-CNN, SPPnet, Fast R-CNN, and Faster R-CNN. R-CNN uses region proposals and CNN features to classify each region. SPPnet improves efficiency by computing CNN features once for the whole image. Fast R-CNN further improves efficiency by sharing computation and using a RoI pooling layer. Faster R-CNN introduces a region proposal network to generate proposals, achieving end-to-end training. The techniques showed improved accuracy and processing speed over prior methods.
PR-132: SSD: Single Shot MultiBox DetectorJinwon Lee
SSD is a single-shot object detector that processes the entire image at once, rather than proposing regions of interest. It uses a base VGG16 network with additional convolutional layers to predict bounding boxes and class probabilities at three scales simultaneously. SSD achieves state-of-the-art accuracy while running significantly faster than two-stage detectors like Faster R-CNN. It introduces techniques like default boxes, hard negative mining, and data augmentation to address class imbalance and improve results on small objects. On PASCAL VOC 2007, SSD detects objects at 59 FPS with 74.3% mAP, comparable to Faster R-CNN but much faster.
Slides by Amaia Salvador at the UPC Computer Vision Reading Group.
Source document on GDocs with clickable links:
https://docs.google.com/presentation/d/1jDTyKTNfZBfMl8OHANZJaYxsXTqGCHMVeMeBe5o1EL0/edit?usp=sharing
Based on the original work:
Ren, Shaoqing, Kaiming He, Ross Girshick, and Jian Sun. "Faster R-CNN: Towards real-time object detection with region proposal networks." In Advances in Neural Information Processing Systems, pp. 91-99. 2015.
論文紹介:Flamingo: a Visual Language Model for Few-Shot LearningToru Tamaki
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob L Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikołaj Bińkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, Karén Simonyan, "Flamingo: a Visual Language Model for Few-Shot Learning" NeurIPS2022
https://proceedings.neurips.cc/paper_files/paper/2022/hash/960a172bc7fbf0177ccccbb411a7d800-Abstract-Conference.html
http://imatge-upc.github.io/telecombcn-2016-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of big annotated data and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which had been addressed until now with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or text captioning.
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object DetectorJinwon Lee
TensorFlow Korea 논문읽기모임 PR12 270번째 논문 review입니다.
이번 논문은 Baidu에서 나온 PP-YOLO: An Effective and Efficient Implementation of Object Detector입니다. YOLOv3에 다양한 방법을 적용하여 매우 높은 성능과 함께 매우 빠른 속도 두마리 토끼를 다 잡아버린(?) 그런 논문입니다. 논문에서 사용한 다양한 trick들에 대해서 좀 더 깊이있게 살펴보았습니다. Object detection에 사용된 기법 들 중에 Deformable convolution, Exponential Moving Average, DropBlock, IoU aware prediction, Grid sensitivity elimination, MatrixNMS, CoordConv, 등의 방법에 관심이 있으시거나 알고 싶으신 분들은 영상과 발표자료를 참고하시면 좋을 것 같습니다!
논문링크: https://arxiv.org/abs/2007.12099
영상링크: https://youtu.be/7v34cCE5H4k
This document summarizes Deep Q-Networks (DQN), a deep reinforcement learning algorithm that was able to achieve human-level performance on many Atari 2600 games. The key ideas of DQN include using a deep neural network to approximate the Q-function, experience replay to increase data efficiency, and a separate target network to stabilize learning. DQN has inspired many follow up algorithms, including double DQN, dueling DQN, prioritized experience replay, and noisy networks for better exploration. DQN was able to learn human-level policies directly from pixels and rewards for many Atari games using the same hyperparameters and network architecture.
DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Gen...ssuserffe940
DreamBooth는 주어진 텍스트 프롬프트로부터 고품질이고 다양한 이미지를 합성하는 데 탁월한 대규모 텍스트-이미지 모델의 한계를 극복합니다. 기존 모델들은 주어진 참조 세트에서 주체의 모습을 모방하고 다른 맥락에서 그들의 새로운 렌더링을 생성하는 능력이 부족했습니다.
개인화된 이미지 생성
DreamBooth는 주체의 몇 장의 이미지만 입력으로 받아, 사전 훈련된 텍스트-이미지 모델을 미세 조정합니다. 이를 통해 모델은 특정 주체에 고유 식별자를 결합하여 학습합니다. 이 고유 식별자를 사용하여, 모델의 출력 도메인에 내장된 주체를 다양한 장면에서의 새로운 사실적 이미지로 합성할 수 있습니다.
기술의 적용
이 기술은 주체 재배치, 텍스트 가이드 뷰 합성, 렌더링 등 여러 이전에 해결하기 어려웠던 작업에 적용되었습니다. 모델은 주체의 핵심 특징을 보존하면서, 참조 이미지에 나타나지 않는 다양한 장면, 포즈, 시점 및 조명 조건에서 주체를 합성하는 데 성공했습니다.
오늘 논문 리뷰를 위해 이미지처리 김준철님이 자세한 리뷰를 도와주셨습니다 많은 관심 미리 감사드립니다!
https://youtu.be/jq85UXiJEXk
Photo-realistic Single Image Super-resolution using a Generative Adversarial ...Hansol Kang
* Ledig, Christian, et al. "Photo-realistic single image super-resolution using a generative adversarial network." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.
PR-231: A Simple Framework for Contrastive Learning of Visual RepresentationsJinwon Lee
The document presents SimCLR, a framework for contrastive learning of visual representations using simple data augmentation. Key aspects of SimCLR include using random cropping and color distortions to generate positive sample pairs for the contrastive loss, a nonlinear projection head to learn representations, and large batch sizes. Evaluation shows SimCLR learns representations that outperform supervised pretraining on downstream tasks and achieves state-of-the-art results with only view augmentation and contrastive loss.
PR-207: YOLOv3: An Incremental ImprovementJinwon Lee
YOLOv3 makes the following incremental improvements over previous versions of YOLO:
1. It predicts bounding boxes at three different scales to detect objects more accurately at a variety of sizes.
2. It uses Darknet-53 as its feature extractor, which provides better performance than ResNet while being faster to evaluate.
3. It predicts more bounding boxes overall (over 10,000) to detect objects more precisely, as compared to YOLOv2 which predicts around 800 boxes.
Yolo is an end-to-end, real-time object detection system that uses a single convolutional neural network to predict bounding boxes and class probabilities directly from full images. It uses a deeper Darknet-53 backbone network and multi-scale predictions to achieve state-of-the-art accuracy while running faster than other algorithms. Yolo is trained on a merged ImageNet and COCO dataset and predicts bounding boxes using predefined anchor boxes and associated class probabilities at three different scales to localize and classify objects in images with just one pass through the network.
Recent Progress on Single-Image Super-ResolutionHiroto Honda
This document summarizes recent progress in single image super resolution (SISR) techniques using deep convolutional neural networks. It discusses early networks like SRCNN and VDSR, as well as more advanced models such as SRResNet, SRGAN, and EDSR that utilize residual blocks and perceptual loss functions. The document notes that while SISR accuracy has improved significantly in recent years, achieving both high PSNR and natural perceptual quality remains challenging due to a distortion-perception tradeoff. It concludes that the application determines whether more accurate or plausible output is preferred.
Image Segmentation Using Deep Learning : A surveyNUPUR YADAV
1. The document discusses various deep learning models for image segmentation, including fully convolutional networks, encoder-decoder models, multi-scale pyramid networks, and dilated convolutional models.
2. It provides details on popular architectures like U-Net, SegNet, and models from the DeepLab family.
3. The document also reviews datasets commonly used to evaluate image segmentation methods and reports accuracies of different models on the Cityscapes dataset.
The document describes improvements made to the YOLO object detection model, resulting in YOLOv2 and YOLO9000. YOLOv2 uses techniques like batch normalization, anchor boxes, and multi-scale training to improve accuracy while maintaining speed. YOLO9000 further trains the model on a combination of detection and classification datasets totaling over 9,000 classes, using a word hierarchy to relate labels. This allows training an object detector on a much larger scale than typical detection datasets alone. Evaluation shows YOLO9000 can detect new classes it wasn't directly trained on, bringing object detection closer to the scale of image classification tasks.
제 17회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [6시내고양포CAT몬] : Cat Anti-aging Project based Style...BOAZ Bigdata
데이터 분석 프로젝트를 진행한 6시내고양포CAT몬 팀에서는 아래와 같은 프로젝트를 진행했습니다.
Cat Anti-aging Project based StyleGAN2
18기 박규연 국민대학교 소프트웨어학부
18기 김가영 숙명여자대학교 통계학과
18기 서은유 동덕여자대학교 정보통계학과
18기 이기원 고려대학교 식품자원경제학과
Object Detection Methods using Deep LearningSungjoon Choi
The document discusses object detection techniques including R-CNN, SPPnet, Fast R-CNN, and Faster R-CNN. R-CNN uses region proposals and CNN features to classify each region. SPPnet improves efficiency by computing CNN features once for the whole image. Fast R-CNN further improves efficiency by sharing computation and using a RoI pooling layer. Faster R-CNN introduces a region proposal network to generate proposals, achieving end-to-end training. The techniques showed improved accuracy and processing speed over prior methods.
PR-132: SSD: Single Shot MultiBox DetectorJinwon Lee
SSD is a single-shot object detector that processes the entire image at once, rather than proposing regions of interest. It uses a base VGG16 network with additional convolutional layers to predict bounding boxes and class probabilities at three scales simultaneously. SSD achieves state-of-the-art accuracy while running significantly faster than two-stage detectors like Faster R-CNN. It introduces techniques like default boxes, hard negative mining, and data augmentation to address class imbalance and improve results on small objects. On PASCAL VOC 2007, SSD detects objects at 59 FPS with 74.3% mAP, comparable to Faster R-CNN but much faster.
Slides by Amaia Salvador at the UPC Computer Vision Reading Group.
Source document on GDocs with clickable links:
https://docs.google.com/presentation/d/1jDTyKTNfZBfMl8OHANZJaYxsXTqGCHMVeMeBe5o1EL0/edit?usp=sharing
Based on the original work:
Ren, Shaoqing, Kaiming He, Ross Girshick, and Jian Sun. "Faster R-CNN: Towards real-time object detection with region proposal networks." In Advances in Neural Information Processing Systems, pp. 91-99. 2015.
論文紹介:Flamingo: a Visual Language Model for Few-Shot LearningToru Tamaki
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob L Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikołaj Bińkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, Karén Simonyan, "Flamingo: a Visual Language Model for Few-Shot Learning" NeurIPS2022
https://proceedings.neurips.cc/paper_files/paper/2022/hash/960a172bc7fbf0177ccccbb411a7d800-Abstract-Conference.html
http://imatge-upc.github.io/telecombcn-2016-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of big annotated data and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which had been addressed until now with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or text captioning.
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object DetectorJinwon Lee
TensorFlow Korea 논문읽기모임 PR12 270번째 논문 review입니다.
이번 논문은 Baidu에서 나온 PP-YOLO: An Effective and Efficient Implementation of Object Detector입니다. YOLOv3에 다양한 방법을 적용하여 매우 높은 성능과 함께 매우 빠른 속도 두마리 토끼를 다 잡아버린(?) 그런 논문입니다. 논문에서 사용한 다양한 trick들에 대해서 좀 더 깊이있게 살펴보았습니다. Object detection에 사용된 기법 들 중에 Deformable convolution, Exponential Moving Average, DropBlock, IoU aware prediction, Grid sensitivity elimination, MatrixNMS, CoordConv, 등의 방법에 관심이 있으시거나 알고 싶으신 분들은 영상과 발표자료를 참고하시면 좋을 것 같습니다!
논문링크: https://arxiv.org/abs/2007.12099
영상링크: https://youtu.be/7v34cCE5H4k
This document summarizes Deep Q-Networks (DQN), a deep reinforcement learning algorithm that was able to achieve human-level performance on many Atari 2600 games. The key ideas of DQN include using a deep neural network to approximate the Q-function, experience replay to increase data efficiency, and a separate target network to stabilize learning. DQN has inspired many follow up algorithms, including double DQN, dueling DQN, prioritized experience replay, and noisy networks for better exploration. DQN was able to learn human-level policies directly from pixels and rewards for many Atari games using the same hyperparameters and network architecture.
DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Gen...ssuserffe940
DreamBooth는 주어진 텍스트 프롬프트로부터 고품질이고 다양한 이미지를 합성하는 데 탁월한 대규모 텍스트-이미지 모델의 한계를 극복합니다. 기존 모델들은 주어진 참조 세트에서 주체의 모습을 모방하고 다른 맥락에서 그들의 새로운 렌더링을 생성하는 능력이 부족했습니다.
개인화된 이미지 생성
DreamBooth는 주체의 몇 장의 이미지만 입력으로 받아, 사전 훈련된 텍스트-이미지 모델을 미세 조정합니다. 이를 통해 모델은 특정 주체에 고유 식별자를 결합하여 학습합니다. 이 고유 식별자를 사용하여, 모델의 출력 도메인에 내장된 주체를 다양한 장면에서의 새로운 사실적 이미지로 합성할 수 있습니다.
기술의 적용
이 기술은 주체 재배치, 텍스트 가이드 뷰 합성, 렌더링 등 여러 이전에 해결하기 어려웠던 작업에 적용되었습니다. 모델은 주체의 핵심 특징을 보존하면서, 참조 이미지에 나타나지 않는 다양한 장면, 포즈, 시점 및 조명 조건에서 주체를 합성하는 데 성공했습니다.
오늘 논문 리뷰를 위해 이미지처리 김준철님이 자세한 리뷰를 도와주셨습니다 많은 관심 미리 감사드립니다!
https://youtu.be/jq85UXiJEXk
AI/Machine Learning의 한 분야인 Natural Language Processing (NLP)에 대해서 발표를 할 예정입니다. NLP는 한국어로 “자연어 처리”로서 Computer Vision 및 Image Processing에서 “언어적 문맥” 이해와 “그 처리”는 상당히 중요한 역할을 차지합니다. Image/Video를 Language화하여 처리하는 다양한 알고리즘이 존재하며 CVPR/ICCV의 학회에서도 핫한 분야 중 하나입니다. 대표적인 분야는 Image/Video Captioning, Description 및 Visual Q&A 등이 있습니다.
그 중에서도 핵심 Background가 되는 Word2Vec에 대해서 소개하고자 합니다. Word2Vec은 언어처리 뿐만 아니라 Generative Model과도 연관성이 높다는 것이 특징이며. NLP의 모든 분야에서의 핵심 이론으로 보시면 되겠습니다.
(Papers Review)CNN for sentence classificationMYEONGGYU LEE
review date: 2017/10/10 (by Meyong-Gyu.LEE @Soongsil Univ.)
Korean review of 'Convolutional Neural Networks for Sentence Classification'(EMNLP2014) and 'A Syllable-based Technique for Word Embeddings of Korean Words'(HCLT 2017)
1. Show and Tell: Lessons learned from the 2015
MSCOCO Image Captioning Challenge
IEEE Transactions on Pattern analysis and Machine Learning (2016)
Oriol Vinyals, Alexander Toshev, Dumitru Erhan and Samy Bengio (Google Inc.)
자연어처리연구실
발표자: 허광호
Show and Tell: A Neural Image Caption Generator (CVPR 2015, Google Inc. 830 citations)
Oriol Vinyals
Alexander Toshev Dumitru Erhan Samy Bengio
1
2. Image Captioning
• Given an input image, describe the content of an image using properly formed
natural language like English.
• This task is significantly harder than image classification or object recognition task.
• 1) A description must capture not only the objects, (물체 인식)
• 2) it also must express how these objects relate to each other, (물체들 사이의 관계)
• 3) as well as their attributes (물체의 속성: 색상, 모양)
• 4) and the activities they are involved in (물체들의 행위)
입력이미지
자연어출력
The above semantic knowledge has
to be expressed in a natural
language like English.
2
3. Neural Image Caption (NIC Model)
• Present a single joint model
• 이미지 𝐼를 입력으로 받아, Likelihood 𝑝(𝑆|𝐼)를 최대로 하는 문장 𝑆 = {𝑆1, 𝑆2, …𝑆 𝑛}를
출력하도록 학습함.
• Inspiration comes from machine translation
• DNN을 사용하기 전 기계번역 방법은 여러 개의 독립적인 task로 나누어서 해결
• (e.g. translating words individually, aligning words, reordering, etc.)
• 최근 기계번역에서 RNN을 사용하여 간결한 방법으로 state-of-the-art 성능.
S RNN
“encoder”
Fixed-length
VectorRNN
“decoder”
T
RNN 기계번역
NIC 모델
3
4. Related Works
• Objects 사이의 관계 triplet을 찾고 templates을 이용하여 text 생성. (Farhadi et al. 2010)
• 좀 더 복잡한 Graph로 triplet을 대체 후 template-based text generation. (Kulkani et al. 2011)
• Language parsing기반 language model로 template-based 대체. (논문 5개, 2010~2013)
• Co-embedding of images and text in the same vector space. (논문 5개, 2013~2015)
• Or even image crops (부분 이미지) and sub-sentences (부분 문장).
• 논문에서 주장하는 가장 큰 구별점
• Visual input을 RNN모델에 직접 연결하여, RNN모델이 text에 언급된 objects를 추적할 수 있게 함.
• 이와 유사한 방법들을 심도 있게 분석한 논문 Devlin et al. (ACL 2015)
Heavily hand-designed and rigid.
Do not attempt to generate novel descriptions.
4
5. Model Architecture (1)
• Generate descriptions from image in “end-to-end” fashion.
• RNN 모델 → ℎ 𝑡+1 = 𝑓(ℎ 𝑡, 𝑥𝑡)
• 1) non-linear function 𝑓 를 어떻게 선택할 것인가?
• Long-Short Term Memory (LSTM) 사용
• 2) 이미지와 단어를 어떻게 입력 𝑥𝑡로 표현할 것인가?
• 이미지는 CNN (ILSVRC 2014 competition에서 우승한 모델)
• 단어는 word embedding (word2vec)
일반적인 language model과 매우 유사
𝑝(𝑆|𝐼)를 RNN으로 모델링 할 수 있음.
5
6. Model Architecture (2)
• LSTM-based Sentence Generator
* Hidden state 혹은 memory를 𝑚 𝑡로 표현하였음.
input
forget
output
input, forget 적용
Output 적용
recurrent connections 6
7. Model Architecture (3)
• Training
• loss function
• Inference
• Sampling
• Beam Search
• Iteratively consider k best sentences up to time t as candidates to generate sentences of size t+1.
이미지 표현을 t=-1에만 사용한 이유?
7
8. Experiments (1)
• Evaluation Metrics
• Human raters (Amazon Mechanical Turk)
• BLEU score
• CIDER (R. Vedantam 2015) Introduced by MS COCO challenge organizers.
• METEOR (S. Banerjee 2005)
• ROUGE (C. Y. Lin 2004)
• Automatic metrics 와 Human rank 연관성
8
9. Experiments (2)
• Datasets
PASCAL (A. Farhadi 2010)
Flickr8k (C. Rashtchian 2010)
Flickr30k (P. Young 2014)
MSCOCO (T. Y. Lin 2014)
SBU (V. Ordonez 2011)
9
10. Experiments (3)
• Scores on the MSCOCO dev. Set
• BLEU-1 scores
NIC (Google 첫 번째 버전 2015)
NICv2 (현재 버전 2016)
* 뒤에서 NIC와 NICv2 차이점을 토론.
10
11. Experiments (4)
• Human Evaluation 결과
• (X축: 점수, Y축: scores > x인 누적 분포)
• 1점: Unrelated to the image
• 2점: Somewhat related to the image
• 3점: Describes with minor errors
• 4점: Describes without errors
4점 받은 Example 11
12. Generation Diversity
• Generation Diversity에 관한 질문
• 1) whether the model generates novel captions?
• 2) whether the generated captions are diverse and high quality?
• 실험에서 얻은 결과
• Best 문장 하나만 고려할 경우 training set caption과 80%가 겹침.
• Top 15개 문장을 고려할 경우 50% 정도의 novel description을 생성할 수 있음.
12
13. Improvements Over CVPR15 Model (1)
• 새로 적용한 기법과 향상된 BLEU-4 성능
• Image Model Improvement (+2% BLUE-4)
• 당시 GoogleLeNet 모델을 사용함. (22 layers, 2014 ImageNet competition 우승)
• 나중에 “Batch Normalization” 기법을 적용하여
• ImageNet task에서 top-5 error가 6.67%에서 4.8%로 감소 (2% 향상)
• Batch Normalization (S. Ioffe 2015)
• Normalize each layer of a neural network with respect to the current batch of example.
13
14. Improvements Over CVPR15 Model (2)
• Image Model Fine Tuning (+1% BLUE-4)
• 기존 CNN 모델은 ImageNet에서 학습한 params를 고정하였고 (For generalization)
• LSTM params만 MS COCO 데이터로 훈련.
• NICv2에서 CNN모델을 MS COCO데이터로 Fine tuning.
• Fine tuning할 때 발견한 점.
• LSTM의 params를 일정한 수준까지 학습하고 안정된 후 CNN Fine tuning 수행.
• Why? → LSTM의 initial gradients가 pre-trained CNN params를 corrupt시킴.
• 1) CNN params을 freezing 후 500K steps을 학습하고
• 2) CNN과 LSTM을 100K steps만큼 joint 학습함.
• 1 step당 3초 정도 걸려 총 3주정도 소요. (Single GPU – Nvidia K20)
• 병렬로 학습할 경우 converge는 빠르나 학습 후 성능이 낮아짐.
• Fine-tuning 후 color feature를 더 잘 catch하여 “A blue and yellow train…” 문장 출력
14
15. Improvements Over CVPR15 Model (3)
• Scheduled Sampling (+1.5% BLEU-4) (S. Bengio NIPS 2015)
• LSTM 모델의 학습과정과 Inference과정에 차이가 존재
• 𝑝 𝑆𝑡 𝑆1, 𝑆2, … , 𝑆𝑡−1
• 𝑆𝑡를 학습할 때 조건으로 하는 Previous words (𝑆1, 𝑆2, … , 𝑆𝑡−1)은 모두 정답임.
• 그러나 Inference단계에서는 기존에 생성한 단어를 previous words로 할 수밖에 없음.
• 이를 해결하기 위하여 Curriculum learning strategy를 도입.
• 정답 previous words로 Fully guided scheme로 학습하는 기존 방식에서
• 생성한 previous words로 Less guided scheme로 학습하도록 변경함
15
16. Improvements Over CVPR15 Model (4)
• Ensemble (+1.5% BLEU)
• 5 models trained with Scheduled Sampling.
• 10 models trained with fine-tuning image model.
• Beam Size Reduction (+2% BLUE)
• 기존 Beam Size 20에서 3으로 줄임.
• 일반적으로 large beam size로 학습하면 성능이 높지만, training set을 over-fitting 했다고 함
• Training caption 중복율 80%에서 60%로 감소 (즉 Novel caption이 20% → 40%)
16
18. Future Works
• “Despite the exciting results on captioning, we believe it is just the beginning.”
• Have a system which is capable of more targeted descriptions
• anchoring the descriptions to given image properties and locations.
• being a response to a user specified question or task.
• Further research direction
• better evaluation metrics
• evaluation through higher level goals (e.g. application such as robotics)
Tensorflow 소스코드 - https://github.com/tensorflow/models/tree/master/im2txt
18