The document discusses several image enhancement techniques:
1. WCT2, which uses wavelet transforms for photorealistic style transfer, achieving faster and lighter models than previous techniques.
2. CutBlur, a new data augmentation method that improves performance on super-resolution and other low-level vision tasks by adding blur and cutting patches from images.
3. SimUSR, a simple but strong baseline for unsupervised super-resolution that achieves state-of-the-art results using only a single low-resolution image during training.
A beginner's guide to Style Transfer and recent trendsJaeJun Yoo
Style transfer techniques have evolved from matching gram matrices to using neural networks. Early methods matched gram statistics of CNN features to transfer texture styles. Recent work uses adaptive instance normalization and feed-forward networks. WCT2 achieves photorealistic transfer using wavelet transforms that satisfy the perfect reconstruction condition, enabling high resolution stylization and temporal consistency in videos without post-processing.
Super resolution in deep learning era - Jaejun YooJaeJun Yoo
1) The document discusses super-resolution techniques in deep learning, including inverse problems, image restoration problems, and different deep learning models.
2) Early models like SRCNN used convolutional networks for super-resolution but were shallow, while later models incorporated residual learning (VDSR), recursive learning (DRCN), and became very deep and dense (SRResNet).
3) Key developments included EDSR which provided a strong backbone model and GAN-based approaches like SRGAN which aimed to generate more realistic textures but require new evaluation metrics.
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...Jinwon Lee
TensorFlow Korea 논문읽기모임 PR12 258번째 논문 review입니다.
이번 논문은 MIT에서 나온 From ImageNet to Image Classification: Contextualizing Progress on Benchmarks입니다.
Deep Learning 하시는 분들이면 ImageNet 모르시는 분들이 없을텐데요, 이 논문은 ImageNet의 labeling 방법의 한계와 문제점에 대해서 얘기하고 top-1 accuracy 기반의 평가 방법에도 문제가 있을 수 있음을 지적하고 있습니다.
ImageNet data의 20% 이상이 multi object를 포함하고 있지만 그 중에 하나만 정답으로 인정되는 문제가 있고, annotation 방법의 한계로 인하여 실제로 사람이 생각하는 것과 다른 class가 정답으로 labeling되어 있는 경우도 많았습니다. 또한 terrier만 20종이 넘는 등 전문가가 아니면 판단하기 어려운 label도 많다는 문제도 있었구요. 이 밖에도 다양한 실험을 통해서 정량적인 분석과 함께 human-in-the-loop을 이용한 평가로 현재 model들의 성능이 어디까지 와있는지, 그리고 앞으로 더 높은 성능을 내기 위해서 data labeling 측면에서 해결해야할 과제는 무엇인지에 대해서 이야기하고 있습니다. 논문이 양이 좀 많긴 하지만 기술적인 내용이 별로 없어서 쉽게 읽으실 수 있는데요, 자세한 내용이 궁금하신 분들은 영상을 참고해주세요!
논문링크: https://arxiv.org/abs/2005.11295
발표영상링크: https://youtu.be/CPMgX5ikL_8
TensorFlow Korea 논문읽기모임 PR12 243째 논문 review입니다
이번 논문은 RegNet으로 알려진 Facebook AI Research의 Designing Network Design Spaces 입니다.
CNN을 디자인할 때, bottleneck layer는 정말 좋을까요? layer 수는 많을 수록 높은 성능을 낼까요? activation map의 width, height를 절반으로 줄일 때(stride 2 혹은 pooling), channel을 2배로 늘려주는데 이게 최선일까요? 혹시 bottleneck layer가 없는 게 더 좋지는 않은지, 최고 성능을 내는 layer 수에 magic number가 있는 건 아닐지, activation이 절반으로 줄어들 때 channel을 2배가 아니라 3배로 늘리는 게 더 좋은건 아닌지?
이 논문에서는 하나의 neural network을 잘 design하는 것이 아니라 Auto ML과 같은 기술로 좋은 neural network을 찾을 수 있는 즉 좋은 neural network들이 살고 있는 좋은 design space를 design하는 방법에 대해서 얘기하고 있습니다. constraint이 거의 없는 design space에서 human-in-the-loop을 통해 좋은 design space로 그 공간을 좁혀나가는 방법을 제안하였는데요, EfficientNet보다 더 좋은 성능을 보여주는 RegNet은 어떤 design space에서 탄생하였는지 그리고 그 과정에서 우리가 당연하게 여기고 있었던 design choice들이 잘못된 부분은 없었는지 아래 동영상에서 확인하실 수 있습니다~
영상링크: https://youtu.be/bnbKQRae_u4
논문링크: https://arxiv.org/abs/2003.13678
A beginner's guide to Style Transfer and recent trendsJaeJun Yoo
Style transfer techniques have evolved from matching gram matrices to using neural networks. Early methods matched gram statistics of CNN features to transfer texture styles. Recent work uses adaptive instance normalization and feed-forward networks. WCT2 achieves photorealistic transfer using wavelet transforms that satisfy the perfect reconstruction condition, enabling high resolution stylization and temporal consistency in videos without post-processing.
Super resolution in deep learning era - Jaejun YooJaeJun Yoo
1) The document discusses super-resolution techniques in deep learning, including inverse problems, image restoration problems, and different deep learning models.
2) Early models like SRCNN used convolutional networks for super-resolution but were shallow, while later models incorporated residual learning (VDSR), recursive learning (DRCN), and became very deep and dense (SRResNet).
3) Key developments included EDSR which provided a strong backbone model and GAN-based approaches like SRGAN which aimed to generate more realistic textures but require new evaluation metrics.
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...Jinwon Lee
TensorFlow Korea 논문읽기모임 PR12 258번째 논문 review입니다.
이번 논문은 MIT에서 나온 From ImageNet to Image Classification: Contextualizing Progress on Benchmarks입니다.
Deep Learning 하시는 분들이면 ImageNet 모르시는 분들이 없을텐데요, 이 논문은 ImageNet의 labeling 방법의 한계와 문제점에 대해서 얘기하고 top-1 accuracy 기반의 평가 방법에도 문제가 있을 수 있음을 지적하고 있습니다.
ImageNet data의 20% 이상이 multi object를 포함하고 있지만 그 중에 하나만 정답으로 인정되는 문제가 있고, annotation 방법의 한계로 인하여 실제로 사람이 생각하는 것과 다른 class가 정답으로 labeling되어 있는 경우도 많았습니다. 또한 terrier만 20종이 넘는 등 전문가가 아니면 판단하기 어려운 label도 많다는 문제도 있었구요. 이 밖에도 다양한 실험을 통해서 정량적인 분석과 함께 human-in-the-loop을 이용한 평가로 현재 model들의 성능이 어디까지 와있는지, 그리고 앞으로 더 높은 성능을 내기 위해서 data labeling 측면에서 해결해야할 과제는 무엇인지에 대해서 이야기하고 있습니다. 논문이 양이 좀 많긴 하지만 기술적인 내용이 별로 없어서 쉽게 읽으실 수 있는데요, 자세한 내용이 궁금하신 분들은 영상을 참고해주세요!
논문링크: https://arxiv.org/abs/2005.11295
발표영상링크: https://youtu.be/CPMgX5ikL_8
TensorFlow Korea 논문읽기모임 PR12 243째 논문 review입니다
이번 논문은 RegNet으로 알려진 Facebook AI Research의 Designing Network Design Spaces 입니다.
CNN을 디자인할 때, bottleneck layer는 정말 좋을까요? layer 수는 많을 수록 높은 성능을 낼까요? activation map의 width, height를 절반으로 줄일 때(stride 2 혹은 pooling), channel을 2배로 늘려주는데 이게 최선일까요? 혹시 bottleneck layer가 없는 게 더 좋지는 않은지, 최고 성능을 내는 layer 수에 magic number가 있는 건 아닐지, activation이 절반으로 줄어들 때 channel을 2배가 아니라 3배로 늘리는 게 더 좋은건 아닌지?
이 논문에서는 하나의 neural network을 잘 design하는 것이 아니라 Auto ML과 같은 기술로 좋은 neural network을 찾을 수 있는 즉 좋은 neural network들이 살고 있는 좋은 design space를 design하는 방법에 대해서 얘기하고 있습니다. constraint이 거의 없는 design space에서 human-in-the-loop을 통해 좋은 design space로 그 공간을 좁혀나가는 방법을 제안하였는데요, EfficientNet보다 더 좋은 성능을 보여주는 RegNet은 어떤 design space에서 탄생하였는지 그리고 그 과정에서 우리가 당연하게 여기고 있었던 design choice들이 잘못된 부분은 없었는지 아래 동영상에서 확인하실 수 있습니다~
영상링크: https://youtu.be/bnbKQRae_u4
논문링크: https://arxiv.org/abs/2003.13678
Deep learning for image super resolutionPrudhvi Raj
Using Deep Convolutional Networks, the machine can learn end-to-end mapping between the low/high-resolution images. Unlike traditional methods, this method jointly optimizes all the layers of the image. A light-weight CNN structure is used, which is simple to implement and provides formidable trade-off from the existential methods.
Explores the type of structure learned by Convolutional Neural Networks, the applications where they're most valuable and a number of appropriate mental models for understanding deep learning.
Chen, X., & He, K. (2021). Exploring Simple Siamese Representation Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 15750-15758).
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object DetectorJinwon Lee
TensorFlow Korea 논문읽기모임 PR12 270번째 논문 review입니다.
이번 논문은 Baidu에서 나온 PP-YOLO: An Effective and Efficient Implementation of Object Detector입니다. YOLOv3에 다양한 방법을 적용하여 매우 높은 성능과 함께 매우 빠른 속도 두마리 토끼를 다 잡아버린(?) 그런 논문입니다. 논문에서 사용한 다양한 trick들에 대해서 좀 더 깊이있게 살펴보았습니다. Object detection에 사용된 기법 들 중에 Deformable convolution, Exponential Moving Average, DropBlock, IoU aware prediction, Grid sensitivity elimination, MatrixNMS, CoordConv, 등의 방법에 관심이 있으시거나 알고 싶으신 분들은 영상과 발표자료를 참고하시면 좋을 것 같습니다!
논문링크: https://arxiv.org/abs/2007.12099
영상링크: https://youtu.be/7v34cCE5H4k
본 논문은 single depth map으로부터의 정확한 3D hand pose estimation을 목표로 한다. 3D hand pose estimation은 HCI, AR등의 기술을 구현함에 있어서 매우 중요한 기술이다. 이를 위해 많은 연구자들이 정확도를 높이기 위해 여러 방법을 제시하였지만, 여전히 손가락들의 비슷한 생김새, 가려짐, 다양한 손가락의 움직임으로 인한 복잡성 때문에 정확도를 올리는데 한계가 있었다. 본 논문은 기존 방법들의 한계를 극복하기 위해 기존 방법들이 사용하는 입력 형태와 출력 형태를 바꾸었다. 2d depth image를 입력으로 받아 hand joint의 3D coordinate를 직접 regress하는 대부분의 기존 방법들과는 달리, 제안하는 모델은 3D voxelized depth map을 입력으로 받아 3D heatmap을 출력한다. 이를 위해 encoder-decoder 형식의 3D CNN을 사용하였고, 달라진 입력과 출력 형태로 인해 제안하는 모델은 널리 사용되는 3개의 3d hand pose estimation dataset, 1개의 3d human pose estimation dataset에서 가장 높은 성능을 내었다. 또한 ICCV 2017에서 주최된 HANDS 2017 challenge에서 우승 하였다.
We presents a deep architecture for dense semantic correspondence, called pyramidal affine regression networks (PARN), that estimates locally-varying affine transformation fields across images.
To deal with intra-class appearance and shape variations that commonly exist among different instances within the same object category,
we leverage a pyramidal model where affine transformation fields are progressively estimated in a coarse-to-fine manner so that the smoothness constraint is naturally imposed within deep networks.
PARN estimates residual affine transformations at each level and composes them to estimate final affine transformations.
Furthermore, to overcome the limitations of insufficient training data for semantic correspondence, we propose a novel weakly-supervised training scheme that generates progressive supervisions by leveraging a correspondence consistency across image pairs.
Our method is fully learnable in an end-to-end manner and does not require quantizing infinite continuous affine transformation fields.
PR-231: A Simple Framework for Contrastive Learning of Visual RepresentationsJinwon Lee
The document presents SimCLR, a framework for contrastive learning of visual representations using simple data augmentation. Key aspects of SimCLR include using random cropping and color distortions to generate positive sample pairs for the contrastive loss, a nonlinear projection head to learn representations, and large batch sizes. Evaluation shows SimCLR learns representations that outperform supervised pretraining on downstream tasks and achieves state-of-the-art results with only view augmentation and contrastive loss.
Self-supervised learning uses unlabeled data to learn visual representations through pretext tasks like predicting relative patch location, solving jigsaw puzzles, or image rotation. These tasks require semantic understanding to solve but only use unlabeled data. The features learned through pretraining on pretext tasks can then be transferred to downstream tasks like image classification and object detection, often outperforming supervised pretraining. Several papers introduce different pretext tasks and evaluate feature transfer on datasets like ImageNet and PASCAL VOC. Recent work combines multiple pretext tasks and shows improved generalization across tasks and datasets.
We trained a large, deep convolutional neural network to classify the 1.2 million
high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif-
ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5%
and 17.0% which is considerably better than the previous state-of-the-art. The
neural network, which has 60 million parameters and 650,000 neurons, consists
of five convolutional layers, some of which are followed by max-pooling layers,
and three fully-connected layers with a final 1000-way softmax. To make train-
ing faster, we used non-saturating neurons and a very efficient GPU implemen-
tation of the convolution operation. To reduce overfitting in the fully-connected
layers we employed a recently-developed regularization method called “dropout”
that proved to be very effective. We also entered a variant of this model in the
ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%,
compared to 26.2% achieved by the second-best entry.
Emerging Properties in Self-Supervised Vision TransformersSungchul Kim
The document summarizes the DINO self-supervised learning approach for vision transformers. DINO uses a teacher-student framework where the teacher's predictions are used to supervise the student through knowledge distillation. Two global and several local views of an image are passed through the student, while only global views are passed through the teacher. The student is trained to match the teacher's predictions for local views. DINO achieves state-of-the-art results on ImageNet with linear evaluation and transfers well to downstream tasks. It also enables vision transformers to discover object boundaries and semantic layouts.
Attentive semantic alignment with offset aware correlation kernelsNAVER Engineering
Semantic correspondence is the problem of establishing correspondences across images depicting different instances of the same object or scene class. One of recent approaches to this problem is to estimate parameters of a global transformation model that densely aligns one image to the other. Since an entire correlation map between all feature pairs across images is typically used to predict such a global transformation, noisy features from different backgrounds, clutter, and occlusion distract the predictor from correct estimation of the alignment. This is a challenging issue, in particular, in the problem of semantic correspondence where a large degree of image variations is often involved. In this paper, we introduce an attentive semantic alignment method that focuses on reliable correlations, filtering out distractors. For effective attention, we also propose an offset-aware correlation kernel that learns to capture translation-invariant local transformations in computing correlation values over spatial locations. Experiments demonstrate the effectiveness of the attentive model and offset-aware kernel, and the proposed model combining both techniques achieves the state-of-the-art performance.
Seed net automatic seed generation with deep reinforcement learning for robus...NAVER Engineering
본 논문에서는 interactive segmentation 문제를 풀기 위하여 deep reinforcement learning을 활용한 seed gereration 기법을 제안한다. Interactive segmentation 문제의 이슈 중 하나는 사용자의 개입을 최소화하는 것이다. 본 논문에서 제안하는 시스템이 사용자를 대신하여 인공적인 seed를 생성하게 된다. 사용자는 initial seed 정보만을 제공하면 된다. 우리는 optimal seed point 정의의 모호함으로 인해 supervised 기법을 사용하여 학습하기 어려운 점을 reinforcement learning 기법을 사용하여 극복하였다. Seed generation 문제에 맞도록 MDP를 정의하여 deep-q-network를 성공적으로 학습하였다. 우리는 MSRA10K 데이터셋에 대하여 학습을 진행하여 기존 segmentation 알고리즘의 부정확한 initial 결과 대비 우수한 성능을 보였다.
In this presentation we discuss the convolution operation, the architecture of a convolution neural network, different layers such as pooling etc. This presentation draws heavily from A Karpathy's Stanford Course CS 231n
The presentation is coverong the convolution neural network (CNN) design.
First,
the main building blocks of CNNs will be introduced. Then we systematically
investigate the impact of a range of recent advances in CNN architectures and
learning methods on the object categorization (ILSVRC) problem. In the
evaluation, the influence of the following choices of the architecture are
tested: non-linearity (ReLU, ELU, maxout, compatibility with batch
normalization), pooling variants (stochastic, max, average, mixed), network
width, classifier design (convolution, fully-connected, SPP), image
pre-processing, and of learning parameters: learning rate, batch size,
cleanliness of the data, etc.
CPlaNet: Enhancing Image Geolocalization by Combinatorial Partitioning of MapsNAVER Engineering
Image geolocalization is the task of identifying the location depicted in a photo based only on its visual information. This task is inherently challenging since many photos have only few, possibly ambiguous cues to their geolocation. Recent work has cast this task as a classification problem by partitioning the earth into a set of discrete cells that correspond to geographic regions. The granularity of this partitioning presents a critical trade-off; using fewer but larger cells results in lower location accuracy while using more but smaller cells reduces the number of training examples per class and increases model size, making the model prone to overfitting. To tackle this issue, we propose a simple but effective algorithm, combinatorial partitioning, which generates a large number of fine-grained output classes by intersecting multiple coarse-grained partitionings of the earth. Each classifier votes for the fine-grained classes that overlap with their respective coarse-grained ones. This technique allows us to predict locations at a fine scale while maintaining sufficient training examples per class. Our algorithm achieves the state-of-the-art performance in location recognition on multiple benchmark datasets.
This document discusses domain transfer and domain adaptation in deep learning. It begins with introductions to domain transfer, which learns a mapping between domains, and domain adaptation, which learns a mapping between domains with labels. It then covers several approaches for domain transfer, including neural style transfer, instance normalization, and GAN-based methods. It also discusses general approaches for domain adaptation such as source/target feature matching and target data augmentation.
We trained a large, deep convolutional neural network to classify the 1.2 million
high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif-
ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5%
and 17.0% which is considerably better than the previous state-of-the-art. The
neural network, which has 60 million parameters and 650,000 neurons, consists
of five convolutional layers, some of which are followed by max-pooling layers,
and three fully-connected layers with a final 1000-way softmax. To make train-
ing faster, we used non-saturating neurons and a very efficient GPU implemen-
tation of the convolution operation. To reduce overfitting in the fully-connected
layers we employed a recently-developed regularization method called “dropout”
that proved to be very effective. We also entered a variant of this model in the
ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%,
compared to 26.2% achieved by the second-best entry.
The document describes a vehicle detection system using a fully convolutional regression network (FCRN). The FCRN is trained on patches from aerial images to predict a density map indicating vehicle locations. The proposed system is evaluated on two public datasets and achieves higher precision and recall than comparative shallow and deep learning methods for vehicle detection in aerial images. The system could help with applications like urban planning and traffic management.
This document discusses real-time image processing. It begins with an introduction and definitions of real-time and non-real-time processing. It then discusses the requirements for a real-time image processing platform, including high resolution/frame rate video input and low latency. The document outlines some advantages of real-time image processing such as immediate results and automation. It then provides an overview of an object detection system using Viola-Jones detection with integral images, AdaBoost learning, and a cascade classifier structure. Experimental results show the cascade classifier can detect faces in real-time.
This document describes research on 3D reconstruction of solder balls on printed circuit boards (PCBs). 360 X-ray images of a PCB were taken every 2.81 degrees and reconstructed using the simultaneous algebraic reconstruction technique (SART) and iterative algorithms to generate a 3D model. Unity software was used to build a 3D visualization with zoom and rotation capabilities. Google Cardboard VR was used to create a mobile application to view the 3D model. The reconstruction aims to detect defects in solder balls without damaging the PCBs.
Deep learning for image super resolutionPrudhvi Raj
Using Deep Convolutional Networks, the machine can learn end-to-end mapping between the low/high-resolution images. Unlike traditional methods, this method jointly optimizes all the layers of the image. A light-weight CNN structure is used, which is simple to implement and provides formidable trade-off from the existential methods.
Explores the type of structure learned by Convolutional Neural Networks, the applications where they're most valuable and a number of appropriate mental models for understanding deep learning.
Chen, X., & He, K. (2021). Exploring Simple Siamese Representation Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 15750-15758).
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object DetectorJinwon Lee
TensorFlow Korea 논문읽기모임 PR12 270번째 논문 review입니다.
이번 논문은 Baidu에서 나온 PP-YOLO: An Effective and Efficient Implementation of Object Detector입니다. YOLOv3에 다양한 방법을 적용하여 매우 높은 성능과 함께 매우 빠른 속도 두마리 토끼를 다 잡아버린(?) 그런 논문입니다. 논문에서 사용한 다양한 trick들에 대해서 좀 더 깊이있게 살펴보았습니다. Object detection에 사용된 기법 들 중에 Deformable convolution, Exponential Moving Average, DropBlock, IoU aware prediction, Grid sensitivity elimination, MatrixNMS, CoordConv, 등의 방법에 관심이 있으시거나 알고 싶으신 분들은 영상과 발표자료를 참고하시면 좋을 것 같습니다!
논문링크: https://arxiv.org/abs/2007.12099
영상링크: https://youtu.be/7v34cCE5H4k
본 논문은 single depth map으로부터의 정확한 3D hand pose estimation을 목표로 한다. 3D hand pose estimation은 HCI, AR등의 기술을 구현함에 있어서 매우 중요한 기술이다. 이를 위해 많은 연구자들이 정확도를 높이기 위해 여러 방법을 제시하였지만, 여전히 손가락들의 비슷한 생김새, 가려짐, 다양한 손가락의 움직임으로 인한 복잡성 때문에 정확도를 올리는데 한계가 있었다. 본 논문은 기존 방법들의 한계를 극복하기 위해 기존 방법들이 사용하는 입력 형태와 출력 형태를 바꾸었다. 2d depth image를 입력으로 받아 hand joint의 3D coordinate를 직접 regress하는 대부분의 기존 방법들과는 달리, 제안하는 모델은 3D voxelized depth map을 입력으로 받아 3D heatmap을 출력한다. 이를 위해 encoder-decoder 형식의 3D CNN을 사용하였고, 달라진 입력과 출력 형태로 인해 제안하는 모델은 널리 사용되는 3개의 3d hand pose estimation dataset, 1개의 3d human pose estimation dataset에서 가장 높은 성능을 내었다. 또한 ICCV 2017에서 주최된 HANDS 2017 challenge에서 우승 하였다.
We presents a deep architecture for dense semantic correspondence, called pyramidal affine regression networks (PARN), that estimates locally-varying affine transformation fields across images.
To deal with intra-class appearance and shape variations that commonly exist among different instances within the same object category,
we leverage a pyramidal model where affine transformation fields are progressively estimated in a coarse-to-fine manner so that the smoothness constraint is naturally imposed within deep networks.
PARN estimates residual affine transformations at each level and composes them to estimate final affine transformations.
Furthermore, to overcome the limitations of insufficient training data for semantic correspondence, we propose a novel weakly-supervised training scheme that generates progressive supervisions by leveraging a correspondence consistency across image pairs.
Our method is fully learnable in an end-to-end manner and does not require quantizing infinite continuous affine transformation fields.
PR-231: A Simple Framework for Contrastive Learning of Visual RepresentationsJinwon Lee
The document presents SimCLR, a framework for contrastive learning of visual representations using simple data augmentation. Key aspects of SimCLR include using random cropping and color distortions to generate positive sample pairs for the contrastive loss, a nonlinear projection head to learn representations, and large batch sizes. Evaluation shows SimCLR learns representations that outperform supervised pretraining on downstream tasks and achieves state-of-the-art results with only view augmentation and contrastive loss.
Self-supervised learning uses unlabeled data to learn visual representations through pretext tasks like predicting relative patch location, solving jigsaw puzzles, or image rotation. These tasks require semantic understanding to solve but only use unlabeled data. The features learned through pretraining on pretext tasks can then be transferred to downstream tasks like image classification and object detection, often outperforming supervised pretraining. Several papers introduce different pretext tasks and evaluate feature transfer on datasets like ImageNet and PASCAL VOC. Recent work combines multiple pretext tasks and shows improved generalization across tasks and datasets.
We trained a large, deep convolutional neural network to classify the 1.2 million
high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif-
ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5%
and 17.0% which is considerably better than the previous state-of-the-art. The
neural network, which has 60 million parameters and 650,000 neurons, consists
of five convolutional layers, some of which are followed by max-pooling layers,
and three fully-connected layers with a final 1000-way softmax. To make train-
ing faster, we used non-saturating neurons and a very efficient GPU implemen-
tation of the convolution operation. To reduce overfitting in the fully-connected
layers we employed a recently-developed regularization method called “dropout”
that proved to be very effective. We also entered a variant of this model in the
ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%,
compared to 26.2% achieved by the second-best entry.
Emerging Properties in Self-Supervised Vision TransformersSungchul Kim
The document summarizes the DINO self-supervised learning approach for vision transformers. DINO uses a teacher-student framework where the teacher's predictions are used to supervise the student through knowledge distillation. Two global and several local views of an image are passed through the student, while only global views are passed through the teacher. The student is trained to match the teacher's predictions for local views. DINO achieves state-of-the-art results on ImageNet with linear evaluation and transfers well to downstream tasks. It also enables vision transformers to discover object boundaries and semantic layouts.
Attentive semantic alignment with offset aware correlation kernelsNAVER Engineering
Semantic correspondence is the problem of establishing correspondences across images depicting different instances of the same object or scene class. One of recent approaches to this problem is to estimate parameters of a global transformation model that densely aligns one image to the other. Since an entire correlation map between all feature pairs across images is typically used to predict such a global transformation, noisy features from different backgrounds, clutter, and occlusion distract the predictor from correct estimation of the alignment. This is a challenging issue, in particular, in the problem of semantic correspondence where a large degree of image variations is often involved. In this paper, we introduce an attentive semantic alignment method that focuses on reliable correlations, filtering out distractors. For effective attention, we also propose an offset-aware correlation kernel that learns to capture translation-invariant local transformations in computing correlation values over spatial locations. Experiments demonstrate the effectiveness of the attentive model and offset-aware kernel, and the proposed model combining both techniques achieves the state-of-the-art performance.
Seed net automatic seed generation with deep reinforcement learning for robus...NAVER Engineering
본 논문에서는 interactive segmentation 문제를 풀기 위하여 deep reinforcement learning을 활용한 seed gereration 기법을 제안한다. Interactive segmentation 문제의 이슈 중 하나는 사용자의 개입을 최소화하는 것이다. 본 논문에서 제안하는 시스템이 사용자를 대신하여 인공적인 seed를 생성하게 된다. 사용자는 initial seed 정보만을 제공하면 된다. 우리는 optimal seed point 정의의 모호함으로 인해 supervised 기법을 사용하여 학습하기 어려운 점을 reinforcement learning 기법을 사용하여 극복하였다. Seed generation 문제에 맞도록 MDP를 정의하여 deep-q-network를 성공적으로 학습하였다. 우리는 MSRA10K 데이터셋에 대하여 학습을 진행하여 기존 segmentation 알고리즘의 부정확한 initial 결과 대비 우수한 성능을 보였다.
In this presentation we discuss the convolution operation, the architecture of a convolution neural network, different layers such as pooling etc. This presentation draws heavily from A Karpathy's Stanford Course CS 231n
The presentation is coverong the convolution neural network (CNN) design.
First,
the main building blocks of CNNs will be introduced. Then we systematically
investigate the impact of a range of recent advances in CNN architectures and
learning methods on the object categorization (ILSVRC) problem. In the
evaluation, the influence of the following choices of the architecture are
tested: non-linearity (ReLU, ELU, maxout, compatibility with batch
normalization), pooling variants (stochastic, max, average, mixed), network
width, classifier design (convolution, fully-connected, SPP), image
pre-processing, and of learning parameters: learning rate, batch size,
cleanliness of the data, etc.
CPlaNet: Enhancing Image Geolocalization by Combinatorial Partitioning of MapsNAVER Engineering
Image geolocalization is the task of identifying the location depicted in a photo based only on its visual information. This task is inherently challenging since many photos have only few, possibly ambiguous cues to their geolocation. Recent work has cast this task as a classification problem by partitioning the earth into a set of discrete cells that correspond to geographic regions. The granularity of this partitioning presents a critical trade-off; using fewer but larger cells results in lower location accuracy while using more but smaller cells reduces the number of training examples per class and increases model size, making the model prone to overfitting. To tackle this issue, we propose a simple but effective algorithm, combinatorial partitioning, which generates a large number of fine-grained output classes by intersecting multiple coarse-grained partitionings of the earth. Each classifier votes for the fine-grained classes that overlap with their respective coarse-grained ones. This technique allows us to predict locations at a fine scale while maintaining sufficient training examples per class. Our algorithm achieves the state-of-the-art performance in location recognition on multiple benchmark datasets.
This document discusses domain transfer and domain adaptation in deep learning. It begins with introductions to domain transfer, which learns a mapping between domains, and domain adaptation, which learns a mapping between domains with labels. It then covers several approaches for domain transfer, including neural style transfer, instance normalization, and GAN-based methods. It also discusses general approaches for domain adaptation such as source/target feature matching and target data augmentation.
We trained a large, deep convolutional neural network to classify the 1.2 million
high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif-
ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5%
and 17.0% which is considerably better than the previous state-of-the-art. The
neural network, which has 60 million parameters and 650,000 neurons, consists
of five convolutional layers, some of which are followed by max-pooling layers,
and three fully-connected layers with a final 1000-way softmax. To make train-
ing faster, we used non-saturating neurons and a very efficient GPU implemen-
tation of the convolution operation. To reduce overfitting in the fully-connected
layers we employed a recently-developed regularization method called “dropout”
that proved to be very effective. We also entered a variant of this model in the
ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%,
compared to 26.2% achieved by the second-best entry.
The document describes a vehicle detection system using a fully convolutional regression network (FCRN). The FCRN is trained on patches from aerial images to predict a density map indicating vehicle locations. The proposed system is evaluated on two public datasets and achieves higher precision and recall than comparative shallow and deep learning methods for vehicle detection in aerial images. The system could help with applications like urban planning and traffic management.
This document discusses real-time image processing. It begins with an introduction and definitions of real-time and non-real-time processing. It then discusses the requirements for a real-time image processing platform, including high resolution/frame rate video input and low latency. The document outlines some advantages of real-time image processing such as immediate results and automation. It then provides an overview of an object detection system using Viola-Jones detection with integral images, AdaBoost learning, and a cascade classifier structure. Experimental results show the cascade classifier can detect faces in real-time.
This document describes research on 3D reconstruction of solder balls on printed circuit boards (PCBs). 360 X-ray images of a PCB were taken every 2.81 degrees and reconstructed using the simultaneous algebraic reconstruction technique (SART) and iterative algorithms to generate a 3D model. Unity software was used to build a 3D visualization with zoom and rotation capabilities. Google Cardboard VR was used to create a mobile application to view the 3D model. The reconstruction aims to detect defects in solder balls without damaging the PCBs.
Realtime pothole detection system using improved CNN Modelsnithinsai2992
The document summarizes work on a real-time pothole detection system using improved CNN models. It discusses using the YOLOv5 model for pothole detection and training YOLOv5m6, YOLOv5s6, and YOLOv5n6 models on a dataset, achieving mAP scores of 80.8%, 82.2%, and 82.5% respectively. It also proposes further improving the system through techniques like better image processing during nighttime and enhancing detection of distant objects.
BIG DATA-DRIVEN FAST REDUCING THE VISUAL BLOCK ARTIFACTS OF DCT COMPRESSED IM...IJDKP
1) The document proposes a new simple method to reduce visual block artifacts in images compressed using DCT (used in JPEG) for urban surveillance systems.
2) The method smooths only the connection edges between adjacent blocks while keeping other image areas unchanged.
3) Simulation results show the proposed method achieves better image quality as measured by PSNR compared to median and wiener filters, while using significantly less computational resources.
An improved image compression algorithm based on daubechies wavelets with ar...Alexander Decker
This document summarizes an academic article that proposes a new image compression algorithm using Daubechies wavelets and arithmetic coding. It first discusses existing image compression techniques and their limitations. It then describes the proposed algorithm, which applies Daubechies wavelet transform followed by 2D Walsh wavelet transform on image blocks and arithmetic coding. Results show the proposed method achieves higher compression ratios and PSNR values than existing algorithms like EZW and SPIHT. Future work aims to improve results by exploring different wavelets and compression techniques.
This document provides an overview of image processing algorithms for real-time embedded systems. It discusses objectives like image enhancement, restoration, feature extraction and compression. Technologies applied include the TMS320C6713 DSP, Code Composer Studio, MATLAB and OpenCV. Image enhancement algorithms covered are contrast stretching, window-level slicing, and histogram equalization. Image restoration techniques include low pass, high pass and rank order filtering. Feature extraction methods include edge detection and image segmentation. Wavelet-based techniques are also discussed for edge detection and denoising. Implementation challenges for real-time embedded systems are addressed.
The slides for the techniques used in the Temporal Segment Network (TSN), including the basic ideas, recall of BN-Inception, optical flow and tricks in application. Used in group paper reading in University of Sydney.
The document introduces various computer vision topics including convolutional neural networks, popular CNN architectures, data augmentation, transfer learning, object detection, neural style transfer, generative adversarial networks, and variational autoencoders. It provides overviews of each topic and discusses concepts such as how convolutions work, common CNN architectures like ResNet and VGG, why data augmentation is important, how transfer learning can utilize pre-trained models, how object detection algorithms like YOLO work, the content and style losses used in neural style transfer, how GANs use generators and discriminators, and how VAEs describe images with probability distributions. The document aims to discuss these topics at a practical level and provide insights through examples.
This document discusses objective video quality measurement based on the human visual system. It introduces various deblocking algorithms used to improve the quality of reconstructed video by reducing blocking artifacts. It also discusses limitations of traditional PSNR metrics and proposes a no-reference quality assessment method. The proposed method considers aspects of the human visual system like masking effects and uses algorithms in the DCT domain and post-processing to evaluate video quality in a way that correlates better with subjective human perception. Experimental results on distorted video sets demonstrate the effectiveness of the proposed no-reference quality measurement approach.
Real Time Sign Language Recognition Using Deep LearningIRJET Journal
The document describes a study that used the YOLOv5 deep learning model to perform real-time sign language recognition. The researchers trained and tested the model on the Roboflow dataset along with additional images. They achieved 88.4% accuracy, 76.6% precision, and 81.2% recall. For comparison, they also trained a CNN model which achieved lower accuracy of 52.98%. The YOLOv5 model was able to detect signs in complex environments and perform accurate real-time detection, demonstrating its advantages over CNN for this task.
This document proposes a region-based object tracking method that uses both global-viewed and local-viewed trackers with Adaboost-based feature selection. The global-viewed tracker uses seed features and Adaboost to track objects at the pixel level. The local-viewed tracker regionalizes the image using k-means clustering, then applies seed features and Adaboost within each region to provide compensation. A manual refinement tool and confidence measurement are used to combine the trackers' results. Experimental results on a test video sequence demonstrate the method can track multiple non-rigid objects.
IEEE 2014 MATLAB IMAGE PROCESSING PROJECTS Compressed domain video retargetingIEEEBEBTECHSTUDENTPROJECTS
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09666155510, 09849539085 or mail us - ieeefinalsemprojects@gmail.com-Visit Our Website: www.finalyearprojects.org
This document proposes and evaluates methods for video-to-video translation using CycleGAN. It begins with a baseline method that applies CycleGAN to each video frame independently, resulting in inconsistent translations between frames. A improved method adds a flow-guided loss term to CycleGAN that considers optical flow between frames, producing more temporally coherent translations. Evaluation shows the flow-guided method generates higher quality translations that better preserve details and consistency across frames when translating videos between day and night domains. Further optimizations to the model are suggested to improve results.
DEEP NEURAL NETWORKS APPLIED TO LOW POWER ONBOARD IMAGE COMPRESSION
Over the past decade, rapid developments in digital technologies and access to space have enabled unprecedented capabilities of monitoring our planet and, more generally, our Universe.
This new space race is pushing for a paradigm shift in order to respond to the ever-increasing challenge of delivering the useful information to the end users. With huge number of satellites, greater spatial and spectral resolutions, higher temporal cadence and shrinking spectrum resources, on-board data reduction becomes not only a cost saving solution but, in many cases also, a key enabling technology to achieve viable missions.
https://atpi.eventsair.com/obpdc2022/
The document discusses Arabic optical character recognition (AOCR). It introduces AOCR and its challenges. It then describes the preprocessing steps of image rotation, segmentation, and enhancement. It explains the feature extraction process and the features selected. It details the implementation of an AOCR system using Hidden Markov Models in HTK, including data preparation, model creation, and recognition. It presents experimental results on isolated character recognition with variations in font, size, and length. Recognition accuracy was highest using vertical histograms and modeling each character.
Build Your Own 3D Scanner: 3D Scanning with Structured LightingDouglas Lanman
Build Your Own 3D Scanner:
3D Scanning with Structured Lighting
http://mesh.brown.edu/byo3d/
SIGGRAPH 2009 Courses
Douglas Lanman and Gabriel Taubin
This course provides a beginner with the necessary mathematics, software, and practical details to leverage projector-camera systems in their own 3D scanning projects. An example-driven approach is used throughout; each new concept is illustrated using a practical scanner implemented with off-the-shelf parts. The course concludes by detailing how these new approaches are used in rapid prototyping, entertainment, cultural heritage, and web-based applications.
Questions Log: Dynamic Cubes – Set to Retire Transformer?Senturus
This document contains a questions log from a webinar about optimizing Cognos performance. It includes questions from webinar attendees about topics like using virtual cubes and dynamic cubes to address large data volumes, optimizing in-memory aggregates, hardware sizing requirements for dynamic cubes, and configuration considerations when using dynamic cubes. The questions are answered in detail to help attendees understand how to best implement and optimize dynamic cubes in Cognos.
This document discusses using fully convolutional neural networks for defect inspection. It begins with an agenda that outlines image segmentation using FCNs and defect inspection. It then provides details on data preparation including labeling guidelines, data augmentation, and model setup using techniques like deconvolution layers and the U-Net architecture. Metrics for evaluating the model like Dice score and IoU are also covered. The document concludes with best practices for successful deep learning projects focusing on aspects like having a large reusable dataset, feasibility of the problem, potential payoff, and fault tolerance.
DALL-E is a large AI model that can generate images from text descriptions. It was trained on a dataset of text-image pairs using a two-stage process: 1) A discrete variational autoencoder (dVAE) learned a visual codebook to represent images as discrete latent codes, and 2) A Transformer model learned the joint distribution between text captions and latent image codes to generate new images. The model achieved impressive zero-shot image generation capabilities, generalizing to new concepts and combining ideas in novel ways, as demonstrated through both quantitative and qualitative evaluation.
Similar to [CVPR2020] Simple but effective image enhancement techniques (20)
This paper proposes AmbientGAN, which trains a generative adversarial network using partial or noisy observations rather than fully observed samples. AmbientGAN trains the discriminator on the measurement domain rather than the raw data domain, allowing the generator to be trained without needing large amounts of good training data. The paper proves it is theoretically possible to recover the original data distribution even when the measurement process is not invertible. It presents experimental results showing AmbientGAN can generate high quality samples and recover the underlying data distribution from various types of lossy and noisy measurements.
[PR12] categorical reparameterization with gumbel softmaxJaeJun Yoo
(Korean) Introduction to (paper1) Categorical Reparameterization with Gumbel Softmax and (paper2) The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables
Video: https://youtu.be/ty3SciyoIyk
Paper1: https://arxiv.org/abs/1611.01144
Paper2: https://arxiv.org/abs/1611.00712
[PR12] understanding deep learning requires rethinking generalizationJaeJun Yoo
The document discusses a paper that argues traditional theories of generalization may not fully explain why large neural networks generalize well in practice. It summarizes the paper's key points:
1) The paper shows neural networks can easily fit random labels, calling into question traditional measures of complexity.
2) Regularization helps but is not the fundamental reason for generalization. Neural networks have sufficient capacity to memorize data.
3) Implicit biases in algorithms like SGD may better explain generalization by driving solutions toward minimum norm.
4) The paper suggests rethinking generalization as the effective capacity of neural networks may differ from theoretical measures. Understanding finite sample expressivity is important.
The document discusses capsule networks, a type of neural network proposed by Geoff Hinton in 2017 as an alternative to convolutional neural networks (CNNs) for computer vision tasks. Capsule networks aim to address some limitations of CNNs, such as their inability to capture spatial relationships and pose information. The key concepts discussed include dynamic routing between capsules, which allows for parts-based representation, and equivariance, where capsules can learn transformation properties like position and orientation. The document provides an overview of a capsule network architecture and routing algorithm proposed in a 2017 paper by Sabour et al.
[PR12] Inception and Xception - Jaejun YooJaeJun Yoo
This document discusses Inception and Xception models for computer vision tasks. It describes the Inception architecture, which uses 1x1, 3x3 and 5x5 convolutional filters arranged in parallel to capture correlations at different scales more efficiently. It also describes the Xception model, which entirely separates cross-channel correlations and spatial correlations using depthwise separable convolutions. The document compares different approaches for reducing computational costs like pooling and strided convolutions.
Introduction to domain adversarial training of neural network.
(Kor) video : https://www.youtube.com/watch?v=n2J7giHrS-Y&t=1s
Papers: A survey on transfer learning, SJ Pan 2009 / A theory of learning from different domains, S Ben-David et al. 2010 / Domain-Adversarial Training of Neural Networks, Y Ganin 2016
Slides I refered:
http://www.di.ens.fr/~germain/talks/nips2014_dann_slides.pdf
http://john.blitzer.com/talks/icmltutorial_2010.pdf (DA theory part)
https://epat2014.sciencesconf.org/conference/epat2014/pages/slides_DA_epat_17.pdf (DA theory part)
https://www.slideshare.net/butest/ppt-3860159 (DA theory part)
Picked-up lists of GAN variants which provided insights to the community. (GANs-Improved GANs-DCGAN-Unrolled GAN-InfoGAN-f-GAN-EBGAN-WGAN)
After short introduction to GANs, we look through the remaining difficulties of standard GANs and their temporary solutions (Improved GANs). By following the slides, we can see the other solutions which tried to resolve the problems in various ways, e.g. careful architecture selection (DCGAN), slight change in update (Unrolled GAN), additional constraint (InfoGAN), generalization of the loss function using various divergence (f-GAN), providing new framework of energy based model (EBGAN), another step of generalization of the loss function (WGAN).
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...Sérgio Sacani
Context. With a mass exceeding several 104 M⊙ and a rich and dense population of massive stars, supermassive young star clusters
represent the most massive star-forming environment that is dominated by the feedback from massive stars and gravitational interactions
among stars.
Aims. In this paper we present the Extended Westerlund 1 and 2 Open Clusters Survey (EWOCS) project, which aims to investigate
the influence of the starburst environment on the formation of stars and planets, and on the evolution of both low and high mass stars.
The primary targets of this project are Westerlund 1 and 2, the closest supermassive star clusters to the Sun.
Methods. The project is based primarily on recent observations conducted with the Chandra and JWST observatories. Specifically,
the Chandra survey of Westerlund 1 consists of 36 new ACIS-I observations, nearly co-pointed, for a total exposure time of 1 Msec.
Additionally, we included 8 archival Chandra/ACIS-S observations. This paper presents the resulting catalog of X-ray sources within
and around Westerlund 1. Sources were detected by combining various existing methods, and photon extraction and source validation
were carried out using the ACIS-Extract software.
Results. The EWOCS X-ray catalog comprises 5963 validated sources out of the 9420 initially provided to ACIS-Extract, reaching a
photon flux threshold of approximately 2 × 10−8 photons cm−2
s
−1
. The X-ray sources exhibit a highly concentrated spatial distribution,
with 1075 sources located within the central 1 arcmin. We have successfully detected X-ray emissions from 126 out of the 166 known
massive stars of the cluster, and we have collected over 71 000 photons from the magnetar CXO J164710.20-455217.
ESR spectroscopy in liquid food and beverages.pptxPRIYANKA PATEL
With increasing population, people need to rely on packaged food stuffs. Packaging of food materials requires the preservation of food. There are various methods for the treatment of food to preserve them and irradiation treatment of food is one of them. It is the most common and the most harmless method for the food preservation as it does not alter the necessary micronutrients of food materials. Although irradiated food doesn’t cause any harm to the human health but still the quality assessment of food is required to provide consumers with necessary information about the food. ESR spectroscopy is the most sophisticated way to investigate the quality of the food and the free radicals induced during the processing of the food. ESR spin trapping technique is useful for the detection of highly unstable radicals in the food. The antioxidant capability of liquid food and beverages in mainly performed by spin trapping technique.
The technology uses reclaimed CO₂ as the dyeing medium in a closed loop process. When pressurized, CO₂ becomes supercritical (SC-CO₂). In this state CO₂ has a very high solvent power, allowing the dye to dissolve easily.
hematic appreciation test is a psychological assessment tool used to measure an individual's appreciation and understanding of specific themes or topics. This test helps to evaluate an individual's ability to connect different ideas and concepts within a given theme, as well as their overall comprehension and interpretation skills. The results of the test can provide valuable insights into an individual's cognitive abilities, creativity, and critical thinking skills
Authoring a personal GPT for your research and practice: How we created the Q...Leonel Morgado
Thematic analysis in qualitative research is a time-consuming and systematic task, typically done using teams. Team members must ground their activities on common understandings of the major concepts underlying the thematic analysis, and define criteria for its development. However, conceptual misunderstandings, equivocations, and lack of adherence to criteria are challenges to the quality and speed of this process. Given the distributed and uncertain nature of this process, we wondered if the tasks in thematic analysis could be supported by readily available artificial intelligence chatbots. Our early efforts point to potential benefits: not just saving time in the coding process but better adherence to criteria and grounding, by increasing triangulation between humans and artificial intelligence. This tutorial will provide a description and demonstration of the process we followed, as two academic researchers, to develop a custom ChatGPT to assist with qualitative coding in the thematic data analysis process of immersive learning accounts in a survey of the academic literature: QUAL-E Immersive Learning Thematic Analysis Helper. In the hands-on time, participants will try out QUAL-E and develop their ideas for their own qualitative coding ChatGPT. Participants that have the paid ChatGPT Plus subscription can create a draft of their assistants. The organizers will provide course materials and slide deck that participants will be able to utilize to continue development of their custom GPT. The paid subscription to ChatGPT Plus is not required to participate in this workshop, just for trying out personal GPTs during it.
The ability to recreate computational results with minimal effort and actionable metrics provides a solid foundation for scientific research and software development. When people can replicate an analysis at the touch of a button using open-source software, open data, and methods to assess and compare proposals, it significantly eases verification of results, engagement with a diverse range of contributors, and progress. However, we have yet to fully achieve this; there are still many sociotechnical frictions.
Inspired by David Donoho's vision, this talk aims to revisit the three crucial pillars of frictionless reproducibility (data sharing, code sharing, and competitive challenges) with the perspective of deep software variability.
Our observation is that multiple layers — hardware, operating systems, third-party libraries, software versions, input data, compile-time options, and parameters — are subject to variability that exacerbates frictions but is also essential for achieving robust, generalizable results and fostering innovation. I will first review the literature, providing evidence of how the complex variability interactions across these layers affect qualitative and quantitative software properties, thereby complicating the reproduction and replication of scientific studies in various fields.
I will then present some software engineering and AI techniques that can support the strategic exploration of variability spaces. These include the use of abstractions and models (e.g., feature models), sampling strategies (e.g., uniform, random), cost-effective measurements (e.g., incremental build of software configurations), and dimensionality reduction methods (e.g., transfer learning, feature selection, software debloating).
I will finally argue that deep variability is both the problem and solution of frictionless reproducibility, calling the software science community to develop new methods and tools to manage variability and foster reproducibility in software systems.
Exposé invité Journées Nationales du GDR GPL 2024
When I was asked to give a companion lecture in support of ‘The Philosophy of Science’ (https://shorturl.at/4pUXz) I decided not to walk through the detail of the many methodologies in order of use. Instead, I chose to employ a long standing, and ongoing, scientific development as an exemplar. And so, I chose the ever evolving story of Thermodynamics as a scientific investigation at its best.
Conducted over a period of >200 years, Thermodynamics R&D, and application, benefitted from the highest levels of professionalism, collaboration, and technical thoroughness. New layers of application, methodology, and practice were made possible by the progressive advance of technology. In turn, this has seen measurement and modelling accuracy continually improved at a micro and macro level.
Perhaps most importantly, Thermodynamics rapidly became a primary tool in the advance of applied science/engineering/technology, spanning micro-tech, to aerospace and cosmology. I can think of no better a story to illustrate the breadth of scientific methodologies and applications at their best.
[CVPR2020] Simple but effective image enhancement techniques
1. Image Enhancement
via a Simple but (very) Effective way
Style transfer your image in “photographic way”, e.g., day2sunset. “CutBlur”: a powerful data augmentation method for various low-level vision.
Jaejun Yoo AI research scientist / Clova AI
Postdoctoral researcher / EPFL
Code, generated images,
and pre-trained models
are all available at
github.com/clovaai/WCT2
Code, generated images,
and pre-trained models
are all available at
github.com/clovaai/cutblur
Leave your contact information
and feedback here. Join Clova !
2. Image enhancement
Image enhancement is the process of adjusting digital images so that the results are more suitable for
display or further image analysis.
Traditionally…
Low resolution Baseline Proposed
Denoising Super-resolution
3. Image enhancement (extended)
Image enhancement is the process of adjusting digital images so that the results are more suitable for
display or further image analysis.
Traditionally… + generate (or translate into) an authentic image
4. • WCT2: Photorealistic Style Transfer via Wavelet Transforms (ICCV’19)
Clearer (authentic) output with 840 times faster and 51% lighter (in memory) model (Current SOTA)
• CutBlur: Rethinking Data Augmentation for Image Super-resolution (CVPR’20)
Current SOTA in Real-world Super-resolution (RealSR)
• SimUSR: A Simple but Strong Baseline for Unsupervised Image Super-resolution (CVPRW’20)
Ranked PSNR 1st and SSIM 2nd in unsupervised Super-resolution Competition (NTIRE)
Contents
Our goal: Solving these problems in a simple and intuitive way while also
achieving huge improvement over the previous SOTA.
5. WCT2 (ICCV’19)
: A simple correction of the network architecture using wavelets
Clova AI Research 1 Yonsei University 2
Jaejun Yoo1* Youngjung Uh1* Sanghyuk Chun1* Byeongkyu Kang1,2 Jung-Woo Ha1
Leave your contact information
and feedback here. Join Clova !
6. Gatys et al. CVPR ‘16
using CNN representation (VGG)
Artistic Style Transfer
9. Transfer your style in “photographic way”, e.g., day2sunset, day2night, etc.
Photorealistic Style Transfer
10. Artistic: Whitening and Coloring Transforms (WCT)
orthogonal matrix of eigenvectors
diagonal matrix with the eigenvalues of the covariance matrix
Whitening
centered content feature
- from Yi et al., NIPS 2017
11. Artistic: Whitening and Coloring Transforms (WCT)
orthogonal matrix of eigenvectors
diagonal matrix with the eigenvalues of the covariance matrix
Whitening
centered content feature
- from Yi et al., NIPS 2017
12. Artistic: Whitening and Coloring Transforms (WCT)
orthogonal matrix of eigenvectors
Coloring
centered style feature
diagonal matrix with the eigenvalues of the covariance matrix
- from Yi et al., NIPS 2017
13. Artistic: Whitening and Coloring Transforms (WCT)
Contents Single level Multi level
- from Yi et al., NIPS 2017
14. PhotoWCT
WCT (artistic model)
: “VGG decoder uses nearest-neighbor upsampling”
PhotoWCT (photorealistic model)
- from Yi et al., ECCV 2018
“Provide decoder the location where the pooling operation happened (unpooling)”
√
√
√
√
√
√
√
17. Content PhotoWCT
+ smoothing
* in seconds
PhotoWCT
Ours
* Yoo et al., ICCV 2019
“Our new model show better performance even without any post-processing!
18. WCT via Wavelet Corrected Transforms (WCT2)
To enforce the Encoder-Decoder to learn a function, which has a good property:
1. Should play a similar role to the original pooling (global filter),
2. Shoud not lose the information during the encoding and decoding process,
3. Should be able to represent the features of input images.
19. WCT via Wavelet Corrected Transforms (WCT2)
* Theoretical motivation: Perfect reconstruction (PR) condition
To enforce the Encoder-Decoder to learn a function, which has a good property:
1. Should play a similar role to the original pooling (global filter),
2. Shoud not lose the information during the encoding and decoding process,
3. Should be able to represent the features of input images.
20. WCT via Wavelet Corrected Transforms (WCT2)
To enforce the Encoder-Decoder to learn a function, which has a good property:
1. Should play a similar role to the original pooling (global filter),
2. Shoud not lose the information during the encoding and decoding process,
3. Should be able to represent the features of input images.
21. WCT via Wavelet Corrected Transforms (WCT2)
1
1
1
1
1
-1
1
-1
1
1
-1
-1
1
-1
-1
1
1
2
∗
LL HHHLLH
1
2
∗
To enforce the Encoder-Decoder to learn a function, which has a good property:
1. Should play a similar role to the original pooling (global filter),
2. Shoud not lose the information during the encoding and decoding process,
3. Should be able to represent the features of input images.
22. WCT via Wavelet Corrected Transforms (WCT2)
- removing multi-level stylization (less error propagation)
1. Better stylization (6 times more preferred by users)
2. Faster model (840 times acceleration than the previous SOTA)
3. Lighter model (51% less memory than the previous SOTA)
4. Stronger model (The only model that can process 1k sized resolution image under 4 sec.)
To enforce the Encoder-Decoder to learn a function, which has a good property:
25. WCT via Wavelet Corrected Transforms (WCT2)
“Photorealistic video stylization results (from day-to-sunset). Given
a style and video frames (top), we show the results by WCT2
(middle) and PhotoWCT (bottom) without semantic segmentation
and post-processing. Despite the lack of segmentation map, WCT2
shows photorealistic results while keeping temporal consistency. On
the other hand, PhotoWCT generates spotty and varying artifacts
over frames, which harm the photorealism.”
* Video Style Transfer
Sequential consistency over the frames without imposing
further constraints.
26. WCT via Wavelet Corrected Transforms (WCT2)
IDEAL CASE
* User study results
40 pairs of content and style
41 subjects
* Computational cost (seconds)
* SSIM index vs. Style loss
27. Substitute all pooling layers to wavelet filters (wavelet corrected transform)
• Note that this is a general architecture change and not bound to the stylization method!
• You can use our wavelet corrected model with other methods such as AdaIN.
Enjoy the power of lossless image reconstructing network!
• Note that this also opens a new venue to the other applications such as image
restoration tasks! (e.g., denoising, super-resolution, dehazing, etc.)
Summary
Simple solution,
But Effective !
28. CutBlur (CVPR’20)
: The first data augmentation method for various low-level vision tasks
Clova AI Research 1 Ajou University 2
Jaejun Yoo1* Namhyuk Ahn1,2* Kyung-Ah Shon2
Leave your contact information
and feedback here. Join Clova !
29. Some spoilers J
• First to provide comprehensive analysis of recent DA’s on Super-resolution (SR)
• A new data augmentation (DA) strategy “CutBlur” is proposed.
• Our method provide consistent and significant improvements in the SR task.
• By just applying our DA method, the model of ‘17 can already achieve the state-of-
the-art (SOTA) performance in RealSR competition
• Last but not least, our method also improves other low-level vision tasks, such as
denoising and compression artifact removal.
30. Some spoilers J
• First to provide comprehensive analysis of recent DA’s on Super-resolution (SR)
• A new data augmentation (DA) strategy “CutBlur” is proposed.
• Our method provide consistent and significant improvements in the SR task.
• By just applying our DA method, the model of ‘17 can already achieve the state-of-
the-art (SOTA) performance in RealSR competition
• Last but not least, our method also improves other low-level vision tasks, such as
denoising and compression artifact removal.
31. Some spoilers J
• First to provide comprehensive analysis of recent DA’s on Super-resolution (SR)
• A new data augmentation (DA) strategy “CutBlur” is proposed.
• Our method provide consistent and significant improvements in the SR task.
• By just applying our DA method, the model of ‘17 can already achieve the state-of-
the-art (SOTA) performance in RealSR competition
• Last but not least, our method also improves other low-level vision tasks, such as
denoising and compression artifact removal.
32. Some spoilers J
• First to provide comprehensive analysis of recent DA’s on Super-resolution (SR)
• A new data augmentation (DA) strategy “CutBlur” is proposed.
• Our method provide consistent and significant improvements in the SR task.
• By just applying our DA method, the model of ‘17 can already achieve the state-of-
the-art (SOTA) performance in RealSR competition
• Last but not least, our method also improves other low-level vision tasks, such as
denoising and compression artifact removal.
33. Some spoilers J
• First to provide comprehensive analysis of recent DA’s on Super-resolution (SR)
• A new data augmentation (DA) strategy “CutBlur” is proposed.
• Our method provide consistent and significant improvements in the SR task.
• By just applying our DA method, the model of ‘17 can already achieve the state-of-
the-art (SOTA) performance in RealSR competition
• Last but not least, our method also improves other low-level vision tasks, such as
denoising and compression artifact removal.
34. Some spoilers J
• First to provide comprehensive analysis of recent DA’s on Super-resolution (SR)
• A new data augmentation (DA) strategy “CutBlur” is proposed.
• Our method provide consistent and significant improvements in the SR task.
• By just applying our DA method, the model of ‘17 can already achieve the state-of-
the-art (SOTA) performance in RealSR competition
• Last but not least, our method also improves other low-level vision tasks, such as
denoising and compression artifact removal.
37. Analysis on existing DA methods
“Sharp transitions, mixed image contents or losing the relationships of pixels can
degrade SR performance.”
e.g., Cutout fails (discarding pixels) and every feature method fails (manipulation).
Training curves when applied feature DA’s
38. Analysis on existing DA methods
• DA methods in pixel space bring
some improvements when applied
very carefully.
39. Analysis on existing DA methods
• DA methods in pixel space bring
some improvements when applied
very carefully.
• Cutout:
Original setting (drop 25% of of pixels in a
rectangular shape) significantly degrades the
performance because it erases spatial information
too much. However, erasing tiny amount of pixels
(0.1% random pixels) boosts the performance (2~3
pixels of 48x48 input patch)
40. Analysis on existing DA methods
• DA methods in pixel space bring
some improvements when applied
very carefully.
• Mixup & CutMix:
Improvements of using CutMix are marginal. We
suspect this happens because CutMix generates a
drastic sharp transition between two different
images.
Improvements of using Mixup is better than
CutMix but it still generates unrealistic image and
affects to the image structure.
Mixup CutMix
41. Analysis on existing DA methods
• DA methods in pixel space bring
some improvements when applied
very carefully.
• CutMixup:
To verify our hypothesis, we combine benefits of
Mixup and CutMix; CutMixup. CutMixupt
provides various boundary cases while minimizes
the sharp transition by retaining partial cues as
Mixup does.
CutMixup
42. Analysis on existing DA methods
• DA methods in pixel space bring
some improvements when applied
very carefully.
• Blend & RGB permutation:
To push further, we tried a constant blending and
RGB channel permutation, which turn out to be
very simple but effective strategies showing big
performance enhancement (dB).
Note that both methods do not incur any structure
modification to an image.
BlendRGB perm.
44. CutBlur
• What does the model learn from CutBlur?
• CutBlur prevents the SR model from over-sharpening an image and helps it to super-resolve only the
necessary region.
Super-resolution results of a model (EDSR) trained without CutBlur and its error residual (Δ)
Error residual (Δ)Output
45. CutBlur
• What does the model learn from CutBlur?
• CutBlur prevents the SR model from over-sharpening an image and helps it to super-resolve only the
necessary region.
Super-resolution results of a model (EDSR) trained CutBlur and its error residual (Δ)
Error residual (Δ)Output
with
49. Mixture of Augmentation (MoA)
• During the training phase …
• Randomly select single augmentation at
every step. (among the curated DA list)
• Apply it!
50. Comparison on diverse benchmark models and datasets
• SRCNN (0.07M) – ECCV’14, CARN (1.14M) – ECCV’18, RCAN (15.6M) – ECCV’18, EDSR (43.1M) – CVPRW’17
• DIV2K (synthetic), RealSR (real-world)
• Our method shows consistent improvement for different models (parameters) and
datasets (different environments and size):
51. Use CutBlur; cut-and-paste the LQ images to the corresponding HQ images (or vice versa)!
Use our curated list of augmentation methods to further improve the performance!
Mixture of Augmentation
Enjoy the at minimum 0.22dB performance boosting of your model J
+ additional positive side-effects as well
Summary
Simple solution,
But Effective !
52. SimUSR (CVPRW’20)
: Simple but Strong Baseline for Unsupervised Image Super-resolution
Clova AI Research 1 Ajou University 2
Jaejun Yoo1* Kyung-Ah Shon2Namhyuk Ahn1,2*
Leave your contact information
and feedback here. Join Clova !
57. Zero-shot SR (ZSSR); previous SOTA
• Tackles the truly unsupervised SR task (no !"# given)
• Use only a single LR image (!$#)
• Runs both training and inference online (runtime)
• Training: Optimize image-specific network using
!$#
%
, !$# , where !$#
%
is downsampled version of !$#
• Inference: Same as the supervised SR
• Pros and Cons:
• ! Single image is required; learns internal statistics
• " Extremely high latency
• " Hard to benefit from a large capacity network
58. Zero-shot SR (ZSSR); previous SOTA
• Tackles the truly unsupervised SR task (no !"# given)
• Use only a single LR image (!$#)
• Runs both training and inference online (runtime)
• Training: Optimize image-specific network using
!$#
%
, !$# , where !$#
%
is downsampled version of !$#
• Inference: Same as the supervised SR
• Pros and Cons:
• ! Single image is required; learns internal statistics
• " Extremely high latency
• " Hard to benefit from a large capacity network
59. Our method: SimUSR
• Relax the constraints of ZSSR by assuming that
LR images ("#$%
, . . . , "#$(
) are easy to collect
• Train the model offline and inference online
• Generate ("#$*
+
, "#$*
) following the ZSSR
• Now, unsupervised SR turns into the supervised
• Benefits of SimUSR
• ! Low latency
• ! Enjoy every advantages of supervised framework
• Can use any network unlike ZSSR and MZSR [1]
• Can apply data augmentation [2]
60. Our method: SimUSR
• Relax the constraints of ZSSR by assuming that
LR images ("#$%
, . . . , "#$(
) are easy to collect
• Train the model offline and inference online
• Generate ("#$*
+
, "#$*
) following the ZSSR
• Now, unsupervised SR turns into the supervised
• Benefits of SimUSR
• ! Low latency
• ! Enjoy every advantages of supervised framework
• Can use any network unlike ZSSR and MZSR [1]
• Can apply data augmentation [2]
• Yes! We are just doing a supervised learning method at one scale lower space and
relying on its generalizability on different scales.
61. Our method: SimUSR
• Relax the constraints of ZSSR by assuming that
LR images ("#$%
, . . . , "#$(
) are easy to collect
• Train the model offline and inference online
• Generate ("#$*
+
, "#$*
) following the ZSSR
• Now, unsupervised SR turns into the supervised
• Benefits of SimUSR
• ! Low latency
• ! Enjoy every advantages of supervised framework
• Can use any network unlike ZSSR and MZSR [1]
• Can apply data augmentation [2]
• Though this line of study is easy to think of and thus SHOULD HAVE BEEN investigated
prior to any complicated unsupervised methods, surprisingly, there are currently NONE.
62. Our method: SimUSR
• Relax the constraints of ZSSR by assuming that
LR images ("#$%
, . . . , "#$(
) are easy to collect
• Train the model offline and inference online
• Generate ("#$*
+
, "#$*
) following the ZSSR
• Now, unsupervised SR turns into the supervised
• Benefits of SimUSR
• ! Low latency
• ! Enjoy every advantages of supervised framework
• Can use any network unlike ZSSR and MZSR [1]
• Can apply data augmentation [2]
• Even more, this simple method outperforms the SOTA method with a dramatically shorter
latency at runtime, and significantly reduces the gap to the supervised models.
63. Experiment: Bicubic SR
• To analyze multiple methods simultaneously in the supervised setup
• Compare supervised SR / ZSSR / SimUSR
• For SimUSR, we use CARN / RCAN / EDSR as a backbone
• SimUSR shows large improvement over the ZSSR (Table 1)
• Larger network achieves better performance (e.g. CARN vs. RCAN)
• SimUSR further reduces the gap to supervised SR using augmentation (Table 2)
Note that ours are achieving almost similar performance with the supervised SR.
64. Experiment: Real-world SR
• Compare ZSSR and our SimUSR on NTIRE 2020 dataset
: We improved the previous SOTA based on our observations
1) ZSSR suffers from noise
→ add BM3D as pre-processing
2) Certain data augmentation harms the performance
→ remove Affine transformation
65. Experiment: Real-world SR
• Compare ZSSR and our SimUSR on NTIRE 2020 dataset
: We improved the previous SOTA based on our observations
1) ZSSR suffers from noise
→ add BM3D as pre-processing
2) Certain data augmentation harms the performance
→ remove Affine transformation
• Our SimUSR outperforms ZSSR in a huge margin
on both SR performance and latency
66. Qualitative comparison and competition results
NTIRE 20 Real-world SR challenge
Track 1 (image processing artifact)
• 1st rank of PSNR
• 2nd rank of SSIM
• 13th rank of LPIPS
Bicubic dataset
NTIRE 2020 dataset
67. Solve the supervised learning problem in the one-scale lower and apply it to the original problem!
Use our data augmentation methods for better generalization J
Enjoy the SOTA performance in the unsupervised SR world J
Summary
Simple solution,
But Effective !
68. Style transfer your image in “photographic way”, e.g., day2sunset.
“CutBlur”: a powerful data augmentation method for various low-level vision.
That’s all!
Enjoy your CVPR!
Code, generated images,
and pre-trained models
are all available at
github.com/clovaai/WCT2
Code, generated images,
and pre-trained models
are all available at
github.com/clovaai/cutblur
Leave your contact information
and feedback here. Join Clova!