The document discusses FactorVAE, a method for disentangling latent representations in variational autoencoders (VAEs). It introduces Total Correlation (TC) as a penalty term that encourages independence between latent variables. TC is added to the standard VAE objective function to guide the model to learn disentangled representations. The document provides details on how TC is defined and computed based on the density-ratio trick from generative adversarial networks. It also discusses how FactorVAE uses TC to learn disentangled representations and can be evaluated using a disentanglement metric.
The document discusses FactorVAE, a method for disentangling latent representations in variational autoencoders (VAEs). It introduces Total Correlation (TC) as a penalty term that encourages independence between latent variables. TC is added to the standard VAE objective function to guide the model to learn disentangled representations. The document provides details on how TC is defined and computed based on the density-ratio trick from generative adversarial networks. It also discusses how FactorVAE uses TC to learn disentangled representations and can be evaluated using a disentanglement metric.
The document summarizes a research paper that compares the performance of MLP-based models to Transformer-based models on various natural language processing and computer vision tasks. The key points are:
1. Gated MLP (gMLP) architectures can achieve performance comparable to Transformers on most tasks, demonstrating that attention mechanisms may not be strictly necessary.
2. However, attention still provides benefits for some NLP tasks, as models combining gMLP and attention outperformed pure gMLP models on certain benchmarks.
3. For computer vision, gMLP achieved results close to Vision Transformers and CNNs on image classification, indicating gMLP can match their data efficiency.
【DL輪読会】Efficiently Modeling Long Sequences with Structured State SpacesDeep Learning JP
This document summarizes a research paper on modeling long-range dependencies in sequence data using structured state space models and deep learning. The proposed S4 model (1) derives recurrent and convolutional representations of state space models, (2) improves long-term memory using HiPPO matrices, and (3) efficiently computes state space model convolution kernels. Experiments show S4 outperforms existing methods on various long-range dependency tasks, achieves fast and memory-efficient computation comparable to efficient Transformers, and performs competitively as a general sequence model.
Several recent papers have explored self-supervised learning methods for vision transformers (ViT). Key approaches include:
1. Masked prediction tasks that predict masked patches of the input image.
2. Contrastive learning using techniques like MoCo to learn representations by contrasting augmented views of the same image.
3. Self-distillation methods like DINO that distill a teacher ViT into a student ViT using different views of the same image.
4. Hybrid approaches that combine masked prediction with self-distillation, such as iBOT.
論文紹介:Learning With Neighbor Consistency for Noisy LabelsToru Tamaki
Ahmet Iscen, Jack Valmadre, Anurag Arnab, Cordelia Schmid, "Learning With Neighbor Consistency for Noisy Labels" CVPR2022
https://openaccess.thecvf.com/content/CVPR2022/html/Iscen_Learning_With_Neighbor_Consistency_for_Noisy_Labels_CVPR_2022_paper.html
Malicious software are categorized into families based on
their static and dynamic characteristics, infection methods, and nature of threat. Visual exploration of malware instances and families in a low dimensional space helps in giving a first overview about dependencies and
relationships among these instances, detecting their groups and isolating outliers. Furthermore, visual exploration of different sets of features is useful in assessing the quality of these sets to carry a valid abstract representation, which can be later used in classification and clustering algorithms to achieve a high accuracy. We investigate one of
the best dimensionality reduction techniques known as t-SNE to reduce the malware representation from a high dimensional space consisting of
thousands of features to a low dimensional space. We experiment with
different feature sets and depict malware clusters in 2-D.
The document summarizes a research paper that compares the performance of MLP-based models to Transformer-based models on various natural language processing and computer vision tasks. The key points are:
1. Gated MLP (gMLP) architectures can achieve performance comparable to Transformers on most tasks, demonstrating that attention mechanisms may not be strictly necessary.
2. However, attention still provides benefits for some NLP tasks, as models combining gMLP and attention outperformed pure gMLP models on certain benchmarks.
3. For computer vision, gMLP achieved results close to Vision Transformers and CNNs on image classification, indicating gMLP can match their data efficiency.
【DL輪読会】Efficiently Modeling Long Sequences with Structured State SpacesDeep Learning JP
This document summarizes a research paper on modeling long-range dependencies in sequence data using structured state space models and deep learning. The proposed S4 model (1) derives recurrent and convolutional representations of state space models, (2) improves long-term memory using HiPPO matrices, and (3) efficiently computes state space model convolution kernels. Experiments show S4 outperforms existing methods on various long-range dependency tasks, achieves fast and memory-efficient computation comparable to efficient Transformers, and performs competitively as a general sequence model.
Several recent papers have explored self-supervised learning methods for vision transformers (ViT). Key approaches include:
1. Masked prediction tasks that predict masked patches of the input image.
2. Contrastive learning using techniques like MoCo to learn representations by contrasting augmented views of the same image.
3. Self-distillation methods like DINO that distill a teacher ViT into a student ViT using different views of the same image.
4. Hybrid approaches that combine masked prediction with self-distillation, such as iBOT.
論文紹介:Learning With Neighbor Consistency for Noisy LabelsToru Tamaki
Ahmet Iscen, Jack Valmadre, Anurag Arnab, Cordelia Schmid, "Learning With Neighbor Consistency for Noisy Labels" CVPR2022
https://openaccess.thecvf.com/content/CVPR2022/html/Iscen_Learning_With_Neighbor_Consistency_for_Noisy_Labels_CVPR_2022_paper.html
Malicious software are categorized into families based on
their static and dynamic characteristics, infection methods, and nature of threat. Visual exploration of malware instances and families in a low dimensional space helps in giving a first overview about dependencies and
relationships among these instances, detecting their groups and isolating outliers. Furthermore, visual exploration of different sets of features is useful in assessing the quality of these sets to carry a valid abstract representation, which can be later used in classification and clustering algorithms to achieve a high accuracy. We investigate one of
the best dimensionality reduction techniques known as t-SNE to reduce the malware representation from a high dimensional space consisting of
thousands of features to a low dimensional space. We experiment with
different feature sets and depict malware clusters in 2-D.
This document discusses generative adversarial networks (GANs) and their relationship to reinforcement learning. It begins with an introduction to GANs, explaining how they can generate images without explicitly defining a probability distribution by using an adversarial training process. The second half discusses how GANs are related to actor-critic models and inverse reinforcement learning in reinforcement learning. It explains how GANs can be viewed as training a generator to fool a discriminator, similar to how policies are trained in reinforcement learning.
The document summarizes the author's computer vision research from 2020 to the present. It covers areas of research including image segmentation, 3D reconstruction, image restoration, and lip generation. Specific projects are mentioned under each area, such as YOLACT and MODNet for image segmentation, PIFu and SMPL for 3D reconstruction, and Wav2Lip and SyncTalkFace for lip generation from speech. The author also outlines plans for future research directions involving multimodal learning, generative models, and representing scenes with neural radiance fields.
The slide of the talk in http://www.meetup.com/R-Users-Sydney/events/223867196/
There is a web version here: http://wush978.github.io/FeatureHashing/index.html
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networksareeasiertooptimize,andcangainaccuracyfrom considerably increased depth. On the ImageNet dataset we evaluate residualnets with adepth of up to152 layers—8× deeper than VGG nets [41] but still having lower complexity. Anensembleoftheseresidualnetsachieves3.57%error ontheImageNettestset. Thisresultwonthe1stplaceonthe ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residualnetsarefoundationsofoursubmissionstoILSVRC & COCO 2015 competitions1, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.
A simple framework for contrastive learning of visual representationsDevansh16
Link: https://machine-learning-made-simple.medium.com/learnings-from-simclr-a-framework-contrastive-learning-for-visual-representations-6c145a5d8e99
If you'd like to discuss something, text me on LinkedIn, IG, or Twitter. To support me, please use my referral link to Robinhood. It's completely free, and we both get a free stock. Not using it is literally losing out on free money.
Check out my other articles on Medium. : https://rb.gy/zn1aiu
My YouTube: https://rb.gy/88iwdd
Reach out to me on LinkedIn. Let's connect: https://rb.gy/m5ok2y
My Instagram: https://rb.gy/gmvuy9
My Twitter: https://twitter.com/Machine01776819
My Substack: https://devanshacc.substack.com/
Live conversations at twitch here: https://rb.gy/zlhk9y
Get a free stock on Robinhood: https://join.robinhood.com/fnud75
This paper presents SimCLR: a simple framework for contrastive learning of visual representations. We simplify recently proposed contrastive self-supervised learning algorithms without requiring specialized architectures or a memory bank. In order to understand what enables the contrastive prediction tasks to learn useful representations, we systematically study the major components of our framework. We show that (1) composition of data augmentations plays a critical role in defining effective predictive tasks, (2) introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations, and (3) contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning. By combining these findings, we are able to considerably outperform previous methods for self-supervised and semi-supervised learning on ImageNet. A linear classifier trained on self-supervised representations learned by SimCLR achieves 76.5% top-1 accuracy, which is a 7% relative improvement over previous state-of-the-art, matching the performance of a supervised ResNet-50. When fine-tuned on only 1% of the labels, we achieve 85.8% top-5 accuracy, outperforming AlexNet with 100X fewer labels.
Comments: ICML'2020. Code and pretrained models at this https URL
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
Cite as: arXiv:2002.05709 [cs.LG]
(or arXiv:2002.05709v3 [cs.LG] for this version)
Submission history
From: Ting Chen [view email]
[v1] Thu, 13 Feb 2020 18:50:45 UTC (5,093 KB)
[v2] Mon, 30 Mar 2020 15:32:51 UTC (5,047 KB)
[v3] Wed, 1 Jul 2020 00:09:08 UTC (5,829 KB)
Multimodal Residual Learning for Visual Question-AnsweringNAVER D2
The document summarizes Jin-Hwa Kim's paper on multimodal residual learning for visual question answering (VQA). It describes the VQA task, the vision and question modeling parts of the proposed approach, and how multimodal residual networks are used to combine the vision and question representations. Evaluation results on the VQA test-dev dataset show the proposed approach achieves state-of-the-art performance.
The document evaluates the bag of features technique for visual object detection. It compares the performance of support vector machines, k-nearest neighbors, and decision trees on 6 object classes using SURF keypoints and SIFT descriptors. SVM achieved the best accuracy rate of 91.9% while decision trees performed worst at 71.1%. The author proposes two enhancements: 1) a hybrid algorithm combining bag of features with geometric constraints or 2) reimplementing the algorithm with a convolutional neural network to incorporate spatial information.
A brief lesson on what constitutes computational decision making, from simple regression via various classification methods to deep learning. No maths, only basic concepts to teach the lingo of machine learning to a lay audience.
Offline Character Recognition Using Monte Carlo Method and Neural Networkijaia
Human Machine interface are constantly gaining improvements because of increasing development of
computer tools. Handwritten Character Recognition do have various significant applications like form
scanning, verification, validation, or checks reading. Because of the importance of these applications
passionate research in the field of Off-Line handwritten character recognition is going on. The challenge in
recognising the handwritings lies in the nature of humans, having unique styles in terms of font, contours,
etc. This paper presents a novice approach to identify the offline characters; we call it as character divider
approach which can be used after pre-processing stage. We devise an innovative approach for feature
extraction known as vector contour. We also discuss the pros and cons including limitations, of our
approach
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and ArchitecturesMLAI2
MetaPerturb is a meta-learned perturbation function that can enhance generalization of neural networks on different tasks and architectures. It proposes a novel meta-learning framework involving jointly training a main model and perturbation module on multiple source tasks to learn a transferable perturbation function. This meta-learned perturbation function can then be transferred to improve performance of a target model on an unseen target task or architecture, outperforming baselines on various datasets and architectures.
Online Coreset Selection for Rehearsal-based Continual LearningMLAI2
A dataset is a shred of crucial evidence to describe a task. However, each data point in the dataset does not have the same potential, as some of the data points can be more representative or informative than others. This unequal importance among the data points may have a large impact in rehearsal-based continual learning, where we store a subset of the training examples (coreset) to be replayed later to alleviate catastrophic forgetting. In continual learning, the quality of the samples stored in the coreset directly affects the model's effectiveness and efficiency. The coreset selection problem becomes even more important under realistic settings, such as imbalanced continual learning or noisy data scenarios. To tackle this problem, we propose Online Coreset Selection (OCS), a simple yet effective method that selects the most representative and informative coreset at each iteration and trains them in an online manner. Our proposed method maximizes the model's adaptation to a target dataset while selecting high-affinity samples to past tasks, which directly inhibits catastrophic forgetting. We validate the effectiveness of our coreset selection mechanism over various standard, imbalanced, and noisy datasets against strong continual learning baselines, demonstrating that it improves task adaptation and prevents catastrophic forgetting in a sample-efficient manner.
This document discusses using fully convolutional neural networks for defect inspection. It begins with an agenda that outlines image segmentation using FCNs and defect inspection. It then provides details on data preparation including labeling guidelines, data augmentation, and model setup using techniques like deconvolution layers and the U-Net architecture. Metrics for evaluating the model like Dice score and IoU are also covered. The document concludes with best practices for successful deep learning projects focusing on aspects like having a large reusable dataset, feasibility of the problem, potential payoff, and fault tolerance.
M ESH S IMPLIFICATION V IA A V OLUME C OST M EASUREijcga
We develop a polygonal mesh simplification algorithm based on a novel analysis of the mesh
geometry.
Particularly, we propose first a characterization of vertices as hyperbolic or non
-
hyperbolic depend
-
ing
upon their discrete local geometry. Subsequently, the simplification process computes a volume cost for
each non
-
hyperbolic vertex, in anal
-
ogy with spherical volume, to capture the loss of fidelity if that vertex
is decimated. Vertices of least volume cost are then successively deleted and the resulting holes re
-
triangulated using a method based on a novel heuristic. Preliminary experiments i
ndicate a performance
comparable to that of the best known mesh simplification algorithms
The document discusses several approaches for learning molecular representations using deep learning networks, including:
1. Improved Neural Fingerprint (A-NFP), which addresses limitations in the original Neural Fingerprint approach by differentiating between bond types. A-NFP achieves better performance than NFP on toxicity, solubility, and drug efficacy prediction tasks.
2. Message Passing Neural Networks (MPNN), which provide a general framework for graph-structured data like molecules. MPNNs contain message passing and readout phases. CNNs for learning molecular fingerprints is a specific case of MPNNs.
3. Mol2vec, which learns molecular representations without labels using a skip-gram model similar to word2vec and
Image classification is perhaps the most important part of digital image analysis. In this paper, we compare the most widely used model CNN Convolutional Neural Network , and MLP Multilayer Perceptron . We aim to show how both models differ and how both models approach towards the final goal, which is image classification. Souvik Banerjee | Dr. A Rengarajan "Hand-Written Digit Classification" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-5 | Issue-4 , June 2021, URL: https://www.ijtsrd.compapers/ijtsrd42444.pdf Paper URL: https://www.ijtsrd.comcomputer-science/artificial-intelligence/42444/handwritten-digit-classification/souvik-banerjee
Compressed Sensing using Generative Modelkenluck2001
This document summarizes Kenneth Emeka Odoh's presentation on using generative models for compressed sensing. It introduces compressed sensing and describes how generative models like variational autoencoders (VAEs) and generative adversarial networks (GANs) can be used to solve the underdetermined linear systems that arise in compressed sensing. The document also compares VAEs to standard autoencoders and discusses how generative models may require fewer training examples due to regularization imposing structure on the problem.
This document introduces Julia and provides an overview of its key features. It begins with introductions and background on why Julia was created. It then covers basic Julia concepts like variables, arithmetic operators, control flow, arrays, functions, and parallelization capabilities. The document also discusses Julia's built-in package ecosystem and provides examples of packages like DataFrames. It aims to provide attendees with foundational knowledge of the Julia programming language.
Similar to 教師なし画像特徴表現学習の動向 {Un, Self} supervised representation learning (CVPR 2018 完全読破チャレンジ報告会) (20)
Digital Marketing Trends in 2024 | Guide for Staying AheadWask
https://www.wask.co/ebooks/digital-marketing-trends-in-2024
Feeling lost in the digital marketing whirlwind of 2024? Technology is changing, consumer habits are evolving, and staying ahead of the curve feels like a never-ending pursuit. This e-book is your compass. Dive into actionable insights to handle the complexities of modern marketing. From hyper-personalization to the power of user-generated content, learn how to build long-term relationships with your audience and unlock the secrets to success in the ever-shifting digital landscape.
Monitoring and Managing Anomaly Detection on OpenShift.pdfTosin Akinosho
Monitoring and Managing Anomaly Detection on OpenShift
Overview
Dive into the world of anomaly detection on edge devices with our comprehensive hands-on tutorial. This SlideShare presentation will guide you through the entire process, from data collection and model training to edge deployment and real-time monitoring. Perfect for those looking to implement robust anomaly detection systems on resource-constrained IoT/edge devices.
Key Topics Covered
1. Introduction to Anomaly Detection
- Understand the fundamentals of anomaly detection and its importance in identifying unusual behavior or failures in systems.
2. Understanding Edge (IoT)
- Learn about edge computing and IoT, and how they enable real-time data processing and decision-making at the source.
3. What is ArgoCD?
- Discover ArgoCD, a declarative, GitOps continuous delivery tool for Kubernetes, and its role in deploying applications on edge devices.
4. Deployment Using ArgoCD for Edge Devices
- Step-by-step guide on deploying anomaly detection models on edge devices using ArgoCD.
5. Introduction to Apache Kafka and S3
- Explore Apache Kafka for real-time data streaming and Amazon S3 for scalable storage solutions.
6. Viewing Kafka Messages in the Data Lake
- Learn how to view and analyze Kafka messages stored in a data lake for better insights.
7. What is Prometheus?
- Get to know Prometheus, an open-source monitoring and alerting toolkit, and its application in monitoring edge devices.
8. Monitoring Application Metrics with Prometheus
- Detailed instructions on setting up Prometheus to monitor the performance and health of your anomaly detection system.
9. What is Camel K?
- Introduction to Camel K, a lightweight integration framework built on Apache Camel, designed for Kubernetes.
10. Configuring Camel K Integrations for Data Pipelines
- Learn how to configure Camel K for seamless data pipeline integrations in your anomaly detection workflow.
11. What is a Jupyter Notebook?
- Overview of Jupyter Notebooks, an open-source web application for creating and sharing documents with live code, equations, visualizations, and narrative text.
12. Jupyter Notebooks with Code Examples
- Hands-on examples and code snippets in Jupyter Notebooks to help you implement and test anomaly detection models.
Webinar: Designing a schema for a Data WarehouseFederico Razzoli
Are you new to data warehouses (DWH)? Do you need to check whether your data warehouse follows the best practices for a good design? In both cases, this webinar is for you.
A data warehouse is a central relational database that contains all measurements about a business or an organisation. This data comes from a variety of heterogeneous data sources, which includes databases of any type that back the applications used by the company, data files exported by some applications, or APIs provided by internal or external services.
But designing a data warehouse correctly is a hard task, which requires gathering information about the business processes that need to be analysed in the first place. These processes must be translated into so-called star schemas, which means, denormalised databases where each table represents a dimension or facts.
We will discuss these topics:
- How to gather information about a business;
- Understanding dictionaries and how to identify business entities;
- Dimensions and facts;
- Setting a table granularity;
- Types of facts;
- Types of dimensions;
- Snowflakes and how to avoid them;
- Expanding existing dimensions and facts.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Introduction of Cybersecurity with OSS at Code Europe 2024Hiroshi SHIBATA
I develop the Ruby programming language, RubyGems, and Bundler, which are package managers for Ruby. Today, I will introduce how to enhance the security of your application using open-source software (OSS) examples from Ruby and RubyGems.
The first topic is CVE (Common Vulnerabilities and Exposures). I have published CVEs many times. But what exactly is a CVE? I'll provide a basic understanding of CVEs and explain how to detect and handle vulnerabilities in OSS.
Next, let's discuss package managers. Package managers play a critical role in the OSS ecosystem. I'll explain how to manage library dependencies in your application.
I'll share insights into how the Ruby and RubyGems core team works to keep our ecosystem safe. By the end of this talk, you'll have a better understanding of how to safeguard your code.
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceIndexBug
Imagine a world where machines not only perform tasks but also learn, adapt, and make decisions. This is the promise of Artificial Intelligence (AI), a technology that's not just enhancing our lives but revolutionizing entire industries.
Best 20 SEO Techniques To Improve Website Visibility In SERPPixlogix Infotech
Boost your website's visibility with proven SEO techniques! Our latest blog dives into essential strategies to enhance your online presence, increase traffic, and rank higher on search engines. From keyword optimization to quality content creation, learn how to make your site stand out in the crowded digital landscape. Discover actionable tips and expert insights to elevate your SEO game.
Taking AI to the Next Level in Manufacturing.pdfssuserfac0301
Read Taking AI to the Next Level in Manufacturing to gain insights on AI adoption in the manufacturing industry, such as:
1. How quickly AI is being implemented in manufacturing.
2. Which barriers stand in the way of AI adoption.
3. How data quality and governance form the backbone of AI.
4. Organizational processes and structures that may inhibit effective AI adoption.
6. Ideas and approaches to help build your organization's AI strategy.
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on integration of Salesforce with Bonterra Impact Management.
Interested in deploying an integration with Salesforce for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Skybuffer SAM4U tool for SAP license adoptionTatiana Kojar
Manage and optimize your license adoption and consumption with SAM4U, an SAP free customer software asset management tool.
SAM4U, an SAP complimentary software asset management tool for customers, delivers a detailed and well-structured overview of license inventory and usage with a user-friendly interface. We offer a hosted, cost-effective, and performance-optimized SAM4U setup in the Skybuffer Cloud environment. You retain ownership of the system and data, while we manage the ABAP 7.58 infrastructure, ensuring fixed Total Cost of Ownership (TCO) and exceptional services through the SAP Fiori interface.
Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API?
Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose?
Which one is cheapest? Which one is fastest? Which one will scale to meet our needs?
Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
OpenID AuthZEN Interop Read Out - AuthorizationDavid Brossard
During Identiverse 2024 and EIC 2024, members of the OpenID AuthZEN WG got together and demoed their authorization endpoints conforming to the AuthZEN API
13. cvpaper.challenge 13
n Exemplar CNN
➤ Pretext task : (幾何学・⾊)変換に頑健なインスタンスレベルの画像識別
➤ (クラス数=学習画像インスタンス数)であり,普通にSoftmaxで識別していく
ので使⽤できるデータセットの規模がスケールしにくい
➤ 実はInstance Discrimination(後述)と近いこと(2014年時点で)をしている
➤ Geometric matchingなどのtaskでSIFTよりも良い結果
(いきなり)その他
Dosovitskiy et al., “Discriminative Unsupervised Feature Learning with Exemplar Convolutional Neural Networks”, NIPS 2014.
(~ CVPR2017)
2
Fig. 2. Several random transformations applied to one of the
patches extracted from the STL unlabeled dataset. The original
(’seed’) patch is in the top left corner.
-
,
the purpose of object classification, we used transformations
from the following list:
様々な変換後の,ある画像インスタンス.
これを⼀つのクラスと定義.
5
TABLE 1
Classification accuracies on several datasets (in percent). ⇤ Average per-class accuracy1
78.0% ± 0.4%. † Average per-class
accuracy 85.0% ± 0.7%. ‡ Average per-class accuracy 85.8% ± 0.7%.
Algorithm STL-10 CIFAR-10(400) CIFAR-10 Caltech-101 Caltech-256(30) #features
Convolutional K-means Network [32] 60.1 ± 1 70.7 ± 0.7 82.0 — — 8000
Multi-way local pooling [33] — — — 77.3 ± 0.6 41.7 1024 ⇥ 64
Slowness on videos [14] 61.0 — — 74.6 — 556
Hierarchical Matching Pursuit (HMP) [34] 64.5 ± 1 — — — — 1000
Multipath HMP [35] — — — 82.5 ± 0.5 50.7 5000
View-Invariant K-means [16] 63.7 72.6 ± 0.7 81.9 — — 6400
Exemplar-CNN (64c5-64c5-128f) 67.1 ± 0.2 69.7 ± 0.3 76.5 79.8 ± 0.5⇤
42.4 ± 0.3 256
Exemplar-CNN (64c5-128c5-256c5-512f) 72.8 ± 0.4 75.4 ± 0.2 82.2 86.1 ± 0.5†
51.2 ± 0.2 960
Exemplar-CNN (92c5-256c5-512c5-1024f) 74.2 ± 0.4 76.6 ± 0.2 84.3 87.1 ± 0.7‡
53.6 ± 0.2 1884
Supervised state of the art 70.1[36] — 92.0 [37] 91.44 [38] 70.6 [2] —
4.3 Detailed Analysis
We performed additional experiments using the 64c5-64c5-
128f network to study the effect of various design choices in
Exemplar-CNN training and validate the invariance proper-
ties of the learned features.
4.3.1 Number of Surrogate Classes
We varied the number N of surrogate classes between 50
and 32000. As a sanity check, we also tried classification
with random filters. The results are shown in Fig. 3.
Clearly, the classification accuracy increases with the
number of surrogate classes until it reaches an optimum at
about 8000 surrogate classes after which it did not change or
even decreased. This is to be expected: the larger the number
of surrogate classes, the more likely it is to draw very similar
or even identical samples, which are hard or impossible
to discriminate. Few such cases are not detrimental to the
50 100 250 500 1000 2000 4000 8000 1600032000
54
56
58
60
62
64
66
68
Number of classes (log scale)
ClassificationaccuracyonSTL−10
Classification
on STL (± σ)
Validation error on
surrogate data
0
20
40
60
80
100
Erroronvalidationdata
Fig. 3. Influence of the number of surrogate training classes. The val-
idation error on the surrogate data is shown in red. Note the different
y-axes for the two curves.
クラス数(= 画像インスタンス数)
が8000あたりで限界となる
14. cvpaper.challenge 14
n Context Prediction (CP)
➤ Pretext task : 画像を3×3に分割し,⼆つのパッチの相対位置の8クラス分類
- 重みを共有した枝構造を持つSiameseNetに2つのパッチを⼊⼒
- 枝のCNNを学習済みモデルとして使⽤
➤ Fine-tuningの結果はランダム初期化より少し良い程度
cover clusters of, say, foliage. A few subsequent works have
attempted to use representations more closely tied to shape
[36, 43], but relied on contour extraction, which is difficult
in complex images. Many other approaches [22, 29, 16]
focus on defining similarity metrics which can be used in
more standard clustering algorithms; [45], for instance,
re-casts the problem as frequent itemset mining. Geom-
etry may also be used to for verifying links between im-
ages [44, 6, 23], although this can fail for deformable ob-
jects.
Video can provide another cue for representation learn-
ing. For most scenes, the identity of objects remains un-
changed even as appearance changes with time. This kind
of temporal coherence has a long history in visual learning
literature [18, 59], and contemporaneous work shows strong
improvements on modern detection datasets [57].
Finally, our work is related to a line of research on dis-
criminative patch mining [13, 50, 28, 37, 52, 11], which has
emphasized weak supervision as a means of object discov-
ery. Like the current work, they emphasize the utility of
learning representations of patches (i.e. object parts) before
learning full objects and scenes, and argue that scene-level
labels can serve as a pretext task. For example, [13] trains
detectors to be sensitive to different geographic locales, but
the actual goal is to discover specific elements of architec-
tural style.
3. Learning Visual Context Prediction
Patch 2Patch 1
pool1 (3x3,96,2)pool1 (3x3,96,2)
LRN1LRN1
pool2 (3x3,384,2)pool2 (3x3,384,2)
LRN2LRN2
fc6 (4096)fc6 (4096)
conv5 (3x3,256,1)conv5 (3x3,256,1)
conv4 (3x3,384,1)conv4 (3x3,384,1)
conv3 (3x3,384,1)conv3 (3x3,384,1)
conv2 (5x5,384,2)conv2 (5x5,384,2)
conv1 (11x11,96,4)conv1 (11x11,96,4)
fc7 (4096)
fc8 (4096)
fc9 (8)
pool5 (3x3,256,2)pool5 (3x3,256,2)
Figure 3. Our architecture for pair classification. Dotted lines in-
dicate shared weights. ‘conv’ stands for a convolution layer, ‘fc’
stands for a fully-connected one, ‘pool’ is a max-pooling layer, and
‘LRN’ is a local response normalization layer. Numbers in paren-
theses are kernel size, number of outputs, and stride (fc layers have
only a number of outputs). The LRN parameters follow [32]. All
conv and fc layers are followed by ReLU nonlinearities, except fc9
which feeds into a softmax classifier.
semantic reasoning for each patch separately. When design-
ing the network, we followed AlexNet where possible.
To obtain training examples given an image, we sample
the first patch uniformly, without any reference to image
SiameseNet
Cls. Det. Seg.
random 53.3 43.4 19.8
CP 55.3 46.6 —
pe-
We
pre-
ing
pro-
red
ject
gly,
pite
on a
ion
.
s as
An
ner-
be
uses
em.
in-
with
as
321
54
876
); Y = 3,X = (
Figure 2. The algorithm receives two patches in one of these eight
possible spatial arrangements, without any context, and must then
classify which configuration was sampled.
model (e.g. a deep network) to predict, from a single word,
the n preceding and n succeeding words. In principle, sim-
ilar reasoning could be applied in the image domain, a kind
of visual “fill in the blank” task, but, again, one runs into the
Fine-tuning on Pascal VOC
識別系
Doersch et al., “Unsupervised visual representation learning by context prediction”, ICCV 2015.
(~ CVPR2017)
15. cvpaper.challenge 15
n Jigsaw Puzzle (JP)
➤ Pretext task : パッチをランダムな順に⼊⼒し,正しい順列をクラス識別
- SiameseNetに9つのパッチを同時に⼊⼒
- 順列は膨⼤な数になるのでハミング距離が⼤きくなるように選んだ
1000クラスで学習
➤ CPはパッチによってはかなりあいまい性がある(下図)
➤ ネットワークが⾒れるパッチが多い⽅があいまい性が減る
➤ CPと⽐較するとかなり精度が改善している
識別系
Cls. Det. Seg.
random 53.3 43.4 19.8
CP 55.3 46.6 —
JP 67.7 53.2 —
①や②の⑤を基準とした
相対位置を推定するのはかなり難しい
P. Favaro
(b) (c)
representations by solving Jigsaw puzzles. (a) The image
marked with green lines) are extracted. (b) A puzzle ob-
① ➁
⑤
Noroozi et al., “Unsupervised learning of visual representations by solving jigsaw puzzles ”, ECCV 2016.
(~ CVPR2017)
16. cvpaper.challenge 16
n ⾼次な情報を必要としないPretext taskの解法
➤ しかし,実際に捉えてほしいのは⾼次(semantic)な情報
➤ パッチ境界の低レベルな情報のみで
相対位置の推定が可能?
- パッチ間にgapをつける
- パッチ位置をjittering
➤ ⾊収差によって相対位置の推定が可能?
- ランダムに2チャネルをGaussian noise
に置き換え
trivial solution
occur in a specific spatial configuration (if there is no spe-
cific configuration of the parts, then it is “stuff” [1]). We
present a ConvNet-based approach to learn a visual repre-
sentation from this task. We demonstrate that the resulting
visual representation is good for both object detection, pro-
viding a significant boost on PASCAL VOC 2007 compared
to learning from scratch, as well as for unsupervised object
discovery / visual data mining. This means, surprisingly,
that our representation generalizes across images, despite
being trained using an objective function that operates on a
single image at a time. That is, instance-level supervision
appears to improve performance on category-level tasks.
2. Related Work
One way to think of a good image representation is as
the latent variables of an appropriate generative model. An
ideal generative model of natural images would both gener-
ate images according to their natural distribution, and be
concise in the sense that it would seek common causes
for different images and share information between them.
However, inferring the latent structure given an image is in-
tractable for even relatively simple models. To deal with
these computational issues, a number of works, such as
the wake-sleep algorithm [25], contrastive divergence [24],
deep Boltzmann machines [48], and variational Bayesian
methods [30, 46] use sampling to perform approximate in-
ference. Generative models have shown promising per-
formance on smaller datasets such as handwritten dig-
its [25, 24, 48, 30, 46], but none have proven effective for
321
54
876
Figure 2. The algorithm receives two patches in one of these eight
possible spatial arrangements, without any context, and must then
classify which configuration was sampled.
model (e.g. a deep network) to predict, from a single word,
the n preceding and n succeeding words. In principle, sim-
ilar reasoning could be applied in the image domain, a kind
of visual “fill in the blank” task, but, again, one runs into the
problem of determining whether the predictions themselves
are correct [12], unless one cares about predicting only very
low-level features [14, 33, 53]. To address this, [39] predicts
the appearance of an image region by consensus voting of
the transitive nearest neighbors of its surrounding regions.
Our previous work [12] explicitly formulates a statistical
t area has been apertured on a 96x96 size Figure 4. On the left is an example of the famous
例えば…
⾊収差の例
学習時にチャネル間の「収差」を
得られなくする
境界やその外挿で判断できなく
する
17. cvpaper.challenge 17
n Context Encoder (CE)
➤ Pretext task : ⽋損画像の補完
- Adversarial Loss + L2 Lossを提案しているが,表現学習の実験は
L2 Lossのみ
- つまりただの回帰
➤ ネットワークは表現学習の段階で⽋損画像しか⾒ていない
- しかしTarget taskでは⽋損していない画像を⼊⼒する
再構成系
Cls. Det. Seg.
random 53.3 43.4 19.8
CE 56.5 44.5 29.7
JP 67.7 53.2 —
-
t
-
-
-
e
-
r
s
,
y
Figure 2: Context Encoder. The context image is passed
through the encoder to obtain features which are connected
to the decoder using channel-wise fully-connected layer as
Pathak et al., “Context encoders: Feature learning by inpainting ”, CVPR 2016.
(~ CVPR2017)
18. cvpaper.challenge 18
n Colorful Image Colorization (CC)
➤ Pretext task : グレースケール画像の⾊付け {L => ab}
➤ 単純な回帰ではなく,量⼦化したab空間の識別問題を解く
➤ グレースケール画像⼊⼒を前提として表現学習するため,カラー画像
を扱う場合は,Lab⼊⼒とし,abチャネルはランダムに初期化
n Split-Brain (SB)
➤ ネットワークをチャネル⽅向に2分割し,
{L => ab, ab => L} のアンサンブル
➤ 回帰ではなく量⼦化して識別問題に
する⽅が良い特徴表現が得られた
再構成系
4 Zhang, Isola, Efros
Fig. 2. Our network architecture. Each conv layer refers to a block of 2 or 3 repeated
conv and ReLU layers, followed by a BatchNorm [30] layer. The net has no pool layers.
All changes in resolution are achieved through spatial downsampling or upsampling
between conv blocks.
[29]. In Section 3.1, we provide quantitative comparisons to Larsson et al., and
encourage interested readers to investigate both concurrent papers.
2 Approach
We train a CNN to map from a grayscale input to a distribution over quantized
color value outputs using the architecture shown in Figure 2. Architectural de-
tails are described in the supplementary materials on our project webpage1
, and
the model is publicly available. In the following, we focus on the design of the
objective function, and our technique for inferring point estimates of color from
Cls. Det. Seg.
random 53.3 43.4 19.8
CC 65.9 46.9 35.6
SB 67.1 46.7 36.0
JP 67.7 53.2 —
Input Image X Predicted Image X"
L Grayscale Channel X#
ab Color Channels X$ Predicted Grayscale Channel X#
%
Predicted Color Channels X$
%
(a) Lab Images
Figure 2: Split-Brain Autoencoders applied to various dom
Zhang et al., “Colorful Image Colorization”, ECCV 2016.
Zhang et al., “Split-brain autoencoders: Unsupervised learning by cross-channel prediction”, CVPR 2017.
(~ CVPR2017)
19. cvpaper.challenge 19
n DCGAN
➤ Pretext task : 画像⽣成モデルの学習
- 質の⾼い⽣成を可能とするテクニックを主にアーキテクチャの観点
から提案
- データ分布を⾼い性能でモデル化 => 良い特徴を捉えている
➤ Discriminatorの中間出⼒を表現に利⽤
➤ ImageNet => Pascal VOCでの実験はなし
➤ CIFAR-10においてExemplar CNNと⽐較
⽣成モデル系
on CIFAR-10
acc. (%) Num of feature
Ex CNN 84.3 1024
DCGAN 82.8 512
Radford et al., “UNSUPERVISED REPRESENTATION LEARNING WITH DEEP
CONVOLUTIONAL GENERATIVE ADVERSARIAL NETWORKS”, ICLR 2016.
(~ CVPR2017)
Under review as a conference paper at ICLR 2016
アーキテクチャや表現学習に
使⽤しているデータセットが
異なるため対等な評価とは⾔えない
22. cvpaper.challenge 22
n BiGAN
➤ 通常の𝑝(𝑥|𝑧)のみをモデル化(Generator)するGANと異なり,潜在変数の
推論𝑝(𝑧|𝑥)もモデル化 (Encoder)
➤ Generatorによる同時分布(𝑝0 𝑥, 𝑧 = 𝑝0 𝑥|𝑧 𝑝(𝑧))とEncoderによる同時分布
(𝑝1 𝑥, 𝑧 = 𝑝1 𝑧|𝑥 𝑝(𝑥))を通常のGANと同様の枠組みで近づける
➤ 特徴表現としてDの中間出⼒を使⽤する通常のGANよりも良好な結果
- Dはデータ分布とそれ以外を汎⽤的に識別するものではない
⽣成モデル系
Cls. Det. Seg.
random 53.3 43.4 19.8
BiGAN 60.3 46.9 35.2
JP 67.7 53.2 —
Donahue et al., “ADVERSARIAL FEATURE LEARNING”, ICLR 2017.
(~ CVPR2017)
Published as a conference paper at ICLR 2017
features data
z G G(z)
xEE(x)
G(z), z
x, E(x)
D P(y)
Figure 1: The structure of Bidirectional Generative Adversarial Networks (BiGAN).
generator maps latent samples to generated data, but the framework does not include an inverse
mapping from data to latent representation.
Hence, we propose a novel unsupervised feature learning framework, Bidirectional Generative
23. cvpaper.challenge 23
n Learning to Count (LC)
➤ Pretext task : 以下の制約を満たす特徴量を学習
➤ 制約:各分割画像と元画像をそれぞれ同じCNNに⼊⼒し,元画像の出⼒
特徴が全分割画像の出⼒特徴の和と⼀致する
=> 出⼒特徴の各次元が画像内の「ある⾼次なprimitive」の量を表す場合に
上記の制約を満たすことができる
➤ 個⼈的にかなり⾯⽩いアイデア
その他
Cls. Det. Seg.
random 53.3 43.4 19.8
LC 67.7 51.4 36.6
JP 67.7 53.2 —
neurons
0 100 200 300 400 500 600 700 800 900 1000
averagemagnitude
0
0.2
0.4
0.6
0.8
1
Figure 3: Average response of our trained network on
the ImageNet validation set. Despite its sparsity (30 non
zero entries), the hidden representation in the trained net-
work performs well when transferred to the classification,
detection and segmentation tasks.
Method Ref Class. Det.
Supervised [20] [43] 79.9 56.8
Random [33] 53.3 43.4
Context [9] [19] 55.3 46.6
Context [9]∗ [19] 65.3 51.1
Jigsaw [30] [30] 67.6 53.2
ego-motion [1] [1] 52.9 41.8
ego-motion [1]∗ [1] 54.2 43.9
Adversarial [10]∗ [10] 58.6 46.2
ContextEncoder [33] [33] 56.5 44.5
Sound [31] [44] 54.4 44.0
Sound [31]∗ [44] 61.3 -
Video [41] [19] 62.8 47.4
Video [41]∗ [19] 63.1 47.2
Colorization [43]∗ [43] 65.9 46.9
Split-Brain [44]∗ [44] 67.1 46.7
ColorProxy [22] [22] 65.9 -
特徴量がprimitiveのヒストグラムのようなものになる
Noroozi et al., “Representation Learning by Learning to Count”, ICCV 2017.
同じ⼈
(~ CVPR2018)
24. cvpaper.challenge 24
n Noise as target (NAT)
➤ Pretext task : ⼀様にサンプリングされたtarget vectorsに各画像からの出⼒
を1対1に対応させ,近づける
- Targetは全体サンプルの誤差の和が最⼩になるように割り当てたい
- 全⾛査は厳しいのでバッチごとにハンガリアン法で近似的に割り当て
➤ ⼀⾒意味不明だが,画像の特徴ベクトルを特徴空間上に⼀様に分散させる
ことに意味があるらしい (Appendix参照)
その他
Cls. Det. Seg.
random 53.3 43.4 19.8
NAT 65.3 49.4 36.6
JP 67.7 53.2 —
Bojanowski et al., “Unsupervised Learning by Predicting Noise”, ICML 2017.
Unsupervised Learning by Predicting Noise
Target space
Features AssignmentImages
cj
Pf(X)
CNN
Figure 1. Our approach takes a set of images, computes their deep
Choosing the loss function. In the supervised setting, a
popular choice for the loss ` is the softmax function. How
ever, computing this loss is linear in the number of targets
making it impractical for large output spaces (Goodman
2001). While there are workarounds to scale these losses to
large output spaces, Tygert et al. (2017) has recently shown
that using a squared `2 distance works well in many su
pervised settings, as long as the final activations are uni
normalized. This loss only requires access to a single tar
get per sample, making its computation independent of the
number of targets. This leads to the following problem:
min
✓
min
Y 2Rn⇥d
1
2n
kf✓(X) Y k2
F , (2
where we still denote by f✓(X) the unit normalized fea
tures.
データ数分,⼀様分布から
サンプリング(固定)
Unsupervised Learning by Predicting Noise
Figure 3. Images and their 3 nearest neighbors in ImageNet according to our model using an `2 distance. The query images are shown on
the top row, and the nearest neighbors are sorted from the closer to the further. Our features seem to capture global distinctive structures.
Figure 4. Filters form the first layer of an AlexNet trained on Im-
ageNet with supervision (left) or with NAT (right). The filters
are in grayscale, since we use grayscale gradient images as input.
This visualization shows the composition of the gradients with the
the bird.
4.2. Comparison with the state of the art
We report results on the transfer task both on ImageNet and
PASCAL VOC 2007. In both cases, the model is trained on
ImageNet.
ImageNet classification. In this experiment, we evaluate
the quality of our features for the object classification task
of ImageNet. Note that in this setup, we build the unsuper-
vised features on images that correspond to predefined im-
age categories. Even though we do not have access to cat-
egory labels, the data itself is biased towards these classes.
In order to evaluate the features, we freeze the layers up
to the last convolutional layer and train the classifier with
supervision. This experimental setting follows Noroozi &
Favaro (2016).
Nearest Neighbor
(~ CVPR2018)
26. cvpaper.challenge 26
n Spot Artifact (SA)
➤ Pretext task : 特徴マップ上で⽋損させた画像の補完
- ⽋損を補完するrepair layersとdiscriminator間で敵対的学習
- 事前にAuto encoderとして学習したモデルの
特徴マップを⽤いる
- discriminatorが良い特徴表現を得ることを期待
➤ 特徴マップを⽋損はより⾼次な情報を⽋損させる
ことを期待 (実際の⽋損画像を⾒てもあまりわからない)
再構成系
Cls. Det. Seg.
random 53.3 43.4 19.8
SA 69.8 52.5 38.1
JP 67.7 53.2 —
Real/Corrupt
X + + + + +
Figure 2. The proposed architecture. Two autoencoders {E, D1, D2, D3, D4, D5} output either real images (top row) or images with
artifacts (bottom row). A discriminator C is trained to distinguish them. The corrupted images are generated by masking the encoded
feature φ(x) and then by using a repair network {R1, R2, R3, R4, R5} distributed across the layers of the decoder. The mask is also used
by the repair network to change only the dropped entries of the feature (see Figure 5 for more details). The discriminator and the repair
network (both shaded in blue) are trained in an adversarial fashion on the real/corrupt classification loss. The discriminator is also trained
to output the mask used to drop feature entries, so that it learns to localize all artifacts.
Repair layerを挟む
⽋損位置を推定
Wu et al., “Self-Supervised Feature Learning by
Learning to Spot Artifacts ”, CVPR 2018.
⾚:corrupt,緑:real
(~ CVPR2018)
CVPR2018
27. cvpaper.challenge 27
n Jigsaw Puzzle++
➤ Pretext task : 1~3パッチを他の画像のパッチに置き換えたJP
- ⾒れるパッチが少ない・他画像からのパッチを識別する必要がある
- 上記からpretext taskの難度が上がる
- 複数のクラスに属することがないようハミング距離を考慮して順列を選択
識別系
Cls. Det. Seg.
random 53.3 43.4 19.8
LC 67.7 51.4 36.6
JP++ 69.8 55.5 38.1
JP 67.7 53.2 —
cluster. Our
space and to
n the dataset
work with the
learn a novel
Figure 2 and
Suppose that
set. Our first
task with the
he models of
one consid-
ayer (shown
feature vec-
t. Then, we
n distance to
y, when per-
we want the
ories. In the
centers com-
(a) (b)
(c) (d)
Figure 3: The Jigsaw++ task. (a) the main image. (b) a
random image. (c) a puzzle from the original formulation
Noroozi et al., “Boosting Self-Supervised Learning via
Knowledge Transfer ”, CVPR 2018.
同じ⼈
(~ CVPR2018)
CVPR2018
28. cvpaper.challenge 28
n Classify Rotation (CR)
➤ Pretext task : 画像の回転推定
- 0°,90°,180°,270°の4クラス分類
- それ以上の細かい分類は回転後に補間が必要
=> artifactが⽣まれ,trivial solutionの原因となる
➤ objectの回転⾓を推定するためにはobjectの⾼次な情報が必要
➤ ここまでの最⾼精度(Cls., Det. ) & 実装が最も簡単
識別系
Cls. Det. Seg.
random 53.3 43.4 19.8
CR 73.0 54.4 39.1
JP++ 69.8 55.5 38.1
Published as a conference paper at ICLR 2018
Rotated image: X
0
Rotated image: X
3
Rotated image: X
2
Rotated image: X
1
ConvNet
model F(.)
ConvNet
model F(.)
ConvNet
model F(.)
ConvNet
model F(.)
Image X
Predict 270 degrees rotation (y=3)Rotate 270 degrees
g( X , y=3)
Rotate 180 degrees
g( X , y=2)
Rotate 90 degrees
g( X , y=1)
Rotate 0 degrees
g( X , y=0)
Maximize prob.
F
3
( X
3
)
Predict 0 degrees rotation (y=0)
Maximize prob.
F
2
( X
2
)
Maximize prob.
F
1
( X
1
)
Maximize prob.
F
0
( X
0
)
Predict 180 degrees rotation (y=2)
Predict 90 degrees rotation (y=1)
Objectives:
Figure 2: Illustration of the self-supervised task that we propose for semantic feature learning.
Given four possible geometric transformations, the 0, 90, 180, and 270 degrees rotations, we train
a ConvNet model F(.) to recognize the rotation that is applied to the image that it gets as input.
y y⇤
Gidaris et al., “Unsupervised Representation Learning by predicting Image Rotation”, ICLR 2018.
(~ CVPR2018)
29. cvpaper.challenge 29
n Classify Rotation (CR)
➤ データ構造への依存
➤ 画像ドメインによっては低次な特徴で回転の推定が可能では?
- 実際にPlacesのシーン識別タスクでは奮わない
➤ 回転が定義できないような画像もあるはず
- 航空写真など
識別系
Gidaris et al., “Unsupervised Representation Learning by predicting Image Rotation”, ICLR 2018.
Random 11.6 17.1 16.9 16.3 14.1
Random rescaled Kr¨ahenb¨uhl et al. (2015) 17.5 23.0 24.5 23.2 20.6
Context (Doersch et al., 2015) 16.2 23.3 30.2 31.7 29.6
Context Encoders (Pathak et al., 2016b) 14.1 20.7 21.0 19.8 15.5
Colorization (Zhang et al., 2016a) 12.5 24.5 30.4 31.5 30.3
Jigsaw Puzzles (Noroozi & Favaro, 2016) 18.2 28.8 34.0 33.9 27.1
BIGAN (Donahue et al., 2016) 17.7 24.5 31.0 29.9 28.0
Split-Brain (Zhang et al., 2016b) 17.7 29.3 35.4 35.2 32.8
Counting (Noroozi et al., 2017) 18.0 30.6 34.3 32.5 25.7
(Ours) RotNet 18.8 31.7 38.7 38.2 36.5
Table 6: Task & Dataset Generalization: Places top-1 classification with linear layers. We
compare our unsupervised feature learning approach with other unsupervised approaches by training
logistic regression classifiers on top of the feature maps of each layer to perform the 205-way Places
classification task (Zhou et al., 2014). All unsupervised methods are pre-trained (in an unsupervised
way) on ImageNet. All weights are frozen and feature maps are spatially resized (with adaptive max
pooling) so as to have around 9000 elements. All approaches use AlexNet variants and were pre-
trained on ImageNet without labels except the Place labels, ImageNet labels, and Random entries.
Method Conv1 Conv2 Conv3 Conv4 Conv5
Places labels Zhou et al. (2014) 22.1 35.1 40.2 43.3 44.6
ImageNet labels 22.7 34.8 38.4 39.4 38.7
Random 15.7 20.3 19.8 19.1 17.5
Random rescaled Kr¨ahenb¨uhl et al. (2015) 21.4 26.2 27.1 26.1 24.0
Context (Doersch et al., 2015) 19.7 26.7 31.9 32.7 30.9
Context Encoders (Pathak et al., 2016b) 18.2 23.2 23.4 21.9 18.4
Colorization (Zhang et al., 2016a) 16.0 25.7 29.6 30.3 29.7
Jigsaw Puzzles (Noroozi & Favaro, 2016) 23.0 31.9 35.0 34.2 29.3
BIGAN (Donahue et al., 2016) 22.0 28.7 31.8 31.3 29.7
Split-Brain (Zhang et al., 2016b) 21.3 30.7 34.0 34.1 32.5
Counting (Noroozi et al., 2017) 23.3 33.9 36.3 34.7 29.6
(Ours) RotNet 21.5 31.0 35.1 34.6 33.7
classification tasks of ImageNet, Places, and PASCAL VOC datasets and on the object detection and
object segmentation tasks of PASCAL VOC.
Implementation details: For those experiments we implemented our RotNet model with an
AlexNet architecture. Our implementation of the AlexNet model does not have local response
normalization units, dropout units, or groups in the colvolutional layers while it includes batch
for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication
10.1109/TPAMI.2017.2723009, IEEE Transactions on Pattern Analysis and Machine Intelligence
elevator door
arch
corral
windmill
bar
cafeteria
field road
fishpond
watering hole
tra
tower
ation in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPAMI.2017.2723009, IEEE Transactions on Pattern Analysis and Machine Intelligence
2
amusement park
evator door
arch
corral
windmill
bar
cafeteria
field road
fishpond
watering hole
train station platform
tower soccer field
cle has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
10.1109/TPAMI.2017.2723009, IEEE Transactions on Pattern Analysis and Machine Intelligence
amus
elevator door
arch
corral
windmill
bar
cafeteria
arians office
edroom
rence center
field road
fishpond
watering hole
train statio
Indoor Nature Urban
tower
swimming pool
rcase s
shoe shop rainforest
ticle has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation inf
10.1109/TPAMI.2017.2723009, IEEE Transactions on Pattern Analysis and Machine Intelligence
amusem
elevator door
arch
corral
windmill
bar
cafeteria
narians office
bedroom
erence center
field road
fishpond
watering hole
train station
Indoor Nature Urban
tower
swimming pool s
aircase soc
shoe shop rainforest
Places
例えば,空の位置のみで
回転推定できる
(~ CVPR2018)
33. cvpaper.challenge 33
n Deep Cluster (DC)
➤ 以下の操作を繰り返し⾏う
① CNNの中間特徴を元にk-meansクラスタリング
② 割り当てられたクラスタをPseudo labelとして識別問題を学習
➤ 最初のiterationではランダム初期化されたCNNの出⼒を元にクラスタリング
- その出⼒を⽤いてMLPを学習しても12%出る
=> ⼊⼒情報はある程度保持されてる
➤ ImageNetでの実験ではk = 10000 (> 1000)が最も良い
➤ 単純かつ⾮常に強⼒な⼿法
最新動向
Caron et al., “Deep Clustering for Unsupervised Learning of Visual Features ”, ECCV 2018.
Cls. Det. Seg.
random 53.3 43.4 19.8
CR 73.0 54.4 39.1
JP++ 69.8 55.5 38.1
DC 73.7 55.4 45.1
34. cvpaper.challenge 34
n Deep Cluster (DC)
➤ 以下の操作を繰り返し⾏う
① CNNの中間特徴を元にk-meansクラスタリング
② 割り当てられたクラスタをPseudo labelとして識別問題を学習
➤ 最初のiterationではランダム初期化されたCNNの出⼒を元にクラスタリング
- その出⼒を⽤いてMLPを学習しても12%出る
=> ⼊⼒情報はある程度保持されてる
➤ ImageNetでの実験ではk = 10000 (> 1000)が最も良い
➤ 単純かつ⾮常に強⼒な⼿法
最新動向
Caron et al., “Deep Clustering for Unsupervised Learning of Visual Features ”, ECCV 2018.
Deep Clustering for Unsupervised Learning of Visual Features 7
(a) Clustering quality (b) Cluster reassignment (c) Influence of k
ImageNet labelとクラスタの
相互情報量が増加していく
epoch間の相互情報量が増加
=> クラスタ割り当てが安定
Cls. Det. Seg.
random 53.3 43.4 19.8
CR 73.0 54.4 39.1
JP++ 69.8 55.5 38.1
DC 73.7 55.4 45.1
35. cvpaper.challenge 35
n Deep INFORMAX (DIM)
➤ ⼊⼒𝒙と特徴ベクトル𝒛の相互情報量𝐼(𝒙; 𝒛)を最⼤化するように学習
- 簡単に⾔うと𝒙と𝒛の依存を⼤きくする
- 実際には𝒛と𝒙の各パッチの相互情報量最⼤化が⼤きな効果を発揮
➤ 𝒙, 𝒛 のpositive or negativeペアの識別をするdiscriminatorをつけて
end-to-endに学習するだけで𝑰(𝒙; 𝒛)の下限を最⼤化することができる
➤ GANのような交互最適化でもないので,実装・学習が簡単
➤ 全ての⼿法との⽐較はしていないが教師あり学習に近い精度
最新動向
Devon Hajelm et al., “Learning deep representations by mutual information estimation and maximization”, arXiv 8/2018.
Figure 1: The base encoder model in the
context of image data. An image (in this
case) is encoded into a convolutional network
until reaching a feature map of M ⇥ M fea-
ture vectors corresponding to M ⇥ M input
patches. These vectors are summarized (for
instance, using additional convolutions and
fully-connected layers) into a single feature
vector, Y . Our goal is to train this network
such that relevant information about the input
is extractable from the high-level features.
Figure 2: Deep INFOMAX (DIM) with a
global MI(X; Y ) objective. Here, we pass
both the high-level feature vector, Y , and the
lower-level M ⇥ M feature map (See Fig-
ure 1) through a discriminator composed of
additional convolutions, flattening, and fully-
connected layers to get the score. Fake sam-
ples are drawn by combining the same feature
vector with a M ⇥ M feature map from an-
other image.
Table 2: Classification accuracy (top 1) results on Tiny Image
DIM with the local objective outperforms all other models presen
accuracy of a fully-supervised classifier with similar with the A
Tiny ImageNet STL
conv fc (4096) Y (64) conv
Fully supervised 36.60
VAE 18.63 16.88 11.93 58.27
AAE 18.04 17.27 11.49 59.54
BiGAN 24.38 20.21 13.06 71.53
NAT 13.70 11.62 1.20 64.32
DIM(G) 11.32 6.34 4.95 42.03
DIM(L) 33.8 34.5 30.7 71.82
Table 3: Extended comparisons on CIFAR10. Linear classific
runs. MS-SSIM is estimated by training a separate decoder usin
Tiny ImageNetにおいて教師ありに近い精度
36. cvpaper.challenge 36
n Contrastive Predictive Coding (CPC)
➤ 系列情報においてある時点での特徴ベクトル𝑐8と先の⼊⼒𝑥89:間の
相互情報量を最⼤化
➤ こちらはdiscriminatorがN個のペアから1つのpositiveペアを識別する
Nクラス分類を解くことで相互情報量の下界を最⼤化
➤ 画像の場合は図のように特徴マップを上から下⽅向の系列として捉える
➤ 全ての⼿法との⽐較はしていないが実験内では圧倒的な精度
最新動向
Oord et al., “Representation Learning with Contrastive Predictive Coding”, arxiv 6/2018.
Predictions
zt+2
zt+3
zt+4
ct
64 px
256 px
50% overlap
genc - output
gar - output
input image
Figure 4: Visualization of Contrastive Predictive Coding for images (2D adaptation of Figure 1).
To understand the representations extracted by CPC, we measure the phone prediction performance
with a linear classifier trained on top of these features, which shows how linearly separable the
Method Top-1 ACC
Using AlexNet conv5
Video [27] 29.8
Relative Position [11] 30.4
BiGan [34] 34.8
Colorization [10] 35.2
Jigsaw [28] * 38.1
Using ResNet-V2
Motion Segmentation [35] 27.6
Exemplar [35] 31.5
Relative Position [35] 36.2
Colorization [35] 39.6
CPC 48.7
Table 3: ImageNet top-1 unsupervised classifi-
cation results. *Jigsaw is not directly compa-
rable to the other AlexNet results because of
architectural differences.