Slides presented in the All Japan Computer Vision Study Group on May 15, 2022. Methods for disentangling the relationship between multimodal data are discussed.
The document discusses FactorVAE, a method for disentangling latent representations in variational autoencoders (VAEs). It introduces Total Correlation (TC) as a penalty term that encourages independence between latent variables. TC is added to the standard VAE objective function to guide the model to learn disentangled representations. The document provides details on how TC is defined and computed based on the density-ratio trick from generative adversarial networks. It also discusses how FactorVAE uses TC to learn disentangled representations and can be evaluated using a disentanglement metric.
The document discusses FactorVAE, a method for disentangling latent representations in variational autoencoders (VAEs). It introduces Total Correlation (TC) as a penalty term that encourages independence between latent variables. TC is added to the standard VAE objective function to guide the model to learn disentangled representations. The document provides details on how TC is defined and computed based on the density-ratio trick from generative adversarial networks. It also discusses how FactorVAE uses TC to learn disentangled representations and can be evaluated using a disentanglement metric.
This document discusses generative adversarial networks (GANs) and their relationship to reinforcement learning. It begins with an introduction to GANs, explaining how they can generate images without explicitly defining a probability distribution by using an adversarial training process. The second half discusses how GANs are related to actor-critic models and inverse reinforcement learning in reinforcement learning. It explains how GANs can be viewed as training a generator to fool a discriminator, similar to how policies are trained in reinforcement learning.
Semi supervised, weakly-supervised, unsupervised, and active learningYusuke Uchida
An overview of semi supervised learning, weakly-supervised learning, unsupervised learning, and active learning.
Focused on recent deep learning-based image recognition approaches.
Several recent papers have explored self-supervised learning methods for vision transformers (ViT). Key approaches include:
1. Masked prediction tasks that predict masked patches of the input image.
2. Contrastive learning using techniques like MoCo to learn representations by contrasting augmented views of the same image.
3. Self-distillation methods like DINO that distill a teacher ViT into a student ViT using different views of the same image.
4. Hybrid approaches that combine masked prediction with self-distillation, such as iBOT.
CVPR2022 paper reading - Balanced multimodal learning - All Japan Computer Vi...Antonio Tejero de Pablos
Introduction of the CVPR2022 paper: Balanced multimodal learning via on-the-fly gradient modulation @ The All Japan Computer Vision Study Group (2022/08/07)
The document discusses word embedding techniques, specifically Word2vec. It introduces the motivation for distributed word representations and describes the Skip-gram and CBOW architectures. Word2vec produces word vectors that encode linguistic regularities, with simple examples showing words with similar relationships have similar vector offsets. Evaluation shows Word2vec outperforms previous methods, and its word vectors are now widely used in NLP applications.
This document discusses generative adversarial networks (GANs) and their relationship to reinforcement learning. It begins with an introduction to GANs, explaining how they can generate images without explicitly defining a probability distribution by using an adversarial training process. The second half discusses how GANs are related to actor-critic models and inverse reinforcement learning in reinforcement learning. It explains how GANs can be viewed as training a generator to fool a discriminator, similar to how policies are trained in reinforcement learning.
Semi supervised, weakly-supervised, unsupervised, and active learningYusuke Uchida
An overview of semi supervised learning, weakly-supervised learning, unsupervised learning, and active learning.
Focused on recent deep learning-based image recognition approaches.
Several recent papers have explored self-supervised learning methods for vision transformers (ViT). Key approaches include:
1. Masked prediction tasks that predict masked patches of the input image.
2. Contrastive learning using techniques like MoCo to learn representations by contrasting augmented views of the same image.
3. Self-distillation methods like DINO that distill a teacher ViT into a student ViT using different views of the same image.
4. Hybrid approaches that combine masked prediction with self-distillation, such as iBOT.
CVPR2022 paper reading - Balanced multimodal learning - All Japan Computer Vi...Antonio Tejero de Pablos
Introduction of the CVPR2022 paper: Balanced multimodal learning via on-the-fly gradient modulation @ The All Japan Computer Vision Study Group (2022/08/07)
The document discusses word embedding techniques, specifically Word2vec. It introduces the motivation for distributed word representations and describes the Skip-gram and CBOW architectures. Word2vec produces word vectors that encode linguistic regularities, with simple examples showing words with similar relationships have similar vector offsets. Evaluation shows Word2vec outperforms previous methods, and its word vectors are now widely used in NLP applications.
This document discusses attention mechanisms, which focus on important parts of input data while ignoring other parts. Attention was first used in machine translation in 2014 and is now widely used in natural language processing, computer vision, speech recognition and other domains. Attention mechanisms help overcome drawbacks of traditional encoder-decoder models. There are several categories of attention based on the number of sequences, abstraction levels, positions and representations. Stand-alone self-attention models have been applied successfully to computer vision tasks like image classification and object detection, outperforming convolutional baselines.
Crafting Recommenders: the Shallow and the Deep of it! Sudeep Das, Ph.D.
Sudeep Das presented on recommender systems and advances in deep learning approaches. Matrix factorization is still the foundational method for collaborative filtering, but deep learning models are now augmenting these approaches. Deep neural networks can learn hierarchical representations of users and items from raw data like images, text, and sequences of user actions. Models like wide and deep networks combine the strengths of memorization and generalization. Sequence models like recurrent neural networks have also been applied to sessions for next item recommendation.
The document discusses image captioning using deep neural networks. It begins by providing examples of how humans can easily describe images but generating image captions with a computer program was previously very difficult. Recent advances in deep learning, specifically using convolutional neural networks (CNNs) to recognize objects in images and recurrent neural networks (RNNs) to generate captions, have enabled automated image captioning. The document discusses CNN and RNN architectures for image captioning and provides examples of pre-trained models that can be used, such as VGG-16.
Survey of Attention mechanism & Use in Computer VisionSwatiNarkhede1
This presentation contains the overview of Attention models. It also has information of the stand alone self attention model used for Computer Vision tasks.
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017StampedeCon
Words are no longer sufficient in delivering the search results users are looking for, particularly in relation to image search. Text and languages pose many challenges in describing visual details and providing the necessary context for optimal results. Machine Learning technology opens a new world of search innovation that has yet to be applied by businesses.
In this session, Mike Ranzinger of Shutterstock will share a technical presentation detailing his research on composition aware search. He will also demonstrate how the research led to the launch of AI technology allowing users to more precisely find the image they need within Shutterstock’s collection of more than 150 million images. While the company released a number of AI search enabled tools in 2016, this new technology allows users to search for items in an image and specify where they should be located within the image. The research identifies the networks that localize and describe regions of an image as well as the relationships between things. The goal of this research was to improve the future of search using visual data, contextual search functions, and AI. A combination of multiple machine learning technologies led to this breakthrough.
How to use transfer learning to bootstrap image classification and question a...Wee Hyong Tok
1. The presentation discusses how to use transfer learning to bootstrap image classification and question answering tasks. Transfer learning allows leveraging knowledge from existing models trained on large datasets and applying it to new tasks with less data.
2. For image classification, the presentation recommends using features from pretrained convolutional neural networks on ImageNet as general purpose image features. Fine-tuning the top layers of these networks on smaller datasets can achieve good accuracy.
3. For natural language processing tasks, transfer learning techniques like using pretrained word embeddings, language models like ULMFiT and ELMo, and models trained on question answering datasets can help bootstrap tasks with less text data.
Bridging the gap between AI and UI - DSI Vienna - full versionLiad Magen
This is a summary of the latest research on model interpretability, including Recurrent neural networks (RNN) for Natural Language Processing (NLP) in terms of what's in an RNN.
In addition, it contains suggestion to improve machine learning based user interface, to engage users and encourage them to contribute data to adapt the models to them.
Teaching Object Oriented Programming Courses by Sandeep K Singh JIIT,NoidaDr. Sandeep Kumar Singh
The document discusses various approaches, innovations and experiences in teaching object-oriented programming courses. It describes difficulties students face in learning OOP concepts like data encapsulation, inheritance and polymorphism. It then outlines several pedagogical interventions like using an object-first approach, memory models, methodology first over language, collaborative designs, early design patterns, structured lab assignments, sequencing assignments, and animation environments. Game-based approaches and tools like Greenfoot, Sifteo cubes and BlueJ are also highlighted.
Transfer learning enables you to use pretrained deep neural networks trained on various large datasets (ImageNet, CIFAR, WikiQA, SQUAD, and more) and adapt them for various deep learning tasks (e.g., image classification, question answering, and more).
Wee Hyong Tok and Danielle Dean share the basics of transfer learning and demonstrate how to use the technique to bootstrap the building of custom image classifiers and custom question-answering (QA) models. You’ll learn how to use the pretrained CNNs available in various model libraries to custom build a convolution neural network for your use case. In addition, you’ll discover how to use transfer learning for question-answering tasks, with models trained on large QA datasets (WikiQA, SQUAD, and more), and adapt them for new question-answering tasks.
Topics include:
An introduction to convolution neural networks and question-answering problems
Using pretrained CNNs and the last fully connected layer as a featurizer (Once the features are extracted, any existing classifier can be used for image classification, using the extracted features as inputs.)
Fine-tuning the pretrained models and adapting them for the new images
Using pretrained QA models trained on large QA datasets (WikiQA, SQUAD) and applying transfer learning for QA tasks
Presentation document for "Task-Agnostic Vision Transformer for Distributed Learning of Image Processing" publihed in IEEE Transactions on Image Processing (TIP).
This document describes an image analysis tool called Image Hub Explorer that allows users to visualize and evaluate representations and metrics for content-based image retrieval. The tool detects centers of influence called "hubs" within image data and allows users to examine individual hubs, feature representations, similarity metrics, and their effects on tasks like object recognition and ranking. It aims to help researchers and system developers optimize metrics and representations to reduce the impact of hubs and improve retrieval and recognition performance.
This 3-sentence summary provides the high-level information about the ICWSM'11 tutorial document:
The tutorial document announces a workshop on exploratory network analysis using Gephi, an open-source graph visualization and manipulation software, to be held on July 17, 2011 from 1-4 PM with instructors Sébastien Heymann and Julian Bilcke. The tutorial will provide an introduction to Gephi and guide participants through importing data, network visualization and manipulation, analysis, and aesthetics refinements using real datasets. Participants will work in teams and present preliminary results with the goal of learning practical skills for using Gephi on their own projects.
Video has become ubiquitous on the Internet, TV, as well as personal devices. Recognition of video content has been a fundamental challenge in computer vision for decades, where previous research predominantly focused on recognizing videos using a predefined yet limited vocabulary. Thanks to the recent development of deep learning and knowledge graph techniques, researchers in multiple communities are now striving to bridge videos with natural language in order to move beyond classification to interpretation, which should be regarded as the ultimate goal of video understanding. We will present recent advances in exploring the synergy of video understanding and language processing techniques, including video entity linking, video-language alignment, and video captioning, and discuss how domain knowledge can fit in to improve the performance.
Does deep learning solve all the machine learning problems? Where would domain knowledge fit in? While it is common in medical data analytics to incorporate domain knowledge, we focus on one emerging area in computer vision and language processing, video+language, to answer these questions.
Video has become ubiquitous on the Internet, TV, as well as personal devices. Recognition of video content has been a fundamental challenge in computer vision for decades, where previous research predominantly focused on recognizing videos using a predefined yet limited vocabulary. Thanks to the recent development of deep learning and knowledge graph techniques, researchers in multiple communities are now striving to bridge videos with natural language in order to move beyond classification to interpretation, which should be regarded as the ultimate goal of video understanding. We will present recent advances in exploring the synergy of video understanding and language processing techniques, including video entity linking, video-language alignment, and video captioning, and discuss how domain knowledge can fit in to improve the performance.
SP1: Exploratory Network Analysis with GephiJohn Breslin
ICWSM 2011 Tutorial
Sebastien Heymann and Julian Bilcke
Gephi is an interactive visualization and exploration software for all kinds of networks and relational data: online social networks, emails, communication and financial networks, but also semantic networks, inter-organizational networks and more. Designed to make data navigation and manipulation easy, it aims to fulfill the complete chain from data importing to aesthetics refinements and interaction. Users interact with the visualization and manipulate structures, shapes and colors to reveal hidden properties. The goal is to help data analysts to make hypotheses, intuitively discover patterns or errors in large data collections.
In this tutorial we will provide a hands-on demonstration of the essential functionalities of Gephi, based on a real case scenario: the exploration of student networks from the "Facebook100" dataset (Social Structure of Facebook Networks, Amanda L. Traud et al, 2011). The participants will be guided step by step through the complete chain of representation, manipulation, layout, analysis and aesthetics refinements. Particular focus will be put on filters and metrics for the creation of their first visualizations. They will be incited to compare the hypotheses suggested by their own exploration to the results actually published in the academic paper afterwards. They finally will walk away with the practical knowledge enabling them to use Gephi for their own projects. The tutorial is intended for professionals, researchers and graduates who wish to learn how playing during a network exploration can speed up their studies.
Sébastien Heymann is a Ph.D. Candidate in Computer Science at Université Pierre et Marie Curie, France. His research at the ComplexNetworks team focuses on the dynamics of realworld networks. He leads the Gephi project since 2008, and is the administrator of the Gephi Consortium.
Julian Bilcke is a Software Engineer at ISC-PIF (Complex Systems Institute of Paris, France). He is a founder and a developer for the Gephi project since 2008.
Presentation by Mark Billinghurst on Collaborative Immersive Analytics at the BDVA conference on November 7th 2017. This talk provides an overview of the topic of Collaborative Immersive Analytics
This document discusses deep learning and its applications in the real world. It begins with an introduction to deep learning and then discusses using pre-trained deep learning models for new problems and applications. Some key points discussed include starting from scratch to build a model for a new problem with no existing literature, repurposing pre-trained models for new ideas, and tips for using pre-trained models for mobile applications such as model conversion. Real-life examples of using pre-trained models for new applications like human pose estimation are also provided.
Similar to VAEs for multimodal disentanglement (20)
Inteligencia artificial,visión por ordenador,y datos multimodales - ACE Jap...Antonio Tejero de Pablos
La inteligencia artificial (AI) está a la orden del día, pero ¿qué es realmente? ¿Cómo es capaz una máquina de percibir el mundo real? Diseñadas inicialmente para reconocer patrones sencillos en imágenes, las redes neuronales artificiales han incrementado su complejidad hasta obtener en la actualidad una precisión equivalente a la del ser humano. Esto ha permitido su aplicación en una gran variedad de sectores, desde el médico hasta el automovilístico. Esta charla sirve de introducción a mi campo dentro de la AI, la visión por ordenador, y a mi tema de investigación actual, el aprendizaje de datos multimodales.
Presentation Seminar - Harada Ushiku Lab - The University of Tokyo (in English)
(日本語版:https://www.slideshare.net/AntonioTejerodePablo/presentation-skills-up-seminar-harada-ushiku-lab)
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
State of Artificial intelligence Report 2023kuntobimo2016
Artificial intelligence (AI) is a multidisciplinary field of science and engineering whose goal is to create intelligent machines.
We believe that AI will be a force multiplier on technological progress in our increasingly digital, data-driven world. This is because everything around us today, ranging from culture to consumer products, is a product of intelligence.
The State of AI Report is now in its sixth year. Consider this report as a compilation of the most interesting things we’ve seen with a goal of triggering an informed conversation about the state of AI and its implication for the future.
We consider the following key dimensions in our report:
Research: Technology breakthroughs and their capabilities.
Industry: Areas of commercial application for AI and its business impact.
Politics: Regulation of AI, its economic implications and the evolving geopolitics of AI.
Safety: Identifying and mitigating catastrophic risks that highly-capable future AI systems could pose to us.
Predictions: What we believe will happen in the next 12 months and a 2022 performance review to keep us honest.
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
VAEs for multimodal disentanglement
1. AGREEMENT
• If you plan to share these slides or to use the content in these slides for your own work,
please include the following reference:
• 本スライドの公開や個人使用の場合には、下記の様にリファレンスをお願いします:
Tejero-de-Pablos A. (2022) “VAEs for multimodal disentanglement”. All Japan Computer Vision Study Group.
7. What is a VAE?
• Auto-encoder • Variational auto-encoder
With the proper regularization:
8. There is more!
• Vector Quantized-VAE
Quantize the bottleneck using a discrete codebook
There are a number of algorithms (like transformers) that are designed to work on discrete data, so we
would like to have a discrete representation of the data for these algorithms to use.
Advantages of VQ-VAE:
- Simplified latent space (easier to train)
- Likelihood-based model: do not suffer from the
problems of mode collapse and lack of diversity
- Real world data favors a discrete representation
(number of images that make sense is kind of finite)
9. Why are VAEs cool?
• Usage of VAEs (state-of-the-art)
Multimodal generation (DALL-E)
Representation learning, latent space disentanglement
11. Today I’m introducing:
1) Shi, Y., Paige, B., & Torr, P. (2019). Variational mixture-of-experts autoencoders for multi-modal
deep generative models. Advances in Neural Information Processing Systems, 32.
2) Lee, M., & Pavlovic, V. (2021). Private-shared disentangled multimodal VAE for learning of
latent representations. Conference on Computer Vision and Pattern Recognition (pp. 1692-
1700).
3) Joy, T., Shi, Y., Torr, P. H., Rainforth, T., Schmon, S. M., & Siddharth, N. (2022). Learning
Multimodal VAEs through Mutual Supervision. International Conference on Learning
Representations.
12. Motivation and goal
• Importance of multimodal data
Learning in the real world involves multiple perspectives: visual, auditive, linguistic
Understanding them individually allows only a partial learning of concepts
• Understanding how different modalities work together is not trivial
A similar joint-embedding process happens in the brain for reasoning and understanding
• Multimodal VAE facilitate representation learning on data with multiple views/modalities
Capture common underlying factors between the modalities
13. Motivation and goal
• Normally, only the shared aspects of modalities are modeled
The private information of each modality is totally LOST
E.g., image captioning
• Leverage VAE’s latent space for disentanglement
Private spaces are leveraged for modeling the disjoint properties of each
modality, and cross-modal generation
• Basically, such disentanglement can be used as:
An analytical tool to understand how modalities intertwine
A way of cross-generating modalities
14. Motivation and goal
• [1] and [2] propose a similar methodology
According to [1], a true multimodal generative model should meet four criteria:
Today I will introduce [2] (most recent), and explain briefly the differences with [3]
15. Dataset
• Digit images: MNIST & SVHN
- Shared features: Digit class
- Private features: Number style, background, etc.
Image domains as different modalities?
• Flower images and text description: Oxford-102 Flowers
- Shared features: Words and image features present in both
modalities
- Private features: Words and image features exclusive from
their modality
MNIST
SVHN
16. Related work
• Multimodal generation and joint multimodal VAEs (e.g., JMVAE, MVAE)
The learning of a common disentangled embedding (i.e., private-shared) is often ignored
Only some works in image-to-image translation separate ”content” (~shared) and ”style” (~private) in the
latent space (e.g., via adversarial loss)
Exclusively for between-image modalities: Not suitable for different modalities such as image and text
• Domain adaptation
Learning joint embeddings of multimodal observations
17. Proposed method: DMVAE
• Generative variational model: Introducing separate shared and private spaces
Usage: Cross-generation (analytical tool)
• Representations induced using pairs of individual modalities (encoder, decoder)
• Consistency of representations via Product of Experts (PoE). For a number of modalities N:
𝑞 𝑧! 𝑥", 𝑥#, ⋯ , 𝑥$ ∝ 𝑝(𝑧!) *
%&"
$
𝑞(𝑧!|𝑥%)
In VAE, inference networks and priors assume conditional Gaussian forms
𝑝 𝑧 = 𝑁 𝑧 0, 𝐼 , 𝑞 𝑧 𝑥% = 𝑁 𝑧 𝜇%, 𝐶%
𝑧"~𝑞'!
𝑧 𝑥" , 𝑧#~𝑞'"
𝑧 𝑥#
𝑧" = 𝑧(!
, 𝑧!!
, 𝑧# = 𝑧("
, 𝑧!"
We want: 𝑧) = 𝑧!!
= 𝑧!"
→ PoE
18. Proposed method: DMVAE
• Reconstruction inference
PoE-induced shared inference allows for inference when one or more modalities are missing
Thus, we consider three reconstruction tasks:
- Reconstruct both modalities at the same time: 𝑥", 𝑥# → 4
𝑥", 4
𝑥# 𝑧(!
, 𝑧("
, 𝑧)
- Reconstruct a single modality from its own input: 𝑥" → 4
𝑥" 𝑧(!
, 𝑧) or 𝑥# → 4
𝑥# 𝑧("
, 𝑧)
- Reconstruct a single modality from the opposite modality’s input: 𝑥# → 4
𝑥" 𝑧(!
, 𝑧) or 𝑥" → 4
𝑥# 𝑧("
, 𝑧)
• Loss function
Accuracy of reconstruction for jointly learned shared latent + KL-divergence of each normal distribution
Accuracy of cross-modal and self reconstruction + KL-divergence
19. Experiments: Digits (image-image)
• Evaluation
Qualitative: Cross-generation between modalities
Quantitative: Accuracy of the cross-generated images using a pre-trained classifier for each modality
- Joint: A sample from zs generates two image modalities that must be assigned the same class
Input
Output for
different
samples of zp2
Input
Output for
different
samples of zp1
21. Experiments: Flowers (image-text)
• This task is more complex
Instead of the image-text, the intermediate features are reconstructed
• Quantitative evaluation
Class recognition (image-to-text) and cosine-similarity retrieval (text-to-image) on the shared latent space
• Qualitative evaluation
Retrieval
22. Conclusions
• Multimodal VAE for disentangling private and shared spaces
Improve the representational performance of multimodal VAEs
Successful application to image-image and image-text modalities
• Shaping a latent space into subspaces that capture the private-shared aspects of the
modalities
“is important from the perspective of downstream tasks, where better decomposed representations are more
amenable for using on a wider variety of tasks”
23. [3] Multimodal VAEs via mutual supervision
• Main differences with [1] and [2]
A type of multimodal VAE, without private-shared disentanglement
Does not rely on factorizations such as MoE or PoE for modeling modality-shared information
Instead, it repurposes semi-supervised VAEs for combining inter-modality information
- Allows learning from partially-observed modalities (Reg. = KL divergence)
• Proposed method: Mutually supErvised Multimodal vaE (MEME)
24. [3] Multimodal VAEs via mutual supervision
• Qualitative evaluation
Cross-modal generation
• Quantitative evaluation
Coherence: Percentage of matching predictions of the cross-generated modality using a pretrained classifier
Relatedness: Wassertein Distance between the representations of two modalities (closer if same class)
26. Final remarks
• VAE not only for generation but also for reconstruction and disentanglement tasks
Recommended textbook: “An Introduction to Variational Autoencoders”, Kingma & Welling
• Private-shared latent spaces as an effective tool for analyzing multimodal data
• There is still a lot of potential for this research
It has been only applied to a limited number of multimodal problems
• このテーマに興味のある博士課程の学生 → インターン募集中
https://www.cyberagent.co.jp/news/detail/id=27453
違うテーマでも大丈夫!
共同研究も大歓迎!