Giro-i-Nieto, X. One Perceptron to Rule Them All: Language, Vision, Audio and Speech. In Proceedings of the 2020 International Conference on Multimedia Retrieval (pp. 7-8).
Tutorial page:
https://imatge.upc.edu/web/publications/one-perceptron-rule-them-all-language-vision-audio-and-speech-tutorial
Deep neural networks have boosted the convergence of multimedia data analytics in a unified framework shared by practitioners in natural language, vision and speech. Image captioning, lip reading or video sonorization are some of the first applications of a new and exciting field of research exploiting the generalization properties of deep neural representation. This tutorial will firstly review the basic neural architectures to encode and decode vision, text and audio, to later review the those models that have successfully translated information across modalities.
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
https://mcv-m6-video.github.io/deepvideo-2020/
Self-supervised audiovisual learning exploits the synchronization between pixels and audio recorded in video files. This lecture reviews the state of the art in deep neural networks trained with this approach, which does not require any manual annotation from humans.
https://github.com/mcv-m6-video/deepvideo-2019
The synchronization of the visual and audio tracks recorded in videos can be used as a supervisory signal for machine learning. This presentation reviews some recent research on this topic exploiting the capabilities of deep neural networks.
http://ixa2.si.ehu.es/deep_learning_seminar/
Deep neural networks have boosted the convergence of multimedia data analytics in a unified framework shared by practitioners in natural language and vision. Image captioning, visual question answering or multimodal translation are some of the first applications of a new and exciting field that exploiting the generalization properties of deep neural representations. This talk will provide an overview of how vision and language problems are addressed with deep neural networks, and the exciting challenges being addressed nowadays by the research community.
These slides review the research of our lab since 2016 on applied deep learning, starting from our participation in the TRECVID Instance Search 2014, moving into video analysis with CNN+RNN architectures, and our current efforts in sign language translation and production.
https://telecombcn-dl.github.io/2018-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
https://mcv-m6-video.github.io/deepvideo-2019/
These slides provides an overview of how deep neural networks can be used to solve an object tracking task
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
https://mcv-m6-video.github.io/deepvideo-2020/
Self-supervised audiovisual learning exploits the synchronization between pixels and audio recorded in video files. This lecture reviews the state of the art in deep neural networks trained with this approach, which does not require any manual annotation from humans.
https://github.com/mcv-m6-video/deepvideo-2019
The synchronization of the visual and audio tracks recorded in videos can be used as a supervisory signal for machine learning. This presentation reviews some recent research on this topic exploiting the capabilities of deep neural networks.
http://ixa2.si.ehu.es/deep_learning_seminar/
Deep neural networks have boosted the convergence of multimedia data analytics in a unified framework shared by practitioners in natural language and vision. Image captioning, visual question answering or multimodal translation are some of the first applications of a new and exciting field that exploiting the generalization properties of deep neural representations. This talk will provide an overview of how vision and language problems are addressed with deep neural networks, and the exciting challenges being addressed nowadays by the research community.
These slides review the research of our lab since 2016 on applied deep learning, starting from our participation in the TRECVID Instance Search 2014, moving into video analysis with CNN+RNN architectures, and our current efforts in sign language translation and production.
https://telecombcn-dl.github.io/2018-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
https://mcv-m6-video.github.io/deepvideo-2019/
These slides provides an overview of how deep neural networks can be used to solve an object tracking task
https://mcv-m6-video.github.io/deepvideo-2019/
Overview of deep learning solutions for video processing. Part of a series of slides covering topics like action recognition, action detection, object tracking, object detection, scene segmentation, language and learning from videos.
Master in Computer Vision Barcelona, 2019
https://telecombcn-dl.github.io/2017-dlsl/
Winter School on Deep Learning for Speech and Language. UPC BarcelonaTech ETSETB TelecomBCN.
The aim of this course is to train students in methods of deep learning for speech and language. Recurrent Neural Networks (RNN) will be presented and analyzed in detail to understand the potential of these state of the art tools for time series processing. Engineering tips and scalability issues will be addressed to solve tasks such as machine translation, speech recognition, speech synthesis or question answering. Hands-on sessions will provide development skills so that attendees can become competent in contemporary data analytics tools.
https://imatge-upc.github.io/wav2pix/
Speech is a rich biometric signal that contains information about the identity, gender and emotional state of the speaker. In this work, we explore its potential to generate face images of a speaker by conditioning a Generative Adversarial Network (GAN) with raw speech input. We propose a deep neural network that is trained from scratch in an end-to-end fashion, generating a face directly from the raw speech waveform without any additional identity information (e.g reference image or one-hot encoding). Our model is trained in a self-supervised fashion by exploiting the audio and visual signals naturally aligned in videos. With the purpose of training from video data, we present a novel dataset collected for this work, with high-quality videos of ten youtubers with notable expressiveness in both the speech and visual signals.
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
https://mcv-m6-video.github.io/deepvideo-2018/
Overview of deep learning solutions for video processing. Part of a series of slides covering topics like action recognition, action detection, object tracking, object detection, scene segmentation, language and learning from videos.
Prepared for the Master in Computer Vision Barcelona:
http://pagines.uab.cat/mcv/
Language and speech technologies are rapidly evolving thanks to the current advances in artificial intelligence. The convergence of large-scale datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Applications such as machine translation or speech recognition can be tackled from a neural perspective with novel architectures that combine convolutional and/or recurrent models with attention. This winter school overview the state of the art on deep learning for speech and language ad introduces the programming skills and techniques required to train these systems.
https://telecombcn-dl.github.io/2019-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
https://mcv-m6-video.github.io/deepvideo-2018/
Overview of deep learning solutions for video processing. Part of a series of slides covering topics like action recognition, action detection, object tracking, object detection, scene segmentation, language and learning from videos.
Prepared for the Master in Computer Vision Barcelona:
http://pagines.uab.cat/mcv/
https://telecombcn-dl.github.io/2017-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
https://telecombcn-dl.github.io/2017-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
https://mcv-m6-video.github.io/deepvideo-2020/
Self-supervised techniques define surrogate tasks to train machine learning algorithms without the need of human generated labels. This lecture reviews the state of the art in the field of computer vision, including the baseline techniques based on visual feature learning from ImageNet data.
https://telecombcn-dl.github.io/2017-dlsl/
Winter School on Deep Learning for Speech and Language. UPC BarcelonaTech ETSETB TelecomBCN.
The aim of this course is to train students in methods of deep learning for speech and language. Recurrent Neural Networks (RNN) will be presented and analyzed in detail to understand the potential of these state of the art tools for time series processing. Engineering tips and scalability issues will be addressed to solve tasks such as machine translation, speech recognition, speech synthesis or question answering. Hands-on sessions will provide development skills so that attendees can become competent in contemporary data analytics tools.
https://telecombcn-dl.github.io/2017-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
https://mcv-m6-video.github.io/deepvideo-2019/
Overview of deep learning solutions for video processing. Part of a series of slides covering topics like action recognition, action detection, object tracking, object detection, scene segmentation, language and learning from videos.
This lecture reviews methods that allow interpreting the outcomes of a deep convolutional neural network. It presents some of the techniques proposed in the literature.
https://mcv-m6-video.github.io/deepvideo-2018/
Overview of deep learning solutions for video processing. Part of a series of slides covering topics like action recognition, action detection, object tracking, object detection, scene segmentation, language and learning from videos.
Prepared for the Master in Computer Vision Barcelona:
http://pagines.uab.cat/mcv/
https://mcv-m6-video.github.io/deepvideo-2019/
This lecture provides an overview how the temporal information encoded in video sequences can be exploited to learn visual features from a self-supervised perspective. Self-supervised learning is a type of unsupervised learning in which data itself provides the necessary supervision to estimate the parameters of a machine learning algorithm.
Master in Computer Vision Barcelona 2019.
http://pagines.uab.cat/mcv/
https://telecombcn-dl.github.io/2017-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
https://telecombcn-dl.github.io/2017-dlsl/
Winter School on Deep Learning for Speech and Language. UPC BarcelonaTech ETSETB TelecomBCN.
The aim of this course is to train students in methods of deep learning for speech and language. Recurrent Neural Networks (RNN) will be presented and analyzed in detail to understand the potential of these state of the art tools for time series processing. Engineering tips and scalability issues will be addressed to solve tasks such as machine translation, speech recognition, speech synthesis or question answering. Hands-on sessions will provide development skills so that attendees can become competent in contemporary data analytics tools.
Deep neural networks have boosted the convergence of multimedia data analytics in a unified framework shared by practitioners in natural language, vision and speech. Image captioning, lip reading or video sonorization are some of the first applications of a new and exciting field of research exploiting the generalization properties of deep learning. In this talk we will review the latest results on how convolutional and recurrent neural networks are combined to find the most hidden patterns in multimedia.
https://telecombcn-dl.github.io/2018-dlmm/
achine Learning and deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. Ever wondered what all the fuss is about? Or what these technologies can do for you? Are you working in the field and wish to enhance your current knowledge in some specific techniques?
Insight@DCU will host a 2 day workshop on Machine Learning on May 21st and 22nd, which will help to answer your questions, whether a novice or knowledgeable in the field.
This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or text captioning.
https://mcv-m6-video.github.io/deepvideo-2019/
Overview of deep learning solutions for video processing. Part of a series of slides covering topics like action recognition, action detection, object tracking, object detection, scene segmentation, language and learning from videos.
Master in Computer Vision Barcelona, 2019
https://telecombcn-dl.github.io/2017-dlsl/
Winter School on Deep Learning for Speech and Language. UPC BarcelonaTech ETSETB TelecomBCN.
The aim of this course is to train students in methods of deep learning for speech and language. Recurrent Neural Networks (RNN) will be presented and analyzed in detail to understand the potential of these state of the art tools for time series processing. Engineering tips and scalability issues will be addressed to solve tasks such as machine translation, speech recognition, speech synthesis or question answering. Hands-on sessions will provide development skills so that attendees can become competent in contemporary data analytics tools.
https://imatge-upc.github.io/wav2pix/
Speech is a rich biometric signal that contains information about the identity, gender and emotional state of the speaker. In this work, we explore its potential to generate face images of a speaker by conditioning a Generative Adversarial Network (GAN) with raw speech input. We propose a deep neural network that is trained from scratch in an end-to-end fashion, generating a face directly from the raw speech waveform without any additional identity information (e.g reference image or one-hot encoding). Our model is trained in a self-supervised fashion by exploiting the audio and visual signals naturally aligned in videos. With the purpose of training from video data, we present a novel dataset collected for this work, with high-quality videos of ten youtubers with notable expressiveness in both the speech and visual signals.
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
https://mcv-m6-video.github.io/deepvideo-2018/
Overview of deep learning solutions for video processing. Part of a series of slides covering topics like action recognition, action detection, object tracking, object detection, scene segmentation, language and learning from videos.
Prepared for the Master in Computer Vision Barcelona:
http://pagines.uab.cat/mcv/
Language and speech technologies are rapidly evolving thanks to the current advances in artificial intelligence. The convergence of large-scale datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Applications such as machine translation or speech recognition can be tackled from a neural perspective with novel architectures that combine convolutional and/or recurrent models with attention. This winter school overview the state of the art on deep learning for speech and language ad introduces the programming skills and techniques required to train these systems.
https://telecombcn-dl.github.io/2019-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
https://mcv-m6-video.github.io/deepvideo-2018/
Overview of deep learning solutions for video processing. Part of a series of slides covering topics like action recognition, action detection, object tracking, object detection, scene segmentation, language and learning from videos.
Prepared for the Master in Computer Vision Barcelona:
http://pagines.uab.cat/mcv/
https://telecombcn-dl.github.io/2017-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
https://telecombcn-dl.github.io/2017-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
https://mcv-m6-video.github.io/deepvideo-2020/
Self-supervised techniques define surrogate tasks to train machine learning algorithms without the need of human generated labels. This lecture reviews the state of the art in the field of computer vision, including the baseline techniques based on visual feature learning from ImageNet data.
https://telecombcn-dl.github.io/2017-dlsl/
Winter School on Deep Learning for Speech and Language. UPC BarcelonaTech ETSETB TelecomBCN.
The aim of this course is to train students in methods of deep learning for speech and language. Recurrent Neural Networks (RNN) will be presented and analyzed in detail to understand the potential of these state of the art tools for time series processing. Engineering tips and scalability issues will be addressed to solve tasks such as machine translation, speech recognition, speech synthesis or question answering. Hands-on sessions will provide development skills so that attendees can become competent in contemporary data analytics tools.
https://telecombcn-dl.github.io/2017-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
https://mcv-m6-video.github.io/deepvideo-2019/
Overview of deep learning solutions for video processing. Part of a series of slides covering topics like action recognition, action detection, object tracking, object detection, scene segmentation, language and learning from videos.
This lecture reviews methods that allow interpreting the outcomes of a deep convolutional neural network. It presents some of the techniques proposed in the literature.
https://mcv-m6-video.github.io/deepvideo-2018/
Overview of deep learning solutions for video processing. Part of a series of slides covering topics like action recognition, action detection, object tracking, object detection, scene segmentation, language and learning from videos.
Prepared for the Master in Computer Vision Barcelona:
http://pagines.uab.cat/mcv/
https://mcv-m6-video.github.io/deepvideo-2019/
This lecture provides an overview how the temporal information encoded in video sequences can be exploited to learn visual features from a self-supervised perspective. Self-supervised learning is a type of unsupervised learning in which data itself provides the necessary supervision to estimate the parameters of a machine learning algorithm.
Master in Computer Vision Barcelona 2019.
http://pagines.uab.cat/mcv/
https://telecombcn-dl.github.io/2017-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
https://telecombcn-dl.github.io/2017-dlsl/
Winter School on Deep Learning for Speech and Language. UPC BarcelonaTech ETSETB TelecomBCN.
The aim of this course is to train students in methods of deep learning for speech and language. Recurrent Neural Networks (RNN) will be presented and analyzed in detail to understand the potential of these state of the art tools for time series processing. Engineering tips and scalability issues will be addressed to solve tasks such as machine translation, speech recognition, speech synthesis or question answering. Hands-on sessions will provide development skills so that attendees can become competent in contemporary data analytics tools.
Deep neural networks have boosted the convergence of multimedia data analytics in a unified framework shared by practitioners in natural language, vision and speech. Image captioning, lip reading or video sonorization are some of the first applications of a new and exciting field of research exploiting the generalization properties of deep learning. In this talk we will review the latest results on how convolutional and recurrent neural networks are combined to find the most hidden patterns in multimedia.
https://telecombcn-dl.github.io/2018-dlmm/
achine Learning and deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. Ever wondered what all the fuss is about? Or what these technologies can do for you? Are you working in the field and wish to enhance your current knowledge in some specific techniques?
Insight@DCU will host a 2 day workshop on Machine Learning on May 21st and 22nd, which will help to answer your questions, whether a novice or knowledgeable in the field.
This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or text captioning.
Deep neural networks have boosted the convergence of multimedia data analytics in a unified framework shared by practitioners in natural language, vision and speech. Image captioning, lip reading or video sonorization are some of the first applications of a new and exciting field of research exploiting the generalisation properties of deep learning. Get the latest results on how convolutional and recurrent neural networks are combined to find the most hidden patterns in multimedia.
https://re-work.co/events/deep-learning-summit-london-2017/
Machine translation and computer vision have greatly benefited from the advances in deep learning. A large and diverse amount of textual and visual data have been used to train neural networks whether in a supervised or self-supervised manner. Nevertheless, the convergence of the two fields in sign language translation and production still poses multiple open challenges, like the low video resources, limitations in hand pose estimation, or 3D spatial grounding from poses.
https://telecombcn-dl.github.io/2018-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
Machine translation and computer vision have greatly benefited of the advances in deep learning. The large and diverse amount of textual and visual data have been used to train neural networks whether in a supervised or self-supervised manner. Nevertheless, the convergence of the two field in sign language translation and production is still poses multiple open challenges, like the low video resources, limitations in hand pose estimation, or 3D spatial grounding from poses. This talk will present these challenges and the How2✌️Sign dataset (https://how2sign.github.io) recorded at CMU in collaboration with UPC, BSC, Gallaudet University and Facebook.
https://imatge.upc.edu/web/publications/sign-language-translation-and-production-multimedia-and-multimodal-challenges-all
Modeling perceptual similarity and shift invariance in deep networksNAVER Engineering
Abstract: While it is nearly effortless for humans to quickly assess the perceptual similarity between two images, the underlying processes are thought to be quite complex. Despite this, the most widely used perceptual metrics today, such as PSNR and SSIM, are simple, shallow functions, and fail to account for many nuances of human perception. Recently, the deep learning community has found that features of the VGG network trained on ImageNet classification have been remarkably useful as a training loss for image synthesis. But how perceptual are these so-called "perceptual losses"? What elements are critical for their success? To answer these questions, we introduce a new dataset of human perceptual similarity judgments. We systematically evaluate deep features across different architectures and tasks and compare them with classic metrics. We find that deep features outperform all previous metrics by large margins on our dataset. More surprisingly, this result is not restricted to ImageNet-trained VGG features, but holds across different deep architectures and levels of supervision (supervised, self-supervised, or even unsupervised). Our results suggest that perceptual similarity is an emergent property shared across deep visual representations.
Despite their strong transfer performance, deep convolutional representations surprisingly lack a basic low-level property -- shift-invariance, as small input shifts or translations can cause drastic changes in the output. Commonly used downsampling methods, such as max-pooling, strided-convolution, and average-pooling, ignore the sampling theorem. The well-known signal processing fix is anti-aliasing by low-pass filtering before downsampling. However, simply inserting this module into deep networks degrades performance; as a result, it is seldomly used today. We show that when integrated correctly, it is compatible with existing architectural components, such as max-pooling and strided-convolution. We observe increased accuracy in ImageNet classification, across several commonly-used architectures, such as ResNet, DenseNet, and MobileNet, indicating effective regularization. Furthermore, we observe better generalization, in terms of stability and robustness to input corruptions. Our results demonstrate that this classical signal processing technique has been undeservingly overlooked in modern deep networks.
Deep neural networks have revolutionized the data analytics scene by improving results in several and diverse benchmarks with the same recipe: learning feature representations from data. These achievements have raised the interest across multiple scientific fields, especially in those where large amounts of data and computation are available. This change of paradigm in data analytics has several ethical and economic implications that are driving large investments, political debates and sounding press coverage under the generic label of artificial intelligence (AI). This talk will present the fundamentals of deep learning through the classic example of image classification, and point at how the same principal has been adopted for several tasks. Finally, some of the forthcoming potentials and risks for AI will be pointed.
https://telecombcn-dl.github.io/dlmm-2017-dcu/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of big annotated data and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which had been addressed until now with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
This talk introduces computational social science as a new research discipline, gives a brief introduction to natural language processing and explains how word vector representations are computed and how to use them in Python.
Word vector representations like word2vec encode semantic relationships like gender and "is the capital city of". This makes it easy to find similar words and compare them visually. In this talk, I'm comparing Wikipedia article revisions using word2vec.
https://telecombcn-dl.github.io/2018-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
SSII2021 [SS2] Deepfake Generation and Detection – An Overview (ディープフェイクの生成と検出)SSII
SSII2021 [SS2] Deepfake Generation and Detection – An Overview (ディープフェイクの生成と検出)
6/10 (木) 14:30~15:00
講師:Huy H. Nguyen 氏(総合研究大学院大学/国立情報学研究所)
概要: Advances in machine learning and their interference with computer graphics allow us to easily generate high-quality images and videos. State-of-the-art manipulation methods enable the real-time manipulation of videos obtained from social networks. It is also possible to generate videos from a single portrait image. By combining these methods with speech synthesis, attackers can create a realistic video of some person saying something that they never said and distribute it on the internet. This results in loosing social trust, making confusion, and harming people’s reputation. Several countermeasures have been proposed to tackle this problem, from using hand-crafted features to using convolutional neural network. Some countermeasures use images as input and other leverage temporal information in videos. Their output could be binary (bona fide or fake) or muti-class (deepfake detection), or segmentation masks (manipulation localization). Since deepfake methods evolve rapidly, dealing with unseen ones is still a challenging problem. Some solutions have been proposed, however, this problem is not completely solved. In this talk, I will provide an overview on both deepfake generation and deepfake detection/localization. I will mainly focus on image and video domain and also introduce some audiovisual-based methods on both sides. Some open discussions and future directions are also included.
Evolution of Pattern Languages: Designing Human Actions, Dialogue, & Films (P...Takashi Iba
Takashi Iba, "Evolution of Pattern Languages: Designing Human Actions, Dialogue, & Films", presented in the PUARL2013 conference, Portland, Oregon, on Nov. 3rd, 2013
TOP 5 Most View Article From Academia in 2019sipij
TOP 5 Most View Article From Academia in 2019
Signal & Image Processing : An International Journal (SIPIJ)
ISSN : 0976 - 710X (Online) ; 2229 - 3922 (print)
http://www.airccse.org/journal/sipij/index.html
International Perspectives: Visualization in Science and EducationLiz Dorland
Overview of the international and interdisciplinary Gordon Research Conference on Visualization in Science and Education and info on key cognitive science and learning sciences researchers. History of the conference, NSF workshop, and research on learning with visualizations.
Similar to Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial) (20)
This document provides an overview of deep generative learning and summarizes several key generative models including GANs, VAEs, diffusion models, and autoregressive models. It discusses the motivation for generative models and their applications such as image generation, text-to-image synthesis, and enhancing other media like video and speech. Example state-of-the-art models are provided for each application. The document also covers important concepts like the difference between discriminative and generative modeling, sampling techniques, and the training procedures for GANs and VAEs.
The transformer is the neural architecture that has received most attention in the early 2020's. It removed the recurrency in RNNs, replacing it with and attention mechanism across the input and output tokens of a sequence (cross-attenntion) and between the tokens composing the input (and output) sequences, named self-attention.
https://imatge-upc.github.io/synthref/
Integrating computer vision with natural language processing has achieved significant progress
over the last years owing to the continuous evolution of deep learning. A novel vision and language
task, which is tackled in the present Master thesis is referring video object segmentation, in which a
language query defines which instance to segment from a video sequence. One of the biggest chal-
lenges for this task is the lack of relatively large annotated datasets since a tremendous amount of
time and human effort is required for annotation. Moreover, existing datasets suffer from poor qual-
ity annotations in the sense that approximately one out of ten language expressions fails to uniquely
describe the target object.
The purpose of the present Master thesis is to address these challenges by proposing a novel
method for generating synthetic referring expressions for an image (video frame). This method pro-
duces synthetic referring expressions by using only the ground-truth annotations of the objects as well
as their attributes, which are detected by a state-of-the-art object detection deep neural network. One
of the advantages of the proposed method is that its formulation allows its application to any object
detection or segmentation dataset.
By using the proposed method, the first large-scale dataset with synthetic referring expressions for
video object segmentation is created, based on an existing large benchmark dataset for video instance
segmentation. A statistical analysis and comparison of the created synthetic dataset with existing ones
is also provided in the present Master thesis.
The conducted experiments on three different datasets used for referring video object segmen-
tation prove the efficiency of the generated synthetic data. More specifically, the obtained results
demonstrate that by pre-training a deep neural network with the proposed synthetic dataset one can
improve the ability of the network to generalize across different datasets, without any additional annotation cost. This outcome is even more important taking into account that no additional annotation cost is involved.
Master MATT thesis defense by Juan José Nieto
Advised by Víctor Campos and Xavier Giro-i-Nieto.
27th May 2021.
Pre-training Reinforcement Learning (RL) agents in a task-agnostic manner has shown promising results. However, previous works still struggle to learn and discover meaningful skills in high-dimensional state-spaces. We approach the problem by leveraging unsupervised skill discovery and self-supervised learning of state representations. In our work, we learn a compact latent representation by making use of variational or contrastive techniques. We demonstrate that both allow learning a set of basic navigation skills by maximizing an information theoretic objective. We assess our method in Minecraft 3D maps with different complexities. Our results show that representations and conditioned policies learned from pixels are enough for toy examples, but do not scale to realistic and complex maps. We also explore alternative rewards and input observations to overcome these limitations.
https://imatge.upc.edu/web/publications/discovery-and-learning-navigation-goals-pixels-minecraft
Peter Muschick MSc thesis
Universitat Pollitecnica de Catalunya, 2020
Sign language recognition and translation has been an active research field in the recent years with most approaches using deep neural networks to extract information from sign language data. This work investigates the mostly disregarded approach of using human keypoint estimation from image and video data with OpenPose in combination with transformer network architecture. Firstly, it was shown that it is possible to recognize individual signs (4.5% word error rate (WER)). Continuous sign language recognition though was more error prone (77.3% WER) and sign language translation was not possible using the proposed methods, which might be due to low accuracy scores of human keypoint estimation by OpenPose and accompanying loss of information or insufficient capacities of the used transformer model. Results may improve with the use of datasets containing higher repetition rates of individual signs or focusing more precisely on keypoint extraction of hands.
https://github.com/telecombcn-dl/lectures-all/
These slides review techniques for interpreting the behavior of deep neural networks. The talk reviews basic techniques such as the display of filters and tensors, as well as more advanced ones that try to interpret which part of the input data is responsible for the predictions, or generate data that maximizes the activation of certain neurons.
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
https://telecombcn-dl.github.io/dlai-2020/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
https://telecombcn-dl.github.io/drl-2020/
This course presents the principles of reinforcement learning as an artificial intelligence tool based on the interaction of the machine with its environment, with applications to control tasks (eg. robotics, autonomous driving) o decision making (eg. resource optimization in wireless communication networks). It also advances in the development of deep neural networks trained with little or no supervision, both for discriminative and generative tasks, with special attention on multimedia applications (vision, language and speech).
Image segmentation is a classic computer vision task that aims at labeling pixels with semantic classes. These slides provide an overview of the basic approaches applied from the deep learning field to tackle this challenge and presents the basic subtasks (semantic, instance and panoptic segmentation) and related datasets.
Presented at the International Summer School on Deep Learning (ISSonDL) 2020 held online and organized by the University of Gdansk (Poland) between the 30th August and 2nd September.
http://2020.dl-lab.eu/virtual-summer-school-on-deep-learning/
https://imatge-upc.github.io/rvos-mots/
Video object segmentation can be understood as a sequence-to-sequence task that can benefit from the curriculum learning strategies for better and faster training of deep neural networks. This work explores different schedule sampling and frame skipping variations to significantly improve the performance of a recurrent architecture. Our results on the car class of the KITTI-MOTS challenge indicate that, surprisingly, an inverse schedule sampling is a better option than a classic forward one. Also, that a progressive skipping of frames during training is beneficial, but only when training with the ground truth masks instead of the predicted ones.
Deep neural networks have achieved outstanding results in various applications such as vision, language, audio, speech, or reinforcement learning. These powerful function approximators typically require large amounts of data to be trained, which poses a challenge in the usual case where little labeled data is available. During the last year, multiple solutions have been proposed to leverage this problem, based on the concept of self-supervised learning, which can be understood as a specific case of unsupervised learning. This talk will cover its basic principles and provide examples in the field of multimedia.
Deep neural networks have revolutionized the data analytics scene by improving results in several and diverse benchmarks with the same recipe: learning feature representations from data. These achievements have raised the interest across multiple scientific fields, especially in those where large amounts of data and computation are available. This change of paradigm in data analytics has several ethical and economic implications that are driving large investments, political debates and sounding press coverage under the generic label of artificial intelligence (AI). This talk will present the fundamentals of deep learning through the classic example of image classification, and point at how the same principal has been adopted for several tasks. Finally, some of the forthcoming potentials and risks for AI will be pointed.
Benet Oriol, Jordi Luque, Ferran Diego, Xavier Giro-i-Nieto
Telefonica Research / Universitat Politecnica de Catalunya (UPC)
CVPR 2020 Workshop on on Egocentric Perception, Interaction and Computing
In this work, we propose an effective approach for training unique embedding representations by combining three simultaneous modalities: image and spoken and textual narratives. The proposed methodology departs from a baseline system that spawns a embedding space trained with only spoken narratives and image cues. Our experiments on the EPIC-Kitchen and Places Audio Caption datasets show that introducing the human-generated textual transcriptions of the spoken narratives helps to the training procedure yielding to get better embedding representations. The triad speech, image and words allows for a better estimate of the point embedding and show an improving of the performance within tasks like image and speech retrieval, even when text third modality, text, is not present in the task.
These slides provide an overview of the most popular approaches up to date to solve the task of object detection with deep neural networks. It reviews both the two stages approaches such as R-CNN, Fast R-CNN and Faster R-CNN, and one-stage approaches such as YOLO and SSD. It also contains pointers to relevant datasets (Pascal, COCO, ILSRVC, OpenImages) and the definition of the Average Precision (AP) metric.
Full program:
https://www.talent.upc.edu/ing/estudis/formacio/curs/310400/postgraduate-course-artificial-intelligence-deep-learning/
These slides summarize the main trends in deep neural networks for video encoding. Including single frame models, spatiotemporal convolutionals, long term sequence modeling with RNNs and their combinaction with optical flow.
This lecture provides an introduction to recurrent neural networks, which include a layer whose hidden state is aware of its values in a previous time-step.
These slides were used in the Master in Computer Vision Barcelona 2019/2020, in the Module 6 dedicated to Video Analysis.
http://pagines.uab.cat/mcv/
More from Universitat Politècnica de Catalunya (20)
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
StarCompliance is a leading firm specializing in the recovery of stolen cryptocurrency. Our comprehensive services are designed to assist individuals and organizations in navigating the complex process of fraud reporting, investigation, and fund recovery. We combine cutting-edge technology with expert legal support to provide a robust solution for victims of crypto theft.
Our Services Include:
Reporting to Tracking Authorities:
We immediately notify all relevant centralized exchanges (CEX), decentralized exchanges (DEX), and wallet providers about the stolen cryptocurrency. This ensures that the stolen assets are flagged as scam transactions, making it impossible for the thief to use them.
Assistance with Filing Police Reports:
We guide you through the process of filing a valid police report. Our support team provides detailed instructions on which police department to contact and helps you complete the necessary paperwork within the critical 72-hour window.
Launching the Refund Process:
Our team of experienced lawyers can initiate lawsuits on your behalf and represent you in various jurisdictions around the world. They work diligently to recover your stolen funds and ensure that justice is served.
At StarCompliance, we understand the urgency and stress involved in dealing with cryptocurrency theft. Our dedicated team works quickly and efficiently to provide you with the support and expertise needed to recover your assets. Trust us to be your partner in navigating the complexities of the crypto world and safeguarding your investments.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
7. 7
#ShowAndTell Vinyals, Oriol, Alexander Toshev, Samy Bengio, and Dumitru Erhan. "Show and tell: A neural image caption
generator." CVPR 2015.
Image Captioning with RNN
8. 8
#DeepImageSent Karpathy, Andrej, and Li Fei-Fei. "Deep visual-semantic alignments for generating image descriptions."
CVPR 2015 (Slides by Marc Bolaños)
Image Captioning with RNN
9. 9
Xu, Kelvin, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua
Bengio. "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention." ICML 2015
Image Captioning with RNN & Attention
10. 10
Xu, Kelvin, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua
Bengio. "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention." ICML 2015
Image Captioning with RNN & Attention
11. 11
Cornia, Marcella, Matteo Stefanini, Lorenzo Baraldi, and Rita Cucchiara. "Meshed-Memory Transformer for Image
Captioning." CVPR 2020. [tweet]
Image Captioning with Transformers
12. 12
Johnson, Justin, Andrej Karpathy, and Li Fei-Fei. "Densecap: Fully convolutional localization networks for dense
captioning." CVPR 2016
Dense Captioning
13. 13
XAVI: “man has
short hair”, “man
with short hair”
AMAIA:”a woman
wearing a black
shirt”, “
BOTH: “two men
wearing black
glasses”
Johnson, Justin, Andrej Karpathy, and Li Fei-Fei. "Densecap: Fully convolutional localization networks for dense
captioning." CVPR 2016
Dense Captioning
14. 14
Recipe Generation
Salvador, Amaia, Michal Drozdzal, Xavier Giro-i-Nieto, and Adriana Romero. "Inverse Cooking: Recipe
Generation from Food Images." CVPR 2019.
15. 15
Recipe Generation
Title: Edamame corn salad
Ingredients
pepper, corn, onion, edamame, salt, vinegar, cilantro, avocado, oil
Instructions
- In a large bowl, combine edamame, corn, red onion, cilantro,
avocado, and red bell pepper.
- In a small bowl, whisk together olive oil, vinegar, salt, and
pepper.
- Pour dressing over edamame mixture and toss to coat.
- Cover and refrigerate for at least 1 hour before serving.
Salvador, Amaia, Michal Drozdzal, Xavier Giro-i-Nieto, and Adriana Romero. "Inverse Cooking: Recipe
Generation from Food Images." CVPR 2019.
16. 16
#Equalizer Burns, Kaylee, Lisa Anne Hendricks, Trevor Darrell, and Anna Rohrbach. "Women also Snowboard: Overcoming
Bias in Captioning Models." ECCV 2018.
Fighting Data Bias in Captioning
17. 17
#Equalizer Burns, Kaylee, Lisa Anne Hendricks, Trevor Darrell, and Anna Rohrbach. "Women also Snowboard: Overcoming
Bias in Captioning Models." ECCV 2018.
Fighting Data Bias in Captioning
18. 18
Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, Trevor
Darrel. Long-term Recurrent Convolutional Networks for Visual Recognition and Description, CVPR 2015. code
Video Captioning
19. 19
(Slides by Marc Bolaños) Pingbo Pan, Zhongwen Xu, Yi Yang,Fei Wu,Yueting Zhuang Hierarchical Recurrent Neural
Encoder for Video Representation with Application to Captioning, CVPR 2016.
LSTM unit
(2nd layer)
Time
Image
t = 1 t = T
hidden state
at t = T
first chunk
of data
Captioning: Video
21. 21
Multimodal Machine Translation
Sulubacak, Umut, Ozan Caglayan, Stig-Arne Grönroos, Aku Rouhe, Desmond Elliott, Lucia Specia, and Jörg Tiedemann.
"Multimodal machine translation through visuals and speech." Machine Translation (2020): 1-51. [tweet]
22. 22
Sign Language Translation with RNN+Att
Camgoz, Necati Cihan, et al. Neural Sign Language Translation. CVPR 2018.
23. 23
Sign Language Translation with Transformers
Necati Cihan Camgoz, Oscar Koller, Simon Hadfield, Richard Bowden, “Sign Language Transformers: Joint
End-to-end Sign Language Recognition and Translation” CVPR 2020.
24. 24
Assael, Yannis M., Brendan Shillingford, Shimon Whiteson, and Nando de Freitas. "LipNet: End-to-End Sentence-level Lipreading."
(2016).
25. 25
Lip Reading
Assael, Yannis M., Brendan Shillingford, Shimon Whiteson, and Nando de Freitas. "LipNet: End-to-End Sentence-level
Lipreading." (2016).
26. 26
Chung, Joon Son, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. "Lip reading sentences in the wild."
CVPR 2017
27. 27
Lipreading: Watch, Listen, Attend & Spell
Audio
features
Image
features
Chung, Joon Son, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. "Lip reading sentences in the wild." CVPR 2017
28. 28
Lipreading: Watch, Listen, Attend & Spell
Chung, Joon Son, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. "Lip reading sentences in the wild." CVPR 2017
Attention over output
states from audio and
video is computed at
each timestep
30. 30
Image Captioning Grounded on Detected Objects
Lu, Jiasen and Yang, Jianwei and Batra, Dhruv and Parikh, Devi “Neural Baby Talk” CVPR 2018 [code]
31. 31Lu, Jiasen and Yang, Jianwei and Batra, Dhruv and Parikh, Devi “Neural Baby Talk” CVPR 2018 [code]
Image Captioning Grounded on Detected Objects
32. 32Akbari, Hassan, Svebor Karaman, Surabhi Bhargava, Brian Chen, Carl Vondrick, and Shih-Fu Chang. "Multi-level Multimodal
Common Semantic Space for Image-Phrase Grounding." CVPR 2019. [code]
Image Captioning Grounded on Heatmaps
33. 33
Outline
1. Generative Models
a. Text
b. Vision
2. Discriminative Models
a. Text
b. Vision
3. Representation Learning
4. Control Tasks
35. 35
Reed, Scott, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. "Generative adversarial
text to image synthesis." ICML 2016.
Image Generation
36. 36
Reed, Scott, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. "Generative adversarial
text to image synthesis." ICML 2016. [code]
Image Generation
37. 37
Image Synthesis
#StackGAN Zhang, Han, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaolei Huang, Xiaogang Wang, and Dimitris Metaxas.
"Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks." ICCV 2017. [code]
38. 38
Image Synthesis with Cycle Consistency
#MirroGAN Qiao, Tingting, Jing Zhang, Duanqing Xu, and Dacheng Tao. "Mirrorgan: Learning text-to-image generation by
redescription." CVPR 2019. [code]
39. 39
Image Synthesis with Cycle Consistency
#MirroGAN Qiao, Tingting, Jing Zhang, Duanqing Xu, and Dacheng Tao. "Mirrorgan: Learning text-to-image generation by
redescription." CVPR 2019. [code]
40. 40Justin Johnson, Agrim Gupta, Li Fei-Fei, “Image Generation from Scene Graphs” CVPR 2018
Image Generation via Scene Graphs
41. 41
#Text2Scene Tan, Fuwen, Song Feng, and Vicente Ordonez. "Text2Scene: Generating Compositional Scenes From Textual
Descriptions." CVPR 2019 [blog].
42. 42
#CRAFT Gupta, Tanmay, Dustin Schwenk, Ali Farhadi, Derek Hoiem, and Aniruddha Kembhavi. "Imagine this! scripts to
compositions to videos." ECCV 2018
Video Generation by Composition
43. 43
Saunders, B., Camgoz, N. C., & Bowden, R. (2020). Progressive Transformers for End-to-End Sign Language Production.
ECCV 2020.
Sign Language Generation with Transformers
44. 44
Lucas Ventura, Amanda Duarte, Xavier Giro-i-Nieto, “Can Everybody Sign Now ? Exploring Sign Language
Video Generation from 2D Poses”. ECCV SLRTP Workshop 2020.
Sign Language Generation (pose 2 pixels)
45. 45
Outline
1. Generative Models
a. Text
b. Vision
2. Discriminative Models
a. Text
b. Vision
3. Representation Learning
4. Control Tasks
47. 47
Visual Question Answering
Antol, Stanislaw, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. "VQA:
Visual question answering." CVPR 2015.
48. 48
Visual Question Answering (VQA)
Francisco Roldán, Issey Masuda, Santiago Pascual de la Puente, and Xavier Giro-i-Nieto. "Visual
Question-Answering 2.0." ETSETB UPC TelecomBCN (2017).
49. 49
Noh, H., Seo, P. H., & Han, B. Image question answering using convolutional neural network with dynamic parameter
prediction. CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
Visual Question Answering (VQA)
50. 50
VQA: Dynamic Memory Networks
(Slides and Slidecast by Santi Pascual): Xiong, Caiming, Stephen Merity, and Richard Socher. "Dynamic Memory Networks for
Visual and Textual Question Answering." ICML 2016
51. 51
Visual Reasoning
#Clevr Johnson, Justin, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross Girshick.
"CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning." CVPR 2017
52. 52
Visual Reasoning: Programming
(Slides by Fran Roldan) Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Judy Hoffman, Fei-Fei Li, Larry
Zitnick, Ross Girshick , “Inferring and Executing Programs for Visual Reasoning”. ICCV 2017
Program Generator Execution Engine
53. 53
Visual Reasoning: Relation Networks
#RN Santoro, Adam, David Raposo, David G. Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy
Lillicrap. "A simple neural network module for relational reasoning." NIPS 2017.
Relation Networks concatenate all possible pairs of objects with the an encoded question to later find the
answer with a MLP.
54. 54
Visual Dialog
Das, Abhishek, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José MF Moura, Devi Parikh, and Dhruv Batra. "Visual
Dialog." CVPR 2017 [Project]
57. 57
Hate Speech Detection in Memes
Benet Oriol, Cristian Canton, Xavier Giro-i-Nieto, “Hate Speech in Pixels: Detection of Offensive Memes
towards Automatic Moderation”. NeurIPS 2019 AI for Good Workshop. [code]
Hate Speech Detection
60. 60
Niu, Yulei, Hanwang Zhang, Zhiwu Lu, and Shih-Fu Chang. "Variational Context: Exploiting Visual and Textual Context for
Grounding Referring Expressions." arXiv preprint arXiv:1907.03609 (2019).
Objects from Referring Expressions
61. 61
Video Objects from Referring Expressions
Li, Zhenyang, Ran Tao, Efstratios Gavves, Cees GM Snoek, and Arnold WM Smeulders. "Tracking by natural language
specification." CVPR 2017. [code]
62. 62
Video Object Detection with Transformers
Sadhu, A., Chen, K., & Nevatia, R. (2020). Video Object Grounding using Semantic Roles in Language Description. CVPR 2020.
63. 63
#Mattnet Yu, Licheng, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L. Berg. "Mattnet: Modular
attention network for referring expression comprehension." CVPR 2018. [code]
Segments from Referring Expressions
64. 64
Khoreva, Anna, Anna Rohrbach, and Bernt Schiele. "Video object segmentation with language referring expressions." ACCV
2018.
Video Objects from Referring Expressions
65. 65
Herrera-Palacio, Alba, Carles Ventura, and Xavier Giro-i-Nieto. "Video object linguistic grounding." ACM Multimedia
Workshops 2019.
Video Objects from Referring Expressions
66. 66
#RefVOS Bellver, Miriam, Carles Ventura, Carina Silberer, Ioannis Kazakos, Jordi Torres, and Xavier Giro-i-Nieto.
"RefVOS: A Closer Look at Referring Expressions for Video Object Segmentation." arXiv preprint
arXiv:2010.00263 (2020).
Video Objects from Referring Expressions
67. 67
#RefVOS Bellver, Miriam, Carles Ventura, Carina Silberer, Ioannis Kazakos, Jordi Torres, and Xavier Giro-i-Nieto.
"RefVOS: A Closer Look at Referring Expressions for Video Object Segmentation." arXiv preprint
arXiv:2010.00263 (2020).
Video Objects from Referring Expressions
68. 68
#SynthRef Ioannis Kazakos, Bellver, Miriam, Carles Ventura, Carina Silberer, and Xavier Giro-i-Nieto, “Generation
of Synthetic Referring Expressions for Object Segmentation” (submitted)
Synthetic Expressions w/ Scene Graphs
69. 69
#SynthRef Ioannis Kazakos, Bellver, Miriam, Carles Ventura, Carina Silberer, and Xavier Giro-i-Nieto, “Generation
of Synthetic Referring Expressions for Object Segmentation” (submitted)
70. Segments from Questions
Gan, Chuang, Yandong Li, Haoxiang Li, Chen Sun, and Boqing Gong. "VQS: Linking segmentations to questions and
answers for supervised attention in vqa and question-focused semantic segmentation." ICCV 2017.
73. 73
Joint Representations (Embeddings)
#Devise Frome, Andrea, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeff Dean, and Tomas Mikolov. "Devise: A deep
visual-semantic embedding model." NIPS 2013
74. 74
Zero-shot learning
Socher, R., Ganjoo, M., Manning, C. D., & Ng, A., Zero-shot learning through cross-modal transfer. NIPS 2013 [slides] [code]
No images from “cat” in
the training set...
...but they can still be
recognised as “cats”
thanks to the
representations learned
from text .
75. 75
Multimodal Retrieval
Kiros, Ryan, Ruslan Salakhutdinov, and Richard S. Zemel. "Unifying visual-semantic embeddings with multimodal neural
language models." NeurIPS 2014 Deep Learning Workshop.
76. 76
Multimodal Retrieval
Aytar, Yusuf, Lluis Castrejon, Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. "Cross-Modal Scene Networks."
CVPR 2016.
77. 77
Multimodal Retrieval
Aytar, Yusuf, Lluis Castrejon, Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. "Cross-Modal Scene Networks."
CVPR 2016.
78. 78
Image and text retrieval with joint embeddings.
Joint Neural Embeddings
#pic2recipe Amaia Salvador, Nicholas Haynes, Yusuf Aytar, Javier Marín, Ferda Ofli, Ingmar Weber, Antonio
Torralba, “Learning Cross-modal Embeddings for Cooking Recipes and Food Images”. CVPR 2017 [video]
79. 79
#pic2recipe Amaia Salvador, Nicholas Haynes, Yusuf Aytar, Javier Marín, Ferda Ofli, Ingmar Weber, Antonio Torralba,
“Learning Cross-modal Embeddings for Cooking Recipes and Food Images”. CVPR 2017 [video]
Joint Neural Embeddings
80. 80
Joint Neural Embeddings
joint
embedding
LSTM Bidirectional LSTM
#pic2recipe Amaia Salvador, Nicholas Haynes, Yusuf Aytar, Javier Marín, Ferda Ofli, Ingmar Weber, Antonio Torralba,
“Learning Cross-modal Embeddings for Cooking Recipes and Food Images”. CVPR 2017
82. 82
Representations
#ViLBERT Lu, Jiasen, Dhruv Batra, Devi Parikh, and Stefan Lee. "Vilbert: Pretraining task-agnostic visiolinguistic
representations for vision-and-language tasks." NeurIPS 2019. [MIT talk by Devih Parikh] [demo]
Visual Task:
Predict the visual categories for the
masked video frame
Language Task:
Predict the masked word (same as in
language-only BERT).
83. 83
Representations
#ViLBERT Lu, Jiasen, Dhruv Batra, Devi Parikh, and Stefan Lee. "Vilbert: Pretraining task-agnostic visiolinguistic
representations for vision-and-language tasks." NeurIPS 2019. [MIT talk by Devih Parikh] [demo]
Multimodal Task:
Predict whether the video frames correspond to the caption.
84. 84
Representations
#VideoBERT Sun, Chen, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. "Videobert: A joint model for video
and language representation learning." ICCV 2019.
85. 85
Representations
#VideoBERT Sun, Chen, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. "Videobert: A joint model for video
and language representation learning." ICCV 2019.
Rich representations can be used to retrieve matching video frames, which are encoded after vector
quantization.
87. 87
Learning Language from Video
Doughty, Hazel, Ivan Laptev, Walterio Mayol-Cuevas, and Dima Damen. "Action Modifiers: Learning from Adverbs in
Instructional Videos." CVPR 2020..
88. 88
Learning Language from Video
Surís, Dídac, Dave Epstein, Heng Ji, Shih-Fu Chang, and Carl Vondrick. "Learning to Learn Words from Visual Scenes." ECCV
2020.
90. 90
Platforms for Embodied AI
#Habitat Savva, Manolis, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub et
al. "Habitat: A platform for embodied ai research." ICCV 2019. [site]
91. 91
Navigation
Fried, Daniel, Ronghang Hu, Volkan Cirik, Anna Rohrbach, Jacob Andreas, Louis-Philippe Morency, Taylor
Berg-Kirkpatrick, Kate Saenko, Dan Klein, and Trevor Darrell. "Speaker-Follower Models for Vision-and-Language
Navigation." NeurIPS 2018.
92. 92
Navigation
#R2R Anderson, P., Wu, Q., Teney, D., Bruce, J., Johnson, M., Sünderhauf, N., ... & van den Hengel, A. Vision-and-language
navigation: Interpreting visually-grounded navigation instructions in real environments. CVPR 2018. [tweet]
93. 93
Navigation
#RxR Alexander Ku and Peter Anderson and Roma Patel and Eugene Ie and Jason Baldridge, “Room-Across-Room:
Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding” EMNLP 2020.
94. 94
Navigation
Ünal, Emre, Ozan Arkan Can, and Yücel Yemez. "Visually Grounded Language Learning For Robot Navigation." ACMMM
Workshops 2019.
95. 95
Object manipulation
Hill, F., Lampinen, A. K., Schneider, R., Clark, S., Botvinick, M., McClelland, J. L., & Santoro, A. Environmental drivers of
systematicity and generalization in a situated agent. ICLR 2020. [talk]
97. 97
My take home message
1. Generative Models
a. Text
b. Vision
2. Discriminative Models
a. Text
b. Vision
3. Feature Learning
4. Control Tasks
98. Xavier Giro-i-Nieto
@DocXavi
xavier.giro@upc.edu
Associate Professor
Universitat Politècnica de Catalunya
Barcelona Supercomputing Center
Was this tutorial helpful ? Please consider citing:
Go raibh maith agat / Thank you
Giro-i-Nieto, X. One Perceptron to Rule Them All: Language,
Vision, Audio and Speech. In Proceedings of the 2020
International Conference on Multimedia Retrieval (pp. 7-8).