Video has become ubiquitous on the Internet, TV, as well as personal devices. Recognition of video content has been a fundamental challenge in computer vision for decades, where previous research predominantly focused on recognizing videos using a predefined yet limited vocabulary. Thanks to the recent development of deep learning techniques, researchers in multiple communities are now striving to bridge videos with natural language in order to move beyond classification to interpretation, which should be regarded as the ultimate goal of video understanding. We will present recent advances in exploring the synergy of video understanding and language processing techniques.
Many learning tasks can be summarized as learning a mapping from a structured input to a structured output, such as machine translation, image captioning, image style transfer, and image dehazing. Such mappings are usually learned on paired training data, where an input sample and its corresponding output are both provided. Collecting paired training data often involves expensive human annotation, and the scale of paired training data is therefore often limited. As a result, the generalization ability of models trained on paired data is also limited. One way to mitigate this issue is learning with unpaired data, which is far less expensive to collect. Taking machine translation as an example, the unpaired training data can be collected separately from newspapers in the source language and target language without any annotation. The challenge of unpaired learning turns into how to align the unpaired data. With carefully designed objectives, unpaired learning has achieved remarkable progress on several tasks. This talk will cover the data collection and training methods of several unpaired learning tasks to illustrate the power of learning with unpaired data.
https://telecombcn-dl.github.io/2018-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
https://telecombcn-dl.github.io/2018-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
https://telecombcn-dl.github.io/2018-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
SSII2021 [SS2] Deepfake Generation and Detection – An Overview (ディープフェイクの生成と検出)SSII
SSII2021 [SS2] Deepfake Generation and Detection – An Overview (ディープフェイクの生成と検出)
6/10 (木) 14:30~15:00
講師:Huy H. Nguyen 氏(総合研究大学院大学/国立情報学研究所)
概要: Advances in machine learning and their interference with computer graphics allow us to easily generate high-quality images and videos. State-of-the-art manipulation methods enable the real-time manipulation of videos obtained from social networks. It is also possible to generate videos from a single portrait image. By combining these methods with speech synthesis, attackers can create a realistic video of some person saying something that they never said and distribute it on the internet. This results in loosing social trust, making confusion, and harming people’s reputation. Several countermeasures have been proposed to tackle this problem, from using hand-crafted features to using convolutional neural network. Some countermeasures use images as input and other leverage temporal information in videos. Their output could be binary (bona fide or fake) or muti-class (deepfake detection), or segmentation masks (manipulation localization). Since deepfake methods evolve rapidly, dealing with unseen ones is still a challenging problem. Some solutions have been proposed, however, this problem is not completely solved. In this talk, I will provide an overview on both deepfake generation and deepfake detection/localization. I will mainly focus on image and video domain and also introduce some audiovisual-based methods on both sides. Some open discussions and future directions are also included.
https://mcv-m6-video.github.io/deepvideo-2020/
Self-supervised audiovisual learning exploits the synchronization between pixels and audio recorded in video files. This lecture reviews the state of the art in deep neural networks trained with this approach, which does not require any manual annotation from humans.
Many learning tasks can be summarized as learning a mapping from a structured input to a structured output, such as machine translation, image captioning, image style transfer, and image dehazing. Such mappings are usually learned on paired training data, where an input sample and its corresponding output are both provided. Collecting paired training data often involves expensive human annotation, and the scale of paired training data is therefore often limited. As a result, the generalization ability of models trained on paired data is also limited. One way to mitigate this issue is learning with unpaired data, which is far less expensive to collect. Taking machine translation as an example, the unpaired training data can be collected separately from newspapers in the source language and target language without any annotation. The challenge of unpaired learning turns into how to align the unpaired data. With carefully designed objectives, unpaired learning has achieved remarkable progress on several tasks. This talk will cover the data collection and training methods of several unpaired learning tasks to illustrate the power of learning with unpaired data.
https://telecombcn-dl.github.io/2018-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
https://telecombcn-dl.github.io/2018-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
https://telecombcn-dl.github.io/2018-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
SSII2021 [SS2] Deepfake Generation and Detection – An Overview (ディープフェイクの生成と検出)SSII
SSII2021 [SS2] Deepfake Generation and Detection – An Overview (ディープフェイクの生成と検出)
6/10 (木) 14:30~15:00
講師:Huy H. Nguyen 氏(総合研究大学院大学/国立情報学研究所)
概要: Advances in machine learning and their interference with computer graphics allow us to easily generate high-quality images and videos. State-of-the-art manipulation methods enable the real-time manipulation of videos obtained from social networks. It is also possible to generate videos from a single portrait image. By combining these methods with speech synthesis, attackers can create a realistic video of some person saying something that they never said and distribute it on the internet. This results in loosing social trust, making confusion, and harming people’s reputation. Several countermeasures have been proposed to tackle this problem, from using hand-crafted features to using convolutional neural network. Some countermeasures use images as input and other leverage temporal information in videos. Their output could be binary (bona fide or fake) or muti-class (deepfake detection), or segmentation masks (manipulation localization). Since deepfake methods evolve rapidly, dealing with unseen ones is still a challenging problem. Some solutions have been proposed, however, this problem is not completely solved. In this talk, I will provide an overview on both deepfake generation and deepfake detection/localization. I will mainly focus on image and video domain and also introduce some audiovisual-based methods on both sides. Some open discussions and future directions are also included.
https://mcv-m6-video.github.io/deepvideo-2020/
Self-supervised audiovisual learning exploits the synchronization between pixels and audio recorded in video files. This lecture reviews the state of the art in deep neural networks trained with this approach, which does not require any manual annotation from humans.
Language and speech technologies are rapidly evolving thanks to the current advances in artificial intelligence. The convergence of large-scale datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Applications such as machine translation or speech recognition can be tackled from a neural perspective with novel architectures that combine convolutional and/or recurrent models with attention. This winter school overview the state of the art on deep learning for speech and language ad introduces the programming skills and techniques required to train these systems.
Giro-i-Nieto, X. One Perceptron to Rule Them All: Language, Vision, Audio and Speech. In Proceedings of the 2020 International Conference on Multimedia Retrieval (pp. 7-8).
Tutorial page:
https://imatge.upc.edu/web/publications/one-perceptron-rule-them-all-language-vision-audio-and-speech-tutorial
Deep neural networks have boosted the convergence of multimedia data analytics in a unified framework shared by practitioners in natural language, vision and speech. Image captioning, lip reading or video sonorization are some of the first applications of a new and exciting field of research exploiting the generalization properties of deep neural representation. This tutorial will firstly review the basic neural architectures to encode and decode vision, text and audio, to later review the those models that have successfully translated information across modalities.
Language and speech technologies are rapidly evolving thanks to the current advances in artificial intelligence. The convergence of large-scale datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Applications such as machine translation or speech recognition can be tackled from a neural perspective with novel architectures that combine convolutional and/or recurrent models with attention. This winter school overview the state of the art on deep learning for speech and language ad introduces the programming skills and techniques required to train these systems.
Video has become ubiquitous on the Internet, TV, as well as personal devices. Recognition of video content has been a fundamental challenge in computer vision for decades, where previous research predominantly focused on recognizing videos using a predefined yet limited vocabulary. Thanks to the recent development of deep learning and knowledge graph techniques, researchers in multiple communities are now striving to bridge videos with natural language in order to move beyond classification to interpretation, which should be regarded as the ultimate goal of video understanding. We will present recent advances in exploring the synergy of video understanding and language processing techniques, including video entity linking, video-language alignment, and video captioning, and discuss how domain knowledge can fit in to improve the performance.
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
https://telecombcn-dl.github.io/2017-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
http://ixa2.si.ehu.es/deep_learning_seminar/
Deep neural networks have boosted the convergence of multimedia data analytics in a unified framework shared by practitioners in natural language and vision. Image captioning, visual question answering or multimodal translation are some of the first applications of a new and exciting field that exploiting the generalization properties of deep neural representations. This talk will provide an overview of how vision and language problems are addressed with deep neural networks, and the exciting challenges being addressed nowadays by the research community.
https://telecombcn-dl.github.io/2017-dlsl/
Winter School on Deep Learning for Speech and Language. UPC BarcelonaTech ETSETB TelecomBCN.
The aim of this course is to train students in methods of deep learning for speech and language. Recurrent Neural Networks (RNN) will be presented and analyzed in detail to understand the potential of these state of the art tools for time series processing. Engineering tips and scalability issues will be addressed to solve tasks such as machine translation, speech recognition, speech synthesis or question answering. Hands-on sessions will provide development skills so that attendees can become competent in contemporary data analytics tools.
These slides review the research of our lab since 2016 on applied deep learning, starting from our participation in the TRECVID Instance Search 2014, moving into video analysis with CNN+RNN architectures, and our current efforts in sign language translation and production.
https://github.com/mcv-m6-video/deepvideo-2019
The synchronization of the visual and audio tracks recorded in videos can be used as a supervisory signal for machine learning. This presentation reviews some recent research on this topic exploiting the capabilities of deep neural networks.
https://telecombcn-dl.github.io/2017-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
https://telecombcn-dl.github.io/2017-dlsl/
Winter School on Deep Learning for Speech and Language. UPC BarcelonaTech ETSETB TelecomBCN.
The aim of this course is to train students in methods of deep learning for speech and language. Recurrent Neural Networks (RNN) will be presented and analyzed in detail to understand the potential of these state of the art tools for time series processing. Engineering tips and scalability issues will be addressed to solve tasks such as machine translation, speech recognition, speech synthesis or question answering. Hands-on sessions will provide development skills so that attendees can become competent in contemporary data analytics tools.
https://telecombcn-dl.github.io/2017-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
https://telecombcn-dl.github.io/dlmm-2017-dcu/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of big annotated data and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which had been addressed until now with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
https://telecombcn-dl.github.io/2017-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
https://mcv-m6-video.github.io/deepvideo-2019/
These slides provides an overview of how deep neural networks can be used to solve an object tracking task
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of big annotated data and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which had been addressed until now with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or text captioning.
https://telecombcn-dl.github.io/2017-dlsl/
Winter School on Deep Learning for Speech and Language. UPC BarcelonaTech ETSETB TelecomBCN.
The aim of this course is to train students in methods of deep learning for speech and language. Recurrent Neural Networks (RNN) will be presented and analyzed in detail to understand the potential of these state of the art tools for time series processing. Engineering tips and scalability issues will be addressed to solve tasks such as machine translation, speech recognition, speech synthesis or question answering. Hands-on sessions will provide development skills so that attendees can become competent in contemporary data analytics tools.
Video has become ubiquitous on the Internet, TV, as well as personal devices. Recognition of video content has been a fundamental challenge in computer vision for decades, where previous research predominantly focused on understanding videos using a predefined yet limited vocabulary. Thanks to the recent development of deep learning techniques, researchers in both computer vision and multimedia communities are now striving to bridge videos with natural language, which can be regarded as the ultimate goal of video understanding. We will present recent advances in exploring the synergy of video understanding and language processing techniques, including video-language alignment and video captioning.
Does deep learning solve all the machine learning problems? Where would domain knowledge fit in? While it is common in medical data analytics to incorporate domain knowledge, we focus on one emerging area in computer vision and language processing, video+language, to answer these questions.
Video has become ubiquitous on the Internet, TV, as well as personal devices. Recognition of video content has been a fundamental challenge in computer vision for decades, where previous research predominantly focused on recognizing videos using a predefined yet limited vocabulary. Thanks to the recent development of deep learning and knowledge graph techniques, researchers in multiple communities are now striving to bridge videos with natural language in order to move beyond classification to interpretation, which should be regarded as the ultimate goal of video understanding. We will present recent advances in exploring the synergy of video understanding and language processing techniques, including video entity linking, video-language alignment, and video captioning, and discuss how domain knowledge can fit in to improve the performance.
Language and speech technologies are rapidly evolving thanks to the current advances in artificial intelligence. The convergence of large-scale datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Applications such as machine translation or speech recognition can be tackled from a neural perspective with novel architectures that combine convolutional and/or recurrent models with attention. This winter school overview the state of the art on deep learning for speech and language ad introduces the programming skills and techniques required to train these systems.
Giro-i-Nieto, X. One Perceptron to Rule Them All: Language, Vision, Audio and Speech. In Proceedings of the 2020 International Conference on Multimedia Retrieval (pp. 7-8).
Tutorial page:
https://imatge.upc.edu/web/publications/one-perceptron-rule-them-all-language-vision-audio-and-speech-tutorial
Deep neural networks have boosted the convergence of multimedia data analytics in a unified framework shared by practitioners in natural language, vision and speech. Image captioning, lip reading or video sonorization are some of the first applications of a new and exciting field of research exploiting the generalization properties of deep neural representation. This tutorial will firstly review the basic neural architectures to encode and decode vision, text and audio, to later review the those models that have successfully translated information across modalities.
Language and speech technologies are rapidly evolving thanks to the current advances in artificial intelligence. The convergence of large-scale datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Applications such as machine translation or speech recognition can be tackled from a neural perspective with novel architectures that combine convolutional and/or recurrent models with attention. This winter school overview the state of the art on deep learning for speech and language ad introduces the programming skills and techniques required to train these systems.
Video has become ubiquitous on the Internet, TV, as well as personal devices. Recognition of video content has been a fundamental challenge in computer vision for decades, where previous research predominantly focused on recognizing videos using a predefined yet limited vocabulary. Thanks to the recent development of deep learning and knowledge graph techniques, researchers in multiple communities are now striving to bridge videos with natural language in order to move beyond classification to interpretation, which should be regarded as the ultimate goal of video understanding. We will present recent advances in exploring the synergy of video understanding and language processing techniques, including video entity linking, video-language alignment, and video captioning, and discuss how domain knowledge can fit in to improve the performance.
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
https://telecombcn-dl.github.io/2017-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
http://ixa2.si.ehu.es/deep_learning_seminar/
Deep neural networks have boosted the convergence of multimedia data analytics in a unified framework shared by practitioners in natural language and vision. Image captioning, visual question answering or multimodal translation are some of the first applications of a new and exciting field that exploiting the generalization properties of deep neural representations. This talk will provide an overview of how vision and language problems are addressed with deep neural networks, and the exciting challenges being addressed nowadays by the research community.
https://telecombcn-dl.github.io/2017-dlsl/
Winter School on Deep Learning for Speech and Language. UPC BarcelonaTech ETSETB TelecomBCN.
The aim of this course is to train students in methods of deep learning for speech and language. Recurrent Neural Networks (RNN) will be presented and analyzed in detail to understand the potential of these state of the art tools for time series processing. Engineering tips and scalability issues will be addressed to solve tasks such as machine translation, speech recognition, speech synthesis or question answering. Hands-on sessions will provide development skills so that attendees can become competent in contemporary data analytics tools.
These slides review the research of our lab since 2016 on applied deep learning, starting from our participation in the TRECVID Instance Search 2014, moving into video analysis with CNN+RNN architectures, and our current efforts in sign language translation and production.
https://github.com/mcv-m6-video/deepvideo-2019
The synchronization of the visual and audio tracks recorded in videos can be used as a supervisory signal for machine learning. This presentation reviews some recent research on this topic exploiting the capabilities of deep neural networks.
https://telecombcn-dl.github.io/2017-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
https://telecombcn-dl.github.io/2017-dlsl/
Winter School on Deep Learning for Speech and Language. UPC BarcelonaTech ETSETB TelecomBCN.
The aim of this course is to train students in methods of deep learning for speech and language. Recurrent Neural Networks (RNN) will be presented and analyzed in detail to understand the potential of these state of the art tools for time series processing. Engineering tips and scalability issues will be addressed to solve tasks such as machine translation, speech recognition, speech synthesis or question answering. Hands-on sessions will provide development skills so that attendees can become competent in contemporary data analytics tools.
https://telecombcn-dl.github.io/2017-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
https://telecombcn-dl.github.io/dlmm-2017-dcu/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of big annotated data and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which had been addressed until now with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
https://telecombcn-dl.github.io/2017-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
https://mcv-m6-video.github.io/deepvideo-2019/
These slides provides an overview of how deep neural networks can be used to solve an object tracking task
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of big annotated data and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which had been addressed until now with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or text captioning.
https://telecombcn-dl.github.io/2017-dlsl/
Winter School on Deep Learning for Speech and Language. UPC BarcelonaTech ETSETB TelecomBCN.
The aim of this course is to train students in methods of deep learning for speech and language. Recurrent Neural Networks (RNN) will be presented and analyzed in detail to understand the potential of these state of the art tools for time series processing. Engineering tips and scalability issues will be addressed to solve tasks such as machine translation, speech recognition, speech synthesis or question answering. Hands-on sessions will provide development skills so that attendees can become competent in contemporary data analytics tools.
Video has become ubiquitous on the Internet, TV, as well as personal devices. Recognition of video content has been a fundamental challenge in computer vision for decades, where previous research predominantly focused on understanding videos using a predefined yet limited vocabulary. Thanks to the recent development of deep learning techniques, researchers in both computer vision and multimedia communities are now striving to bridge videos with natural language, which can be regarded as the ultimate goal of video understanding. We will present recent advances in exploring the synergy of video understanding and language processing techniques, including video-language alignment and video captioning.
Does deep learning solve all the machine learning problems? Where would domain knowledge fit in? While it is common in medical data analytics to incorporate domain knowledge, we focus on one emerging area in computer vision and language processing, video+language, to answer these questions.
Video has become ubiquitous on the Internet, TV, as well as personal devices. Recognition of video content has been a fundamental challenge in computer vision for decades, where previous research predominantly focused on recognizing videos using a predefined yet limited vocabulary. Thanks to the recent development of deep learning and knowledge graph techniques, researchers in multiple communities are now striving to bridge videos with natural language in order to move beyond classification to interpretation, which should be regarded as the ultimate goal of video understanding. We will present recent advances in exploring the synergy of video understanding and language processing techniques, including video entity linking, video-language alignment, and video captioning, and discuss how domain knowledge can fit in to improve the performance.
The speed of Deep Neural Networks (DNN), in both training and inference, is important for its practical usage. This talk presents adaptive deep reuse, a novel optimization to enhance the speed of DNN by efficiently and effectively identifying unnecessary computations in DNN training on the fly. By avoiding these computations, the technique cuts the training time of DNN by 69% and inference time by 50%, with virtually no accuracy loss. The method is fully automatic and ready to be adopted, requiring neither manual code changes nor extra computing resource. It offers a promising way to substantially reduce both the time and energy cost in both the development and deployment of AI products. Since its recent publication, the technique has drawn a lot of interest in media, industry practitioners, and research community.
A practical talk by Anirudh Koul aimed at how to run Deep Neural Networks to run on memory and energy constrained devices like smartphones. Highlights some frameworks and best practices.
Introduction To TensorFlow | Deep Learning with TensorFlow | TensorFlow For B...Edureka!
** AI & Deep Learning with Tensorflow Training: https://goo.gl/vDxgi5 **
This Edureka tutorial on "Introduction to TensorFlow" provides you an insight into one of the top Deep Learning frameworks that you should consider learning!
Check out our Deep Learning blog series: https://bit.ly/2xVIMe1
Check out our complete Youtube playlist here: https://bit.ly/2OhZEpz
Follow us to never miss an update in the future.
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Large-scale Video Classification with Convolutional Neural Net.docxcroysierkathey
Large-scale Video Classification with Convolutional Neural Networks
Andrej Karpathy1,2 George Toderici1 Sanketh Shetty1
[email protected][email protected][email protected]
Thomas Leung1 Rahul Sukthankar1 Li Fei-Fei2
[email protected][email protected][email protected]
1Google Research 2Computer Science Department, Stanford University
http://cs.stanford.edu/people/karpathy/deepvideo
Abstract
Convolutional Neural Networks (CNNs) have been es-
tablished as a powerful class of models for image recog-
nition problems. Encouraged by these results, we pro-
vide an extensive empirical evaluation of CNNs on large-
scale video classification using a new dataset of 1 million
YouTube videos belonging to 487 classes. We study mul-
tiple approaches for extending the connectivity of a CNN
in time domain to take advantage of local spatio-temporal
information and suggest a multiresolution, foveated archi-
tecture as a promising way of speeding up the training.
Our best spatio-temporal networks display significant per-
formance improvements compared to strong feature-based
baselines (55.3% to 63.9%), but only a surprisingly mod-
est improvement compared to single-frame models (59.3%
to 60.9%). We further study the generalization performance
of our best model by retraining the top layers on the UCF-
101 Action Recognition dataset and observe significant per-
formance improvements compared to the UCF-101 baseline
model (63.3% up from 43.9%).
1. Introduction
Images and videos have become ubiquitous on the in-
ternet, which has encouraged the development of algo-
rithms that can analyze their semantic content for vari-
ous applications, including search and summarization. Re-
cently, Convolutional Neural Networks (CNNs) [15] have
been demonstrated as an effective class of models for un-
derstanding image content, giving state-of-the-art results
on image recognition, segmentation, detection and retrieval
[11, 3, 2, 20, 9, 18]. The key enabling factors behind these
results were techniques for scaling up the networks to tens
of millions of parameters and massive labeled datasets that
can support the learning process. Under these conditions,
CNNs have been shown to learn powerful and interpretable
image features [28]. Encouraged by positive results in do-
main of images, we study the performance of CNNs in
large-scale video classification, where the networks have
access to not only the appearance information present in
single, static images, but also their complex temporal evolu-
tion. There are several challenges to extending and applying
CNNs in this setting.
From a practical standpoint, there are currently no video
classification benchmarks that match the scale and variety
of existing image datasets because videos are significantly
more difficult to collect, annotate and store. To obtain suffi-
cient amount of data needed to train our CNN architectures,
we collected a new Sports-1M dataset, which consists of 1
million YouTube videos belonging to a taxonomy ...
Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube an...AI Frontiers
This talk will present some recent advances in video understanding at Google. It will cover the technology behind progress in applications such as large-scale video annotation for YouTube, video summarization and Motion Stills, as well as our research in weakly-supervised learning, domain adaptation from YouTube to Google Photos and action recognition. I will also give my perspective on promising directions for future research in video.
Keynote VariVolution/VM4ModernTech@SPLC 2022
At compile-time or at runtime, varying software is a powerful means to achieve optimal functional and performance goals. An observation is that only considering the software layer might be naive to tune the performance of the system or test that the functionality behaves correctly. In fact, many layers (hardware, operating system, input data, build process etc.), themselves subject to variability, can alter performances of software configurations. For instance, configurations' options may have very different effects on execution time or energy consumption when used with different input data, depending on the way it has been compiled and the hardware on which it is executed.
In this talk, I will introduce the concept of “deep software variability” which refers to the interactions of all external layers modifying the behavior or non-functional properties of a software system. I will show how compile-time options, inputs, and software evolution (versions), some dimensions of deep variability, can question the generalization of the variability knowledge of popular configurable systems like Linux, gcc, xz, or x264.
I will then argue that machine learning (ML) is particularly suited to manage very large variants space. The key idea of ML is to build a model based on sample data -- here observations about software variants in variable settings -- in order to make predictions or decisions. I will review state-of-the-art solutions developed in software engineering and software product line engineering while connecting with works in ML (e.g., transfer learning, dimensionality reduction, adversarial learning). Overall, the key challenge is to leverage the right ML pipeline in order to harness all variability layers (and not only the software layer), leading to more efficient systems and variability knowledge that truly generalizes to any usage and context.
From this perspective, we are starting an initiative to collect data, software, reusable artefacts, and body of knowledge related to (deep) software variability: https://deep.variability.io
Finally, I will open a broader discussion on how machine learning and deep software variability relate to the reproducibility, replicability, and robustness of scientific, software-based studies (e.g., in neuroimaging and climate modelling).
Tagging based Efficient Web Video Event CategorizationEditor IJCATR
Web video categorization is one of the emerging research fields in the computer vision domain due to its massive volume
growth in the internet which demands to discover events. Due to insufficient, noisy information and large intra class disparity makes it
more daunting task to recognize the events. Most of the recent works focus on constrained (fixed camera, known environment) videos
with supervised labelling to categorize the web videos. In this paper, we propose the subject based Part-Of- Speech (POS) Tagging
technique with the assist of Named Entity Recognition (NER) and WordNet is applied on YouTube video titles to discover the events
based on the subject, not on the objects visualized in the videos. Unsupervised learning method is used on high level features (titles)
because of incoming videos are not known and large intra-class variations. For the experiment, we have chosen topics from Google
Zeitgeist and downloaded the related videos from YouTube. A novel conclusion is derived from the experimental result that use of low
level features will lead to a poor classification in discovering intra class events based on the subject of the videos
The Boss: A Petascale Database for Large-Scale Neuroscience, Powered by Serve...Amazon Web Services
The IARPA Machine Intelligence from Cortical Networks (MICrONS) program is a research endeavor created to improve neurally-plausible machine-learning algorithms by understanding data representations and learning rules used by the brain through structurally and functionally interrogating a cubic millimeter of mammalian neocortex. This effort requires efficiently storing, visualizing, and processing petabytes of neuroimaging data. The Johns Hopkins University Applied Physics Laboratory (APL) has developed an open-source, highly available service to manage these data, called the Boss. The Boss uses AWS to provide a cloud-native spatial database with an innovative storage hierarchy and auto-scaling capability to balance cost and performance. This system extensively uses serverless components to meet both scalability and cost requirements. In this session, we provide an overview of the Boss, and we focus on how the APL used Amazon DynamoDB, AWS Lambda, and AWS Step Functions for several high-throughput components of the system. We discuss both the challenges and successes with serverless technologies.
VEKG: Video Event Knowledge Graph to Represent Video Streams for Complex Even...Piyush Yadav
This work was presented at IEEE Graph Computing Conference 2019 at Laguna Hills California. The work focused on graph based structured representation of video streams and complex event rules creation for pattern matching.
Come puoi gestire i difetti? Se sei in una fabbrica, la produzione può produrre oggetti con difetti. Oppure i valori dei sensori possono dirti nel tempo che alcuni valori non sono "normali". Cosa puoi fare come sviluppatore (non come Data Scientist) con .NET o Azure per rilevare queste anomalie? Vediamo come in questa sessione.
Jiri ece-01-03 adaptive temporal averaging and frame prediction based surveil...Ijripublishers Ijri
Global interconnect planning becomes a challenge as semiconductor technology continuously scales. Because of the increasing wire resistance and higher capacitive coupling in smaller features, the delay of global interconnects becomes large compared with the delay of a logic gate, introducing a huge performance gap that needs to be resolved A novel equalized global link architecture and driver– receiver co design flow are proposed for high-speed and low-energy on-chip communication by utilizing a continuous-time linear equalizer (CTLE). The proposed global link is analyzed using a linear system method, and the formula of CTLE eye opening is derived to provide high-level design guidelines and insights.
Compared with the separate driver–receiver design flow, over 50% energy reduction is observed.
Jiri ece-01-03 adaptive temporal averaging and frame prediction based surveil...Ijripublishers Ijri
Global interconnect planning becomes a challenge as semiconductor technology continuously scales. Because of the increasing wire resistance and higher capacitive coupling in smaller features, the delay of global interconnects becomes large compared with the delay of a logic gate, introducing a huge performance gap that needs to be resolved A novel equalized global link architecture and driver– receiver co design flow are proposed for high-speed and low-energy on-chip communication by utilizing a continuous-time linear equalizer (CTLE). The proposed global link is analyzed using a linear system method, and the formula of CTLE eye opening is derived to provide high-level design guidelines and insights.
Compared with the separate driver–receiver design flow, over 50% energy reduction is observed.
Serena Yeung, PHD, Stanford, at MLconf Seattle 2017 MLconf
Serena is a Ph.D. student in the Stanford Vision Lab, advised by Prof. Fei-Fei Li. Her research interests are in computer vision, machine learning, and deep learning. She is particularly interested in the areas of video understanding, human action recognition, and healthcare applications. She interned at Facebook AI Research in Summer 2016.
Before starting her Ph.D., she received a B.S. in Electrical Engineering in 2010, and an M.S. in Electrical Engineering in 2013, both from Stanford. She also worked as a software engineer at Rockmelt (acquired by Yahoo) from 2009-2011.
Abstract summary
Towards Scaling Video Understanding:
The quantity of video data is vast, yet our capabilities for visual recognition and understanding in videos lags significantly behind that for images. In this talk, I will first discuss some of the challenges of scale in labeling, modeling, and inference behind this gap. I will then present some of our recent work towards addressing these challenges, in particular using reinforcement learning-based formulations to tackle efficient inference in videos and learning classifiers from noisy web search results. Finally, I will conclude with discussion on future promising directions towards scaling video understanding.
With the explosive growth in AI related fields, top conferences and journals are struggling to keep up with the tremendous amount of paper submissions. More and more new or inexprienced reviewers are rising to the ocassion. How to become a good reviewer and contribute to the health and growth of the field we all invest in? We will share our perspectives and suggestions.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
GridMate - End to end testing is a critical piece to ensure quality and avoid...
Video + Language
1. 2019/5/10
1
Video + Language
Jiebo Luo
Department of Computer Science
What Is Computer Vision?
• A major branch of Artificial Intelligence
• Vision is the process of discovering from images what
is present in the world, and where it is.
-- David Marr, Vision (1982)
Computer vision: 50 years of progress
6
feature engineering + model learning
𝐟 = 𝑓 𝐼 𝒚 𝑔 𝐟, 𝜃
deep le
𝒚 𝑔 𝐼, 𝜃 , wher
MSR-VTT
CCV
UCF101
HMDB51
ActivityNet
FCVID
Hollywood
Sports-1M
YouTube-
8M
1,000
10,000
100,000
1,000,000
10,000,000
10 100 1,000 10,000
#Example
#Class
Computer vision: 50 years of progress
7
COCO
KBK-1M
Open
Images
Visual
Genome
Caltech 101
Caltech 256
SUN
ImageNet
ImageNet…
Pascal
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
1 10 100 1,000 10,000 100,000
#Example
#Class
Flickr 8K
Flickr 30K
MPII-MD
M-VAD
MSVD
SBU
TGIF
YFCC 100M
Introduction
• Video has become ubiquitous on the Internet, TV, as well as
personal devices.
• Recognition of video content has been a fundamental challenge
in computer vision for decades, where previous research
predominantly focused on understanding videos using a
predefined yet limited vocabulary.
• Thanks to the recent development of deep learning techniques,
researchers in both computer vision and multimedia
communities are now striving to bridge video with
natural language, which can be regarded as the ultimate goal of
video understanding.
• We present recent advances in exploring the synergy of video
understanding and language processing, including video-
language alignment, video captioning, and video emotion
analysis.
From Classification to Description
Recognizing Realistic Actions from Videos "in the Wild"
UCF-11 to UCF-101
(CVPR 2009)
Similarity btw Videos Cross-Domain Learning
Visual Event Recognition in Videos by
Learning from Web Data
(CVPR2010 Best Student Paper)
Heterogeneous Feature Machine For Visual Recognition
(ICCV 2009)
1 5
6 7
9 10
2. 2019/5/10
2
From Classification to Description
11
Semantic Video Entity Linking
(ICCV2015)
Aligning Language Descriptions
with Videos
Iftekhar Naim, Young Chol Song, Qiguang Liu
Jiebo Luo, Dan Gildea, Henry Kautz
link
OverviewOverview
Unsupervised alignment of video with text
Motivations
Generate labels from data (reduce burden of manual labeling)
Learn new actions from only parallel video+text
Extend noun/object matching to verbs and actions
Unsupervised alignment of video with text
Motivations
Generate labels from data (reduce burden of manual labeling)
Learn new actions from only parallel video+text
Extend noun/object matching to verbs and actions
Matching Verbs to Actions
The person takes out a knife
and cutting board
Matching Nouns to Objects
[Naim et al., 2015]
An overview of the text and
video alignment framework
Hyperfeatures for ActionsHyperfeatures for Actions
High-level features required for alignment with text
→ Motion features are generally low-level
Hyperfeatures, originally used for image recognition extended
for use with motion features
→ Use temporal domain instead of spatial domain for vector
quantization (clustering)
High-level features required for alignment with text
→ Motion features are generally low-level
Hyperfeatures, originally used for image recognition extended
for use with motion features
→ Use temporal domain instead of spatial domain for vector
quantization (clustering)
Originally described in “Hyperfeatures:
Multilevel Local Coding for Visual Recog‐
nition” Agarwal, A. (ECCV 06), for images Hyperfeatures for actions
Hyperfeatures for ActionsHyperfeatures for Actions
From low-level motion features, create high-level
representations that can easily align with verbs in text
From low-level motion features, create high-level
representations that can easily align with verbs in text
Cluster 3
at time t
Accumulate over
frame at time t
& cluster
Conduct vector
quantization
of the histogram
at time t
Cluster 3, 5, …,5,20
= Hyperfeature 6
Each color code
is a vector
quantized
STIP point
Vector quantized
STIP point histogram at time t
Accumulate clusters over
window (t‐w/2, t+w/2]
and conduct vector
quantization
→ first‐level hyperfeatures
Align hyperfeatures
with verbs from text
(using LCRF)
Latent-variable CRF AlignmentLatent-variable CRF Alignment
CRF where the latent variable is the alignment
N pairs of video/text observations {(xi, yi)}i=1 (indexed by i)
Xi,m represents nouns and verbs extracted from the mth sentence
Yi,n represents blobs and actions in interval n in the video
Conditional likelihood
conditional probability of
Learning weights w
Stochastic gradient descent
CRF where the latent variable is the alignment
N pairs of video/text observations {(xi, yi)}i=1 (indexed by i)
Xi,m represents nouns and verbs extracted from the mth sentence
Yi,n represents blobs and actions in interval n in the video
Conditional likelihood
conditional probability of
Learning weights w
Stochastic gradient descent
where feature function
More details in Naim et al. 2015 NAACL Paper ‐
Discriminative unsupervised alignment of natural language instructions with corresponding video segments
N
11 14
16 17
18 19
3. 2019/5/10
3
Experiments: Wetlab DatasetExperiments: Wetlab Dataset
RGB-Depth video with lab protocols in text
Compare addition of hyperfeatures generated from motion
features to previous results (Naim et al. 2015)
Small improvement over previous results
Activities already highly correlated with object-use
RGB-Depth video with lab protocols in text
Compare addition of hyperfeatures generated from motion
features to previous results (Naim et al. 2015)
Small improvement over previous results
Activities already highly correlated with object-use
Detection of objects in 3D space
using color and point‐cloud
Previous results
using object/noun
alignment only
Addition of different types
of motion features
2DTraj: Dense trajectories
*Using hyperfeature window size w=150
Experiments: TACoS DatasetExperiments: TACoS Dataset
RGB video with crowd-sourced text descriptions
Activities such as “making a salad,” “baking a cake”
No object recognition, alignment using actions only
Uniform: Assume each sentence takes the same amount of time over the entire sequence
Segmented LCRF: Assume the segmentation of actions is known, infer only the action labels
Unsupervised LCRF: Both segmentation and alignment are unknown
Effect of window size and number of clusters
Consistent with average
action length: 150 frames
RGB video with crowd-sourced text descriptions
Activities such as “making a salad,” “baking a cake”
No object recognition, alignment using actions only
Uniform: Assume each sentence takes the same amount of time over the entire sequence
Segmented LCRF: Assume the segmentation of actions is known, infer only the action labels
Unsupervised LCRF: Both segmentation and alignment are unknown
Effect of window size and number of clusters
Consistent with average
action length: 150 frames
*Using hyperfeature
window size w=150
*d(2)=64
Experiments: TACoS DatasetExperiments: TACoS Dataset
Segmentation from a sequence in the dataset Segmentation from a sequence in the dataset
Crowd‐sourced descriptions
Example of text and video alignment generated
by the system on the TACoS corpus for sequence s13‐d28
Image Captioning with Semantic
Attention (CVPR 2016)
Quanzeng You, Jiebo Luo
Hailin Jin, Zhaowen Wang and Chen Fang
Image Captioning
• Motivations
– Real-world Usability
• Help visually impaired people, learning-impaired
– Improving Image Understanding
• Classification, Objection detection
– Image Retrieval
1. a young girl inhales with the intent of blowing
out a candle
2. girl blowing out the candle on an ice cream
1. A shot from behind home
plate of children playing
baseball
2. A group of children playing
baseball in the rain
3. Group of baseball players
playing on a wet field
Introduction of Image Captioning
• Machine learning as an approach to solve the problem
20 21
22 23
24 25
4. 2019/5/10
4
Overview
• Brief overview of current approaches
• Our main motivation
• The proposed semantic attention model
• Evaluation results
Brief Introduction of Recurrent Neural Network
• Different from CNN
11),( ttttt BhAxhxfh
tt Chy
• Unfolding over time
Feedforward network
Backpropagation through time
Inputs
Hidden Units
Outputs
xt ht-1
ht
yt
Inputs
Hidden Units
Outputs
C
yt
Inputs
Hidden Units
Inputs
Hidden Units
B
B
A
A
A B
t-1
t-2
Applications of Recurrent Neural Networks
• Machine Translation
• Reads input sentence “ABC” and produces “WXYZ”
Decoder RNNEncoder RNN
Encoder-Decoder Framework for Captioning
• Inspired by neural network based machine
translation
• Loss function
N
t
tt wwIwp
IwpL
1
10 ),,,|(log
)|(log
Our Motivation
• Additional textual information
– Own noisy title, tags or captions (Web)
Our Motivation
• Additional textual information
– Own noisy title, tags or captions (Web)
– Visually similar nearest neighbor images
26 27
28 29
31 32
5. 2019/5/10
5
Our Motivation
• Additional textual information
– Own noisy title, tags or captions (Web)
– Visually similar nearest neighbor images
– Success of low-level tasks
• Visual attributes detection
Image Captioning with Semantic Attention
• Big idea
First Idea
• Provide additional knowledge at each input node
• Concatenate the input word and the extra attributes K
• Each image has a fixed keyword list
)],,([),( 11 tktttt hbKWwfhxfh
Visual Features: 1024
GoogleNet
LSTM Hidden states: 512
Training details:
1. 256 image/sentence
pairs
2. RMS-Prob
Using Attributes along with Visual Features
• Provide additional knowledge at each input node
• Concatenate the visual embedding and keywords for h0
];[),( 10 bKWvWhvfh kiv
Attention Model on Attributes
• Instead of using the same set of attributes at every
step
• At each step, select the attributes (attention)
m mtmt kKwatt ),(
)softmax VK(wT
tt
))],,(;([),( 11 tttttt hKwattxfhxfh
Overall Framework
• Training with a bilinear/bilateral attention model
ht
pt
xt
v
{Ai}
Yt~
𝜑
𝜙
RNN
Image
CNN
AttrDet 1
AttrDet 2
AttrDet 3
AttrDet N
t = 0
Word
33 34
35 36
38 39
6. 2019/5/10
6
Visual Attributes
• A secondary contribution
• We try different approaches
Performance
• Examples showing the impact of visual attributes on captions
Performance on the Testing Dataset
• Publicly available split
Performance
• MS-COCO Image Captioning Challenge
Captioning with Emotion and Style A Simple Framework
40 41
42 43
45 46
7. 2019/5/10
7
Examples Examples
Integrating Scene Text and Visual Appearance
for Fine-Grained Image Classification Xiang Bai et al.
49
TGIF: A New Dataset and Benchmark on
Animated GIF Description
Yuncheng Li, Yale Song, Liangliang Cao, Joel Tetreault,
Larry Goldberg, Jiebo Luo
Overview Comparison with Existing Datasets
47 48
49 50
51 52
8. 2019/5/10
8
Examples
a skate boarder is
doing trick on his skate
board.
a gloved hand opens to
reveal a golden ring.
a sport car is swinging on
the race playground
the vehicle is moving fast
into the tunnel
Contributions
• A large scale animated GIF description dataset for promoting image
sequence modeling and research
• Performing automatic validation to collect natural language descriptions
from crowd workers
• Establishing baseline image captioning methods for future
benchmarking
• Comparison with existing datasets, highlighting the benefits with
animated GIFs
In Comparison with Existing Datasets
• The language in our dataset is closer to common language
• Our dataset has an emphasis on the verbs
• Animated GIFs are more coherent and self contained
• Our dataset can be used to solve more difficult movie description
problem
Machine Generated Sentence Examples
Machine Generated Sentence Examples Machine Generated Sentence Examples
53 54
55 56
57 58
9. 2019/5/10
9
Comparing Professionals and Crowd-workers
Crowd worker: two people are kissing on a boat.
Professional: someone glances at a kissing couple
then steps to a railing overlooking the ocean an
older man and woman stand beside him.
Crowd worker: two men got into their car
and not able to go anywhere because the
wheels were locked.
Professional: someone slides over the
camaros hood then gets in with his partner
he starts the engine the revving vintage
car starts to backup then lurches to a halt.
Crowd worker: a man in a shirt and tie sits beside a person who is
covered in a sheet.
Professional: he makes eye contact with the woman for only a second.
More: http://beta-
web2.cloudapp.net/ls
mdc_sentence_comp
arison.html
Movie Descriptions versus TGIF
• Crowd workers are encouraged to describe the major visual content
directly, and not to use overly descriptive language
• Because our animated GIFs are presented to crowd workers without
any context, the sentences in our dataset are more self-contained
• Animated GIFs are perfectly segmented since they are carefully
curated by online users to create a coherent visual story
Code & Dataset
• Yahoo! webscope (Coming soon!)
• Animated GIFs and sentences
• Code and models for LSTM baseline
• Pipeline for syntactic and semantic validation to collect natural
languages from crowd workers
Image Captioning with Unpaired Data
• Supervised and semi-supervised captioning methods rely on paired
training data, which are expensive to acquire (i.e., the bottleneck)
• Image captioning trained with unpaired data is a promising direction
because unpaired training data are much easier to obtain
Supervised captioning
images captions
Semi-supervised captioning
images captions
Unpaired captioning
images captions
(a) (b) (c)
Image Captioning with Unpaired Data
• How to align images with sentences?
… …
Little girl looking
down at leaves
with her bicycle
with training
wheels parked
next to her.
Image Captioning with Unpaired Data
• How to align images with sentences?
… …
Little girl looking
down at leaves
with her bicycle
with training
wheels parked
next to her.
Little girl looking
down at leaves
with her bicycle
with training
wheels parked
next to her.
Object
Detection
59 60
61 62
63 64
10. 2019/5/10
10
Image Captioning with Unpaired Data
Bench, bike, girl.
How to
generate a
sentence?
Image Captioning with Unpaired Data
Bench, bike, girl.
…
A young child in a park next to a red
bench and red bicycle that as training
wheels.
A woman that is standing in front of an oven.
Real sentences
Discriminator Real/Fake
Image Captioning with Unpaired Data Image Captioning with Unpaired Data
Unpaired Image Captioning
detected concepts person, tree, helmet
con2sen a person in a helmet is riding a tree .
feat2sen
a person wearing a helmet is skateboarding on a ramp
.
adv a man riding a skateboard down a street .
adv + con a person on a court with a racket
adv + con + im a man riding a skateboard on a ramp .
Ours without init a man riding a surfboard on top of a wave .
Ours a man on a skateboard performing a trick .
BRNN a man riding a skateboard down the side of a ramp
detected concepts boat, watercraft
con2sen a group of people in a boat with a lighthouse .
feat2sen a boat that is floating in the water .
adv a man riding a wave on top of a surfboard .
adv + con a large boat that is in the water .
adv + con + im a large boat in the middle of a lake .
Ours without init a boat that is sitting in the water
Ours a view of a boat in the ocean
BRNN a boat floating on top of a body of water
Unpaired Image Captioning
detected concepts boat
con2sen a boat that is floating in the water .
feat2sen a boat that is floating in the water .
adv a small boat is floating in the water .
adv + con a boat that is floating in the water .
adv + con + im a boat that is parked in the water .
Ours without init a boat that is sitting in the water .
Ours a boat is docked in a harbor .
BRNN a boat floating on top of a body of water
detected concepts animal, frame
con2sen a stuffed animal sitting on top of a frame .
feat2sen
a group of people standing around a table with a
picture of a large piece of cloth and
adv a man riding a wave on top of a surfboard .
adv + con a group of people that are standing on the beach
adv + con + im a person riding a surf board on the waves
Ours without init a child is flying a kite on the beach .
Ours a group of people surfing on a wave .
BRNN a man standing on top of a sandy beach
65 66
67 68
69 70
11. 2019/5/10
11
Sports Video Captioning
by Attentive Motion Representation
based Hierarchical Recurrent Neural Networks
Mengshi Qi1, Yunhong Wang1, Annan Li1, Jiebo Luo2
1Beijing Advanced Innovation Center for Big Data and Brain Computing,
School of Computer Science and Engineering, Beihang University
2Department of Computer Science, University of Rochester
Introduction
Sports video captioning is a task of automatically generating a
textual description for sports events (e.g., soccer, basketball or
volleyball games).
Soccer Basketball Volleyball
Problem Definition Challenges
How to design an effective sports video captioning system?
conventional video captioning methods can only generate coarse descriptions,
and overlook the motion details of players’ actions and group-level activities
How to capture fine-grained motion information of sports
players?
player’s articulated movements/pose estimation/motion trajectory
the key player’s action is critical in a sports event
How to build a dataset that focuses on sports video captioning?
different sports games
Contributions
We introduce a novel framework for sports video captioning with
attentive motion representation-based hierarchical recurrent neural
networks.
A motion representation module is employed to extract player’s pose
and trajectory information, where we capture semantic attributes from
player’s skeletons and cluster trajectories from dynamic movements.
We annotate a new dataset called Sports Video Captioning Dataset-
Volleyball .
Framework
71 72
73 74
75 76
12. 2019/5/10
12
(1) Action Proposal Module
Given a video, retrieving and localizing temporal segments that
likely contain spatio-temporal group events (i.e. attack, defend)
or individual actions (i.e. spiking, passing) is the first task.
Deep Action Proposals (DAP) method [1]
Motion Representation Module
(1) Pose Attribute Detection Part
(2) Trajectory Clustering Part
Pose Attribute Detection Part
Extract pose-based CNN (Convolutional Neural Network) feature [2] from each
body part of a player
Build an attribute vocabulary from the annotated sentences in the dataset (e.g.,
UCF-101 [3] , Volleyball [4]), and use the top 𝑘 high-frequency words
Trajectory Clustering Part
Extracting the dense point trajectories 𝑇𝑟𝑎 𝑇𝑟𝑎 , … , 𝑇𝑟𝑎 for a sequence of
frames by trajectory-pooled deep-convolutional descriptors [5] , where 𝑀 is the
number of trajectories
Then we partition all the detected trajectories into groups by computing the
affinity matrix between each trajectory pair and utilizing the graph clustering
method [6], we can obtain 𝑚 clusters.
Databases
General Video Captioning:
1) Microsoft Video Description Dataset (MSVD)
2) MSR Video-to-Text Dataset (MSR-VTT)
Sports Video Captioning:
1) Sports Video Captioning Dataset-Volleyball (SVCDV)
Sports Video Captioning Dataset-Volleyball
(SVCDV)
55 videos with 4,830 short clips from Olympic Volleyball Games
About 44,436 sentences, 9.2 sentences on average of each
clip
Verb ratio is nearly 16.2%, 1.72 verbs per sentences
9 individual action labels:
waiting/setting/digging/falling/spiking/blocking/jumping/moving/standing
8 group activity labels:
right set/right spike/right pass/right winpoint/left winpoint/left pass/left
spike/left set
77 78
79 81
86 87
13. 2019/5/10
13
General Video Captioning Task
Table 1. Captioning Performance on MSV and
MSR-VTT Datasets
General Video Captioning Task
Sports Video Captioning and Components
Analysis
Table 2. Captioning Performance on SVCDV
Sports Video Captioning
Demo Video Video Re-localization (ECCV 2018)
• How to localize the part semantically relevant to the clip on the
left in the long video on the right?
88 89
90 91
92 95
14. 2019/5/10
14
Video Re-localization
Possible solutions
• Copy detection
–The contents can be largely different in both foreground and
background
• Action localization
–Only one sample is available
Video Re-localization
• We propose a matching framework
Cross Gating & Bilinear Matching
• We propose
• Cross gating to remove irrelevant contents
• Bilinear matching to capture the interactions between the query and
reference
Video Re-localization
• Results
Shuffleboard
query length = 14
reference length = 49
Ground truth 17 27
Ours 17 27
Frame 15 23
Video 17 19
Pole vault
query length = 15
reference length = 49
Ground truth 10 37
Ours 10 38
Frame 20 32
Video 17 38
query
reference
query
reference
Video Re-localization Localizing Language in Video (AAAI 2019)
• How to localize the part semantically relevant to the clip on the
left in the long video on the right?
96 97
98 99
100 101
15. 2019/5/10
15
Localizing Language in Video (AAAI 2019) Localizing Language in Video (AAAI 2019)
Localizing Language in Video (AAAI 2019) Localizing Language in Video (AAAI 2019)
Localizing Language in Video (AAAI 2019) Increasing Attention (CVPR 2019)
102 103
104 105
106 131