Introduction of the CVPR2022 paper: Balanced multimodal learning via on-the-fly gradient modulation @ The All Japan Computer Vision Study Group (2022/08/07)
This document discusses generative adversarial networks (GANs) and their relationship to reinforcement learning. It begins with an introduction to GANs, explaining how they can generate images without explicitly defining a probability distribution by using an adversarial training process. The second half discusses how GANs are related to actor-critic models and inverse reinforcement learning in reinforcement learning. It explains how GANs can be viewed as training a generator to fool a discriminator, similar to how policies are trained in reinforcement learning.
This document discusses generative adversarial networks (GANs) and their relationship to reinforcement learning. It begins with an introduction to GANs, explaining how they can generate images without explicitly defining a probability distribution by using an adversarial training process. The second half discusses how GANs are related to actor-critic models and inverse reinforcement learning in reinforcement learning. It explains how GANs can be viewed as training a generator to fool a discriminator, similar to how policies are trained in reinforcement learning.
ConvMixer is a simple CNN-based model that achieves state-of-the-art results on ImageNet classification. It divides the input image into patches and embeds them into high-dimensional vectors, similar to ViT. However, unlike ViT, it does not use attention but instead applies simple convolutional layers between the patch embedding and classification layers. Experiments show that despite its simplicity, ConvMixer outperforms more complex models like ResNet, ViT, and MLP-Mixer on ImageNet, demonstrating that patch embeddings may be as important as attention mechanisms for vision tasks.
This document summarizes recent developments in action recognition using deep learning techniques. It discusses early approaches using improved dense trajectories and two-stream convolutional neural networks. It then focuses on advances using 3D convolutional networks, enabled by large video datasets like Kinetics. State-of-the-art results are achieved using inflated 3D convolutional networks and temporal aggregation methods like temporal linear encoding. The document provides an overview of popular datasets and challenges and concludes with tips on training models at scale.
Slides presented in the All Japan Computer Vision Study Group on May 15, 2022. Methods for disentangling the relationship between multimodal data are discussed.
MediaEval 2016 - UNIFESP Predicting Media Interestingness Taskmultimediaeval
The document describes a system developed by UNIFESP for the MediaEval 2016 Predicting Media Interestingness task. The system:
1. Uses histograms of motion patterns to extract visual features from video segments.
2. Employs various learning-to-rank algorithms like Ranking SVM, RankNet, RankBoost and ListNet to predict interestingness.
3. Uses a majority voting scheme to combine the rankings from different algorithms and improve the prediction results.
ConvMixer is a simple CNN-based model that achieves state-of-the-art results on ImageNet classification. It divides the input image into patches and embeds them into high-dimensional vectors, similar to ViT. However, unlike ViT, it does not use attention but instead applies simple convolutional layers between the patch embedding and classification layers. Experiments show that despite its simplicity, ConvMixer outperforms more complex models like ResNet, ViT, and MLP-Mixer on ImageNet, demonstrating that patch embeddings may be as important as attention mechanisms for vision tasks.
This document summarizes recent developments in action recognition using deep learning techniques. It discusses early approaches using improved dense trajectories and two-stream convolutional neural networks. It then focuses on advances using 3D convolutional networks, enabled by large video datasets like Kinetics. State-of-the-art results are achieved using inflated 3D convolutional networks and temporal aggregation methods like temporal linear encoding. The document provides an overview of popular datasets and challenges and concludes with tips on training models at scale.
Slides presented in the All Japan Computer Vision Study Group on May 15, 2022. Methods for disentangling the relationship between multimodal data are discussed.
MediaEval 2016 - UNIFESP Predicting Media Interestingness Taskmultimediaeval
The document describes a system developed by UNIFESP for the MediaEval 2016 Predicting Media Interestingness task. The system:
1. Uses histograms of motion patterns to extract visual features from video segments.
2. Employs various learning-to-rank algorithms like Ranking SVM, RankNet, RankBoost and ListNet to predict interestingness.
3. Uses a majority voting scheme to combine the rankings from different algorithms and improve the prediction results.
Crafting Recommenders: the Shallow and the Deep of it! Sudeep Das, Ph.D.
Sudeep Das presented on recommender systems and advances in deep learning approaches. Matrix factorization is still the foundational method for collaborative filtering, but deep learning models are now augmenting these approaches. Deep neural networks can learn hierarchical representations of users and items from raw data like images, text, and sequences of user actions. Models like wide and deep networks combine the strengths of memorization and generalization. Sequence models like recurrent neural networks have also been applied to sessions for next item recommendation.
Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022VasileiosMezaris
Matching images to articles is challenging and can be considered a special version of the cross-media retrieval problem. This working note paper presents our solution for the MediaEval NewsImages
benchmarking task. We investigated the performance of two cross-modal networks, a pre-trained network and a trainable one, the latter originally developed for text-video retrieval tasks and adapted to the NewsImages task. Moreover, we utilize a method for revising the similarities produced by either one of the cross-modal networks, i.e., a dual softmax operation, to improve our solutions’ performance. We report the official results for our submitted runs and additional experiments we conducted to evaluate our runs internally.
Wearable Computing - Part IV: Ensemble classifiers & Insight into ongoing res...Daniel Roggen
The document discusses ensemble classifiers and their usefulness in wearable computing applications. It defines ensemble classifiers as combining multiple classifiers through decision fusion to increase confidence in classifications. Ensemble classifiers are useful because they can reduce errors from noise and outliers, overcome limitations of individual classifiers, and approximate complex decision boundaries better than a single classifier. The document explores strategies for generating diverse ensemble classifiers, such as manipulating the training data, inputs, outputs, or injecting randomness into learning algorithms. Measuring diversity is challenging but can relate to improved ensemble accuracy.
Large Scale GAN Training for High Fidelity Natural Image SynthesisSeunghyun Hwang
Review : Large Scale GAN Training for High Fidelity Natural Image Synthesis
- by Seunghyun Hwang (Yonsei University, Severance Hospital, Center for Clinical Data Science)
The document discusses large scale distributed deep networks and how distributed computing can be applied. It describes Google's DistBelief framework which uses two approaches - parallelization and model replication. Parallelization splits the neural network across multiple machines to speed up training for large networks. Model replication involves creating multiple copies of the neural network and processing them asynchronously on different machines and data shards, using techniques like Downpour SGD. Distributed computing is necessary to train very large neural networks with millions of parameters and can provide significant speedups over single machine training. However, it introduces challenges around network overhead and limitations on connectivity between network units.
Scalable Learning Analytics and Interoperability – an assessment of potential...LACE Project
A presentation given at the 2015 EUNIS Congress, held at Abertay University in Dundee, June 2015.
Learning analytics is now moving from being a research interest to topic for adoption. As this happens, the challenge of efficiently and reliably moving data between systems becomes of vital practical importance. In this context, “scalable learning analytics” is not intended to refer to infrastructural throughput, but to refer to the feasibility of a combination of: a) pervasive system
integration, and b) efficient analytical and data management practices. There are a number of
considerations that are of particular relevance to learning analytics in addition to elements that are generic to analytics. This contribution to EUNIS 2015 seeks to clarify, by argument and through evidence, both where there are potential benefits and limitations to applying interoperability specifications (and standards) in the service of scalable learning analytics.
Distributed Models Over Distributed Data with MLflow, Pyspark, and PandasDatabricks
Does more data always improve ML models? Is it better to use distributed ML instead of single node ML?
In this talk I will show that while more data often improves DL models in high variance problem spaces (with semi or unstructured data) such as NLP, image, video more data does not significantly improve high bias problem spaces where traditional ML is more appropriate. Additionally, even in the deep learning domain, single node models can still outperform distributed models via transfer learning.
Data scientists have pain points running many models in parallel automating the experimental set up. Getting others (especially analysts) within an organization to use their models Databricks solves these problems using pandas udfs, ml runtime and MLflow.
Presenter: Ricardo Manhães Savii
Paper: http://ceur-ws.org/Vol-1984/Mediaeval_2017_paper_25.pdf
Video: https://youtu.be/XaOhI6D_OY0
Authors: Jurandy Almeida and Ricardo M. Savii
Abstract: This paper describes the GIBIS team experience in the Predicting Media Interestingness Task at MediaEval 2017. In this task, the teams were required to develop an approach to predict whether images or videos are interesting or not. Our proposal relies on late fusion with rank aggregation methods for combining ranking models learned with different features and by different learning-to-rank algorithms.
The explosion of Open Educational Resources (OERs) in the recent years creates the demand for scalable, automatic approaches to process and evaluate OERs, with the end goal of identifying and recommending the most suitable educational materials for learners. We focus on building models to find the characteristics and features involved in context-agnostic engagement (i.e. population-based), a seldom researched topic compared to other contextualised and personalised approaches that focus more on individual learner engagement. Learner engagement, is arguably a more reliable measure than popularity/number of views, is more abundant than user ratings and has also been shown to be a crucial component in achieving learning outcomes. In this work, we explore the idea of building a predictive model for population-based engagement in education. We introduce a novel, large dataset of video lectures for predicting context-agnostic engagement and propose both cross-modal and modality-specific feature sets to achieve this task. We further test different strategies for quantifying learner engagement signals. We demonstrate the use of our approach in the case of data scarcity. Additionally, we perform a sensitivity analysis of the best performing model, which shows promising performance and can be easily integrated into an educational recommender system for OERs.
Authors: Sahan Bulathwela, María Pérez-Ortiz, Aldo Lipani, Emine Yilmaz and John Shawe-Taylor
Ensemble Learning Featuring the Netflix Prize Competition and ...butest
The document discusses ensemble learning methods for improving prediction accuracy. It provides an overview of using multiple models together (ensembles) and techniques like bagging and boosting to increase diversity among models. Bagging involves training models on different subsets of data, while boosting incrementally focuses on misclassified examples. The Netflix Prize is used as a case study, where top teams achieved over 5% better accuracy than Netflix by developing diverse ensembles of up to 100 models using different algorithms and inputs.
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09666155510, 09849539085 or mail us - ieeefinalsemprojects@gmail.com-Visit Our Website: www.finalyearprojects.org
NYAI #25: Evolution Strategies: An Alternative Approach to AI w/ Maxwell ReboMaryam Farooq
NYAI #25: Evolution Strategies: An Alternative Approach to AI w/ Maxwell Rebo
at Capital One Labs on Tues, 10/23/18
Join us for what's sure to be an awesome night in AI! This month's event is focused Evolution Strategies, and will touch on many themes discussed here (https://blog.openai.com/evolution-strategies/).
Maxwell Rebo is a machine learning founder working on a stealth project in ML-powered simulation engine.
A class of heuristic search algorithms have been shown to be viable alternatives to reinforcement learning as well as other ML tasks. These methods can be parallelized on arbitrary numbers of CPUs and do not require GPUs to be effective. To increase explicability, it is possible to create attribution mechanisms within these methods.
Maxwell is the former founder of Machine Colony, and enterprise AI platform company, and a founding member of NYAI. A machine learning developer and three-time founder, he has been doing ML at massive scale since 2010. He has previously spoken at venues such as the Ethereal conference in NYC and the joint Asian Leadership/HelloTomorrow conference in Seoul.
Video has become ubiquitous on the Internet, TV, as well as personal devices. Recognition of video content has been a fundamental challenge in computer vision for decades, where previous research predominantly focused on recognizing videos using a predefined yet limited vocabulary. Thanks to the recent development of deep learning and knowledge graph techniques, researchers in multiple communities are now striving to bridge videos with natural language in order to move beyond classification to interpretation, which should be regarded as the ultimate goal of video understanding. We will present recent advances in exploring the synergy of video understanding and language processing techniques, including video entity linking, video-language alignment, and video captioning, and discuss how domain knowledge can fit in to improve the performance.
Does deep learning solve all the machine learning problems? Where would domain knowledge fit in? While it is common in medical data analytics to incorporate domain knowledge, we focus on one emerging area in computer vision and language processing, video+language, to answer these questions.
Video has become ubiquitous on the Internet, TV, as well as personal devices. Recognition of video content has been a fundamental challenge in computer vision for decades, where previous research predominantly focused on recognizing videos using a predefined yet limited vocabulary. Thanks to the recent development of deep learning and knowledge graph techniques, researchers in multiple communities are now striving to bridge videos with natural language in order to move beyond classification to interpretation, which should be regarded as the ultimate goal of video understanding. We will present recent advances in exploring the synergy of video understanding and language processing techniques, including video entity linking, video-language alignment, and video captioning, and discuss how domain knowledge can fit in to improve the performance.
Enhancing Video Understanding: NLP-Based Automatic Question GenerationIRJET Journal
The document describes a research project that aims to develop an autonomous question generation system to improve video comprehension. The system would use natural language processing techniques to transcribe audio from videos, identify main ideas, and generate questions at different cognitive levels about the video content. This could help students more deeply engage with videos and foster critical thinking skills. The system would combine computer vision to extract visual elements from videos with NLP to transcribe audio, allowing it to develop a comprehensive understanding of video content to generate a wide variety of contextually appropriate questions.
Graph and language embeddings were used to analyze user data from Reddit to predict whether authors would post in the SuicideWatch subreddit. Metapath2vec was used to generate graph embeddings from subreddit and author relationships. Doc2vec was used to generate document embeddings based on language similarity between submissions and subreddits. Combining the graph and document embeddings in a logistic regression achieved 90% accuracy in predicting SuicideWatch posters, reducing both false positives and false negatives compared to using the embeddings separately. Next steps proposed using the embeddings to better understand similarities between related subreddits and predict risk factors in posts.
This presentation discusses the current state and future directions of using deep neural networks for tabular data. Some key challenges with tabular data include issues with data quality like missing values and outliers, lack of spatial dependencies between variables, and difficulties with preprocessing categorical features. Current approaches include transforming data, using hybrid models that combine DNNs with classical models, applying attention mechanisms from transformers, and strong regularization. Future areas of focus are continued work on data preprocessing, transformer architectures, regularization techniques, data generation methods, and improving model explainability.
Similar to CVPR2022 paper reading - Balanced multimodal learning - All Japan Computer Vision Study Group (2022/08/07) (20)
Inteligencia artificial,visión por ordenador,y datos multimodales - ACE Jap...Antonio Tejero de Pablos
La inteligencia artificial (AI) está a la orden del día, pero ¿qué es realmente? ¿Cómo es capaz una máquina de percibir el mundo real? Diseñadas inicialmente para reconocer patrones sencillos en imágenes, las redes neuronales artificiales han incrementado su complejidad hasta obtener en la actualidad una precisión equivalente a la del ser humano. Esto ha permitido su aplicación en una gran variedad de sectores, desde el médico hasta el automovilístico. Esta charla sirve de introducción a mi campo dentro de la AI, la visión por ordenador, y a mi tema de investigación actual, el aprendizaje de datos multimodales.
Presentation Seminar - Harada Ushiku Lab - The University of Tokyo (in English)
(日本語版:https://www.slideshare.net/AntonioTejerodePablo/presentation-skills-up-seminar-harada-ushiku-lab)
Introduction- e - waste – definition - sources of e-waste– hazardous substances in e-waste - effects of e-waste on environment and human health- need for e-waste management– e-waste handling rules - waste minimization techniques for managing e-waste – recycling of e-waste - disposal treatment methods of e- waste – mechanism of extraction of precious metal from leaching solution-global Scenario of E-waste – E-waste in India- case studies.
Advanced control scheme of doubly fed induction generator for wind turbine us...IJECEIAES
This paper describes a speed control device for generating electrical energy on an electricity network based on the doubly fed induction generator (DFIG) used for wind power conversion systems. At first, a double-fed induction generator model was constructed. A control law is formulated to govern the flow of energy between the stator of a DFIG and the energy network using three types of controllers: proportional integral (PI), sliding mode controller (SMC) and second order sliding mode controller (SOSMC). Their different results in terms of power reference tracking, reaction to unexpected speed fluctuations, sensitivity to perturbations, and resilience against machine parameter alterations are compared. MATLAB/Simulink was used to conduct the simulations for the preceding study. Multiple simulations have shown very satisfying results, and the investigations demonstrate the efficacy and power-enhancing capabilities of the suggested control system.
International Conference on NLP, Artificial Intelligence, Machine Learning an...gerogepatton
International Conference on NLP, Artificial Intelligence, Machine Learning and Applications (NLAIM 2024) offers a premier global platform for exchanging insights and findings in the theory, methodology, and applications of NLP, Artificial Intelligence, Machine Learning, and their applications. The conference seeks substantial contributions across all key domains of NLP, Artificial Intelligence, Machine Learning, and their practical applications, aiming to foster both theoretical advancements and real-world implementations. With a focus on facilitating collaboration between researchers and practitioners from academia and industry, the conference serves as a nexus for sharing the latest developments in the field.
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELgerogepatton
As digital technology becomes more deeply embedded in power systems, protecting the communication
networks of Smart Grids (SG) has emerged as a critical concern. Distributed Network Protocol 3 (DNP3)
represents a multi-tiered application layer protocol extensively utilized in Supervisory Control and Data
Acquisition (SCADA)-based smart grids to facilitate real-time data gathering and control functionalities.
Robust Intrusion Detection Systems (IDS) are necessary for early threat detection and mitigation because
of the interconnection of these networks, which makes them vulnerable to a variety of cyberattacks. To
solve this issue, this paper develops a hybrid Deep Learning (DL) model specifically designed for intrusion
detection in smart grids. The proposed approach is a combination of the Convolutional Neural Network
(CNN) and the Long-Short-Term Memory algorithms (LSTM). We employed a recent intrusion detection
dataset (DNP3), which focuses on unauthorized commands and Denial of Service (DoS) cyberattacks, to
train and test our model. The results of our experiments show that our CNN-LSTM method is much better
at finding smart grid intrusions than other deep learning algorithms used for classification. In addition,
our proposed approach improves accuracy, precision, recall, and F1 score, achieving a high detection
accuracy rate of 99.50%.
Batteries -Introduction – Types of Batteries – discharging and charging of battery - characteristics of battery –battery rating- various tests on battery- – Primary battery: silver button cell- Secondary battery :Ni-Cd battery-modern battery: lithium ion battery-maintenance of batteries-choices of batteries for electric vehicle applications.
Fuel Cells: Introduction- importance and classification of fuel cells - description, principle, components, applications of fuel cells: H2-O2 fuel cell, alkaline fuel cell, molten carbonate fuel cell and direct methanol fuel cells.
Literature Review Basics and Understanding Reference Management.pptxDr Ramhari Poudyal
Three-day training on academic research focuses on analytical tools at United Technical College, supported by the University Grant Commission, Nepal. 24-26 May 2024
Embedded machine learning-based road conditions and driving behavior monitoringIJECEIAES
Car accident rates have increased in recent years, resulting in losses in human lives, properties, and other financial costs. An embedded machine learning-based system is developed to address this critical issue. The system can monitor road conditions, detect driving patterns, and identify aggressive driving behaviors. The system is based on neural networks trained on a comprehensive dataset of driving events, driving styles, and road conditions. The system effectively detects potential risks and helps mitigate the frequency and impact of accidents. The primary goal is to ensure the safety of drivers and vehicles. Collecting data involved gathering information on three key road events: normal street and normal drive, speed bumps, circular yellow speed bumps, and three aggressive driving actions: sudden start, sudden stop, and sudden entry. The gathered data is processed and analyzed using a machine learning system designed for limited power and memory devices. The developed system resulted in 91.9% accuracy, 93.6% precision, and 92% recall. The achieved inference time on an Arduino Nano 33 BLE Sense with a 32-bit CPU running at 64 MHz is 34 ms and requires 2.6 kB peak RAM and 139.9 kB program flash memory, making it suitable for resource-constrained embedded systems.
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesChristina Lin
Traditionally, dealing with real-time data pipelines has involved significant overhead, even for straightforward tasks like data transformation or masking. However, in this talk, we’ll venture into the dynamic realm of WebAssembly (WASM) and discover how it can revolutionize the creation of stateless streaming pipelines within a Kafka (Redpanda) broker. These pipelines are adept at managing low-latency, high-data-volume scenarios.
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
CVPR2022 paper reading - Balanced multimodal learning - All Japan Computer Vision Study Group (2022/08/07)
1. AGREEMENT
• If you plan to share these slides or to use the content in these slides for your own work,
please include the following reference:
• 本スライドの公開や個人使用の場合には、下記の様にリファレンスをお願いします:
Tejero-de-Pablos A. (2022) “Paper reading: Balanced multimodal learning via on-the-fly
gradient modulation”. The 11th All Japan Computer Vision Study Group.
7. The power of multimodal data
• The real world is multimodal
• Understanding the world in a comprehensive way requires more than one sense
Driving: Image of the road + Voices of children
Diagnosis: Image of the heart + ECG signal
8. What is multimodal learning?
• Neural networks can learn different types of data
But deciding when and how should such data be mixed, is not trivial
Also, not all modalities are learned at the same rate à CHALLENGE
・Car
Option 1
・Car
Option 2, etc.
9. Paper introduction
Balanced Multimodal Learning via On-the-fly
Gradient Modulation
Peng, X., Wei, Y., Deng, A., Wang, D., & Hu, D. (2022)
In Proc. Computer Vision and Pattern
Recognition (pp. 8238-8247).
10. Problem setting
• Differences between modalities hinder simultaneous learning
Multimodal information is not fully utilized
11. Problem setting
• Intuitively, leveraging multiple modalities should increase the performance, however…
Unimodal representations are stronger due to suboptimized learning
Reason: different modalities converge at different rates à Balance speeds!
Common learning schema
Modality B overfits
the training data
Modality A underfits
the training data
Optimal
learning point
12. Related work
• Gradient blending
Obtain an optimal blending of modalities based on their overfitting behaviors
Optimize a metric to understand the problem quantitatively: the overfitting-to-generalization ratio (OGR)
Wang, W., Tran, D., & Feiszli, M. (2020). What makes training multi-modal classification networks hard?.
In Proc. Computer Vision and Pattern Recognition (pp. 12695-12705).
Uni-modal Multimodal
Multimodal w/
weighted blending
OGR between two training checkpoints measures
the change in overfitting and generalization
(small ∆O/∆V is better)
13. Proposed method
• Pipeline of the On-the-fly Gradient Modulation with Generalization Enhancement strategy
Adaptively modulate the backward gradient according to the performance discrepancy between modalities
14. Proposed method
• Step 1: On-the-fly gradient modulation
Stochastic Gradient Descent (for modality “u”)
• Step 2: Generalization enhancement
Step 1 may undermine the gradient noise
↓ ↓ ↓
The generalization ability of SGD is weakened
Solution:
Add random Gaussian noise
15. Experiments
• Datasets
CREMA-D: audio-visual (video) dataset for emotion recognition
Kinetics-Sounds: audio-visual (video) dataset for action recognition
VGGSound: audio-visual (video) dataset for event recognition
• Implementation
Encoders are ResNet18-based backbones. Input:
- Visual: Subsampled video frames (~3)
- Audio: Spectrogram transformation of the signal
Optimizer: SGD with 0.9 momentum, weight decay is 1e-4, learning rate is 1e-3
16. Experiments
• Comparison on the multimodal task ( )
With conventional fusion methods With other modulation strategies Applied to recognition methods
Dataset CREMA-D
17. Conclusions
• The proposed method is effective in solving the optimization imbalance problem
Validation accuracy on VGGSound during training:
• Limitations
OGM-GE unimodal could not outperform the base unimodal model
Other modalities and fusion strategies should be investigated
Class-wise performance is not addressed
Audio modality Visual modality Multimodal
19. Final remarks
• There are still many unsolved problems related to multimodal learning
Why do multimodal models cannot achieve optimal performance?
What kind of features from each modality is the model actually learning?
Is there a way to design an optimal fusion strategy for a given task and set of modalities?
• このテーマに興味のある研究者/先生方の皆さんへ
共同研究大歓迎!