https://imatge-upc.github.io/rvos-mots/
Video object segmentation can be understood as a sequence-to-sequence task that can benefit from the curriculum learning strategies for better and faster training of deep neural networks. This work explores different schedule sampling and frame skipping variations to significantly improve the performance of a recurrent architecture. Our results on the car class of the KITTI-MOTS challenge indicate that, surprisingly, an inverse schedule sampling is a better option than a classic forward one. Also, that a progressive skipping of frames during training is beneficial, but only when training with the ground truth masks instead of the predicted ones.
A presentation by SMART Infrastructure Facility Research Director Dr Pascal Perez to the 11th International Multidisciplinary Modeling and Simulation Multiconference (I3M), Bordeaux, September 2014.
Coordinated by the OER Foundation, OERu is an independent, not-for-profit organization with 35 participating Higher Education institutions worldwide, making higher education accessible to everyone by offering free online courses and “affordable ways for learners to gain academic credit towards qualifications from recognised institutions” (McGreal, Rory, et al. 2014). The 2015 OERu evaluation follows the CIPP (context, input, process, and product) evaluation framework (Stufflebeam 2003) and focuses on “input analysis” at this stage. The evaluation aims to assess different design options and identify major challenges in online curriculum developments, nominating open courses by participating institutions, open business models, open governance, and other aspects. Issues raised in the evaluation process are not unique for OERu and will have relevance to other practitioners designing open education.
A presentation by SMART Infrastructure Facility Research Director Dr Pascal Perez to the 11th International Multidisciplinary Modeling and Simulation Multiconference (I3M), Bordeaux, September 2014.
Coordinated by the OER Foundation, OERu is an independent, not-for-profit organization with 35 participating Higher Education institutions worldwide, making higher education accessible to everyone by offering free online courses and “affordable ways for learners to gain academic credit towards qualifications from recognised institutions” (McGreal, Rory, et al. 2014). The 2015 OERu evaluation follows the CIPP (context, input, process, and product) evaluation framework (Stufflebeam 2003) and focuses on “input analysis” at this stage. The evaluation aims to assess different design options and identify major challenges in online curriculum developments, nominating open courses by participating institutions, open business models, open governance, and other aspects. Issues raised in the evaluation process are not unique for OERu and will have relevance to other practitioners designing open education.
Luận Văn Thạc Sĩ How Does Channel Integration Quality Enrich Customer Experiences With Omnichannel Retailers? An Examination Of Mediating And Moderating Mechanisms đã chia sẻ đến cho các bạn nguồn tài liệu hoàn toàn hữu ích đáng để xem và theo dõi. Nếu các bạn có nhu cầu cần tải bài mẫu này vui lòng nhắn tin nhanh qua zalo/telegram : 0932.091.562 để được hỗ trợ tải nhé!
Slides Emooc 2016 European MOOC Stakeholders SummitOlivier Bernaert
SLIDES - Presentation by Lucie Dhorne
The past few years have seen an exponential growth of the number of MOOCs worldwide. However, it is proven by the avaivalable completion rate data that motivation can quickly fade even for students who are highly motivated at the beginning of the courses. Faced with this realty, it seems crucial for the future of MOOCs to adress this motivational issue and to find ways to improve the completion rates.
IFP School launched two MOOCs – “Sustainable Mobility” and “Oil&Gas” – which saw unusally high completion rates.
In this paper we analyze the results obtained within these two MOOCs. Our goal is to identify the factors that made such completion rates possible and to understand how these key issues help to produce a succesful MOOC. By this analysis, we are able to give some tips in terms of video recording, interactive assignment design such as Serious Game or Mini-games and participant mentoring to promote motivation. Applying these tips when designing a MOOC will minimize the chance of participant withdrawal and thus lead to high completion rates.
It is important to understand the basic concept of staff development program and its significance to implies it on organisation to make in stronger and efficient work force.
This document provides an overview of deep generative learning and summarizes several key generative models including GANs, VAEs, diffusion models, and autoregressive models. It discusses the motivation for generative models and their applications such as image generation, text-to-image synthesis, and enhancing other media like video and speech. Example state-of-the-art models are provided for each application. The document also covers important concepts like the difference between discriminative and generative modeling, sampling techniques, and the training procedures for GANs and VAEs.
Machine translation and computer vision have greatly benefited from the advances in deep learning. A large and diverse amount of textual and visual data have been used to train neural networks whether in a supervised or self-supervised manner. Nevertheless, the convergence of the two fields in sign language translation and production still poses multiple open challenges, like the low video resources, limitations in hand pose estimation, or 3D spatial grounding from poses.
The transformer is the neural architecture that has received most attention in the early 2020's. It removed the recurrency in RNNs, replacing it with and attention mechanism across the input and output tokens of a sequence (cross-attenntion) and between the tokens composing the input (and output) sequences, named self-attention.
These slides review the research of our lab since 2016 on applied deep learning, starting from our participation in the TRECVID Instance Search 2014, moving into video analysis with CNN+RNN architectures, and our current efforts in sign language translation and production.
Machine translation and computer vision have greatly benefited of the advances in deep learning. The large and diverse amount of textual and visual data have been used to train neural networks whether in a supervised or self-supervised manner. Nevertheless, the convergence of the two field in sign language translation and production is still poses multiple open challenges, like the low video resources, limitations in hand pose estimation, or 3D spatial grounding from poses. This talk will present these challenges and the How2✌️Sign dataset (https://how2sign.github.io) recorded at CMU in collaboration with UPC, BSC, Gallaudet University and Facebook.
https://imatge.upc.edu/web/publications/sign-language-translation-and-production-multimedia-and-multimodal-challenges-all
https://imatge-upc.github.io/synthref/
Integrating computer vision with natural language processing has achieved significant progress
over the last years owing to the continuous evolution of deep learning. A novel vision and language
task, which is tackled in the present Master thesis is referring video object segmentation, in which a
language query defines which instance to segment from a video sequence. One of the biggest chal-
lenges for this task is the lack of relatively large annotated datasets since a tremendous amount of
time and human effort is required for annotation. Moreover, existing datasets suffer from poor qual-
ity annotations in the sense that approximately one out of ten language expressions fails to uniquely
describe the target object.
The purpose of the present Master thesis is to address these challenges by proposing a novel
method for generating synthetic referring expressions for an image (video frame). This method pro-
duces synthetic referring expressions by using only the ground-truth annotations of the objects as well
as their attributes, which are detected by a state-of-the-art object detection deep neural network. One
of the advantages of the proposed method is that its formulation allows its application to any object
detection or segmentation dataset.
By using the proposed method, the first large-scale dataset with synthetic referring expressions for
video object segmentation is created, based on an existing large benchmark dataset for video instance
segmentation. A statistical analysis and comparison of the created synthetic dataset with existing ones
is also provided in the present Master thesis.
The conducted experiments on three different datasets used for referring video object segmen-
tation prove the efficiency of the generated synthetic data. More specifically, the obtained results
demonstrate that by pre-training a deep neural network with the proposed synthetic dataset one can
improve the ability of the network to generalize across different datasets, without any additional annotation cost. This outcome is even more important taking into account that no additional annotation cost is involved.
Master MATT thesis defense by Juan José Nieto
Advised by Víctor Campos and Xavier Giro-i-Nieto.
27th May 2021.
Pre-training Reinforcement Learning (RL) agents in a task-agnostic manner has shown promising results. However, previous works still struggle to learn and discover meaningful skills in high-dimensional state-spaces. We approach the problem by leveraging unsupervised skill discovery and self-supervised learning of state representations. In our work, we learn a compact latent representation by making use of variational or contrastive techniques. We demonstrate that both allow learning a set of basic navigation skills by maximizing an information theoretic objective. We assess our method in Minecraft 3D maps with different complexities. Our results show that representations and conditioned policies learned from pixels are enough for toy examples, but do not scale to realistic and complex maps. We also explore alternative rewards and input observations to overcome these limitations.
https://imatge.upc.edu/web/publications/discovery-and-learning-navigation-goals-pixels-minecraft
Peter Muschick MSc thesis
Universitat Pollitecnica de Catalunya, 2020
Sign language recognition and translation has been an active research field in the recent years with most approaches using deep neural networks to extract information from sign language data. This work investigates the mostly disregarded approach of using human keypoint estimation from image and video data with OpenPose in combination with transformer network architecture. Firstly, it was shown that it is possible to recognize individual signs (4.5% word error rate (WER)). Continuous sign language recognition though was more error prone (77.3% WER) and sign language translation was not possible using the proposed methods, which might be due to low accuracy scores of human keypoint estimation by OpenPose and accompanying loss of information or insufficient capacities of the used transformer model. Results may improve with the use of datasets containing higher repetition rates of individual signs or focusing more precisely on keypoint extraction of hands.
https://github.com/telecombcn-dl/lectures-all/
These slides review techniques for interpreting the behavior of deep neural networks. The talk reviews basic techniques such as the display of filters and tensors, as well as more advanced ones that try to interpret which part of the input data is responsible for the predictions, or generate data that maximizes the activation of certain neurons.
More Related Content
Similar to Curriculum Learning for Recurrent Video Object Segmentation
Luận Văn Thạc Sĩ How Does Channel Integration Quality Enrich Customer Experiences With Omnichannel Retailers? An Examination Of Mediating And Moderating Mechanisms đã chia sẻ đến cho các bạn nguồn tài liệu hoàn toàn hữu ích đáng để xem và theo dõi. Nếu các bạn có nhu cầu cần tải bài mẫu này vui lòng nhắn tin nhanh qua zalo/telegram : 0932.091.562 để được hỗ trợ tải nhé!
Slides Emooc 2016 European MOOC Stakeholders SummitOlivier Bernaert
SLIDES - Presentation by Lucie Dhorne
The past few years have seen an exponential growth of the number of MOOCs worldwide. However, it is proven by the avaivalable completion rate data that motivation can quickly fade even for students who are highly motivated at the beginning of the courses. Faced with this realty, it seems crucial for the future of MOOCs to adress this motivational issue and to find ways to improve the completion rates.
IFP School launched two MOOCs – “Sustainable Mobility” and “Oil&Gas” – which saw unusally high completion rates.
In this paper we analyze the results obtained within these two MOOCs. Our goal is to identify the factors that made such completion rates possible and to understand how these key issues help to produce a succesful MOOC. By this analysis, we are able to give some tips in terms of video recording, interactive assignment design such as Serious Game or Mini-games and participant mentoring to promote motivation. Applying these tips when designing a MOOC will minimize the chance of participant withdrawal and thus lead to high completion rates.
It is important to understand the basic concept of staff development program and its significance to implies it on organisation to make in stronger and efficient work force.
This document provides an overview of deep generative learning and summarizes several key generative models including GANs, VAEs, diffusion models, and autoregressive models. It discusses the motivation for generative models and their applications such as image generation, text-to-image synthesis, and enhancing other media like video and speech. Example state-of-the-art models are provided for each application. The document also covers important concepts like the difference between discriminative and generative modeling, sampling techniques, and the training procedures for GANs and VAEs.
Machine translation and computer vision have greatly benefited from the advances in deep learning. A large and diverse amount of textual and visual data have been used to train neural networks whether in a supervised or self-supervised manner. Nevertheless, the convergence of the two fields in sign language translation and production still poses multiple open challenges, like the low video resources, limitations in hand pose estimation, or 3D spatial grounding from poses.
The transformer is the neural architecture that has received most attention in the early 2020's. It removed the recurrency in RNNs, replacing it with and attention mechanism across the input and output tokens of a sequence (cross-attenntion) and between the tokens composing the input (and output) sequences, named self-attention.
These slides review the research of our lab since 2016 on applied deep learning, starting from our participation in the TRECVID Instance Search 2014, moving into video analysis with CNN+RNN architectures, and our current efforts in sign language translation and production.
Machine translation and computer vision have greatly benefited of the advances in deep learning. The large and diverse amount of textual and visual data have been used to train neural networks whether in a supervised or self-supervised manner. Nevertheless, the convergence of the two field in sign language translation and production is still poses multiple open challenges, like the low video resources, limitations in hand pose estimation, or 3D spatial grounding from poses. This talk will present these challenges and the How2✌️Sign dataset (https://how2sign.github.io) recorded at CMU in collaboration with UPC, BSC, Gallaudet University and Facebook.
https://imatge.upc.edu/web/publications/sign-language-translation-and-production-multimedia-and-multimodal-challenges-all
https://imatge-upc.github.io/synthref/
Integrating computer vision with natural language processing has achieved significant progress
over the last years owing to the continuous evolution of deep learning. A novel vision and language
task, which is tackled in the present Master thesis is referring video object segmentation, in which a
language query defines which instance to segment from a video sequence. One of the biggest chal-
lenges for this task is the lack of relatively large annotated datasets since a tremendous amount of
time and human effort is required for annotation. Moreover, existing datasets suffer from poor qual-
ity annotations in the sense that approximately one out of ten language expressions fails to uniquely
describe the target object.
The purpose of the present Master thesis is to address these challenges by proposing a novel
method for generating synthetic referring expressions for an image (video frame). This method pro-
duces synthetic referring expressions by using only the ground-truth annotations of the objects as well
as their attributes, which are detected by a state-of-the-art object detection deep neural network. One
of the advantages of the proposed method is that its formulation allows its application to any object
detection or segmentation dataset.
By using the proposed method, the first large-scale dataset with synthetic referring expressions for
video object segmentation is created, based on an existing large benchmark dataset for video instance
segmentation. A statistical analysis and comparison of the created synthetic dataset with existing ones
is also provided in the present Master thesis.
The conducted experiments on three different datasets used for referring video object segmen-
tation prove the efficiency of the generated synthetic data. More specifically, the obtained results
demonstrate that by pre-training a deep neural network with the proposed synthetic dataset one can
improve the ability of the network to generalize across different datasets, without any additional annotation cost. This outcome is even more important taking into account that no additional annotation cost is involved.
Master MATT thesis defense by Juan José Nieto
Advised by Víctor Campos and Xavier Giro-i-Nieto.
27th May 2021.
Pre-training Reinforcement Learning (RL) agents in a task-agnostic manner has shown promising results. However, previous works still struggle to learn and discover meaningful skills in high-dimensional state-spaces. We approach the problem by leveraging unsupervised skill discovery and self-supervised learning of state representations. In our work, we learn a compact latent representation by making use of variational or contrastive techniques. We demonstrate that both allow learning a set of basic navigation skills by maximizing an information theoretic objective. We assess our method in Minecraft 3D maps with different complexities. Our results show that representations and conditioned policies learned from pixels are enough for toy examples, but do not scale to realistic and complex maps. We also explore alternative rewards and input observations to overcome these limitations.
https://imatge.upc.edu/web/publications/discovery-and-learning-navigation-goals-pixels-minecraft
Peter Muschick MSc thesis
Universitat Pollitecnica de Catalunya, 2020
Sign language recognition and translation has been an active research field in the recent years with most approaches using deep neural networks to extract information from sign language data. This work investigates the mostly disregarded approach of using human keypoint estimation from image and video data with OpenPose in combination with transformer network architecture. Firstly, it was shown that it is possible to recognize individual signs (4.5% word error rate (WER)). Continuous sign language recognition though was more error prone (77.3% WER) and sign language translation was not possible using the proposed methods, which might be due to low accuracy scores of human keypoint estimation by OpenPose and accompanying loss of information or insufficient capacities of the used transformer model. Results may improve with the use of datasets containing higher repetition rates of individual signs or focusing more precisely on keypoint extraction of hands.
https://github.com/telecombcn-dl/lectures-all/
These slides review techniques for interpreting the behavior of deep neural networks. The talk reviews basic techniques such as the display of filters and tensors, as well as more advanced ones that try to interpret which part of the input data is responsible for the predictions, or generate data that maximizes the activation of certain neurons.
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
https://telecombcn-dl.github.io/dlai-2020/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
https://telecombcn-dl.github.io/drl-2020/
This course presents the principles of reinforcement learning as an artificial intelligence tool based on the interaction of the machine with its environment, with applications to control tasks (eg. robotics, autonomous driving) o decision making (eg. resource optimization in wireless communication networks). It also advances in the development of deep neural networks trained with little or no supervision, both for discriminative and generative tasks, with special attention on multimedia applications (vision, language and speech).
Giro-i-Nieto, X. One Perceptron to Rule Them All: Language, Vision, Audio and Speech. In Proceedings of the 2020 International Conference on Multimedia Retrieval (pp. 7-8).
Tutorial page:
https://imatge.upc.edu/web/publications/one-perceptron-rule-them-all-language-vision-audio-and-speech-tutorial
Deep neural networks have boosted the convergence of multimedia data analytics in a unified framework shared by practitioners in natural language, vision and speech. Image captioning, lip reading or video sonorization are some of the first applications of a new and exciting field of research exploiting the generalization properties of deep neural representation. This tutorial will firstly review the basic neural architectures to encode and decode vision, text and audio, to later review the those models that have successfully translated information across modalities.
Image segmentation is a classic computer vision task that aims at labeling pixels with semantic classes. These slides provide an overview of the basic approaches applied from the deep learning field to tackle this challenge and presents the basic subtasks (semantic, instance and panoptic segmentation) and related datasets.
Presented at the International Summer School on Deep Learning (ISSonDL) 2020 held online and organized by the University of Gdansk (Poland) between the 30th August and 2nd September.
http://2020.dl-lab.eu/virtual-summer-school-on-deep-learning/
Deep neural networks have achieved outstanding results in various applications such as vision, language, audio, speech, or reinforcement learning. These powerful function approximators typically require large amounts of data to be trained, which poses a challenge in the usual case where little labeled data is available. During the last year, multiple solutions have been proposed to leverage this problem, based on the concept of self-supervised learning, which can be understood as a specific case of unsupervised learning. This talk will cover its basic principles and provide examples in the field of multimedia.
Deep neural networks have revolutionized the data analytics scene by improving results in several and diverse benchmarks with the same recipe: learning feature representations from data. These achievements have raised the interest across multiple scientific fields, especially in those where large amounts of data and computation are available. This change of paradigm in data analytics has several ethical and economic implications that are driving large investments, political debates and sounding press coverage under the generic label of artificial intelligence (AI). This talk will present the fundamentals of deep learning through the classic example of image classification, and point at how the same principal has been adopted for several tasks. Finally, some of the forthcoming potentials and risks for AI will be pointed.
More from Universitat Politècnica de Catalunya (20)
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
4. INTRODUCTION
Curriculum Learning for Recurrent VOS - 4 of 144
Curriculum Learning:
Methodology inspired by the learning process of humans. The training data is presented in a
meaningful way, from simple to complex concepts.
Yoshua Bengio et al. “Curriculum Learning”, ICML. 2019.
5. INTRODUCTION
Curriculum Learning for Recurrent VOS - 5 of 144
Curriculum Learning:
Methodology inspired by the learning process of humans. The training data is presented in a
meaningful way, from simple to complex concepts.
4 curriculums
Yoshua Bengio et al. “Curriculum Learning”, ICML. 2019.
6. INTRODUCTION
Curriculum Learning for Recurrent VOS - 6 of 144
Curriculum Learning:
Methodology inspired by the learning process of humans. The training data is presented in a
meaningful way, from simple to complex concepts.
4 curriculums
THE DATASET
Yoshua Bengio et al. “Curriculum Learning”, ICML. 2019.
7. INTRODUCTION
Curriculum Learning for Recurrent VOS - 7 of 144
Curriculum Learning:
Methodology inspired by the learning process of humans. The training data is presented in a
meaningful way, from simple to complex concepts.
4 curriculums
THE DATASET THE MODEL
Yoshua Bengio et al. “Curriculum Learning”, ICML. 2019.
11. INTRODUCTION
Curriculum Learning for Recurrent VOS - 11 of 144
THE TASK
Estimated by the modelGiven to the model
Semi-supervised or “one-shot” Video Object Segmentation
13. KITTI-MOTS
DATASET
Curriculum Learning for Recurrent VOS - 13 of 144
Andreas Geiger, Philip Lenz, and Raquel Urtasun. “Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite”, CVPR 2012.
14. KITTI-MOTS
DATASET
Curriculum Learning for Recurrent VOS - 14 of 144
Andreas Geiger, Philip Lenz, and Raquel Urtasun. “Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite”, CVPR 2012.
Its video sequences present challenges:
15. KITTI-MOTS
DATASET
Curriculum Learning for Recurrent VOS - 15 of 144
Its video sequences present challenges:
Andreas Geiger, Philip Lenz, and Raquel Urtasun. “Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite”, CVPR 2012.
16. KITTI-MOTS
DATASET
Curriculum Learning for Recurrent VOS - 16 of 144
Its video sequences present challenges:
Andreas Geiger, Philip Lenz, and Raquel Urtasun. “Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite”, CVPR 2012.
17. KITTI-MOTS
DATASET
Curriculum Learning for Recurrent VOS - 17 of 144
Its video sequences present challenges:
Andreas Geiger, Philip Lenz, and Raquel Urtasun. “Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite”, CVPR 2012.
19. THE MODEL
End-to-End Recurrent Network for video object segmentation: RVOS
Curriculum Learning for Recurrent VOS - 19 of 144
Carles Ventura, Miriam Bellver, Andreu Girbau, Amaia Salvador, Ferran Marques and Xavier Giro-i-Nieto. “RVOS: End-to-End Recurrent Network
for Video Object Segmentation”, CVPR 2019.
20. THE MODEL
End-to-End Recurrent Network for video object segmentation: RVOS
Curriculum Learning for Recurrent VOS - 20 of 144
Athar, A., Mahadevan, S., Oˇsep, A., Leal-Taix´e, L., Leibe, B.: Stem-seg: Spatio-temporal embeddings for instance segmentation in videos., ECCV (2020)
22. SETS OF EXPERIMENTS
All techniques tested on two sets of experiments:
Resolution Batch Size Length clip
287x950 2 3
Resolution Batch Size Length clip
256x448 4 5
Curriculum Learning for Recurrent VOS - 22 of 144
23. METRICS
The results have been evaluated on the official metrics of the MOTS Challenge.
- sMOTSA has been defined as the reference metric:
Curriculum Learning for Recurrent VOS - 23 of 144
Paul Voigtlaender et al. “MOTS: Multi-Object Tracking and Segmentation”, CVPR 2019.
24. METRICS
The results have been evaluated on the official metrics of the MOTS Challenge.
- sMOTSA has been defined as the reference metric:
Curriculum Learning for Recurrent VOS - 25 of 144
Paul Voigtlaender et al. “MOTS: Multi-Object Tracking and Segmentation”, CVPR 2019.
59. FRAME SKIPPING
Curriculum Learning for Recurrent VOS - 59 of 144
Ideally:
But we have limitations (e.g. memory constraints)
…
…
..
N
fram
es of the sequence
…
...
.
75. FRAME SKIPPING
Frame Skipping
From 0 to 9 From 1 to 5
All training
First half
training
All training
First half
training
Curriculum Learning for Recurrent VOS - 75 of 144
76. FRAME SKIPPING
Frame Skipping
From 0 to 9 From 1 to 5
All training
First half
training
All training
First half
training
Curriculum Learning for Recurrent VOS - 76 of 144
77. FRAME SKIPPING
Frame Skipping
From 0 to 9 From 1 to 5
All training
First half
training
All training
First half
training
Curriculum Learning for Recurrent VOS - 77 of 144
83. TEMPORAL AND SPATIAL RECURRENCES
Curriculum Learning for Recurrent VOS - 83 of 144
KITTI-MOTS is a crowded dataset:
84. TEMPORAL AND SPATIAL RECURRENCES
Curriculum Learning for Recurrent VOS - 84 of 144
time (frame sequence)
space(objectsequence)
85. TEMPORAL AND SPATIAL RECURRENCES
Curriculum Learning for Recurrent VOS - 85 of 144
time (frame sequence)
TEMPORAL RECURRENCE
86. TEMPORAL AND SPATIAL RECURRENCES
Curriculum Learning for Recurrent VOS - 86 of 144
space(objectsequence)
SPATIAL RECURRENCE
87. TEMPORAL AND SPATIAL RECURRENCES
Proposed curriculum:
Curriculum Learning for Recurrent VOS - 87 of 144
88. TEMPORAL AND SPATIAL RECURRENCES
Temporal and Spatial
Recurrence
Only temporal during
the first half of training
Curriculum Learning for Recurrent VOS - 88 of 144
89. TEMPORAL AND SPATIAL RECURRENCES
Temporal and Spatial
Recurrence
Spatio-temporal during
all training
Only temporal during
the first half of training
Curriculum Learning for Recurrent VOS - 89 of 144
90. TEMPORAL AND SPATIAL RECURRENCES
Temporal and Spatial
Recurrence
Spatio-temporal during
all training
Only temporal during
all training
Only temporal during
the first half of training
Curriculum Learning for Recurrent VOS - 90 of 144
91. TEMPORAL AND SPATIAL RECURRENCES
Temporal and Spatial
Recurrence
Spatio-temporal during
all training
Only temporal during
all training
Only temporal during
the first half of training
Only temporal during the
second half of training
Curriculum Learning for Recurrent VOS - 91 of 144
92. TEMPORAL AND SPATIAL RECURRENCES
Curriculum Learning for Recurrent VOS - 92 of 144
93. TEMPORAL AND SPATIAL RECURRENCES
Curriculum Learning for Recurrent VOS - 93 of 144
94. TEMPORAL AND SPATIAL RECURRENCES
Curriculum Learning for Recurrent VOS - 94 of 144
Ground-truth
95. TEMPORAL AND SPATIAL RECURRENCES
Curriculum Learning for Recurrent VOS - 95 of 144
Only Spatio-Temporal
Ground-truth
96. TEMPORAL AND SPATIAL RECURRENCES
Curriculum Learning for Recurrent VOS - 96 of 144
Only Spatio-Temporal Only Temporal
Ground-truth
97. TEMPORAL AND SPATIAL RECURRENCES
Curriculum Learning for Recurrent VOS - 97 of 144
Only Spatio-Temporal Only Temporal
Only Temporal first half
Ground-truth
98. TEMPORAL AND SPATIAL RECURRENCES
Curriculum Learning for Recurrent VOS - 98 of 144
Only Spatio-Temporal Only Temporal
Only Temporal first half Only Temporal second half
Ground-truth
99. TEMPORAL AND SPATIAL RECURRENCES
Curriculum Learning for Recurrent VOS - 99 of 144
119. YouTube-VOS
Curriculum Learning for Recurrent VOS - 119 of 144
Ning Xu et al. “YouTube-VOS: A Large-Scale Video Object Segmentation Benchmark”, ECCV 2018
121. YouTube-VOS
Curriculum Learning for Recurrent VOS - 121 of 144
- Training parameters:
- Evaluated with the official metrics of the YouTube-VOS challenge.
Resolution Batch Size Length clip
256x448 4 5
122. YouTube-VOS
Curriculum Learning for Recurrent VOS - 122 of 144
- Training parameters:
- Evaluated with the official metrics of the YouTube-VOS challenge.
Resolution Batch Size Length clip
256x448 4 5
123. YouTube-VOS
Curriculum Learning for Recurrent VOS - 123 of 144
- Training parameters:
- Evaluated with the official metrics of the YouTube-VOS challenge.
Resolution Batch Size Length clip
256x448 4 5
129. CONCLUSIONS
Curriculum Learning for Recurrent VOS - 129 of 144
SCHEDULE SAMPLING FRAME SKIPPING
LOSS PENALIZATION BY OBJECT AREATEMPORAL AND SPATIAL RECURRENCES
130. CONCLUSIONS
Curriculum Learning for Recurrent VOS - 130 of 144
SCHEDULE SAMPLING FRAME SKIPPING
LOSS PENALIZATION BY OBJECT AREATEMPORAL AND SPATIAL RECURRENCES
131. CONCLUSIONS
Curriculum Learning for Recurrent VOS - 131 of 144
SCHEDULE SAMPLING FRAME SKIPPING
LOSS PENALIZATION BY OBJECT AREATEMPORAL AND SPATIAL RECURRENCES
132. CONCLUSIONS
Curriculum Learning for Recurrent VOS - 132 of 144
SCHEDULE SAMPLING FRAME SKIPPING
LOSS PENALIZATION BY OBJECT AREATEMPORAL AND SPATIAL RECURRENCES
133. CONCLUSIONS
Curriculum Learning for Recurrent VOS - 133 of 144
SCHEDULE SAMPLING FRAME SKIPPING
LOSS PENALIZATION BY OBJECT AREATEMPORAL AND SPATIAL RECURRENCES
140. FUTURE WORK
Curriculum Learning for Recurrent VOS - 140 of 144
Schedule Sampling Frame Skipping
Loss penalization
by object area
141. FUTURE WORK
Curriculum Learning for Recurrent VOS - 141 of 144
Schedule Sampling Frame Skipping
Loss penalization
by object area
Other curriculums
142. FUTURE WORK
Curriculum Learning for Recurrent VOS - 142 of 144
Schedule Sampling Frame Skipping
Loss penalization
by object area
Other curriculums
Combination of the
best curriculums