Deep Visual Saliency - Kevin McGuinness - UPC Barcelona 2017

The document discusses medical practice as a recommender system. It outlines how medical decisions can be improved with recommender systems by making better use of data through algorithms and machine learning to provide more personalized recommendations. Current medical decision support systems are discussed, including knowledge-based approaches built from medical literature and data from electronic health records and ontologies. Machine learning techniques can be used to build diagnostic systems from data. The company Curai is working on combining AI/ML with good UX to build a medical tool for patients, leveraging techniques discussed in the document. Challenges include algorithmic approaches, data quality and bias, trustworthy UX, and legal issues. Medicine is an area that can greatly benefit from recommender system approaches

PR-386: Light Field Networks: Neural Scene Representations with Single-Evalua...

제가 이번에 소개드릴 논문은 NeRF와 같이 view synthesis를 하는 논문입니다. NeRF 이후로 NeRF의 문제점을 보완하기 위해 여러 방법들이 쏟아져 나왔는데요, 다른 한편으로는 발상의 전환을 통해 NeRF와 다른 방법을 활용하고자 하는 시도들도 있는 편입니다. 그러한 가장 대표적인 방법중 하나인 Neural Light Field Rendering 방식에 대해 설명드리겠습니다. 논문 링크: https://arxiv.org/abs/2106.02634 영상 링크: https://youtu.be/gxag8uvA2Sc

Image Restoration

Poonam Seth

The document discusses image restoration techniques. It describes how images can become degraded through phenomena like motion, improper camera focusing, and noise. The goal of image restoration is to recover the original high quality image from its degraded version using knowledge about the degradation process and types of noise. Common noise models include Gaussian, Rayleigh, Erlang, exponential, and impulse noise. Filtering techniques like mean, order statistics, and adaptive filters can be used for restoration by smoothing the image while preserving edges. The adaptive filters change based on local image statistics to better reduce noise with less blurring than regular filters.

Deep learning for 3 d point clouds presentation

VijaylaxmiNagurkar

This document summarizes deep learning techniques for 3D point clouds. It discusses methods for 3D shape classification, object detection and tracking, and segmentation. For classification, projection-based and point-based networks are examined. Point-based networks include MLP, graph-based, and convolution networks. Object detection methods include region proposal-based and single shot detection. Segmentation explores semantic, instance, and part segmentation using point-based networks.

PR-302: NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis

드디어 PR12 Season 4가 시작되었습니다! 제가 이번 시즌에서 발표하게 된 첫 논문은 ""NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis"라는 논문입니다. View Synthesis라는 Task는 몇 개의 시점에서 대상을 찍은 영상이 주어지면 주어지지 않은 위치와 방향에서 바라본 대상의 영상을 합성해내는 기술입니다. 이를 위해서 본 논문에서는 대상의 3D 정보를 통째로 Neural Network가 외우게 하는 방법을 선택했는데요, 이 방식은 Implicit Neural Representation이라는 이름으로 유명해지고 있는 추세고, 2D 이미지에 대해서도 적용하려는 접근들이 늘고 있습니다. 영상 링크: https://youtu.be/zkeh7Tt9tYQ 논문 링크: https://arxiv.org/abs/2003.08934

Visual Saliency Prediction with Deep Learning - Kevin McGuinness - UPC Barcel...

Silversparro Technologies

https://telecombcn-dl.github.io/2018-dlcv/ Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.

Video Classification Basic

Deep Learning for Personalized Search and Recommender Systems

Benjamin Le

Past, present, and future of Recommender Systems: an industry perspective

Keynote for the ACM Intelligent User Interface conference in 2016 in Sonoma, CA. I start with the past by talking about the Recommender Problem, and the Netflix Prize. Then I go into the Present and the Future by talking about approaches that go beyond rating prediction and ranking and by finishing with some of the most important lessons learned over the years. Throughout my talk I put special emphasis on the relation between algorithms and the User Interface.

Personalized Page Generation for Browsing Recommendations

Justin Basilico

Skeleton-based Human Action Recognition with Recurrent Neural Network

Luong Vo

This document presents a thesis on using recurrent neural networks for skeleton-based human action recognition. The proposed method uses two RNNs - a temporal RNN to model the temporal dynamics of joints over time, and a spatial RNN to model the dependencies between joints spatially. The RNNs are trained on skeleton data extracted from video datasets like NTU RGB+D and Kinetics. Experimental results show the method achieves state-of-the-art accuracy on the NTU datasets and can recognize actions in real-time from new video inputs. Future work involves exploring more advanced temporal modeling and evaluating on larger datasets.

Image Retrieval (D4L5 2017 UPC Deep Learning for Computer Vision)

最近の研究情勢についていくために - Deep Learningを中心に -

Hiroshi Fukui

This document summarizes key developments in deep learning for object detection from 2012 onwards. It begins with a timeline showing that 2012 was a turning point, as deep learning achieved record-breaking results in image classification. The document then provides overviews of 250+ contributions relating to object detection frameworks, fundamental problems addressed, evaluation benchmarks and metrics, and state-of-the-art performance. Promising future research directions are also identified.

What's hot

Single Image Super Resolution Overview

LEE HOSEONG

A Multi-Armed Bandit Framework For Recommendations at Netflix

Jaya Kawale

Automatic Skin Lesion Segmentation and Melanoma Detection: Transfer Learning ...

Zabir Al Nazi Nabil

A Deep Journey into Super-resolution

Ronak Mehta

GraphSage vs Pinsage #InsideArangoDB

ArangoDB Database

Deep Learning for Recommender Systems

inovex GmbH

Super Resolution

alokahuti

Computer Vision: Correlation, Convolution, and Gradient

Ahmed Gad

Super resolution

Federico D'Amato

Medical advice as a Recommender System

PR-386: Light Field Networks: Neural Scene Representations with Single-Evalua...

Image Restoration

Poonam Seth

Deep learning for 3 d point clouds presentation

VijaylaxmiNagurkar

PR-302: NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis

Visual Saliency Prediction with Deep Learning - Kevin McGuinness - UPC Barcel...

Silversparro Technologies

Video Classification Basic

Deep Learning for Personalized Search and Recommender Systems

Benjamin Le

Past, present, and future of Recommender Systems: an industry perspective

Personalized Page Generation for Browsing Recommendations

Justin Basilico

Skeleton-based Human Action Recognition with Recurrent Neural Network

Luong Vo

What's hot (20)

Single Image Super Resolution Overview

A Multi-Armed Bandit Framework For Recommendations at Netflix

Automatic Skin Lesion Segmentation and Melanoma Detection: Transfer Learning ...

A Deep Journey into Super-resolution

GraphSage vs Pinsage #InsideArangoDB

Deep Learning for Recommender Systems

Super Resolution

Computer Vision: Correlation, Convolution, and Gradient

Super resolution

Medical advice as a Recommender System

PR-386: Light Field Networks: Neural Scene Representations with Single-Evalua...

Image Restoration

Deep learning for 3 d point clouds presentation

PR-302: NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis

Visual Saliency Prediction with Deep Learning - Kevin McGuinness - UPC Barcel...

Video Classification Basic

Deep Learning for Personalized Search and Recommender Systems

Past, present, and future of Recommender Systems: an industry perspective

Personalized Page Generation for Browsing Recommendations

Skeleton-based Human Action Recognition with Recurrent Neural Network

Similar to Deep Visual Saliency - Kevin McGuinness - UPC Barcelona 2017

Image Retrieval (D4L5 2017 UPC Deep Learning for Computer Vision)

最近の研究情勢についていくために - Deep Learningを中心に -

Hiroshi Fukui

Learning where to look: focus and attention in deep vision

Fellowship at Vodafone FutureLab

This document summarizes Kevin McGuinness' presentation on deep learning for computer vision. It discusses visual attention models and their ability to predict eye gaze, applications in image cropping, retrieval and classification. It also covers medical image analysis using deep learning for knee osteoarthritis grading and neonatal brain segmentation. Deep crowd analysis is examined for crowd counting. Finally, interactive deep vision for image segmentation using user interactions is presented.

Semantic segmentation with Convolutional Neural Network Approaches

In this project, we propose methods for semantic segmentation with the deep learning state-of-the-art models. Moreover, we want to filterize the segmentation to the specific object in specific application. Instead of concentrating on unnecessary objects we can focus on special ones and make it more specialize and effecient for special purposes. Furtheromore, In this project, we leverage models that are suitable for face segmentation. The models that are used in this project are Mask-RCNN and DeepLabv3. The experimental results clearly indicate that how illustrated approach are efficient and robust in the segmentation task to the previous work in the field of segmentation. These models are reached to 74.4 and 86.6 precision of Mean of Intersection over Union. The visual Results of the models are shown in Appendix part.

Scene recognition using Convolutional Neural Network

DhirajGidde

The document discusses scene recognition using convolutional neural networks. It begins with an abstract stating that scene recognition allows context for object recognition. While object recognition has improved due to large datasets and CNNs, scene recognition performance has not reached the same level of success. The document then discusses using a new scene-centric database called Places with over 7 million images to train CNNs for scene recognition. It establishes new state-of-the-art results on several scene datasets and allows visualization of network responses to show differences between object-centric and scene-centric representations.

Artificial Intelligence for Vision: A walkthrough of recent breakthroughs

Nikolas Markou

Principles of Data Visualization

Eamonn Maguire

The document discusses principles of data visualization. It provides an overview of Tamara Munzner's framework for visualization design, which involves four levels of analysis: the domain situation, data/task abstraction, visual encoding and interaction idioms, and algorithms. The framework aims to translate real-world problems into visual representations that help users accomplish tasks. The document also outlines different types of data visualization like scientific and information visualization. Finally, it notes discoverability as a key purpose of visualization, to gain new insights from data in an interactive manner.

Introduction talk to Computer Vision

Chen Sagiv

The document provides an introduction to computer vision. It discusses key topics including: - What computer vision is and why it is useful. It uses mathematical and computational tools to extract information from images and improve human vision. - Some basic concepts in computer vision including digital images, sampling, noise removal, segmentation, and feature extraction techniques. - Where computer vision is used such as healthcare, autonomous vehicles, augmented/virtual reality, industry, social media, security, agriculture, and fashion. - A brief history of computer vision including classical approaches and the revolution enabled by advances in artificial intelligence and deep learning.

dwdwd

mokamojah

This article proposes an omnisupervised learning framework to improve semantic segmentation of omnidirectional images using multiple data sources. The framework trains an efficient CNN using labeled pinhole images and unlabeled panoramic images. It generates panoramic labels using an ensemble method that considers the wide-angle and wrap-around features of panoramas. This allows the CNN to be trained on automatically generated panoramic data, bypassing costly manual annotation. Experiments show the approach outperforms state-of-the-art methods on challenging omnidirectional datasets, demonstrating improved generalizability of the CNN to unseen panoramic domains.

Content-Based Image Retrieval (D2L6 Insight@DCU Machine Learning Workshop 2017)

The document discusses content-based image retrieval. It begins with an overview of the problem of using a query image to retrieve similar images from a large dataset. Common techniques discussed include using SIFT features with bag-of-words models or convolutional neural network (CNN) features. The document outlines the classic SIFT retrieval pipeline and techniques for using features from pre-trained CNNs, such as max-pooling features from convolutional layers or encoding them with VLAD. It also discusses learning image representations specifically for retrieval using methods like the triplet loss to learn an embedding space that clusters similar images. The state-of-the-art methods achieve the best performance by learning global or regional image representations from CNNs trained on large, generated datasets

Object Detetcion using SSD-MobileNet

IRJET Journal

This document presents a study on object detection using SSD-MobileNet. The researchers developed a lightweight object detection model using SSD-MobileNet that can perform real-time object detection on embedded systems with limited processing resources. They tested the model on images and video captured using webcams. The model was able to detect objects like people, cars, and animals with good accuracy. The SSD-MobileNet framework provides fast and efficient object detection for applications like autonomous driving assistance systems that require real-time performance on low-power devices.

Image Captioning Generator using Deep Machine Learning

ijtsrd

Technologys scope has evolved into one of the most powerful tools for human development in a variety of fields.AI and machine learning have become one of the most powerful tools for completing tasks quickly and accurately without the need for human intervention. This project demonstrates how deep machine learning can be used to create a caption or a sentence for a given picture. This can be used for visually impaired persons, as well as automobiles for self identification, and for various applications to verify quickly and easily. The Convolutional Neural Network CNN is used to describe the alphabet, and the Long Short Term Memory LSTM is used to organize the right meaningful sentences in this model. The flicker 8k and flicker 30k datasets were used to train this. Sreejith S P | Vijayakumar A "Image Captioning Generator using Deep Machine Learning" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-5 | Issue-4 , June 2021, URL: https://www.ijtsrd.compapers/ijtsrd42344.pdf Paper URL: https://www.ijtsrd.comcomputer-science/artificial-intelligence/42344/image-captioning-generator-using-deep-machine-learning/sreejith-s-p

Remote Sensing Image Scene Classification

Gaurav Singh

This project proposes methods for classifying scenes in remote sensing images. It compares the accuracy of traditional bag-of-visual-words (BoVW) models using handcrafted features to a bag-of-convolutional features (BoCF) model using deep learning. It also applies a grey wolf optimizer (GWO) algorithm for image segmentation. Results show BoCF doubled the accuracy of BoVW, and combining BoVW with GWO improved accuracy over BoVW alone. The project concludes more work is needed to better combine remote sensing data and deep learning for classification.

Neural Architectures for Still Images - Xavier Giro- UPC Barcelona 2019

https://telecombcn-dl.github.io/2019-dlcv/ Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.

Pratik ibm-open power-ppt

Vaibhav R

Generative adversarial networks (GANs) show promise for enhancing computer vision in low visibility conditions. GANs can learn to translate images from low visibility domains like hazy or low-light conditions to clear images without paired training data. Recent work has incorporated hyperspectral guidance to improve image-to-image translation for tasks like dehazing. A domain-aware model was proposed to address the distributional discrepancy between RGB and hyperspectral images. Additionally, optimizing the spectral profile in translation helps mitigate spectral aberrations in results. These techniques push the limits of machine learning for analyzing visual data in challenging conditions with applications like autonomous vehicles and medical imaging.

AaSeminar_Template.pptx

ManojGowdaKb

This document discusses semantic segmentation using fully convolutional networks (FCNs). 1. Semantic segmentation involves assigning each pixel in an image a class label, such as identifying objects. FCNs can perform pixel-wise segmentation by learning features at different scales through downsampling and then upsampling to generate predictions. 2. Experimental results found that FCNs with downsampling and upsampling improve segmentation accuracy by capturing features at different scales. Downsampling allows learning of more abstract features while upsampling restores resolution for precise predictions. 3. In conclusion, FCNs have become a highly effective approach for semantic segmentation tasks in various domains like medical imaging and autonomous driving due to learning multi-scale features and pixel-

ANALYSIS OF LUNG NODULE DETECTION AND STAGE CLASSIFICATION USING FASTER RCNN ...

IRJET Journal

This document presents a method for detecting and classifying lung nodules using Faster R-CNN technique. It first segments the lung from CT images and extracts features using Dual-Tree Complex Wavelet Transform. A Back Propagation Neural Network is then used to classify patterns of interstitial lung diseases detected in the images. Fuzzy clustering is also proposed to segment abnormal regions of the lung. The method aims to help identify and diagnose common lung diseases like pleural effusion and interstitial lung disease in an automated manner from CT images.

Google | Infinite Nature Zero Whitepaper

Alejandro Franceschi

InfiniteNature-Zero: Learning Perpetual View Generation of Natural Scenes from Single Images Abstract. We present a method for learning to generate unbounded flythrough videos of natural scenes starting from a single view, where this capability is learned from a collection of single photographs, without requiring camera poses or even multiple views of each scene. To achieve this, we propose a novel self-supervised view generation training paradigm, where we sample and rendering virtual camera trajectories, including cyclic ones, allowing our model to learn stable view generation from a collection of single views. At test time, despite never seeing a video during training, our approach can take a single image and generate long camera trajectories comprised of hundreds of new views with realistic and diverse contents. We compare our approach with recent state-of-the-art supervised view generation methods that require posed multi-view videos and demonstrate superior performance and synthesis quality.

Satellite Image Classification with Deep Learning Survey

ijtsrd

Satellite imagery is important for many applications including disaster response, law enforcement and environmental monitoring etc. These applications require the manual identification of objects in the imagery. Because the geographic area to be covered is very large and the analysts available to conduct the searches are few, thus an automation is required. Yet traditional object detection and classification algorithms are too inaccurate and unreliable to solve the problem. Deep learning is a part of broader family of machine learning methods that have shown promise for the automation of such tasks. It has achieved success in image understanding by means that of convolutional neural networks. The problem of object and facility recognition in satellite imagery is considered. The system consists of an ensemble of convolutional neural networks and additional neural networks that integrate satellite metadata with image features. Roshni Rajendran | Liji Samuel ""Satellite Image Classification with Deep Learning: Survey"" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-4 | Issue-2 , February 2020, URL: https://www.ijtsrd.com/papers/ijtsrd30031.pdf Paper Url : https://www.ijtsrd.com/engineering/computer-engineering/30031/satellite-image-classification-with-deep-learning-survey/roshni-rajendran

A Literature Survey on Image Linguistic Visual Question Answering

IRJET Journal

This document discusses a literature survey on image and linguistic visual question answering. It aims to develop a model that achieves higher performance than state-of-the-art solutions by exploring different existing models and developing a custom model. The paper reviews several existing models for visual question answering and image classification using convolutional neural networks. It also discusses developing a new dataset for visual question answering using automated question generation from image descriptions.

Similar to Deep Visual Saliency - Kevin McGuinness - UPC Barcelona 2017 (20)

Image Retrieval (D4L5 2017 UPC Deep Learning for Computer Vision)

最近の研究情勢についていくために - Deep Learningを中心に -

Learning where to look: focus and attention in deep vision

Semantic segmentation with Convolutional Neural Network Approaches

Scene recognition using Convolutional Neural Network

Artificial Intelligence for Vision: A walkthrough of recent breakthroughs

Principles of Data Visualization

Introduction talk to Computer Vision

dwdwd

Content-Based Image Retrieval (D2L6 Insight@DCU Machine Learning Workshop 2017)

Object Detetcion using SSD-MobileNet

Image Captioning Generator using Deep Machine Learning

Remote Sensing Image Scene Classification

Neural Architectures for Still Images - Xavier Giro- UPC Barcelona 2019

Pratik ibm-open power-ppt

AaSeminar_Template.pptx

ANALYSIS OF LUNG NODULE DETECTION AND STAGE CLASSIFICATION USING FASTER RCNN ...

Google | Infinite Nature Zero Whitepaper

Satellite Image Classification with Deep Learning Survey

A Literature Survey on Image Linguistic Visual Question Answering

More from Universitat Politècnica de Catalunya

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)

This document provides an overview of deep generative learning and summarizes several key generative models including GANs, VAEs, diffusion models, and autoregressive models. It discusses the motivation for generative models and their applications such as image generation, text-to-image synthesis, and enhancing other media like video and speech. Example state-of-the-art models are provided for each application. The document also covers important concepts like the difference between discriminative and generative modeling, sampling techniques, and the training procedures for GANs and VAEs.

Deep Generative Learning for All

The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...

The document discusses the Vision Transformer (ViT) model for computer vision tasks. It covers: 1. How ViT tokenizes images into patches and uses position embeddings to encode spatial relationships. 2. ViT uses a class embedding to trigger class predictions, unlike CNNs which have decoders. 3. The receptive field of ViT grows as the attention mechanism allows elements to attend to other distant elements in later layers. 4. Initial results showed ViT performance was comparable to CNNs when trained on large datasets but lagged CNNs trained on smaller datasets like ImageNet.

Towards Sign Language Translation & Production | Xavier Giro-i-Nieto

Machine translation and computer vision have greatly benefited from the advances in deep learning. A large and diverse amount of textual and visual data have been used to train neural networks whether in a supervised or self-supervised manner. Nevertheless, the convergence of the two fields in sign language translation and production still poses multiple open challenges, like the low video resources, limitations in hand pose estimation, or 3D spatial grounding from poses.

The Transformer - Xavier Giró - UPC Barcelona 2021

Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...

Open challenges in sign language translation and production

Machine translation and computer vision have greatly benefited of the advances in deep learning. The large and diverse amount of textual and visual data have been used to train neural networks whether in a supervised or self-supervised manner. Nevertheless, the convergence of the two field in sign language translation and production is still poses multiple open challenges, like the low video resources, limitations in hand pose estimation, or 3D spatial grounding from poses. This talk will present these challenges and the How2✌️Sign dataset (https://how2sign.github.io) recorded at CMU in collaboration with UPC, BSC, Gallaudet University and Facebook. https://imatge.upc.edu/web/publications/sign-language-translation-and-production-multimedia-and-multimodal-challenges-all

Generation of Synthetic Referring Expressions for Object Segmentation in Videos

https://imatge-upc.github.io/synthref/ Integrating computer vision with natural language processing has achieved significant progress over the last years owing to the continuous evolution of deep learning. A novel vision and language task, which is tackled in the present Master thesis is referring video object segmentation, in which a language query defines which instance to segment from a video sequence. One of the biggest challenges for this task is the lack of relatively large annotated datasets since a tremendous amount of time and human effort is required for annotation. Moreover, existing datasets suffer from poor quality annotations in the sense that approximately one out of ten language expressions fails to uniquely describe the target object. The purpose of the present Master thesis is to address these challenges by proposing a novel method for generating synthetic referring expressions for an image (video frame). This method pro- duces synthetic referring expressions by using only the ground-truth annotations of the objects as well as their attributes, which are detected by a state-of-the-art object detection deep neural network. One of the advantages of the proposed method is that its formulation allows its application to any object detection or segmentation dataset. By using the proposed method, the first large-scale dataset with synthetic referring expressions for video object segmentation is created, based on an existing large benchmark dataset for video instance segmentation. A statistical analysis and comparison of the created synthetic dataset with existing ones is also provided in the present Master thesis. The conducted experiments on three different datasets used for referring video object segmentation prove the efficiency of the generated synthetic data. More specifically, the obtained results demonstrate that by pre-training a deep neural network with the proposed synthetic dataset one can improve the ability of the network to generalize across different datasets, without any additional annotation cost. This outcome is even more important taking into account that no additional annotation cost is involved.

Discovery and Learning of Navigation Goals from Pixels in Minecraft

Master MATT thesis defense by Juan José Nieto Advised by Víctor Campos and Xavier Giro-i-Nieto. 27th May 2021. Pre-training Reinforcement Learning (RL) agents in a task-agnostic manner has shown promising results. However, previous works still struggle to learn and discover meaningful skills in high-dimensional state-spaces. We approach the problem by leveraging unsupervised skill discovery and self-supervised learning of state representations. In our work, we learn a compact latent representation by making use of variational or contrastive techniques. We demonstrate that both allow learning a set of basic navigation skills by maximizing an information theoretic objective. We assess our method in Minecraft 3D maps with different complexities. Our results show that representations and conditioned policies learned from pixels are enough for toy examples, but do not scale to realistic and complex maps. We also explore alternative rewards and input observations to overcome these limitations. https://imatge.upc.edu/web/publications/discovery-and-learning-navigation-goals-pixels-minecraft

Learn2Sign : Sign language recognition and translation using human keypoint e...

Peter Muschick MSc thesis Universitat Pollitecnica de Catalunya, 2020 Sign language recognition and translation has been an active research field in the recent years with most approaches using deep neural networks to extract information from sign language data. This work investigates the mostly disregarded approach of using human keypoint estimation from image and video data with OpenPose in combination with transformer network architecture. Firstly, it was shown that it is possible to recognize individual signs (4.5% word error rate (WER)). Continuous sign language recognition though was more error prone (77.3% WER) and sign language translation was not possible using the proposed methods, which might be due to low accuracy scores of human keypoint estimation by OpenPose and accompanying loss of information or insufficient capacities of the used transformer model. Results may improve with the use of datasets containing higher repetition rates of individual signs or focusing more precisely on keypoint extraction of hands.

Intepretability / Explainable AI for Deep Neural Networks

This document discusses interpretability and explainable AI (XAI) in neural networks. It begins by providing motivation for why explanations of neural network predictions are often required. It then provides an overview of different interpretability techniques, including visualizing learned weights and feature maps, attribution methods like class activation maps and guided backpropagation, and feature visualization. Specific examples and applications of each technique are described. The document serves as a guide to interpretability and explainability in deep learning models.

Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020

Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.

Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...

Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020

Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...

https://telecombcn-dl.github.io/dlai-2020/ Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.

Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020

https://telecombcn-dl.github.io/drl-2020/ This course presents the principles of reinforcement learning as an artificial intelligence tool based on the interaction of the machine with its environment, with applications to control tasks (eg. robotics, autonomous driving) o decision making (eg. resource optimization in wireless communication networks). It also advances in the development of deep neural networks trained with little or no supervision, both for discriminative and generative tasks, with special attention on multimedia applications (vision, language and speech).

Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)

Giro-i-Nieto, X. One Perceptron to Rule Them All: Language, Vision, Audio and Speech. In Proceedings of the 2020 International Conference on Multimedia Retrieval (pp. 7-8). Tutorial page: https://imatge.upc.edu/web/publications/one-perceptron-rule-them-all-language-vision-audio-and-speech-tutorial Deep neural networks have boosted the convergence of multimedia data analytics in a unified framework shared by practitioners in natural language, vision and speech. Image captioning, lip reading or video sonorization are some of the first applications of a new and exciting field of research exploiting the generalization properties of deep neural representation. This tutorial will firstly review the basic neural architectures to encode and decode vision, text and audio, to later review the those models that have successfully translated information across modalities.

Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...

This document summarizes image segmentation techniques using deep learning. It begins with an overview of semantic segmentation and instance segmentation. It then discusses several techniques for semantic segmentation, including deconvolution/transposed convolution for learnable upsampling, skip connections to combine predictions from different CNN depths, and dilated convolutions to increase the receptive field without losing resolution. For instance segmentation, it covers proposal-based methods like Mask R-CNN, and single-shot and recurrent approaches as alternatives to proposal-based models.

Curriculum Learning for Recurrent Video Object Segmentation

https://imatge-upc.github.io/rvos-mots/ Video object segmentation can be understood as a sequence-to-sequence task that can benefit from the curriculum learning strategies for better and faster training of deep neural networks. This work explores different schedule sampling and frame skipping variations to significantly improve the performance of a recurrent architecture. Our results on the car class of the KITTI-MOTS challenge indicate that, surprisingly, an inverse schedule sampling is a better option than a classic forward one. Also, that a progressive skipping of frames during training is beneficial, but only when training with the ground truth masks instead of the predicted ones.

Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020