Defensa del Project de Final de Carrera de l'Elisabet Carcel, en els estudia d'Engineria Tècnica en Telecomunicacions, especialitat en So i Imatge.
Co-dirigit per en Xavier Giró-i-i-Nieto (UPC) i Xavier Vives (CCMA)
Escola d'Enginyeria de Terrassa, Universitat Politècnica de Catalunya. Juny 2011.
Més detalls:
https://imatge.upc.edu/web/publications/rich-internet-application-semi-automatic-annotation-semantic-shots-keyframes-0
La Metodología Canvas (Canvas Business Model) describe de manera lógica la forma en que las organizaciones crean, entregan y capturan valor.
Puede ser usada por PyMEs y grandes empresas, independientemente del giro y el público al que apunten.
Ariel Valero Cruz
-Profesional Certificado, Consultor en Planes y Modelos de Negocio [Canvas] (PC-CPMN)
-Lego® Serious Play, Certified Facilitator (LSP-CF).
This document provides an overview of deep generative learning and summarizes several key generative models including GANs, VAEs, diffusion models, and autoregressive models. It discusses the motivation for generative models and their applications such as image generation, text-to-image synthesis, and enhancing other media like video and speech. Example state-of-the-art models are provided for each application. The document also covers important concepts like the difference between discriminative and generative modeling, sampling techniques, and the training procedures for GANs and VAEs.
La Metodología Canvas (Canvas Business Model) describe de manera lógica la forma en que las organizaciones crean, entregan y capturan valor.
Puede ser usada por PyMEs y grandes empresas, independientemente del giro y el público al que apunten.
Ariel Valero Cruz
-Profesional Certificado, Consultor en Planes y Modelos de Negocio [Canvas] (PC-CPMN)
-Lego® Serious Play, Certified Facilitator (LSP-CF).
This document provides an overview of deep generative learning and summarizes several key generative models including GANs, VAEs, diffusion models, and autoregressive models. It discusses the motivation for generative models and their applications such as image generation, text-to-image synthesis, and enhancing other media like video and speech. Example state-of-the-art models are provided for each application. The document also covers important concepts like the difference between discriminative and generative modeling, sampling techniques, and the training procedures for GANs and VAEs.
Machine translation and computer vision have greatly benefited from the advances in deep learning. A large and diverse amount of textual and visual data have been used to train neural networks whether in a supervised or self-supervised manner. Nevertheless, the convergence of the two fields in sign language translation and production still poses multiple open challenges, like the low video resources, limitations in hand pose estimation, or 3D spatial grounding from poses.
The transformer is the neural architecture that has received most attention in the early 2020's. It removed the recurrency in RNNs, replacing it with and attention mechanism across the input and output tokens of a sequence (cross-attenntion) and between the tokens composing the input (and output) sequences, named self-attention.
These slides review the research of our lab since 2016 on applied deep learning, starting from our participation in the TRECVID Instance Search 2014, moving into video analysis with CNN+RNN architectures, and our current efforts in sign language translation and production.
Machine translation and computer vision have greatly benefited of the advances in deep learning. The large and diverse amount of textual and visual data have been used to train neural networks whether in a supervised or self-supervised manner. Nevertheless, the convergence of the two field in sign language translation and production is still poses multiple open challenges, like the low video resources, limitations in hand pose estimation, or 3D spatial grounding from poses. This talk will present these challenges and the How2✌️Sign dataset (https://how2sign.github.io) recorded at CMU in collaboration with UPC, BSC, Gallaudet University and Facebook.
https://imatge.upc.edu/web/publications/sign-language-translation-and-production-multimedia-and-multimodal-challenges-all
https://imatge-upc.github.io/synthref/
Integrating computer vision with natural language processing has achieved significant progress
over the last years owing to the continuous evolution of deep learning. A novel vision and language
task, which is tackled in the present Master thesis is referring video object segmentation, in which a
language query defines which instance to segment from a video sequence. One of the biggest chal-
lenges for this task is the lack of relatively large annotated datasets since a tremendous amount of
time and human effort is required for annotation. Moreover, existing datasets suffer from poor qual-
ity annotations in the sense that approximately one out of ten language expressions fails to uniquely
describe the target object.
The purpose of the present Master thesis is to address these challenges by proposing a novel
method for generating synthetic referring expressions for an image (video frame). This method pro-
duces synthetic referring expressions by using only the ground-truth annotations of the objects as well
as their attributes, which are detected by a state-of-the-art object detection deep neural network. One
of the advantages of the proposed method is that its formulation allows its application to any object
detection or segmentation dataset.
By using the proposed method, the first large-scale dataset with synthetic referring expressions for
video object segmentation is created, based on an existing large benchmark dataset for video instance
segmentation. A statistical analysis and comparison of the created synthetic dataset with existing ones
is also provided in the present Master thesis.
The conducted experiments on three different datasets used for referring video object segmen-
tation prove the efficiency of the generated synthetic data. More specifically, the obtained results
demonstrate that by pre-training a deep neural network with the proposed synthetic dataset one can
improve the ability of the network to generalize across different datasets, without any additional annotation cost. This outcome is even more important taking into account that no additional annotation cost is involved.
Master MATT thesis defense by Juan José Nieto
Advised by Víctor Campos and Xavier Giro-i-Nieto.
27th May 2021.
Pre-training Reinforcement Learning (RL) agents in a task-agnostic manner has shown promising results. However, previous works still struggle to learn and discover meaningful skills in high-dimensional state-spaces. We approach the problem by leveraging unsupervised skill discovery and self-supervised learning of state representations. In our work, we learn a compact latent representation by making use of variational or contrastive techniques. We demonstrate that both allow learning a set of basic navigation skills by maximizing an information theoretic objective. We assess our method in Minecraft 3D maps with different complexities. Our results show that representations and conditioned policies learned from pixels are enough for toy examples, but do not scale to realistic and complex maps. We also explore alternative rewards and input observations to overcome these limitations.
https://imatge.upc.edu/web/publications/discovery-and-learning-navigation-goals-pixels-minecraft
Peter Muschick MSc thesis
Universitat Pollitecnica de Catalunya, 2020
Sign language recognition and translation has been an active research field in the recent years with most approaches using deep neural networks to extract information from sign language data. This work investigates the mostly disregarded approach of using human keypoint estimation from image and video data with OpenPose in combination with transformer network architecture. Firstly, it was shown that it is possible to recognize individual signs (4.5% word error rate (WER)). Continuous sign language recognition though was more error prone (77.3% WER) and sign language translation was not possible using the proposed methods, which might be due to low accuracy scores of human keypoint estimation by OpenPose and accompanying loss of information or insufficient capacities of the used transformer model. Results may improve with the use of datasets containing higher repetition rates of individual signs or focusing more precisely on keypoint extraction of hands.
https://github.com/telecombcn-dl/lectures-all/
These slides review techniques for interpreting the behavior of deep neural networks. The talk reviews basic techniques such as the display of filters and tensors, as well as more advanced ones that try to interpret which part of the input data is responsible for the predictions, or generate data that maximizes the activation of certain neurons.
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
https://telecombcn-dl.github.io/dlai-2020/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
https://telecombcn-dl.github.io/drl-2020/
This course presents the principles of reinforcement learning as an artificial intelligence tool based on the interaction of the machine with its environment, with applications to control tasks (eg. robotics, autonomous driving) o decision making (eg. resource optimization in wireless communication networks). It also advances in the development of deep neural networks trained with little or no supervision, both for discriminative and generative tasks, with special attention on multimedia applications (vision, language and speech).
Giro-i-Nieto, X. One Perceptron to Rule Them All: Language, Vision, Audio and Speech. In Proceedings of the 2020 International Conference on Multimedia Retrieval (pp. 7-8).
Tutorial page:
https://imatge.upc.edu/web/publications/one-perceptron-rule-them-all-language-vision-audio-and-speech-tutorial
Deep neural networks have boosted the convergence of multimedia data analytics in a unified framework shared by practitioners in natural language, vision and speech. Image captioning, lip reading or video sonorization are some of the first applications of a new and exciting field of research exploiting the generalization properties of deep neural representation. This tutorial will firstly review the basic neural architectures to encode and decode vision, text and audio, to later review the those models that have successfully translated information across modalities.
Image segmentation is a classic computer vision task that aims at labeling pixels with semantic classes. These slides provide an overview of the basic approaches applied from the deep learning field to tackle this challenge and presents the basic subtasks (semantic, instance and panoptic segmentation) and related datasets.
Presented at the International Summer School on Deep Learning (ISSonDL) 2020 held online and organized by the University of Gdansk (Poland) between the 30th August and 2nd September.
http://2020.dl-lab.eu/virtual-summer-school-on-deep-learning/
https://imatge-upc.github.io/rvos-mots/
Video object segmentation can be understood as a sequence-to-sequence task that can benefit from the curriculum learning strategies for better and faster training of deep neural networks. This work explores different schedule sampling and frame skipping variations to significantly improve the performance of a recurrent architecture. Our results on the car class of the KITTI-MOTS challenge indicate that, surprisingly, an inverse schedule sampling is a better option than a classic forward one. Also, that a progressive skipping of frames during training is beneficial, but only when training with the ground truth masks instead of the predicted ones.
Deep neural networks have achieved outstanding results in various applications such as vision, language, audio, speech, or reinforcement learning. These powerful function approximators typically require large amounts of data to be trained, which poses a challenge in the usual case where little labeled data is available. During the last year, multiple solutions have been proposed to leverage this problem, based on the concept of self-supervised learning, which can be understood as a specific case of unsupervised learning. This talk will cover its basic principles and provide examples in the field of multimedia.
Deep neural networks have revolutionized the data analytics scene by improving results in several and diverse benchmarks with the same recipe: learning feature representations from data. These achievements have raised the interest across multiple scientific fields, especially in those where large amounts of data and computation are available. This change of paradigm in data analytics has several ethical and economic implications that are driving large investments, political debates and sounding press coverage under the generic label of artificial intelligence (AI). This talk will present the fundamentals of deep learning through the classic example of image classification, and point at how the same principal has been adopted for several tasks. Finally, some of the forthcoming potentials and risks for AI will be pointed.
Benet Oriol, Jordi Luque, Ferran Diego, Xavier Giro-i-Nieto
Telefonica Research / Universitat Politecnica de Catalunya (UPC)
CVPR 2020 Workshop on on Egocentric Perception, Interaction and Computing
In this work, we propose an effective approach for training unique embedding representations by combining three simultaneous modalities: image and spoken and textual narratives. The proposed methodology departs from a baseline system that spawns a embedding space trained with only spoken narratives and image cues. Our experiments on the EPIC-Kitchen and Places Audio Caption datasets show that introducing the human-generated textual transcriptions of the spoken narratives helps to the training procedure yielding to get better embedding representations. The triad speech, image and words allows for a better estimate of the point embedding and show an improving of the performance within tasks like image and speech retrieval, even when text third modality, text, is not present in the task.
Machine translation and computer vision have greatly benefited from the advances in deep learning. A large and diverse amount of textual and visual data have been used to train neural networks whether in a supervised or self-supervised manner. Nevertheless, the convergence of the two fields in sign language translation and production still poses multiple open challenges, like the low video resources, limitations in hand pose estimation, or 3D spatial grounding from poses.
The transformer is the neural architecture that has received most attention in the early 2020's. It removed the recurrency in RNNs, replacing it with and attention mechanism across the input and output tokens of a sequence (cross-attenntion) and between the tokens composing the input (and output) sequences, named self-attention.
These slides review the research of our lab since 2016 on applied deep learning, starting from our participation in the TRECVID Instance Search 2014, moving into video analysis with CNN+RNN architectures, and our current efforts in sign language translation and production.
Machine translation and computer vision have greatly benefited of the advances in deep learning. The large and diverse amount of textual and visual data have been used to train neural networks whether in a supervised or self-supervised manner. Nevertheless, the convergence of the two field in sign language translation and production is still poses multiple open challenges, like the low video resources, limitations in hand pose estimation, or 3D spatial grounding from poses. This talk will present these challenges and the How2✌️Sign dataset (https://how2sign.github.io) recorded at CMU in collaboration with UPC, BSC, Gallaudet University and Facebook.
https://imatge.upc.edu/web/publications/sign-language-translation-and-production-multimedia-and-multimodal-challenges-all
https://imatge-upc.github.io/synthref/
Integrating computer vision with natural language processing has achieved significant progress
over the last years owing to the continuous evolution of deep learning. A novel vision and language
task, which is tackled in the present Master thesis is referring video object segmentation, in which a
language query defines which instance to segment from a video sequence. One of the biggest chal-
lenges for this task is the lack of relatively large annotated datasets since a tremendous amount of
time and human effort is required for annotation. Moreover, existing datasets suffer from poor qual-
ity annotations in the sense that approximately one out of ten language expressions fails to uniquely
describe the target object.
The purpose of the present Master thesis is to address these challenges by proposing a novel
method for generating synthetic referring expressions for an image (video frame). This method pro-
duces synthetic referring expressions by using only the ground-truth annotations of the objects as well
as their attributes, which are detected by a state-of-the-art object detection deep neural network. One
of the advantages of the proposed method is that its formulation allows its application to any object
detection or segmentation dataset.
By using the proposed method, the first large-scale dataset with synthetic referring expressions for
video object segmentation is created, based on an existing large benchmark dataset for video instance
segmentation. A statistical analysis and comparison of the created synthetic dataset with existing ones
is also provided in the present Master thesis.
The conducted experiments on three different datasets used for referring video object segmen-
tation prove the efficiency of the generated synthetic data. More specifically, the obtained results
demonstrate that by pre-training a deep neural network with the proposed synthetic dataset one can
improve the ability of the network to generalize across different datasets, without any additional annotation cost. This outcome is even more important taking into account that no additional annotation cost is involved.
Master MATT thesis defense by Juan José Nieto
Advised by Víctor Campos and Xavier Giro-i-Nieto.
27th May 2021.
Pre-training Reinforcement Learning (RL) agents in a task-agnostic manner has shown promising results. However, previous works still struggle to learn and discover meaningful skills in high-dimensional state-spaces. We approach the problem by leveraging unsupervised skill discovery and self-supervised learning of state representations. In our work, we learn a compact latent representation by making use of variational or contrastive techniques. We demonstrate that both allow learning a set of basic navigation skills by maximizing an information theoretic objective. We assess our method in Minecraft 3D maps with different complexities. Our results show that representations and conditioned policies learned from pixels are enough for toy examples, but do not scale to realistic and complex maps. We also explore alternative rewards and input observations to overcome these limitations.
https://imatge.upc.edu/web/publications/discovery-and-learning-navigation-goals-pixels-minecraft
Peter Muschick MSc thesis
Universitat Pollitecnica de Catalunya, 2020
Sign language recognition and translation has been an active research field in the recent years with most approaches using deep neural networks to extract information from sign language data. This work investigates the mostly disregarded approach of using human keypoint estimation from image and video data with OpenPose in combination with transformer network architecture. Firstly, it was shown that it is possible to recognize individual signs (4.5% word error rate (WER)). Continuous sign language recognition though was more error prone (77.3% WER) and sign language translation was not possible using the proposed methods, which might be due to low accuracy scores of human keypoint estimation by OpenPose and accompanying loss of information or insufficient capacities of the used transformer model. Results may improve with the use of datasets containing higher repetition rates of individual signs or focusing more precisely on keypoint extraction of hands.
https://github.com/telecombcn-dl/lectures-all/
These slides review techniques for interpreting the behavior of deep neural networks. The talk reviews basic techniques such as the display of filters and tensors, as well as more advanced ones that try to interpret which part of the input data is responsible for the predictions, or generate data that maximizes the activation of certain neurons.
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
https://telecombcn-dl.github.io/dlai-2020/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
https://telecombcn-dl.github.io/drl-2020/
This course presents the principles of reinforcement learning as an artificial intelligence tool based on the interaction of the machine with its environment, with applications to control tasks (eg. robotics, autonomous driving) o decision making (eg. resource optimization in wireless communication networks). It also advances in the development of deep neural networks trained with little or no supervision, both for discriminative and generative tasks, with special attention on multimedia applications (vision, language and speech).
Giro-i-Nieto, X. One Perceptron to Rule Them All: Language, Vision, Audio and Speech. In Proceedings of the 2020 International Conference on Multimedia Retrieval (pp. 7-8).
Tutorial page:
https://imatge.upc.edu/web/publications/one-perceptron-rule-them-all-language-vision-audio-and-speech-tutorial
Deep neural networks have boosted the convergence of multimedia data analytics in a unified framework shared by practitioners in natural language, vision and speech. Image captioning, lip reading or video sonorization are some of the first applications of a new and exciting field of research exploiting the generalization properties of deep neural representation. This tutorial will firstly review the basic neural architectures to encode and decode vision, text and audio, to later review the those models that have successfully translated information across modalities.
Image segmentation is a classic computer vision task that aims at labeling pixels with semantic classes. These slides provide an overview of the basic approaches applied from the deep learning field to tackle this challenge and presents the basic subtasks (semantic, instance and panoptic segmentation) and related datasets.
Presented at the International Summer School on Deep Learning (ISSonDL) 2020 held online and organized by the University of Gdansk (Poland) between the 30th August and 2nd September.
http://2020.dl-lab.eu/virtual-summer-school-on-deep-learning/
https://imatge-upc.github.io/rvos-mots/
Video object segmentation can be understood as a sequence-to-sequence task that can benefit from the curriculum learning strategies for better and faster training of deep neural networks. This work explores different schedule sampling and frame skipping variations to significantly improve the performance of a recurrent architecture. Our results on the car class of the KITTI-MOTS challenge indicate that, surprisingly, an inverse schedule sampling is a better option than a classic forward one. Also, that a progressive skipping of frames during training is beneficial, but only when training with the ground truth masks instead of the predicted ones.
Deep neural networks have achieved outstanding results in various applications such as vision, language, audio, speech, or reinforcement learning. These powerful function approximators typically require large amounts of data to be trained, which poses a challenge in the usual case where little labeled data is available. During the last year, multiple solutions have been proposed to leverage this problem, based on the concept of self-supervised learning, which can be understood as a specific case of unsupervised learning. This talk will cover its basic principles and provide examples in the field of multimedia.
Deep neural networks have revolutionized the data analytics scene by improving results in several and diverse benchmarks with the same recipe: learning feature representations from data. These achievements have raised the interest across multiple scientific fields, especially in those where large amounts of data and computation are available. This change of paradigm in data analytics has several ethical and economic implications that are driving large investments, political debates and sounding press coverage under the generic label of artificial intelligence (AI). This talk will present the fundamentals of deep learning through the classic example of image classification, and point at how the same principal has been adopted for several tasks. Finally, some of the forthcoming potentials and risks for AI will be pointed.
Benet Oriol, Jordi Luque, Ferran Diego, Xavier Giro-i-Nieto
Telefonica Research / Universitat Politecnica de Catalunya (UPC)
CVPR 2020 Workshop on on Egocentric Perception, Interaction and Computing
In this work, we propose an effective approach for training unique embedding representations by combining three simultaneous modalities: image and spoken and textual narratives. The proposed methodology departs from a baseline system that spawns a embedding space trained with only spoken narratives and image cues. Our experiments on the EPIC-Kitchen and Places Audio Caption datasets show that introducing the human-generated textual transcriptions of the spoken narratives helps to the training procedure yielding to get better embedding representations. The triad speech, image and words allows for a better estimate of the point embedding and show an improving of the performance within tasks like image and speech retrieval, even when text third modality, text, is not present in the task.
More from Universitat Politècnica de Catalunya (20)
4. Introducció
1
Continguts digitals:
Els continguts audiovisuals s’han
d’indexar per tal de permetre la seva
recuperació
Producció AV
Grabació de vídeo en
format digital
Ingesta
Recuperació
Emmagatzemament i
indexació dels vídeos a la
base de dades
Cerca de les metadades
de la base de dades per
obtenir el vídeo
7. Introducció
Solució actual
4
Proposta del projecte
vídeos sense
annotar
Metadades per estrats
• Codi de temps inici
• Codi de temps final
• Descripció del contingut
- Descripció de l’acció
-Tipus de plans
- Persones
Interfície Gràfica
d’annotació
semi-automàtica
vídeos annotats
15. Disseny
9
Arquitectura del sistema
Entrenament:
Entrenador 1
Entrenador 2
?
Detector1
Detector 2
Model
classe N
valor de
confiança 1
valor de
confiança 2
Model
classe 2
Entrenador N
Detecció:
Model
classe 1
Detector N
valor de
confiança N
Model
domini
Decisió
màx valor
de confiança
>llindar
No
Sí
Classe màx
Cap Pla
18. Desenvolupament
11
Anotació
Ontologia Futbol
Classe 1
Classe 2
Classe 3
Classe 4
Classe 5
Classe 6
Classe 7
Cromo
Cromo PP
Cromo Beauty
Public
Pancarta
Llotja
Cap Pla
116
87
4
19
4
6
278
514
Ontologia Parlament
Classe 1
Classe 2
Classe 3
Classe 4
Classe 5
Classe 6
PC President
PC
PM
PG Mesa
PG Hemicicle
Cap Pla
16
63
68
13
23
84
276
21. Evaluació
n or specificity is a measure of the ability of a system to present only relevant
Precision and Recall
s. It measures the exactness or fidelity of 4.3.3. F1 and Fß measure
the system.
13
n or specificity isempreades the ability of a system to present only relevant
Mesures a measure of
a
and
es. It P recisi´ = quantitat dor fidelity ofdetectades correctament precision(4.1) recall providing a single mea
measures the exactness inst`ncies F-measure considers both
the system.
o
quantitat total d inst`ncies having two independent measures.
a
avoiding detectades
quantitat d inst`ncies detectades correctament
a
sensitivity is a measure of the ability of a system to present all relevant instances,
P recisi´ =
o
(4.1)
inst`ncies detectades
quantitat total d
a
precisi´ · record
o
sed for evaluating the completeness of results.
F 1 = 2·
precisi´ + record
o
r sensitivity is a measure of the ability of a system to present all relevant instances,
quantitat d inst`ncies detectades correctament
a
used for evaluating the completeness of results.
Record =
In order to give different weights (4.2)
to precision and recall, the F-m
inst`ncies a la col.lecci´
quantitat d
a
o
that Fß measures the effectiveness of retrieval with respect to a user
amount of correct instances detected
Recall row and two column as much importance to recall as precision[8].
(4.2)
ctive analysis a two=
confusion table is commonly used for
amount of instances in the collection
ng binary classifiers. It reports the number of true negatives, false positives, false
ictive true positives. True negatives andconfusion
s, and analysis a two row and two columnAutomàtic table are properly classified
true positives is commonly used for
precisi´ · record
o
2
Fß = (1 false
ng false classifiers. It reports the number ofwhen obtained false positives,+ ß ) ·
true
hilebinarypositives and false negatives occurClasse2 negatives, classification do not (ß2 · precisi´) + record
o
Classe1
Classe3
es, and true positives. True negatives in table positives are properly classified
orrect classification. This is illustrated and true4.3.
while false positives and false negatives occur when obtained0classification do not
Classe1
4
1
Automatic
correct classification. This is illustrated in table 4.3.
M
a
positive
negative
Classe2
0
3
0
n
Automatic positive
u
positive true positive
false
a
l Manual
Classe3
positive
negative
tp
fp 3
0
positive true
false positive
negative false positive
negative true negative
Manual
tp
fp
fn
tn
negative false negative true negative
2
22. Evaluació
n or specificity is a measure of the ability of a system to present only relevant
Precision and Recall
s. It measures the exactness or fidelity of 4.3.3. F1 and Fß measure
the system.
13
n or specificity isempreades the ability of a system to present only relevant
Mesures a measure of
a
and
es. It P recisi´ = quantitat dor fidelity ofdetectades correctament precision(4.1) recall providing a single mea
measures the exactness inst`ncies F-measure considers both
the system.
o
quantitat total d inst`ncies having two independent measures.
a
avoiding detectades
quantitat d inst`ncies detectades correctament
a
sensitivity is a measure of the ability of a system to present all relevant instances,
P recisi´ =
o
(4.1)
inst`ncies detectades
quantitat total d
a
precisi´ · record
o
sed for evaluating the completeness of results.
F 1 = 2·
precisi´ + record
o
r sensitivity is a measure of the ability of a system to present all relevant instances,
quantitat d inst`ncies detectades correctament
a
used for evaluating the completeness of results.
Record =
In order to give different weights (4.2)
to precision and recall, the F-m
inst`ncies a la col.lecci´
quantitat d
a
o
that Fß measures the effectiveness of retrieval with respect to a user
amount of correct instances detected
Recall row and two column as much importance to recall as precision[8].
(4.2)
ctive analysis a two=
confusion table is commonly used for
amount of instances in the collection
ng binary classifiers. It reports the number of true negatives, false positives, false
ictive true positives. True negatives andconfusion
s, and analysis a two row and two columnAutomàtic table are properly classified
true positives is commonly used for
precisi´ · record
o
2
Fß = (1 false
ng false classifiers. It reports the number ofwhen obtained false positives,+ ß ) ·
true
hilebinarypositives and false negatives occurClasse2 negatives, classification do not (ß2 · precisi´) + record
o
Classe1
Classe3
es, and true positives. True negatives in table positives are properly classified
orrect classification. This is illustrated and true4.3.
while false positives and false negatives occur when obtained0classification do not
Classe1
4
1
Automatic
correct classification. This is illustrated in table 4.3.
M
a
positive
negative
Classe2
0
3
0
n
Automatic positive
u
positive true positive
false
a
l Manual
Classe3
positive
negative
tp
fp 3
0
positive true
false positive
negative false positive
negative true negative
Manual
tp
fp
fn
tn
negative false negative true negative
2
23. Evaluació
14
Partició de dades
• Entrenament
- 80% de les instàncies (+)
- Totes les instàncies positives de les altres
classes com a negatives (-)
- Eliminació d’instàncies de cap tipus de pla
• Detecció
- 20% de les instàncies (+)
24. Evaluació
15
Variables del classificador:
mesura F1
• Màx distància del clúster
• Min elements dins d'un clúster
• Puntuació mínima
Màx dist. (0.1 - 1.0)
Min elem. (0 - 5)
Iteracions (3)
Llindar (0.0 - 1.0)
31. Interfície Gràfica d’Usuari
Configuració de la xarxa
21
Serveis web UPC:
• Llistat de keyframes donat un asset
Asset 1257
00_00_09_24
00_00_39_25
00_00_54_48
00_00_59_41
• Detecció donat un keyframe
Intercanvi de dades
00_00_09_24
Futbol
Mín elements
Màx distància
Classe 3
Score: 0.83
00_00_54_48
Futbol
Mín elements
Màx distància
Class 5
Score 0.46
Class 3
Score 0.75
00_00_59_41
Futbol
Mín elements
Màx distància
Class 2
Score 0.98
• Format: JSON
00_00_39_25
Futbol
Mín elements
Màx distància
37. Conclusions
Classificador semàntic
24
Interfície Gràfica d’Usuari
• Interfície integrable al digition
• Classificació multiclasse a partir de classificació binaria
- Annotació optimitzada (només + )
- Evaluació per sistemes multiclasse
- Mètriques a partri de la matriu de confusió
- Cercador automàtic de paràmetres òptims
- Nou mètode de partició de dades
• Interfície flexible, fàcil d’utilitzar, optimitzada:
- Ordenació per puntuació mínima
- Canvi de la puntuació mínima
- Drag and drop
- Validació pàgina /pla semàntic
• Creació dels dos models
• Bons resultats
• Aprofitament de l’anotació:
- Entrenament
- Destacat de keyframes