Multimedia annotation (DCU 2016)

ImageProcessingGroups
UniversitatPolitècnicadeCatalunya(UPC)
Xavier Giro-i-Nieto, “Multimedia Annotation”. Dublin City University (04/04/2016)
Multimedia Annotation
Lecturer: Xavier Giro-i-Nieto
Version 2016/1
1
@DocXavi

Densely linked slides
2

Introduction
Xavier Giro-i-Nieto
• Web: https://imatge.upc.edu/web/people/xavier-giro
Associate Professor at Universitat Politecnica de Catalunya (UPC)
3

Acknowledgements
Alan Smeaton
Cathal Gurrin
Professor at Dublin City University [Page]
Professor at Dublin City University [Page]
Horst Eidenberger
Professor at Vienna University of Technology [Page]
4

Acknowledgments
5

Acknowledgments
6

Outline
1. Motivation
2. Architecture
3. Metadata
4. Manual vs Automatic Annotation
7

Motivation
8
In previous lectures, you have learned how text retrieval works.

Motivation
This lecture expands to any type of multimedia documents.
9

Motivation
Exponential increase of generated multimedia content..
10

Motivation
11
...keeping a record of the memorable personal moments...
Pope Francis @ Philippines, 2015 (Source: AP Photo/Bullit Marquez)

Motivation
12
Pope Francis @ Ecuador, 2015 (Source: AP)
...keeping a record of the memorable personal moments...

Motivation
13
…(or not).
Pope Francis @ USA, 2015

Motivation
This data growth is motivated by ubiquous mobile access to...
14

Motivation
...the Internet (for visual data transmission)...
15
Source: Cisco Visual Networking INdex (VNI)

Motivation
...and people !
16
Person of the Year
(2006)

Motivation
And it will keep growing with wearable devices...
17

Motivation
...that will generate a permanent memory record of our lives...
18
Black Mirror, “The entire history of you” (Season 1, Episode 3)

Motivation
...so that the challenge is to index and retrieve these data...
19

Motivation
...in the most user friendly fashion.
20
Source: Si Liu, http://dx.doi.org/10.1109/CVPR.2012.6248071 (2012)

Motivation
The challenge is the access to very large multimedia
repositories.
21

Motivation
22
Open question: How to do store and retrieve your photos ?

Outline
1. Motivation
2. Architecture
3. Metadata
4. Manual Annotation vs Automatic
23

Architecture
Three basic stages of visual indexing process
Production Upload Retrieval
24
Personal collections

Architecture
Three basic stages of visual indexing process
Capture Storage Retrieval
Digital multimedia
data recording
Indexing in a database Search based on the
descriptive metadata
25
Professional broadcasting

Architecture
• Example: CCMA Digiton (TV3, Public Catalan TV)
26

Architecture
The contents are stored in repositories and indexed in
databases.
Slide credit:: Emili Bonilla
http://gps-tsc.upc.es/imatge/_Xgiro/teaching/thesis/2007-2008/EmiliBonilla.pdf,
Content
ServerClient
Metadata
Multimedia
Network Search
engine
engine
Client
27

Outline
1. Motivation
2. Architecture
3. Metadata
28

Metadata
Metadata describe the content and allow the search and retrieval.
29
Client
Metadata
Multimedia
Network
Search
engine
engine
Client
Content
Server

Metadata
• Example: Dublin Core
Source: B. Haslhofer, W. Klas: http://dx.doi.org/10.1145/1667062.1667064
30

Metadata
Source: University of Oregon
31

Metadata
32

Metadata
• Multiple options depending on their semantic Level.
Level Nature Example
High Words Tags, keywords, title, author...
Medium Sensor Geolocation, date, time, size...
Low Perceptual (video) Colour, texture, shape,
(audio) Pitch, frequency, ...
33

Metadata
“negre”
“black”
Text Mean
colour
• Example:
R=0
G=0
B=0
34

Metadata
• Applications: Browsing by geolocation.
35

Metadata
• Applications: Grouping photos in events with metadata.
36
D. Manchon-Vizuete, Gris-Sarabia, I., and Giró-i-Nieto, X., “Photo Clustering of Social Events by
Extending PhotoTOC to a Rich Context”, in ICMR 2014 Workshop on Social Events in Web Multimedia
(SEWM), Glasgow, Scotland, 2014.

Outline
1. Motivació
2. Architecture
3. Metadata
4. Manual vs Automatic Annotation
37

Manual vs Automatic Annotation
38
Task: Write down on a paper tags for this photo.

39
Task: How do you think is this photo seen by a computer ?

40
The semantic gap is the difference between a high level and a
low level description of a document:
Human are very good at abstraction
using natural language (words)...
...while computers are really good at
analysing perceptual features.
Semantic
gap

Annotation is the process of generating high level metadata
(semantic).
How to generate
semantic metadata ?
Manual
Annotation
Automatic
Annotation
41

How to generate
semantic metadata ?
Manual
Annotation
Automatic
Annotation
42
Annotation is the process of generating high level metadata
(semantic).

Explicit manual annotation
• Eg. Hashtags on Twitter.
43
Manual Annotation

• Eg. Hashtags on Instagram.
44
Manual Annotation

• Eg. Hashtags on Flickr.
45
Manual Annotation

• Eg. Friends tagging on Facebook.
46
Manual Annotation

• Eg. Dedicated forms to collect structured metadata.
47
Manual Annotation

+info: http://www.youtube.com/terrassatsc
• Eg. Dedicated forms to collect structured metadata.
48
Manual Annotation

Manual Annotation
49
Problem: Manual Annotation is tedious.

Manual Annotation
50
Annotation can be splitted and assigned to the crowd as...
+info: http://www.crowdmm.org

Manual Annotation
51
Annotation can be splitted and assigned to the crowd as…
….micro-tasks for online workers.
+info: https://www.mturk.com/mturk/
http://microworkers.com/
http://pallas-ludens.com/

Manual Annotation
52
….micro-games for online players.

Manual Annotation
53
Ref: Luis von Ahn and Laura Dabbish,
“Labeling images with a computer
game”. (SIGCHI 2004)
Ref: Amaia Salvador et al, “Crowdsourced
Object Segmentation with a Game” (CrowdMM
2013)
Games With A Purpose (GWAP)
….micro-games for online players.

The annotation is the process of generating Metadata semantics
(high level).
How to generate
semantic metadata ?
Latent
Annotation
Automatic
Annotation
54

Xavier Giro-i-Nieto, “Multimedia Annotation”. Dublin City University (04/04/2016) 55
Latent Annotation
Text contained in the same document where the multimedia
content is presented.

Text associated to a publication sharing the multimedia content.
Image
Tex
t
Vide
o
Latent Annotation

Latent Annotation
Comments about the multimedia item.

• Problem 1: Most multimedia content has no other associated
text.
58

Problem 2: Manual annotation may be too expensive for large
amounts of data.
Jean Le Tavernier : “Jean Miélot al seu scriptorium” (1456)
59

• Solution: Automating the annotation process.
Johannes Gutenberg (1398-1468),
inventor of mechanical moveable type
printing
“Printer from XV century”,
work of Jost Amman (1539-1591)
60

61
The semantic gap is the difference between a high level and a
low level description of a document:
Human are very good at abstraction
using natural language...
...while computers are really good at
analysing perceptual features.
Challenge

The annotation is the process of generating Metadata semantics
(high level).
How to generate
semantic metadata ?
Manual
Annotation
Automatic
Annotation
62

Automatic Annotation
Manual Annotations
Model
New Image
Automatic
annotation
Annotation
Artificial intelligence algorithms can learn to perform for this task.
Trainer
Detector
Anchor

Li Fei-Fei, “How we’re teaching computers to understand
pictures” TEDTalks 2014.
Automatic Annotat.: Categories

Source: Horst Eidenberger, “Handbook of Multimedia Information Retrieval” (2012)

Automatic Annotation: Features

Descriptors for text documents: Word histogram
Source: C. Yu, D. Ballard, “A unified model of early word learning” (2004)

Descriptors for text documents: Term Frequency-
Inverse Document Frequency (TF-IDF)
Eg: term “the” is not
representative to
distinguish one type of
document from the
other

Descriptors for audio documents: Spectrogram

Descriptors for audio documents: Mel-Frequency
Spectrum Coefficients - MFCC

Descriptors for image documents:Textures around
interest points (SIFT, HoG, SURF…)
Source: Sivic & Zissermann, “VideoGoogle” (2003)

Instead of designing hand-crafted features (SIFT, SURF…)
and learn a classifier...,
Slide credit: Marc’Aurelio Ranzato (Google)
Deep learning

Slide credit: Marc’Aurelio Ranzato (Google)
...features are learned from annotated data.
Deep learning

LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning
applied to document recognition. Proceedings of the IEEE, 86(11), 2278-2324.
Machine learning
(Deep learning)

• An ontology is a set of related semantic concepts.
• The classification is performed in relation to one/some of them.

Text
• Example: Wordnet (http://wordnet.princeton.edu/)

Font: Andrej Karpathy, “What I learned from competing against a computer on ImageNet” (2014)
Automatic Annotation: Classes

Source: Andrej Karpathy, “What I learned from competing against a computer on ImageNet” (2014)

Automatic Annotat.: Classifier

The classification is the process of assigning a label to an
observation based on its features.
Features must allow a discrimination between samples from
each category.

Example: Visual detector of the camera viewpoint.

Source: Andrej Karpathy, “What I learned from competing against a computer on ImageNet” (2014)

Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional
neural networks." In Advances in neural information processing systems, pp. 1097-1105. 2012

Object detection
Girshick, Ross, Jeff Donahue, Trevor Darrell, and Jitendra Malik. "Region-based convolutional networks for
accurate object detection and segmentation." Pattern Analysis and Machine Intelligence, IEEE Transactions
on 38, no. 1 (2016): 142-158.

Object segmentation
Source: Pascal Visual Object Challenge

Face detection and recognition
Farfade, Sachin Sudhakar, Mohammad Saberian, and Li-Jia Li. "Multi-view Face
Detection Using Deep Convolutional Neural Networks." ICMR (2015).

Activity Recognition
Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning
spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International
Conference on Computer Vision, pp. 4489-4497. 2015

Learn more with Nat & Lo 20% Google Project:

Outline
1. Motivation
2. Architecture
3. Metadata
91

Bonus: Artificial intelligence
92
Nexi, from MIT Media Lab (Photo: Spencer Lowel)

Big data
Internet of things - IoT
Only learn to see ?

Only learn to see ?
Personal data
Big data

Visual saliency prediction
J. Pan, McGuinness, K., Sayrol, E., O'Connor, N., and Giró-i-Nieto, X., “Shallow and
Deep Convolutional Networks for Saliency Prediction”, in IEEE Conference on
Computer Vision and Pattern Recognition, CVPR, In Press.
LSUN Challenge
Only learn to see ?

Only learn to see ?
Atlas, de Boston Dynamics
Robust motion

Only learn to see ?
Games (reinforcement learning)
(Google) DeepMind

Only learn to see ?
Autonomous Driving
Google Self-driving car

Only learn to see ?
Visual arts
Google Research, “Going deeper into neural networks” - DeepDream (2015)

Only learn to see ?
Google Research, “Going deeper into neural networks” - DeepDream (2015)
Visual arts

Only learn to see ?
http://turing.deepart.io/
Visual arts

Only learn to see ?
Music composition
Manuel Araoz, “Training a Recurrent Neural Network to Compose Music” (2016).

Only learn to see ?
Poetry
Ross Goodwin, Neuralsnap (2016).

Only learn to see ?
“Scripts” (!?)
Darknet
JON
He leaned close and onions, barefoot from his shoulder. "I am not a purple
girl," he said as he stood over him. "The sight of you sell your father with you a
little choice."
"I say to swear up his sea or a boy of stone and heart, down," Lord Tywin
said. "I love your word or her to me."

Only learn to see ?
Public Health
Announcement of Google DeepMind Health (24/02/2016)

Only learn to see ?
Nacho Hernandez, “Why artificial intelligence will democratize
healthcare” (TEDx Talk, 2014)
Public health

Only learn to see ?
Nancy Lublin, “The heartbreaking text that inspired a crisis
helpline” (TED Talk 2015)
Mental health

Only learn to see ?
Affective computing
Rana el Kalioubi, “This app know how you feel, from the look on
your face”, TEDTalks 2015.

Only learn to see ?
Affective computing
V. Campos, Salvador, A., Jou, B., and Giró-i-Nieto, X., “Diving Deep into
Sentiment: Understanding Fine-tuned CNNs for Visual Sentiment
Prediction”, in 1st International Workshop on Affect and Sentiment in
Multimedia, Brisbane, Australia, 2015.
Visual maps of positive (green) or negative (red) sentiments:

Only learn to see ?
Affective computing
[video]
Nexi Project,
from MIT Media Lab
(Photos:
Spencer Lowel)

Only learn to see ?
Psychological support and counseling ?

Only learn to see ?
“Google’s chairman (Eric Schmidth) thinks artificial intelligence
will let scientists solve some of the world’s "hard problems," like
population growth, climate change, human development,
and education.” (Bloomberg Business, 11/01/2016)
[+info @ MIT Technology Review]

Only learn to see ?
The New York Times: “The Race Is On to Control Artificial
Intelligence, and Tech’s Future” (25/03/2016)

Only learn to see ?
The Economist, “Million-dollar babies” (02/04/2016)

Only learn to see ?
Jeremy Howard, “The wonderful and terrifying implications of
computers that can learn”, TEDTalks 2014.

Only learn to see ?
Stephen Hawking, “Artificial intelligence could spell out the
human race.” (2014)

Only learn to see ?
Elon Musk (Tesla), one of OpenAI promoters

Only learn to see ?
From Industry 1.0 to Industry 4.0
Source: DFKI (2011)

Thanks a lot !
Slides available at:
https://imatge.upc.edu/web/people/xavier-giro
@DocXavi
/ProfessorXavi

Multimedia annotation (DCU 2016)

Recommended

Recommended

More Related Content

Similar to Multimedia annotation (DCU 2016)

Similar to Multimedia annotation (DCU 2016) (20)

More from Universitat Politècnica de Catalunya

More from Universitat Politècnica de Catalunya (20)

Recently uploaded

Recently uploaded (20)

Multimedia annotation (DCU 2016)