"Semantic Indexing of Wearable Camera Images: Kids’Cam Concepts"

Semantic Indexing of Wearable
Camera Images: Kids’Cam
Concepts
Alan F. Smeaton
(Dublin City University)
… and …

... Kevin McGuinness and Cathal Gurrin and Jiang Zhou
and Noel E. O’Connor
and Peng Wang
and Brian Davis and Lucas Azevedo
and Andre Freitas
and Louise Signal and Moira Smith and James Stanley
and Michelle Barr and Tim Chambers and Cliona Ní
Mhurchu

Overview
• Automatic assignment of one-per-class concept detectors
is now commonplace.
• We’re interested in the challenging case of processing
images from wearable cameras where improvement is
necessary.
• We try to exploit some limited manual annotations to
improve accuracy of automatic concept weights.
• This work is not complete, its ongoing, but the story is
interesting.

Analysis of Visual Media
• More progress made within the last few years than in previous decade
• Incorporation of deep learning plus availability of huge searchable
image resources and training data
• Automatic image tagging is now hosted
and offered by website like Aylien,
Imagga, Clarifai, and others, and very
cost-effective.

Analysis of Visual Media
• These developments are welcome … but … restrictive tagging
vocabularies.
• How to map to users formulating queries
• Alternative approach is tagging at query time but its expensive and not
scalable to huge collections.
• Almost all work on concept detection based on one concept at a time.
• TRECVid tried simultaneous detection of concept pairs like “computer-
screen with telephone”, and “airplane with clouds”.
• Limited success but “Government Leader with Flag” was OK !
• Detection of concepts independently needs a course-correction
because:
– Doesn’t avail of all available information sources
– Doesn’t map to a user’s search vocabulary

Long-term approach …
Images Concept Set
Mapping
User Search
vocabulary
How can a single image be mapped to two different vocabularies ?

Using NL for image search … tagging
• NL is fraught with complexities, ambiguities at all levels ..
– Lexical level polysemy
– Syntactic level structural ambiguity
– Semantic interpretations
– Discourse level pronoun resolution
• + vocabulary limitations when finding word or phrase to describe
something
• When using computers to help search for image data, language
challenges are exacerbated yet we assume a “simplistic” approach of
tagging by a set of concepts, notwithstanding what we’re seeing with
captioning here today
• Tagging is very useful for smaller, niche applications in restricted
domains with manual tagging, but we see scalability problems
– Addressed with progress in automatic tagging but we’re tolerant of
inaccuracies !

In this paper …
• We are interested in images from wearable cameras with lots of juicy
challenges.
• Notoriously difficult to process automatically because …
– Blurring caused by wearer moving at image capture
– Occlusions from wearer’s hands
– Lighting conditions
– Fisheye lens for wider perspective causing distortion
– First person viewpoint but not what wearer sees
– Content varies hugely across subjects
• Applications in memory support, behaviour recording and analysis,
security, other work-related, and QS.
• In this paper we work with wearable camera data from school children,
for analysis of their environments

The Kids’Cam Project
• Child obesity is a significant public health concern, worldwide.
• Unequivocal evidence that marketing of energy-dense and nutrient-
poor foods and beverages is a causal factor in child obesity.
• Evidence of children’s total exposure to advertising of poor foodstuffs
is not quantified.
• Kids’Cam study aimed to determine the frequency, nature and duration
of children’s exposure to such marketing.
• 169 randomly selected children 11 to 13 yo from 16 schools in
Wellington, NZ, each wore an Autographer and carried a GPS for 4
days .. .mages every 7 seconds, GPS every 5 seconds.
– 1.5M images, 2.5M GPS datapoints
• Manual annotation for food / beverage marketing using a 3-level, 53
concept ontology .. Inter-annotator reliability of 90%.

+ GPS + Date/Time + User
xxxx yyyy zzzz
xxxx yyyy zzzz
xxxx yyyy zzzz
xxxx yyyy zzzz
85 concepts
Manual Annotation

Shop front > sign > sugary drinks/juices

Convenience store indoors > in-store
marketing > convenience store

Processing the Kids’Cam Data
• Following integration of different data sources and after the
manual annotation of images, we processed the image
collection in the following way …

14
xxxx yyyy zzzz
xxxx yyyy zzzz
xxxx yyyy zzzz
xxxx yyyy zzzz
85 concepts
:
X 1.5M images
aaaa bbbb cccc
aaaa bbbb cccc
aaaa bbbb cccc
aaaa bbbb cccc
aaaa bbbb cccc
aaaa bbbb cccc
TFR
aaaa’ bbbb’ cccc’
GPU
Concept
Models
6,000
concepts
xxxx xxxxx xxxxx xxxx xxxxx xxxxx
aaaaa bbbb ccccc
aaaa bbbb ccccc
aaaa bbbb ccccc
aaaa bbbb ccccc
8
7
6
5
4
3
2
1
9
13
12
11
10

14
xxxx yyyy zzzz
xxxx yyyy zzzz
xxxx yyyy zzzz
xxxx yyyy zzzz
85 concepts
:
X 1.5M images
aaaa bbbb cccc
aaaa bbbb cccc
aaaa bbbb cccc
aaaa bbbb cccc
aaaa bbbb cccc
aaaa bbbb cccc
TFR
GPU
Concept
Models
6,000
concepts
aaaaa bbbb ccccc
aaaa bbbb ccccc
aaaa bbbb ccccc
aaaa bbbb ccccc
8
7
6
5
4
3
2
1
9
13
12
11
10
7. Using a CNN to apply tags
to images. We used the
VGG-16 network, a deep
CNN, trained on 1,000
object classes using 1.2M
images from ImageNet

14
xxxx yyyy zzzz
xxxx yyyy zzzz
xxxx yyyy zzzz
xxxx yyyy zzzz
85 concepts
:
X 1.5M images
aaaa bbbb cccc
aaaa bbbb cccc
aaaa bbbb cccc
aaaa bbbb cccc
aaaa bbbb cccc
aaaa bbbb cccc
TFR
GPU
Concept
Models
6,000
concepts
aaaaa bbbb ccccc
aaaa bbbb ccccc
aaaa bbbb ccccc
aaaa bbbb ccccc
8
7
6
5
4
3
2
1
9
13
12
11
10
8. Trained models were
used to predict probabilities
for each concept in each of
1.5M images. Processed in
batches of 64 on NVIDIA
GPU, taking 4 days to
complete

Training Free Refinement
• Current concept-at-a-time classifiers do not consider inter-
concept relationships or dependencies yet these do exist
• To improve one-per-class detectors, we post-process detection
scores
– We take advantage of concept co-occurrence and re-
occurrence which depend on the particular collection
– We take advantage of local (temporal) neighbourhood
information where concepts are likely to re-occur close in
time
– We use GPS location information where concepts identified
by a person at a location, may re-occur subsequently at that
same location
• TFR is based on non-negative matrix factorisation, described
elsewhere

14
xxxx yyyy zzzz
xxxx yyyy zzzz
xxxx yyyy zzzz
xxxx yyyy zzzz
85 concepts
:
X 1.5M images
aaaa bbbb cccc
aaaa bbbb cccc
aaaa bbbb cccc
aaaa bbbb cccc
aaaa bbbb cccc
aaaa bbbb cccc
TFR
GPU
Concept
Models
6,000
concepts
aaaaa bbbb ccccc
aaaa bbbb ccccc
aaaa bbbb ccccc
aaaa bbbb ccccc
8
7
6
5
4
3
2
1
9
13
12
11
10
9. Previously-described, we
then applied Training-Free
Refinement to improve
probability assignments

• We do not know accuracy of assignment of 1,000 concepts but we
know accuracy of assignment of 53 concepts …and we have 1.5M
images each mapped into 2 concept spaces
• Can we adjust values in (b), anchored and pivoting around (a) in
addition to having already used local, within-collection distributions ?
y1
y2
x1 x2
b2
b1
a1 a2
(a) Manual, correct (b) Automatic,
unknown
accuracy

y1
y2
x1 x2
b2
b1
a1 a2
(a)
(b)
b2’
b1’
a1’ a2’
(c)

Cross-mapping concept spaces
• Distributional semantics – corpus-driven approach – based
on hypothesis that co-occurring words in similar contexts
have similar meaning
• Using word2vec in DINFRA, we can
map all words in a vocabulary to an
n-dimensional vector space, where
we can obtain relatedless scores
among the words
• Figure illustrates an example
• For each image in Kids’Cam we can
evaluate relatedness between human
annotation and automatic concepts
with highest-probability

School > availability > drink bottle

• We have top-ranked
concepts, their
confidences, their
relatedness to the
manual tags …
• First effort is to simply
multiple, as in Table, but
its hard to see the
impact of this

And the result is …
• … and that’s where we currently are !

Conclusions and Future Work
• Since automatic concept-detection using pre-defined models has
made so much progress recently, we’re seeing vocabulary / concept
space mis-matching
• Using 1.5M Kids’Cam images from wearable cameras, we have used
within-collection distributions to “smooth” concept weights (outliers and
gaps) in TFR
• We are trying to pivot around some manual annotations in order to
improve concept accuracies
• But, we need …
– More concepts – a richer vocabulary of them
– More varied manual annotations, not just fast food adverts
– A more global or collection-wide way to combine concept
confidences and relatedless to known manual annotations
– Some validation of accuracy of automatic concepts to measure
accuracy of our post-processing

Finally, a plug …
• TRECVid Video captioning Pilot task 2016
• 2,000 x Vine Videos, manually annotated with
captions, twice
• 8 participating groups (CMU, CUHK, DCU, GMU,
NII, UvA, Sheffield)
• Two tasks …
– For each video, rank the 2,000 captions –
metric is MRR
– For each video, generate your own caption –
metrics are bleu, meteor, and UMBC STS
(Semantic Textual Similarity) Service
• Lots of lessons learned and will build upon for full
task in 2017, probably using Vine videos

"Semantic Indexing of Wearable Camera Images: Kids’Cam Concepts"

Recommended

Recommended

More Related Content

Similar to "Semantic Indexing of Wearable Camera Images: Kids’Cam Concepts"

Similar to "Semantic Indexing of Wearable Camera Images: Kids’Cam Concepts" (20)

Recently uploaded

Recently uploaded (20)

"Semantic Indexing of Wearable Camera Images: Kids’Cam Concepts"