Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Semantic Indexing of Wearable
Camera Images: Kids’Cam
Concepts
Alan F. Smeaton
(Dublin City University)
… and …
... Kevin McGuinness and Cathal Gurrin and Jiang Zhou
and Noel E. O’Connor
and Peng Wang
and Brian Davis and Lucas Azevedo...
Overview
• Automatic assignment of one-per-class concept detectors
is now commonplace.
• We’re interested in the challengi...
Analysis of Visual Media
• More progress made within the last few years than in previous decade
• Incorporation of deep le...
Analysis of Visual Media
• These developments are welcome … but … restrictive tagging
vocabularies.
• How to map to users ...
Long-term approach …
Images Concept Set
Mapping
User Search
vocabulary
How can a single image be mapped to two different v...
Using NL for image search … tagging
• NL is fraught with complexities, ambiguities at all levels ..
– Lexical level polyse...
In this paper …
• We are interested in images from wearable cameras with lots of juicy
challenges.
• Notoriously difficult...
Wearable Camera Images
The Kids’Cam Project
• Child obesity is a significant public health concern, worldwide.
• Unequivocal evidence that market...
+ GPS + Date/Time + User
xxxx yyyy zzzz
xxxx yyyy zzzz
xxxx yyyy zzzz
xxxx yyyy zzzz
85 concepts
Manual Annotation
Shop front > sign > sugary drinks/juices
Convenience store indoors > in-store
marketing > convenience store
School > sign > fast food
Processing the Kids’Cam Data
• Following integration of different data sources and after the
manual annotation of images, ...
14
+ GPS + Date/Time + User
xxxx yyyy zzzz
xxxx yyyy zzzz
xxxx yyyy zzzz
xxxx yyyy zzzz
85 concepts
:
X 1.5M images
aaaa b...
14
+ GPS + Date/Time + User
xxxx yyyy zzzz
xxxx yyyy zzzz
xxxx yyyy zzzz
xxxx yyyy zzzz
85 concepts
:
X 1.5M images
aaaa b...
14
+ GPS + Date/Time + User
xxxx yyyy zzzz
xxxx yyyy zzzz
xxxx yyyy zzzz
xxxx yyyy zzzz
85 concepts
:
X 1.5M images
aaaa b...
14
+ GPS + Date/Time + User
xxxx yyyy zzzz
xxxx yyyy zzzz
xxxx yyyy zzzz
xxxx yyyy zzzz
85 concepts
:
X 1.5M images
aaaa b...
Training Free Refinement
• Current concept-at-a-time classifiers do not consider inter-
concept relationships or dependenc...
14
+ GPS + Date/Time + User
xxxx yyyy zzzz
xxxx yyyy zzzz
xxxx yyyy zzzz
xxxx yyyy zzzz
85 concepts
:
X 1.5M images
aaaa b...
• We do not know accuracy of assignment of 1,000 concepts but we
know accuracy of assignment of 53 concepts …and we have 1...
y1
y2
x1 x2
b2
b1
a1 a2
(a)
(b)
b2’
b1’
a1’ a2’
(c)
Cross-mapping concept spaces
• Distributional semantics – corpus-driven approach – based
on hypothesis that co-occurring w...
School > availability > drink bottle
• We have top-ranked
concepts, their
confidences, their
relatedness to the
manual tags …
• First effort is to simply
multi...
And the result is …
• … and that’s where we currently are !
Conclusions and Future Work
• Since automatic concept-detection using pre-defined models has
made so much progress recentl...
Finally, a plug …
• TRECVid Video captioning Pilot task 2016
• 2,000 x Vine Videos, manually annotated with
captions, twic...
Upcoming SlideShare
Loading in …5
×

"Semantic Indexing of Wearable Camera Images: Kids’Cam Concepts"

244 views

Published on

Slides from Workshop on Vision and Language Integration Meets Multimedia Fusion at ACM MULTIMEDIA Conference, Amsterdam, August 2016

Published in: Data & Analytics
  • I have tried numerous times over the last 10 years to try & beat bulimia on my own with no luck. Coming across ur program online was the best thing I have done, knowing that all my thoughts & actions were the same as what you & many other people with bulimia had gone through or are going through made me realize I'm not crazy & I wasn't alone in this. The thought of bingeing or purging rarely enters my mind anymore if it does its for a split second & I'm able to push it right back out again. I'm eating normally again everything in moderation, no dieting enjoying food which I hadn't for a long time being kind to my body & loving myself. I have now been in recovery with no relapses for over 6 months I never thought I would be able to overcome this but I'm so happy & proud to say I'm finally free ▲▲▲ http://t.cn/A6Pq6ilz
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

"Semantic Indexing of Wearable Camera Images: Kids’Cam Concepts"

  1. 1. Semantic Indexing of Wearable Camera Images: Kids’Cam Concepts Alan F. Smeaton (Dublin City University) … and …
  2. 2. ... Kevin McGuinness and Cathal Gurrin and Jiang Zhou and Noel E. O’Connor and Peng Wang and Brian Davis and Lucas Azevedo and Andre Freitas and Louise Signal and Moira Smith and James Stanley and Michelle Barr and Tim Chambers and Cliona Ní Mhurchu
  3. 3. Overview • Automatic assignment of one-per-class concept detectors is now commonplace. • We’re interested in the challenging case of processing images from wearable cameras where improvement is necessary. • We try to exploit some limited manual annotations to improve accuracy of automatic concept weights. • This work is not complete, its ongoing, but the story is interesting.
  4. 4. Analysis of Visual Media • More progress made within the last few years than in previous decade • Incorporation of deep learning plus availability of huge searchable image resources and training data • Automatic image tagging is now hosted and offered by website like Aylien, Imagga, Clarifai, and others, and very cost-effective.
  5. 5. Analysis of Visual Media • These developments are welcome … but … restrictive tagging vocabularies. • How to map to users formulating queries • Alternative approach is tagging at query time but its expensive and not scalable to huge collections. • Almost all work on concept detection based on one concept at a time. • TRECVid tried simultaneous detection of concept pairs like “computer- screen with telephone”, and “airplane with clouds”. • Limited success but “Government Leader with Flag” was OK ! • Detection of concepts independently needs a course-correction because: – Doesn’t avail of all available information sources – Doesn’t map to a user’s search vocabulary
  6. 6. Long-term approach … Images Concept Set Mapping User Search vocabulary How can a single image be mapped to two different vocabularies ?
  7. 7. Using NL for image search … tagging • NL is fraught with complexities, ambiguities at all levels .. – Lexical level polysemy – Syntactic level structural ambiguity – Semantic interpretations – Discourse level pronoun resolution • + vocabulary limitations when finding word or phrase to describe something • When using computers to help search for image data, language challenges are exacerbated yet we assume a “simplistic” approach of tagging by a set of concepts, notwithstanding what we’re seeing with captioning here today • Tagging is very useful for smaller, niche applications in restricted domains with manual tagging, but we see scalability problems – Addressed with progress in automatic tagging but we’re tolerant of inaccuracies !
  8. 8. In this paper … • We are interested in images from wearable cameras with lots of juicy challenges. • Notoriously difficult to process automatically because … – Blurring caused by wearer moving at image capture – Occlusions from wearer’s hands – Lighting conditions – Fisheye lens for wider perspective causing distortion – First person viewpoint but not what wearer sees – Content varies hugely across subjects • Applications in memory support, behaviour recording and analysis, security, other work-related, and QS. • In this paper we work with wearable camera data from school children, for analysis of their environments
  9. 9. Wearable Camera Images
  10. 10. The Kids’Cam Project • Child obesity is a significant public health concern, worldwide. • Unequivocal evidence that marketing of energy-dense and nutrient- poor foods and beverages is a causal factor in child obesity. • Evidence of children’s total exposure to advertising of poor foodstuffs is not quantified. • Kids’Cam study aimed to determine the frequency, nature and duration of children’s exposure to such marketing. • 169 randomly selected children 11 to 13 yo from 16 schools in Wellington, NZ, each wore an Autographer and carried a GPS for 4 days .. .mages every 7 seconds, GPS every 5 seconds. – 1.5M images, 2.5M GPS datapoints • Manual annotation for food / beverage marketing using a 3-level, 53 concept ontology .. Inter-annotator reliability of 90%.
  11. 11. + GPS + Date/Time + User xxxx yyyy zzzz xxxx yyyy zzzz xxxx yyyy zzzz xxxx yyyy zzzz 85 concepts Manual Annotation
  12. 12. Shop front > sign > sugary drinks/juices
  13. 13. Convenience store indoors > in-store marketing > convenience store
  14. 14. School > sign > fast food
  15. 15. Processing the Kids’Cam Data • Following integration of different data sources and after the manual annotation of images, we processed the image collection in the following way …
  16. 16. 14 + GPS + Date/Time + User xxxx yyyy zzzz xxxx yyyy zzzz xxxx yyyy zzzz xxxx yyyy zzzz 85 concepts : X 1.5M images aaaa bbbb cccc aaaa bbbb cccc aaaa bbbb cccc aaaa bbbb cccc aaaa bbbb cccc aaaa bbbb cccc TFR aaaa’ bbbb’ cccc’ aaaa’ bbbb’ cccc’ aaaa’ bbbb’ cccc’ aaaa’ bbbb’ cccc’ aaaa’ bbbb’ cccc’ aaaa’ bbbb’ cccc’ GPU Concept Models 6,000 concepts xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx aaaaa bbbb ccccc aaaa bbbb ccccc aaaa bbbb ccccc aaaa bbbb ccccc 8 7 6 5 4 3 2 1 9 13 12 11 10
  17. 17. 14 + GPS + Date/Time + User xxxx yyyy zzzz xxxx yyyy zzzz xxxx yyyy zzzz xxxx yyyy zzzz 85 concepts : X 1.5M images aaaa bbbb cccc aaaa bbbb cccc aaaa bbbb cccc aaaa bbbb cccc aaaa bbbb cccc aaaa bbbb cccc TFR aaaa’ bbbb’ cccc’ aaaa’ bbbb’ cccc’ aaaa’ bbbb’ cccc’ aaaa’ bbbb’ cccc’ aaaa’ bbbb’ cccc’ aaaa’ bbbb’ cccc’ GPU Concept Models 6,000 concepts xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx aaaaa bbbb ccccc aaaa bbbb ccccc aaaa bbbb ccccc aaaa bbbb ccccc 8 7 6 5 4 3 2 1 9 13 12 11 10
  18. 18. 14 + GPS + Date/Time + User xxxx yyyy zzzz xxxx yyyy zzzz xxxx yyyy zzzz xxxx yyyy zzzz 85 concepts : X 1.5M images aaaa bbbb cccc aaaa bbbb cccc aaaa bbbb cccc aaaa bbbb cccc aaaa bbbb cccc aaaa bbbb cccc TFR aaaa’ bbbb’ cccc’ aaaa’ bbbb’ cccc’ aaaa’ bbbb’ cccc’ aaaa’ bbbb’ cccc’ aaaa’ bbbb’ cccc’ aaaa’ bbbb’ cccc’ GPU Concept Models 6,000 concepts xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx aaaaa bbbb ccccc aaaa bbbb ccccc aaaa bbbb ccccc aaaa bbbb ccccc 8 7 6 5 4 3 2 1 9 13 12 11 10 7. Using a CNN to apply tags to images. We used the VGG-16 network, a deep CNN, trained on 1,000 object classes using 1.2M images from ImageNet
  19. 19. 14 + GPS + Date/Time + User xxxx yyyy zzzz xxxx yyyy zzzz xxxx yyyy zzzz xxxx yyyy zzzz 85 concepts : X 1.5M images aaaa bbbb cccc aaaa bbbb cccc aaaa bbbb cccc aaaa bbbb cccc aaaa bbbb cccc aaaa bbbb cccc TFR aaaa’ bbbb’ cccc’ aaaa’ bbbb’ cccc’ aaaa’ bbbb’ cccc’ aaaa’ bbbb’ cccc’ aaaa’ bbbb’ cccc’ aaaa’ bbbb’ cccc’ GPU Concept Models 6,000 concepts xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx aaaaa bbbb ccccc aaaa bbbb ccccc aaaa bbbb ccccc aaaa bbbb ccccc 8 7 6 5 4 3 2 1 9 13 12 11 10 8. Trained models were used to predict probabilities for each concept in each of 1.5M images. Processed in batches of 64 on NVIDIA GPU, taking 4 days to complete
  20. 20. Training Free Refinement • Current concept-at-a-time classifiers do not consider inter- concept relationships or dependencies yet these do exist • To improve one-per-class detectors, we post-process detection scores – We take advantage of concept co-occurrence and re- occurrence which depend on the particular collection – We take advantage of local (temporal) neighbourhood information where concepts are likely to re-occur close in time – We use GPS location information where concepts identified by a person at a location, may re-occur subsequently at that same location • TFR is based on non-negative matrix factorisation, described elsewhere
  21. 21. 14 + GPS + Date/Time + User xxxx yyyy zzzz xxxx yyyy zzzz xxxx yyyy zzzz xxxx yyyy zzzz 85 concepts : X 1.5M images aaaa bbbb cccc aaaa bbbb cccc aaaa bbbb cccc aaaa bbbb cccc aaaa bbbb cccc aaaa bbbb cccc TFR aaaa’ bbbb’ cccc’ aaaa’ bbbb’ cccc’ aaaa’ bbbb’ cccc’ aaaa’ bbbb’ cccc’ aaaa’ bbbb’ cccc’ aaaa’ bbbb’ cccc’ GPU Concept Models 6,000 concepts xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx xxxx xxxxx xxxxx aaaaa bbbb ccccc aaaa bbbb ccccc aaaa bbbb ccccc aaaa bbbb ccccc 8 7 6 5 4 3 2 1 9 13 12 11 10 9. Previously-described, we then applied Training-Free Refinement to improve probability assignments
  22. 22. • We do not know accuracy of assignment of 1,000 concepts but we know accuracy of assignment of 53 concepts …and we have 1.5M images each mapped into 2 concept spaces • Can we adjust values in (b), anchored and pivoting around (a) in addition to having already used local, within-collection distributions ? y1 y2 x1 x2 b2 b1 a1 a2 (a) Manual, correct (b) Automatic, unknown accuracy
  23. 23. y1 y2 x1 x2 b2 b1 a1 a2 (a) (b) b2’ b1’ a1’ a2’ (c)
  24. 24. Cross-mapping concept spaces • Distributional semantics – corpus-driven approach – based on hypothesis that co-occurring words in similar contexts have similar meaning • Using word2vec in DINFRA, we can map all words in a vocabulary to an n-dimensional vector space, where we can obtain relatedless scores among the words • Figure illustrates an example • For each image in Kids’Cam we can evaluate relatedness between human annotation and automatic concepts with highest-probability
  25. 25. School > availability > drink bottle
  26. 26. • We have top-ranked concepts, their confidences, their relatedness to the manual tags … • First effort is to simply multiple, as in Table, but its hard to see the impact of this
  27. 27. And the result is … • … and that’s where we currently are !
  28. 28. Conclusions and Future Work • Since automatic concept-detection using pre-defined models has made so much progress recently, we’re seeing vocabulary / concept space mis-matching • Using 1.5M Kids’Cam images from wearable cameras, we have used within-collection distributions to “smooth” concept weights (outliers and gaps) in TFR • We are trying to pivot around some manual annotations in order to improve concept accuracies • But, we need … – More concepts – a richer vocabulary of them – More varied manual annotations, not just fast food adverts – A more global or collection-wide way to combine concept confidences and relatedless to known manual annotations – Some validation of accuracy of automatic concepts to measure accuracy of our post-processing
  29. 29. Finally, a plug … • TRECVid Video captioning Pilot task 2016 • 2,000 x Vine Videos, manually annotated with captions, twice • 8 participating groups (CMU, CUHK, DCU, GMU, NII, UvA, Sheffield) • Two tasks … – For each video, rank the 2,000 captions – metric is MRR – For each video, generate your own caption – metrics are bleu, meteor, and UMBC STS (Semantic Textual Similarity) Service • Lots of lessons learned and will build upon for full task in 2017, probably using Vine videos

×