2. Tags:
Vision
canon, eos, macro, japan, vacation, f
rog, animal, toad, amphibian, pet, ey
e, feet, mouth, finger, hand, prince, p
hoto, art, light, photo, flickr, blurry, fa
vorite, nice.
Language
Humans
It's the perfect party dress. With
distinctly feminine details such as a wide
sash bow around an empire waist and a
deep scoopneck, this linen dress will
keep you comfortable and feeling
elegant all evening long.
3. Visually Descriptive Text
“It was an arresting face, pointed of chin, square of jaw. Her eyes
were pale green without a touch of hazel, starred with bristly black
lashes and slightly tilted at the ends. Above them, her thick black
brows slanted upward, cutting a startling oblique line in her
magnolia-white skin–that skin so prized by Southern women and so
carefully guarded with bonnets, veils and mittens against hot
Georgia suns” – Gone with the Wind
How do people
describe the world?
Visually descriptive language provides:
• information about how people construct natural language for imagery.
• information about the world, especially the visual world.
• guidance for computational visual recognition. How does the
world work?
What should we
recognize?
4. Visually Descriptive Text
“It was an arresting face, pointed of chin, square of jaw. Her eyes
were pale green without a touch of hazel, starred with bristly black
lashes and slightly tilted at the ends. Above them, her thick black
brows slanted upward, cutting a startling oblique line in her
magnolia-white skin–that skin so prized by Southern women and so
carefully guarded with bonnets, veils and mittens against hot
Georgia suns” – from Gone with the Wind by Margaret Mitchell
How do people
describe the
Visually descriptive language provides: world?
• information about how people construct natural language for imagery.
• information about the world, especially the visual world.
• guidance for computational visual recognition. How does the
world work?
What should we
recognize?
5. What’s in a description?
What’s in this image?
man
baby
sling
shirt
glasses
ladder
fridge
table
watermelon
chair
What do people describe? boxes
“A bearded man is holding a child in a sling.” cups
“A bearded man stands while holding a small child in a green water bottle
sheet.” wall
“A bearded man with a baby in a sling poses.” pacifier
“Man standing in kitchen with little girl in green sack.” beard
“Man with beard and baby” …
6. What’s in a description?
women ✔
bench ✔
1) “two women sitting brunette
magazine ✔
blonde on bench reading
magazine” grass ✖
Predict what people will skirt ✖
…
Given an image describe
e.g. Spain & Perona, 2010
clouds ✔
“looking for car ✖
2) castles in the
window ✖
clouds out my car
window” castle ?
Given a caption Predict what’s in the image
7. Who’s in the picture?
T.L. Berg, A.C. Berg, J. Edwards, D.A. Forsyth
President George W. Bush makes a
statement in the Rose Garden while
Secretary of Defense Donald Rumsfeld
looks on, July 23, 2003. Rumsfeld said
the United States would release graphic
photographs of the dead sons of
Saddam Hussein to prove they were Model Accuracy of labeling
killed by American troops. Photo by Vision model, No Lang model 67%
Larry Downing/Reuters Vision model + Lang model 78%
8. Visually Descriptive Text
“It was an arresting face, pointed of chin, square of jaw. Her eyes
were pale green without a touch of hazel, starred with bristly black
lashes and slightly tilted at the ends. Above them, her thick black
brows slanted upward, cutting a startling oblique line in her
magnolia-white skin–that skin so prized by Southern women and so
carefully guarded with bonnets, veils and mittens against hot
Georgia suns” – from Gone with the Wind by Margaret Mitchell
How do people
describe the world?
Visually descriptive language provides:
• information about how people construct natural language for imagery.
• information about the world, especially the visual world.
• guidance for computational visual recognition. How does the
What should we world work?
recognize?
9. Vision is hard
Green sheep
World knowledge (from descriptive text)
can be used to smooth noisy vision
predictions!
10. Learning World Knowledge
BabyTalk: Understanding and Generating Simple Image Descriptions
Kulkarni, Premraj, Dhar, Li, Choi, AC Berg, TL Berg, CVPR 2011
Attributes
green green grass by the a very shiny car in the car
lake museum in my hometown of
upstate NY.
Relationships
very little person in a big Our cat Tusik sleeping on
rocking chair the sofa near a hot radiator.
11. System Flow
near(a,b)0.01
brown 1
+, ($%
near(b,a) 0.16
striped 1
furry .26
against(a,b)! " #$% ' () *$%
.11wooden .2
feathered
against(b,a)
a) dog .04.06 +, (&%
beside(a,b) ... This is a photograph of one
.24
brown 0.32
' () *- %
person and one brown sofa and
! "#
&%
beside(b,a)
striped 0.09
near(a,c) 1
.17 one dog. The person is against
furry .04
near(c,a) 1
...
wooden .2
against(a,c) .3
the brown sofa. And the dog is
Feathered
against(c,a) near the person, and beside the
.04
.05 brown sofa.
... "
beside(a,c) !.5#- % ' () *&%
b) person beside(c,a)
.45 +, (- %
...
near(b,c)0.94
brown 1
near(c,b) 0.10
striped 1
<<null,person_b>,against,<brown,sofa_c>>
against(b,c)
furry .06
Input Image .67 <<null,dog_a>,near,<null,person_b>>
wooden .8 Generate natural
<<null,dog_a>,beside,<brown,sofa_c>>
against(c,b)
Feathered
.33
.08 language
c) sofa beside(b,c) .0
... – vision
Predict labeling description
beside(c,b)
Objects/stuff potentials smoothed with text
Extract Predict attributes
Predict prepositions
.19
potentials
...
12. BabyTalk results
Objects, Attributes,
Prepositions
This is a picture of one
sky, one road and one Here we see one
sheep. The gray sky is road, one sky and one
over the gray road. The bicycle. The road is near
gray sheep is by the gray the blue sky, and near the
road. colorful bicycle. The
colorful bicycle is within
the blue sky.
This is a picture of two
dogs. The first dog is
near the second furry
13. Visually Descriptive Text
“It was an arresting face, pointed of chin, square of jaw. Her eyes
were pale green without a touch of hazel, starred with bristly black
lashes and slightly tilted at the ends. Above them, her thick black
brows slanted upward, cutting a startling oblique line in her
magnolia-white skin–that skin so prized by Southern women and so
carefully guarded with bonnets, veils and mittens against hot
Georgia suns” – from Gone with the Wind by Margaret Mitchell
How do people
describe the world?
Visually descriptive language provides:
• information about how people construct natural language for imagery.
• information about the world, especially the visual world.
• guidance for computational visual recognition. How does the
world work?
What should we
recognize?
14. What should we recognize?
• Recognition is beginning to work
• Open question – what should we recognize?
• Maybe objects aren’t (always) the right base
level entities
15. Object Recognition
Parts, Poselets, Attributes
For example:
[Fergus, Perona, Zisserman2003],
[Bourdev, Malik2009], …
Slide Credit: Ali Farhadi
16. Automatically Discovering Attributes from Noisy Web Data
T.L. Berg, A.C. Berg, J. Shih ECCV 2010
Fully beaded with megawatt
crystals, this Christian Louboutin suede
pump matches the gleam in your eye.
Pump's linear heel plays up the alluring
curves of its dipped sides.
Round toe frames low-cut vamp.
Tonally topstitched collar.
4" straight, covered heel shows off
signature red sole.
Creamy leather lining with padded
insole.
"Fifi" is made in Italy.
Learn which attributes in descriptions are depictable
terms
17. Given Web Images + Noisy Text Descriptions:
1) Discover visual attribute terms in text descriptions - likely domain dependent
2) Learn appearance models for attributes without labeled data
3) Characterize attributes by: type, localizability
18. Object Recognition
Scenes
For example:
[Oliva, Torralba 2001],
[SUN 2010], …
Slide Credit: Ali Farhadi
19. What are the right quanta of
Recognition?
Farhadi & Sadeghi
Recognition using Visual Phrases , CVPR 2011
20. Participating in Phrases Profoundly affects the
appearance of objects
Farhadi & Sadeghi
Recognition using Visual Phrases , CVPR 2011
21. What should we recognize?
“a sleeping dog in NTHU” “the dog is sleeping”
“A dog is sleeping in” “sleeping dog in delhi”
Maybe descriptive text can inform entity hypotheses!
22. What should we recognize?
“the cat is in the bag” “cat in a bag”
“cat in bag” “cat in the bag”
Maybe descriptive text can inform entity hypotheses!
23. Conclusion
Use large pools of descriptive text to:
Learn how people describe the visual world
Learn how the world works
Guide future efforts in recognition
Apply this knowledge to multi-modal
collections & applications
24. Acknowledgements
• Collaborators: Alex Berg, David Forsyth, Jaety
Edwards, Jonathan Shih, Girish Kulkarni, Visruth
Premraj, Sagnik Dhar, Vicente Ordonez, Siming
Li, Yejin Choi, Kota Yamaguchi, Vicente Ordonez
• Funded by NSF Faculty Early Career
Development (CAREER) Program: Award
#1054133