Data-driven Generation of Image Descriptions

Data-driven Generation of Image
Descriptions
Vicente Ordonez-Roman
Advisor: Tamara Berg

Previously:
The State University of New York

What most Computer Vision systems aim
to say about a picture

Computer Vision

sky
trees
water
building
bridge
river
tree

What we are able to say about a picture

An old bridge over dirty green water.

Our Goal
One of the many stone bridges in town
that carry the gravel carriage roads.
A stone bridge over a peaceful river.

Let’s just borrow captions from similar images!

Im2Text: Describing Images Using 1 Million Captioned Photographs.
Vicente Ordonez, Girish Kulkarni, Tamara L. Berg.
Advances in Neural Information Processing Systems. NIPS 2011.

Harness the Web!
Images + Captions
from the Web

Smallest house in paris
between red (on right)
and beige (on left).

Matching using Global
Image Features
(GIST + Color)

Bridge to temple in
Hoan Kiem lake.

A walk around the
lake near our house
with Abby.

Transfer Caption(s)
e.g. “The water is clear
enough to see fish
swimming around in it.”

The water is clear
enough to see
fish swimming
around in it.

Hangzhou bridge in
West lake.

...

The daintree river by
boat.

Use the web to collect
images + captions

90, 000, 000, 000 pictures~!! (**)
A lot of them with captions
(a lot of them not publicy available )

6, 000, 000, 000 photographs! (*)
A lot of them with captions
(lots of them publicly available )

(*) http://blog.flickr.net/en/2011/08/04/6000000000/
(**) http://www.quora.com/How-many-photos-are-uploaded-to-Facebook-each-day

Flickr images + captions
Dog with a ball in its mouth running around like
crazy on the green grass.

cat in a sink

A 10-kg cat called Hercules.. and got caught in a pet
door when trying to sneak into another house to steal
dog food. 'Nuff said


Dog with a ball in its mouth
running around like crazy on the
green grass.

cat in a sink



cat catsink a
in a in

sink




cat in a sink

A 10-kg cat called Hercules..sneak into another house to steal
and got caught in a pet
door when trying to
door when trying to'Nuff saidinto another house to steal
dog food. sneak

Solution:
Collect hundreds of millions of captions
Filter them out
We found “good captions” have visual concepts and
relation words “by”, “in”, “over”, “beside”, “on top of”
~1 “good caption” for every 1000 “bad captions”
Im2Text: Describing Images Using 1 Million Captioned Photographs.
Vicente Ordonez, Girish Kulkarni, Tamara L. Berg.
Advances in Neural Information Processing Systems. NIPS 2011.

SBU Captioned Photo Dataset

The Egyptian cat statue by the
floor clock and perpetual
motion machine in the
pantheon

Man sits in a rusted car buried
in the sand on Waitarere beach

Little girl and her dog in
northern Thailand. They both
seemed interested in what we
were doing

Our dog Zoe in her
bed

Interior design of modern white
and brown living room furniture
against white wall with a lamp
hanging.

Emma in her hat looking
super cute

Results

(1) while walking by the water
(2) plane flying over the sun
(3) shot this in a moving car at the nkve highway
(4) sunset over creve coeur lake and the page bridge
(5) sunset on 12th sep 2009 as seen from the field polder near my house
(6) window over yellow door
(7) sunset over capitol hill as seen from the roof of my building
(8) an orange sky over the irish sea
(9) beautiful golden sunset reflected in the waves of the ocean
(10) red sky probably caused by volcanic ash from iceland
(11) a view of sunset over river brahmaputa from koliyabhumura bridge
(12) red sky in the morning

Results

(1) burnt wooden door in derelict building portugal
(2) peterborough cathedral norman door in south wall
(3) amazing wooden door with wider light above
(4) door in wall
(5) girl looking in a classroom window
(6) a interesting cross in a window of an ancient city
(7) this mirror decorated with fruit painting was left behind by theprevious owners
(8) unusual exterior wall postbox at st albans post office in st peters street al1
(9) door in oxford uk in black and white
(10) 19 plate behind glass in brass mat and preserver
(11) this is some of the window decoration external on the house justover the porch 0364
(12) cat in a window

Results

(1) img8783 ginger in the red chair
(2) red sky in the morning
(3) the cat is in the bag and the bag is in the river quot
(4) the light in the kitchen made everythin glow my little girl is growing up
(5) my cat in a box that is far too small for her
(6) one of the towel animals in the cabin edno ot jivotnite napraveno ot havlieni karpi v kabinata
(7) baby in her later years turned from green to red but she never went fully red all over
(8) if you take pictures through the hole in the bottom of a flower pot the whole of the eldritch world is revealed
(9) glazed ceramic poop form in orange wooden box
(10) rock garden in library
(11) it s funny to capture the preciousest cat in the house at his most devillicious
(12) the pink will get replaced by orange and blue in the fall

Results

(1) starfish from the book toys to knitdashing dachs superwash sock yarn in goldfishbacking is orange
fabricstuffing is pillow stuffing
(2) mural of birds and trees in the crypt of wat ratburana ayutthaya
(3) carvings in the rock wall
(4) acrylic on paper scarlet macaws communicate in the color red withyellow and blue as visual grammar
(5) epsom and table salt crystals growing in concentrated green tea solution
(6) the hops dried to a golden green in a matter of a few days almosttoo pretty to bag up
(7) after staring at the gorgeous colors of the leaves claes discoveredthat there were about 100 birds sleeping in the
(8) you know you re in wisconsin when the beach has pine needles inthe sand
(9) i was walking down the sidewalk and i saw this glove craft droppedin the dirt it seemed really unusual
(10) made by fusing plastic bags
(11) bark pattern from a ponderosa pine tree in grand canyon national park
(12) the peasant that found a statue of the black virgin on a rock in ariver

Use High Level Content to Rerank
(Objects, Stuff, People, Scenes, Captions)

The bridge over the
lake on Suzhou Street.

Iron bridge over the Duck
river.

Transfer Caption(s)
e.g. “The bridge over the
lake on Suzhou Street.”

The Daintree river by boat. Bridge over Cacapon river.

...

Some success…

Amazing colours in
the sky at sunset
with the orange of
the cloud and the
blue of the sky
behind.

A female mallard duck in the
lake at Luukki Espoo

Strange cloud formation
literally flowing through the sky
like a river in relation to the
other clouds out there.

The sun was
coming through
the trees while I
was sitting in my
chair by the river

Fresh fruit and
vegetables at the market
in Port Louis Mauritius.

Tree with red leaves in the
field in autumn.

Under the sky of burning
clouds.

Stained glass
window in
Eusebius church.

Still far from perfect
Incorrect objects

Kentucky cows in a field.
The cat in the window.

Still far from perfect
Incorrect context

The sky is blue over the Gherkin.

Tree beside the river.

Completely wrong

The boat ended up a kilometre from
the water in the middle of the airstrip.

Water over the road.

How to Evaluate?
• “Ground truth”: The car is parked next to the
train station besides a building.
• Candidates:
“There is car parked in front of an office building”
“This is the building that hosted the ceremony”
“A vehicle stopped next to my house”

Similar to evaluation on Machine
Translation

BLEU score evaluation against Human Captions
Method

BLEU score

Global matching (1k)

0.0774


0.0909


0.0917

Global matching (1million)

0.1177

Global + Content matching
(linear regression)

0.1215

Global + Content matching
(linear SVM)

0.1259

Human Visual Verification
View overlooking Kuala Lumpur from my office
building
Please choose
the image that
better
corresponds to
the given
caption:

Caption from
Flickr

Please choose
the image that
better
corresponds to
the given
caption:

Random image

building

Caption from
Flickr

Random image

building

Please choose
the image that
better
corresponds to
the given
caption:

Caption used

Success rate

Original human caption

96.0%

Top caption

66.7%

Best from our top 4 captions

92.7%

Human Visual Evaluation
Caption
produced by
our system

Random image
The view from the 13th floor of an apartment building in
Nakano awesome.

Please choose
the image that
better
corresponds to
the given
caption:

Caption used

Success rate

Original human caption

96.0%

Top caption

66.7%

Best from our top 4 captions

92.7%

Let’s not borrow captions from other
images, let’s just borrow short phrases!
Collective Generation of Natural Image Descriptions.
Polina Kuznetsova, Vicente Ordonez, Alexander C. Berg, Tamara L. Berg, Yejin Choi.
Association for Computational Linguistics. ACL 2012.
Large Scale Retrieval for Image Description Generation
Vicente Ordonez, Xufeng Han, Polina Kuznetsova, Girish Kulkarni, Margaret Mitchell,
Kota Yamaguchi, Karl Stratos, Amit Goyal, Jesse Dodge, Alyssa Mensch, Hal Daume III,
Alexander C. Berg, Yejin Choi, Tamara L. Berg
On Submission to IJCV special issue on Big Data.

Retrieving noun phrases from similar object
detections

Retrieving verb
phrases from similar
object detections
Contented dog just laying
on the edge of the road in
front of a house..

Peruvian dog sleeping on
city street in the city of
Cusco, (Peru)

Detect: dog

Find matching
dog detections
by visual
similarity

this dog was laying in the
middle of the road on a
back street in jaco

Closeup of my dog sleeping
under my desk.

Retrieving prepositional
phrases from region +
detection matches

Find matching region
detections using
appearance +
arrangement

Object: car

Cordoba - lonely elephant
under an orange tree...

Comfy chair under a tree.

I positioned the chairs
around the lemon tree -it's like a shrine

Mini Nike soccer ball all
alone in the grass

Retrieving prepositional phrases from scene matches

Extract scene descriptor

Pedestrian street in the Old
Lyon with stairs to climb up
the hill of fourviere

Find matching
images by scene
similarity
View from our B&B in this
photo

I'm about to blow the building
across the street over with my
massive lung power.

Only in Paris will you find a
bottle of wine on a table
outside a bookstore

Data Processing
1 million images:
– Run object detectors
– Run region based stuff detectors (e.g.
grass, sky, etc)
– Run global scene classifiers
– Parse captions associated with images
and retrieve phrases referring to objects
(NPs, VPs), region relationships (PPstuff),
and general scene context (PPscene).

Recognition, aka Vision is hard
Detecting one hundred objects

Sometimes you can make it (a little) better
Detecting “mentioned” objects

Look in the mountain for a lion face

Ecuador, amazon basin, near coca, rain forest,
passion fruit flower

The background is a vintage paint by number painting I have
and the fabulous forest dress is by candyjunky!

Kevin’s mom, so punxrawk in Kev’s black flag hat

Everything together
Scene

Objects
Actions

bird

Stuff

looking
for food

in water

in Lincoln City
Oregon coast

Everything together
Retrieved phrases
bird
looking for
food

bird

looking for
food

in Atlantic City

in water
on the beach

bird

in water

in water
looking for
food

in Lincoln City
Oregon coast

Binary Integer Linear Programming
Phrase sij

Position k

Phrase Vision
Confidence

Phrase sij
Phrase spq

Pairwise
phrase
cohesion

=

Position k
Position k+1

Head words
Ngram
co+
cohesion
occurrence

Composing Descriptions
Compose descriptions from phrases with ILP approach

• Linguistic constraints
– Allow only one phrase of each type
– Enforce plural/singular agreement between NP and VP

• Discourse constraints
– Prevent inclusion of repeated phrasing

• Phrase cohesion constraints
– n-gram statistics between phrases
– Co-occurrence statistics between head words of phrases (last
word or main verb) to encourage longer range cohesion

Good Results

This is a sporty little red convertible
made for a great day in Key West FL. This
car was in the 4th parade of the
apartment buildings.

Taken in front of my cat sitting in a shoe
box. Cat likes hanging around in my
recliner.

This is a brass viking
boat moored on
beach in Tobago by
the ocean.

Bad Results
Grammatically incorrect.

Cognitive absurdity.

One of the most shirt in the wall of
the house.

Here you can see a cross by the frog
in the sky.

Not relevant

This is a shoulder bag with a blended
rainbow effect

BLEU score evaluation
Method

BLEU score

HMM (using cognitive phrases)

0.111

HMM (without using cognitive phrases)

0.114

ILP (using cognitive phrases)

0.114

ILP (without using cognitive phrases)

0.116

Human Forced Choice Evaluation
Caption used

ILP Selection

ILP vs. HMM (no images, no cognitive phrases)

67.2%

ILP vs. HMM (no images, with cognitive phrases)

66.3%

ILP vs. HMM (with images, no cognitive phrases)

53.17%

ILP vs. HMM (with images, with cognitive phrases)

54.5%

ILP vs. NIPS 2011 (Global matching 1M)

71.8%

ILP vs. HUMAN

16%

Visual Turing Test
Us vs Original Human Written Caption

In some cases (16%), ILP
generated captions were
preferred over human
written ones!

To be presented at ICCV 2013

Meaning from large-scale computer vision
Images with the word “house”

Images recognized as more likely
to produce the word “house”


Images with the word “girl”

Images recognized as more likely
to produce the word “girl”


Weights learned to recognize
images with “desk” in caption

Mammals

Top weighted classifier outputs

Birds InstrumentsStructures Plants Other

Weights learned over outputs of ~8k classifiers


images with “tree” in caption

Mammals




Data-driven Generation of Image Descriptions

Recommended

Recommended

More Related Content

More from Vicente Ordonez

More from Vicente Ordonez (11)

Recently uploaded

Recently uploaded (13)

Data-driven Generation of Image Descriptions

Editor's Notes