Successfully reported this slideshow.

More Related Content

Related Books

Free with a 14 day trial from Scribd

See all

Data-driven Generation of Image Descriptions

  1. 1. Data-driven Generation of Image Descriptions Vicente Ordonez-Roman Advisor: Tamara Berg Previously: The State University of New York
  2. 2. What most Computer Vision systems aim to say about a picture Computer Vision sky trees water building bridge river tree
  3. 3. What we are able to say about a picture An old bridge over dirty green water. Our Goal One of the many stone bridges in town that carry the gravel carriage roads. A stone bridge over a peaceful river.
  4. 4. Let’s just borrow captions from similar images! Im2Text: Describing Images Using 1 Million Captioned Photographs. Vicente Ordonez, Girish Kulkarni, Tamara L. Berg. Advances in Neural Information Processing Systems. NIPS 2011.
  5. 5. Harness the Web! Images + Captions from the Web Smallest house in paris between red (on right) and beige (on left). Matching using Global Image Features (GIST + Color) Bridge to temple in Hoan Kiem lake. A walk around the lake near our house with Abby. Transfer Caption(s) e.g. “The water is clear enough to see fish swimming around in it.” The water is clear enough to see fish swimming around in it. Hangzhou bridge in West lake. ... The daintree river by boat.
  6. 6. Use the web to collect images + captions 90, 000, 000, 000 pictures~!! (**) A lot of them with captions (a lot of them not publicy available ) 6, 000, 000, 000 photographs! (*) A lot of them with captions (lots of them publicly available ) (*) http://blog.flickr.net/en/2011/08/04/6000000000/ (**) http://www.quora.com/How-many-photos-are-uploaded-to-Facebook-each-day
  7. 7. Flickr images + captions Dog with a ball in its mouth running around like crazy on the green grass. cat in a sink A 10-kg cat called Hercules.. and got caught in a pet door when trying to sneak into another house to steal dog food. 'Nuff said
  8. 8. Flickr images + captions Dog with a ball in its mouth running around like crazy on the green grass. Dog with a ball in its mouth running around like crazy on the green grass. cat in a sink A 10-kg cat called Hercules.. and got caught in a pet door when trying to sneak into another house to steal dog food. 'Nuff said
  9. 9. Flickr images + captions Dog with a ball in its mouth running around like crazy on the green grass. cat in a sink A 10-kg cat called Hercules.. and got caught in a pet door when trying to sneak into another house to steal dog food. 'Nuff said
  10. 10. Flickr images + captions Dog with a ball in its mouth running around like crazy on the green grass. cat catsink a in a in sink A 10-kg cat called Hercules.. and got caught in a pet door when trying to sneak into another house to steal dog food. 'Nuff said
  11. 11. Flickr images + captions Dog with a ball in its mouth running around like crazy on the green grass. cat in a sink A 10-kg cat called Hercules.. and got caught in a pet door when trying to sneak into another house to steal dog food. 'Nuff said
  12. 12. Flickr images + captions Dog with a ball in its mouth running around like crazy on the green grass. A 10-kg cat called Hercules.. and got caught in a pet cat in a sink A 10-kg cat called Hercules..sneak into another house to steal and got caught in a pet door when trying to door when trying to'Nuff saidinto another house to steal dog food. sneak dog food. 'Nuff said
  13. 13. Solution: Collect hundreds of millions of captions Filter them out We found “good captions” have visual concepts and relation words “by”, “in”, “over”, “beside”, “on top of” ~1 “good caption” for every 1000 “bad captions” Im2Text: Describing Images Using 1 Million Captioned Photographs. Vicente Ordonez, Girish Kulkarni, Tamara L. Berg. Advances in Neural Information Processing Systems. NIPS 2011.
  14. 14. SBU Captioned Photo Dataset The Egyptian cat statue by the floor clock and perpetual motion machine in the pantheon Man sits in a rusted car buried in the sand on Waitarere beach Little girl and her dog in northern Thailand. They both seemed interested in what we were doing Our dog Zoe in her bed Interior design of modern white and brown living room furniture against white wall with a lamp hanging. Emma in her hat looking super cute
  15. 15. Results (1) while walking by the water (2) plane flying over the sun (3) shot this in a moving car at the nkve highway (4) sunset over creve coeur lake and the page bridge (5) sunset on 12th sep 2009 as seen from the field polder near my house (6) window over yellow door (7) sunset over capitol hill as seen from the roof of my building (8) an orange sky over the irish sea (9) beautiful golden sunset reflected in the waves of the ocean (10) red sky probably caused by volcanic ash from iceland (11) a view of sunset over river brahmaputa from koliyabhumura bridge (12) red sky in the morning
  16. 16. Results (1) burnt wooden door in derelict building portugal (2) peterborough cathedral norman door in south wall (3) amazing wooden door with wider light above (4) door in wall (5) girl looking in a classroom window (6) a interesting cross in a window of an ancient city (7) this mirror decorated with fruit painting was left behind by theprevious owners (8) unusual exterior wall postbox at st albans post office in st peters street al1 (9) door in oxford uk in black and white (10) 19 plate behind glass in brass mat and preserver (11) this is some of the window decoration external on the house justover the porch 0364 (12) cat in a window
  17. 17. Results (1) img8783 ginger in the red chair (2) red sky in the morning (3) the cat is in the bag and the bag is in the river quot (4) the light in the kitchen made everythin glow my little girl is growing up (5) my cat in a box that is far too small for her (6) one of the towel animals in the cabin edno ot jivotnite napraveno ot havlieni karpi v kabinata (7) baby in her later years turned from green to red but she never went fully red all over (8) if you take pictures through the hole in the bottom of a flower pot the whole of the eldritch world is revealed (9) glazed ceramic poop form in orange wooden box (10) rock garden in library (11) it s funny to capture the preciousest cat in the house at his most devillicious (12) the pink will get replaced by orange and blue in the fall
  18. 18. Results (1) starfish from the book toys to knitdashing dachs superwash sock yarn in goldfishbacking is orange fabricstuffing is pillow stuffing (2) mural of birds and trees in the crypt of wat ratburana ayutthaya (3) carvings in the rock wall (4) acrylic on paper scarlet macaws communicate in the color red withyellow and blue as visual grammar (5) epsom and table salt crystals growing in concentrated green tea solution (6) the hops dried to a golden green in a matter of a few days almosttoo pretty to bag up (7) after staring at the gorgeous colors of the leaves claes discoveredthat there were about 100 birds sleeping in the (8) you know you re in wisconsin when the beach has pine needles inthe sand (9) i was walking down the sidewalk and i saw this glove craft droppedin the dirt it seemed really unusual (10) made by fusing plastic bags (11) bark pattern from a ponderosa pine tree in grand canyon national park (12) the peasant that found a statue of the black virgin on a rock in ariver
  19. 19. What to do next?
  20. 20. Use High Level Content to Rerank (Objects, Stuff, People, Scenes, Captions) The bridge over the lake on Suzhou Street. Iron bridge over the Duck river. Transfer Caption(s) e.g. “The bridge over the lake on Suzhou Street.” The Daintree river by boat. Bridge over Cacapon river. ...
  21. 21. Some success… Amazing colours in the sky at sunset with the orange of the cloud and the blue of the sky behind. A female mallard duck in the lake at Luukki Espoo Strange cloud formation literally flowing through the sky like a river in relation to the other clouds out there. The sun was coming through the trees while I was sitting in my chair by the river Fresh fruit and vegetables at the market in Port Louis Mauritius. Tree with red leaves in the field in autumn. Under the sky of burning clouds. Stained glass window in Eusebius church.
  22. 22. Still far from perfect Incorrect objects Kentucky cows in a field. The cat in the window.
  23. 23. Still far from perfect Incorrect context The sky is blue over the Gherkin. Tree beside the river. Completely wrong The boat ended up a kilometre from the water in the middle of the airstrip. Water over the road.
  24. 24. How to Evaluate? • “Ground truth”: The car is parked next to the train station besides a building. • Candidates: “There is car parked in front of an office building” “This is the building that hosted the ceremony” “A vehicle stopped next to my house” Similar to evaluation on Machine Translation
  25. 25. BLEU score evaluation against Human Captions Method BLEU score Global matching (1k) 0.0774 Global matching (10k) 0.0909 Global matching (100k) 0.0917 Global matching (1million) 0.1177 Global + Content matching (linear regression) 0.1215 Global + Content matching (linear SVM) 0.1259
  26. 26. Human Visual Verification View overlooking Kuala Lumpur from my office building Please choose the image that better corresponds to the given caption:
  27. 27. Human Visual Verification Caption from Flickr Please choose the image that better corresponds to the given caption: Random image View overlooking Kuala Lumpur from my office building
  28. 28. Human Visual Verification Caption from Flickr Random image View overlooking Kuala Lumpur from my office building Please choose the image that better corresponds to the given caption: Caption used Success rate Original human caption 96.0% Top caption 66.7% Best from our top 4 captions 92.7%
  29. 29. Human Visual Evaluation Caption produced by our system Random image The view from the 13th floor of an apartment building in Nakano awesome. Please choose the image that better corresponds to the given caption: Caption used Success rate Original human caption 96.0% Top caption 66.7% Best from our top 4 captions 92.7%
  30. 30. Human Visual Evaluation Caption produced by our system Random image The view from the 13th floor of an apartment building in Nakano awesome. Please choose the image that better corresponds to the given caption: Caption used Success rate Original human caption 96.0% Top caption 66.7% Best from our top 4 captions 92.7%
  31. 31. What to do next?
  32. 32. Let’s not borrow captions from other images, let’s just borrow short phrases! Collective Generation of Natural Image Descriptions. Polina Kuznetsova, Vicente Ordonez, Alexander C. Berg, Tamara L. Berg, Yejin Choi. Association for Computational Linguistics. ACL 2012. Large Scale Retrieval for Image Description Generation Vicente Ordonez, Xufeng Han, Polina Kuznetsova, Girish Kulkarni, Margaret Mitchell, Kota Yamaguchi, Karl Stratos, Amit Goyal, Jesse Dodge, Alyssa Mensch, Hal Daume III, Alexander C. Berg, Yejin Choi, Tamara L. Berg On Submission to IJCV special issue on Big Data.
  33. 33. Retrieving noun phrases from similar object detections
  34. 34. Retrieving verb phrases from similar object detections Contented dog just laying on the edge of the road in front of a house.. Peruvian dog sleeping on city street in the city of Cusco, (Peru) Detect: dog Find matching dog detections by visual similarity this dog was laying in the middle of the road on a back street in jaco Closeup of my dog sleeping under my desk.
  35. 35. Retrieving prepositional phrases from region + detection matches Find matching region detections using appearance + arrangement Object: car Cordoba - lonely elephant under an orange tree... Comfy chair under a tree. I positioned the chairs around the lemon tree -it's like a shrine Mini Nike soccer ball all alone in the grass
  36. 36. Retrieving prepositional phrases from scene matches Extract scene descriptor Pedestrian street in the Old Lyon with stairs to climb up the hill of fourviere Find matching images by scene similarity View from our B&B in this photo I'm about to blow the building across the street over with my massive lung power. Only in Paris will you find a bottle of wine on a table outside a bookstore
  37. 37. Data Processing 1 million images: – Run object detectors – Run region based stuff detectors (e.g. grass, sky, etc) – Run global scene classifiers – Parse captions associated with images and retrieve phrases referring to objects (NPs, VPs), region relationships (PPstuff), and general scene context (PPscene).
  38. 38. Recognition, aka Vision is hard Detecting one hundred objects
  39. 39. Sometimes you can make it (a little) better Detecting “mentioned” objects Look in the mountain for a lion face Ecuador, amazon basin, near coca, rain forest, passion fruit flower The background is a vintage paint by number painting I have and the fabulous forest dress is by candyjunky! Kevin’s mom, so punxrawk in Kev’s black flag hat
  40. 40. Everything together Scene Objects Actions bird Stuff looking for food in water in Lincoln City Oregon coast
  41. 41. Everything together Retrieved phrases bird looking for food bird looking for food in Atlantic City in water on the beach bird in water in water looking for food in Lincoln City Oregon coast
  42. 42. Binary Integer Linear Programming Phrase sij Position k Phrase Vision Confidence Phrase sij Phrase spq Pairwise phrase cohesion = Position k Position k+1 Head words Ngram co+ cohesion occurrence
  43. 43. Composing Descriptions Compose descriptions from phrases with ILP approach • Linguistic constraints – Allow only one phrase of each type – Enforce plural/singular agreement between NP and VP • Discourse constraints – Prevent inclusion of repeated phrasing • Phrase cohesion constraints – n-gram statistics between phrases – Co-occurrence statistics between head words of phrases (last word or main verb) to encourage longer range cohesion
  44. 44. Good Results This is a sporty little red convertible made for a great day in Key West FL. This car was in the 4th parade of the apartment buildings. Taken in front of my cat sitting in a shoe box. Cat likes hanging around in my recliner. This is a brass viking boat moored on beach in Tobago by the ocean.
  45. 45. Bad Results Grammatically incorrect. Cognitive absurdity. One of the most shirt in the wall of the house. Here you can see a cross by the frog in the sky. Not relevant This is a shoulder bag with a blended rainbow effect
  46. 46. BLEU score evaluation Method BLEU score HMM (using cognitive phrases) 0.111 HMM (without using cognitive phrases) 0.114 ILP (using cognitive phrases) 0.114 ILP (without using cognitive phrases) 0.116
  47. 47. Human Forced Choice Evaluation Caption used ILP Selection ILP vs. HMM (no images, no cognitive phrases) 67.2% ILP vs. HMM (no images, with cognitive phrases) 66.3% ILP vs. HMM (with images, no cognitive phrases) 53.17% ILP vs. HMM (with images, with cognitive phrases) 54.5% ILP vs. NIPS 2011 (Global matching 1M) 71.8% ILP vs. HUMAN 16%
  48. 48. Visual Turing Test Us vs Original Human Written Caption In some cases (16%), ILP generated captions were preferred over human written ones!
  49. 49. What’s next?
  50. 50. To be presented at ICCV 2013 Meaning from large-scale computer vision Images with the word “house” Images recognized as more likely to produce the word “house”
  51. 51. To be presented at ICCV 2013 Meaning from large-scale computer vision Images with the word “girl” Images recognized as more likely to produce the word “girl”
  52. 52. To be presented at ICCV 2013 Meaning from large-scale computer vision Weights learned to recognize images with “desk” in caption Mammals Top weighted classifier outputs Birds InstrumentsStructures Plants Other Weights learned over outputs of ~8k classifiers
  53. 53. To be presented at ICCV 2013 Meaning from large-scale computer vision Weights learned to recognize images with “tree” in caption Mammals Top weighted classifier outputs Birds InstrumentsStructures Plants Other Weights learned over outputs of ~8k classifiers
  54. 54. Meaning from large-scale computer vision Weights learned to recognize images with “tree” in caption Mammals Top weighted classifier outputs Birds InstrumentsStructures Plants Other Weights learned over outputs of ~8k classifiers
  55. 55. Questions?

Editor's Notes

  • Add previous affiliations
  • Most computer vision methods deal with the problem of identifying individual pieces of information but do not output the same type of output you would expect from a human. From this picture a good computer vision system would identify sky, trees, water, building, perhaps even bridge but a person on the other hand would say things about this picture like “a stone bridge over a peaceful river”. So our goal in this paper is to generate image descriptions as opposed to generate the individual pieces of information that computer vision methods would usually output.
  • Most computer vision methods deal with the problem of identifying individual pieces of information but do not output the same type of output you would expect from a human. From this picture a good computer vision system would identify sky, trees, water, building, perhaps even bridge but a person on the other hand would say things about this picture like “a stone bridge over a peaceful river”. So our goal in this paper is to generate image descriptions as opposed to generate the individual pieces of information that computer vision methods would usually output.
  • Most computer vision methods deal with the problem of identifying individual pieces of information but do not output the same type of output you would expect from a human. From this picture a good computer vision system would identify sky, trees, water, building, perhaps even bridge but a person on the other hand would say things about this picture like “a stone bridge over a peaceful river”. So our goal in this paper is to generate image descriptions as opposed to generate the individual pieces of information that computer vision methods would usually output.
  • We approach this task in a data-driven manner by first building a 1 million dataset of images with visually relevant captions. We construct this dataset by collecting an enormous amount of captions assigned to images by web users and filtering these captions in such a way that we end up with captions that are more likely to refer to visual content. We use standard global image feature descriptors such as GIST and Tinyimages to retrieve similar images from which we can directly transfer captions.
  • Most computer vision methods deal with the problem of identifying individual pieces of information but do not output the same type of output you would expect from a human. From this picture a good computer vision system would identify sky, trees, water, building, perhaps even bridge but a person on the other hand would say things about this picture like “a stone bridge over a peaceful river”. So our goal in this paper is to generate image descriptions as opposed to generate the individual pieces of information that computer vision methods would usually output.
  • Again we make use of the million image sbu captioned photo dataset
  • Additionally we incorporate high level information to rerank the retrieved images used by the previous baseline method by running object detectors, scene classification, stuff detection, people and action detection and computing text statistics. So in this example we have a bridge and a water detections, we use those to match them with similar detections in the retrieved set of images. As you can see here we run object detectors in our retrieved images only if a relevant keyword is mentioned. Text statistics are also relevant because if in the retrieved set a lot of images agree that there is a bridge then those images are rewarded in the final ranking as well. And then again we can transfer captions from this reranked set of images.
  • Finally here are some good and bad results obtained using our full approach. The first picture says Amazing colours in the sky at sunset with the orange of the cloud and the blue of the sky behind. The captions are very human like because they were written by actual humans. And it works suprisingly well for a some types of images. On the other hand even with 1 million images we can’t generalize to all possible observable images and also our image matching methods can fail thus leading to bad results. If you would like to check in more detail our quantitative results please come to our poster. Thanks.
  • Finally here are some good and bad results obtained using our full approach. The first picture says Amazing colours in the sky at sunset with the orange of the cloud and the blue of the sky behind. The captions are very human like because they were written by actual humans. And it works suprisingly well for a some types of images. On the other hand even with 1 million images we can’t generalize to all possible observable images and also our image matching methods can fail thus leading to bad results. If you would like to check in more detail our quantitative results please come to our poster. Thanks.
  • Finally here are some good and bad results obtained using our full approach. The first picture says Amazing colours in the sky at sunset with the orange of the cloud and the blue of the sky behind. The captions are very human like because they were written by actual humans. And it works suprisingly well for a some types of images. On the other hand even with 1 million images we can’t generalize to all possible observable images and also our image matching methods can fail thus leading to bad results. If you would like to check in more detail our quantitative results please come to our poster. Thanks.
  • Most computer vision methods deal with the problem of identifying individual pieces of information but do not output the same type of output you would expect from a human. From this picture a good computer vision system would identify sky, trees, water, building, perhaps even bridge but a person on the other hand would say things about this picture like “a stone bridge over a peaceful river”. So our goal in this paper is to generate image descriptions as opposed to generate the individual pieces of information that computer vision methods would usually output.
  • We can retrieve noun phrases referring to an object in a query using visual similarity between the query detection and detections from the database
  • Similarly we can retrieve verb phrases based on similar matching poses. For example giving us – laying on the edge of the road in front of a house.
  • For relationships between objects and stuff detections we use a combination of matching appearance and similarity in spatial arrangement. So here for this car, tree, and grass detections. We can retrieve phrases like “under a tree”, “in the grass” and so on.
  • Finally we can use our scene detectors to find matching images by scene similarity. Again for this we use the output of all of our scene classifiers as a descriptor for the image scene and then find similar scenes according to similarity between scene descriptors. This sometimes, but not always produces quite pleasing results. here we generally get similar european street scenes matching our query image. These phrases provide a sort of general scene context for a description.
  • First we do some processing on the data, including running about 100 object detections, regional stuff detectors, global scene classifiers and finally we parse the captions using the berkeley parser to get phrases referring to objects, spatial arrangements with background elements, and general scene descriptions.
  • But one issue with running lots of detectors is that it produces really noisy results. If for example you try to run 100 object and pose detectors on even these fairly simple images you get a big mess of detections. Here’s a bicycle in the mountain, a chair down here… The correct detections may be in there somewhere, but you can’t really see them amongst all the noisy false detections.So obviously we had to make these results better if we were going to be able to use them.
  • So we decided to play some simple tricks to make our recognition problem a little easier.For example, if you have some prior on what you expect to be in the image, then you can guide recognition in the right direction. In our case, with our giant captioned data set we have really good evidence for what might be in an image. We have some text telling us the likely objects. So for an image with a caption, we can just run the detectors for the objects mentioned in the caption. Woohoo that produces still not perfect, but considerably better recognition results! Now we can use these for captioning.
  • We compose descriptions from retrieved phrases using an ILP approach with a number of constraints from the vision predictions, linguistic constraints, discourse constraints, and phrase cohesion constraints.
  • The captions we produce are often quite reasonable, sometimes even preferred over the original human written ones!
  • ×