The document discusses generating image descriptions using data-driven techniques. It describes collecting hundreds of millions of images and captions from Flickr and other sources on the web. Key objects, scenes and stuff are detected in images using computer vision techniques. Captions are parsed to retrieve noun phrases referring to detected objects and prepositional phrases containing spatial relationships. An integer linear program is used to compose new image descriptions by selecting relevant phrases while enforcing linguistic and discourse constraints. Evaluation shows the generated descriptions often match human captions and are preferred to descriptions from other baseline methods. While progress is made, challenges remain in object detection accuracy and avoiding nonsensical or irrelevant descriptions.
This document outlines Tyler Karrels' research on salient point detection. It introduces the topic, discusses previous work, and outlines Karrels' proposed framework which includes extracting features from images, clustering points in feature space, and identifying salient clusters. The document also discusses challenges including quantifying saliency and relating computer and human performance. In the results section, it suggests orientation tests and references related work on feature combination strategies, clustering algorithms, and structural saliency detection.
This document proposes a method for salient object detection by composition. It models salient objects as visually distinctive parts that can be composed together with minimal cost. It represents images as oversegmented parts and detects salient windows by finding a set of inside parts that can be composed with outside parts at lowest cost, accounting for appearance, spatial proximity, and non-reusability of parts. It achieves state-of-the-art detection performance on PASCAL VOC 2007, outperforming previous salient object detection methods.
Visual Saliency: Learning to Detect Salient ObjectsVicente Ordonez
Vicente Ordonez will work on a 2007 paper about detecting salient objects in images. The paper uses multiscale contrast, center surround histograms, and color spatial distribution as visual attention features. These features are combined using conditional random fields trained on a labeled dataset to determine salient regions. Ordonez implements the features and achieves results similar to the original paper, with computation times of several seconds per image. The center surround histogram feature gives the highest precision for detecting salient objects.
The document summarizes the process of archaeological investigation at an ancient Mayan village site. Social scientists uncovered evidence of multiple settlements at the site, including remnants of buildings, canals, and gardens. By analyzing pollen samples, they determined areas were used to grow corn, peppers, and tomatoes. Further from the city, they found evidence of slash-and-burn agriculture and early villages buried deeper in the soil layers. This led scientists to hypothesize that the civilization collapsed due to unsustainable agricultural practices depleting the soil and resources.
From Large Scale Image Categorization to Entry-Level CategoriesVicente Ordonez
This document describes research on scaling up image naming tasks from detailed categories to more general entry-level categories. The researchers explore two goals: 1) category translation, which takes a detailed category as input and outputs a more general entry-level category, and 2) content naming, which takes an image as input and outputs the most likely entry-level category. They evaluate both text-based and image-based approaches for category translation, and combine computer vision techniques with weak image annotations to train models for content naming. Experimental results demonstrate their joint model achieves the best performance on content naming tasks. The work aims to generate more human-like descriptions of image content by inferring other types of abstractions beyond object nouns.
Im2Text: Describing Images Using 1 Million Captioned PhotographsVicente Ordonez
The document describes Im2Text, a system that can generate captions for images by leveraging over 1 million captioned photographs. Im2Text first matches an input image to similar captioned images in its dataset using global image features. It then transfers the captions from similar matched images to caption the input image. The captions are further reranked using high-level image understanding of objects, scenes, stuff and people present in the image. Im2Text is able to generate relevant captions for most images but sometimes produces irrelevant or nonsensical captions.
The Image Quilting algorithm synthesizes new textures by taking patches from a sample texture and arranging them to match on overlapping regions, minimizing errors. It works on both stochastic and repeated textures. Patches are selected randomly from the sample but placed to match overlapping regions. A minimum error boundary cut is calculated between patches using dynamic programming. Parameters like patch size and overlap affect quality and speed. The algorithm generates realistic new textures but has room for improved blending and faster search methods.
Este documento discute el contenido generado por los usuarios, incluyendo ejemplos como reviews de Amazon y historias en FanFiction. También analiza críticas como que no es de la misma calidad que el contenido profesional, y cómo sitios como YouTube y Twitter aprovechan la "cola larga" al permitir contenido de nicho. Finalmente, se mencionan fenómenos como la teoría de las ventanas rotas y el efecto Streisand que ocurren en comunidades en línea.
This document outlines Tyler Karrels' research on salient point detection. It introduces the topic, discusses previous work, and outlines Karrels' proposed framework which includes extracting features from images, clustering points in feature space, and identifying salient clusters. The document also discusses challenges including quantifying saliency and relating computer and human performance. In the results section, it suggests orientation tests and references related work on feature combination strategies, clustering algorithms, and structural saliency detection.
This document proposes a method for salient object detection by composition. It models salient objects as visually distinctive parts that can be composed together with minimal cost. It represents images as oversegmented parts and detects salient windows by finding a set of inside parts that can be composed with outside parts at lowest cost, accounting for appearance, spatial proximity, and non-reusability of parts. It achieves state-of-the-art detection performance on PASCAL VOC 2007, outperforming previous salient object detection methods.
Visual Saliency: Learning to Detect Salient ObjectsVicente Ordonez
Vicente Ordonez will work on a 2007 paper about detecting salient objects in images. The paper uses multiscale contrast, center surround histograms, and color spatial distribution as visual attention features. These features are combined using conditional random fields trained on a labeled dataset to determine salient regions. Ordonez implements the features and achieves results similar to the original paper, with computation times of several seconds per image. The center surround histogram feature gives the highest precision for detecting salient objects.
The document summarizes the process of archaeological investigation at an ancient Mayan village site. Social scientists uncovered evidence of multiple settlements at the site, including remnants of buildings, canals, and gardens. By analyzing pollen samples, they determined areas were used to grow corn, peppers, and tomatoes. Further from the city, they found evidence of slash-and-burn agriculture and early villages buried deeper in the soil layers. This led scientists to hypothesize that the civilization collapsed due to unsustainable agricultural practices depleting the soil and resources.
From Large Scale Image Categorization to Entry-Level CategoriesVicente Ordonez
This document describes research on scaling up image naming tasks from detailed categories to more general entry-level categories. The researchers explore two goals: 1) category translation, which takes a detailed category as input and outputs a more general entry-level category, and 2) content naming, which takes an image as input and outputs the most likely entry-level category. They evaluate both text-based and image-based approaches for category translation, and combine computer vision techniques with weak image annotations to train models for content naming. Experimental results demonstrate their joint model achieves the best performance on content naming tasks. The work aims to generate more human-like descriptions of image content by inferring other types of abstractions beyond object nouns.
Im2Text: Describing Images Using 1 Million Captioned PhotographsVicente Ordonez
The document describes Im2Text, a system that can generate captions for images by leveraging over 1 million captioned photographs. Im2Text first matches an input image to similar captioned images in its dataset using global image features. It then transfers the captions from similar matched images to caption the input image. The captions are further reranked using high-level image understanding of objects, scenes, stuff and people present in the image. Im2Text is able to generate relevant captions for most images but sometimes produces irrelevant or nonsensical captions.
The Image Quilting algorithm synthesizes new textures by taking patches from a sample texture and arranging them to match on overlapping regions, minimizing errors. It works on both stochastic and repeated textures. Patches are selected randomly from the sample but placed to match overlapping regions. A minimum error boundary cut is calculated between patches using dynamic programming. Parameters like patch size and overlap affect quality and speed. The algorithm generates realistic new textures but has room for improved blending and faster search methods.
Este documento discute el contenido generado por los usuarios, incluyendo ejemplos como reviews de Amazon y historias en FanFiction. También analiza críticas como que no es de la misma calidad que el contenido profesional, y cómo sitios como YouTube y Twitter aprovechan la "cola larga" al permitir contenido de nicho. Finalmente, se mencionan fenómenos como la teoría de las ventanas rotas y el efecto Streisand que ocurren en comunidades en línea.
El documento describe las historias y características de las pantallas de plasma y LCD. Las pantallas de plasma fueron inventadas en la década de 1960 y comercializadas en la década de 1990, mientras que los cristales líquidos se descubrieron en la década de 1880 y las primeras pantallas LCD se produjeron en la década de 1970. Ambos tipos de pantalla funcionan manipulando pequeñas celdas entre dos cristales para mostrar imágenes, pero difieren en sus ventajas y desventajas con respecto al contraste, consum
Este documento presenta una discusión sobre las API de Google Maps y Google Earth. Cubre temas como las opciones básicas de Google Maps API, capas KML, capas personalizadas, Google MyMaps API, Google Earth API, generación estática vs dinámica de contenido georeferenciado, opciones para almacenar contenido geográfico, y el uso de Google Maps/Earth API con Google App Engine.
Este documento describe el análisis, diseño e implementación de un sistema de búsqueda de audio mediante la integración de reconocimiento automático de voz y búsqueda por indexación. El objetivo principal fue investigar tecnologías de reconocimiento de voz y diseñar una arquitectura que permita extraer transcripciones de audio para almacenarlas e indexarlas, permitiendo búsquedas eficientes. Se implementó un prototipo web que usa un motor de reconocimiento de voz comercial. Las pruebas mostraron que el sistema puede encontrar resultados relevantes
Este documento ofrece consejos sobre cómo alcanzar metas y objetivos, incluyendo luchar por el objetivo mientras se disfruta el camino, entrenar la creatividad al conectar puntos hacia atrás, visualizarse a futuro y estar preparado, dar el siguiente paso tomando decisiones y riesgos con curiosidad, pasión, visión y determinación.
MapReduce es un modelo de programación para procesar grandes conjuntos de datos en paralelo en clusters de computadoras. Permite particionar los datos y distribuir el trabajo entre los nodos del cluster. Los programadores definen funciones map() y reduce() para procesar los datos de forma paralela y combinar los resultados. Se usa principalmente para tareas como indexación, filtrado y análisis de grandes cantidades de datos.
El documento describe la historia y el estado actual de la robótica. Comienza con los orígenes de la palabra "robot" en 1920 y continúa describiendo hitos importantes en el desarrollo de la robótica hasta la actualidad, incluido el desarrollo del primer robot programable en 1954 y el primer robot humanoide comercial capaz de correr, QRIO de Sony, en 2003. También explica brevemente las leyes de la robótica de Isaac Asimov y el uso continuo de robots humanoides como ASIMO y QRIO para fines promocionales por parte de empresas
El documento habla sobre la transmisión de audio y video en una red. Explica los formatos de almacenamiento y compresión como MP3, MPEG y códecs populares. También cubre temas como streaming, las formas de transmisión como streaming y download, y lo necesario para montar un servidor de streaming y acceder a servicios de streaming.
El documento presenta un proyecto para desarrollar una aplicación de búsqueda de podcasts que incluya un rastreador web para encontrar podcasts, un componente para extraer características de audio, un índice de documentos normalizados y una interfaz de búsqueda. Se analizan soluciones existentes que se enfocan en búsquedas de palabras exactas, mientras que este proyecto busca proveer resultados más relevantes y contextualizados.
El documento describe diferentes herramientas y metáforas utilizadas en entornos operativos multimedia y de colaboración, incluyendo iconos, texto, botones, imágenes y sonido. Luego analiza plataformas como Tactile 3D, Looking Glass y Open Croquet, destacando que permiten la colaboración, personalización y conservación de elementos gráficos. Finalmente menciona aplicaciones de aprendizaje colaborativo como CSCL, DLG y GAW.
Este documento describe las ventajas e inconvenientes de las aplicaciones web frente a las aplicaciones de escritorio, introduce el concepto de AJAX y cómo mejora la experiencia del usuario en las aplicaciones web. También presenta el framework Atlas de ASP.NET, el cual utiliza AJAX para agregar funcionalidad asíncrona a los controles web existentes de manera sencilla. El framework incluye clases como UpdatePanel para actualizaciones parciales y TimerControl para actualizaciones automáticas.
Este documento describe los conceptos básicos de los portales y los webparts de ASP.NET 2.0. Explica que un portal es un sitio web que proporciona acceso a otros recursos en Internet o una intranet y que los usuarios pueden personalizar. También describe cómo crear webparts usando la clase GenericWebpart e implementando la interfaz IWebPart, y cómo usar características como viewstate, cookies, sesiones y perfiles para persistir datos.
Insanony: Watch Instagram Stories Secretly - A Complete GuideTrending Blogers
Welcome to the world of social media, where Instagram reigns supreme! Today, we're going to explore a fascinating tool called Insanony that lets you watch Instagram Stories secretly. If you've ever wanted to view someone's story without them knowing, this blog is for you. We'll delve into everything you need to know about Insanony with Trending Blogers!
Confidence is Key: Fashion for Women Over 50miabarn9
Unlock your personal style and confidence at 50 and beyond! Our fashion blog provides actionable tips and inspiration on how to improve your dressing sense according to your body type, skin tone, and personal style, ensuring you look and feel amazing.
Biography and career history of Bruno AmezcuaBruno Amezcua
Bruno Amezcua's entry into the film and visual arts world seemed predestined. His grandfather, a distinguished film editor from the 1950s through the 1970s, profoundly influenced him. This familial mentorship early on exposed him to the nuances of film production and a broad array of fine arts, igniting a lifelong passion for narrative creation. Over 15 years, Bruno has engaged in diverse projects showcasing his dedication to the arts.
Amid the constant barrage of distractions and dwindling motivation, self-discipline emerges as the unwavering beacon that guides individuals toward triumph. This vital quality serves as the key to unlocking one’s true potential, whether the aspiration is to attain personal goals, ascend the career ladder, or refine everyday habits.
Understanding Self-Discipline
MISS RAIPUR 2024 - WINNER POONAM BHARTI.DK PAGEANT
Poonam Bharti, a guide of ability and diligence, has been chosen as the champ from Raipur for Mrs. India 2024, Pride of India, from the DK Show. Her journey to this prestigious title is a confirmation of her commitment, difficult work, and multifaceted gifts. At fair 23 a long time ago, Poonam has as of now made noteworthy strides in both her proficient and individual lives, encapsulating the soul of present-day Indian ladies who adjust different parts with beauty and competence. This article dives into Poonam Bharti’s foundation, achievements, and qualities that separated her as a meriting champion of this award.
MISS TEEN LUCKNOW 2024 - WINNER ASIYA 2024DK PAGEANT
In the dynamic city of Lucknow, known for its wealthy social legacy and authentic importance, a youthful star has developed, capturing the hearts of numerous with her elegance, insights, and eagerness. Asiya, as of late delegated as the champ from Lucknow for Miss Youngster India 2024 by the DK Pageant, stands as a confirmation of the monstrous ability and potential dwelling inside the youth of India. This exceptional young lady is a signal of excellence and a paragon of devotion and aspiration.
El documento describe las historias y características de las pantallas de plasma y LCD. Las pantallas de plasma fueron inventadas en la década de 1960 y comercializadas en la década de 1990, mientras que los cristales líquidos se descubrieron en la década de 1880 y las primeras pantallas LCD se produjeron en la década de 1970. Ambos tipos de pantalla funcionan manipulando pequeñas celdas entre dos cristales para mostrar imágenes, pero difieren en sus ventajas y desventajas con respecto al contraste, consum
Este documento presenta una discusión sobre las API de Google Maps y Google Earth. Cubre temas como las opciones básicas de Google Maps API, capas KML, capas personalizadas, Google MyMaps API, Google Earth API, generación estática vs dinámica de contenido georeferenciado, opciones para almacenar contenido geográfico, y el uso de Google Maps/Earth API con Google App Engine.
Este documento describe el análisis, diseño e implementación de un sistema de búsqueda de audio mediante la integración de reconocimiento automático de voz y búsqueda por indexación. El objetivo principal fue investigar tecnologías de reconocimiento de voz y diseñar una arquitectura que permita extraer transcripciones de audio para almacenarlas e indexarlas, permitiendo búsquedas eficientes. Se implementó un prototipo web que usa un motor de reconocimiento de voz comercial. Las pruebas mostraron que el sistema puede encontrar resultados relevantes
Este documento ofrece consejos sobre cómo alcanzar metas y objetivos, incluyendo luchar por el objetivo mientras se disfruta el camino, entrenar la creatividad al conectar puntos hacia atrás, visualizarse a futuro y estar preparado, dar el siguiente paso tomando decisiones y riesgos con curiosidad, pasión, visión y determinación.
MapReduce es un modelo de programación para procesar grandes conjuntos de datos en paralelo en clusters de computadoras. Permite particionar los datos y distribuir el trabajo entre los nodos del cluster. Los programadores definen funciones map() y reduce() para procesar los datos de forma paralela y combinar los resultados. Se usa principalmente para tareas como indexación, filtrado y análisis de grandes cantidades de datos.
El documento describe la historia y el estado actual de la robótica. Comienza con los orígenes de la palabra "robot" en 1920 y continúa describiendo hitos importantes en el desarrollo de la robótica hasta la actualidad, incluido el desarrollo del primer robot programable en 1954 y el primer robot humanoide comercial capaz de correr, QRIO de Sony, en 2003. También explica brevemente las leyes de la robótica de Isaac Asimov y el uso continuo de robots humanoides como ASIMO y QRIO para fines promocionales por parte de empresas
El documento habla sobre la transmisión de audio y video en una red. Explica los formatos de almacenamiento y compresión como MP3, MPEG y códecs populares. También cubre temas como streaming, las formas de transmisión como streaming y download, y lo necesario para montar un servidor de streaming y acceder a servicios de streaming.
El documento presenta un proyecto para desarrollar una aplicación de búsqueda de podcasts que incluya un rastreador web para encontrar podcasts, un componente para extraer características de audio, un índice de documentos normalizados y una interfaz de búsqueda. Se analizan soluciones existentes que se enfocan en búsquedas de palabras exactas, mientras que este proyecto busca proveer resultados más relevantes y contextualizados.
El documento describe diferentes herramientas y metáforas utilizadas en entornos operativos multimedia y de colaboración, incluyendo iconos, texto, botones, imágenes y sonido. Luego analiza plataformas como Tactile 3D, Looking Glass y Open Croquet, destacando que permiten la colaboración, personalización y conservación de elementos gráficos. Finalmente menciona aplicaciones de aprendizaje colaborativo como CSCL, DLG y GAW.
Este documento describe las ventajas e inconvenientes de las aplicaciones web frente a las aplicaciones de escritorio, introduce el concepto de AJAX y cómo mejora la experiencia del usuario en las aplicaciones web. También presenta el framework Atlas de ASP.NET, el cual utiliza AJAX para agregar funcionalidad asíncrona a los controles web existentes de manera sencilla. El framework incluye clases como UpdatePanel para actualizaciones parciales y TimerControl para actualizaciones automáticas.
Este documento describe los conceptos básicos de los portales y los webparts de ASP.NET 2.0. Explica que un portal es un sitio web que proporciona acceso a otros recursos en Internet o una intranet y que los usuarios pueden personalizar. También describe cómo crear webparts usando la clase GenericWebpart e implementando la interfaz IWebPart, y cómo usar características como viewstate, cookies, sesiones y perfiles para persistir datos.
Insanony: Watch Instagram Stories Secretly - A Complete GuideTrending Blogers
Welcome to the world of social media, where Instagram reigns supreme! Today, we're going to explore a fascinating tool called Insanony that lets you watch Instagram Stories secretly. If you've ever wanted to view someone's story without them knowing, this blog is for you. We'll delve into everything you need to know about Insanony with Trending Blogers!
Confidence is Key: Fashion for Women Over 50miabarn9
Unlock your personal style and confidence at 50 and beyond! Our fashion blog provides actionable tips and inspiration on how to improve your dressing sense according to your body type, skin tone, and personal style, ensuring you look and feel amazing.
Biography and career history of Bruno AmezcuaBruno Amezcua
Bruno Amezcua's entry into the film and visual arts world seemed predestined. His grandfather, a distinguished film editor from the 1950s through the 1970s, profoundly influenced him. This familial mentorship early on exposed him to the nuances of film production and a broad array of fine arts, igniting a lifelong passion for narrative creation. Over 15 years, Bruno has engaged in diverse projects showcasing his dedication to the arts.
Amid the constant barrage of distractions and dwindling motivation, self-discipline emerges as the unwavering beacon that guides individuals toward triumph. This vital quality serves as the key to unlocking one’s true potential, whether the aspiration is to attain personal goals, ascend the career ladder, or refine everyday habits.
Understanding Self-Discipline
MISS RAIPUR 2024 - WINNER POONAM BHARTI.DK PAGEANT
Poonam Bharti, a guide of ability and diligence, has been chosen as the champ from Raipur for Mrs. India 2024, Pride of India, from the DK Show. Her journey to this prestigious title is a confirmation of her commitment, difficult work, and multifaceted gifts. At fair 23 a long time ago, Poonam has as of now made noteworthy strides in both her proficient and individual lives, encapsulating the soul of present-day Indian ladies who adjust different parts with beauty and competence. This article dives into Poonam Bharti’s foundation, achievements, and qualities that separated her as a meriting champion of this award.
MISS TEEN LUCKNOW 2024 - WINNER ASIYA 2024DK PAGEANT
In the dynamic city of Lucknow, known for its wealthy social legacy and authentic importance, a youthful star has developed, capturing the hearts of numerous with her elegance, insights, and eagerness. Asiya, as of late delegated as the champ from Lucknow for Miss Youngster India 2024 by the DK Pageant, stands as a confirmation of the monstrous ability and potential dwelling inside the youth of India. This exceptional young lady is a signal of excellence and a paragon of devotion and aspiration.
Calendario 2024 mensual anual documento A4 multicolor pastel imprimible bla...
Data-driven Generation of Image Descriptions
1. Data-driven Generation of Image
Descriptions
Vicente Ordonez-Roman
Advisor: Tamara Berg
Previously:
The State University of New York
2. What most Computer Vision systems aim
to say about a picture
Computer Vision
sky
trees
water
building
bridge
river
tree
3. What we are able to say about a picture
An old bridge over dirty green water.
Our Goal
One of the many stone bridges in town
that carry the gravel carriage roads.
A stone bridge over a peaceful river.
4. Let’s just borrow captions from similar images!
Im2Text: Describing Images Using 1 Million Captioned Photographs.
Vicente Ordonez, Girish Kulkarni, Tamara L. Berg.
Advances in Neural Information Processing Systems. NIPS 2011.
5. Harness the Web!
Images + Captions
from the Web
Smallest house in paris
between red (on right)
and beige (on left).
Matching using Global
Image Features
(GIST + Color)
Bridge to temple in
Hoan Kiem lake.
A walk around the
lake near our house
with Abby.
Transfer Caption(s)
e.g. “The water is clear
enough to see fish
swimming around in it.”
The water is clear
enough to see
fish swimming
around in it.
Hangzhou bridge in
West lake.
...
The daintree river by
boat.
6. Use the web to collect
images + captions
90, 000, 000, 000 pictures~!! (**)
A lot of them with captions
(a lot of them not publicy available )
6, 000, 000, 000 photographs! (*)
A lot of them with captions
(lots of them publicly available )
(*) http://blog.flickr.net/en/2011/08/04/6000000000/
(**) http://www.quora.com/How-many-photos-are-uploaded-to-Facebook-each-day
7. Flickr images + captions
Dog with a ball in its mouth running around like
crazy on the green grass.
cat in a sink
A 10-kg cat called Hercules.. and got caught in a pet
door when trying to sneak into another house to steal
dog food. 'Nuff said
8. Flickr images + captions
Dog with a ball in its mouth running around like
crazy on the green grass.
Dog with a ball in its mouth
running around like crazy on the
green grass.
cat in a sink
A 10-kg cat called Hercules.. and got caught in a pet
door when trying to sneak into another house to steal
dog food. 'Nuff said
9. Flickr images + captions
Dog with a ball in its mouth running around like
crazy on the green grass.
cat in a sink
A 10-kg cat called Hercules.. and got caught in a pet
door when trying to sneak into another house to steal
dog food. 'Nuff said
10. Flickr images + captions
Dog with a ball in its mouth running around like
crazy on the green grass.
cat catsink a
in a in
sink
A 10-kg cat called Hercules.. and got caught in a pet
door when trying to sneak into another house to steal
dog food. 'Nuff said
11. Flickr images + captions
Dog with a ball in its mouth running around like
crazy on the green grass.
cat in a sink
A 10-kg cat called Hercules.. and got caught in a pet
door when trying to sneak into another house to steal
dog food. 'Nuff said
12. Flickr images + captions
Dog with a ball in its mouth running around like
crazy on the green grass.
A 10-kg cat called Hercules.. and got caught in a pet
cat in a sink
A 10-kg cat called Hercules..sneak into another house to steal
and got caught in a pet
door when trying to
door when trying to'Nuff saidinto another house to steal
dog food. sneak
dog food. 'Nuff said
13. Solution:
Collect hundreds of millions of captions
Filter them out
We found “good captions” have visual concepts and
relation words “by”, “in”, “over”, “beside”, “on top of”
~1 “good caption” for every 1000 “bad captions”
Im2Text: Describing Images Using 1 Million Captioned Photographs.
Vicente Ordonez, Girish Kulkarni, Tamara L. Berg.
Advances in Neural Information Processing Systems. NIPS 2011.
14. SBU Captioned Photo Dataset
The Egyptian cat statue by the
floor clock and perpetual
motion machine in the
pantheon
Man sits in a rusted car buried
in the sand on Waitarere beach
Little girl and her dog in
northern Thailand. They both
seemed interested in what we
were doing
Our dog Zoe in her
bed
Interior design of modern white
and brown living room furniture
against white wall with a lamp
hanging.
Emma in her hat looking
super cute
15. Results
(1) while walking by the water
(2) plane flying over the sun
(3) shot this in a moving car at the nkve highway
(4) sunset over creve coeur lake and the page bridge
(5) sunset on 12th sep 2009 as seen from the field polder near my house
(6) window over yellow door
(7) sunset over capitol hill as seen from the roof of my building
(8) an orange sky over the irish sea
(9) beautiful golden sunset reflected in the waves of the ocean
(10) red sky probably caused by volcanic ash from iceland
(11) a view of sunset over river brahmaputa from koliyabhumura bridge
(12) red sky in the morning
16. Results
(1) burnt wooden door in derelict building portugal
(2) peterborough cathedral norman door in south wall
(3) amazing wooden door with wider light above
(4) door in wall
(5) girl looking in a classroom window
(6) a interesting cross in a window of an ancient city
(7) this mirror decorated with fruit painting was left behind by theprevious owners
(8) unusual exterior wall postbox at st albans post office in st peters street al1
(9) door in oxford uk in black and white
(10) 19 plate behind glass in brass mat and preserver
(11) this is some of the window decoration external on the house justover the porch 0364
(12) cat in a window
17. Results
(1) img8783 ginger in the red chair
(2) red sky in the morning
(3) the cat is in the bag and the bag is in the river quot
(4) the light in the kitchen made everythin glow my little girl is growing up
(5) my cat in a box that is far too small for her
(6) one of the towel animals in the cabin edno ot jivotnite napraveno ot havlieni karpi v kabinata
(7) baby in her later years turned from green to red but she never went fully red all over
(8) if you take pictures through the hole in the bottom of a flower pot the whole of the eldritch world is revealed
(9) glazed ceramic poop form in orange wooden box
(10) rock garden in library
(11) it s funny to capture the preciousest cat in the house at his most devillicious
(12) the pink will get replaced by orange and blue in the fall
18. Results
(1) starfish from the book toys to knitdashing dachs superwash sock yarn in goldfishbacking is orange
fabricstuffing is pillow stuffing
(2) mural of birds and trees in the crypt of wat ratburana ayutthaya
(3) carvings in the rock wall
(4) acrylic on paper scarlet macaws communicate in the color red withyellow and blue as visual grammar
(5) epsom and table salt crystals growing in concentrated green tea solution
(6) the hops dried to a golden green in a matter of a few days almosttoo pretty to bag up
(7) after staring at the gorgeous colors of the leaves claes discoveredthat there were about 100 birds sleeping in the
(8) you know you re in wisconsin when the beach has pine needles inthe sand
(9) i was walking down the sidewalk and i saw this glove craft droppedin the dirt it seemed really unusual
(10) made by fusing plastic bags
(11) bark pattern from a ponderosa pine tree in grand canyon national park
(12) the peasant that found a statue of the black virgin on a rock in ariver
20. Use High Level Content to Rerank
(Objects, Stuff, People, Scenes, Captions)
The bridge over the
lake on Suzhou Street.
Iron bridge over the Duck
river.
Transfer Caption(s)
e.g. “The bridge over the
lake on Suzhou Street.”
The Daintree river by boat. Bridge over Cacapon river.
...
21. Some success…
Amazing colours in
the sky at sunset
with the orange of
the cloud and the
blue of the sky
behind.
A female mallard duck in the
lake at Luukki Espoo
Strange cloud formation
literally flowing through the sky
like a river in relation to the
other clouds out there.
The sun was
coming through
the trees while I
was sitting in my
chair by the river
Fresh fruit and
vegetables at the market
in Port Louis Mauritius.
Tree with red leaves in the
field in autumn.
Under the sky of burning
clouds.
Stained glass
window in
Eusebius church.
22. Still far from perfect
Incorrect objects
Kentucky cows in a field.
The cat in the window.
23. Still far from perfect
Incorrect context
The sky is blue over the Gherkin.
Tree beside the river.
Completely wrong
The boat ended up a kilometre from
the water in the middle of the airstrip.
Water over the road.
24. How to Evaluate?
• “Ground truth”: The car is parked next to the
train station besides a building.
• Candidates:
“There is car parked in front of an office building”
“This is the building that hosted the ceremony”
“A vehicle stopped next to my house”
Similar to evaluation on Machine
Translation
25. BLEU score evaluation against Human Captions
Method
BLEU score
Global matching (1k)
0.0774
Global matching (10k)
0.0909
Global matching (100k)
0.0917
Global matching (1million)
0.1177
Global + Content matching
(linear regression)
0.1215
Global + Content matching
(linear SVM)
0.1259
26. Human Visual Verification
View overlooking Kuala Lumpur from my office
building
Please choose
the image that
better
corresponds to
the given
caption:
27. Human Visual Verification
Caption from
Flickr
Please choose
the image that
better
corresponds to
the given
caption:
Random image
View overlooking Kuala Lumpur from my office
building
28. Human Visual Verification
Caption from
Flickr
Random image
View overlooking Kuala Lumpur from my office
building
Please choose
the image that
better
corresponds to
the given
caption:
Caption used
Success rate
Original human caption
96.0%
Top caption
66.7%
Best from our top 4 captions
92.7%
29. Human Visual Evaluation
Caption
produced by
our system
Random image
The view from the 13th floor of an apartment building in
Nakano awesome.
Please choose
the image that
better
corresponds to
the given
caption:
Caption used
Success rate
Original human caption
96.0%
Top caption
66.7%
Best from our top 4 captions
92.7%
30. Human Visual Evaluation
Caption
produced by
our system
Random image
The view from the 13th floor of an apartment building in
Nakano awesome.
Please choose
the image that
better
corresponds to
the given
caption:
Caption used
Success rate
Original human caption
96.0%
Top caption
66.7%
Best from our top 4 captions
92.7%
32. Let’s not borrow captions from other
images, let’s just borrow short phrases!
Collective Generation of Natural Image Descriptions.
Polina Kuznetsova, Vicente Ordonez, Alexander C. Berg, Tamara L. Berg, Yejin Choi.
Association for Computational Linguistics. ACL 2012.
Large Scale Retrieval for Image Description Generation
Vicente Ordonez, Xufeng Han, Polina Kuznetsova, Girish Kulkarni, Margaret Mitchell,
Kota Yamaguchi, Karl Stratos, Amit Goyal, Jesse Dodge, Alyssa Mensch, Hal Daume III,
Alexander C. Berg, Yejin Choi, Tamara L. Berg
On Submission to IJCV special issue on Big Data.
34. Retrieving verb
phrases from similar
object detections
Contented dog just laying
on the edge of the road in
front of a house..
Peruvian dog sleeping on
city street in the city of
Cusco, (Peru)
Detect: dog
Find matching
dog detections
by visual
similarity
this dog was laying in the
middle of the road on a
back street in jaco
Closeup of my dog sleeping
under my desk.
35. Retrieving prepositional
phrases from region +
detection matches
Find matching region
detections using
appearance +
arrangement
Object: car
Cordoba - lonely elephant
under an orange tree...
Comfy chair under a tree.
I positioned the chairs
around the lemon tree -it's like a shrine
Mini Nike soccer ball all
alone in the grass
36. Retrieving prepositional phrases from scene matches
Extract scene descriptor
Pedestrian street in the Old
Lyon with stairs to climb up
the hill of fourviere
Find matching
images by scene
similarity
View from our B&B in this
photo
I'm about to blow the building
across the street over with my
massive lung power.
Only in Paris will you find a
bottle of wine on a table
outside a bookstore
37. Data Processing
1 million images:
– Run object detectors
– Run region based stuff detectors (e.g.
grass, sky, etc)
– Run global scene classifiers
– Parse captions associated with images
and retrieve phrases referring to objects
(NPs, VPs), region relationships (PPstuff),
and general scene context (PPscene).
39. Sometimes you can make it (a little) better
Detecting “mentioned” objects
Look in the mountain for a lion face
Ecuador, amazon basin, near coca, rain forest,
passion fruit flower
The background is a vintage paint by number painting I have
and the fabulous forest dress is by candyjunky!
Kevin’s mom, so punxrawk in Kev’s black flag hat
42. Binary Integer Linear Programming
Phrase sij
Position k
Phrase Vision
Confidence
Phrase sij
Phrase spq
Pairwise
phrase
cohesion
=
Position k
Position k+1
Head words
Ngram
co+
cohesion
occurrence
43. Composing Descriptions
Compose descriptions from phrases with ILP approach
• Linguistic constraints
– Allow only one phrase of each type
– Enforce plural/singular agreement between NP and VP
• Discourse constraints
– Prevent inclusion of repeated phrasing
• Phrase cohesion constraints
– n-gram statistics between phrases
– Co-occurrence statistics between head words of phrases (last
word or main verb) to encourage longer range cohesion
44. Good Results
This is a sporty little red convertible
made for a great day in Key West FL. This
car was in the 4th parade of the
apartment buildings.
Taken in front of my cat sitting in a shoe
box. Cat likes hanging around in my
recliner.
This is a brass viking
boat moored on
beach in Tobago by
the ocean.
45. Bad Results
Grammatically incorrect.
Cognitive absurdity.
One of the most shirt in the wall of
the house.
Here you can see a cross by the frog
in the sky.
Not relevant
This is a shoulder bag with a blended
rainbow effect
47. Human Forced Choice Evaluation
Caption used
ILP Selection
ILP vs. HMM (no images, no cognitive phrases)
67.2%
ILP vs. HMM (no images, with cognitive phrases)
66.3%
ILP vs. HMM (with images, no cognitive phrases)
53.17%
ILP vs. HMM (with images, with cognitive phrases)
54.5%
ILP vs. NIPS 2011 (Global matching 1M)
71.8%
ILP vs. HUMAN
16%
48. Visual Turing Test
Us vs Original Human Written Caption
In some cases (16%), ILP
generated captions were
preferred over human
written ones!
50. To be presented at ICCV 2013
Meaning from large-scale computer vision
Images with the word “house”
Images recognized as more likely
to produce the word “house”
51. To be presented at ICCV 2013
Meaning from large-scale computer vision
Images with the word “girl”
Images recognized as more likely
to produce the word “girl”
52. To be presented at ICCV 2013
Meaning from large-scale computer vision
Weights learned to recognize
images with “desk” in caption
Mammals
Top weighted classifier outputs
Birds InstrumentsStructures Plants Other
Weights learned over outputs of ~8k classifiers
53. To be presented at ICCV 2013
Meaning from large-scale computer vision
Weights learned to recognize
images with “tree” in caption
Mammals
Top weighted classifier outputs
Birds InstrumentsStructures Plants Other
Weights learned over outputs of ~8k classifiers
54. Meaning from large-scale computer vision
Weights learned to recognize
images with “tree” in caption
Mammals
Top weighted classifier outputs
Birds InstrumentsStructures Plants Other
Weights learned over outputs of ~8k classifiers
Most computer vision methods deal with the problem of identifying individual pieces of information but do not output the same type of output you would expect from a human. From this picture a good computer vision system would identify sky, trees, water, building, perhaps even bridge but a person on the other hand would say things about this picture like “a stone bridge over a peaceful river”. So our goal in this paper is to generate image descriptions as opposed to generate the individual pieces of information that computer vision methods would usually output.
Most computer vision methods deal with the problem of identifying individual pieces of information but do not output the same type of output you would expect from a human. From this picture a good computer vision system would identify sky, trees, water, building, perhaps even bridge but a person on the other hand would say things about this picture like “a stone bridge over a peaceful river”. So our goal in this paper is to generate image descriptions as opposed to generate the individual pieces of information that computer vision methods would usually output.
Most computer vision methods deal with the problem of identifying individual pieces of information but do not output the same type of output you would expect from a human. From this picture a good computer vision system would identify sky, trees, water, building, perhaps even bridge but a person on the other hand would say things about this picture like “a stone bridge over a peaceful river”. So our goal in this paper is to generate image descriptions as opposed to generate the individual pieces of information that computer vision methods would usually output.
We approach this task in a data-driven manner by first building a 1 million dataset of images with visually relevant captions. We construct this dataset by collecting an enormous amount of captions assigned to images by web users and filtering these captions in such a way that we end up with captions that are more likely to refer to visual content. We use standard global image feature descriptors such as GIST and Tinyimages to retrieve similar images from which we can directly transfer captions.
Most computer vision methods deal with the problem of identifying individual pieces of information but do not output the same type of output you would expect from a human. From this picture a good computer vision system would identify sky, trees, water, building, perhaps even bridge but a person on the other hand would say things about this picture like “a stone bridge over a peaceful river”. So our goal in this paper is to generate image descriptions as opposed to generate the individual pieces of information that computer vision methods would usually output.
Again we make use of the million image sbu captioned photo dataset
Additionally we incorporate high level information to rerank the retrieved images used by the previous baseline method by running object detectors, scene classification, stuff detection, people and action detection and computing text statistics. So in this example we have a bridge and a water detections, we use those to match them with similar detections in the retrieved set of images. As you can see here we run object detectors in our retrieved images only if a relevant keyword is mentioned. Text statistics are also relevant because if in the retrieved set a lot of images agree that there is a bridge then those images are rewarded in the final ranking as well. And then again we can transfer captions from this reranked set of images.
Finally here are some good and bad results obtained using our full approach. The first picture says Amazing colours in the sky at sunset with the orange of the cloud and the blue of the sky behind. The captions are very human like because they were written by actual humans. And it works suprisingly well for a some types of images. On the other hand even with 1 million images we can’t generalize to all possible observable images and also our image matching methods can fail thus leading to bad results. If you would like to check in more detail our quantitative results please come to our poster. Thanks.
Finally here are some good and bad results obtained using our full approach. The first picture says Amazing colours in the sky at sunset with the orange of the cloud and the blue of the sky behind. The captions are very human like because they were written by actual humans. And it works suprisingly well for a some types of images. On the other hand even with 1 million images we can’t generalize to all possible observable images and also our image matching methods can fail thus leading to bad results. If you would like to check in more detail our quantitative results please come to our poster. Thanks.
Finally here are some good and bad results obtained using our full approach. The first picture says Amazing colours in the sky at sunset with the orange of the cloud and the blue of the sky behind. The captions are very human like because they were written by actual humans. And it works suprisingly well for a some types of images. On the other hand even with 1 million images we can’t generalize to all possible observable images and also our image matching methods can fail thus leading to bad results. If you would like to check in more detail our quantitative results please come to our poster. Thanks.
Most computer vision methods deal with the problem of identifying individual pieces of information but do not output the same type of output you would expect from a human. From this picture a good computer vision system would identify sky, trees, water, building, perhaps even bridge but a person on the other hand would say things about this picture like “a stone bridge over a peaceful river”. So our goal in this paper is to generate image descriptions as opposed to generate the individual pieces of information that computer vision methods would usually output.
We can retrieve noun phrases referring to an object in a query using visual similarity between the query detection and detections from the database
Similarly we can retrieve verb phrases based on similar matching poses. For example giving us – laying on the edge of the road in front of a house.
For relationships between objects and stuff detections we use a combination of matching appearance and similarity in spatial arrangement. So here for this car, tree, and grass detections. We can retrieve phrases like “under a tree”, “in the grass” and so on.
Finally we can use our scene detectors to find matching images by scene similarity. Again for this we use the output of all of our scene classifiers as a descriptor for the image scene and then find similar scenes according to similarity between scene descriptors. This sometimes, but not always produces quite pleasing results. here we generally get similar european street scenes matching our query image. These phrases provide a sort of general scene context for a description.
First we do some processing on the data, including running about 100 object detections, regional stuff detectors, global scene classifiers and finally we parse the captions using the berkeley parser to get phrases referring to objects, spatial arrangements with background elements, and general scene descriptions.
But one issue with running lots of detectors is that it produces really noisy results. If for example you try to run 100 object and pose detectors on even these fairly simple images you get a big mess of detections. Here’s a bicycle in the mountain, a chair down here… The correct detections may be in there somewhere, but you can’t really see them amongst all the noisy false detections.So obviously we had to make these results better if we were going to be able to use them.
So we decided to play some simple tricks to make our recognition problem a little easier.For example, if you have some prior on what you expect to be in the image, then you can guide recognition in the right direction. In our case, with our giant captioned data set we have really good evidence for what might be in an image. We have some text telling us the likely objects. So for an image with a caption, we can just run the detectors for the objects mentioned in the caption. Woohoo that produces still not perfect, but considerably better recognition results! Now we can use these for captioning.
We compose descriptions from retrieved phrases using an ILP approach with a number of constraints from the vision predictions, linguistic constraints, discourse constraints, and phrase cohesion constraints.
The captions we produce are often quite reasonable, sometimes even preferred over the original human written ones!