SlideShare a Scribd company logo
1 of 24
Learning from
Descriptive Text

     Tamara L Berg
 Stony Brook University
Tags:
                                               Vision
canon, eos, macro, japan, vacation, f
rog, animal, toad, amphibian, pet, ey
e, feet, mouth, finger, hand, prince, p
hoto, art, light, photo, flickr, blurry, fa
vorite, nice.

                                              Language




                                              Humans

                                                          It's the perfect party dress. With
                                                         distinctly feminine details such as a wide
                                                         sash bow around an empire waist and a
                                                         deep scoopneck, this linen dress will
                                                         keep you comfortable and feeling
                                                         elegant all evening long.
Visually Descriptive Text
                      “It was an arresting face, pointed of chin, square of jaw. Her eyes
                      were pale green without a touch of hazel, starred with bristly black
                      lashes and slightly tilted at the ends. Above them, her thick black
                      brows slanted upward, cutting a startling oblique line in her
                      magnolia-white skin–that skin so prized by Southern women and so
                      carefully guarded with bonnets, veils and mittens against hot
                      Georgia suns” –       Gone with the Wind


                                                                    How do people
                                                                    describe the world?

Visually descriptive language provides:
• information about how people construct natural language for imagery.
• information about the world, especially the visual world.
• guidance for computational visual recognition.         How does the
                                                                 world work?
                                         What should we
                                         recognize?
Visually Descriptive Text
                      “It was an arresting face, pointed of chin, square of jaw. Her eyes
                      were pale green without a touch of hazel, starred with bristly black
                      lashes and slightly tilted at the ends. Above them, her thick black
                      brows slanted upward, cutting a startling oblique line in her
                      magnolia-white skin–that skin so prized by Southern women and so
                      carefully guarded with bonnets, veils and mittens against hot
                      Georgia suns” –       from Gone with the Wind by Margaret Mitchell



                                                                   How do people
                                                                   describe the
Visually descriptive language provides:                            world?
• information about how people construct natural language for imagery.
• information about the world, especially the visual world.
• guidance for computational visual recognition.         How does the
                                                                 world work?
                                         What should we
                                         recognize?
What’s in a description?
                                                   What’s in this image?
                                                             man
                                                             baby
                                                             sling
                                                             shirt
                                                             glasses
                                                             ladder
                                                             fridge
                                                             table
                                                             watermelon
                                                             chair
What do people describe?                                     boxes
“A bearded man is holding a child in a sling.”               cups
“A bearded man stands while holding a small child in a green water bottle
sheet.”                                                      wall
“A bearded man with a baby in a sling poses.”                pacifier
“Man standing in kitchen with little girl in green sack.”    beard
“Man with beard and baby”                                    …
What’s in a description?
                                                               women    ✔
                                                                bench ✔
1)                               “two women sitting brunette
                                                               magazine ✔
                                 blonde on bench reading
                                 magazine”                      grass ✖
                                 Predict what people will        skirt ✖
                                                                   …
         Given an image          describe

     e.g. Spain & Perona, 2010
                                                               clouds ✔
      “looking for                                               car  ✖
2)    castles in the
                                                               window ✖
      clouds out my car
      window”                                                  castle   ?
        Given a caption          Predict what’s in the image
Who’s in the picture?
                           T.L. Berg, A.C. Berg, J. Edwards, D.A. Forsyth




President George W. Bush makes a
statement in the Rose Garden while
Secretary of Defense Donald Rumsfeld
looks on, July 23, 2003. Rumsfeld said
the United States would release graphic
photographs of the dead sons of
Saddam Hussein to prove they were                          Model             Accuracy of labeling
killed by American troops. Photo by            Vision model, No Lang model          67%
Larry Downing/Reuters                          Vision model + Lang model            78%
Visually Descriptive Text
                     “It was an arresting face, pointed of chin, square of jaw. Her eyes
                     were pale green without a touch of hazel, starred with bristly black
                     lashes and slightly tilted at the ends. Above them, her thick black
                     brows slanted upward, cutting a startling oblique line in her
                     magnolia-white skin–that skin so prized by Southern women and so
                     carefully guarded with bonnets, veils and mittens against hot
                     Georgia suns” –       from Gone with the Wind by Margaret Mitchell


                                                                   How do people
                                                                   describe the world?

Visually descriptive language provides:
• information about how people construct natural language for imagery.
• information about the world, especially the visual world.
• guidance for computational visual recognition.         How does the
                                        What should we          world work?
                                        recognize?
Vision is hard




                                          Green sheep




World knowledge (from descriptive text)
can be used to smooth noisy vision
predictions!
Learning World Knowledge
               BabyTalk: Understanding and Generating Simple Image Descriptions
               Kulkarni, Premraj, Dhar, Li, Choi, AC Berg, TL Berg, CVPR 2011




                                                                              Attributes


green green grass by the           a very shiny car in the car
lake                               museum in my hometown of
                                   upstate NY.




                                                                             Relationships


  very little person in a big         Our cat Tusik sleeping on
  rocking chair                       the sofa near a hot radiator.
System Flow
                              near(a,b)0.01
                                 brown 1
                                     +, ($%

                              near(b,a) 0.16
                                 striped 1
                                 furry .26
                              against(a,b)! " #$%         ' () *$%
                              .11wooden .2
                                 feathered
                              against(b,a)
               a) dog         .04.06                                  +, (&%
                              beside(a,b)  ...      This is a photograph of one
                              .24
                                brown 0.32
                                   ' () *- %
                                                    person and one brown sofa and
                                                                 ! "#
                                                                    &%
                              beside(b,a)
                                striped 0.09
                              near(a,c) 1
                              .17                   one dog. The person is against
                                furry .04
                              near(c,a) 1
                                       ...
                                wooden .2
                              against(a,c) .3
                                                    the brown sofa. And the dog is
                                Feathered
                              against(c,a)          near the person, and beside the
                                .04
                              .05                   brown sofa.
                                           ... "
                              beside(a,c) !.5#- %         ' () *&%
              b) person       beside(c,a)
                              .45 +, (- %
                                   ...
                              near(b,c)0.94
                                brown 1
                              near(c,b) 0.10
                                striped 1
                                  <<null,person_b>,against,<brown,sofa_c>>
                              against(b,c)
                                furry .06
Input Image                   .67 <<null,dog_a>,near,<null,person_b>>
                                wooden .8        Generate natural
                                  <<null,dog_a>,beside,<brown,sofa_c>>
                              against(c,b)
                                Feathered
                              .33
                                .08              language
             c) sofa          beside(b,c) .0
                                       ...        – vision
                                Predict labeling description
                              beside(c,b)
                Objects/stuff potentials smoothed with text
        Extract Predict attributes
                Predict prepositions
                              .19
                                potentials
                                  ...
BabyTalk results


                            Objects, Attributes,
                            Prepositions
This is a picture of one
sky, one road and one                                  Here we see one
sheep. The gray sky is                                 road, one sky and one
over the gray road. The                                bicycle. The road is near
gray sheep is by the gray                              the blue sky, and near the
road.                                                  colorful bicycle. The
                                                       colorful bicycle is within
                                                       the blue sky.

                            This is a picture of two
                            dogs. The first dog is
                            near the second furry
Visually Descriptive Text
                      “It was an arresting face, pointed of chin, square of jaw. Her eyes
                      were pale green without a touch of hazel, starred with bristly black
                      lashes and slightly tilted at the ends. Above them, her thick black
                      brows slanted upward, cutting a startling oblique line in her
                      magnolia-white skin–that skin so prized by Southern women and so
                      carefully guarded with bonnets, veils and mittens against hot
                      Georgia suns” –       from Gone with the Wind by Margaret Mitchell


                                                                    How do people
                                                                    describe the world?

Visually descriptive language provides:
• information about how people construct natural language for imagery.
• information about the world, especially the visual world.
• guidance for computational visual recognition.         How does the
                                                                 world work?
                                         What should we
                                         recognize?
What should we recognize?

• Recognition is beginning to work

• Open question – what should we recognize?

• Maybe objects aren’t (always) the right base
  level entities
Object Recognition




Parts, Poselets, Attributes
  For example:
  [Fergus, Perona, Zisserman2003],
  [Bourdev, Malik2009], …


                                        Slide Credit: Ali Farhadi
Automatically Discovering Attributes from Noisy Web Data
                  T.L. Berg, A.C. Berg, J. Shih ECCV 2010

                                    Fully beaded with megawatt
                                    crystals, this Christian Louboutin suede
                                    pump matches the gleam in your eye.

                                    Pump's linear heel plays up the alluring
                                    curves of its dipped sides.

                                    Round toe frames low-cut vamp.
                                    Tonally topstitched collar.

                                    4" straight, covered heel shows off
                                    signature red sole.

                                    Creamy leather lining with padded
                                    insole.
                                    "Fifi" is made in Italy.

Learn which attributes in descriptions are depictable
             terms
Given Web Images + Noisy Text Descriptions:
 1) Discover visual attribute terms in text descriptions - likely domain dependent
 2) Learn appearance models for attributes without labeled data
 3) Characterize attributes by: type, localizability
Object Recognition




                Scenes
                For example:
                [Oliva, Torralba 2001],
                [SUN 2010], …



                      Slide Credit: Ali Farhadi
What are the right quanta of
      Recognition?




              Farhadi & Sadeghi
              Recognition using Visual Phrases , CVPR 2011
Participating in Phrases Profoundly affects the
            appearance of objects




                       Farhadi & Sadeghi
                       Recognition using Visual Phrases , CVPR 2011
What should we recognize?




  “a sleeping dog in NTHU”     “the dog is sleeping”




     “A dog is sleeping in”    “sleeping dog in delhi”

Maybe descriptive text can inform entity hypotheses!
What should we recognize?




     “the cat is in the bag”    “cat in a bag”




            “cat in bag”       “cat in the bag”


Maybe descriptive text can inform entity hypotheses!
Conclusion

   Use large pools of descriptive text to:

       Learn how people describe the visual world

       Learn how the world works

       Guide future efforts in recognition


   Apply this knowledge to multi-modal
    collections & applications
Acknowledgements

• Collaborators: Alex Berg, David Forsyth, Jaety
  Edwards, Jonathan Shih, Girish Kulkarni, Visruth
  Premraj, Sagnik Dhar, Vicente Ordonez, Siming
  Li, Yejin Choi, Kota Yamaguchi, Vicente Ordonez

• Funded by NSF Faculty Early Career
  Development (CAREER) Program: Award
  #1054133

More Related Content

More from zukun

ETHZ CV2012: Information
ETHZ CV2012: InformationETHZ CV2012: Information
ETHZ CV2012: Informationzukun
 
Siwei lyu: natural image statistics
Siwei lyu: natural image statisticsSiwei lyu: natural image statistics
Siwei lyu: natural image statisticszukun
 
Lecture9 camera calibration
Lecture9 camera calibrationLecture9 camera calibration
Lecture9 camera calibrationzukun
 
Brunelli 2008: template matching techniques in computer vision
Brunelli 2008: template matching techniques in computer visionBrunelli 2008: template matching techniques in computer vision
Brunelli 2008: template matching techniques in computer visionzukun
 
Modern features-part-4-evaluation
Modern features-part-4-evaluationModern features-part-4-evaluation
Modern features-part-4-evaluationzukun
 
Modern features-part-3-software
Modern features-part-3-softwareModern features-part-3-software
Modern features-part-3-softwarezukun
 
Modern features-part-2-descriptors
Modern features-part-2-descriptorsModern features-part-2-descriptors
Modern features-part-2-descriptorszukun
 
Modern features-part-1-detectors
Modern features-part-1-detectorsModern features-part-1-detectors
Modern features-part-1-detectorszukun
 
Modern features-part-0-intro
Modern features-part-0-introModern features-part-0-intro
Modern features-part-0-introzukun
 
Lecture 02 internet video search
Lecture 02 internet video searchLecture 02 internet video search
Lecture 02 internet video searchzukun
 
Lecture 01 internet video search
Lecture 01 internet video searchLecture 01 internet video search
Lecture 01 internet video searchzukun
 
Lecture 03 internet video search
Lecture 03 internet video searchLecture 03 internet video search
Lecture 03 internet video searchzukun
 
Icml2012 tutorial representation_learning
Icml2012 tutorial representation_learningIcml2012 tutorial representation_learning
Icml2012 tutorial representation_learningzukun
 
Advances in discrete energy minimisation for computer vision
Advances in discrete energy minimisation for computer visionAdvances in discrete energy minimisation for computer vision
Advances in discrete energy minimisation for computer visionzukun
 
Gephi tutorial: quick start
Gephi tutorial: quick startGephi tutorial: quick start
Gephi tutorial: quick startzukun
 
EM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysisEM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysiszukun
 
Object recognition with pictorial structures
Object recognition with pictorial structuresObject recognition with pictorial structures
Object recognition with pictorial structureszukun
 
Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities zukun
 
Icml2012 learning hierarchies of invariant features
Icml2012 learning hierarchies of invariant featuresIcml2012 learning hierarchies of invariant features
Icml2012 learning hierarchies of invariant featureszukun
 
ECCV2010: Modeling Temporal Structure of Decomposable Motion Segments for Act...
ECCV2010: Modeling Temporal Structure of Decomposable Motion Segments for Act...ECCV2010: Modeling Temporal Structure of Decomposable Motion Segments for Act...
ECCV2010: Modeling Temporal Structure of Decomposable Motion Segments for Act...zukun
 

More from zukun (20)

ETHZ CV2012: Information
ETHZ CV2012: InformationETHZ CV2012: Information
ETHZ CV2012: Information
 
Siwei lyu: natural image statistics
Siwei lyu: natural image statisticsSiwei lyu: natural image statistics
Siwei lyu: natural image statistics
 
Lecture9 camera calibration
Lecture9 camera calibrationLecture9 camera calibration
Lecture9 camera calibration
 
Brunelli 2008: template matching techniques in computer vision
Brunelli 2008: template matching techniques in computer visionBrunelli 2008: template matching techniques in computer vision
Brunelli 2008: template matching techniques in computer vision
 
Modern features-part-4-evaluation
Modern features-part-4-evaluationModern features-part-4-evaluation
Modern features-part-4-evaluation
 
Modern features-part-3-software
Modern features-part-3-softwareModern features-part-3-software
Modern features-part-3-software
 
Modern features-part-2-descriptors
Modern features-part-2-descriptorsModern features-part-2-descriptors
Modern features-part-2-descriptors
 
Modern features-part-1-detectors
Modern features-part-1-detectorsModern features-part-1-detectors
Modern features-part-1-detectors
 
Modern features-part-0-intro
Modern features-part-0-introModern features-part-0-intro
Modern features-part-0-intro
 
Lecture 02 internet video search
Lecture 02 internet video searchLecture 02 internet video search
Lecture 02 internet video search
 
Lecture 01 internet video search
Lecture 01 internet video searchLecture 01 internet video search
Lecture 01 internet video search
 
Lecture 03 internet video search
Lecture 03 internet video searchLecture 03 internet video search
Lecture 03 internet video search
 
Icml2012 tutorial representation_learning
Icml2012 tutorial representation_learningIcml2012 tutorial representation_learning
Icml2012 tutorial representation_learning
 
Advances in discrete energy minimisation for computer vision
Advances in discrete energy minimisation for computer visionAdvances in discrete energy minimisation for computer vision
Advances in discrete energy minimisation for computer vision
 
Gephi tutorial: quick start
Gephi tutorial: quick startGephi tutorial: quick start
Gephi tutorial: quick start
 
EM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysisEM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysis
 
Object recognition with pictorial structures
Object recognition with pictorial structuresObject recognition with pictorial structures
Object recognition with pictorial structures
 
Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities
 
Icml2012 learning hierarchies of invariant features
Icml2012 learning hierarchies of invariant featuresIcml2012 learning hierarchies of invariant features
Icml2012 learning hierarchies of invariant features
 
ECCV2010: Modeling Temporal Structure of Decomposable Motion Segments for Act...
ECCV2010: Modeling Temporal Structure of Decomposable Motion Segments for Act...ECCV2010: Modeling Temporal Structure of Decomposable Motion Segments for Act...
ECCV2010: Modeling Temporal Structure of Decomposable Motion Segments for Act...
 

Recently uploaded

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 

Recently uploaded (20)

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 

Fcv hum mach_t_berg

  • 1. Learning from Descriptive Text Tamara L Berg Stony Brook University
  • 2. Tags: Vision canon, eos, macro, japan, vacation, f rog, animal, toad, amphibian, pet, ey e, feet, mouth, finger, hand, prince, p hoto, art, light, photo, flickr, blurry, fa vorite, nice. Language Humans It's the perfect party dress. With distinctly feminine details such as a wide sash bow around an empire waist and a deep scoopneck, this linen dress will keep you comfortable and feeling elegant all evening long.
  • 3. Visually Descriptive Text “It was an arresting face, pointed of chin, square of jaw. Her eyes were pale green without a touch of hazel, starred with bristly black lashes and slightly tilted at the ends. Above them, her thick black brows slanted upward, cutting a startling oblique line in her magnolia-white skin–that skin so prized by Southern women and so carefully guarded with bonnets, veils and mittens against hot Georgia suns” – Gone with the Wind How do people describe the world? Visually descriptive language provides: • information about how people construct natural language for imagery. • information about the world, especially the visual world. • guidance for computational visual recognition. How does the world work? What should we recognize?
  • 4. Visually Descriptive Text “It was an arresting face, pointed of chin, square of jaw. Her eyes were pale green without a touch of hazel, starred with bristly black lashes and slightly tilted at the ends. Above them, her thick black brows slanted upward, cutting a startling oblique line in her magnolia-white skin–that skin so prized by Southern women and so carefully guarded with bonnets, veils and mittens against hot Georgia suns” – from Gone with the Wind by Margaret Mitchell How do people describe the Visually descriptive language provides: world? • information about how people construct natural language for imagery. • information about the world, especially the visual world. • guidance for computational visual recognition. How does the world work? What should we recognize?
  • 5. What’s in a description? What’s in this image? man baby sling shirt glasses ladder fridge table watermelon chair What do people describe? boxes “A bearded man is holding a child in a sling.” cups “A bearded man stands while holding a small child in a green water bottle sheet.” wall “A bearded man with a baby in a sling poses.” pacifier “Man standing in kitchen with little girl in green sack.” beard “Man with beard and baby” …
  • 6. What’s in a description? women ✔ bench ✔ 1) “two women sitting brunette magazine ✔ blonde on bench reading magazine” grass ✖ Predict what people will skirt ✖ … Given an image describe e.g. Spain & Perona, 2010 clouds ✔ “looking for car ✖ 2) castles in the window ✖ clouds out my car window” castle ? Given a caption Predict what’s in the image
  • 7. Who’s in the picture? T.L. Berg, A.C. Berg, J. Edwards, D.A. Forsyth President George W. Bush makes a statement in the Rose Garden while Secretary of Defense Donald Rumsfeld looks on, July 23, 2003. Rumsfeld said the United States would release graphic photographs of the dead sons of Saddam Hussein to prove they were Model Accuracy of labeling killed by American troops. Photo by Vision model, No Lang model 67% Larry Downing/Reuters Vision model + Lang model 78%
  • 8. Visually Descriptive Text “It was an arresting face, pointed of chin, square of jaw. Her eyes were pale green without a touch of hazel, starred with bristly black lashes and slightly tilted at the ends. Above them, her thick black brows slanted upward, cutting a startling oblique line in her magnolia-white skin–that skin so prized by Southern women and so carefully guarded with bonnets, veils and mittens against hot Georgia suns” – from Gone with the Wind by Margaret Mitchell How do people describe the world? Visually descriptive language provides: • information about how people construct natural language for imagery. • information about the world, especially the visual world. • guidance for computational visual recognition. How does the What should we world work? recognize?
  • 9. Vision is hard Green sheep World knowledge (from descriptive text) can be used to smooth noisy vision predictions!
  • 10. Learning World Knowledge BabyTalk: Understanding and Generating Simple Image Descriptions Kulkarni, Premraj, Dhar, Li, Choi, AC Berg, TL Berg, CVPR 2011 Attributes green green grass by the a very shiny car in the car lake museum in my hometown of upstate NY. Relationships very little person in a big Our cat Tusik sleeping on rocking chair the sofa near a hot radiator.
  • 11. System Flow near(a,b)0.01 brown 1 +, ($% near(b,a) 0.16 striped 1 furry .26 against(a,b)! " #$% ' () *$% .11wooden .2 feathered against(b,a) a) dog .04.06 +, (&% beside(a,b) ... This is a photograph of one .24 brown 0.32 ' () *- % person and one brown sofa and ! "# &% beside(b,a) striped 0.09 near(a,c) 1 .17 one dog. The person is against furry .04 near(c,a) 1 ... wooden .2 against(a,c) .3 the brown sofa. And the dog is Feathered against(c,a) near the person, and beside the .04 .05 brown sofa. ... " beside(a,c) !.5#- % ' () *&% b) person beside(c,a) .45 +, (- % ... near(b,c)0.94 brown 1 near(c,b) 0.10 striped 1 <<null,person_b>,against,<brown,sofa_c>> against(b,c) furry .06 Input Image .67 <<null,dog_a>,near,<null,person_b>> wooden .8 Generate natural <<null,dog_a>,beside,<brown,sofa_c>> against(c,b) Feathered .33 .08 language c) sofa beside(b,c) .0 ... – vision Predict labeling description beside(c,b) Objects/stuff potentials smoothed with text Extract Predict attributes Predict prepositions .19 potentials ...
  • 12. BabyTalk results Objects, Attributes, Prepositions This is a picture of one sky, one road and one Here we see one sheep. The gray sky is road, one sky and one over the gray road. The bicycle. The road is near gray sheep is by the gray the blue sky, and near the road. colorful bicycle. The colorful bicycle is within the blue sky. This is a picture of two dogs. The first dog is near the second furry
  • 13. Visually Descriptive Text “It was an arresting face, pointed of chin, square of jaw. Her eyes were pale green without a touch of hazel, starred with bristly black lashes and slightly tilted at the ends. Above them, her thick black brows slanted upward, cutting a startling oblique line in her magnolia-white skin–that skin so prized by Southern women and so carefully guarded with bonnets, veils and mittens against hot Georgia suns” – from Gone with the Wind by Margaret Mitchell How do people describe the world? Visually descriptive language provides: • information about how people construct natural language for imagery. • information about the world, especially the visual world. • guidance for computational visual recognition. How does the world work? What should we recognize?
  • 14. What should we recognize? • Recognition is beginning to work • Open question – what should we recognize? • Maybe objects aren’t (always) the right base level entities
  • 15. Object Recognition Parts, Poselets, Attributes For example: [Fergus, Perona, Zisserman2003], [Bourdev, Malik2009], … Slide Credit: Ali Farhadi
  • 16. Automatically Discovering Attributes from Noisy Web Data T.L. Berg, A.C. Berg, J. Shih ECCV 2010 Fully beaded with megawatt crystals, this Christian Louboutin suede pump matches the gleam in your eye. Pump's linear heel plays up the alluring curves of its dipped sides. Round toe frames low-cut vamp. Tonally topstitched collar. 4" straight, covered heel shows off signature red sole. Creamy leather lining with padded insole. "Fifi" is made in Italy. Learn which attributes in descriptions are depictable terms
  • 17. Given Web Images + Noisy Text Descriptions: 1) Discover visual attribute terms in text descriptions - likely domain dependent 2) Learn appearance models for attributes without labeled data 3) Characterize attributes by: type, localizability
  • 18. Object Recognition Scenes For example: [Oliva, Torralba 2001], [SUN 2010], … Slide Credit: Ali Farhadi
  • 19. What are the right quanta of Recognition? Farhadi & Sadeghi Recognition using Visual Phrases , CVPR 2011
  • 20. Participating in Phrases Profoundly affects the appearance of objects Farhadi & Sadeghi Recognition using Visual Phrases , CVPR 2011
  • 21. What should we recognize? “a sleeping dog in NTHU” “the dog is sleeping” “A dog is sleeping in” “sleeping dog in delhi” Maybe descriptive text can inform entity hypotheses!
  • 22. What should we recognize? “the cat is in the bag” “cat in a bag” “cat in bag” “cat in the bag” Maybe descriptive text can inform entity hypotheses!
  • 23. Conclusion  Use large pools of descriptive text to:  Learn how people describe the visual world  Learn how the world works  Guide future efforts in recognition  Apply this knowledge to multi-modal collections & applications
  • 24. Acknowledgements • Collaborators: Alex Berg, David Forsyth, Jaety Edwards, Jonathan Shih, Girish Kulkarni, Visruth Premraj, Sagnik Dhar, Vicente Ordonez, Siming Li, Yejin Choi, Kota Yamaguchi, Vicente Ordonez • Funded by NSF Faculty Early Career Development (CAREER) Program: Award #1054133