4. WHAT IS IT?
“Interpreting images and generating sentences. ”
“Scene interpretation means understanding every-
day occurrences or recognizing rare events.”
“Scene interpretations are Controlled Hallucinations.”
5. WHY DO WE NEED SUCH SYSTEMS?
Isn't a picture enough to depict the
things clearly?
6. Ever imagined a cricket match without commentary? Or
A movie without dialogues?
7. It has been estimated that more than
80% of the activities we do online are
text-based.
8. A laymen can’t understand
the medical reports unless
the doctor makes him
understand or it is in written
form.
Medical reports
16. Assists Visually Impaired People
Screen Reader
Screen Readers are software programs that
convert text into synthesized speech and blind
people are able to listen to web content.
LIMITATIONS:
• Screen readers cannot describe images.
• Screen readers cannot survey the entirety of a
web page as a visual user might do. It cannot
always intelligently skip over extraneous
content, such as advertisements or navigation
bars.
18. Images are captured and unusual activities are
recorded.
The features of images are extracted which thus help in
crime investigation.
Criminal Act Recognition
19. Efficient and consistent scene interpretation is a
prerequisite self aware cognitive robots to work.
Human Computer Interaction
Object
Recognition
and scene
interpretation
Spatial
Relation
Extraction
22. Sr. No. PAPER PROPOSED DATASET CONCLUSION
1. “Midge: Generating
image descriptions from
computer vision
detection”
U. of Aberdeen and Oregon Health
and Science University, Stony Brook
University, U. of Maryland, Columbia
University, U. of Washington, MIT.
This paper introduces a
novel generation
system that composes
humanlike descriptions
of images from computer
vision detections.
For training:700,000
(Flickr, 2011) images
with
associated
descriptions from the
dataset in Ordonez
et al. (2011).
For evaluation:840
PASCAL
images.
Midge generates a well-
formed description of an
image by filtering attribute
detections that are unlikely
and placing objects into an
ordered syntactic
structure.
2. “Every picture tells a
story: Generating
sentences
from images.”
Farhadi, A., Hejrati, S. M. M.,
Sadeghi, M. A., Young, P.,
Rashtchian, C., Hockenmaier, J., and
Forsyth, D. A.
(2010). Springer.
attempts
to “generate” sentences by
first learning from
a set of human annotated
examples, and producing
the same sentence if both
images and sentence
share common properties
in terms of their triplets:
(Nouns-Verbs-Scenes).
PASCAL 2008 images
with human
annotation
Sentences are rich, compact
and subtle representations
of information. Even so, we
can predict good sentences
for images that people like.
The intermediate meaning
representation is one key
component in our model as
it allows benefiting from
distributional semantics.
23. Sr. No. PAPER PROPOSED DATASET CONCLUSION
3. “Babytalk: Understanding
and generating simple
image descriptions”
G Kulkarni, V Premraj, V Ordonez.
IEEE TRANSACTIONS ON PATTERN
ANALYSIS AND MACHINE
INTELLIGENCE, VOL. 35, NO. 12,
DECEMBER 2013
It uses detector for object
scene detection and make
quadruplet: (Nouns-Verbs-
Scenes-preposition).
PASCAL 2008
images
Human-forced choice
experiments
demonstrate the quality of the
generated sentences
over previous approaches. One
key to the success of our
system was automatically mining
and parsing large text
collections to obtain statistical
models for visually descriptive
language.
4. “Choosing Linguistics over
Vision to Describe
Images”
Ankush Gupta, Yashaswi Verma, C.
V. Jawahar
International Institute of
Information Technology, Hyderabad,
India – 500032
Problem of automatically
generating human-like
descriptions for unseen
images,
given a collection of
images and their
corresponding
human-generated
descriptions.
PASCAL dataset They proposed a novel approach
for generating relevant, fluent
and human-like descriptions for
images without relying
on any object detectors,
classifiers, hand-written rules or
heuristics.
25. 1.) Choosing Linguistics over Vision to
Describe Images
i. Given an unseen image,
ii. find K images most similar to it from the training images,
and using the phrases extracted from their descriptions
iii. generate a ranked list of triples which is then used to
compose description for the new image.
26. i. input image ii.) Neighboring
images with
extracted phrases
iii.) Triple section
and sentence
generation
27. FAILURE SCENARIO
A motor racer is speeding
through a splash mud.
A water cow is grazing
along a roadside.
An orange fixture is hanging
in a messy kitchen.
30. OPEN CV
• Open source computer vision and machine learning
software library.
• More than 2500 optimized algorithms.
• C++, C, Python, Java and MATLAB interfaces
• Supports Windows, Linux, Android and Mac OS
31. NLP(Natural Language Processing)
It is a field of computer science, artificial intelligence, and
computational linguistics concerned with the interactions
between computers and human (natural) languages.
32. SVM(Support Vector Machines)
A discriminative classifier formally
defined by a separating
hyperplane. In other words, given
labeled training data (supervised
learning), the algorithm outputs an
optimal hyperplane which
categorizes new examples.
34. Take query
image as input
Detect
objects from
query image
Corpus data
& extract
shortest
sentences
RDF (Resource
Description
Framework)Pa
rser
<object1,predicate1,object2>
Google image API
retrieve top
10 images
Match query
image and
compute score
for each
retrieval image
Highest score
images are our
matching triplet
36. DATA SET
PASCAL (Pattern Analysis, Statistical Modeling and Computational
Learning)
It provides standardized image data sets for object class
recognition
Technology
JAVA/PYTHON
37. Thus we saw the fundamentals
of scene description, its
applications, previous work in
this field and our approach for
designing this system.
5. CONCLUSION