The document discusses the "bag of words" representation for images. Local image features are extracted from images and clustered to form a visual vocabulary of "words". Images are then represented as histograms of these visual words. This allows images to be searched and classified based on their visual content in a similar way that documents are searched based on bags of words representations of text. The bag of words model provides a compact summary of image content but ignores spatial information between features.
Compact and Distinctive Visual Vocabularies for Efficient Multimedia Data Ind...Symeon Papadopoulos
Paper presentation in ADBIS 2013.
Abstract: Multimedia data indexing for content-based retrieval has attracted significant attention in recent years due to the commoditization of multimedia capturing equipment and the widespread adoption of social networking platforms as means for sharing media content online. Due to the very large amounts of multimedia content, notably images, produced and shared online by people, a very important requirement for multimedia indexing approaches pertains to their efficiency both in terms of computation and memory usage. A common approach to support query-by-example image search is based on the extraction of visual words from images and their indexing by means of inverted indices, a method proposed and popularized in the field of text retrieval.
The main challenge that visual word indexing systems currently face arises from the fact that it is necessary to build very large visual vocabularies (hundreds of thousands or even millions of words) to support sufficiently precise search. However, when the visual vocabulary is large, the image indexing process becomes computationally expensive due to the fact that the local image descriptors (e.g. SIFT) need to be quantized to the nearest visual words.
To this end, this paper proposes a novel method that significantly decreases the time required for the above quantization process. Instead of using hundreds of thousands of visual words for quantization, the proposed method manages to preserve retrieval quality by using a much
smaller number of words for indexing. This is achieved by the concept of composite words, i.e. assigning multiple words to a local descriptor in ascending order of distance. We evaluate the proposed method in the Oxford and Paris buildings datasets to demonstrate the validity of the
proposed approach.
Vocabulary length experiments for binary image classification using bov approachsipij
Bag-of-Visual-words (BoV) approach to image classification is popular among computer vision scientists.
The visual words come from the visual vocabulary which is constructed using the key points extracted from
the image database. Unlike the natural language, the length of such vocabulary for image classification is
task dependent. The visual words capture the local invariant features of the image. The region of image
over which a visual word is constrained forms the spatial content for the visual word. Spatial pyramid
representation of images is an approach to handle spatial information. In this paper, we study the role of
vocabulary lengths for the levels of a simple two level spatial pyramid to perform binary classifications.
Two binary classification problems namely to detect the presence of persons and cars are studied. Relevant
images from PASCAL dataset are being used for the learning activities involved in this work
Presenter: Amaia Salvador
Related papers:
E. Mohedano, Salvador, A., McGuinness, K., Giró-i-Nieto, X., O'Connor, N., and Marqués, F., “Bags of Local Convolutional Features for Scalable Instance Search”, in ACM International Conference on Multimedia Retrieval (ICMR), New York City, NY; USA. 2016
A. Salvador, Giró-i-Nieto, X., Marqués, F., and Satoh, S. 'ichi, “Faster R-CNN Features for Instance Search”, in CVPR Workshop Deep Vision, Las Vegas, NV, USA. 2016.
Abstract:
Image representations derived from pre-trained Convolutional Neural Networks (CNNs) have become the new state of the art in computer vision tasks such as instance retrieval. This work proposes a simple pipeline for encoding the local activations of a convolutional layer of a pre-trained CNN using the well-known bag of words aggregation scheme (BoW). Assigning each local array of activations in a convolutional layer to a visual word produces an assignment map, a compact representation that relates regions of an image with a visual word. We use the assignment map for fast spatial reranking, obtaining object localizations that are used for query expansion. We further investigate the potential of using convolutional features from an object detection network such as Faster R-CNN, which allows to obtain image- and region- wise features in a single forward pass. We demonstrate the suitability of such representations for image retrieval on the Oxford Buildings 5k, Paris Buildings 6k and a subset of TRECVid Instance Search 2013, achieving competitive results. This talk will review the two publications related to this work, which have been recently accepted at ICMR 2016 and DeepVision CVPRW 2016.
Barcelona, 3 May 2016.
Video has become ubiquitous on the Internet, TV, as well as personal devices. Recognition of video content has been a fundamental challenge in computer vision for decades, where previous research predominantly focused on understanding videos using a predefined yet limited vocabulary. Thanks to the recent development of deep learning techniques, researchers in both computer vision and multimedia communities are now striving to bridge videos with natural language, which can be regarded as the ultimate goal of video understanding. We will present recent advances in exploring the synergy of video understanding and language processing techniques, including video-language alignment and video captioning.
Content Modelling for Human Action Detection via Multidimensional ApproachCSCJournals
Video content analysis is an active research domain due to the availability and the increment of audiovisual data in the digital format. There is a need to automatically extracting video content for efficient access, understanding, browsing and retrieval of videos. To obtain the information that is of interest and to provide better entertainment, tools are needed to help users extract relevant content and to effectively navigate through the large amount of available video information. Existing methods do not seem to attempt to model and estimate the semantic content of the video. Detecting and interpreting human presence, actions and activities is one of the most valuable functions in this proposed framework. The general objectives of this research are to analyze and process the audio-video streams to a robust audiovisual action recognition system by integrating, structuring and accessing multimodal information via multidimensional retrieval and extraction model. The proposed technique characterizes the action scenes by integrating cues obtained from both the audio and video tracks. Information is combined based on visual features (motion, edge, and visual characteristics of objects), audio features and video for recognizing action. This model uses HMM and GMM to provide a framework for fusing these features and to represent the multidimensional structure of the framework. The action-related visual cues are obtained by computing the spatiotemporal dynamic activity from the video shots and by abstracting specific visual events. Simultaneously, the audio features are analyzed by locating and compute several sound effects of action events that embedded in the video. Finally, these audio and visual cues are combined to identify the action scenes. Compared with using single source of either visual or audio track alone, such combined audiovisual information provides more reliable performance and allows us to understand the story content of movies in more detail. To compare the usefulness of the proposed framework, several experiments were conducted and the results were obtained by using visual features only (77.89% for precision; 72.10% for recall), audio features only (62.52% for precision; 48.93% for recall) and combined audiovisual (90.35% for precision; 90.65% for recall).
Deep neural methods have recently demonstrated significant performance improvements in several IR tasks. In this lecture, we will present a brief overview of deep models for ranking and retrieval.
This is a follow-up lecture to "Neural Learning to Rank" (https://www.slideshare.net/BhaskarMitra3/neural-learning-to-rank-231759858)
Introduction of object oriented analysis & design by sarmad balochSarmad Baloch
Introduction of Object oriented Analysis & Design by SARMAD BALOCH
I AM SARMAD KHOSA
BSIT (5TH A)
(ISP)
FACEBOOK PAGLE::
https://www.facebook.com/LAUGHINGHLAUGHTER/
YOUTUBE CHANNEL:::
https://www.youtube.com/channel/UCUjaIeS-DHI9xv-ZnBpx2hQ
Video has become ubiquitous on the Internet, TV, as well as personal devices. Recognition of video content has been a fundamental challenge in computer vision for decades, where previous research predominantly focused on recognizing videos using a predefined yet limited vocabulary. Thanks to the recent development of deep learning techniques, researchers in multiple communities are now striving to bridge videos with natural language in order to move beyond classification to interpretation, which should be regarded as the ultimate goal of video understanding. We will present recent advances in exploring the synergy of video understanding and language processing techniques.
Compact and Distinctive Visual Vocabularies for Efficient Multimedia Data Ind...Symeon Papadopoulos
Paper presentation in ADBIS 2013.
Abstract: Multimedia data indexing for content-based retrieval has attracted significant attention in recent years due to the commoditization of multimedia capturing equipment and the widespread adoption of social networking platforms as means for sharing media content online. Due to the very large amounts of multimedia content, notably images, produced and shared online by people, a very important requirement for multimedia indexing approaches pertains to their efficiency both in terms of computation and memory usage. A common approach to support query-by-example image search is based on the extraction of visual words from images and their indexing by means of inverted indices, a method proposed and popularized in the field of text retrieval.
The main challenge that visual word indexing systems currently face arises from the fact that it is necessary to build very large visual vocabularies (hundreds of thousands or even millions of words) to support sufficiently precise search. However, when the visual vocabulary is large, the image indexing process becomes computationally expensive due to the fact that the local image descriptors (e.g. SIFT) need to be quantized to the nearest visual words.
To this end, this paper proposes a novel method that significantly decreases the time required for the above quantization process. Instead of using hundreds of thousands of visual words for quantization, the proposed method manages to preserve retrieval quality by using a much
smaller number of words for indexing. This is achieved by the concept of composite words, i.e. assigning multiple words to a local descriptor in ascending order of distance. We evaluate the proposed method in the Oxford and Paris buildings datasets to demonstrate the validity of the
proposed approach.
Vocabulary length experiments for binary image classification using bov approachsipij
Bag-of-Visual-words (BoV) approach to image classification is popular among computer vision scientists.
The visual words come from the visual vocabulary which is constructed using the key points extracted from
the image database. Unlike the natural language, the length of such vocabulary for image classification is
task dependent. The visual words capture the local invariant features of the image. The region of image
over which a visual word is constrained forms the spatial content for the visual word. Spatial pyramid
representation of images is an approach to handle spatial information. In this paper, we study the role of
vocabulary lengths for the levels of a simple two level spatial pyramid to perform binary classifications.
Two binary classification problems namely to detect the presence of persons and cars are studied. Relevant
images from PASCAL dataset are being used for the learning activities involved in this work
Presenter: Amaia Salvador
Related papers:
E. Mohedano, Salvador, A., McGuinness, K., Giró-i-Nieto, X., O'Connor, N., and Marqués, F., “Bags of Local Convolutional Features for Scalable Instance Search”, in ACM International Conference on Multimedia Retrieval (ICMR), New York City, NY; USA. 2016
A. Salvador, Giró-i-Nieto, X., Marqués, F., and Satoh, S. 'ichi, “Faster R-CNN Features for Instance Search”, in CVPR Workshop Deep Vision, Las Vegas, NV, USA. 2016.
Abstract:
Image representations derived from pre-trained Convolutional Neural Networks (CNNs) have become the new state of the art in computer vision tasks such as instance retrieval. This work proposes a simple pipeline for encoding the local activations of a convolutional layer of a pre-trained CNN using the well-known bag of words aggregation scheme (BoW). Assigning each local array of activations in a convolutional layer to a visual word produces an assignment map, a compact representation that relates regions of an image with a visual word. We use the assignment map for fast spatial reranking, obtaining object localizations that are used for query expansion. We further investigate the potential of using convolutional features from an object detection network such as Faster R-CNN, which allows to obtain image- and region- wise features in a single forward pass. We demonstrate the suitability of such representations for image retrieval on the Oxford Buildings 5k, Paris Buildings 6k and a subset of TRECVid Instance Search 2013, achieving competitive results. This talk will review the two publications related to this work, which have been recently accepted at ICMR 2016 and DeepVision CVPRW 2016.
Barcelona, 3 May 2016.
Video has become ubiquitous on the Internet, TV, as well as personal devices. Recognition of video content has been a fundamental challenge in computer vision for decades, where previous research predominantly focused on understanding videos using a predefined yet limited vocabulary. Thanks to the recent development of deep learning techniques, researchers in both computer vision and multimedia communities are now striving to bridge videos with natural language, which can be regarded as the ultimate goal of video understanding. We will present recent advances in exploring the synergy of video understanding and language processing techniques, including video-language alignment and video captioning.
Content Modelling for Human Action Detection via Multidimensional ApproachCSCJournals
Video content analysis is an active research domain due to the availability and the increment of audiovisual data in the digital format. There is a need to automatically extracting video content for efficient access, understanding, browsing and retrieval of videos. To obtain the information that is of interest and to provide better entertainment, tools are needed to help users extract relevant content and to effectively navigate through the large amount of available video information. Existing methods do not seem to attempt to model and estimate the semantic content of the video. Detecting and interpreting human presence, actions and activities is one of the most valuable functions in this proposed framework. The general objectives of this research are to analyze and process the audio-video streams to a robust audiovisual action recognition system by integrating, structuring and accessing multimodal information via multidimensional retrieval and extraction model. The proposed technique characterizes the action scenes by integrating cues obtained from both the audio and video tracks. Information is combined based on visual features (motion, edge, and visual characteristics of objects), audio features and video for recognizing action. This model uses HMM and GMM to provide a framework for fusing these features and to represent the multidimensional structure of the framework. The action-related visual cues are obtained by computing the spatiotemporal dynamic activity from the video shots and by abstracting specific visual events. Simultaneously, the audio features are analyzed by locating and compute several sound effects of action events that embedded in the video. Finally, these audio and visual cues are combined to identify the action scenes. Compared with using single source of either visual or audio track alone, such combined audiovisual information provides more reliable performance and allows us to understand the story content of movies in more detail. To compare the usefulness of the proposed framework, several experiments were conducted and the results were obtained by using visual features only (77.89% for precision; 72.10% for recall), audio features only (62.52% for precision; 48.93% for recall) and combined audiovisual (90.35% for precision; 90.65% for recall).
Deep neural methods have recently demonstrated significant performance improvements in several IR tasks. In this lecture, we will present a brief overview of deep models for ranking and retrieval.
This is a follow-up lecture to "Neural Learning to Rank" (https://www.slideshare.net/BhaskarMitra3/neural-learning-to-rank-231759858)
Introduction of object oriented analysis & design by sarmad balochSarmad Baloch
Introduction of Object oriented Analysis & Design by SARMAD BALOCH
I AM SARMAD KHOSA
BSIT (5TH A)
(ISP)
FACEBOOK PAGLE::
https://www.facebook.com/LAUGHINGHLAUGHTER/
YOUTUBE CHANNEL:::
https://www.youtube.com/channel/UCUjaIeS-DHI9xv-ZnBpx2hQ
Video has become ubiquitous on the Internet, TV, as well as personal devices. Recognition of video content has been a fundamental challenge in computer vision for decades, where previous research predominantly focused on recognizing videos using a predefined yet limited vocabulary. Thanks to the recent development of deep learning techniques, researchers in multiple communities are now striving to bridge videos with natural language in order to move beyond classification to interpretation, which should be regarded as the ultimate goal of video understanding. We will present recent advances in exploring the synergy of video understanding and language processing techniques.
Understanding Inductive Bias in Machine LearningSUTEJAS
This presentation explores the concept of inductive bias in machine learning. It explains how algorithms come with built-in assumptions and preferences that guide the learning process. You'll learn about the different types of inductive bias and how they can impact the performance and generalizability of machine learning models.
The presentation also covers the positive and negative aspects of inductive bias, along with strategies for mitigating potential drawbacks. We'll explore examples of how bias manifests in algorithms like neural networks and decision trees.
By understanding inductive bias, you can gain valuable insights into how machine learning models work and make informed decisions when building and deploying them.
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesChristina Lin
Traditionally, dealing with real-time data pipelines has involved significant overhead, even for straightforward tasks like data transformation or masking. However, in this talk, we’ll venture into the dynamic realm of WebAssembly (WASM) and discover how it can revolutionize the creation of stateless streaming pipelines within a Kafka (Redpanda) broker. These pipelines are adept at managing low-latency, high-data-volume scenarios.
Literature Review Basics and Understanding Reference Management.pptxDr Ramhari Poudyal
Three-day training on academic research focuses on analytical tools at United Technical College, supported by the University Grant Commission, Nepal. 24-26 May 2024
Hierarchical Digital Twin of a Naval Power SystemKerry Sado
A hierarchical digital twin of a Naval DC power system has been developed and experimentally verified. Similar to other state-of-the-art digital twins, this technology creates a digital replica of the physical system executed in real-time or faster, which can modify hardware controls. However, its advantage stems from distributing computational efforts by utilizing a hierarchical structure composed of lower-level digital twin blocks and a higher-level system digital twin. Each digital twin block is associated with a physical subsystem of the hardware and communicates with a singular system digital twin, which creates a system-level response. By extracting information from each level of the hierarchy, power system controls of the hardware were reconfigured autonomously. This hierarchical digital twin development offers several advantages over other digital twins, particularly in the field of naval power systems. The hierarchical structure allows for greater computational efficiency and scalability while the ability to autonomously reconfigure hardware controls offers increased flexibility and responsiveness. The hierarchical decomposition and models utilized were well aligned with the physical twin, as indicated by the maximum deviations between the developed digital twin hierarchy and the hardware.
Water billing management system project report.pdfKamal Acharya
Our project entitled “Water Billing Management System” aims is to generate Water bill with all the charges and penalty. Manual system that is employed is extremely laborious and quite inadequate. It only makes the process more difficult and hard.
The aim of our project is to develop a system that is meant to partially computerize the work performed in the Water Board like generating monthly Water bill, record of consuming unit of water, store record of the customer and previous unpaid record.
We used HTML/PHP as front end and MYSQL as back end for developing our project. HTML is primarily a visual design environment. We can create a android application by designing the form and that make up the user interface. Adding android application code to the form and the objects such as buttons and text boxes on them and adding any required support code in additional modular.
MySQL is free open source database that facilitates the effective management of the databases by connecting them to the software. It is a stable ,reliable and the powerful solution with the advanced features and advantages which are as follows: Data Security.MySQL is free open source database that facilitates the effective management of the databases by connecting them to the software.
TOP 10 B TECH COLLEGES IN JAIPUR 2024.pptxnikitacareer3
Looking for the best engineering colleges in Jaipur for 2024?
Check out our list of the top 10 B.Tech colleges to help you make the right choice for your future career!
1) MNIT
2) MANIPAL UNIV
3) LNMIIT
4) NIMS UNIV
5) JECRC
6) VIVEKANANDA GLOBAL UNIV
7) BIT JAIPUR
8) APEX UNIV
9) AMITY UNIV.
10) JNU
TO KNOW MORE ABOUT COLLEGES, FEES AND PLACEMENT, WATCH THE FULL VIDEO GIVEN BELOW ON "TOP 10 B TECH COLLEGES IN JAIPUR"
https://www.youtube.com/watch?v=vSNje0MBh7g
VISIT CAREER MANTRA PORTAL TO KNOW MORE ABOUT COLLEGES/UNIVERSITITES in Jaipur:
https://careermantra.net/colleges/3378/Jaipur/b-tech
Get all the information you need to plan your next steps in your medical career with Career Mantra!
https://careermantra.net/
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...ssuser7dcef0
Power plants release a large amount of water vapor into the
atmosphere through the stack. The flue gas can be a potential
source for obtaining much needed cooling water for a power
plant. If a power plant could recover and reuse a portion of this
moisture, it could reduce its total cooling water intake
requirement. One of the most practical way to recover water
from flue gas is to use a condensing heat exchanger. The power
plant could also recover latent heat due to condensation as well
as sensible heat due to lowering the flue gas exit temperature.
Additionally, harmful acids released from the stack can be
reduced in a condensing heat exchanger by acid condensation. reduced in a condensing heat exchanger by acid condensation.
Condensation of vapors in flue gas is a complicated
phenomenon since heat and mass transfer of water vapor and
various acids simultaneously occur in the presence of noncondensable
gases such as nitrogen and oxygen. Design of a
condenser depends on the knowledge and understanding of the
heat and mass transfer processes. A computer program for
numerical simulations of water (H2O) and sulfuric acid (H2SO4)
condensation in a flue gas condensing heat exchanger was
developed using MATLAB. Governing equations based on
mass and energy balances for the system were derived to
predict variables such as flue gas exit temperature, cooling
water outlet temperature, mole fraction and condensation rates
of water and sulfuric acid vapors. The equations were solved
using an iterative solution technique with calculations of heat
and mass transfer coefficients and physical properties.
A review on techniques and modelling methodologies used for checking electrom...nooriasukmaningtyas
The proper function of the integrated circuit (IC) in an inhibiting electromagnetic environment has always been a serious concern throughout the decades of revolution in the world of electronics, from disjunct devices to today’s integrated circuit technology, where billions of transistors are combined on a single chip. The automotive industry and smart vehicles in particular, are confronting design issues such as being prone to electromagnetic interference (EMI). Electronic control devices calculate incorrect outputs because of EMI and sensors give misleading values which can prove fatal in case of automotives. In this paper, the authors have non exhaustively tried to review research work concerned with the investigation of EMI in ICs and prediction of this EMI using various modelling methodologies and measurement setups.
We have compiled the most important slides from each speaker's presentation. This year’s compilation, available for free, captures the key insights and contributions shared during the DfMAy 2024 conference.
Using recycled concrete aggregates (RCA) for pavements is crucial to achieving sustainability. Implementing RCA for new pavement can minimize carbon footprint, conserve natural resources, reduce harmful emissions, and lower life cycle costs. Compared to natural aggregate (NA), RCA pavement has fewer comprehensive studies and sustainability assessments.
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...Amil Baba Dawood bangali
Contact with Dawood Bhai Just call on +92322-6382012 and we'll help you. We'll solve all your problems within 12 to 24 hours and with 101% guarantee and with astrology systematic. If you want to take any personal or professional advice then also you can call us on +92322-6382012 , ONLINE LOVE PROBLEM & Other all types of Daily Life Problem's.Then CALL or WHATSAPP us on +92322-6382012 and Get all these problems solutions here by Amil Baba DAWOOD BANGALI
#vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore#blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #blackmagicforlove #blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #Amilbabainuk #amilbabainspain #amilbabaindubai #Amilbabainnorway #amilbabainkrachi #amilbabainlahore #amilbabaingujranwalan #amilbabainislamabad
3. AAnnaallooggyy ttoo ddooccuummeennttss
Of all the sensory impressions proceeding to the
brain, the visual experiences are the dominant
ones. Our perception of the world around us is
based essentially on the messages that reach the
brain from our eyes. sensory, For a long brain,
time it was
thought that the visual, retinal image perception,
was transmitted
point by point to visual centers in the brain; the
cerebral cortex retinal, was a movie cerebral screen, cortex,
so to speak,
upon which the eye, image cell, in the eye optical
was projected.
Through the discoveries of Hubel and Wiesel we
now know that behind nerve, the origin image
of the visual
perception in the brain there is a considerably
more complicated Hubel, course of events. Wiesel
By following
the visual impulses along their path to the various
cell layers of the optical cortex, Hubel and Wiesel
have been able to demonstrate that the message
about the image falling on the retina undergoes a
step-wise analysis in a system of nerve cells
stored in columns. In this system each cell has its
specific function and is responsible for a specific
detail in the pattern of the retinal image.
China is forecasting a trade surplus of $90bn
(£51bn) to $100bn this year, a threefold increase
on 2004's $32bn. The Commerce Ministry said
the surplus would be created by a predicted 30%
jump in exports to $750bn, compared with a 18%
rise in imports to China, $660bn. The trade,
figures are likely to
further annoy surplus, the US, which commerce,
has long argued that
China's exports exports, are unfairly helped by a
deliberately undervalued imports, yuan. Beijing US,
agrees the
surplus is too yuan, high, but bank, says the domestic,
yuan is only one
factor. Bank of foreign, China governor increase,
Zhou Xiaochuan
said the country also needed to do more to boost
domestic demand so trade, more goods value
stayed within the
country. China increased the value of the yuan
against the dollar by 2.1% in July and permitted it
to trade within a narrow band, but the US wants
the yuan to be allowed to trade freely. However,
Beijing has made it clear that it will take its time
and tread carefully before allowing the yuan to
rise further in value.
ICCV 2005 short course, L. Fei-Fei
4. Visual words: main idea
Extract some local features from a number of images …
e.g., SIFT descriptor
space: each point is 128-
dimensional
Slide credit: D. Nister
10. Visual words
Example: each
group of patches
belongs to the
same visual word
Figure from Sivic & Zisserman, ICCV 2003
11. Visual words
11
• More recently used for describing scenes and
objects for the sake of indexing or classification.
Sivic & Zisserman 2003;
Csurka, Bray, Dance, & Fan
2004; many others.
Source credit: K. Grauman, B. Leibe
14. Bags of visual words
Summarize entire image
based on its distribution
(histogram) of word
occurrences.
Analogous to bag of words
representation commonly used
for documents.
14
Image credit: Fei-Fei Li
15. Similarly, bags of textons for texture
representation
histogram
Universal texton dictionary
Julesz, 1981; Cula & Dana, 2001; Leung & Malik 2001; Mori,
Belongie & Malik, 2001; Schmid 2001; Varma & Zisserman, 2002,
Source: Lana Lazebnik
16. Comparing bags of words
Rank frames by normalized scalar product between their
(possibly weighted) occurrence counts---nearest
neighbor search for similar images.
[1 8 1 4] [5 1 1 0]
q
j d
17. Inverted file index for images
comprised of visual words
Image credit: A. Zisserman
Word
number
List of image
numbers
When will this give us a significant gain in efficiency?
18. Indexing local features: inverted file index
For text documents,
an efficient way to find
all pages on which a
word occurs is to use
an index…
We want to find all
images in which a
feature occurs.
We need to index
each feature by the
image it appears and
also we keep the # of
occurrence.
Source credit : K. Grauman, B. Leibe
19. tf-idf weighting
Term frequency – inverse document frequency
Describe frame by frequency of each word within it,
downweight words that appear often in the database
(Standard weighting for text retrieval)
Total number of
words in database
Number of
occurrences of
word i in whole
database
Number of
occurrences of
word i in
document d
Number of
words in
document d
20. Bags of words for content-based image
What if query of interest is a portion of a frame?
Slide from Andrew Zisserman
Sivic & Zisserman, ICCV 2003
retrieval
21. Video Google System
1. Collect all words within
query region
2. Inverted file index to find
relevant frames
3. Compare word counts
4. Spatial verification
Sivic & Zisserman, ICCV 2003
Demo online at :
http://www.robots.ox.ac.uk/~vgg/rese
arch/vgoogle/index.html
21
Query
region
Retrieved frames
22. 22
Collecting words within a query region
Query region:
pull out only the SIFT
descriptors whose
positions are within the
polygon
23.
24.
25.
26. Bag of words representation:
spatial info
A bag of words is an orderless representation:
throwing out spatial relationships between features
Middle ground:
Visual “phrases” : frequently co-occurring words
Semi-local features : describe configuration,
neighborhood
Let position be part of each feature
Count bags of words only within sub-grids of an
image
After matching, verify spatial consistency (e.g., look
at neighbors – are they the same too?)
27. Visual vocabulary formation
Issues:
Sampling strategy: where to extract features?
Clustering / quantization algorithm
Unsupervised vs. supervised
What corpus provides features (universal
vocabulary?)
Vocabulary size, number of words
27
K. Grauman, B. Leibe
28. Sampling strategies
28
Sparse, at Dense, uniformly
interest points
Image credits: F-F. Li, E. Nowak, J. Sivic
Randomly
Multiple interest
operators
• To find specific, textured objects, sparse
sampling from interest points often more
reliable.
• Multiple complementary interest operators
offer more image coverage.
• For object categorization, dense sampling
offers better coverage.
[See Nowak, Jurie & Triggs, ECCV 2006]
29. [Nister & Stewenius, CVPR’06]
29
Example: Recognition with Vocabulary Tree
Tree construction:
Slide credit: David Nister
30. [Nister & Stewenius, CVPR’06]
30
Vocabulary Tree
Training: Filling the tree
Slide credit: David Nister
31. [Nister & Stewenius, CVPR’06]
31
Vocabulary Tree
Training: Filling the tree
Slide credit: David Nister
32. [Nister & Stewenius, CVPR’06]
32
Vocabulary Tree
Training: Filling the tree
Slide credit: David Nister
33. [Nister & Stewenius, CVPR’06]
33
Vocabulary Tree
Training: Filling the tree
Slide credit: David Nister
34. [Nister & Stewenius, CVPR’06]
34
Vocabulary Tree
Training: Filling the tree
Slide credit: David Nister
35. [Nister & Stewenius, CVPR’06]
35
Vocabulary Tree
Recognition
Slide credit: David Nister
RANSAC
verification
36. 36
Vocabulary Tree: Performance
Evaluated on large databases
Indexing with up to 1M images
Online recognition for database
of 50,000 CD covers
Retrieval in ~1s
Find experimentally that large
vocabularies can be beneficial for
recognition
[Nister & Stewenius, CVPR’06]
38. Bags of features for object recognition
face, flowers, building
Works pretty well for image-level classification
Source credit : K. Grauman, B. Leibe Source: Lana Lazebnik
39. Bags of features for object recognition
Caltech6 dataset
bag of features bag of features Parts-and-shape model
Source: Lana Lazebnik
40. Bags of words: pros and cons
+ flexible to geometry / deformations / viewpoint
+ compact summary of image content
+ provides vector representation for sets
+ has yielded good recognition results in practice
- basic model ignores geometry – must verify
afterwards, or encode via features
- background and foreground mixed when bag covers
whole image
- interest points or sampling: no guarantee to capture
object-level parts
- optimal vocabulary formation remains unclear
40
Source credit : K. Grauman, B. Leibe
41. Summary
Local invariant features: distinctive matches possible in
spite of significant view change, useful not only to
provide matches for multi-view geometry, but also to find
objects and scenes.
To find correspondences among detected features,
measure distance between descriptors, and look for
most similar patches.
Bag of words representation: quantize feature space to
make discrete set of visual words
Summarize image by distribution of words
Index individual words
Inverted index: pre-compute index to enable faster
search at query time
Source credit : K. Grauman, B. Leibe