Introduction to Information Retrieval

Introduction to Information Retrieval
June, 2013 Roi Blanco

Acknowledgements
• Many of these slides were taken from other presentations
– P. Raghavan, C. Manning, H. Schutze IR lectures
– Mounia Lalmas’s personal stash
– Other random slide decks
• Textbooks
– Ricardo Baeza-Yates, Berthier Ribeiro Neto
– Raghavan, Manning, Schutze
– … among other good books
• Many online tutorials, many online tools available (full toolkits)
2

Big Plan
• What is Information Retrieval?
– Search engine history
– Examples of IR systems (you might now have known!)
• Is IR hard?
– Users and human cognition
– What is it like to be a search engine?
• Web Search
– Architecture
– Differences between Web search and IR
– Crawling
3

• Representation
– Document view
– Document processing
– Indexing
• Modeling
– Vector space
– Probabilistic
– Language Models
– Extensions
• Others
– Distributed
– Efficiency
– Caching
– Temporal issues
– Relevance feedback
– …
4

Information Retrieval
Information Retrieval (IR) is finding material
(usually documents) of an unstructured nature
(usually text) that satisfies an information need
from within large collections (usually stored on
computers).
Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze
Introduction to Information Retrieval
6
6

Information Retrieval (II)
• What do we understand by documents? How do
we decide what is a document and whatnot?
• What is an information need? What types of
information needs can we satisfy automatically?
• What is a large collection? Which environments
are suitable for IR
7
7

Basic assumptions of Information Retrieval
• Collection: A set of documents
– Assume it is a static collection
• Goal: Retrieve documents with information that is
relevant to the user’s information need and helps
the user complete a task
8

Key issues
• How to describe information resources or information-bearing
objects in ways that they can be effectively used
by those who need to use them ?
– Organizing/Indexing/Storing
• How to find the appropriate information resources or
information-bearing objects for someone’s (or your own)
needs
– Retrieving / Accessing / Filtering
9

Unstructured data
Unstructured data?
SELECT * from HOTELS
where city = Bangalore and
$$$ < 2
10
Cheap hotels in
Bangalore
CITY $$$ name
Bangalore 1.5 Cheapo one
Barcelona 1 EvenCheapoer
10

Unstructured (text) vs. structured (database) data in the
mid-nineties
11

Unstructured (text) vs. structured (database) data today

Search Engine
Index
Square
Pants!
14

Timeline
1990 1991 1993 1994 1998
...
16

...
1995
1996
1997
1998
1999
2000
17

2001
2003
2002
2003
2003
2003
2003
2010 2010
2003
19

Usability
We also fail at using the technology
Sometimes

Applications
• Text Search
• Ad search
• Image/Video search
• Email Search
• Question Answering systems
• Recommender systems
• Desktop Search
• Expert Finding
• ....
Jobs
Prizes
Products
News
Source code
Videogames
Maps
Partners
Mashups
...
37

Types of search engines
• Q&A engines
• Collaborative
• Enterprise
• Web
• Metasearch
• Semantic
• NLP
• ...
38

IR issues
• Find out what the user needs
… and do it quickly
• Challenges: user intention, accessibility, volatility,
redundancy, lack of structure, low quality, different data
sources, volume, scale
• The main bottleneck is human cognition and not
computational
41

IR is mostly about relevance
• Relevance is the core concept in IR, but nobody has a good
definition
• Relevance = useful
• Relevance = topically related
• Relevance = new
• Relevance = interesting
• Relevance = ???
• However we still want relevant information
42

• Information needs must be expressed as a query
– But users don’t often know what they want
• Problems
– Verbalizing information needs
– Understanding query syntax
– Understanding search engines
43

Understanding(?) the user
I am a hungry tourist in
Barcelona, and I want to
find a place to eat;
however I don’t want to
spend a lot of money
I want information
on places with
cheap food in
Barcelona
Info about bars in
Barcelona
Bar celona
Misconception
Mistranslation
Misformulation
44

Why this is hard?
• Documents/images/ video/speech/etc are complex. We
need some representation
• Semantics
– What do words mean?
• Natural language
– How do we say things?
• L Computers cannot deal with these easily
45

… and even harder
• Context
• Opinion
Funny? Talented? Honest?
46

Semantics
Bank Note River Bank Bank
47
Blood bank

What is it like to be a search engine?
• How can we figure out what you’re trying to do?
• Signal can be somehow weak, sometimes!
[ jaguar ]
[ iraq ]
[ latest release Thinkpad drivers touchpad ] [
ebay ]
[ first ]
[ google ]
[ brittttteny spirs ]
48

Search is a multi-step process
• Session search
– Verbalize your query
– Look for a document
– Find your information there
– Refine
• Teleporting
– Go directly to the site you like
– Formulating the query is too hard, you trust more
the final site, etc.
49

• Someone told me that in the mid-1800’s, people often would carry
around a special kind of notebook. They would use the notebook to
write down quotations that they heard, or copy passages from books
they’d read. The notebook was an important part of their education,
and it had a particular name.
– What was the name of the notebook?
50
Examples from Dan Russel

Naming the un-nameable
• What’s this thing called?
51

More tasks …
• Going beyond a search engine
– Using images / multimedia content
– Using maps
– Using other sources
• Think of how to express things differently (synonyms)
– A friend told me that there is an abandoned city in the waters of San Francisco
Bay. Is that true? If it IS true, what was the name of the supposed city?
• Exploring a topic further in depth
• Refining a question
– Suppose you want to buy a unicycle for your Mom or Dad. How would you find
it?
• Looking for lists of information
– Can you find a list of all the groups that inhabited California at the time of the
missions?
52

IR tasks
• Known-item finding
– You want to retrieve some data that you know they exist
– What year was Peter Mika born?
• Exploratory seeking
– You want to find some information through an iterative process
– Not a single answer to your query
• Exhaustive search
– You want to find all the information possible about a particular issue
– Issuing several queries to cover the user information need
• Re-finding
– You want to find an item you have found already
53

Scale
• >300TB of print data produced per year
– +Video, speech, domain-specific information (>600PB per year)
• IR has to be fast + scalable
• Information is dynamic
– News, web pages, maps, …
– Queries are dynamic (you might even change your information needs while
searching)
• Cope with data and searcher change
– This introduces tensions in every component of a search engine
54

Methodology
• Experimentation in IR
• Three fundamental types of IR research:
– Systems (efficiency)
– Methods (effectiveness)
– Applications (user utility)
• Empirical evaluation plays a critical role across all three types
of research
55

Methodology (II)
• Information retrieval (IR) is a highly applied scientific
discipline
• Experimentation is a critical component of the scientific
method
• Poor experimental methodologies are not scientifically
sound and should be avoided
56

58
Task
Info
need
Verbal
form
query
Search
engine
Corpus
results
Query
refinement

User
Interface
Query
interpretation
Document
Collection
Crawling
Text Processing
Indexing
General Voodoo
Matching
Ranking
Metadata
Index
Document
Interpretation
59

Crawler
NLP
pipeline
Indexer
Documents Tokens
Index
Query
System
60

Broker
DNS
Cluster
Cluster
cache
server
partition
replication
61

<a href=
• Web pages are linked
– AKA Web Graph
• We can walk trough the
graph to crawl
• We can rank using the
graph
62

Web Search
• Basic search technology shared with IR systems
– Representation
– Indexing
– Ranking
• Scale (in terms of data and users) changes the game
– Efficiency/architectural design decisions
• Link structure
– For data acquisition (crawling)
– For ranking (PageRank, HITS)
– For spam detection
– For extending document representations (anchor text)
• Adversarial IR
• Monetization
64

User Needs
• Need
– Informational – want to learn about something (~40% / 65%)
– Navigational – want to go to that page (~25% / 15%)
– Transactional – want to do something (web-mediated) (~35% / 20%)
• Access a service
• Downloads
• Shop
– Gray areas
• Find a good hub
• Exploratory search “see what’s there”
Low hemoglobin
United Airlines
Seattle weather
Mars surface images
Canon S410
Car rental Brasil
65

How far do people look for results?
(Source: iprospect.com WhitePaper_2006_SearchEngineUserBehavior.pdf)
66

Users’ empirical evaluation of results
• Quality of pages varies widely
– Relevance is not enough
– Other desirable qualities (non IR!!)
• Content: Trustworthy, diverse, non-duplicated, well maintained
• Web readability: display correctly & fast
• No annoyances: pop-ups, etc.
• Precision vs. recall
– On the web, recall seldom matters
• What matters
– Precision at 1? Precision above the fold?
– Comprehensiveness – must be able to deal with obscure queries
• Recall matters when the number of matches is very small
• User perceptions may be unscientific, but are significant
over a large aggregate
67

Users’ empirical evaluation of engines
• Relevance and validity of results
• UI – Simple, no clutter, error tolerant
• Trust – Results are objective
• Coverage of topics for ambiguous queries
• Pre/Post process tools provided
– Mitigate user errors (auto spell check, search assist,…)
– Explicit: Search within results, more like this, refine ...
– Anticipative: related searches
• Deal with idiosyncrasies
– Web specific vocabulary
• Impact on stemming, spell-check, etc.
– Web addresses typed in the search box
• “The first, the last, the best and the worst …”
68

The Web document collection
• No design/co-ordination
• Distributed content creation, linking,
democratization of publishing
• Content includes truth, lies, obsolete
information, contradictions …
• Unstructured (text, html, …), semi-structured
(XML, annotated photos), structured
(Databases)…
• Scale much larger than previous text collections
… but corporate records are catching up
• Growth – slowed down from initial “volume
doubling every few months” but still expanding
• Content can be dynamically generated The Web
69

Basic crawler operation
• Begin with known “seed” URLs
• Fetch and parse them
–Extract URLs they point to
–Place the extracted URLs on a queue
• Fetch each URL on the queue and
repeat
70

Crawling picture
Web
URLs frontier
Unseen Web
URLs crawled
and parsed
Seed
pages
71

Simple picture – complications
• Web crawling isn’t feasible with one machine
– All of the above steps distributed
• Malicious pages
– Spam pages
– Spider traps – including dynamically generated
• Even non-malicious pages pose challenges
– Latency/bandwidth to remote servers vary
– Webmasters’ stipulations
• How “deep” should you crawl a site’s URL hierarchy?
– Site mirrors and duplicate pages
• Politeness – don’t hit a server too often
72

What any crawler must do
• Be Polite: Respect implicit and explicit
politeness considerations
– Only crawl allowed pages
– Respect robots.txt
• Be Robust: Be immune to spider traps
and other malicious behavior from
web servers
–Be efficient
73

What any crawler should do
• Be capable of distributed operation: designed to
run on multiple distributed machines
• Be scalable: designed to increase the crawl rate
by adding more machines
• Performance/efficiency: permit full use of
available processing and network resources
74

What any crawler should do
• Fetch pages of “higher quality” first
• Continuous operation: Continue fetching
fresh copies of a previously fetched page
• Extensible: Adapt to new data formats,
protocols
75

Updated crawling picture
URLs crawled
and parsed
Unseen Web
Seed
Pages
URL frontier
Crawling thread
76

Document views
sailing
greece
mediterranean
fish
sunset
Author = “B. Smith”
Crdate = “14.12.96”
Ladate = “11.07.02”
Sailing in
Greece
B. Smith
content
view
head
title
author
chapter
section
section
structure
view
data
view
layout
view
78

What is a document: document views
• Content view is concerned with representing the content
of the document; that is, what is the document about.
• Data view is concerned with factual data associated with
the document (e.g. author names, publishing date)
• Layout view is concerned with how documents are
displayed to the users; this view is related to user interface
and visualization issues.
• Structure view is concerned with the logical structure of
the document, (e.g. a book being composed of chapters,
themselves composed of sections, etc.)
79

Indexing language
• An indexing language:
– Is the language used to describe the content of
documents (and queries)
– And it usually consists of index terms that are derived
from the text (automatic indexing), or arrived at
independently (manual indexing), using a controlled
or uncontrolled vocabulary
– Basic operation: is this query term present in this
document?
80

Generating document representations
• The building of the indexing language, that is generating
the document representation, is done in several steps:
– Character encoding
– Language recognition
– Page segmentation (boilerplate detection)
– Tokenization (identification of words)
– Term normalization
– Stopword removal
– Stemming
– Others (doc. Expansion, etc.)
81

Generating document representations: overview
documents
tokens
stop-words
stems
terms (index terms)
tokenization
remove noisy words
reduce to stems
+ others: e.g.
- thesaurus
- more complex
processing
82

Parsing a document
• What format is it in?
– pdf/word/excel/html?
• What language is it in?
• What character set is in use?
– (ISO-8818, UTF-8, …)
But these tasks are often done heuristically …
83

Complications: Format/language
• Documents being indexed can include docs from many
different languages
– A single index may contain terms from many languages.
• Sometimes a document or its components can contain
multiple languages/formats
– French email with a German pdf attachment.
– French email quote clauses from an English-language
contract
• There are commercial and open source libraries that can
handle a lot of this stuff
84

Complications: What is a document?
We return from our query “documents” but there are often
interesting questions of grain size:
What is a unit document?
– A file?
– An email? (Perhaps one of many in a single mbox file)
• What about an email with 5 attachments?
– A group of files (e.g., PPT or LaTeX split over HTML pages)
85

Tokenization
• Input: “Friends, Romans and Countrymen”
• Output: Tokens
– Friends
– Romans
– Countrymen
• A token is an instance of a sequence of characters
• Each such token is now a candidate for an index entry, after
further processing
• But what are valid tokens to emit?
86

Tokenization
• Issues in tokenization:
– Finland’s capital 
Finland AND s? Finlands? Finland’s?
– Hewlett-Packard  Hewlett and Packard as two
tokens?
• state-of-the-art: break up hyphenated sequence.
• co-education
• lowercase, lower-case, lower case ?
• It can be effective to get the user to put in possible hyphens
– San Francisco: one token or two?
• How do you decide it is one token?
87

Numbers
• 3/20/91 Mar. 12, 1991 20/3/91
• 55 B.C.
• B-52
• My PGP key is 324a3df234cb23e
• (800) 234-2333
• Often have embedded spaces
• Older IR systems may not index numbers
But often very useful: think about things like looking up error
codes/stacktraces on the web
• Will often index “meta-data” separately
Creation date, format, etc.
88

Tokenization: language issues
• French
– L'ensemble  one token or two?
• L ? L’ ? Le ?
• Want l’ensemble to match with un ensemble
– Until at least 2003, it didn’t on Google
» Internationalization!
• German noun compounds are not segmented
– Lebensversicherungsgesellschaftsangestellter
– ‘life insurance company employee’
– German retrieval systems benefit greatly from a compound splitter
module
– Can give a 15% performance boost for German
89

• Chinese and Japanese have no spaces between words:
– 莎拉波娃现在居住在美国东南部的佛罗里达。
– Not always guaranteed a unique tokenization
• Further complicated in Japanese, with multiple alphabets
intermingled
– Dates/amounts in multiple formats
フォーチュン500社は情報不足のため時間あた$500K(約6,000万円)
Katakana Hiragana Kanji Romaji
End-user can express query entirely in hiragana!
90

• Arabic (or Hebrew) is basically written right to left, but with certain items
like numbers written left to right
• Words are separated, but letter forms within a word form complex
ligatures
← → ← → ← start
‘Algeria achieved its independence in 1962 after 132 years of French
occupation.’
• With Unicode, the surface presentation is complex, but the stored
form is straightforward
91

Stop words
• With a stop list, you exclude from the dictionary entirely the commonest
words. Intuition:
– They have little semantic content: the, a, and, to, be
– There are a lot of them: ~30% of postings for top 30 words
• But the trend is away from doing this:
– Good compression techniques means the space for including stop words in a system
can be small
– Good query optimization techniques mean you pay little at query time for including
stop words.
– You need them for:
• Phrase queries: “King of Denmark”
• Various song titles, etc.: “Let it be”, “To be or not to be”
• “Relational” queries: “flights to London”
92

Normalization to terms
• Want: matches to occur despite superficial differences in the
character sequences of the tokens
• We may need to “normalize” words in indexed text as well as query words
into the same form
– We want to match U.S.A. and USA
• Result is terms: a term is a (normalized) word type, which is an entry in
our IR system dictionary
• We most commonly implicitly define equivalence classes of terms by, e.g.,
– deleting periods to form a term
• U.S.A., USA  USA
– deleting hyphens to form a term
• anti-discriminatory, antidiscriminatory  antidiscriminatory
93

Normalization: other languages
• Accents: e.g., French résumé vs. resume.
• Umlauts: e.g., German: Tuebingen vs. Tübingen
– Should be equivalent
• Most important criterion:
– How are your users like to write their queries for these words?
• Even in languages that standardly have accents, users often may not type
them
– Often best to normalize to a de-accented term
• Tuebingen, Tübingen, Tubingen  Tubingen
94

Case folding
• Reduce all letters to lower case
– exception: upper case in mid-sentence?
• e.g., General Motors
• Fed vs. fed
• SAIL vs. sail
– Often best to lower case everything, since users will use lowercase
regardless of ‘correct’ capitalization…
• Longstanding Google example: [fixed in 2011…]
– Query C.A.T.
– #1 result is for “cats” (well, Lolcats) not Caterpillar Inc.
95

Normalization to terms
• An alternative to equivalence classing is to do asymmetric
expansion
• An example of where this may be useful
– Enter: window Search: window, windows
– Enter: windows Search: Windows, windows, window
– Enter: Windows Search: Windows
• Potentially more powerful, but less efficient
96

Thesauri and soundex
• Do we handle synonyms and homonyms?
– E.g., by hand-constructed equivalence classes
• car = automobile color = colour
– We can rewrite to form equivalence-class terms
• When the document contains automobile, index it under
car-automobile (and vice-versa)
– Or we can expand a query
• When the query contains automobile, look under car as
well
• What about spelling mistakes?
– One approach is Soundex, which forms equivalence classes of
words based on phonetic heuristics
97

Lemmatization
• Reduce inflectional/variant forms to base form
• E.g.,
– am, are, is  be
– car, cars, car's, cars'  car
• the boy's cars are different colors  the boy car be
different color
• Lemmatization implies doing “proper” reduction to
dictionary headword form
98

Stemming
• Reduce terms to their “roots” before indexing
• “Stemming” suggests crude affix chopping
– language dependent
– e.g., automate(s), automatic, automation all reduced to automat.
for example compressed
and compression are both
accepted as equivalent to
compress.
for exampl compress and
compress ar both accept
as equival to compress
99

– Affix removal
• remove the longest affix: {sailing, sailor} => sail
• simple and effective stemming
• a widely used such stemmer is Porter’s algorithm
– Dictionary-based using a look-up table
• look for stem of a word in table: play + ing => play
• space is required to store the (large) table, so often not practical
100

Stemming: some issues
• Detect equivalent stems:
– {organize, organise}: e as the longest affix leads to {organiz,
organis}, which should lead to one stem: organis
– Heuristics are therefore used to deal with such cases.
• Over-stemming:
– {organisation, organ} reduced into org, which is incorrect
– Again heuristics are used to deal with such cases.
101

Porter’s algorithm
• Commonest algorithm for stemming English
– Results suggest it’s at least as good as other stemming options
• Conventions + 5 phases of reductions
– phases applied sequentially
– each phase consists of a set of commands
– sample convention: Of the rules in a compound command, select
the one that applies to the longest suffix.
102

Typical rules in Porter
• sses  ss
• ies  i
• ational  ate
• tional  tion
103

Language-specificity
• The above methods embody transformations that are
– Language-specific, and often
– Application-specific
• These are “plug-in” addenda to the indexing process
• Both open source and commercial plug-ins are
available for handling these
104

Does stemming help?
• English: very mixed results. Helps recall for some queries but
harms precision on others
– E.g., operative (dentistry) ⇒ oper
• Definitely useful for Spanish, German, Finnish, …
– 30% performance gains for Finnish!
105

Others: Using a thesaurus
• A thesaurus provides a standard vocabulary for indexing
(and searching)
• More precisely, a thesaurus provides a classified
hierarchy for broadening and narrowing terms
bank: 1. Finance institute
2. River edge
– if a document is indexed with bank, then index it with
“finance institute” or “river edge”
– need to disambiguate the sense of bank in the text: e.g. if
money appears in the document, then chose “finance
institute”
• A widely used online thesaurus: WordNet
106

Information storage
• Whole topic on its own
• How do we keep fresh copies of the web manageable by a cluster of
computers and are able to answer millions of queries in milliseconds
– Inverted indexes
– Compression
– Caching
– Distributed architectures
– … and a lot of tricks
• Inverted indexes: cornerstone data structure of IR systems
– For each term t, we must store a list of all documents that contain t.
– Identify each doc by a docID, a document serial number
– Index construction is tricky (can’t hold all the information needed in memory)
107

108
docs t1 t2 t3
D1 1 0 1
D2 1 0 0
D3 0 1 1
D4 1 0 0
D5 1 1 1
D6 1 1 0
D7 0 1 0
D8 0 1 0
D9 0 1 1
D10 0 1 1
Terms D1 D2 D3 D4
t1 1 1 0 1
t2 0 0 1 0
t3 1 0 1 0

• Most basic form:
– Document frequency
– Term frequency
– Document identifiers
109
term Term id df
a 1 4
as 2 3
(1,2), (2,5), (10,1), (11,1)
(1,3), (3,4), (20,1)

• Indexes contain more information
– Position in the document
• Useful for “phrase queries” or “proximity queries”
– Fields in which the term appears in the document
– Metadata …
– All that can be used for ranking
110
(1,2, [1, 1], [2,10]), …
Field 1 (title), position 1

Queries
• How do we process a query?
• Several kinds of queries
– Boolean
•Chicken AND salt
• Gnome OR KDE
• Salt AND NOT pepper
– Phrase queries
– Ranked
111

List Merging
•“Exact match” queries
– Chicken AND curry
– Locate Chicken in the dictionary
– Fetch its postings
– Locate curry in the dictionary
–Fetch its postings
–Merge both postings
112

Intersecting two postings lists
113

List Merging
Walk through the postings in O(x+y) time
salt
pepper
3 22 23 25
3 5 22 25 36
3 22 25
114

Models of information retrieval
• A model:
– abstracts away from the real world
– uses a branch of mathematics
– possibly: uses a metaphor for searching
116

Short history of IR modelling
• Boolean model (±1950)
• Document similarity (±1957)
• Vector space model (±1970)
• Probabilistic retrieval (±1976)
• Language models (±1998)
• Linkage-based models (±1998)
• Positional models (±2004)
• Fielded models (±2005)
117

The Boolean model (±1950)
• Exact matching: data retrieval (instead of
information retrieval)
– A term specifies a set of documents
– Boolean logic to combine terms / document sets
– AND, OR and NOT: intersection, union, and
difference
118

Statistical similarity between documents (±1957)
• The principle of similarity
"The more two representations agree in given elements and their
distribution, the higher would be the probability of their representing
similar information”
(Luhn 1957)
It is here proposed that the frequency of word [term] occurrence in an
article [document ] furnishes a useful measurement of word [term]
significance”
119

Zipf’s law
terms by rank order
frequency of terms
f
r
120

Zipf’s law
• Relative frequencies of terms.
• In natural language, there are a few very frequent terms and very many
very rare terms.
• Zipf’s law: The ith most frequent term has frequency proportional to 1/i .
• cfi ∝ 1/i = K/i where K is a normalizing constant
• cfi is collection frequency: the number of occurrences of the term ti in the
collection.
• Zipf’s law holds for different languages
121

Zipf consequences
• If the most frequent term (the) occurs cf1 times
– then the second most frequent term (of) occurs cf1/2 times
– the third most frequent term (and) occurs cf1/3 times …
• Equivalent: cfi = K/i where K is a normalizing factor, so
– log cfi = log K - log i
– Linear relationship between log cfi and log i
• Another power law relationship
122

Luhn’s analysis -Observation
terms by rank order
frequency of terms
f
resolving power
r
upper cut-off lower cut-off
common terms
rare terms
significant terms
Resolving power of significant terms:
ability of terms to discriminate document content
peak at rank order position half way between the two cut-offs
124

Luhn’s analysis - Implications
• Common terms are not good at representing document
content
– partly implemented through the removal of stop words
• Rare words are also not good at representing document
content
– usually nothing is done
– Not true for every “document”
• Need a means to quantify the resolving power of a term:
– associate weights to index terms
– tf×idf approach
125

Ranked retrieval
• Boolean queries are good for expert users with precise
understanding of their needs and the collection.
– Also good for applications: Applications can easily consume
1000s of results.
• Not good for the majority of users.
– Most users incapable of writing Boolean queries (or they are,
but they think it’s too much work).
– Most users don’t want to wade through 1000s of results.
• This is particularly true of web search.

Feast or Famine
• Boolean queries often result in either too few (=0) or too
many (1000s) results.
• Query 1: “standard user dlink 650” → 200,000 hits
• Query 2: “standard user dlink 650 no card found”: 0 hits
• It takes a lot of skill to come up with a query that produces
a manageable number of hits.
– AND gives too few; OR gives too many

Ranked retrieval models
• Rather than a set of documents satisfying a query expression,
in ranked retrieval, the system returns an ordering over the
(top) documents in the collection for a query
• Free text queries: Rather than a query language of operators
and expressions, the user’s query is just one or more words in
a human language
• In principle, there are two separate choices here, but in
practice, ranked retrieval has normally been associated with
free text queries and vice versa
128

Feast or famine: not a problem in ranked retrieval
• When a system produces a ranked result set, large result sets
are not an issue
– Indeed, the size of the result set is not an issue
– We just show the top k ( ≈ 10) results
– We do not overwhelm the user
– Premise: the ranking algorithm works

Scoring as the basis of ranked retrieval
• We wish to return in order the documents most likely to
be useful to the searcher
• How can we rank-order the documents in the collection
with respect to a query?
• Assign a score – say in [0, 1] – to each document
• This score measures how well document and query
“match”.

Query-document matching scores
• We need a way of assigning a score to a query/document
pair
• Let’s start with a one-term query
• If the query term does not occur in the document: score
should be 0
• The more frequent the query term in the document, the
higher the score (should be)
• We will look at a number of alternatives for this.

Bag of words model
• Vector representation does not consider the ordering of
words in a document
• John is quicker than Mary and Mary is quicker than John
have the same vectors
• This is called the bag of words model.

Term frequency tf
• The term frequency tf(t,d) of term t in document d is defined
as the number of times that t occurs in d.
• We want to use tf when computing query-document match
scores. But how?
• Raw term frequency is not what we want:
– A document with 10 occurrences of the term is more
relevant than a document with 1 occurrence of the term.
– But not 10 times more relevant.
• Relevance does not increase proportionally with term
frequency.

Log-frequency weighting
• The log frequency weight of term t in d is
  
1 log tf , if tf 0



10 t,d t,d
0, otherwise
t,d w
• 0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc.
• Score for a document-query pair: sum over terms t in both q and d:
• score
• The score is 0 if none of the query terms is present in the document.
  
 
t q d t d (1 log tf ) ,

Document frequency
• Rare terms are more informative than frequent terms
– Recall stop words
• Consider a term in the query that is rare in the collection (e.g.,
arachnocentric)
• A document containing this term is very likely to be relevant to
the query arachnocentric
• → We want a high weight for rare terms like arachnocentric.

Document frequency, continued
• Frequent terms are less informative than rare terms
• Consider a query term that is frequent in the collection (e.g., high,
increase, line)
• A document containing such a term is more likely to be relevant than a
document that does not
• But it’s not a sure indicator of relevance.
• → For frequent terms, we want high positive weights for words like high,
increase, and line
• But lower weights than for rare terms.
• We will use document frequency (df) to capture this.

idf weight
• dft is the document frequency of t: the number of documents that contain
t
– dft is an inverse measure of the informativeness of t
– dft  N
• We define the idf (inverse document frequency) of t by
– We use log (N/dft) instead of N/dft to “dampen” the effect of idf.
idf log ( /df ) t 10 t  N

Effect of idf on ranking
• Does idf have an effect on ranking for one-term queries, like
– iPhone
• idf has no effect on ranking one term queries
– idf affects the ranking of documents for queries with at least
two terms
– For the query capricious person, idf weighting makes
occurrences of capricious count for much more in the final
document ranking than occurrences of person.
138

tf-idf weighting
• The tf-idf weight of a term is the product of its tf weight and its
idf weight.
w  log(1  tf ) 
log ( N
/ df ) t , d
t ,d 10 t • Best known weighting scheme in information retrieval
– Note: the “-” in tf-idf is a hyphen, not a minus sign!
– Alternative names: tf.idf, tf x idf
• Increases with the number of occurrences within a document
• Increases with the rarity of the term in the collection

Score for a document given a query
tÎqÇd å
• There are many variants
– How “tf” is computed (with/without logs)
– Whether the terms in the query are also weighted
– …
140
Score(q,d) = tf.idft,d

Documents as vectors
• So we have a |V|-dimensional vector space
• Terms are axes of the space
• Documents are points or vectors in this space
• Very high-dimensional: tens of millions of dimensions when
you apply this to a web search engine
• These are very sparse vectors - most entries are zero.

Statistical similarity between documents (±1957)
• Vector product
– If the vector has binary components, then the product
measures the number of shared terms
– Vector components might be "weights"

score q d  q 
d
k k 
matching terms
( , )
k
 

Why distance is a bad idea
The Euclidean
distance between q
and d2 is large even
though the
distribution of terms
in the query q and the
distribution of
terms in the
document d2 are
very similar.

Vector space model (±1970)
• Documents and
queries are vectors in
a high-dimensional
space
• Geometric measures
(distances, angles)

• Cosine of an angle:
– close to 1 if angle is small
– 0 if vectors are orthogonal
2
m
d q
k k k
d q
m
k 1
k
 
2
m
k 1
k

1
( ) ( )
 
cos( , )
 

 
d q
1 ( )2
 m

   
k 
k
i
i
m
k
k k
v
v
 
d q n d n q n v
1
cos( , ) ( ) ( ), ( )

• PRO: Nice metaphor, easily explained;
Mathematically sound: geometry;
Great for relevance feedback
• CON: Need term weighting (tf-idf);
Hard to model structured queries

Probabilistic IR
• An IR system has an uncertain understanding of user’s queries and
makes uncertain guesses on whether a document satisfies a query
or not.
• Probability theory provides a principled foundation for reasoning
under uncertainty.
• Probabilistic models build upon this foundation to estimate how
likely it is that a document is relevant for a query.
147

Event Space
• Query representation
• Document representation
• Relevance
• Event space
• Conceptually there might be pairs with same q and d,
but different r
• Some times include include user u, context c, etc.
148

Probability Ranking Principle
• Robertson (1977)
– “If a reference retrieval system’s response to each
request is a ranking of the documents in the collection
in order of decreasing probability of relevance to the
user who submitted the request, where the
probabilities are estimated as accurately as possible
on the basis of whatever data have been made
available to the system for this purpose, the overall
effectiveness of the system to its user will be the best
that is obtainable on the basis of those data.”
• Basis for probabilistic approaches for IR
149

Dissecting PRP
• Probability of relevance
• Estimated accurately
• Based on whatever data available
• Best possible accuracy
– The perfect IR system!
– Assumes relevance is independent on other
documents in the collection
150

Relevance?
• What is ?
– Isn’t it decided by the user? her opinion?
• User doesn’t mean a human being!
– We are working with representations
– ... or parts of the reality available to us
• 2/3 keywords, no profile, no context ...
– relevance is uncertain
• depends on what the system sees
• may be marginalized over all the
unseen context/profiles
151

Retrieval as binary classification
• For every (q,d), r takes two values
– Relevant and non-relevant documents
– can be extended to multiple values
• Retrieve using Bayes’ decision
– PRP is related to the Bayes error rate (lowest
possible error rate for a class)
– How do we estimate this probability?
152

PRP ranking
• How to represent the random variables?
• How to estimate the model’s parameters?
153

• d is a binary vector
• Multiple Bernoulli variables
• Under MB, we can decompose into a
product of probabilities, with likelihoods:
154

If the terms are not in the query:
Otherwise we need estimates for them!
155

Estimates
• Assign new weights for query terms based on relevant/non-relevant
documents
• Give higher weights to important terms:
Relevant Non-relevant
156
Document with
t
r n-r n
Document
without t
R-r N-r-R+r N-n
R N-R

Robertson-Spark Jones weight
157
Relevant docs with t
Relevant docs without t
Non-relevant docs with t
Non-relevant docs without t

Estimates without relevance info
• If we pick a relevant document, words are equally like to be
present or absent
• Non-relevant can be approximated with the collection as a
whole
158

Modeling term frequencies
159

Modeling TF
• Naïve estimation: separate probability for every
outcome
• BIR had only two parameters, now we have plenty
(~many outcomes)
• We can plug in a parametric estimate for the term
frequencies
• For instance, a Poisson mixture
160

Okapi BM25
• Same ranking function as before but with new
estimates. Models term frequencies and
document length.
• Words are generated by a mixture of two
Poissons
• Assumes an eliteness variable (elite ~ word
occurs unusually frequently, non-elite ~ word
occurs as expected by chance).
161

BM25
• As a graphical model
162

BM25
• In order to approximate the formula, Robertson and Walker came up
with:
• Two model parameters
• Very effective
• The more words in common with the query the better
• Repetitions less important than different query words
– But more important if the document is relatively long
163

Generative Probabilistic Language Models
• The generative approach – A generator which produces
events/tokens with some probability
– Probability distribution over strings of text
– URN Metaphor – a bucket of different colour balls (10 red, 5
blue, 3 yellow, 2 white)
• What is the probability of drawing a yellow ball? 3/20
• what is the probability of drawing (with replacement) a red ball and a
white ball? ½*1/10
– IR Metaphor: Documents are urns, full of tokens (balls) of (in)
different terms (colors)

What is a language model?
• How likely is a string of words in a “language”?
– P1(“the cat sat on the mat”)
– P2(“the mat sat on the cat”)
– P3(“the cat sat en la alfombra”)
– P4(“el gato se sentó en la alfombra”)
• Given a model M and a observation s we want
– Probability of getting s through random sampling from M
– A mechanism to produce observations (strings) legal in M
• User thinks of a relevant document and then picks some keywords
to use as a query
165

Generative Probabilistic Models
• What is the probability of producing the query from a document? p(q|d)
• Referred to as query-likelihood
• Assumptions:
• The probability of a document being relevant is strongly correlated with
the probability of a query given a document, i.e. p(d|r) is correlated
with p(q|d)
• User has a reasonable idea of the terms that are like to appear in the
“ideal” document
• User’s query terms can distinguish the “ideal” document from the rest
of the corpus
• The query is generated as a representative of the “ideal” document
• System’s task is to estimate for each of the documents in the collection,
which is most likely to be the “ideal” document

Language Models (1998/2001)
• Let’s assume we point blindly, one at a time, at 3 words
in a document
– What is the probability that I, by accident, pointed at the words
“Master”, “computer” and “Science”?
– Compute the probability, and use it to rank the documents.
• Words are “sampled” independently of each other
– Joint probability decomposed into a product of marginals
– Estimation of probabilities just by counting
• Higher models or unigrams?
– Parameter estimation can be very expensive

Standard LM Approach
• Assume that query terms are drawn identically and
independently from a document

Estimating language models
• Usually we don’t know M
• Maximum Likelihood Estimate of
– Simply use the number of times the query term occurs in
the document divided by the total number of term
occurrences.
• Zero Probability (frequency) problem
169

Document Models
• Solution: Infer a language model for each document,
where
• Then we can estimate
• Standard approach is to use the probability of a term to
smooth the document model.
• Interpolate the ML estimator with general language
expectations

Estimating Document Models
• Basic Components
– Probability of a term given a document (maximum likelihood estimate)
– Probability of a term given the collection
– tf(t,d) is the number of times term t occurs in document d (term frequency)

Language Models
• Implementation

Implementation as vector product
df t
tf t D
p t 


'
( )
( ' )
( )
t
df t

'
( , )
( ' , )
( | )
t
tf t D
p t D
Recall:
score q d q dk
q 
tf k q
( , ) .
( , )
tf k d df t
( , ) ( )
k
tf.idf of term k in document d



Odds of the probability of
 

Inverse length of d Term importance



1
.
( ) ( , )
log
Matching Text
t
t
k
k
k
df k tf t d
d

Document length normalization
• Probabilistic models assume causes for documents differing in
length
– Scope
– Verbosity
• In practice, document length softens the term frequency
contribution to the final score
– We’ve seen it in BM25 and LMs
– Usually with a tunable parameter that regulates the
amount of softening
– Can be a function of the deviation of the average
document length
– Can be incorporated into vanilla tf-idf
174

Other models
• Modeling term dependencies (positions) in the language
modeling framework
– Markov Random Fields
• Modeling matches (occurrences of words) in different
parts of a document -> fielded models
– BM25F
– Markov Random Fields can account for this as well
175

More involved signals for ranking
• From document understanding to query
understanding
• Query rewrites (gazetteers, spell correction),
named entity recognition, query suggestions,
query categories, query segmentation ...
• Detecting query intent, triggering verticals
– direct target towards answers
– richer interfaces
176

Signals for Ranking
• Signals for ranking: matches of query terms in
documents, query-independent quality measures,
CTR, among others
• Probabilistic IR models are all about counting
– occurrences of terms in documents, in sets of
documents, etc.
• How to aggregate efficiently a large number of
“different” counts
– coming from the same terms
– no double counts!
177

Searching for food
• New York’s greatest pizza
‣ New OR York’s OR greatest OR pizza
‣ New AND York’s AND greatest AND pizza
‣ New OR York OR great OR pizza
‣ “New York” OR “great pizza”
‣ “New York” AND “great pizza”
‣ York < New AND great OR pizza
• among many more.
178

“Refined”matching
• Extract a number of virtual regions in the document
that match some version of the query (operators)
– Each region provides a different evidence of
relevance (i.e. signal)
• Aggregate the scores over the different regions
• Ex. :“at least any two words in the query appear
either consecutively or with an extra word between
them”
179

Remember BM25
• Term (tf) independence
• Vague Prior over terms not
appearing in the query
• Eliteness - topical model that
perturbs the word distribution
• 2-poisson distribution of term
frequencies over relevant and non-relevant
documents
181

Feature dependencies
• Class-linearly dependent (or affine) features
– add no extra evidence/signal
– model overfitting (vs capacity)
• Still, it is desirable to enrich the model with more
involved features
• Some features are surprisingly correlated
• Positional information requires a large number of
parameters to estimate
• Potentially up to
182

Query concept segmentation
• Queries are made up of basic conceptual units,
comprising many words
– “Indian summer victor herbert”
• Spurious matches: “san jose airport” -> “san jose
city airport”
• Model to detect segments based on generative
language models and Wikipedia
• Relax matches using factors of the max ratio
between span length and segment length
183

Virtual regions
• Different parts of the document
provide different evidence of
relevance
• Create a (finite) set of (latent)
artificial regions and re-weight
184

Implementation
• An operator maps a query to a set of queries,
which could match a document
• Each operator has a weight
• The average term frequency in a document is
185

Remarks
• Different saturation (eliteness) function?
– learn the real functional shape!
– log-logistic is good if the class-conditional
distributions are drawn from an exp. family
• Positions as variables?
– kernel-like method or exp. #parameters
• Apply operators on a per query or per query class
basis?
186

Operator examples
• BOW: maps a raw query to the set of queries
whose elements are the single terms
• p-grams: set of all p-gram of consecutive terms
• p-and: all conjunctions of p arbitrary terms
• segments: match only the “concepts”
• Enlargement: some words might sneak in
between the phrases/segments
187

How does it work in practice?
188

... not that far away
term frequency
link information
query intent information
editorial information
click-through information
geographical information
language information
user preferences
document length
document fields
other gazillion sources of information
189

Dictionaries
• Fast look-up
– Might need specific structures to scale up
• Hash tables
• Trees
– Tolerant retrieval (prefixes)
– Spell checking
• Document correction (OCR)
• Query misspellings (did you mean … ?)
• (Weighted) edit distance – dynamic programming
• Jaccard overlap (index character k-grams)
• Context sensitive
• http://norvig.com/spell-correct.html
– Wild-card queries
• Permuterm index
• K-gram indexes
190

Hardware basics
• Access to data in memory is much faster than access to data on disk.
• Disk seeks: No data is transferred from disk while the disk head is being
positioned.
• Therefore: Transferring one large chunk of data from disk to memory is
faster than transferring many small chunks.
• Disk I/O is block-based: Reading and writing of entire blocks (as opposed
to smaller chunks).
• Block sizes: 8KB to 256 KB.
191

Hardware basics
• Many design decisions in information retrieval are based on the
characteristics of hardware
• Servers used in IR systems now typically have several GB of main memory,
sometimes tens of GB.
• Available disk space is several (2-3) orders of magnitude larger.
• Fault tolerance is very expensive: It is much cheaper to use many regular
machines rather than one fault tolerant machine.
192

Data flow
splits
Parser
Parser
Parser
Master
a-f g-p q-z
a-f g-p q-z
a-f g-p q-z
Inverter
Inverter
Inverter
Postings
a-f
g-p
q-z
assign assign
Map
phase Segment files
Reduce
phase
193

MapReduce
• The index construction algorithm we just described is an instance of
MapReduce.
• MapReduce (Dean and Ghemawat 2004) is a robust and conceptually
simple framework for distributed computing …
• … without having to write code for the distribution part.
• They describe the Google indexing system (ca. 2002) as consisting of a
number of phases, each implemented in MapReduce.
• Open source implementation Hadoop
– Widely used throughout industry
194

MapReduce
• Index construction was just one phase.
• Another phase: transforming a term-partitioned index
into a document-partitioned index.
– Term-partitioned: one machine handles a subrange of
terms
– Document-partitioned: one machine handles a
subrange of documents
• Msearch engines use a document-partitioned index for
better load balancing, etc.
195

Distributed IR
• Basic process
– All queries sent to a director machine
– Director then sends messages to many index servers
• Each index server does some portion of the query processing
– Director organizes the results and returns them to the user
• Two main approaches
– Document distribution
• by far the most popular
– Term distribution
196

Distributed IR (II)
• Document distribution
– each index server acts as a search engine for a small fraction of
the total collection
– director sends a copy of the query to each of the index servers,
each of which returns the top k results
– results are merged into a single ranked list by the director
• Collection statistics should be shared for effective ranking
197

Caching
• Query distributions similar to Zipf
• About ½ each day are unique, but some are very popular
– Caching can significantly improve effectiveness
• Cache popular query results
• Cache common inverted lists
– Inverted list caching can help with unique queries
– Cache must be refreshed to prevent stale data
198

Others
• Efficiency (compression, storage, caching,
distribution)
• Novelty and diversity
• Evaluation
• Relevance feedback
• Learning to rank
• User models
– Context, personalization
• Sponsored Search
• Temporal aspects
• Social aspects
199

Introduction to Information Retrieval

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Introduction to Information Retrieval

Similar to Introduction to Information Retrieval (20)

More from Roi Blanco

More from Roi Blanco (12)

Recently uploaded

Recently uploaded (20)

Introduction to Information Retrieval

Editor's Notes