SlideShare a Scribd company logo
Lecture 1: Introduction and the Boolean Model
Information Retrieval
Computer Science Tripos Part II
Simone Teufel
Natural Language and Information Processing (NLIP) Group
Simone.Teufel@cl.cam.ac.uk
1
Overview
1 Motivation
Definition of “Information Retrieval”
IR: beginnings to now
2 First Boolean Example
Term-Document Incidence matrix
The inverted index
Processing Boolean Queries
Practicalities of Boolean Search
What is Information Retrieval?
Manning et al, 2008:
Information retrieval (IR) is finding material . . . of an unstructured
nature . . . that satisfies an information need from within large
collections . . . .
2
What is Information Retrieval?
Manning et al, 2008:
Information retrieval (IR) is finding material . . . of an unstructured
nature . . . that satisfies an information need from within large
collections . . . .
3
Document Collections
4
Document Collections
IR in the 17th century: Samuel Pepys, the famous English diarist,
subject-indexed his treasured 1000+ books library with key words.
5
Document Collections
6
What we mean here by document collections
Manning et al, 2008:
Information retrieval (IR) is finding material (usually documents)
of an unstructured nature . . . that satisfies an information need
from within large collections (usually stored on computers).
Document Collection: text units we have built an IR system
over.
Usually documents
But could be
memos
book chapters
paragraphs
scenes of a movie
turns in a conversation...
Lots of them
7
IR Basics
IR System
Query
Document
Collection
Set of relevant
documents
8
IR Basics
IR System
Query
web
pages
Set of relevant
web pages
9
What is Information Retrieval?
Manning et al, 2008:
Information retrieval (IR) is finding material (usually documents)
of an unstructured nature . . . that satisfies an information need
from within large collections (usually stored on computers.
10
Structured vs Unstructured Data
Unstructured data means that a formal, semantically overt,
easy-for-computer structure is missing.
In contrast to the rigidly structured data used in DB style
searching (e.g. product inventories, personnel records)
SELECT *
FROM business catalogue
WHERE category = ’florist’
AND city zip = ’cb1’
This does not mean that there is no structure in the data
Document structure (headings, paragraphs, lists. . . )
Explicit markup formatting (e.g. in HTML, XML. . . )
Linguistic structure (latent, hidden)
11
Information Needs and Relevance
Manning et al, 2008:
Information retrieval (IR) is finding material (usually documents) of
an unstructured nature (usually text) that satisfies an information
need from within large collections (usually stored on computers).
An information need is the topic about which the user desires
to know more about.
A query is what the user conveys to the computer in an
attempt to communicate the information need.
A document is relevant if the user perceives that it contains
information of value with respect to their personal information
need.
12
Types of information needs
Manning et al, 2008:
Information retrieval (IR) is finding material . . . of an unstructured
nature . . . that satisfies an information need from within large
collections . . . .
Known-item search
Precise information seeking search
Open-ended search (“topical search”)
13
Information scarcity vs. information abundance
Information scarcity problem (or needle-in-haystack problem):
hard to find rare information
Lord Byron’s first words? 3 years old? Long sentence to the
nurse in perfect English?
. . . when a servant had spilled an urn of hot coffee over his legs, he replied to
the distressed inquiries of the lady of the house, ’Thank you, madam, the
agony is somewhat abated.’ [not Lord Byron, but Lord Macaulay]
Information abundance problem (for more clear-cut
information needs): redundancy of obvious information
What is toxoplasmosis?
14
Relevance
Manning et al, 2008:
Information retrieval (IR) is finding material (usually documents) of
an unstructured nature (usually text) that satisfies an information
need from within large collections (usually stored on computers).
Are the retrieved documents
about the target subject
up-to-date?
from a trusted source?
satisfying the user’s needs?
How should we rank documents in terms of these factors?
More on this in a lecture soon
15
How well has the system performed?
The effectiveness of an IR system (i.e., the quality of its search
results) is determined by two key statistics about the system’s
returned results for a query:
Precision: What fraction of the returned results are relevant to
the information need?
Recall: What fraction of the relevant documents in the
collection were returned by the system?
What is the best balance between the two?
Easy to get perfect recall: just retrieve everything
Easy to get good precision: retrieve only the most relevant
There is much more to say about this – lecture 5
16
IR today
Web search ( )
Search ground are billions of documents on millions of
computers
issues: spidering; efficient indexing and search; malicious
manipulation to boost search engine rankings
Link analysis covered in Lecture 8
Enterprise and institutional search ( )
e.g company’s documentation, patents, research articles
often domain-specific
Centralised storage; dedicated machines for search.
Most prevalent IR evaluation scenario: US intelligence analyst’s
searches
Personal information retrieval (email, pers. documents; )
e.g., Mac OS X Spotlight; Windows’ Instant Search
Issues: different file types; maintenance-free, lightweight to run
in background
17
A short history of IR
1945 1950s 1960s 1970s
1980s
1990s 2000s
memex
Term
IR coined
by Calvin
Moers
Literature
searching
systems;
evaluation
by P&R
(Alan Kent)
Cranfield
experiments
Boolean
IR
SMART
1
0
recall
precision
no items retrieved
precision/
recall
Salton;
VSM
pagerank
TREC
Multimedia
Multilingual
(CLEF)
Recommendation
Systems
18
IR for non-textual media
19
Similarity Searches
20
Areas of IR
“Ad hoc” retrieval (lectures 1-5)
web retrieval (lecture 8)
Support for browsing and filtering document collections:
Clustering (lecture 6)
Classification; using fixed labels (common information needs,
age groups, topics; lecture 7)
Further processing a set of retrieved documents, e.g., by using
natural language processing
Information extraction
Summarisation
Question answering
21
Overview
1 Motivation
Definition of “Information Retrieval”
IR: beginnings to now
2 First Boolean Example
Term-Document Incidence matrix
The inverted index
Processing Boolean Queries
Practicalities of Boolean Search
Boolean Retrieval
In the Boolean retrieval model we can pose any query in the
form of a Boolean expression of terms
i.e., one in which terms are combined with the operators and,
or, and not.
Shakespeare example
22
Brutus AND Caesar AND NOT Calpurnia
Which plays of Shakespeare contain the words Brutus and
Caesar, but not Calpurnia?
Naive solution: linear scan through all text – “grepping”
In this case, works OK (Shakespeare’s Collected works has less
than 1M words).
But in the general case, with much larger text colletions, we
need to index.
Indexing is an offline operation that collects data about which
words occur in a text, so that at search time you only have to
access the precompiled index.
23
The term-document incidence matrix
Main idea: record for each document whether it contains each
word out of all the different words Shakespeare used (about 32K).
Antony
and
Julius
Caesar
The
Tempest
Hamlet Othello Macbeth
Cleopatra
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
. . .
Matrix element (t, d) is 1 if the play in column d contains the
word in row t, 0 otherwise.
24
Query “Brutus AND Caesar AND NOT Calpunia”
We compute the results for our query as the bitwise AND between
vectors for Brutus, Caesar and complement (Calpurnia):
Antony
and
Julius
Caesar
The
Tempest
Hamlet Othello Macbeth
Cleopatra
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
. . .
This returns two documents, “Antony and Cleopatra” and
“Hamlet”.
25
Query “Brutus AND Caesar AND NOT Calpunia”
We compute the results for our query as the bitwise AND between
vectors for Brutus, Caesar and complement (Calpurnia):
Antony
and
Julius
Caesar
The
Tempest
Hamlet Othello Macbeth
Cleopatra
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
¬Calpurnia 1 0 1 1 1 1
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
. . .
This returns two documents, “Antony and Cleopatra” and
“Hamlet”.
26
Query “Brutus AND Caesar AND NOT Calpunia”
We compute the results for our query as the bitwise AND between
vectors for Brutus, Caesar and complement (Calpurnia):
Antony
and
Julius
Caesar
The
Tempest
Hamlet Othello Macbeth
Cleopatra
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
¬Calpurnia 1 0 1 1 1 1
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
AND 1 0 0 1 0 0
Bitwise AND returns two documents, “Antony and Cleopatra” and
“Hamlet”.
27
The results: two documents
Antony and Cleopatra, Act III, Scene ii
Agrippa [Aside to Dominitus Enobarbus]: Why, Enobarbus,
When Antony found Julius Caesar dead,
He cried almost to roaring, and he wept
When at Philippi he found Brutus slain.
Hamlet, Act III, Scene ii
Lord Polonius: I did enact Julius Caesar: I was killed i’ the
Capitol; Brutus killed me.
28
Bigger collections
Consider N=106 documents, each with about 1000 tokens
109 tokens at avg 6 Bytes per token ⇒ 6GB
Assume there are M=500,000 distinct terms in the collection
Size of incidence matrix is then 500,000 ×106
Half a trillion 0s and 1s
29
Can’t build the Term-Document incidence matrix
Observation: the term-document matrix is very sparse
Contains no more than one billion 1s.
Better representation: only represent the things that do occur
Term-document matrix has other disadvantages, such as lack
of support for more complex query operators (e.g., proximity
search)
We will move towards richer representations, beginning with
the inverted index.
30
The inverted index
The inverted index consists of
a dictionary of terms (also: lexicon, vocabulary)
and a postings list for each term, i.e., a list that records which
documents the term occurs in.
Brutus 1 2 4 45
31
11 174
173
Caesar 132
1 2 4 5 6 16 57
Calpurnia 54 101
2 31
179
31
Processing Boolean Queries: conjunctive queries
Our Boolean Query
Brutus AND Calpurnia
Locate the postings lists of both query terms and intersect them.
Brutus 1 2 4 45
31
11 174
173
54 101
2 31
Calpurnia
Intersection 2 31
Note: this only works if postings lists are sorted
32
Algorithm for intersection of two postings
INTERSECT (p1, p2)
1 answer ← <>
2 while p1 6= NIL and p2 6= NIL
3 do if docID(p1) = docID(p2)
4 then ADD (answer, docID(p1))
5 p1 ← next(p1)
6 p2 ← next(p2)
7 if docID(p1) < docID(p2)
8 then p1← next(p1)
9 else p2← next(p2)
10 return answer
Brutus 1 2 4 45
31
11 174
173
54 101
2 31
Calpurnia
Intersection 2 31
33
Complexity of the Intersection Algorithm
Bounded by worst-case length of postings lists
Thus “officially” O(N), with N the number of documents in
the document collection
But in practice much, much better than linear scanning,
which is asymptotically also O(N)
34
Query Optimisation: conjunctive terms
Organise order in which the postings lists are accessed so that least
work needs to be done
Brutus AND Caesar AND Calpurnia
Process terms in increasing document frequency: execute as
(Calpurnia AND Brutus) AND Caesar
Brutus 1 2 4 45
31
11 174
173
Caesar 132
1 2 4 5 6 16 57
Calpurnia 54 101
2 31
8
9
4
179
35
Query Optimisation: disjunctive terms
(maddening OR crowd) AND (ignoble OR strife) AND (killed OR slain)
Process the query in increasing order of the size of each
disjunctive term
Estimate this in turn (conservatively) by the sum of
frequencies of its disjuncts
36
Practical Boolean Search
Provided by large commercial information providers
1960s-1990s
Complex query language; complex and long queries
Extended Boolean retrieval models with additional operators –
proximity operators
Proximity operator: two terms must occur close together in a
document (in terms of certain number of words, or within
sentence or paragraph)
Unordered results...
37
Example: Westlaw
Largest commercial legal search service – 500K subscribers
Boolean Search and ranked retrieval both offered
Document ranking only wrt chronological order
Expert queries are carefully defined and incrementally
developed
38
Westlaw Queries/Information Needs
“trade secret” /s disclos! /s prevent /s employe!
Information need: Information on the legal theories involved in
preventing the disclosure of trade secrets by employees formerly
employed by a competing company.
disab! /p access! /s work-site work-place (employment /3 place)
Information need: Requirements for disabled people to be able to
access a workplace.
host! /p (responsib! liab!) /p (intoxicat! drunk!) /p guest
Information need: Cases about a host’s responsibility for drunk
guests.
39
Comments on WestLaw
Proximity operators: /3= within 3 words, /s=within same
sentence /p =within a paragraph
Space is disjunction, not conjunction (This was standard in
search pre-Google.)
Long, precise queries: incrementally developed, unlike web
search
Why professional searchers like Boolean queries: precision,
transparency, control.
40
Does Google use the Boolean Model?
On Google, the default interpretation of a query [w1 w2 ... wn] is
w1 AND w2 AND ... AND wn
Cases where you get hits which don’t contain one of the w−i:
Page contains variant of wi (morphology, misspelling,
synonym)
long query (n is large)
Boolean expression generates very few hits
wi was in the anchor text
Google also ranks the result set
Simple Boolean Retrieval returns matching documents in no
particular order.
Google (and most well-designed Boolean engines) rank hits
according to some estimator of relevance
41
Reading
Manning, Raghavan, Schütze: Introduction to Information
Retrieval (MRS), chapter 1
42

More Related Content

Similar to lecture1.pdf

MLforIR.pps
MLforIR.ppsMLforIR.pps
MLforIR.pps
butest
 
DNA Information
DNA InformationDNA Information
DNA Information
Hans Rudolf Tremp
 
The Digital Library from Information Superhighway to the Semiotic Web
The Digital Library from Information Superhighway to the Semiotic WebThe Digital Library from Information Superhighway to the Semiotic Web
The Digital Library from Information Superhighway to the Semiotic Web
Martin Kalfatovic
 
introduction into IR
introduction into IRintroduction into IR
introduction into IR
ssusere3b1a2
 
Ir 02
Ir   02Ir   02
Digital Scholarship Seminar: Implications of Data for the 21st-century Humanist
Digital Scholarship Seminar: Implications of Data for the 21st-century HumanistDigital Scholarship Seminar: Implications of Data for the 21st-century Humanist
Digital Scholarship Seminar: Implications of Data for the 21st-century Humanist
Rebecca Davis
 
Week14-Multimedia Information Retrieval.pptx
Week14-Multimedia Information Retrieval.pptxWeek14-Multimedia Information Retrieval.pptx
Week14-Multimedia Information Retrieval.pptx
HasanulFahmi2
 
Management Information System
Management Information System Management Information System
Management Information System
MEKUANINT ABERA
 
What is a database (for non techies)
What is a database (for non techies)What is a database (for non techies)
What is a database (for non techies)
Eric Tachibana
 
ESWC SS 2012 - Monday Keynote Enrico Franconi: Ontologies and Databases
ESWC SS 2012 - Monday Keynote Enrico Franconi: Ontologies and DatabasesESWC SS 2012 - Monday Keynote Enrico Franconi: Ontologies and Databases
ESWC SS 2012 - Monday Keynote Enrico Franconi: Ontologies and Databases
eswcsummerschool
 
Open University - TU100 Day school 1
Open University - TU100 Day school 1Open University - TU100 Day school 1
Open University - TU100 Day school 1
Sarah Horrigan-Fullard
 
Lecture 1
Lecture 1Lecture 1
Lecture 1
Nepal Telecom
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
jimielynbastida
 
Information Retrieval
Information RetrievalInformation Retrieval
Data, Science, Society - Claudio Gutierrez, University of Chile
Data, Science, Society - Claudio Gutierrez, University of ChileData, Science, Society - Claudio Gutierrez, University of Chile
Data, Science, Society - Claudio Gutierrez, University of Chile
LEARN Project
 
Recommandation sociale : filtrage collaboratif et par le contenu
Recommandation sociale : filtrage collaboratif et par le contenuRecommandation sociale : filtrage collaboratif et par le contenu
Recommandation sociale : filtrage collaboratif et par le contenu
Patrice Bellot - Aix-Marseille Université / CNRS (LIS, INS2I)
 
Language 2 knowledge
Language 2 knowledgeLanguage 2 knowledge
Language 2 knowledge
Stefano Lariccia
 
Blank Paper To Type On Blank Paper By Montroytan
Blank Paper To Type On Blank Paper By MontroytanBlank Paper To Type On Blank Paper By Montroytan
Blank Paper To Type On Blank Paper By Montroytan
Lisa Garcia
 
Mir lecture1
Mir lecture1Mir lecture1
Mir lecture1
Adil Alpkoçak
 
AI_Lecture_1.pptx
AI_Lecture_1.pptxAI_Lecture_1.pptx
AI_Lecture_1.pptx
PriyadharshiniG41
 

Similar to lecture1.pdf (20)

MLforIR.pps
MLforIR.ppsMLforIR.pps
MLforIR.pps
 
DNA Information
DNA InformationDNA Information
DNA Information
 
The Digital Library from Information Superhighway to the Semiotic Web
The Digital Library from Information Superhighway to the Semiotic WebThe Digital Library from Information Superhighway to the Semiotic Web
The Digital Library from Information Superhighway to the Semiotic Web
 
introduction into IR
introduction into IRintroduction into IR
introduction into IR
 
Ir 02
Ir   02Ir   02
Ir 02
 
Digital Scholarship Seminar: Implications of Data for the 21st-century Humanist
Digital Scholarship Seminar: Implications of Data for the 21st-century HumanistDigital Scholarship Seminar: Implications of Data for the 21st-century Humanist
Digital Scholarship Seminar: Implications of Data for the 21st-century Humanist
 
Week14-Multimedia Information Retrieval.pptx
Week14-Multimedia Information Retrieval.pptxWeek14-Multimedia Information Retrieval.pptx
Week14-Multimedia Information Retrieval.pptx
 
Management Information System
Management Information System Management Information System
Management Information System
 
What is a database (for non techies)
What is a database (for non techies)What is a database (for non techies)
What is a database (for non techies)
 
ESWC SS 2012 - Monday Keynote Enrico Franconi: Ontologies and Databases
ESWC SS 2012 - Monday Keynote Enrico Franconi: Ontologies and DatabasesESWC SS 2012 - Monday Keynote Enrico Franconi: Ontologies and Databases
ESWC SS 2012 - Monday Keynote Enrico Franconi: Ontologies and Databases
 
Open University - TU100 Day school 1
Open University - TU100 Day school 1Open University - TU100 Day school 1
Open University - TU100 Day school 1
 
Lecture 1
Lecture 1Lecture 1
Lecture 1
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrieval
 
Data, Science, Society - Claudio Gutierrez, University of Chile
Data, Science, Society - Claudio Gutierrez, University of ChileData, Science, Society - Claudio Gutierrez, University of Chile
Data, Science, Society - Claudio Gutierrez, University of Chile
 
Recommandation sociale : filtrage collaboratif et par le contenu
Recommandation sociale : filtrage collaboratif et par le contenuRecommandation sociale : filtrage collaboratif et par le contenu
Recommandation sociale : filtrage collaboratif et par le contenu
 
Language 2 knowledge
Language 2 knowledgeLanguage 2 knowledge
Language 2 knowledge
 
Blank Paper To Type On Blank Paper By Montroytan
Blank Paper To Type On Blank Paper By MontroytanBlank Paper To Type On Blank Paper By Montroytan
Blank Paper To Type On Blank Paper By Montroytan
 
Mir lecture1
Mir lecture1Mir lecture1
Mir lecture1
 
AI_Lecture_1.pptx
AI_Lecture_1.pptxAI_Lecture_1.pptx
AI_Lecture_1.pptx
 

Recently uploaded

一比一原版布拉德福德大学毕业证(bradford毕业证)如何办理
一比一原版布拉德福德大学毕业证(bradford毕业证)如何办理一比一原版布拉德福德大学毕业证(bradford毕业证)如何办理
一比一原版布拉德福德大学毕业证(bradford毕业证)如何办理
taqyea
 
Resumes, Cover Letters, and Applying Online
Resumes, Cover Letters, and Applying OnlineResumes, Cover Letters, and Applying Online
Resumes, Cover Letters, and Applying Online
Bruce Bennett
 
Jill Pizzola's Tenure as Senior Talent Acquisition Partner at THOMSON REUTERS...
Jill Pizzola's Tenure as Senior Talent Acquisition Partner at THOMSON REUTERS...Jill Pizzola's Tenure as Senior Talent Acquisition Partner at THOMSON REUTERS...
Jill Pizzola's Tenure as Senior Talent Acquisition Partner at THOMSON REUTERS...
dsnow9802
 
lab.123456789123456789123456789123456789
lab.123456789123456789123456789123456789lab.123456789123456789123456789123456789
lab.123456789123456789123456789123456789
Ghh
 
Status of Women in Pakistan.pptxStatus of Women in Pakistan.pptx
Status of Women in Pakistan.pptxStatus of Women in Pakistan.pptxStatus of Women in Pakistan.pptxStatus of Women in Pakistan.pptx
Status of Women in Pakistan.pptxStatus of Women in Pakistan.pptx
MuhammadWaqasBaloch1
 
IT Career Hacks Navigate the Tech Jungle with a Roadmap
IT Career Hacks Navigate the Tech Jungle with a RoadmapIT Career Hacks Navigate the Tech Jungle with a Roadmap
IT Career Hacks Navigate the Tech Jungle with a Roadmap
Base Camp
 
Switching Careers Slides - JoyceMSullivan SocMediaFin - 2024Jun11.pdf
Switching Careers Slides - JoyceMSullivan SocMediaFin -  2024Jun11.pdfSwitching Careers Slides - JoyceMSullivan SocMediaFin -  2024Jun11.pdf
Switching Careers Slides - JoyceMSullivan SocMediaFin - 2024Jun11.pdf
SocMediaFin - Joyce Sullivan
 
Job Finding Apps Everything You Need to Know in 2024
Job Finding Apps Everything You Need to Know in 2024Job Finding Apps Everything You Need to Know in 2024
Job Finding Apps Everything You Need to Know in 2024
SnapJob
 
Leave-rules.ppt CCS leave rules 1972 for central govt employees
Leave-rules.ppt CCS leave rules 1972 for central govt employeesLeave-rules.ppt CCS leave rules 1972 for central govt employees
Leave-rules.ppt CCS leave rules 1972 for central govt employees
Sreenivas702647
 
BUKU PENJAGAAN BUKU PENJAGAAN BUKU PENJAGAAN
BUKU PENJAGAAN BUKU PENJAGAAN BUKU PENJAGAANBUKU PENJAGAAN BUKU PENJAGAAN BUKU PENJAGAAN
BUKU PENJAGAAN BUKU PENJAGAAN BUKU PENJAGAAN
cahgading001
 
Introducing Gopay Mobile App For Environment.pptx
Introducing Gopay Mobile App For Environment.pptxIntroducing Gopay Mobile App For Environment.pptx
Introducing Gopay Mobile App For Environment.pptx
FauzanHarits1
 
官方认证美国旧金山州立大学毕业证学位证书案例原版一模一样
官方认证美国旧金山州立大学毕业证学位证书案例原版一模一样官方认证美国旧金山州立大学毕业证学位证书案例原版一模一样
官方认证美国旧金山州立大学毕业证学位证书案例原版一模一样
2zjra9bn
 
Learnings from Successful Jobs Searchers
Learnings from Successful Jobs SearchersLearnings from Successful Jobs Searchers
Learnings from Successful Jobs Searchers
Bruce Bennett
 
Gabrielle M. A. Sinaga Portfolio, Film Student (2024)
Gabrielle M. A. Sinaga Portfolio, Film Student (2024)Gabrielle M. A. Sinaga Portfolio, Film Student (2024)
Gabrielle M. A. Sinaga Portfolio, Film Student (2024)
GabrielleSinaga
 
0624.speakingengagementsandteaching-01.pdf
0624.speakingengagementsandteaching-01.pdf0624.speakingengagementsandteaching-01.pdf
0624.speakingengagementsandteaching-01.pdf
Thomas GIRARD BDes
 
labb123456789123456789123456789123456789
labb123456789123456789123456789123456789labb123456789123456789123456789123456789
labb123456789123456789123456789123456789
Ghh
 
Tape Measure Training & Practice Assessments.pdf
Tape Measure Training & Practice Assessments.pdfTape Measure Training & Practice Assessments.pdf
Tape Measure Training & Practice Assessments.pdf
KateRobinson68
 
Lbs last rank 2023 9988kr47h4744j445.pdf
Lbs last rank 2023 9988kr47h4744j445.pdfLbs last rank 2023 9988kr47h4744j445.pdf
Lbs last rank 2023 9988kr47h4744j445.pdf
ashiquepa3
 
5 Common Mistakes to Avoid During the Job Application Process.pdf
5 Common Mistakes to Avoid During the Job Application Process.pdf5 Common Mistakes to Avoid During the Job Application Process.pdf
5 Common Mistakes to Avoid During the Job Application Process.pdf
Alliance Jobs
 
Leadership Ambassador club Adventist module
Leadership Ambassador club Adventist moduleLeadership Ambassador club Adventist module
Leadership Ambassador club Adventist module
kakomaeric00
 

Recently uploaded (20)

一比一原版布拉德福德大学毕业证(bradford毕业证)如何办理
一比一原版布拉德福德大学毕业证(bradford毕业证)如何办理一比一原版布拉德福德大学毕业证(bradford毕业证)如何办理
一比一原版布拉德福德大学毕业证(bradford毕业证)如何办理
 
Resumes, Cover Letters, and Applying Online
Resumes, Cover Letters, and Applying OnlineResumes, Cover Letters, and Applying Online
Resumes, Cover Letters, and Applying Online
 
Jill Pizzola's Tenure as Senior Talent Acquisition Partner at THOMSON REUTERS...
Jill Pizzola's Tenure as Senior Talent Acquisition Partner at THOMSON REUTERS...Jill Pizzola's Tenure as Senior Talent Acquisition Partner at THOMSON REUTERS...
Jill Pizzola's Tenure as Senior Talent Acquisition Partner at THOMSON REUTERS...
 
lab.123456789123456789123456789123456789
lab.123456789123456789123456789123456789lab.123456789123456789123456789123456789
lab.123456789123456789123456789123456789
 
Status of Women in Pakistan.pptxStatus of Women in Pakistan.pptx
Status of Women in Pakistan.pptxStatus of Women in Pakistan.pptxStatus of Women in Pakistan.pptxStatus of Women in Pakistan.pptx
Status of Women in Pakistan.pptxStatus of Women in Pakistan.pptx
 
IT Career Hacks Navigate the Tech Jungle with a Roadmap
IT Career Hacks Navigate the Tech Jungle with a RoadmapIT Career Hacks Navigate the Tech Jungle with a Roadmap
IT Career Hacks Navigate the Tech Jungle with a Roadmap
 
Switching Careers Slides - JoyceMSullivan SocMediaFin - 2024Jun11.pdf
Switching Careers Slides - JoyceMSullivan SocMediaFin -  2024Jun11.pdfSwitching Careers Slides - JoyceMSullivan SocMediaFin -  2024Jun11.pdf
Switching Careers Slides - JoyceMSullivan SocMediaFin - 2024Jun11.pdf
 
Job Finding Apps Everything You Need to Know in 2024
Job Finding Apps Everything You Need to Know in 2024Job Finding Apps Everything You Need to Know in 2024
Job Finding Apps Everything You Need to Know in 2024
 
Leave-rules.ppt CCS leave rules 1972 for central govt employees
Leave-rules.ppt CCS leave rules 1972 for central govt employeesLeave-rules.ppt CCS leave rules 1972 for central govt employees
Leave-rules.ppt CCS leave rules 1972 for central govt employees
 
BUKU PENJAGAAN BUKU PENJAGAAN BUKU PENJAGAAN
BUKU PENJAGAAN BUKU PENJAGAAN BUKU PENJAGAANBUKU PENJAGAAN BUKU PENJAGAAN BUKU PENJAGAAN
BUKU PENJAGAAN BUKU PENJAGAAN BUKU PENJAGAAN
 
Introducing Gopay Mobile App For Environment.pptx
Introducing Gopay Mobile App For Environment.pptxIntroducing Gopay Mobile App For Environment.pptx
Introducing Gopay Mobile App For Environment.pptx
 
官方认证美国旧金山州立大学毕业证学位证书案例原版一模一样
官方认证美国旧金山州立大学毕业证学位证书案例原版一模一样官方认证美国旧金山州立大学毕业证学位证书案例原版一模一样
官方认证美国旧金山州立大学毕业证学位证书案例原版一模一样
 
Learnings from Successful Jobs Searchers
Learnings from Successful Jobs SearchersLearnings from Successful Jobs Searchers
Learnings from Successful Jobs Searchers
 
Gabrielle M. A. Sinaga Portfolio, Film Student (2024)
Gabrielle M. A. Sinaga Portfolio, Film Student (2024)Gabrielle M. A. Sinaga Portfolio, Film Student (2024)
Gabrielle M. A. Sinaga Portfolio, Film Student (2024)
 
0624.speakingengagementsandteaching-01.pdf
0624.speakingengagementsandteaching-01.pdf0624.speakingengagementsandteaching-01.pdf
0624.speakingengagementsandteaching-01.pdf
 
labb123456789123456789123456789123456789
labb123456789123456789123456789123456789labb123456789123456789123456789123456789
labb123456789123456789123456789123456789
 
Tape Measure Training & Practice Assessments.pdf
Tape Measure Training & Practice Assessments.pdfTape Measure Training & Practice Assessments.pdf
Tape Measure Training & Practice Assessments.pdf
 
Lbs last rank 2023 9988kr47h4744j445.pdf
Lbs last rank 2023 9988kr47h4744j445.pdfLbs last rank 2023 9988kr47h4744j445.pdf
Lbs last rank 2023 9988kr47h4744j445.pdf
 
5 Common Mistakes to Avoid During the Job Application Process.pdf
5 Common Mistakes to Avoid During the Job Application Process.pdf5 Common Mistakes to Avoid During the Job Application Process.pdf
5 Common Mistakes to Avoid During the Job Application Process.pdf
 
Leadership Ambassador club Adventist module
Leadership Ambassador club Adventist moduleLeadership Ambassador club Adventist module
Leadership Ambassador club Adventist module
 

lecture1.pdf

  • 1. Lecture 1: Introduction and the Boolean Model Information Retrieval Computer Science Tripos Part II Simone Teufel Natural Language and Information Processing (NLIP) Group Simone.Teufel@cl.cam.ac.uk 1
  • 2. Overview 1 Motivation Definition of “Information Retrieval” IR: beginnings to now 2 First Boolean Example Term-Document Incidence matrix The inverted index Processing Boolean Queries Practicalities of Boolean Search
  • 3. What is Information Retrieval? Manning et al, 2008: Information retrieval (IR) is finding material . . . of an unstructured nature . . . that satisfies an information need from within large collections . . . . 2
  • 4. What is Information Retrieval? Manning et al, 2008: Information retrieval (IR) is finding material . . . of an unstructured nature . . . that satisfies an information need from within large collections . . . . 3
  • 6. Document Collections IR in the 17th century: Samuel Pepys, the famous English diarist, subject-indexed his treasured 1000+ books library with key words. 5
  • 8. What we mean here by document collections Manning et al, 2008: Information retrieval (IR) is finding material (usually documents) of an unstructured nature . . . that satisfies an information need from within large collections (usually stored on computers). Document Collection: text units we have built an IR system over. Usually documents But could be memos book chapters paragraphs scenes of a movie turns in a conversation... Lots of them 7
  • 10. IR Basics IR System Query web pages Set of relevant web pages 9
  • 11. What is Information Retrieval? Manning et al, 2008: Information retrieval (IR) is finding material (usually documents) of an unstructured nature . . . that satisfies an information need from within large collections (usually stored on computers. 10
  • 12. Structured vs Unstructured Data Unstructured data means that a formal, semantically overt, easy-for-computer structure is missing. In contrast to the rigidly structured data used in DB style searching (e.g. product inventories, personnel records) SELECT * FROM business catalogue WHERE category = ’florist’ AND city zip = ’cb1’ This does not mean that there is no structure in the data Document structure (headings, paragraphs, lists. . . ) Explicit markup formatting (e.g. in HTML, XML. . . ) Linguistic structure (latent, hidden) 11
  • 13. Information Needs and Relevance Manning et al, 2008: Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). An information need is the topic about which the user desires to know more about. A query is what the user conveys to the computer in an attempt to communicate the information need. A document is relevant if the user perceives that it contains information of value with respect to their personal information need. 12
  • 14. Types of information needs Manning et al, 2008: Information retrieval (IR) is finding material . . . of an unstructured nature . . . that satisfies an information need from within large collections . . . . Known-item search Precise information seeking search Open-ended search (“topical search”) 13
  • 15. Information scarcity vs. information abundance Information scarcity problem (or needle-in-haystack problem): hard to find rare information Lord Byron’s first words? 3 years old? Long sentence to the nurse in perfect English? . . . when a servant had spilled an urn of hot coffee over his legs, he replied to the distressed inquiries of the lady of the house, ’Thank you, madam, the agony is somewhat abated.’ [not Lord Byron, but Lord Macaulay] Information abundance problem (for more clear-cut information needs): redundancy of obvious information What is toxoplasmosis? 14
  • 16. Relevance Manning et al, 2008: Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). Are the retrieved documents about the target subject up-to-date? from a trusted source? satisfying the user’s needs? How should we rank documents in terms of these factors? More on this in a lecture soon 15
  • 17. How well has the system performed? The effectiveness of an IR system (i.e., the quality of its search results) is determined by two key statistics about the system’s returned results for a query: Precision: What fraction of the returned results are relevant to the information need? Recall: What fraction of the relevant documents in the collection were returned by the system? What is the best balance between the two? Easy to get perfect recall: just retrieve everything Easy to get good precision: retrieve only the most relevant There is much more to say about this – lecture 5 16
  • 18. IR today Web search ( ) Search ground are billions of documents on millions of computers issues: spidering; efficient indexing and search; malicious manipulation to boost search engine rankings Link analysis covered in Lecture 8 Enterprise and institutional search ( ) e.g company’s documentation, patents, research articles often domain-specific Centralised storage; dedicated machines for search. Most prevalent IR evaluation scenario: US intelligence analyst’s searches Personal information retrieval (email, pers. documents; ) e.g., Mac OS X Spotlight; Windows’ Instant Search Issues: different file types; maintenance-free, lightweight to run in background 17
  • 19. A short history of IR 1945 1950s 1960s 1970s 1980s 1990s 2000s memex Term IR coined by Calvin Moers Literature searching systems; evaluation by P&R (Alan Kent) Cranfield experiments Boolean IR SMART 1 0 recall precision no items retrieved precision/ recall Salton; VSM pagerank TREC Multimedia Multilingual (CLEF) Recommendation Systems 18
  • 20. IR for non-textual media 19
  • 22. Areas of IR “Ad hoc” retrieval (lectures 1-5) web retrieval (lecture 8) Support for browsing and filtering document collections: Clustering (lecture 6) Classification; using fixed labels (common information needs, age groups, topics; lecture 7) Further processing a set of retrieved documents, e.g., by using natural language processing Information extraction Summarisation Question answering 21
  • 23. Overview 1 Motivation Definition of “Information Retrieval” IR: beginnings to now 2 First Boolean Example Term-Document Incidence matrix The inverted index Processing Boolean Queries Practicalities of Boolean Search
  • 24. Boolean Retrieval In the Boolean retrieval model we can pose any query in the form of a Boolean expression of terms i.e., one in which terms are combined with the operators and, or, and not. Shakespeare example 22
  • 25. Brutus AND Caesar AND NOT Calpurnia Which plays of Shakespeare contain the words Brutus and Caesar, but not Calpurnia? Naive solution: linear scan through all text – “grepping” In this case, works OK (Shakespeare’s Collected works has less than 1M words). But in the general case, with much larger text colletions, we need to index. Indexing is an offline operation that collects data about which words occur in a text, so that at search time you only have to access the precompiled index. 23
  • 26. The term-document incidence matrix Main idea: record for each document whether it contains each word out of all the different words Shakespeare used (about 32K). Antony and Julius Caesar The Tempest Hamlet Othello Macbeth Cleopatra Antony 1 1 0 0 0 1 Brutus 1 1 0 1 0 0 Caesar 1 1 0 1 1 1 Calpurnia 0 1 0 0 0 0 Cleopatra 1 0 0 0 0 0 mercy 1 0 1 1 1 1 worser 1 0 1 1 1 0 . . . Matrix element (t, d) is 1 if the play in column d contains the word in row t, 0 otherwise. 24
  • 27. Query “Brutus AND Caesar AND NOT Calpunia” We compute the results for our query as the bitwise AND between vectors for Brutus, Caesar and complement (Calpurnia): Antony and Julius Caesar The Tempest Hamlet Othello Macbeth Cleopatra Antony 1 1 0 0 0 1 Brutus 1 1 0 1 0 0 Caesar 1 1 0 1 1 1 Calpurnia 0 1 0 0 0 0 Cleopatra 1 0 0 0 0 0 mercy 1 0 1 1 1 1 worser 1 0 1 1 1 0 . . . This returns two documents, “Antony and Cleopatra” and “Hamlet”. 25
  • 28. Query “Brutus AND Caesar AND NOT Calpunia” We compute the results for our query as the bitwise AND between vectors for Brutus, Caesar and complement (Calpurnia): Antony and Julius Caesar The Tempest Hamlet Othello Macbeth Cleopatra Antony 1 1 0 0 0 1 Brutus 1 1 0 1 0 0 Caesar 1 1 0 1 1 1 ¬Calpurnia 1 0 1 1 1 1 Cleopatra 1 0 0 0 0 0 mercy 1 0 1 1 1 1 worser 1 0 1 1 1 0 . . . This returns two documents, “Antony and Cleopatra” and “Hamlet”. 26
  • 29. Query “Brutus AND Caesar AND NOT Calpunia” We compute the results for our query as the bitwise AND between vectors for Brutus, Caesar and complement (Calpurnia): Antony and Julius Caesar The Tempest Hamlet Othello Macbeth Cleopatra Antony 1 1 0 0 0 1 Brutus 1 1 0 1 0 0 Caesar 1 1 0 1 1 1 ¬Calpurnia 1 0 1 1 1 1 Cleopatra 1 0 0 0 0 0 mercy 1 0 1 1 1 1 worser 1 0 1 1 1 0 AND 1 0 0 1 0 0 Bitwise AND returns two documents, “Antony and Cleopatra” and “Hamlet”. 27
  • 30. The results: two documents Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to Dominitus Enobarbus]: Why, Enobarbus, When Antony found Julius Caesar dead, He cried almost to roaring, and he wept When at Philippi he found Brutus slain. Hamlet, Act III, Scene ii Lord Polonius: I did enact Julius Caesar: I was killed i’ the Capitol; Brutus killed me. 28
  • 31. Bigger collections Consider N=106 documents, each with about 1000 tokens 109 tokens at avg 6 Bytes per token ⇒ 6GB Assume there are M=500,000 distinct terms in the collection Size of incidence matrix is then 500,000 ×106 Half a trillion 0s and 1s 29
  • 32. Can’t build the Term-Document incidence matrix Observation: the term-document matrix is very sparse Contains no more than one billion 1s. Better representation: only represent the things that do occur Term-document matrix has other disadvantages, such as lack of support for more complex query operators (e.g., proximity search) We will move towards richer representations, beginning with the inverted index. 30
  • 33. The inverted index The inverted index consists of a dictionary of terms (also: lexicon, vocabulary) and a postings list for each term, i.e., a list that records which documents the term occurs in. Brutus 1 2 4 45 31 11 174 173 Caesar 132 1 2 4 5 6 16 57 Calpurnia 54 101 2 31 179 31
  • 34. Processing Boolean Queries: conjunctive queries Our Boolean Query Brutus AND Calpurnia Locate the postings lists of both query terms and intersect them. Brutus 1 2 4 45 31 11 174 173 54 101 2 31 Calpurnia Intersection 2 31 Note: this only works if postings lists are sorted 32
  • 35. Algorithm for intersection of two postings INTERSECT (p1, p2) 1 answer ← <> 2 while p1 6= NIL and p2 6= NIL 3 do if docID(p1) = docID(p2) 4 then ADD (answer, docID(p1)) 5 p1 ← next(p1) 6 p2 ← next(p2) 7 if docID(p1) < docID(p2) 8 then p1← next(p1) 9 else p2← next(p2) 10 return answer Brutus 1 2 4 45 31 11 174 173 54 101 2 31 Calpurnia Intersection 2 31 33
  • 36. Complexity of the Intersection Algorithm Bounded by worst-case length of postings lists Thus “officially” O(N), with N the number of documents in the document collection But in practice much, much better than linear scanning, which is asymptotically also O(N) 34
  • 37. Query Optimisation: conjunctive terms Organise order in which the postings lists are accessed so that least work needs to be done Brutus AND Caesar AND Calpurnia Process terms in increasing document frequency: execute as (Calpurnia AND Brutus) AND Caesar Brutus 1 2 4 45 31 11 174 173 Caesar 132 1 2 4 5 6 16 57 Calpurnia 54 101 2 31 8 9 4 179 35
  • 38. Query Optimisation: disjunctive terms (maddening OR crowd) AND (ignoble OR strife) AND (killed OR slain) Process the query in increasing order of the size of each disjunctive term Estimate this in turn (conservatively) by the sum of frequencies of its disjuncts 36
  • 39. Practical Boolean Search Provided by large commercial information providers 1960s-1990s Complex query language; complex and long queries Extended Boolean retrieval models with additional operators – proximity operators Proximity operator: two terms must occur close together in a document (in terms of certain number of words, or within sentence or paragraph) Unordered results... 37
  • 40. Example: Westlaw Largest commercial legal search service – 500K subscribers Boolean Search and ranked retrieval both offered Document ranking only wrt chronological order Expert queries are carefully defined and incrementally developed 38
  • 41. Westlaw Queries/Information Needs “trade secret” /s disclos! /s prevent /s employe! Information need: Information on the legal theories involved in preventing the disclosure of trade secrets by employees formerly employed by a competing company. disab! /p access! /s work-site work-place (employment /3 place) Information need: Requirements for disabled people to be able to access a workplace. host! /p (responsib! liab!) /p (intoxicat! drunk!) /p guest Information need: Cases about a host’s responsibility for drunk guests. 39
  • 42. Comments on WestLaw Proximity operators: /3= within 3 words, /s=within same sentence /p =within a paragraph Space is disjunction, not conjunction (This was standard in search pre-Google.) Long, precise queries: incrementally developed, unlike web search Why professional searchers like Boolean queries: precision, transparency, control. 40
  • 43. Does Google use the Boolean Model? On Google, the default interpretation of a query [w1 w2 ... wn] is w1 AND w2 AND ... AND wn Cases where you get hits which don’t contain one of the w−i: Page contains variant of wi (morphology, misspelling, synonym) long query (n is large) Boolean expression generates very few hits wi was in the anchor text Google also ranks the result set Simple Boolean Retrieval returns matching documents in no particular order. Google (and most well-designed Boolean engines) rank hits according to some estimator of relevance 41
  • 44. Reading Manning, Raghavan, Schütze: Introduction to Information Retrieval (MRS), chapter 1 42