Lecture 02
Information Retrieval
Informatio
n Need
User Task
Query
Formulati
on
Search
Engine
Collection
Results
Query Re-
Formulati
on
Informatio
n Need
User Task
Query
Formulati
on
Search
Engine
Collection
Results
Query Re-
Formulati
on
Misconceptio
n
Misformulatio
n
Mis
Reformulatio
n
Boolean Retrieval Model
An Example of an IR Problem
 Task:
 Suppose you wanted to determine which plays of Shakespeare
contain the words Brutus AND Caesar AND NOT Calpurnia.
 Read through all the text!!! noting for each play whether it
contains Brutus and Caesar and excluding it from consideration
if it contains Calpurnia. !!!
 This sort of linear scan through documents is actually the
simplest form of document retrieval for a computer to do.
 This process is commonly referred to as grepping through text,
after the Unix command grep, which performs this process.
 This can be a very efficient process for wildcard pattern
Boolean Retrieval Model
An Example of an IR Problem
 In other scenarios, we need more than the “grepping”
function:
1. To process large document collections quickly. The amount
of online data has grown at least as quickly as the speed of
computers, and we would now like to be able to search
collections that total in the order of billions to trillions of words.
2. To allow more flexible matching operations. For example, it
is impractical to perform the query Romans NEAR countrymen
with grep, where NEAR might be defined as “within 5 words” or
“within the same sentence”.
3. To allow ranked retrieval: in many cases you want the best
answer to an information need among many documents that
contain certain words.
Boolean Retrieval Model
An Example of an IR Problem
 The idea behind building Indexes:
 To avoid linearly scanning the texts for each query, we build
an INDEX for the documents in advance.
 The result for our initial task would be a binary term-
document incidence matrix.
Boolean Retrieval Model
An Example of an IR Problem
Boolean Retrieval Model
An Example of an IR Problem
 To answer the query “Brutus AND Caesar AND NOT
Calpurnia”, we take the vectors for Brutus, Caesar
and Calpurnia, complement the last, and then do a
bitwise AND:
110100 AND 110111 AND 101111 = 100100
Answer:
Boolean Retrieval Model
An Example of an IR Problem
 The Boolean retrieval model: is a model for information
retrieval in which we can pose any query which is in the form of a
Boolean expression of terms, that is, in which terms are
combined with the operators AND, OR, and NOT.
 Among the limitations of this model is that it views each
document as just a set of words.
Boolean Retrieval Model
Further Terminology & Notations
 Documents: whatever units we have decided to build a retrieval system
over.
 Collection: the group of documents over which we perform retrieval.
Sometime it is also referred to as a corpus.
 In our previous example the documents are “Shakespeare’s Collected Works”
 Ad hoc Retrieval: is the most standard IR task. In it, a system aims to
provide documents from within the collection that are relevant to an
arbitrary user information need, communicated to the system by means of
a one-off, user-initiated query. (Temporal Information Need). This model is
a.k.a. Pull Text Access Model
 Recommender Systems: When the user has a stable information need
(research topic / interest in sports news), the system takes the initiative and
recommends topics that are related to the user’s information need. This
model is a.k.a. Push Text Access Model.
 Information need: is the topic about which the user desires to know more.
 Query: is what the user conveys to the computer in an attempt to
Boolean Retrieval Model
Further Terminology & Notations
 Relevance: A document is relevant if it is one that the user
perceives as containing information of value with respect to their
personal information need.
 We would like to find relevant documents regardless of whether they
precisely use the words expressed in the query or express the concept
we are looking for with other words.
 Effectiveness: of an IR system is the quality of its search results. It
is measured according to the relevance between the set of returned
results to a given query.
 To measure Effectiveness two key statistics about the system’s
returned results are involved:
 Precision: What fraction of the returned results are relevant to the
information need?
Boolean Retrieval Model
Further Terminology & Notations
 Precision: What fraction of the returned results are relevant to
the information need?
𝑃 =
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑
 Recall: What fraction of the relevant documents in the collection
were returned by the system?
𝑅 =
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑖𝑛 𝑐𝑜𝑙𝑙𝑒𝑐𝑡𝑖𝑜𝑛
Boolean Retrieval Model
Building a term-document matrix
 In practice, we can not build a Term-Document Incidence
matrix.
Let: d = 1 million, and let each document in d contains 1000
words.
* assume an average of 6 bytes per word including spaces and
punctuation  then this is a document collection about 6 GB in
size.
* Assume we have about M = 500,000 distinct terms in these
documents
A 500K × 1M matrix has half-a-trillion 0’s and 1’s – too many to fit
in a computer’s memory.
But the crucial observation is that the matrix is extremely sparse,
that is, it has few non-zero entries.
Boolean Retrieval Model
Building a term-document matrix
 So, a better representation is to record only the things that
do occur, that is, the 1 positions.
 This leads to a central idea in IR that is, the inverted index.
 An Inverted Index or Inverted File: is an index that always
maps back from terms to the parts of a document where
they occur.
  We keep a dictionary of terms (sometimes also referred
to as a vocabulary or lexicon.
  Then for each term, we have a list that records which
documents the term occurs in.
Boolean Retrieval Model
Inverted Index
 Each item in the list – which records that a term appeared in a
document (and, later, often, the positions in the document) – is
conventionally called a posting.
 The list is then called a postings list (or inverted list), and all
the postings lists taken together are referred to as the postings.
Boolean Retrieval Model
Building an inverted index
 The major steps are:
1. Collect the documents to be indexed:
Doc1 = {Friends, Romans, countrymen}
Doc2 = {So let it be with Caesar}
Doc3 = {. . .}
2. Tokenize the text, turning each document into a list of
tokens:
Friends Romans countrymen So . . .
Boolean Retrieval Model
Building an inverted index
 The major steps are:
3. Do linguistic preprocessing, producing a list of normalized
tokens, which are the indexing terms:
friends romans countrymen so . . .
4. Index the documents that each term occurs in by creating
an inverted index, consisting of a dictionary and postings.
Boolean Retrieval Model
Building an inverted index
 Within a document collection, we assume that each document
has a unique serial number, known as the document identifier
(docID).
 The input to indexing is a list of normalized tokens for each
document, which we can equally think of as a list of pairs of term
and docID.
 The core indexing step is sorting this list so that the terms are
alphabetical.
 The postings are secondarily sorted by docID. This provides the
basis for efficient query processing.
Boolean Retrieval Model
Storage Requirements
 In the resulting index, we pay for storage of both the
dictionary and the postings lists.
 The dictionary is commonly kept in memory.
 Postings lists are normally kept on disk.
 So, the size of each is important.
Boolean Retrieval Model
What data structure should be used for a postings
list?
 A fixed length array would be wasteful as some words occur
in many documents, and others in very few.
 For an in-memory postings list, two good alternatives:
 Singly linked lists: allow cheap insertion of documents into
postings lists (following updates, such as when recrawling the
web for updated documents). They also naturally extend to more
advanced indexing strategies such as skip lists, which require
additional pointers.
 Variable length arrays: win in space requirements by avoiding
the overhead for pointers and in time requirements because their
use of contiguous memory increases speed on modern
processors with memory caches.
Boolean Retrieval Model
Exercise
 Draw the inverted index that would be built for the
following document collection:
Doc 1 new home sales top forecasts
Doc 2 home sales rise in July
Doc 3 increase in home sales in July
Doc 4 July new home sales rise
Boolean Retrieval Model
Exercise
 Consider these documents:
Doc 1 breakthrough drug for schizophrenia
Doc 2 new schizophrenia drug
Doc 3 new approach for treatment of schizophrenia
Doc 4 new hopes for schizophrenia patients
A. Draw the term-document incidence matrix for this
document collection
B. What are the returned results for these queries:
a. schizophrenia AND drug
b. for AND NOT(drug OR approach)

Ir 02

  • 1.
  • 2.
  • 3.
    Informatio n Need User Task Query Formulati on Search Engine Collection Results QueryRe- Formulati on Misconceptio n Misformulatio n Mis Reformulatio n
  • 4.
    Boolean Retrieval Model AnExample of an IR Problem  Task:  Suppose you wanted to determine which plays of Shakespeare contain the words Brutus AND Caesar AND NOT Calpurnia.  Read through all the text!!! noting for each play whether it contains Brutus and Caesar and excluding it from consideration if it contains Calpurnia. !!!  This sort of linear scan through documents is actually the simplest form of document retrieval for a computer to do.  This process is commonly referred to as grepping through text, after the Unix command grep, which performs this process.  This can be a very efficient process for wildcard pattern
  • 5.
    Boolean Retrieval Model AnExample of an IR Problem  In other scenarios, we need more than the “grepping” function: 1. To process large document collections quickly. The amount of online data has grown at least as quickly as the speed of computers, and we would now like to be able to search collections that total in the order of billions to trillions of words. 2. To allow more flexible matching operations. For example, it is impractical to perform the query Romans NEAR countrymen with grep, where NEAR might be defined as “within 5 words” or “within the same sentence”. 3. To allow ranked retrieval: in many cases you want the best answer to an information need among many documents that contain certain words.
  • 6.
    Boolean Retrieval Model AnExample of an IR Problem  The idea behind building Indexes:  To avoid linearly scanning the texts for each query, we build an INDEX for the documents in advance.  The result for our initial task would be a binary term- document incidence matrix.
  • 7.
    Boolean Retrieval Model AnExample of an IR Problem
  • 8.
    Boolean Retrieval Model AnExample of an IR Problem  To answer the query “Brutus AND Caesar AND NOT Calpurnia”, we take the vectors for Brutus, Caesar and Calpurnia, complement the last, and then do a bitwise AND: 110100 AND 110111 AND 101111 = 100100 Answer:
  • 9.
    Boolean Retrieval Model AnExample of an IR Problem  The Boolean retrieval model: is a model for information retrieval in which we can pose any query which is in the form of a Boolean expression of terms, that is, in which terms are combined with the operators AND, OR, and NOT.  Among the limitations of this model is that it views each document as just a set of words.
  • 10.
    Boolean Retrieval Model FurtherTerminology & Notations  Documents: whatever units we have decided to build a retrieval system over.  Collection: the group of documents over which we perform retrieval. Sometime it is also referred to as a corpus.  In our previous example the documents are “Shakespeare’s Collected Works”  Ad hoc Retrieval: is the most standard IR task. In it, a system aims to provide documents from within the collection that are relevant to an arbitrary user information need, communicated to the system by means of a one-off, user-initiated query. (Temporal Information Need). This model is a.k.a. Pull Text Access Model  Recommender Systems: When the user has a stable information need (research topic / interest in sports news), the system takes the initiative and recommends topics that are related to the user’s information need. This model is a.k.a. Push Text Access Model.  Information need: is the topic about which the user desires to know more.  Query: is what the user conveys to the computer in an attempt to
  • 11.
    Boolean Retrieval Model FurtherTerminology & Notations  Relevance: A document is relevant if it is one that the user perceives as containing information of value with respect to their personal information need.  We would like to find relevant documents regardless of whether they precisely use the words expressed in the query or express the concept we are looking for with other words.  Effectiveness: of an IR system is the quality of its search results. It is measured according to the relevance between the set of returned results to a given query.  To measure Effectiveness two key statistics about the system’s returned results are involved:  Precision: What fraction of the returned results are relevant to the information need?
  • 12.
    Boolean Retrieval Model FurtherTerminology & Notations  Precision: What fraction of the returned results are relevant to the information need? 𝑃 = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑  Recall: What fraction of the relevant documents in the collection were returned by the system? 𝑅 = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑖𝑛 𝑐𝑜𝑙𝑙𝑒𝑐𝑡𝑖𝑜𝑛
  • 13.
    Boolean Retrieval Model Buildinga term-document matrix  In practice, we can not build a Term-Document Incidence matrix. Let: d = 1 million, and let each document in d contains 1000 words. * assume an average of 6 bytes per word including spaces and punctuation  then this is a document collection about 6 GB in size. * Assume we have about M = 500,000 distinct terms in these documents A 500K × 1M matrix has half-a-trillion 0’s and 1’s – too many to fit in a computer’s memory. But the crucial observation is that the matrix is extremely sparse, that is, it has few non-zero entries.
  • 14.
    Boolean Retrieval Model Buildinga term-document matrix  So, a better representation is to record only the things that do occur, that is, the 1 positions.  This leads to a central idea in IR that is, the inverted index.  An Inverted Index or Inverted File: is an index that always maps back from terms to the parts of a document where they occur.   We keep a dictionary of terms (sometimes also referred to as a vocabulary or lexicon.   Then for each term, we have a list that records which documents the term occurs in.
  • 15.
    Boolean Retrieval Model InvertedIndex  Each item in the list – which records that a term appeared in a document (and, later, often, the positions in the document) – is conventionally called a posting.  The list is then called a postings list (or inverted list), and all the postings lists taken together are referred to as the postings.
  • 16.
    Boolean Retrieval Model Buildingan inverted index  The major steps are: 1. Collect the documents to be indexed: Doc1 = {Friends, Romans, countrymen} Doc2 = {So let it be with Caesar} Doc3 = {. . .} 2. Tokenize the text, turning each document into a list of tokens: Friends Romans countrymen So . . .
  • 17.
    Boolean Retrieval Model Buildingan inverted index  The major steps are: 3. Do linguistic preprocessing, producing a list of normalized tokens, which are the indexing terms: friends romans countrymen so . . . 4. Index the documents that each term occurs in by creating an inverted index, consisting of a dictionary and postings.
  • 18.
    Boolean Retrieval Model Buildingan inverted index  Within a document collection, we assume that each document has a unique serial number, known as the document identifier (docID).  The input to indexing is a list of normalized tokens for each document, which we can equally think of as a list of pairs of term and docID.  The core indexing step is sorting this list so that the terms are alphabetical.  The postings are secondarily sorted by docID. This provides the basis for efficient query processing.
  • 20.
    Boolean Retrieval Model StorageRequirements  In the resulting index, we pay for storage of both the dictionary and the postings lists.  The dictionary is commonly kept in memory.  Postings lists are normally kept on disk.  So, the size of each is important.
  • 21.
    Boolean Retrieval Model Whatdata structure should be used for a postings list?  A fixed length array would be wasteful as some words occur in many documents, and others in very few.  For an in-memory postings list, two good alternatives:  Singly linked lists: allow cheap insertion of documents into postings lists (following updates, such as when recrawling the web for updated documents). They also naturally extend to more advanced indexing strategies such as skip lists, which require additional pointers.  Variable length arrays: win in space requirements by avoiding the overhead for pointers and in time requirements because their use of contiguous memory increases speed on modern processors with memory caches.
  • 22.
    Boolean Retrieval Model Exercise Draw the inverted index that would be built for the following document collection: Doc 1 new home sales top forecasts Doc 2 home sales rise in July Doc 3 increase in home sales in July Doc 4 July new home sales rise
  • 23.
    Boolean Retrieval Model Exercise Consider these documents: Doc 1 breakthrough drug for schizophrenia Doc 2 new schizophrenia drug Doc 3 new approach for treatment of schizophrenia Doc 4 new hopes for schizophrenia patients A. Draw the term-document incidence matrix for this document collection B. What are the returned results for these queries: a. schizophrenia AND drug b. for AND NOT(drug OR approach)