Web Information Retrieval - Homework 1

Master Degree in Engineering in Computer Science
Web Information Retrievial
Homework 1
Author:
Biagio Botticelli 1212666
April 18, 2016

Contents
1 Introduction 2
1.1 Inverted Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Stemmers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Ranking and Scorers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.1 Count Scorer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.2 Tf/Idf Scorer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.3 BM25 Scorer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Evaluation metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4.1 Relevance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4.2 Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4.3 Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Solution of the Problem 6
2.1 General Script: wir hw1.py . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Creation of Collections: create collections.py . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Creation of Inverted Indexes: create indexes.py . . . . . . . . . . . . . . . . . . . . . 7
2.4 Obtain the Results: get results.py . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5 Evaluation of Results: evaluate results.py . . . . . . . . . . . . . . . . . . . . . . . . 10
3 Analysis of Plots 13
3.1 Cranfield Dataset 1:1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Cranfield Dataset 1:2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3 Time Dataset 1:1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4 Statistics 16
4.1 Cranfield Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2 Time Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5 Instruction for the Execution 18
1

1 Introduction
The first homework of the Web Information Retrieval course has the target to analyze the perfor-
mance of a search engine created on two particular datasets: Cranfield and Time.
Each one of these datasets is formed by:
1. a set of html documents;
2. a set of queries;
3. a set of relevant documents IDs for each query in the dataset: the ground-truth;
To build the search engine, I will use MG4J (Managing Gigabytes for Java) which is a free highly
customizable and high-performance search engine for large document collections developed by the
Department of Computer Science of University of Milan.
MG4J takes as input a set of documents with the same number and type of fields and it outputs
an inverted index.
Once obtained the inverted index, I can use the ad-hoc java software homework.RunAllQueries HW
to create the .tsv output files that contain the 20 documents that have the highest rank according
to the specified stemmer-scorer combination.
The last step is to evaluate the retrieved .tsv results by analyzing the precision-at-k metric.
At the end, I will know which is the best configuration in terms of stemming method and
scorer function for the search engine of the single collection.
1.1 Inverted Index
An Inverted Index is a method that allows to represents e ciently only the information that
occurs in the document collection
The basic idea of an inverted index could be represented as:
• a Dictionary of Terms (or Vocabulary)
• a Posting List (or Inverted List)
For each term of the vocabulary, there is a list that stores the docIDs of the documents in which
the term occurs in. Each item in this list is called a posting. The list is called a postings list.
Typically, the vocabulary is kept in main memory with the pointers to each postings list, which is
stored on disk.
2

1.2 Stemmers
The goal of stemming is to reduce inﬂectional forms and sometimes derivationally related forms
of a word to a common base form.
For instance:
am, are, is ) be
car, cars, car’s, cars’ ) car
Usually a stemmer is a simple heuristic process that cuts o↵ the ends of words in the hope of
achieving this goal correctly most of times, and often includes the removal of derivational a xes.
An English stemmer for example, should identify the strings stems, stemmer, stemming, stemmed
as based on the root stem.
For this homework, I will use 3 di↵erent stemmers: Default, English Stemmer and English
Stopwords stemmers.
For each one of these stemmers, I will create a collection and a related inverted index.
1.3 Ranking and Scorers
In case of large document collections, the resulting number of matching documents could be very
huge. Thus, for a search engine it is important to return to users only those documents of the
collection that are the most relevant, or rather, those documents that are top-ranked given the
large number of potentially relevant documents.
How do we rank the importance of the query-document pairs in the collection?
Ranking of query results is one of the fundamental problems in Information Retrieval (IR).
Given a query q and a collection of documents D that match the query, the problem is to rank
the documents according to some criterion, so that the best results appear early in the list of
results displayed to the user.
Ranking is done by computing numeric scores on query-document pairs through the use of
Scorer functions. In particular, I will use 3 di↵erent scorers: Count, Tf/Idf and BM25.
1.3.1 Count Scorer
The Count Scorer is the simplest scorer function between the three.
It computes the score by adding the number of occurrences within the current document of
each query term.
1.3.2 Tf/Idf Scorer
In contrast to Count scorer that registers only the presence of terms in documents, it’s possible
to use the Tf/Idf (Term Frequency/Inverse Document Frequency) scorer that assigns
to a term a weight that expresses also its importance in the document.
The Tf-Idf function assigns a high weight to a term if it occurs frequently in the document, but
rarely in the whole collection.
Instead, a term that occurs in nearly all documents has assigned a low weight.
3

So, the Tf/Idf value increases proportionally to the number of times a word appears in the
document, but it is o↵set by the frequency of the word in the collection, which helps to represent
that some words appear more frequently.
Tf/Idf is the product of term frequency and inverse document frequency:
• Term Frequency: the number of times a term occurs in a document;
• Inverse Document Frequency: the number of documents of the collection that contain
the term gives the Document Frequency (DF).
Since we need a factor which diminishes the weight of terms that occur very frequently in
the collection and increases the weight of terms that occur rarely, we can use the Inverse
Document Frequency obtained by:
idft = log10
N
dft
!
There is one Idf value for each term t in a collection.
A Tf/Idf weight to a term i in each document d is given by:
wi,d = tfi,d ⇥ log10
N
dfi
!
where:
• tfi,d = the frequency of term i in document d;
• N = total number of documents;
• dfi = the number of documents that contain term i.
This value increases with the number of occurrences within a document and with the rarity of
the term across the whole collection.
1.3.3 BM25 Scorer
The BM25 scorer is an improvement of the Tf/Idf scorer: it assigns to each term that appears
in a document a weight depending both on the count (the number of occurrences of the term in
the document), on the frequency (the number of documents in which the term appears) and on
the document length.
It is not a single function, but actually a whole family of scoring functions.
One of the most important instantiations of the BM25 function is deﬁned as: given a query Q
containing the keywords q1, q2, . . . , qn, the BM25 score of a document D is:
score(D, Q) =
nX
i=1
IDF(qi) ·
f(qi, D) · (k1 + 1)
f(qi, D) + k1 ·
✓
1 b + b · |D|
avgdl
◆
where:
• f(qi, D) = term frequency for qi in the document d;
• |D| = lenght of the document D (in words);
4

• avgdl = average document length of the collection;
• k1 and b = free parameters usually chosen as k1 2 [1.2, 2.0] and b = 0.75;
• IDF(qi) = Inverse Document Frequency weight of the query term qi computed as:
IDF(qi) = log
N n(qi) + 1
2
n(qi) + 1
2
!
where:
– n(qi) = number of documents containing qi;
– N = total number of documents in the collection;
1.4 Evaluation metric
Ranking functions are evaluated by a variety of means.
One of the simplest is obtained by determining the Precision of the first k top-ranked results
for some fixed k. This metric is also called Precision-at-k (p@k).
1.4.1 Relevance
Relevance is the concept of one topic being connected to another topic in a way that makes it
useful to consider the first topic when considering the second one.
In the context of search engines, it could be defined as: a document d is relevant to a query q if
it addresses the information need behind the query.
A binary assessment of either Relevant or Irrelevant is given by human expert for each query-
doc pair.
1.4.2 Precision
The Precision (P) is defined as the number of relevant retrieved documents divided by the
total number of documents retrieved by that search.
Precision =
# relevant items retrieved
# retrieved items
= P(relevant|retrieved) P =
tp
tp + fp
1.4.3 Recall
The Recall (R) is defined as the number of relevant retrieved documents divided by the total
number of existing relevant documents.
Recall =
# relevant items retrieved
# relevant items
= P(retrieved|relevant) R =
tp
tp + fn
Relevant Irrelevant
Retrieved true positives (tp) false positives (fp)
Not Retrieved false negatives (fn) true negatives (tn)
5

2 Solution of the Problem
2.1 General Script: wir hw1.py
My solution for the homework provides one Python script that automatically executes other four
di↵erent scripts. Each one of them solves one request at a time; there are:
1. create collections.py: it creates the collections for Cranfield and Time datasets;
2. create indexes.py: it creates the inverted indexes built on the obtained collections;
3. get results.py: it obtain the results by applying three di↵erent scoring functions and it
outputs one .tsv file for each stemmer-scorer combination and for each collection;
4. evaluate results.py: starting from the obtained results, it makes the evaluation of the
outputs by comparing these values with the ones in the ground-truth of each collection giving
as result the Average Precision-at-k (P@k) for each stemmer-scorer combination.
At the end, it outcomes three plots: one for the Cranfield collection with combination of
weights 1:1, the Cranfield collection with combination of weights 1:2 and another one for
the Time collection (with combination of weights 1:1 by default).
2.2 Creation of Collections: create collections.py
In order to built the search engine, I need to create a collection on the two datasets.
This could be done by using MG4J by executing the command:
where:
• Cranfield DATASET: it’s the name of the directory containing the dataset on which I’m
creating the collection;
• FileSetDocumentCollection: it’s the main java method of the MG4J package that takes as
input the list of file (in this project, they are all .html files) returned by the find command;
• -f HtmlDocumentFactory: used factor between the available HtmlDocumentFactory, Com-
positeDocumentFactory, IdentityDocumentFactory, MailDocumentFactory, PdfDocumentFac-
tory, ReplicatedDocumentFactory, PropertyBasedDocumentFactory, TRECHeaderDocument-
Factory, ZipDocumentCollection.ZipFactory;
• -p encoding=UTF-8: chosen encoding property;
• cranfield default.collection: name of the collection;
6

In order to be sure of the execution of this command (and also the following ones), I decided to
create a specific architecture of directories.
The execution of the script will automatically create the directories related to the specific dataset-
stemmer combination in which it will store the 6 .collection files.
Observation: The collection does not contain the files, but only their names.
Thus, deleting or modifying files of source directory (e.g. Cranfield DATASET) may cause incon-
sistence in the collection.
2.3 Creation of Inverted Indexes: create indexes.py
Once obtained the collections, it possible to built on them the inverted indexes.
MG4J will do all the work for us; but in this case we need to specify three di↵erent commands in
order to apply the three di↵erent stemmers (Default, English and English Stopwords).
To apply the Default stemmer, the command is:
To apply the English stemmer the command is:
To apply the English Stopwords stemmer, the command is:
where:
• IndexBuilder: it’s the function of MG4J package that builds an inverted index taking as
input a collection;
7

• --downcase: option that forces all the terms to be downcased (only for Default stem-
mer);
• -t EnglishStemmerStopwords: option to use a di↵erent stemmer from the Default one;
It’s possible to use either the stemmers predefined in MG4J package or user defined ones
(e.g. EnglishStemmerStopwords);
• -s cranfield stopword.collection: option that specifies that I am producing an index
for the specified collection.
Warning! Don’t forget this option: if the option is omitted, IndexBuilder method expects
to index a document sequence read from standard input!
• cranfield stopword: name of the output index;
The create indexes.py script uses the structure of directories made by create collection.py and
automatically executes the commands mentioned above in the most specific directory for the single
dataset-stemmer combination.
This means that at the end of the execution I will have in total 6 inverted indexes, each one stored
in a directory which path will look like: If the user wants to look for some informations regarding
the indexes that are been created, it’s possible to find them in the files created by MG4J:
• cranfield default-{text, title}.terms: contain the terms of the dictionary;
• cranfield default-{text, title}.stats: contain statistics;
• cranfield default-{text, title}.properties:contain global information;
• cranfield default-{text,title}.frequencies: for each term, there is the number of doc-
uments with the term ( -code);
• cranfield default-{text,title}.globcounts: for each term, there is the number of oc-
currence of the term ( -code);
• cranfield default-{text,title}.offset: for each term, there is the o↵set ( -code);
8

2.4 Obtain the Results: get results.py
Once created the correct inverted indexes, I can obtain some results on which are the documents
that are retrieved by the search engine.
To comply this task, I designed get results.py: the script automatically executes the ad-hoc java
software homework.RunAllQueries HW that retrieves the 20 most relevant document IDs for each
query ID and store them in a .tsv (tab-separated values) file.
To keep separated the output .tsv files from the files of the inverted indexes, I design a specular
tree of directories with a di↵erent root called results:
The command that is executed for each iteration is like:
• homework.RunAllQueries HW: it is the java software that retrieves the 20 top-ranked docu-
ments (for each query in the query set) according to a specific Dataset-Stemmer-Weight-
Scorer combination;
• time stopword: it is the name of the collection;
• time all queries.tsv: file that contains the query set of the dataset;
• BM25Scorer: chosen scorer to use;
• 1:1: combination of weights.
1:1 = the title field is considered as important as the text field for the final score;
1:2 = the title field is considered twice important as the text field for the final score;
• time stopword 1to1 BM25Scorer.tsv: name of the .tsv output file;
The execution of this script will create in total 27 .tsv files: 9 for the combinations stemmer-
scorer for Cranfield dataset with weight 1:1, 9 for Cranfield dataset with weight 1:2 and 9
for Time dataset with weight 1:1.
9

2.5 Evaluation of Results: evaluate results.py
The last step that I have to complete is to evaluate the performance of my search engine.
One of the most used metric of evaluation is to calculate the average Precision-at-k (p@k) over
the whole query set.
The evaluate results.py script that I developed computes the average Precision-at-k (p@k)
over 4 values for k: 1, 3, 5 and 10.
But what does this exactly mean?
First of all I have to introduce some concepts that I will use in the script.
• Ground-Truth: it’s a .tsv file that contains, for each query in the query set, the IDs of
documents that are relevant.
To work with it, I defined a Dictionary <key, value> where the keys are the queryIDs
and the values are lists of docIDs that are relevant for those query.
• Relevant: it’s a list of IDs of documents that are relevant for a single query.
• tsv: starting from the .tsv files that are been computed by the get results.py script, I
built a Dictionary <key, value> where the keys are the queryIDs and the values are lists
of docIDs that are retrieved for those query.
• Retrieved: it’s a list of 20 IDs of documents that are retrieved for a single query.
• TopKResults: to evaluate the p@k of a single query, I do not need all the 20 docIDs of the
retrieved list but only a subset of size k.
Thus, TopKResults is a list of k top-ranked documents that are retrieved for those query.
10

The definition of Precision that I used in the script is given by:
p@kextended =
|TopKResults GroundTruth|
min(k, |GroundTruth|)
Observation: analyzing the Cranfield ground-truth, I noticed that there are some missing queries.
Let call them queries “Noisy Queries”.
Why is important to distinguish these “Noisy Queries”?
According to me, I need to mark that there are two di↵erent notions of “zeros” that could be
returned by the p@kextended:
• Real Zero: a zero that is the size of the resulting intersection between TopKResults and
GroundTruth that are considered;
This is the notion of zero that I want to consider in my computation.
• Noisy Zero: a zero that results from the intersection between TopKResults and an
empty set given by the missing query of the GroundTruth.
This is the notion of zero that I want to discard in my computation.
Thus, by definition, all the Dictionaries of the Cranfield Dataset will skip these Noisy Queries
and the total number of queries will be 222. Since Time Dataset does not have Noisy Queries,
I do not need to face this problem and the total number of queries will be 83.
The script uses the notion of p@kextended to output a Dictionary in which the key is the query and
value is the list of p@ks.
11

At the end, it’s computed the average of the values of p@ks for the specific Dataset-Stemmer-
Weight-Scorer combination and it’s plotted on a graph.
There will be created 3 graphs in total: one for the Cranfield collection with combination of
weights 1:1, the Cranfield collection with combination of weights 1:2 and another one for the
Time collection (with combination of weights 1:1 by default).
Example:
12

3 Analysis of Plots
3.1 Cranﬁeld Dataset 1:1
Analyzing the graph, I can say that:
1. Best combination Stemmer-Scorer = EnglishStopwords-BM25;
2. Best Stemmer = EnglishStopwords;
3. Best Scorer = BM25;
13

3.2 Cranﬁeld Dataset 1:2
Analyzing the graph, I can say that the results are very similar with the ones of Cranﬁeld 1:1:
1. Best combination Stemmer-Scorer = EnglishStopwords-BM25;
2. Best Stemmer = EnglishStopwords
Which is the best weights combination between ‘1:1’ and ‘1:2’?
Also if the results are very similar, on the average I assert that the best weight is ‘1:2’.
14

3.3 Time Dataset 1:1
Analyzing the graph, I can say that:
1. Best combination Stemmer-Scorer = on the average, EnglishStopwords-BM25;
2. Best Stemmer = here we can see a di↵erent behavior within two intervals:
• for k 2 [1, ⇠ 3.2] the best stemmer is the English one;
• for k 2 (⇠ 3.2, 10] the best stemmer is the EnglishStopword one;
15

4 Statistics
4.1 Cranﬁeld Dataset
16

5 Instruction for the Execution
The python script is tested to be executed on MacOS. In order to be run, it requires Java, Python
and Plotly library to be installed.
Supposing that the system is ready to execute the script, you can simply unzip the .zip archive
in one arbitrary directory of your Operating System.
Open the directory in a terminal and type the command:
source set-my-classpath.sh
This command will set up the system to ﬁnd the directory in which there are the MG4J and
Homework packages. Now, simply type the command:
python wir hw1.py
This command will start the execution of the script.
At the end of execution, there will be shown on your default browser the 3 graphs described above.
• Observation: since the graphs are made by using Plotly library for Python, in order to
have the o✏ine plots displayed, it’s mandatory to have Plotly installed together with a Plotly
account.
The installation is very simple and could be made by command line by:
pip install plotly
Then the user should create an account on Plotly webpage.
After the registration, it’s possible to set up Plotly environment by the command:
If all it’s done correctly, the plots will be automatically displayed.
18

Web Information Retrieval - Homework 1

Recommended

Recommended

More Related Content

Similar to Web Information Retrieval - Homework 1

Similar to Web Information Retrieval - Homework 1 (20)

More from Biagio Botticelli

More from Biagio Botticelli (10)

Recently uploaded

Recently uploaded (20)

Web Information Retrieval - Homework 1