SlideShare a Scribd company logo
Master Degree in Engineering in Computer Science
Web Information Retrievial
Homework 1
Author:
Biagio Botticelli 1212666
April 18, 2016
Contents
1 Introduction 2
1.1 Inverted Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Stemmers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Ranking and Scorers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.1 Count Scorer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.2 Tf/Idf Scorer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.3 BM25 Scorer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Evaluation metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4.1 Relevance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4.2 Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4.3 Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Solution of the Problem 6
2.1 General Script: wir hw1.py . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Creation of Collections: create collections.py . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Creation of Inverted Indexes: create indexes.py . . . . . . . . . . . . . . . . . . . . . 7
2.4 Obtain the Results: get results.py . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5 Evaluation of Results: evaluate results.py . . . . . . . . . . . . . . . . . . . . . . . . 10
3 Analysis of Plots 13
3.1 Cranfield Dataset 1:1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Cranfield Dataset 1:2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3 Time Dataset 1:1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4 Statistics 16
4.1 Cranfield Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2 Time Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5 Instruction for the Execution 18
1
1 Introduction
The first homework of the Web Information Retrieval course has the target to analyze the perfor-
mance of a search engine created on two particular datasets: Cranfield and Time.
Each one of these datasets is formed by:
1. a set of html documents;
2. a set of queries;
3. a set of relevant documents IDs for each query in the dataset: the ground-truth;
To build the search engine, I will use MG4J (Managing Gigabytes for Java) which is a free highly
customizable and high-performance search engine for large document collections developed by the
Department of Computer Science of University of Milan.
MG4J takes as input a set of documents with the same number and type of fields and it outputs
an inverted index.
Once obtained the inverted index, I can use the ad-hoc java software homework.RunAllQueries HW
to create the .tsv output files that contain the 20 documents that have the highest rank according
to the specified stemmer-scorer combination.
The last step is to evaluate the retrieved .tsv results by analyzing the precision-at-k metric.
At the end, I will know which is the best configuration in terms of stemming method and
scorer function for the search engine of the single collection.
1.1 Inverted Index
An Inverted Index is a method that allows to represents e ciently only the information that
occurs in the document collection
The basic idea of an inverted index could be represented as:
• a Dictionary of Terms (or Vocabulary)
• a Posting List (or Inverted List)
For each term of the vocabulary, there is a list that stores the docIDs of the documents in which
the term occurs in. Each item in this list is called a posting. The list is called a postings list.
Typically, the vocabulary is kept in main memory with the pointers to each postings list, which is
stored on disk.
2
1.2 Stemmers
The goal of stemming is to reduce inflectional forms and sometimes derivationally related forms
of a word to a common base form.
For instance:
am, are, is ) be
car, cars, car’s, cars’ ) car
Usually a stemmer is a simple heuristic process that cuts o↵ the ends of words in the hope of
achieving this goal correctly most of times, and often includes the removal of derivational a xes.
An English stemmer for example, should identify the strings stems, stemmer, stemming, stemmed
as based on the root stem.
For this homework, I will use 3 di↵erent stemmers: Default, English Stemmer and English
Stopwords stemmers.
For each one of these stemmers, I will create a collection and a related inverted index.
1.3 Ranking and Scorers
In case of large document collections, the resulting number of matching documents could be very
huge. Thus, for a search engine it is important to return to users only those documents of the
collection that are the most relevant, or rather, those documents that are top-ranked given the
large number of potentially relevant documents.
How do we rank the importance of the query-document pairs in the collection?
Ranking of query results is one of the fundamental problems in Information Retrieval (IR).
Given a query q and a collection of documents D that match the query, the problem is to rank
the documents according to some criterion, so that the best results appear early in the list of
results displayed to the user.
Ranking is done by computing numeric scores on query-document pairs through the use of
Scorer functions. In particular, I will use 3 di↵erent scorers: Count, Tf/Idf and BM25.
1.3.1 Count Scorer
The Count Scorer is the simplest scorer function between the three.
It computes the score by adding the number of occurrences within the current document of
each query term.
1.3.2 Tf/Idf Scorer
In contrast to Count scorer that registers only the presence of terms in documents, it’s possible
to use the Tf/Idf (Term Frequency/Inverse Document Frequency) scorer that assigns
to a term a weight that expresses also its importance in the document.
The Tf-Idf function assigns a high weight to a term if it occurs frequently in the document, but
rarely in the whole collection.
Instead, a term that occurs in nearly all documents has assigned a low weight.
3
So, the Tf/Idf value increases proportionally to the number of times a word appears in the
document, but it is o↵set by the frequency of the word in the collection, which helps to represent
that some words appear more frequently.
Tf/Idf is the product of term frequency and inverse document frequency:
• Term Frequency: the number of times a term occurs in a document;
• Inverse Document Frequency: the number of documents of the collection that contain
the term gives the Document Frequency (DF).
Since we need a factor which diminishes the weight of terms that occur very frequently in
the collection and increases the weight of terms that occur rarely, we can use the Inverse
Document Frequency obtained by:
idft = log10
N
dft
!
There is one Idf value for each term t in a collection.
A Tf/Idf weight to a term i in each document d is given by:
wi,d = tfi,d ⇥ log10
N
dfi
!
where:
• tfi,d = the frequency of term i in document d;
• N = total number of documents;
• dfi = the number of documents that contain term i.
This value increases with the number of occurrences within a document and with the rarity of
the term across the whole collection.
1.3.3 BM25 Scorer
The BM25 scorer is an improvement of the Tf/Idf scorer: it assigns to each term that appears
in a document a weight depending both on the count (the number of occurrences of the term in
the document), on the frequency (the number of documents in which the term appears) and on
the document length.
It is not a single function, but actually a whole family of scoring functions.
One of the most important instantiations of the BM25 function is defined as: given a query Q
containing the keywords q1, q2, . . . , qn, the BM25 score of a document D is:
score(D, Q) =
nX
i=1
IDF(qi) ·
f(qi, D) · (k1 + 1)
f(qi, D) + k1 ·
✓
1 b + b · |D|
avgdl
◆
where:
• f(qi, D) = term frequency for qi in the document d;
• |D| = lenght of the document D (in words);
4
• avgdl = average document length of the collection;
• k1 and b = free parameters usually chosen as k1 2 [1.2, 2.0] and b = 0.75;
• IDF(qi) = Inverse Document Frequency weight of the query term qi computed as:
IDF(qi) = log
N n(qi) + 1
2
n(qi) + 1
2
!
where:
– n(qi) = number of documents containing qi;
– N = total number of documents in the collection;
1.4 Evaluation metric
Ranking functions are evaluated by a variety of means.
One of the simplest is obtained by determining the Precision of the first k top-ranked results
for some fixed k. This metric is also called Precision-at-k (p@k).
1.4.1 Relevance
Relevance is the concept of one topic being connected to another topic in a way that makes it
useful to consider the first topic when considering the second one.
In the context of search engines, it could be defined as: a document d is relevant to a query q if
it addresses the information need behind the query.
A binary assessment of either Relevant or Irrelevant is given by human expert for each query-
doc pair.
1.4.2 Precision
The Precision (P) is defined as the number of relevant retrieved documents divided by the
total number of documents retrieved by that search.
Precision =
# relevant items retrieved
# retrieved items
= P(relevant|retrieved) P =
tp
tp + fp
1.4.3 Recall
The Recall (R) is defined as the number of relevant retrieved documents divided by the total
number of existing relevant documents.
Recall =
# relevant items retrieved
# relevant items
= P(retrieved|relevant) R =
tp
tp + fn
Relevant Irrelevant
Retrieved true positives (tp) false positives (fp)
Not Retrieved false negatives (fn) true negatives (tn)
5
2 Solution of the Problem
2.1 General Script: wir hw1.py
My solution for the homework provides one Python script that automatically executes other four
di↵erent scripts. Each one of them solves one request at a time; there are:
1. create collections.py: it creates the collections for Cranfield and Time datasets;
2. create indexes.py: it creates the inverted indexes built on the obtained collections;
3. get results.py: it obtain the results by applying three di↵erent scoring functions and it
outputs one .tsv file for each stemmer-scorer combination and for each collection;
4. evaluate results.py: starting from the obtained results, it makes the evaluation of the
outputs by comparing these values with the ones in the ground-truth of each collection giving
as result the Average Precision-at-k (P@k) for each stemmer-scorer combination.
At the end, it outcomes three plots: one for the Cranfield collection with combination of
weights 1:1, the Cranfield collection with combination of weights 1:2 and another one for
the Time collection (with combination of weights 1:1 by default).
2.2 Creation of Collections: create collections.py
In order to built the search engine, I need to create a collection on the two datasets.
This could be done by using MG4J by executing the command:
where:
• Cranfield DATASET: it’s the name of the directory containing the dataset on which I’m
creating the collection;
• FileSetDocumentCollection: it’s the main java method of the MG4J package that takes as
input the list of file (in this project, they are all .html files) returned by the find command;
• -f HtmlDocumentFactory: used factor between the available HtmlDocumentFactory, Com-
positeDocumentFactory, IdentityDocumentFactory, MailDocumentFactory, PdfDocumentFac-
tory, ReplicatedDocumentFactory, PropertyBasedDocumentFactory, TRECHeaderDocument-
Factory, ZipDocumentCollection.ZipFactory;
• -p encoding=UTF-8: chosen encoding property;
• cranfield default.collection: name of the collection;
6
In order to be sure of the execution of this command (and also the following ones), I decided to
create a specific architecture of directories.
The execution of the script will automatically create the directories related to the specific dataset-
stemmer combination in which it will store the 6 .collection files.
Observation: The collection does not contain the files, but only their names.
Thus, deleting or modifying files of source directory (e.g. Cranfield DATASET) may cause incon-
sistence in the collection.
2.3 Creation of Inverted Indexes: create indexes.py
Once obtained the collections, it possible to built on them the inverted indexes.
MG4J will do all the work for us; but in this case we need to specify three di↵erent commands in
order to apply the three di↵erent stemmers (Default, English and English Stopwords).
To apply the Default stemmer, the command is:
To apply the English stemmer the command is:
To apply the English Stopwords stemmer, the command is:
where:
• IndexBuilder: it’s the function of MG4J package that builds an inverted index taking as
input a collection;
7
• --downcase: option that forces all the terms to be downcased (only for Default stem-
mer);
• -t EnglishStemmerStopwords: option to use a di↵erent stemmer from the Default one;
It’s possible to use either the stemmers predefined in MG4J package or user defined ones
(e.g. EnglishStemmerStopwords);
• -s cranfield stopword.collection: option that specifies that I am producing an index
for the specified collection.
Warning! Don’t forget this option: if the option is omitted, IndexBuilder method expects
to index a document sequence read from standard input!
• cranfield stopword: name of the output index;
The create indexes.py script uses the structure of directories made by create collection.py and
automatically executes the commands mentioned above in the most specific directory for the single
dataset-stemmer combination.
This means that at the end of the execution I will have in total 6 inverted indexes, each one stored
in a directory which path will look like: If the user wants to look for some informations regarding
the indexes that are been created, it’s possible to find them in the files created by MG4J:
• cranfield default-{text, title}.terms: contain the terms of the dictionary;
• cranfield default-{text, title}.stats: contain statistics;
• cranfield default-{text, title}.properties:contain global information;
• cranfield default-{text,title}.frequencies: for each term, there is the number of doc-
uments with the term ( -code);
• cranfield default-{text,title}.globcounts: for each term, there is the number of oc-
currence of the term ( -code);
• cranfield default-{text,title}.offset: for each term, there is the o↵set ( -code);
8
2.4 Obtain the Results: get results.py
Once created the correct inverted indexes, I can obtain some results on which are the documents
that are retrieved by the search engine.
To comply this task, I designed get results.py: the script automatically executes the ad-hoc java
software homework.RunAllQueries HW that retrieves the 20 most relevant document IDs for each
query ID and store them in a .tsv (tab-separated values) file.
To keep separated the output .tsv files from the files of the inverted indexes, I design a specular
tree of directories with a di↵erent root called results:
The command that is executed for each iteration is like:
• homework.RunAllQueries HW: it is the java software that retrieves the 20 top-ranked docu-
ments (for each query in the query set) according to a specific Dataset-Stemmer-Weight-
Scorer combination;
• time stopword: it is the name of the collection;
• time all queries.tsv: file that contains the query set of the dataset;
• BM25Scorer: chosen scorer to use;
• 1:1: combination of weights.
1:1 = the title field is considered as important as the text field for the final score;
1:2 = the title field is considered twice important as the text field for the final score;
• time stopword 1to1 BM25Scorer.tsv: name of the .tsv output file;
The execution of this script will create in total 27 .tsv files: 9 for the combinations stemmer-
scorer for Cranfield dataset with weight 1:1, 9 for Cranfield dataset with weight 1:2 and 9
for Time dataset with weight 1:1.
9
2.5 Evaluation of Results: evaluate results.py
The last step that I have to complete is to evaluate the performance of my search engine.
One of the most used metric of evaluation is to calculate the average Precision-at-k (p@k) over
the whole query set.
The evaluate results.py script that I developed computes the average Precision-at-k (p@k)
over 4 values for k: 1, 3, 5 and 10.
But what does this exactly mean?
First of all I have to introduce some concepts that I will use in the script.
• Ground-Truth: it’s a .tsv file that contains, for each query in the query set, the IDs of
documents that are relevant.
To work with it, I defined a Dictionary <key, value> where the keys are the queryIDs
and the values are lists of docIDs that are relevant for those query.
• Relevant: it’s a list of IDs of documents that are relevant for a single query.
• tsv: starting from the .tsv files that are been computed by the get results.py script, I
built a Dictionary <key, value> where the keys are the queryIDs and the values are lists
of docIDs that are retrieved for those query.
• Retrieved: it’s a list of 20 IDs of documents that are retrieved for a single query.
• TopKResults: to evaluate the p@k of a single query, I do not need all the 20 docIDs of the
retrieved list but only a subset of size k.
Thus, TopKResults is a list of k top-ranked documents that are retrieved for those query.
10
The definition of Precision that I used in the script is given by:
p@kextended =
|TopKResults  GroundTruth|
min(k, |GroundTruth|)
Observation: analyzing the Cranfield ground-truth, I noticed that there are some missing queries.
Let call them queries “Noisy Queries”.
Why is important to distinguish these “Noisy Queries”?
According to me, I need to mark that there are two di↵erent notions of “zeros” that could be
returned by the p@kextended:
• Real Zero: a zero that is the size of the resulting intersection between TopKResults and
GroundTruth that are considered;
This is the notion of zero that I want to consider in my computation.
• Noisy Zero: a zero that results from the intersection between TopKResults and an
empty set given by the missing query of the GroundTruth.
This is the notion of zero that I want to discard in my computation.
Thus, by definition, all the Dictionaries of the Cranfield Dataset will skip these Noisy Queries
and the total number of queries will be 222. Since Time Dataset does not have Noisy Queries,
I do not need to face this problem and the total number of queries will be 83.
The script uses the notion of p@kextended to output a Dictionary in which the key is the query and
value is the list of p@ks.
11
At the end, it’s computed the average of the values of p@ks for the specific Dataset-Stemmer-
Weight-Scorer combination and it’s plotted on a graph.
There will be created 3 graphs in total: one for the Cranfield collection with combination of
weights 1:1, the Cranfield collection with combination of weights 1:2 and another one for the
Time collection (with combination of weights 1:1 by default).
Example:
12
3 Analysis of Plots
3.1 Cranfield Dataset 1:1
Analyzing the graph, I can say that:
1. Best combination Stemmer-Scorer = EnglishStopwords-BM25;
2. Best Stemmer = EnglishStopwords;
3. Best Scorer = BM25;
13
3.2 Cranfield Dataset 1:2
Analyzing the graph, I can say that the results are very similar with the ones of Cranfield 1:1:
1. Best combination Stemmer-Scorer = EnglishStopwords-BM25;
2. Best Stemmer = EnglishStopwords
3. Best Scorer = BM25;
Which is the best weights combination between ‘1:1’ and ‘1:2’?
Also if the results are very similar, on the average I assert that the best weight is ‘1:2’.
14
3.3 Time Dataset 1:1
Analyzing the graph, I can say that:
1. Best combination Stemmer-Scorer = on the average, EnglishStopwords-BM25;
2. Best Stemmer = here we can see a di↵erent behavior within two intervals:
• for k 2 [1, ⇠ 3.2] the best stemmer is the English one;
• for k 2 (⇠ 3.2, 10] the best stemmer is the EnglishStopword one;
3. Best Scorer = BM25;
15
4 Statistics
4.1 Cranfield Dataset
16
4.2 Time Dataset
17
5 Instruction for the Execution
The python script is tested to be executed on MacOS. In order to be run, it requires Java, Python
and Plotly library to be installed.
Supposing that the system is ready to execute the script, you can simply unzip the .zip archive
in one arbitrary directory of your Operating System.
Open the directory in a terminal and type the command:
source set-my-classpath.sh
This command will set up the system to find the directory in which there are the MG4J and
Homework packages. Now, simply type the command:
python wir hw1.py
This command will start the execution of the script.
At the end of execution, there will be shown on your default browser the 3 graphs described above.
• Observation: since the graphs are made by using Plotly library for Python, in order to
have the o✏ine plots displayed, it’s mandatory to have Plotly installed together with a Plotly
account.
The installation is very simple and could be made by command line by:
pip install plotly
Then the user should create an account on Plotly webpage.
After the registration, it’s possible to set up Plotly environment by the command:
If all it’s done correctly, the plots will be automatically displayed.
18

More Related Content

Similar to Web Information Retrieval - Homework 1

An intro to applied multi stat with r by everitt et al
An intro to applied multi stat with r by everitt et alAn intro to applied multi stat with r by everitt et al
An intro to applied multi stat with r by everitt et al
Razzaqe
 
An introductiontoappliedmultivariateanalysiswithr everit
An introductiontoappliedmultivariateanalysiswithr everitAn introductiontoappliedmultivariateanalysiswithr everit
An introductiontoappliedmultivariateanalysiswithr everitFredy Gomez Gutierrez
 
Chapter 5 Query Evaluation.pdf
Chapter 5 Query Evaluation.pdfChapter 5 Query Evaluation.pdf
Chapter 5 Query Evaluation.pdf
Habtamu100
 
Principles of programming languages
Principles of programming languagesPrinciples of programming languages
Principles of programming languagesNYversity
 
Dissertation_of_Pieter_van_Zyl_2_March_2010
Dissertation_of_Pieter_van_Zyl_2_March_2010Dissertation_of_Pieter_van_Zyl_2_March_2010
Dissertation_of_Pieter_van_Zyl_2_March_2010Pieter Van Zyl
 
eoe_eng_2020_08_chapter_all.pdf
eoe_eng_2020_08_chapter_all.pdfeoe_eng_2020_08_chapter_all.pdf
eoe_eng_2020_08_chapter_all.pdf
sewt2121
 
information technology materrailas paper
information technology materrailas paperinformation technology materrailas paper
information technology materrailas paper
melkamutesfay1
 
Coding guideline
Coding guidelineCoding guideline
Coding guideline
Vu Nguyen
 
Algorithms
AlgorithmsAlgorithms
Algorithms
suzzanj1990
 
Principles of programming languages
Principles of programming languagesPrinciples of programming languages
Principles of programming languages
IT Training and Job Placement
 
Content and concept filter
Content and concept filterContent and concept filter
Content and concept filter
LinkedTV
 
A Comparative Study Of Generalized Arc-Consistency Algorithms
A Comparative Study Of Generalized Arc-Consistency AlgorithmsA Comparative Study Of Generalized Arc-Consistency Algorithms
A Comparative Study Of Generalized Arc-Consistency Algorithms
Sandra Long
 
Learn C# Includes The C# 3.0 Features
Learn C# Includes The C# 3.0 FeaturesLearn C# Includes The C# 3.0 Features
Learn C# Includes The C# 3.0 FeaturesZEZUA Z.
 
BI Project report
BI Project reportBI Project report
BI Project report
hlel
 
Master's Thesis
Master's ThesisMaster's Thesis
Master's Thesis
Sridhar Mamella
 
Preventing Illicit Information Flow in Networked Computer Games Using Securit...
Preventing Illicit Information Flow in Networked Computer Games Using Securit...Preventing Illicit Information Flow in Networked Computer Games Using Securit...
Preventing Illicit Information Flow in Networked Computer Games Using Securit...Jonas Rabbe
 
Computational thinking v0.1_13-oct-2020
Computational thinking v0.1_13-oct-2020Computational thinking v0.1_13-oct-2020
Computational thinking v0.1_13-oct-2020
Gora Buzz
 
A study on improving speaker diarization system = Nghiên cứu phương pháp cải ...
A study on improving speaker diarization system = Nghiên cứu phương pháp cải ...A study on improving speaker diarization system = Nghiên cứu phương pháp cải ...
A study on improving speaker diarization system = Nghiên cứu phương pháp cải ...
Man_Ebook
 

Similar to Web Information Retrieval - Homework 1 (20)

Aregay_Msc_EEMCS
Aregay_Msc_EEMCSAregay_Msc_EEMCS
Aregay_Msc_EEMCS
 
An intro to applied multi stat with r by everitt et al
An intro to applied multi stat with r by everitt et alAn intro to applied multi stat with r by everitt et al
An intro to applied multi stat with r by everitt et al
 
An introductiontoappliedmultivariateanalysiswithr everit
An introductiontoappliedmultivariateanalysiswithr everitAn introductiontoappliedmultivariateanalysiswithr everit
An introductiontoappliedmultivariateanalysiswithr everit
 
Chapter 5 Query Evaluation.pdf
Chapter 5 Query Evaluation.pdfChapter 5 Query Evaluation.pdf
Chapter 5 Query Evaluation.pdf
 
Principles of programming languages
Principles of programming languagesPrinciples of programming languages
Principles of programming languages
 
Dissertation_of_Pieter_van_Zyl_2_March_2010
Dissertation_of_Pieter_van_Zyl_2_March_2010Dissertation_of_Pieter_van_Zyl_2_March_2010
Dissertation_of_Pieter_van_Zyl_2_March_2010
 
eoe_eng_2020_08_chapter_all.pdf
eoe_eng_2020_08_chapter_all.pdfeoe_eng_2020_08_chapter_all.pdf
eoe_eng_2020_08_chapter_all.pdf
 
information technology materrailas paper
information technology materrailas paperinformation technology materrailas paper
information technology materrailas paper
 
Coding guideline
Coding guidelineCoding guideline
Coding guideline
 
Algorithms
AlgorithmsAlgorithms
Algorithms
 
Principles of programming languages
Principles of programming languagesPrinciples of programming languages
Principles of programming languages
 
Content and concept filter
Content and concept filterContent and concept filter
Content and concept filter
 
A Comparative Study Of Generalized Arc-Consistency Algorithms
A Comparative Study Of Generalized Arc-Consistency AlgorithmsA Comparative Study Of Generalized Arc-Consistency Algorithms
A Comparative Study Of Generalized Arc-Consistency Algorithms
 
Learn C# Includes The C# 3.0 Features
Learn C# Includes The C# 3.0 FeaturesLearn C# Includes The C# 3.0 Features
Learn C# Includes The C# 3.0 Features
 
BI Project report
BI Project reportBI Project report
BI Project report
 
Master's Thesis
Master's ThesisMaster's Thesis
Master's Thesis
 
Preventing Illicit Information Flow in Networked Computer Games Using Securit...
Preventing Illicit Information Flow in Networked Computer Games Using Securit...Preventing Illicit Information Flow in Networked Computer Games Using Securit...
Preventing Illicit Information Flow in Networked Computer Games Using Securit...
 
Computational thinking v0.1_13-oct-2020
Computational thinking v0.1_13-oct-2020Computational thinking v0.1_13-oct-2020
Computational thinking v0.1_13-oct-2020
 
Knapp_Masterarbeit
Knapp_MasterarbeitKnapp_Masterarbeit
Knapp_Masterarbeit
 
A study on improving speaker diarization system = Nghiên cứu phương pháp cải ...
A study on improving speaker diarization system = Nghiên cứu phương pháp cải ...A study on improving speaker diarization system = Nghiên cứu phương pháp cải ...
A study on improving speaker diarization system = Nghiên cứu phương pháp cải ...
 

More from Biagio Botticelli

IoT Malware Detection through Threshold Random Walks
IoT Malware Detection through Threshold Random WalksIoT Malware Detection through Threshold Random Walks
IoT Malware Detection through Threshold Random Walks
Biagio Botticelli
 
Control of Communication and Energy Networks Final Project - Service Function...
Control of Communication and Energy Networks Final Project - Service Function...Control of Communication and Energy Networks Final Project - Service Function...
Control of Communication and Energy Networks Final Project - Service Function...
Biagio Botticelli
 
System and Enterprise Security Project - Penetration Testing
System and Enterprise Security Project - Penetration TestingSystem and Enterprise Security Project - Penetration Testing
System and Enterprise Security Project - Penetration Testing
Biagio Botticelli
 
IoT Honeypots: State of the Art
IoT Honeypots: State of the ArtIoT Honeypots: State of the Art
IoT Honeypots: State of the Art
Biagio Botticelli
 
State of the Art: IoT Honeypots
State of the Art: IoT HoneypotsState of the Art: IoT Honeypots
State of the Art: IoT Honeypots
Biagio Botticelli
 
Anonymity in the web based on routing protocols
Anonymity in the web based on routing protocolsAnonymity in the web based on routing protocols
Anonymity in the web based on routing protocols
Biagio Botticelli
 
Anonymity in the Web based on Routing Protocols
Anonymity in the Web based on Routing ProtocolsAnonymity in the Web based on Routing Protocols
Anonymity in the Web based on Routing Protocols
Biagio Botticelli
 
Blockchain for IoT - Smart Home
Blockchain for IoT - Smart HomeBlockchain for IoT - Smart Home
Blockchain for IoT - Smart Home
Biagio Botticelli
 
Smart Team Tracking Project: Group Tracking
Smart Team Tracking Project: Group Tracking Smart Team Tracking Project: Group Tracking
Smart Team Tracking Project: Group Tracking
Biagio Botticelli
 
Adafruit Huzzah Esp8266 WiFi Board
Adafruit Huzzah Esp8266 WiFi BoardAdafruit Huzzah Esp8266 WiFi Board
Adafruit Huzzah Esp8266 WiFi Board
Biagio Botticelli
 

More from Biagio Botticelli (10)

IoT Malware Detection through Threshold Random Walks
IoT Malware Detection through Threshold Random WalksIoT Malware Detection through Threshold Random Walks
IoT Malware Detection through Threshold Random Walks
 
Control of Communication and Energy Networks Final Project - Service Function...
Control of Communication and Energy Networks Final Project - Service Function...Control of Communication and Energy Networks Final Project - Service Function...
Control of Communication and Energy Networks Final Project - Service Function...
 
System and Enterprise Security Project - Penetration Testing
System and Enterprise Security Project - Penetration TestingSystem and Enterprise Security Project - Penetration Testing
System and Enterprise Security Project - Penetration Testing
 
IoT Honeypots: State of the Art
IoT Honeypots: State of the ArtIoT Honeypots: State of the Art
IoT Honeypots: State of the Art
 
State of the Art: IoT Honeypots
State of the Art: IoT HoneypotsState of the Art: IoT Honeypots
State of the Art: IoT Honeypots
 
Anonymity in the web based on routing protocols
Anonymity in the web based on routing protocolsAnonymity in the web based on routing protocols
Anonymity in the web based on routing protocols
 
Anonymity in the Web based on Routing Protocols
Anonymity in the Web based on Routing ProtocolsAnonymity in the Web based on Routing Protocols
Anonymity in the Web based on Routing Protocols
 
Blockchain for IoT - Smart Home
Blockchain for IoT - Smart HomeBlockchain for IoT - Smart Home
Blockchain for IoT - Smart Home
 
Smart Team Tracking Project: Group Tracking
Smart Team Tracking Project: Group Tracking Smart Team Tracking Project: Group Tracking
Smart Team Tracking Project: Group Tracking
 
Adafruit Huzzah Esp8266 WiFi Board
Adafruit Huzzah Esp8266 WiFi BoardAdafruit Huzzah Esp8266 WiFi Board
Adafruit Huzzah Esp8266 WiFi Board
 

Recently uploaded

Latest trends in computer networking.pptx
Latest trends in computer networking.pptxLatest trends in computer networking.pptx
Latest trends in computer networking.pptx
JungkooksNonexistent
 
This 7-second Brain Wave Ritual Attracts Money To You.!
This 7-second Brain Wave Ritual Attracts Money To You.!This 7-second Brain Wave Ritual Attracts Money To You.!
This 7-second Brain Wave Ritual Attracts Money To You.!
nirahealhty
 
guildmasters guide to ravnica Dungeons & Dragons 5...
guildmasters guide to ravnica Dungeons & Dragons 5...guildmasters guide to ravnica Dungeons & Dragons 5...
guildmasters guide to ravnica Dungeons & Dragons 5...
Rogerio Filho
 
Internet-Security-Safeguarding-Your-Digital-World (1).pptx
Internet-Security-Safeguarding-Your-Digital-World (1).pptxInternet-Security-Safeguarding-Your-Digital-World (1).pptx
Internet-Security-Safeguarding-Your-Digital-World (1).pptx
VivekSinghShekhawat2
 
一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理
一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理
一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理
eutxy
 
BASIC C++ lecture NOTE C++ lecture 3.pptx
BASIC C++ lecture NOTE C++ lecture 3.pptxBASIC C++ lecture NOTE C++ lecture 3.pptx
BASIC C++ lecture NOTE C++ lecture 3.pptx
natyesu
 
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and GuidelinesMulti-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Sanjeev Rampal
 
test test test test testtest test testtest test testtest test testtest test ...
test test  test test testtest test testtest test testtest test testtest test ...test test  test test testtest test testtest test testtest test testtest test ...
test test test test testtest test testtest test testtest test testtest test ...
Arif0071
 
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
3ipehhoa
 
How to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptxHow to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptx
Gal Baras
 
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
keoku
 
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
3ipehhoa
 
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdfJAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
Javier Lasa
 
Comptia N+ Standard Networking lesson guide
Comptia N+ Standard Networking lesson guideComptia N+ Standard Networking lesson guide
Comptia N+ Standard Networking lesson guide
GTProductions1
 
History+of+E-commerce+Development+in+China-www.cfye-commerce.shop
History+of+E-commerce+Development+in+China-www.cfye-commerce.shopHistory+of+E-commerce+Development+in+China-www.cfye-commerce.shop
History+of+E-commerce+Development+in+China-www.cfye-commerce.shop
laozhuseo02
 
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC
 
1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...
JeyaPerumal1
 
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
ufdana
 
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
3ipehhoa
 
The+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptxThe+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptx
laozhuseo02
 

Recently uploaded (20)

Latest trends in computer networking.pptx
Latest trends in computer networking.pptxLatest trends in computer networking.pptx
Latest trends in computer networking.pptx
 
This 7-second Brain Wave Ritual Attracts Money To You.!
This 7-second Brain Wave Ritual Attracts Money To You.!This 7-second Brain Wave Ritual Attracts Money To You.!
This 7-second Brain Wave Ritual Attracts Money To You.!
 
guildmasters guide to ravnica Dungeons & Dragons 5...
guildmasters guide to ravnica Dungeons & Dragons 5...guildmasters guide to ravnica Dungeons & Dragons 5...
guildmasters guide to ravnica Dungeons & Dragons 5...
 
Internet-Security-Safeguarding-Your-Digital-World (1).pptx
Internet-Security-Safeguarding-Your-Digital-World (1).pptxInternet-Security-Safeguarding-Your-Digital-World (1).pptx
Internet-Security-Safeguarding-Your-Digital-World (1).pptx
 
一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理
一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理
一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理
 
BASIC C++ lecture NOTE C++ lecture 3.pptx
BASIC C++ lecture NOTE C++ lecture 3.pptxBASIC C++ lecture NOTE C++ lecture 3.pptx
BASIC C++ lecture NOTE C++ lecture 3.pptx
 
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and GuidelinesMulti-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
 
test test test test testtest test testtest test testtest test testtest test ...
test test  test test testtest test testtest test testtest test testtest test ...test test  test test testtest test testtest test testtest test testtest test ...
test test test test testtest test testtest test testtest test testtest test ...
 
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
 
How to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptxHow to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptx
 
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
 
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
 
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdfJAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
 
Comptia N+ Standard Networking lesson guide
Comptia N+ Standard Networking lesson guideComptia N+ Standard Networking lesson guide
Comptia N+ Standard Networking lesson guide
 
History+of+E-commerce+Development+in+China-www.cfye-commerce.shop
History+of+E-commerce+Development+in+China-www.cfye-commerce.shopHistory+of+E-commerce+Development+in+China-www.cfye-commerce.shop
History+of+E-commerce+Development+in+China-www.cfye-commerce.shop
 
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
 
1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...
 
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
 
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
 
The+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptxThe+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptx
 

Web Information Retrieval - Homework 1

  • 1. Master Degree in Engineering in Computer Science Web Information Retrievial Homework 1 Author: Biagio Botticelli 1212666 April 18, 2016
  • 2. Contents 1 Introduction 2 1.1 Inverted Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Stemmers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Ranking and Scorers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3.1 Count Scorer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3.2 Tf/Idf Scorer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3.3 BM25 Scorer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Evaluation metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4.1 Relevance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4.2 Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4.3 Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Solution of the Problem 6 2.1 General Script: wir hw1.py . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Creation of Collections: create collections.py . . . . . . . . . . . . . . . . . . . . . . 6 2.3 Creation of Inverted Indexes: create indexes.py . . . . . . . . . . . . . . . . . . . . . 7 2.4 Obtain the Results: get results.py . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.5 Evaluation of Results: evaluate results.py . . . . . . . . . . . . . . . . . . . . . . . . 10 3 Analysis of Plots 13 3.1 Cranfield Dataset 1:1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2 Cranfield Dataset 1:2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.3 Time Dataset 1:1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4 Statistics 16 4.1 Cranfield Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.2 Time Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 5 Instruction for the Execution 18 1
  • 3. 1 Introduction The first homework of the Web Information Retrieval course has the target to analyze the perfor- mance of a search engine created on two particular datasets: Cranfield and Time. Each one of these datasets is formed by: 1. a set of html documents; 2. a set of queries; 3. a set of relevant documents IDs for each query in the dataset: the ground-truth; To build the search engine, I will use MG4J (Managing Gigabytes for Java) which is a free highly customizable and high-performance search engine for large document collections developed by the Department of Computer Science of University of Milan. MG4J takes as input a set of documents with the same number and type of fields and it outputs an inverted index. Once obtained the inverted index, I can use the ad-hoc java software homework.RunAllQueries HW to create the .tsv output files that contain the 20 documents that have the highest rank according to the specified stemmer-scorer combination. The last step is to evaluate the retrieved .tsv results by analyzing the precision-at-k metric. At the end, I will know which is the best configuration in terms of stemming method and scorer function for the search engine of the single collection. 1.1 Inverted Index An Inverted Index is a method that allows to represents e ciently only the information that occurs in the document collection The basic idea of an inverted index could be represented as: • a Dictionary of Terms (or Vocabulary) • a Posting List (or Inverted List) For each term of the vocabulary, there is a list that stores the docIDs of the documents in which the term occurs in. Each item in this list is called a posting. The list is called a postings list. Typically, the vocabulary is kept in main memory with the pointers to each postings list, which is stored on disk. 2
  • 4. 1.2 Stemmers The goal of stemming is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. For instance: am, are, is ) be car, cars, car’s, cars’ ) car Usually a stemmer is a simple heuristic process that cuts o↵ the ends of words in the hope of achieving this goal correctly most of times, and often includes the removal of derivational a xes. An English stemmer for example, should identify the strings stems, stemmer, stemming, stemmed as based on the root stem. For this homework, I will use 3 di↵erent stemmers: Default, English Stemmer and English Stopwords stemmers. For each one of these stemmers, I will create a collection and a related inverted index. 1.3 Ranking and Scorers In case of large document collections, the resulting number of matching documents could be very huge. Thus, for a search engine it is important to return to users only those documents of the collection that are the most relevant, or rather, those documents that are top-ranked given the large number of potentially relevant documents. How do we rank the importance of the query-document pairs in the collection? Ranking of query results is one of the fundamental problems in Information Retrieval (IR). Given a query q and a collection of documents D that match the query, the problem is to rank the documents according to some criterion, so that the best results appear early in the list of results displayed to the user. Ranking is done by computing numeric scores on query-document pairs through the use of Scorer functions. In particular, I will use 3 di↵erent scorers: Count, Tf/Idf and BM25. 1.3.1 Count Scorer The Count Scorer is the simplest scorer function between the three. It computes the score by adding the number of occurrences within the current document of each query term. 1.3.2 Tf/Idf Scorer In contrast to Count scorer that registers only the presence of terms in documents, it’s possible to use the Tf/Idf (Term Frequency/Inverse Document Frequency) scorer that assigns to a term a weight that expresses also its importance in the document. The Tf-Idf function assigns a high weight to a term if it occurs frequently in the document, but rarely in the whole collection. Instead, a term that occurs in nearly all documents has assigned a low weight. 3
  • 5. So, the Tf/Idf value increases proportionally to the number of times a word appears in the document, but it is o↵set by the frequency of the word in the collection, which helps to represent that some words appear more frequently. Tf/Idf is the product of term frequency and inverse document frequency: • Term Frequency: the number of times a term occurs in a document; • Inverse Document Frequency: the number of documents of the collection that contain the term gives the Document Frequency (DF). Since we need a factor which diminishes the weight of terms that occur very frequently in the collection and increases the weight of terms that occur rarely, we can use the Inverse Document Frequency obtained by: idft = log10 N dft ! There is one Idf value for each term t in a collection. A Tf/Idf weight to a term i in each document d is given by: wi,d = tfi,d ⇥ log10 N dfi ! where: • tfi,d = the frequency of term i in document d; • N = total number of documents; • dfi = the number of documents that contain term i. This value increases with the number of occurrences within a document and with the rarity of the term across the whole collection. 1.3.3 BM25 Scorer The BM25 scorer is an improvement of the Tf/Idf scorer: it assigns to each term that appears in a document a weight depending both on the count (the number of occurrences of the term in the document), on the frequency (the number of documents in which the term appears) and on the document length. It is not a single function, but actually a whole family of scoring functions. One of the most important instantiations of the BM25 function is defined as: given a query Q containing the keywords q1, q2, . . . , qn, the BM25 score of a document D is: score(D, Q) = nX i=1 IDF(qi) · f(qi, D) · (k1 + 1) f(qi, D) + k1 · ✓ 1 b + b · |D| avgdl ◆ where: • f(qi, D) = term frequency for qi in the document d; • |D| = lenght of the document D (in words); 4
  • 6. • avgdl = average document length of the collection; • k1 and b = free parameters usually chosen as k1 2 [1.2, 2.0] and b = 0.75; • IDF(qi) = Inverse Document Frequency weight of the query term qi computed as: IDF(qi) = log N n(qi) + 1 2 n(qi) + 1 2 ! where: – n(qi) = number of documents containing qi; – N = total number of documents in the collection; 1.4 Evaluation metric Ranking functions are evaluated by a variety of means. One of the simplest is obtained by determining the Precision of the first k top-ranked results for some fixed k. This metric is also called Precision-at-k (p@k). 1.4.1 Relevance Relevance is the concept of one topic being connected to another topic in a way that makes it useful to consider the first topic when considering the second one. In the context of search engines, it could be defined as: a document d is relevant to a query q if it addresses the information need behind the query. A binary assessment of either Relevant or Irrelevant is given by human expert for each query- doc pair. 1.4.2 Precision The Precision (P) is defined as the number of relevant retrieved documents divided by the total number of documents retrieved by that search. Precision = # relevant items retrieved # retrieved items = P(relevant|retrieved) P = tp tp + fp 1.4.3 Recall The Recall (R) is defined as the number of relevant retrieved documents divided by the total number of existing relevant documents. Recall = # relevant items retrieved # relevant items = P(retrieved|relevant) R = tp tp + fn Relevant Irrelevant Retrieved true positives (tp) false positives (fp) Not Retrieved false negatives (fn) true negatives (tn) 5
  • 7. 2 Solution of the Problem 2.1 General Script: wir hw1.py My solution for the homework provides one Python script that automatically executes other four di↵erent scripts. Each one of them solves one request at a time; there are: 1. create collections.py: it creates the collections for Cranfield and Time datasets; 2. create indexes.py: it creates the inverted indexes built on the obtained collections; 3. get results.py: it obtain the results by applying three di↵erent scoring functions and it outputs one .tsv file for each stemmer-scorer combination and for each collection; 4. evaluate results.py: starting from the obtained results, it makes the evaluation of the outputs by comparing these values with the ones in the ground-truth of each collection giving as result the Average Precision-at-k (P@k) for each stemmer-scorer combination. At the end, it outcomes three plots: one for the Cranfield collection with combination of weights 1:1, the Cranfield collection with combination of weights 1:2 and another one for the Time collection (with combination of weights 1:1 by default). 2.2 Creation of Collections: create collections.py In order to built the search engine, I need to create a collection on the two datasets. This could be done by using MG4J by executing the command: where: • Cranfield DATASET: it’s the name of the directory containing the dataset on which I’m creating the collection; • FileSetDocumentCollection: it’s the main java method of the MG4J package that takes as input the list of file (in this project, they are all .html files) returned by the find command; • -f HtmlDocumentFactory: used factor between the available HtmlDocumentFactory, Com- positeDocumentFactory, IdentityDocumentFactory, MailDocumentFactory, PdfDocumentFac- tory, ReplicatedDocumentFactory, PropertyBasedDocumentFactory, TRECHeaderDocument- Factory, ZipDocumentCollection.ZipFactory; • -p encoding=UTF-8: chosen encoding property; • cranfield default.collection: name of the collection; 6
  • 8. In order to be sure of the execution of this command (and also the following ones), I decided to create a specific architecture of directories. The execution of the script will automatically create the directories related to the specific dataset- stemmer combination in which it will store the 6 .collection files. Observation: The collection does not contain the files, but only their names. Thus, deleting or modifying files of source directory (e.g. Cranfield DATASET) may cause incon- sistence in the collection. 2.3 Creation of Inverted Indexes: create indexes.py Once obtained the collections, it possible to built on them the inverted indexes. MG4J will do all the work for us; but in this case we need to specify three di↵erent commands in order to apply the three di↵erent stemmers (Default, English and English Stopwords). To apply the Default stemmer, the command is: To apply the English stemmer the command is: To apply the English Stopwords stemmer, the command is: where: • IndexBuilder: it’s the function of MG4J package that builds an inverted index taking as input a collection; 7
  • 9. • --downcase: option that forces all the terms to be downcased (only for Default stem- mer); • -t EnglishStemmerStopwords: option to use a di↵erent stemmer from the Default one; It’s possible to use either the stemmers predefined in MG4J package or user defined ones (e.g. EnglishStemmerStopwords); • -s cranfield stopword.collection: option that specifies that I am producing an index for the specified collection. Warning! Don’t forget this option: if the option is omitted, IndexBuilder method expects to index a document sequence read from standard input! • cranfield stopword: name of the output index; The create indexes.py script uses the structure of directories made by create collection.py and automatically executes the commands mentioned above in the most specific directory for the single dataset-stemmer combination. This means that at the end of the execution I will have in total 6 inverted indexes, each one stored in a directory which path will look like: If the user wants to look for some informations regarding the indexes that are been created, it’s possible to find them in the files created by MG4J: • cranfield default-{text, title}.terms: contain the terms of the dictionary; • cranfield default-{text, title}.stats: contain statistics; • cranfield default-{text, title}.properties:contain global information; • cranfield default-{text,title}.frequencies: for each term, there is the number of doc- uments with the term ( -code); • cranfield default-{text,title}.globcounts: for each term, there is the number of oc- currence of the term ( -code); • cranfield default-{text,title}.offset: for each term, there is the o↵set ( -code); 8
  • 10. 2.4 Obtain the Results: get results.py Once created the correct inverted indexes, I can obtain some results on which are the documents that are retrieved by the search engine. To comply this task, I designed get results.py: the script automatically executes the ad-hoc java software homework.RunAllQueries HW that retrieves the 20 most relevant document IDs for each query ID and store them in a .tsv (tab-separated values) file. To keep separated the output .tsv files from the files of the inverted indexes, I design a specular tree of directories with a di↵erent root called results: The command that is executed for each iteration is like: • homework.RunAllQueries HW: it is the java software that retrieves the 20 top-ranked docu- ments (for each query in the query set) according to a specific Dataset-Stemmer-Weight- Scorer combination; • time stopword: it is the name of the collection; • time all queries.tsv: file that contains the query set of the dataset; • BM25Scorer: chosen scorer to use; • 1:1: combination of weights. 1:1 = the title field is considered as important as the text field for the final score; 1:2 = the title field is considered twice important as the text field for the final score; • time stopword 1to1 BM25Scorer.tsv: name of the .tsv output file; The execution of this script will create in total 27 .tsv files: 9 for the combinations stemmer- scorer for Cranfield dataset with weight 1:1, 9 for Cranfield dataset with weight 1:2 and 9 for Time dataset with weight 1:1. 9
  • 11. 2.5 Evaluation of Results: evaluate results.py The last step that I have to complete is to evaluate the performance of my search engine. One of the most used metric of evaluation is to calculate the average Precision-at-k (p@k) over the whole query set. The evaluate results.py script that I developed computes the average Precision-at-k (p@k) over 4 values for k: 1, 3, 5 and 10. But what does this exactly mean? First of all I have to introduce some concepts that I will use in the script. • Ground-Truth: it’s a .tsv file that contains, for each query in the query set, the IDs of documents that are relevant. To work with it, I defined a Dictionary <key, value> where the keys are the queryIDs and the values are lists of docIDs that are relevant for those query. • Relevant: it’s a list of IDs of documents that are relevant for a single query. • tsv: starting from the .tsv files that are been computed by the get results.py script, I built a Dictionary <key, value> where the keys are the queryIDs and the values are lists of docIDs that are retrieved for those query. • Retrieved: it’s a list of 20 IDs of documents that are retrieved for a single query. • TopKResults: to evaluate the p@k of a single query, I do not need all the 20 docIDs of the retrieved list but only a subset of size k. Thus, TopKResults is a list of k top-ranked documents that are retrieved for those query. 10
  • 12. The definition of Precision that I used in the script is given by: p@kextended = |TopKResults GroundTruth| min(k, |GroundTruth|) Observation: analyzing the Cranfield ground-truth, I noticed that there are some missing queries. Let call them queries “Noisy Queries”. Why is important to distinguish these “Noisy Queries”? According to me, I need to mark that there are two di↵erent notions of “zeros” that could be returned by the p@kextended: • Real Zero: a zero that is the size of the resulting intersection between TopKResults and GroundTruth that are considered; This is the notion of zero that I want to consider in my computation. • Noisy Zero: a zero that results from the intersection between TopKResults and an empty set given by the missing query of the GroundTruth. This is the notion of zero that I want to discard in my computation. Thus, by definition, all the Dictionaries of the Cranfield Dataset will skip these Noisy Queries and the total number of queries will be 222. Since Time Dataset does not have Noisy Queries, I do not need to face this problem and the total number of queries will be 83. The script uses the notion of p@kextended to output a Dictionary in which the key is the query and value is the list of p@ks. 11
  • 13. At the end, it’s computed the average of the values of p@ks for the specific Dataset-Stemmer- Weight-Scorer combination and it’s plotted on a graph. There will be created 3 graphs in total: one for the Cranfield collection with combination of weights 1:1, the Cranfield collection with combination of weights 1:2 and another one for the Time collection (with combination of weights 1:1 by default). Example: 12
  • 14. 3 Analysis of Plots 3.1 Cranfield Dataset 1:1 Analyzing the graph, I can say that: 1. Best combination Stemmer-Scorer = EnglishStopwords-BM25; 2. Best Stemmer = EnglishStopwords; 3. Best Scorer = BM25; 13
  • 15. 3.2 Cranfield Dataset 1:2 Analyzing the graph, I can say that the results are very similar with the ones of Cranfield 1:1: 1. Best combination Stemmer-Scorer = EnglishStopwords-BM25; 2. Best Stemmer = EnglishStopwords 3. Best Scorer = BM25; Which is the best weights combination between ‘1:1’ and ‘1:2’? Also if the results are very similar, on the average I assert that the best weight is ‘1:2’. 14
  • 16. 3.3 Time Dataset 1:1 Analyzing the graph, I can say that: 1. Best combination Stemmer-Scorer = on the average, EnglishStopwords-BM25; 2. Best Stemmer = here we can see a di↵erent behavior within two intervals: • for k 2 [1, ⇠ 3.2] the best stemmer is the English one; • for k 2 (⇠ 3.2, 10] the best stemmer is the EnglishStopword one; 3. Best Scorer = BM25; 15
  • 19. 5 Instruction for the Execution The python script is tested to be executed on MacOS. In order to be run, it requires Java, Python and Plotly library to be installed. Supposing that the system is ready to execute the script, you can simply unzip the .zip archive in one arbitrary directory of your Operating System. Open the directory in a terminal and type the command: source set-my-classpath.sh This command will set up the system to find the directory in which there are the MG4J and Homework packages. Now, simply type the command: python wir hw1.py This command will start the execution of the script. At the end of execution, there will be shown on your default browser the 3 graphs described above. • Observation: since the graphs are made by using Plotly library for Python, in order to have the o✏ine plots displayed, it’s mandatory to have Plotly installed together with a Plotly account. The installation is very simple and could be made by command line by: pip install plotly Then the user should create an account on Plotly webpage. After the registration, it’s possible to set up Plotly environment by the command: If all it’s done correctly, the plots will be automatically displayed. 18