Hw1

IR Homework #1

By J. H. Wang
Mar. 15, 2012

Programming Exercise #1:
Indexing
• Goal: to build an index for a text collection
using inverted index
• Input: a set of text documents
– (to be described later)
• Output: inverted index files
– (exact format to be described later)

Input: the Test Collection
• ClueWeb09 dataset
– http://lemurproject.org/clueweb09.php/
– 1,040,809,705 Web pages, in 10 languages
– 5TB, compressed (25TB, uncompressed)
– File format: WARC (Web ARChive file
format)
• http://www.digitalpreservation.gov/formats/fdd
/fdd000236.shtml
• Each file contains about 40,000 Web pages, in 1GB
• Each team will be randomly allocated different
files!

Other Test Collections
• Reuters-RCV1: (in the textbook)
http://trec.nist.gov/data/reuters/reuters.html
– About 810,000 English news stories from 1996/08/20 to
1997/08/19 (2.5GB uncompressed)
– Needs to sign agreements
• Reuters-21578:
http://www.daviddlewis.com/resources/testcollection
s/reuters21578/
– 21,578 news articles in 1987 (28.0MB uncompressed)
• Test collections held at University of Glasgow:
http://www.dcs.gla.ac.uk/idom/ir_resources/test_coll
ections/
– LISA, NPL, CACM, CISI, Cranfield, Time, Medline, ADI
– Ex: The Time Collection: 423 documents (1.5MB)

Output: Inverted Index
• Using the standard positional index (Chap. 1 &
2)
• Output format:
– Dictionary file: a sorted list of vocabularies (in
separate lines)
– Postings list: for each term, a list of occurrences in the
original text
• termi, dfi: <doc1, tfi1: <pos1, pos2, … >; doc2, tfi2: <pos1, pos2,
…>; …> (as in Fig. 2.11, Sec. 2.4, p.38)
– dfi: document frequency of termi
– tfij: term frequency of termi in docj
• to, 993427:
<1, 6: <7, 18, 33, 72, 86, 231>;
2, 5: <1, 17, 74, 222, 255>; … >
• …

Implementation Issues
• Note: pos means the token positions in the
body of documents
– This can facilitate easier implementation in
later steps after indexing, for example,
proximity search
• Document preprocessing should be
handled with care
– Different formats for different collections
– Digits, hyphens, punctuation marks, …

Optional Functionality
• Efficiency issues
– A separate data structure (e.g. trie) can be used to
store the vocabularies and postings in your indexer,
but the output should be in the designated format
– Skip pointers
• Tokenization
– Case folding
– Stopword removal
– Stemming
– Able to be turned on/off by a parameter trigger

Submission
• Your submission *should* include
– The source code (and optionally your executable file)
– A one-page description that includes the following
• Major features in your work (ex: high efficiency, low storage,
multiple input formats, huge corpus, …)
• Major difficulties encountered
• Special requirements for execution environments (ex: Java
Runtime Environment, special compilers, …)
• Team members list: The names and the responsible parts of
each individual member should be clearly identified
• Due: two weeks (Mar. 29, 2012)

Submission Instructions
• Programs or homework in electronic files must
be submitted directly on the submission site:
– Submission site: http://140.124.183.39/IR/
• Username: your student ID
• Password: (Please change your default password at your
first login)
– Preparing your submission file: as one single
compressed file
• Remember to specify the names of your team members and
student ID in the files and documentation
– If you cannot successfully submit your work, please
contact with the TA (@ R1424, Technology Building)

Evaluation
• Minimum requirement: correctness
– Using the ClueWeb09 Test Collection (partial)
as the input, and the inverted index generated
by your program will be checked
– Optional features will be considered as bonus
• You might be required to demo if the
program submitted was unable to
compile/run by TA

Hw1

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (18)

Similar to Hw1

Similar to Hw1 (20)

Hw1