2. Similarity of Documents
•Simple inner product
•Cosine similarity
•Term weights
–Standard problem in IR
–tf-idf, BM25, etc.
•A document d is represented as a vector Wdof term weights wt,d, which indicate the importance of each term t in the document
di
dj
3. Trivial Solution
•A trivial solution is to take each vector and compute its similarity (dot prod) with every other vector in the collection, which means that we will load each vector o(n) times and load each term o(dft2)times
•This will work for small collection, but our goal is to have a scalable and efficient sol for large collections
4. •Load weights for each term once
•Each term contributes o(dft2)partial scores
•Allows efficiency tricks
Each term contributes only if appears in
Better Solution
6. Solution –parts we have done
•An efficient solution to the pair wise document similarity problem, expressed as two separate MapReduce jobs
•Indexing:
–We build a standard inverted index (Frakesand Baeza-Yates, 1992), where each term is associated with a list of docid’sfor documents that contain it and the associated term weight.
–Use tf-idfto compute the weights.
–Mapping over all documents, the mapper, for each term in the document, emits the term as the key, and a tupleconsisting of the docidand term weight as the value.
–The MapReduce runtime automatically handles the grouping of these tuples, which the reducer then writes out to disk, thus generating the postings.
7. Exercise -tasks
•Task 1
–Write map-reduce pseudo-code to compute sim(di,dj) based on the idea described in previous page.
–Given the tf-idfcode
–Given the inverted index computation code
–Write the similarity code to compute sim(di,dj) based on the pseudo-code designed above.
–Execute code on the provided data set, and test with the provided testing code.
•Group / individual presentation (60 min)
8. Solution
•An efficient solution to the pair wise document similarity problem, expressed as two separate MapReduce jobs
•Indexing:
–We build a standard inverted index (Frakesand Baeza-Yates, 1992), where each term is associated with a list of docid’sfor documents that contain it and the associated term weight.
–Mapping over all documents, the mapper, for each term in the document, emits the term as the key, and a tupleconsisting of the docidand term weight as the value.
–The MapReduce runtime automatically handles the grouping of these tuples, which the reducer then writes out to disk, thus generating the postings.
•PairwiseSimilarity:
–Mapping over each posting, the mappergenerates key tuplescorresponding to pairs of docidsin the postings.
–These key tuplesare associated with the product of the corresponding term weights
–They represent the individual term contributions to the final inner product.
–The MapReduce runtime sorts the tuplesand then the reducer sums all the individual score contributions for a pair to generate the final similarity score.