Upcoming SlideShare
×

# IR-ranking

1,235 views
1,153 views

Published on

0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
1,235
On SlideShare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
13
0
Likes
0
Embeds 0
No embeds

No notes for slide

### IR-ranking

1. 1. Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International University Gerhard Weikum, MPI Informatik 30th VLDB Conference Toronto ,Canada,2004 Presented By Abhishek Jamloki [email_address]
2. 2. <ul><li>Realtor DB: </li></ul><ul><li>Table D=(TID, Price , City, Bedrooms, Bathrooms, LivingArea, SchoolDistrict, View, Pool, Garage, BoatDock)‏ </li></ul><ul><li>SQL query: </li></ul><ul><li>Select * From D </li></ul><ul><li>Where City=Seattle AND View=Waterfront </li></ul>
3. 3. Consider a database table D with n tuples {t1, …, tn} over a set of m categorical attributes A = {A1, …, Am} a query Q: SELECT * FROM D WHERE X1=x1 AND … AND Xs=xs where each Xi is an attribute from A and xi is a value in its domain. specified attributes: X ={X1, …, Xs} unspecified attributes: Y = A – X Let S be the answer set of Q How to rank tuples in S and return top-k tuples to the user?
4. 4. <ul><li>IR Treatment </li></ul><ul><ul><li>Query Reformulation </li></ul></ul><ul><ul><li>Automatic Ranking </li></ul></ul><ul><ul><li>Correlations are ignored in high dimensional spaces of IR </li></ul></ul><ul><ul><li>Automated Ranking function proposed based on </li></ul></ul><ul><ul><li>A global score of unspecified attributes </li></ul></ul><ul><ul><li>A conditional score (strength of correlation between specified and unspecified attributes) </li></ul></ul><ul><ul><li>Automatic estimation using workload and data analysis </li></ul></ul>
5. 6. <ul><li>Bayes’ Rule </li></ul><ul><li>Product Rule </li></ul>Document t , Query Q R : Relevant document set R = D - R : Irrelevant document set
6. 7. <ul><li>Each tuple t is treated as a document </li></ul><ul><li>Partition t into two parts </li></ul><ul><li>t(X): contains specified attributes </li></ul><ul><li>t(Y): contains unspecified attributes </li></ul><ul><li>Replace t with X and Y </li></ul><ul><li>Replace R with D </li></ul>
7. 9. <ul><li>Comprehensive dependency models have unacceptable preprocessing and query processing costs </li></ul><ul><li>Choose a middle ground. </li></ul><ul><li>Given a query Q and a tuple t, the X (and Y) values within themselves are assumed to be independent, though dependencies between the X and Y values are allowed </li></ul>
8. 11. <ul><li>Workload W : a collection of ranking queries that have been executed on our system in the past. </li></ul><ul><li>Represented as a set of “tuples”, where each tuple represents a query and is a vector containing the corresponding values of the specified attributes. </li></ul><ul><li>We approximate R as all query “tuples” in W that also request for X (approximation is novel to this paper)‏ </li></ul><ul><li>Properties of the set of relevant tuples R can be obtained by only examining the subset of the workload that contains queries that also request for X </li></ul><ul><li>Substitute p(y | R) as p(y | X,W)‏ </li></ul>
9. 13. <ul><li>p(y | W) the relative frequencies of each distinct value y in the workload </li></ul><ul><li>p( y | D) relative frequencies of each distinct value y in the </li></ul><ul><li>database (similar to IDF concept in IR)‏ </li></ul><ul><li>p(x | y,W) confidences of pair-wise association rules in the workload, that is: (#of tuples in W that contains x, y)/total # of tuples in W </li></ul><ul><li>p(x | y,D): (#of tuples in D that contains x, y)/total # of tuples in D </li></ul><ul><li>Stored as auxiliary tables in the intermediate knowledge representation layer </li></ul>
10. 14. <ul><li>p(y | w) {AttName, AttVal, Prob} </li></ul><ul><ul><li>B + Tree index on (AttName, AttVal)‏ </li></ul></ul><ul><li>p(y | D) {AttName, AttVal, Prob} </li></ul><ul><ul><li>B + Tree index on (AttName, AttVal)‏ </li></ul></ul><ul><li>p(x | y,W) {AttNameLeft, AttValLeft, AttNameRight, AttValRight, Prob} </li></ul><ul><ul><li>B + Tree index on (AttNameLeft, AttValLeft, AttNameRight, AttValRight)‏ </li></ul></ul><ul><li>p(x | y,D) {AttNameLeft, AttValLeft, AttNameRight, AttValRight, Prob} </li></ul><ul><ul><li>B + Tree index on (AttNameLeft, AttValLeft, AttNameRight, AttValRight)‏ </li></ul></ul>
11. 15. <ul><li>Preprocessing - Atomic Probabilities Module </li></ul><ul><li>Computes and Indexes the Quantities P(y | W), P(y | D), P(x | y, W) , and P(x | y, D) for All Distinct Values x and y </li></ul><ul><li>Execution </li></ul><ul><li>Select Tuples that Satisfy the Query </li></ul><ul><li>Scan and Compute Score for Each Result-Tuple </li></ul><ul><li>Return Top- K Tuples </li></ul>
12. 16. <ul><li>Trade off between pre-processing and query processing </li></ul><ul><li>Pre-compute ranked lists of the tuples for all possible “atomic” queries. Then at query time, given an actual query that specifies a set of values X, we “merge” the ranked lists corresponding to each x in X to compute the final Top-K tuples. </li></ul><ul><li>We should be able to perform merging without scanning the entire ranked lists </li></ul><ul><li>Threshold algorithm can be used for this purpose </li></ul><ul><li>A feasible adaptation of TA should keep the number of sorted streams small </li></ul><ul><li>Number of sorted streams will depend on number of attributes in database </li></ul>
13. 17. <ul><li>At query time we do a TA-like merging of several ranked lists (i.e. sorted streams)‏ </li></ul><ul><li>The required number of sorted streams depends only on the number of specified attribute values in the query and not on the total number of attributes in the database </li></ul><ul><li>Such a merge operation is only made possible due to the specific functional form of our ranking function resulting from our limited independence assumptions </li></ul>
14. 18. <ul><li>Index Module: takes as inputs the association rules and the database, and for every distinct value x, creates two lists Cx and Gx, each containing the tuple-ids of all data tuples that contain x, ordered in specific ways. </li></ul><ul><li>Conditional List Cx: consists of pairs of the form <TID, CondScore>, ordered by descending CondScore </li></ul><ul><li>TID: tuple-id of a tuple t that contains x </li></ul><ul><li>Global List Gx: consists of pairs of the form <TID, GlobScore>, ordered by descending GlobScore, where TID is the tuple-id of a tuple t that contains x and </li></ul>
15. 19. <ul><li>At query time we retrieve and multiply the scores of t in the lists Cx1,…,Cxs and in one of Gx1,…,Gxs. This requires only s +1 multiplications and results in a score2 that is proportional to the actual score. Two kinds of efficient access operations are needed: </li></ul><ul><li>First, given a value x, it should be possible to perform a GetNextTID operation on lists Cx and Gx in constant time, tuple-ids in the lists should be efficiently retrievable one-by-one in order of decreasing score. This corresponds to the sorted stream access of TA. </li></ul><ul><li>Second, it should be possible to perform random access on the lists, that is, given a TID, the corresponding score (CondScore or GlobScore) should be retrievable in constant time. </li></ul>
16. 20. <ul><li>These lists are stored as database tables – </li></ul><ul><li>CondList C x </li></ul><ul><li>{AttName, AttVal, TID, CondScore} </li></ul><ul><li>B + Tree index on (AttName, AttVal, CondScore)‏ </li></ul><ul><li>GlobList G x </li></ul><ul><li>{AttName, AttVal, TID, GlobScore} </li></ul><ul><li>B + Tree index on (AttName, AttVal, GlobScore)‏ </li></ul>
17. 23. <ul><li>Space consumed by the lists is O(mn) bytes (m is the number of attributes and n the number of tuples of the database table)‏ </li></ul><ul><li>We can store only a subset of the lists at preprocessing time, at the expense of an increase in the query processing time. </li></ul><ul><li>Determining which lists to retain/omit at preprocessing time done by analyzing the workload. </li></ul><ul><li>Store the conditional lists Cx and the corresponding global lists Gx only for those attribute values x that occur most frequently in the workload </li></ul><ul><li>Probe the intermediate knowledge representation layer at query time to compute the missing information </li></ul>
18. 24. <ul><li>The following Datasets were used: </li></ul><ul><ul><li>MSR HomeAdvisor Seattle (http://houseandhome.msn.com/)‏ </li></ul></ul><ul><ul><li>Internet Movie Database (http://www.imdb.com)‏ </li></ul></ul><ul><li>Software and Hardware: </li></ul><ul><ul><li>Microsoft SQL Server2000 RDBMS </li></ul></ul><ul><ul><li>P4 2.8-GHz PC, 1 GB RAM </li></ul></ul><ul><ul><li>C#, Connected to RDBMS through DAO </li></ul></ul>
19. 25. <ul><li>Evaluated using two ranking methods </li></ul><ul><li>1) Conditional </li></ul><ul><li>2) Global </li></ul><ul><li>Several hundred workload queries were collected for both the datasets and ranking algorithm trained on this workload </li></ul>
20. 26. <ul><li>For each query Q i , generate a set H i of 30 tuples likely to contain a good mix of relevant and irrelevant tuples </li></ul><ul><li>Let each user mark 10 tuples in H i as most relevant to Q i </li></ul><ul><li>Measure how closely the 10 tuples marked by the user match the 10 tuples returned by each algorithm </li></ul>
21. 27. <ul><li>Users were given the Top-5 results of the two ranking methods for 5 queries (different from the previous survey), and were asked to choose which rankings they preferred </li></ul>
22. 28. <ul><li>Compared performance of the various implementations of the Conditional algorithm: List Merge, its space-saving variant and Scan </li></ul><ul><li>Datasets used: </li></ul>
23. 33. <ul><li>Completely automated approach for the Many-Answers Problem which leverages data and workload statistics and correlation </li></ul><ul><li>Probabilistic IR models were adapted for structured data. </li></ul><ul><li>Experiments demonstrate efficiency as well as quality of the ranking system </li></ul>
24. 34. <ul><li>Many relational databases contain text columns in addition to numeric and categorical columns. Whether correlations between text and non-text data can be leveraged in a meaningful way for ranking ? </li></ul><ul><li>Comprehensive quality benchmarks for database ranking need to be established </li></ul>