probabilistic ranking

684 views

Published on

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
684
On SlideShare
0
From Embeds
0
Number of Embeds
91
Actions
Shares
0
Downloads
6
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

probabilistic ranking

  1. 1. Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International University Gerhard Weikum, MPI Informatik
  2. 2. Roadmap <ul><li>Problem Definition </li></ul><ul><li>Architecture </li></ul><ul><li>Probabilistic Information Retrieval </li></ul><ul><li>Performance </li></ul><ul><li>Experiments </li></ul><ul><li>Related Work </li></ul><ul><li>Conclusion </li></ul>
  3. 3. Motivation <ul><li>SQL Returns Unordered Sets of Results </li></ul><ul><li>Overwhelms Users of Information Discovery Applications </li></ul><ul><li>How Can Ranking be Introduced, Given that ALL Results Satisfy Query? </li></ul>
  4. 4. Example – Realtor Database <ul><li>House Attributes: Price, City, Bedrooms, Bathrooms, SchoolDistrict, Waterfront, BoatDock, Year </li></ul><ul><li>Query: City =`Seattle’ AND Waterfront = TRUE </li></ul><ul><li>Too Many Results! </li></ul><ul><li>Intuitively, Houses with lower Price , more Bedrooms , or BoatDock are generally preferable </li></ul>
  5. 5. Rank According to Unspecified Attributes <ul><li>Score of a Result Tuple t depends on </li></ul><ul><li>Global Score : Global Importance of Unspecified Attribute Values [CIDR2003] </li></ul><ul><ul><li>E.g., Newer Houses are generally preferred </li></ul></ul><ul><li>Conditional Score : Correlations between Specified and Unspecified Attribute Values </li></ul><ul><ul><li>E.g., Waterfront  BoatDock Many Bedrooms  Good School District </li></ul></ul>
  6. 6. Key Problems <ul><li>Given a Query Q , How to Combine the Global and Conditional Scores into a Ranking Function. Use Probabilistic Information Retrieval (PIR). </li></ul><ul><li>How to Calculate the Global and Conditional Scores. Use Query Workload and Data. </li></ul>
  7. 7. Roadmap <ul><li>Problem Definition </li></ul><ul><li>Architecture </li></ul><ul><li>Probabilistic Information Retrieval </li></ul><ul><li>Performance </li></ul><ul><li>Experiments </li></ul><ul><li>Related Work </li></ul><ul><li>Conclusion </li></ul>
  8. 8. Architecture
  9. 9. Roadmap <ul><li>Problem Definition </li></ul><ul><li>Architecture </li></ul><ul><li>Probabilistic Information Retrieval </li></ul><ul><li>Performance </li></ul><ul><li>Experiments </li></ul><ul><li>Related Work </li></ul><ul><li>Conclusion </li></ul>
  10. 10. PIR Review <ul><li>Bayes’ Rule </li></ul><ul><li>Product Rule </li></ul><ul><li>Document (Tuple) t , Query Q R : Relevant Documents R = D - R : Irrelevant Documents </li></ul>
  11. 11. Ranking Function – Adapt PIR <ul><li>Query Q: X 1 =x 1 AND … AND X s =x s , X ={ X 1 , …, X s } </li></ul><ul><li>Result-Tuple t(X,Y) , where X : Specified Attributes, Y : Unspecified Attributes </li></ul>R  D X, R, D common. R satisfies X .
  12. 12. Ranking Function – Limited Conditional Independence <ul><li>Given a query Q and a tuple t, the X (and Y) values within themselves are assumed to be independent, though dependencies between the X and Y values are allowed </li></ul>Use Data , C involves Y, R, D , C involves X, R, D Use Workload
  13. 13. Atomic Probabilities Estimation Using Workload W <ul><li>If Many Queries Specify Set X of Conditions then there is Preference Correlation between Attributes in X . </li></ul><ul><li>Global : E.g., If Many Queries ask for Waterfront then p(Waterfront=TRUE) is high. </li></ul><ul><li>Conditional : E.g., If Many Queries ask for 4-Bedroom Houses in Good School Districts, then p(Bedrooms=4 | SchoolDistrict=`good’), p(SchoolDistrict=`good’ | Bedrooms=4) are high. </li></ul>Using Limited Conditional Independence Global Part Conditional Part <ul><li>Probabilities p(x | y, W) ( p(x | y, D) ) Calculated Using Standard Association Rule Mining Techniques on W ( D )‏ </li></ul>
  14. 14. Roadmap <ul><li>Problem Definition </li></ul><ul><li>Architecture </li></ul><ul><li>Probabilistic Information Retrieval </li></ul><ul><li>Performance </li></ul><ul><li>Experiments </li></ul><ul><li>Related Work </li></ul><ul><li>Conclusion </li></ul>
  15. 15. Performance
  16. 16. Scan Algorithm <ul><li>Preprocessing - Atomic Probabilities Module </li></ul><ul><li>Computes and Indexes the Quantities P(y | W), P(y | D), P(x | y, W) , and P(x | y, D) for All Distinct Values x and y </li></ul><ul><li>Execution </li></ul><ul><li>Select Tuples that Satisfy the Query </li></ul><ul><li>Scan and Compute Score for Each Result-Tuple </li></ul><ul><li>Return Top- K Tuples </li></ul>
  17. 17. List Merge Algorithm <ul><li>Preprocessing </li></ul><ul><li>For Each Distinct Value x of Database, Calculate and Store the Conditional (C x ) and the Global (G x ) Lists as follows </li></ul><ul><ul><li>For Each Tuple t Containing x Calculate </li></ul></ul><ul><li>Execution Query Q: X 1 =x 1 AND … AND X s =x s </li></ul><ul><li>Execute Threshold Algorithm [Fag01] on the following lists: C x 1 ,…,C xs , and G xb , where G xb is the shortest list among G x 1 ,…, G xs </li></ul><ul><li>and add to C x and G x respectively </li></ul><ul><li>Sort C x , G x by decreasing scores </li></ul>Final Formula:
  18. 18. Roadmap <ul><li>Problem Definition </li></ul><ul><li>Architecture </li></ul><ul><li>Probabilistic Information Retrieval </li></ul><ul><li>Performance </li></ul><ul><li>Experiments </li></ul><ul><li>Related Work </li></ul><ul><li>Conclusion </li></ul>
  19. 19. Quality Experiments <ul><li>Compare our Conditional Ranking Method with the Global Method [CIDR03] </li></ul><ul><li>Surveyed 14 MSR employees </li></ul><ul><li>Datasets: </li></ul><ul><ul><li>MSR HomeAdvisor Seattle (http://houseandhome.msn.com/)‏ </li></ul></ul><ul><ul><li>Internet Movie Database (http://www.imdb.com)‏ </li></ul></ul><ul><li>Each User Behaved According to Various Profiles. E.g.: </li></ul><ul><ul><li>singles, middle-class family, rich retirees… </li></ul></ul><ul><ul><li>teenage males, people interested in comedies of the 80s… </li></ul></ul><ul><li>First Collect Workloads, Then Compare Results of 2 Methods for a Set of Queries </li></ul>
  20. 20. Quality Experiments – Average Precision <ul><li>For 5 queries, ask users to Mark 10 out of a Set of 30 likely results containing: the Top-10 results of both the Conditional and Global plus a few randomly selected tuples. </li></ul><ul><li>Precision = Recall </li></ul>
  21. 21. Quality Experiments - Fraction of Users Preferring Each Algorithm <ul><li>Seattle Homes and Movies Datasets </li></ul><ul><li>5 new queries </li></ul><ul><li>Top-5 Result-lists </li></ul>
  22. 22. Performance Experiments <ul><li>Microsoft SQL Server 2000 RDBMS </li></ul><ul><li>P4 2.8-GHz PC, 1 GB RAM </li></ul><ul><li>C#, Connected to RDBMS through DAO </li></ul><ul><li>Datasets </li></ul><ul><li>Compared Algorithms: </li></ul><ul><ul><li>LM: List Merge </li></ul></ul><ul><ul><li>Scan </li></ul></ul>
  23. 23. Performance Experiments - Precomputation <ul><li>Time and Space Consumed by Index Module </li></ul>
  24. 24. Performance Experiments - Execution Varying Number of Tuples Satisfying Selection Conditions <ul><li>US Homes Database </li></ul><ul><li>2-Attributes Queries </li></ul>
  25. 25. Performance Experiments - Execution <ul><li>US Homes Database </li></ul>
  26. 26. Roadmap <ul><li>Problem Definition </li></ul><ul><li>Architecture </li></ul><ul><li>Probabilistic Information Retrieval </li></ul><ul><li>Performance </li></ul><ul><li>Experiments </li></ul><ul><li>Related Work </li></ul><ul><li>Conclusion </li></ul>
  27. 27. Related Work <ul><li>[CIDR2003] </li></ul><ul><ul><li>Use Workload, Focus on Empty-Answer Problem. </li></ul></ul><ul><ul><li>Drawback : Global Ranking Regardless of Query. E.g.: Tram is desirable to be away from expensive houses, but close to cheap. </li></ul></ul><ul><li>Collaborative Filtering </li></ul><ul><ul><li>Require Training Data of Queries and their Ranked Results </li></ul></ul><ul><li>Relevance-Feedback Techniques for Learning Similarity in Multimedia and Relational Databases </li></ul>
  28. 28. Roadmap <ul><li>Problem Definition </li></ul><ul><li>Architecture </li></ul><ul><li>Probabilistic Information Retrieval </li></ul><ul><li>Performance </li></ul><ul><li>Experiments </li></ul><ul><li>Related Work </li></ul><ul><li>Conclusion </li></ul>
  29. 29. Conclusions – Future Work <ul><li>Conclusions </li></ul><ul><li>Completely Automated Approach for the Many-Answers Problem which Leverages Data and Workload Statistics and Correlations </li></ul><ul><li>Based on PIR </li></ul><ul><li>Future Work </li></ul><ul><li>Empty-Answer Problem </li></ul><ul><li>Handle Plain Text Attributes </li></ul>
  30. 30. Questions?
  31. 31. Performance Experiments - Execution <ul><li>LMM: List Merge where lists for one of the two specified attributes are missing, halving space </li></ul><ul><li>Seattle Homes Database </li></ul>

×