probabilistic ranking

  • 403 views
Uploaded on

 

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
403
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
3
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International University Gerhard Weikum, MPI Informatik
  • 2. Roadmap
    • Problem Definition
    • Architecture
    • Probabilistic Information Retrieval
    • Performance
    • Experiments
    • Related Work
    • Conclusion
  • 3. Motivation
    • SQL Returns Unordered Sets of Results
    • Overwhelms Users of Information Discovery Applications
    • How Can Ranking be Introduced, Given that ALL Results Satisfy Query?
  • 4. Example – Realtor Database
    • House Attributes: Price, City, Bedrooms, Bathrooms, SchoolDistrict, Waterfront, BoatDock, Year
    • Query: City =`Seattle’ AND Waterfront = TRUE
    • Too Many Results!
    • Intuitively, Houses with lower Price , more Bedrooms , or BoatDock are generally preferable
  • 5. Rank According to Unspecified Attributes
    • Score of a Result Tuple t depends on
    • Global Score : Global Importance of Unspecified Attribute Values [CIDR2003]
      • E.g., Newer Houses are generally preferred
    • Conditional Score : Correlations between Specified and Unspecified Attribute Values
      • E.g., Waterfront  BoatDock Many Bedrooms  Good School District
  • 6. Key Problems
    • Given a Query Q , How to Combine the Global and Conditional Scores into a Ranking Function. Use Probabilistic Information Retrieval (PIR).
    • How to Calculate the Global and Conditional Scores. Use Query Workload and Data.
  • 7. Roadmap
    • Problem Definition
    • Architecture
    • Probabilistic Information Retrieval
    • Performance
    • Experiments
    • Related Work
    • Conclusion
  • 8. Architecture
  • 9. Roadmap
    • Problem Definition
    • Architecture
    • Probabilistic Information Retrieval
    • Performance
    • Experiments
    • Related Work
    • Conclusion
  • 10. PIR Review
    • Bayes’ Rule
    • Product Rule
    • Document (Tuple) t , Query Q R : Relevant Documents R = D - R : Irrelevant Documents
  • 11. Ranking Function – Adapt PIR
    • Query Q: X 1 =x 1 AND … AND X s =x s , X ={ X 1 , …, X s }
    • Result-Tuple t(X,Y) , where X : Specified Attributes, Y : Unspecified Attributes
    R  D X, R, D common. R satisfies X .
  • 12. Ranking Function – Limited Conditional Independence
    • Given a query Q and a tuple t, the X (and Y) values within themselves are assumed to be independent, though dependencies between the X and Y values are allowed
    Use Data , C involves Y, R, D , C involves X, R, D Use Workload
  • 13. Atomic Probabilities Estimation Using Workload W
    • If Many Queries Specify Set X of Conditions then there is Preference Correlation between Attributes in X .
    • Global : E.g., If Many Queries ask for Waterfront then p(Waterfront=TRUE) is high.
    • Conditional : E.g., If Many Queries ask for 4-Bedroom Houses in Good School Districts, then p(Bedrooms=4 | SchoolDistrict=`good’), p(SchoolDistrict=`good’ | Bedrooms=4) are high.
    Using Limited Conditional Independence Global Part Conditional Part
    • Probabilities p(x | y, W) ( p(x | y, D) ) Calculated Using Standard Association Rule Mining Techniques on W ( D )‏
  • 14. Roadmap
    • Problem Definition
    • Architecture
    • Probabilistic Information Retrieval
    • Performance
    • Experiments
    • Related Work
    • Conclusion
  • 15. Performance
  • 16. Scan Algorithm
    • Preprocessing - Atomic Probabilities Module
    • Computes and Indexes the Quantities P(y | W), P(y | D), P(x | y, W) , and P(x | y, D) for All Distinct Values x and y
    • Execution
    • Select Tuples that Satisfy the Query
    • Scan and Compute Score for Each Result-Tuple
    • Return Top- K Tuples
  • 17. List Merge Algorithm
    • Preprocessing
    • For Each Distinct Value x of Database, Calculate and Store the Conditional (C x ) and the Global (G x ) Lists as follows
      • For Each Tuple t Containing x Calculate
    • Execution Query Q: X 1 =x 1 AND … AND X s =x s
    • Execute Threshold Algorithm [Fag01] on the following lists: C x 1 ,…,C xs , and G xb , where G xb is the shortest list among G x 1 ,…, G xs
    • and add to C x and G x respectively
    • Sort C x , G x by decreasing scores
    Final Formula:
  • 18. Roadmap
    • Problem Definition
    • Architecture
    • Probabilistic Information Retrieval
    • Performance
    • Experiments
    • Related Work
    • Conclusion
  • 19. Quality Experiments
    • Compare our Conditional Ranking Method with the Global Method [CIDR03]
    • Surveyed 14 MSR employees
    • Datasets:
      • MSR HomeAdvisor Seattle (http://houseandhome.msn.com/)‏
      • Internet Movie Database (http://www.imdb.com)‏
    • Each User Behaved According to Various Profiles. E.g.:
      • singles, middle-class family, rich retirees…
      • teenage males, people interested in comedies of the 80s…
    • First Collect Workloads, Then Compare Results of 2 Methods for a Set of Queries
  • 20. Quality Experiments – Average Precision
    • For 5 queries, ask users to Mark 10 out of a Set of 30 likely results containing: the Top-10 results of both the Conditional and Global plus a few randomly selected tuples.
    • Precision = Recall
  • 21. Quality Experiments - Fraction of Users Preferring Each Algorithm
    • Seattle Homes and Movies Datasets
    • 5 new queries
    • Top-5 Result-lists
  • 22. Performance Experiments
    • Microsoft SQL Server 2000 RDBMS
    • P4 2.8-GHz PC, 1 GB RAM
    • C#, Connected to RDBMS through DAO
    • Datasets
    • Compared Algorithms:
      • LM: List Merge
      • Scan
  • 23. Performance Experiments - Precomputation
    • Time and Space Consumed by Index Module
  • 24. Performance Experiments - Execution Varying Number of Tuples Satisfying Selection Conditions
    • US Homes Database
    • 2-Attributes Queries
  • 25. Performance Experiments - Execution
    • US Homes Database
  • 26. Roadmap
    • Problem Definition
    • Architecture
    • Probabilistic Information Retrieval
    • Performance
    • Experiments
    • Related Work
    • Conclusion
  • 27. Related Work
    • [CIDR2003]
      • Use Workload, Focus on Empty-Answer Problem.
      • Drawback : Global Ranking Regardless of Query. E.g.: Tram is desirable to be away from expensive houses, but close to cheap.
    • Collaborative Filtering
      • Require Training Data of Queries and their Ranked Results
    • Relevance-Feedback Techniques for Learning Similarity in Multimedia and Relational Databases
  • 28. Roadmap
    • Problem Definition
    • Architecture
    • Probabilistic Information Retrieval
    • Performance
    • Experiments
    • Related Work
    • Conclusion
  • 29. Conclusions – Future Work
    • Conclusions
    • Completely Automated Approach for the Many-Answers Problem which Leverages Data and Workload Statistics and Correlations
    • Based on PIR
    • Future Work
    • Empty-Answer Problem
    • Handle Plain Text Attributes
  • 30. Questions?
  • 31. Performance Experiments - Execution
    • LMM: List Merge where lists for one of the two specified attributes are missing, halving space
    • Seattle Homes Database