Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

My cool new Slideshow!


Published on

  • Be the first to comment

  • Be the first to like this

My cool new Slideshow!

  1. 1. Automated Ranking of Database Query ResultsSanjay Agrawal, Surajit Chaudhari, GautamDas, Aristides Gionis Presented By: Upa Gupta
  2. 2. Contents Introduction IDF Similarity QF Similarity Breaking Ties Implementation  ITA Algorithm Conclusion
  3. 3. Introduction Database is Boolean Query Model  E.g.. Select * WHERE MFR_Country = “Germany” AND Type = “Sports” AND Manufacture = “Volkswagon” Problems in Database  Empty Answers  Too selective query leading to Null Result Set  Many Answers  General query leading to too many results
  4. 4. Introduction Ranking of Database Query Results using IR techniques.  Applying TF-IDF concept to database that is based on the frequency of the attribute values.  Need to extend the TF-IDF to Numerical Domains  IDF Similarity is discussed in paper  Collecting WORKLOAD and using it for ranking.  QF Similarity, leveraging Workload Information
  5. 5. Introduction Many Answers Problem is solved using Top-K Query Processing Index-based Threshold Algorithm (ITA) developed exploiting IDF/QF Similarity.
  6. 6. IDF Similarity What is TF-IDF Technique?  Given a set of documents and a query, documents are ranked based on TF and IDF of the words of the document. Adapting IDF concept to Database containing only categorical Attributes t=<t1,……tm>  values of Attribute A n  Number of tuples in the database
  7. 7. IDF Similarity For all the values of t:  Frequency F(t) is defined as no. of tuples having Attribute A = t  IDF is calculated as: IDF(t) = log(n/F(t))  For pair of values u and v in Attribute A domain S(u,v) = IDF (u) if u=v otherwise 0  For tuple T and Query Q for all the Attributes m (A1…Ak) S (t , q ) k k k SIM(T,Q) = k 1
  8. 8. IDF Similarity Example: CAR_ID MODEL MFR MFR_Country Type 1 SLR Mercedes Germany Sports 2 A6 Audi Germany Executive 3 R8 Audi Germany Sports 4 Gallardo Lamborghini Italy Sports Query Q: Select * WHERE MFR_Country = “Germany” AND Type = “Sports” AND MFR = “Volkswagon”
  9. 9. IDF Similarityn=4F (MFR_Country = Germany) = 3IDF(MFR_Country = Germany) = log(n/F(MFR_Country = Germany)) = log(4/3) = 0.287Similarly, IDF(MFR_Country=Italy) = 1.38 IDF(MFR = Audi) = 0.69 IDF(MFR = Lamborghini) = 1.38 IDF(MFR = Mercedes) = 1.38 IDF(Type = Sports) = 0.287 IDF(Type = Executive) = 1.38Similarity of 1st tuple with Q = SIM(T,Q) = S(Germany, Germany) + S(Sports, Sports) + S(Mercedes, Volkswagen) = IDF(MFR_Country = Germany) + IDF(Type = Sports) + 0 = 0.287+0.287+0 = 0.574
  10. 10. IDF Similarity Consider a Numeric Attribute in DB e.g. PRICE SIMPLE SOLUTION: Discretize the data between ranges Consider two Range: (0, 50) and (51, 100)  Values 49 and 52 are considered completely dissimilar. Frequencyn of a 1numeric value t of an attribute is defined t t 2 /2 i h sum of contributions to as e t from every ti i database. F(t) = IDF(t) = log(n/F(t))i t 2 t h = bandwidth parameter 1/ 2 h S(t,q) = density at t of a Gaussian ) e IDF ( q Distribution centered q.
  11. 11. IDF Similarity Consider following Query: Select * where MFR IN (“Germany”, “Italy”, ”Japan”) m SIM(T,Q) = max S k ( t k , q ) q Qk k 1
  12. 12. QF Similarity Problems with IDF:  In a realtor database, more homes are built in recent years such as 2007 and 2008 as compared to 1980 and 1981.Thus recent years have small IDF. Yet newer homes have higher demand.  In a bookstore DB, demand for an author is due to factor other than no. of books he has written
  13. 13. QF Similarity WORKLOAD: Past Queries Importance of attribute values is determined by frequency of their occurrence in workload. As in above eg, frequency of queries requesting homes in 2010 are more than of the year 1981
  14. 14. QF Similarity For categorical data  RQF(q) = raw frequency of occurrence of value q of attribute A in query strings of workload  RQFMax = raw frequency of most frequently occurring value in workload  Query frequency QF(q) = RQF(q)/RQFMax  s(t, q) = QF(q), if q = t otherwise 0 QF resembles TF
  15. 15. QF Similarity Consider Workload containing following values of Attribute TYPE: {Sports, Executive, Luxury, Sports, Sports, Executive} QF(Executive) = RQF(Executive)/RQFMax = 2/3
  16. 16. QF Similarity Similarity between pairs of different categorical attribute values can also be derived from workload eg. To find S(Audi, Mercedes) Similarity coefficient between t and q in this case is defined by jaccard coefficient scaled by QF factor as shown below. S(t,q)=J(W(t),W(q))/QF(q)  W(t) = Subset of queries in workload W in which categorical value t occurs in an IN clause
  17. 17. QF-IDF For QF-IDF Similarity S(t,q)=QF(q) *IDF(q) when t=q otherwise 0
  18. 18. BREAKING TIES IF SIM(t1, q) = SIM (t2, q) Which Should be ranked Higher??   QF and IDF partitions database into classes CAR_ID MODEL MFR MFR_Country Type 1 SLR Mercedes Germany Sports 2 A6 Audi Germany Executive 3 R8 Audi Germany Sports 4 Gallardo Lamborghini Italy Sports  Q: SELECT * WHERE Type = “Sports” AND MFR_Country = “Germany”
  19. 19. Breaking Ties with QF Determine weights of missing attribute values that reflect their “global importance” using workload. log( QF ( t k )) k Global Imp = tk= missing attribute Missing Attributes for Q: MFR and Model
  20. 20. Breaking Ties with QF Considering Workload with following values of MFR and Model MFR{Audi, Audi, Lamborghini, Mercedes, Lamborghini, Audi} Model{R8, A6, Gallardo, SLR, Gallardo, A6} QF(SLR) = ½ = 0.5 1 SLR Mercede Germany =Sports 0.33 QF(Mercedes) 1/3 = s Global Imp = log(0.5) + log(0.33). NEGATIVE VALUES of Global Imp ??
  21. 21. Breaking Ties with IDF Tuples with large IDF(occuring infequently) of missing attributes are ranked higher  Cars which are not popular are ranked higher Tuples with small IDF of missing attributes are ranked higher  Cars having Moonroof will be ranked less which is a desirable feature.
  22. 22. Implementation Pre-processing component Query–processing component
  23. 23. Implementation Pre Processing Component  Compute and store a representation of similarity function(QF-IDF, QF, IDF) in auxiliary database tables
  24. 24. Implementation Query Processing Component  Job: Retrieving Top-K results from Database  ITA Algorithm: Use of Fagin’s Threshold Algorithm and Similarity function  Sorted Access: Along any attribute Ak, TIDs of tuples are retrieved.  Random Access: entire tuple corresponding to a TID is retrieved.
  25. 25. ITA Algorithm Repeat Initialize Top-K Buffer to empty For each k = 1 to p  TID = Index of the next Tuple is retrieved from the ordered Lists  T = Complete Tuple is retrieved for TID  Compute value of Ranking Function  If Rank of T is higher than the rank of lowest ranking tuple in Top-K Buffer, then update Top-K Buffer  If Stopping Condition has been reached then Exit End For Until all index of the tuples have been seen.
  26. 26. ITA AlgorithmStopping Condition Hypothetical tuple – current value a1,…, ap for A1,… Ap, corresponding to index seeks on L1,…, Lp and qp+1,….. qm for remaining columns from the query directly. Termination – Similarity of hypothetical tuple to the query< tuple in Top-k buffer with least similarity.
  27. 27. ITA for Numeric columns Consider a query has condition Ak = qk for a numeric column Ak. Two index scan is performed on Ak.  First retrieve TID’s > qk in incresing order.  Second retrieve TID’s < qk in decreasing order. We then pick TID’s from the merged stream.
  28. 28. Conclusion Automated Ranking Infrastructure for SQL databases. Extended TF-IDF based techniques from Information retrieval to numeric and mixed data. Implementation of Ranking function that exploited Fagin’s TA
  29. 29. THANK YOU