Making Interval-Based Clustering Rank-Aware             Julia Stoyanovich (University of Pennsylvania)                    ...
Research Directions• Representation of Large Complex Datasets    – Symmetric relationships [VLDB 2004]    – Faceted databa...
Applications and Prototypes    • The Faceted Query Engine applied to archaeology    • Biological data management          ...
Ranked Exploration of Structured Datasets                                                      MBA, 40 years old    Dating...
An Example from Yahoo! Personals                                                                          -- income > $50K...
Goal: Find Clusters that Correlate with Ranking                                                           age: 26-37     a...
Roadmap   • Introduction   ➞ Rank-aware clustering         – The formalism         – The BARAC algorithm   • Experimental ...
What Is Subspace Clustering?            Parsons et al., SIGKDD Explorations 6(1), 2006                                    ...
Why Do We Need Subspace Clustering?            Parsons et al., SIGKDD Explorations 6(1), 2006                             ...
How Do We Find Subspace Clusters?    • Finds clusters in multiple, possibly overlapping, subspaces          – Dimensionali...
Problem Statement      • User specifies a conjunction of filtering conditions, e.g.,            Q : age  20,40  edu  ...
BARAC: Bottom-up Algorithm for Rank-Aware Clustering  • BuildGrid     – split each dimension into intervals     – compute ...
Avoiding Match Homogeneity at Top Ranks Cluster descriptions must accurately describe the top-N items                     ...
Ranked Intervals and Interval Dominance           • Ranked intervals: description, contents (items), top-N                ...
Property 1: Tightness  38 years old                36 years old                                             R :[income,] ...
Choose Best from Among Comparable                                                         R :[income,]                   ...
Ranked Subspaces and Clusters    A ranked subspace S : {I1, …, Im} is a set of ranked intervals over distinct       attrib...
Property 2: Rank-Aware Clustering Quality R : income               2 N  3 Q               3     age: 25-29            ...
Rank-Aware Clustering Quality Measures    • QtopN : subspace contains > θ Q items from the top-N of its intervals         ...
Property 3: Maximality   Avoid producing redundant clusters           age: 25-40            edu: PhD                      ...
BARAC Recap • BuildGrid     – split each dimension into intervals     – compute top-N for each interval • Merge     – merg...
Complexity of BARAC    • Polynomial in input size, exponential in the number of attributes    • Exponential dependency is ...
Roadmap   • Introduction   • Rank-aware clustering         – The formalism         – The BARAC algorithm   ➞ Experimental ...
Experimental Dataset: Yahoo! Personals    • Data and users          –   5 weeks, 454 users, 861 searches          –   19 f...
Evaluation of Effectiveness: User Study                                        presentation                               ...
Яндекс   23.08.2011   26
Яндекс   23.08.2011   27
Effectiveness Metrics and Results    • Users may fave matches and / or groups          – When a group is faved, all matche...
Evaluation of Efficiency    • Summary of results: BARAC is scalable          – runtimes of BuildGrid and Join dominate per...
Evaluation of Efficiency    • Summary of results: BARAC is scalable          – runtimes of BuildGrid and Join dominate per...
Evaluation of Efficiency    • Summary of results: BARAC is scalable          – runtimes of BuildGrid and Join dominate per...
Performance of Join                                 600                                 500          runtime of Join (ms) ...
Performance of Join                                 1000                                  900                             ...
Roadmap   • Introduction   • Rank-aware clustering         – The formalism         – The BARAC algorithm   • Experimental ...
Rank-Aware Clustering: Recap  •   Formalized rank-aware clustering, a novel      data exploration paradigm                ...
Related Work    • Subspace clustering          – CLIQUE [Agrawal et al, 1998], ENCLUS [Cheng et al, 1999]          – Impro...
Future Work: Choosing a Clustering Quality Measure         12                                                attribute-ran...
Thank you!Яндекс   23.08.2011
Take 1: Density-Based Clustering     age: 18-25         age: 26-30         age: 31-35          age: 36-40                 ...
Take 1: Density-Based Clustering                 age: 18-30                     age: 31-35         age: 36-40             ...
Take 2: A Lower Threshold?     age: 18-25         age: 26-30         age: 31-35          age: 36-40                       ...
Take 2: A Lower Threshold?                           age: 18-40                        density > 0                  age: 1...
Performance of BARAC   100%                                                               BuildGrid    90%                ...
Upcoming SlideShare
Loading in …5
×

Julia Stoyanovich - Making interval-based clustering rank-aware

2,592 views

Published on

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
2,592
On SlideShare
0
From Embeds
0
Number of Embeds
1,923
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Julia Stoyanovich - Making interval-based clustering rank-aware

  1. 1. Making Interval-Based Clustering Rank-Aware Julia Stoyanovich (University of Pennsylvania) joint work with Sihem Amer-Yahia (Qatar Foundation) and Tova Milo (Tel Aviv University)Яндекс 23.08.2011
  2. 2. Research Directions• Representation of Large Complex Datasets – Symmetric relationships [VLDB 2004] – Faceted databases [VLDB 2005, Internet Archaeology 2007] – Schema polynomials [EDBT 2008] – Probabilistic databases [ICDE 2011] – Scientific workflows with provenance [CIDR 2011, ICDT 2011]• Information Discovery in Large Complex Datasets – Search and ranking in social context [VLDB 2008, AAAI-SIP 2008, SIGMOD 2008] – Ranked data exploration in semantic context [ICDE 2010, SIGMOD 2011] – Rank-aware clustering [CIKM 2009, EDBT 2011] – Exploring repositories of scientific workflows [WANDS 2010, AMW 2011] – Exploring repositories of functional genomics experiments [submitted] – Estimating susceptibility to genetic disorders [Bioinformatics 2007] Яндекс 23.08.2011 2
  3. 3. Applications and Prototypes • The Faceted Query Engine applied to archaeology • Biological data management – MutaGeneSys – estimating individual genetic disease susceptibility – AnnotCompute – exploring repositories of microarray experiments – SkylineSearch – semantic ranking and result visualization for PubMed – myExperiment topics – exploring repositories of scientific workflows • “Shopping and dating” – Yahoo! Garçon – a collaborative tagging recommender system – Yahoo! FindLove – rank-aware clustering for dating data Яндекс 23.08.2011 3
  4. 4. Ranked Exploration of Structured Datasets MBA, 40 years old Dating service user Mike makes $150K • Find matches MBA, 40 years old – age: [18,40] makes $150K – education: at least some college – income: > $50,000 / year MBA, 40 years old makes $150K • Rank by income from higher to lower MBA, 40 years old makes $150K • Problems – too many results … 999 matches – results are homogeneous at top ranks, PhD, 36 years old due to correlations among makes $100K attributes! … 9999 matches – correlations may be complex, BS, 27 years old depend on the selection criteria and makes $80K on the ranking function Яндекс 23.08.2011 4
  5. 5. An Example from Yahoo! Personals -- income > $50K -- edu > BSObserve that 1. % of women with income > $50K increases with age 2. % women with post-graduate education increases until age 29, then plateausThere is a clear positive correlation between 1. age and income, for all ages 2. education and income, at least until age 29 Correlations are local Яндекс 23.08.2011 5
  6. 6. Goal: Find Clusters that Correlate with Ranking age: 26-37 age: 18-25 edu: PhD edu: BS, MS age: 33-40 income: 100-130K income: 50-75K income: 125-150K edu: MS age: 26-30 income: 50-75K income: 75-110K Яндекс 23.08.2011 6
  7. 7. Roadmap • Introduction ➞ Rank-aware clustering – The formalism – The BARAC algorithm • Experimental evaluation – Effectiveness – Efficiency • ConclusionЯндекс 23.08.2011 7
  8. 8. What Is Subspace Clustering? Parsons et al., SIGKDD Explorations 6(1), 2006 8 Яндекс 23.08.2011
  9. 9. Why Do We Need Subspace Clustering? Parsons et al., SIGKDD Explorations 6(1), 2006 9 Яндекс 23.08.2011
  10. 10. How Do We Find Subspace Clusters? • Finds clusters in multiple, possibly overlapping, subspaces – Dimensionality reduction per cluster – Lower-dimensional clusters are easier to identify and their descriptions are more palatable to the users – Example: “age 20-25” and “edu = BS” and “income 25K-50K” • Two main approaches – Top-down: start with full dimensionality and refine – Bottom-up: start with dense units in 1D, combine to find higher-dimensional clusters • Issues – What is a cluster? – need a measure of quality – How do we find clusters? – need a search strategy Яндекс 23.08.2011 10
  11. 11. Problem Statement • User specifies a conjunction of filtering conditions, e.g., Q : age  20,40  edu  Bachelors • User specifies a ranking function, e.g., linear combination R :[income,],[age,] We do not restrict the set of ranking functions, but assume that ranking is derived from, or correlates with, attribute values Given a query Q and a ranking function R, find rank-aware clusters in subspaces of the dataset. Clusters are subspaces that: • have sufficient rank-aware quality • are tight • are maximal Яндекс 23.08.2011 11
  12. 12. BARAC: Bottom-up Algorithm for Rank-Aware Clustering • BuildGrid – split each dimension into intervals – compute top-N for each interval • Merge – merge neighboring intervals using rank-aware locality (interval dominance) ensures tightness • Join – build K-dimensional clusters from compatible (K-1)-dimensional clusters using rank-aware clustering quality ensures maximality and rank-aware quality Яндекс 23.08.2011 12
  13. 13. Avoiding Match Homogeneity at Top Ranks Cluster descriptions must accurately describe the top-N items MBA, 40 years old makes $150K MBA, 40 years old makes $150K MBA, 40 years old age: 25-40 makes $150K income: 75-150K MBA, 40 years old makes $150K … 999 matches PhD, 36 years old makes $100K age: 40 … 9999 matches income:150K BS, 27 years old makes $80K Tightness will give us this property Яндекс 23.08.2011 13
  14. 14. Ranked Intervals and Interval Dominance • Ranked intervals: description, contents (items), top-N – I1: age  [25,30], I2: edu = MBA • Interval dominance is a rank-aware measure of locality, defined – over 2 consecutive intervals on the same attribute – for a ranking function R, integer N, and dominance threshold θdom  (0.5, 1] I1 dominates I2 if I1 + I2 : age  [20,29] I1 : age  [20,24] I2 : age  [25,29] R1 : age (asc) R2 : 0.3inc + 0.7edu (desc) R3 : rel serv (asc) top-10 I2 <10,1 I1 I1 <10,0.8 I2 I1 <>10,0.5 I2 Яндекс 23.08.2011 14
  15. 15. Property 1: Tightness 38 years old 36 years old R :[income,] age: 35-39 edu: PhD  age: 30-39 edu: PhD I1 : age  [30,34] I2 : age  [35,39] I1 + I 2 : age  [30,39] if I1 dominates I2, then add I1 and I2 to the search space else add I1, I2, and I1+ I2 to the search space Яндекс 23.08.2011 15
  16. 16. Choose Best from Among Comparable R :[income,] > ?  age: 33-40 age: 33-40 income: 126-150K income: 70-100K ≠ ? age: 33-40 age: 26-30 income: 125-150K income: 75-110K Rank-aware clustering quality will give us this property Яндекс 23.08.2011 16
  17. 17. Ranked Subspaces and Clusters A ranked subspace S : {I1, …, Im} is a set of ranked intervals over distinct attributes, e.g., S: { age  [25,30] , edu = MBA } • interpreted as a conjunction of predicates over dataset D • dimensionality = number of intervals Goal: find subspaces that have sufficient rank-aware clustering quality All rank-aware clustering quality measures – compare the top-N list of a ranked subspace to the top-N lists of its constituent ranked intervals – are defined for a ranking function R, an integer N, and a quality threshold θ Q  (0.5, 1] Яндекс 23.08.2011 17
  18. 18. Property 2: Rank-Aware Clustering Quality R : income  2 N  3 Q  3 age: 25-29 edu: BS age: 30-34 m1 99K m1 99K m6 125K m3 90K m2 95K m8 110K m7 75K m3 90K m10 100K m9 65K m4 85K m2 95K m4 85K m5 85K age: 25-29 age: 30-34 edu: BS edu: BS m1 99K m2 95K m3 90K m4 85K Яндекс 23.08.2011 18
  19. 19. Rank-Aware Clustering Quality Measures • QtopN : subspace contains > θ Q items from the top-N of its intervals – Considers top-N lists as sets • QSCORE : subspace contains > θ Q high-scoring items from the top-N of its intervals – Based on the sums of scores of top-N items • QSCORE & RANK : subspace contains > θ Q high-scoring, high-ranking items from the top-N of its intervals – Based on NDCG, incorporates both scores and ranks • Clustering quality measures must exhibit downward closure – Quality of a subspace is no higher than the quality of its included subspaces – Holds trivially for density-based measures, due to set properties – Also holds for our measures, details omitted here Яндекс 23.08.2011 19
  20. 20. Property 3: Maximality Avoid producing redundant clusters age: 25-40 edu: PhD age: 25-40 edu: PhD edu: PhD income: 100-130K income: 100-130K age: 25-40 income: 100-130K Maximality will give us this property comes for free with bottom-up subspace clustering Яндекс 23.08.2011 20
  21. 21. BARAC Recap • BuildGrid – split each dimension into intervals – compute top-N for each interval • Merge – merge neighboring intervals using rank-aware locality (interval dominance) ensures tightness • Join – build K-dimensional clusters from compatible (K-1)-dimensional clusters using rank-aware clustering quality ensures maximality and rank-aware quality Яндекс 23.08.2011 21
  22. 22. Complexity of BARAC • Polynomial in input size, exponential in the number of attributes • Exponential dependency is unavoidable! – Even counting distinct maximal frequent itemsets is #P-complete • Example – 1 item for each combination of attribute values – each item has an arbitrary distinct score – find rank-aware clusters with QtopN, N = 1 – there is 1 cluster per item, so an exponential number of clusters! • But lower in practice – correlations are local – clustering quality requires 50% overlap at top-N Яндекс 23.08.2011 22
  23. 23. Roadmap • Introduction • Rank-aware clustering – The formalism – The BARAC algorithm ➞ Experimental evaluation – Effectiveness – Efficiency • ConclusionЯндекс 23.08.2011 23
  24. 24. Experimental Dataset: Yahoo! Personals • Data and users – 5 weeks, 454 users, 861 searches – 19 filtering attributes, 17 clustering attributes, 6 ranking attributes – Filtering on attributes, user-specified – Filtering on geo location (only for effectiveness evaluation) – QtopN clustering quality metric • Ranking function: weighted sum – sum of normalized per-attribute distances from best attribute value from among matches – attributes: age, height, body type, education, income, religious services – personalized by user: choice of attributes, sort order, normalization Яндекс 23.08.2011 24
  25. 25. Evaluation of Effectiveness: User Study presentation list groups content top-100 top list top groups BARAC BARAC list BARAC groups Яндекс 23.08.2011 25
  26. 26. Яндекс 23.08.2011 26
  27. 27. Яндекс 23.08.2011 27
  28. 28. Effectiveness Metrics and Results • Users may fave matches and / or groups – When a group is faved, all matches in that group are faved • A productive search has at least 1 faved match/group % prod. num. faves per num. faves per prod. treatment searches search search top list 17 0.84 5.05 top group 14 0.87 7.33 / 1.17 groups BARAC list 15 0.74 4.93 BARAC group 20 1.55 12.38 / 1.91 groups Яндекс 23.08.2011 28
  29. 29. Evaluation of Efficiency • Summary of results: BARAC is scalable – runtimes of BuildGrid and Join dominate performance – runtime of Merge is negligible • All reported results are over the complete set of female profiles in Yahoo! Personals, without any location-based filtering! Яндекс 23.08.2011 29
  30. 30. Evaluation of Efficiency • Summary of results: BARAC is scalable – runtimes of BuildGrid and Join dominate performance – runtime of Merge is negligible runtime of BuildGrid 8000 runtime of BuildGrid (ms) 7000 6000 5000 4000 3000 2000 1000 0 0 100000 200000 300000 400000 500000 # items Яндекс 23.08.2011 30
  31. 31. Evaluation of Efficiency • Summary of results: BARAC is scalable – runtimes of BuildGrid and Join dominate performance – runtime of Merge is negligible runtime of Join 3500 3000 runtime of Join (ms) 2500 2000 1500 1000 500 0 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 # clustering dimensions Яндекс 23.08.2011 31
  32. 32. Performance of Join 600 500 runtime of Join (ms) 9D 400 8D 7D 300 6D 5D 4D 200 3D 100 0 0.5 0.6 0.7 0.8 0.9 1 quality threshold * results for 100 Yahoo! Personals users on the full Y!P dataset. Яндекс 23.08.2011 32
  33. 33. Performance of Join 1000 900 800 runtime of Join (ms) 700 9D 8D 600 7D 500 6D 5D 400 4D 300 3D 200 100 0 0.5 0.6 0.7 0.8 0.9 1 dominance threshold * results for 100 Yahoo! Personals users on the full Y!P dataset. Яндекс 23.08.2011 33
  34. 34. Roadmap • Introduction • Rank-aware clustering – The formalism – The BARAC algorithm • Experimental evaluation – Effectiveness – Efficiency ➞ ConclusionЯндекс 23.08.2011 34
  35. 35. Rank-Aware Clustering: Recap • Formalized rank-aware clustering, a novel data exploration paradigm age: 18-25 • Developed a rank-aware measure of locality and a edu: BS, MS age: 33-40 inc: 50-75K family of rank-aware clustering quality measures inc: 126-150K • Proposed BARAC: a bottom-up algorithm for rank- age: 26-30 aware clustering inc: 75-110K 8000 runtime of BuildGrid (ms) 7000 6000 • Presented an experimental evaluation on Yahoo! 5000 4000 Personals (also restaurants in Yahoo! Local) 3000 2000 • Effectiveness 1000 0 • Efficiency 0 100000 200000 300000 # items 400000 500000 Яндекс 23.08.2011 35
  36. 36. Related Work • Subspace clustering – CLIQUE [Agrawal et al, 1998], ENCLUS [Cheng et al, 1999] – Improvements [Nagesh, 1999], [Liu et al, 2000], [Chang and Jin, 2002] • Ranking of structured data – Many answers, empty answer problems [Chaudhuri et al, 2004], [Agrawal et al, 2003] – Rank-aware attribute selection [Das et al, 2006] • Integrating ranking with clustering – Mixture model, mutual reinforcement between ranking and clustering, for heterogeneous information networks, e.g., DBLP [Sun et al, 2009] • Diversification – Web search [Agichtein et al, 2007], [Anagnostopoulos et al, 2005], [Kummamuru et al, 2004], … – Database queries [Chen and Li, 2007], [Vee et al, 2008] – Recommendation [Boim et al, 2011], [Yu et al, 2009] Яндекс 23.08.2011 36
  37. 37. Future Work: Choosing a Clustering Quality Measure 12 attribute-rank 10 geo-rank 8 score 6 4 2 0 0 20 40 60 80 100 rank Яндекс 23.08.2011 37
  38. 38. Thank you!Яндекс 23.08.2011
  39. 39. Take 1: Density-Based Clustering age: 18-25 age: 26-30 age: 31-35 age: 36-40 min density = 2 income: 50-75K income: 76-100K income: 101-125K Income: 126-150K Яндекс 23.08.2011 39
  40. 40. Take 1: Density-Based Clustering age: 18-30 age: 31-35 age: 36-40 min density = 2 age: 18-30 age: 36-40 Income: 50-75K income: 101-150K income: 50-75K income: 76-100K income: 101-150K Яндекс 23.08.2011 40
  41. 41. Take 2: A Lower Threshold? age: 18-25 age: 26-30 age: 31-35 age: 36-40 min density = 1 income: 50-75K income: 76-100K income: 101-125K income 126-150K Яндекс 23.08.2011 41
  42. 42. Take 2: A Lower Threshold? age: 18-40 density > 0 age: 18-40; income: 50-150K income: 50-150K Яндекс 23.08.2011 42
  43. 43. Performance of BARAC 100% BuildGrid 90% Join 80% Total 70% 60% 50% 40% 30% 20% 10% 0% <30sec <20sec <15sec <10sec <5 sec <1 sec * results for 100 Yahoo! Personals users on the full Y!P dataset. Яндекс 23.08.2011 43

×