The Optimum Clustering Framework: Implementing the Cluster Hypothesis

1. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis Norbert Fuhr University of Duisburg-Essen March 30, 2011

2. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 2 Outline 1 Introduction 2 Cluster Metric 3 Optimum clustering 4 Towards Optimum Clustering 5 Experiments 6 Conclusion and Outlook

3. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 3 Introduction 1 Introduction 2 Cluster Metric 3 Optimum clustering 4 Towards Optimum Clustering 5 Experiments 6 Conclusion and Outlook

4. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 4 Introduction Motivation Ad-hoc Retrieval heuristic models: deﬁne retrieval function evaluate to test if it yields good quality Probability Ranking Principle (PRP) theoretic foundation for optimum retrieval numerous probabilistic models based on PRP Document clustering classic approach: deﬁne similarity function and fusion principle evaluate to test if they yield good quality Optimum Clustering Principle?

7. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 5 Introduction Cluster Hypothesis Original Formulation ”closely associated documents tend to be relevant to the same requests” (Rijsbergen 1979) Idea of optimum clustering: Cluster documents in such a way, that for any request, the relevant documents occur together in one cluster redeﬁne document similarity: documents are similar if they are relevant to the same queries

10. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 6 Introduction The Optimum Clustering Framework

14. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 7 Cluster Metric 1 Introduction 2 Cluster Metric 3 Optimum clustering 4 Towards Optimum Clustering 5 Experiments 6 Conclusion and Outlook

15. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 8 Cluster Metric Deﬁning a Metric based on the Cluster Hypothesis General idea: Evaluate clustering wrt. a set of queries For each query and each cluster, regard pairs of documents co-occurring: relevant-relevant: good relevant-irrelevant: bad irrelevant-irrelevant: don’t care

16. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 9 Cluster Metric Pairwise precision Q Set of queries D Document collection R relevance judgments: R ⊂ Q × D C Clustering, C = {C1 , . . . , Cn } s.th. ∪n Ci = D and i=1 ∀i, j : i = j → Ci ∩ Cj = ∅ ci = |Ci | (size of cluster Ci ), rik = |{dm ∈ Ci |(qk , dm ) ∈ R}| (number of relevant documents in Ci wrt. qk ) Pairwise precision (weighted average over all clusters) 1 rik (rik − 1) Pp (D, Q, R, C) = ci |D| ci (ci − 1) Ci ∈C qk ∈Q ci >1

21. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 10 Cluster Metric Pairwise precision – Example 1 rik (rik − 1) Pp (D, Q, R, C) = ci |D| ci (ci − 1) Ci ∈C qk ∈Q ci >1 Query set: disjoint classiﬁcation with two classes a and b, three clusters: (aab|bb|aa) Pp = 1 (3( 1 + 0) + 2(0 + 1) + 2(1 + 0)) = 5 . 7 3 7 Perfect clustering for a disjoint classiﬁcation would yield Pp = 1 for arbitrary query sets, values > 1 are possible

22. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 10 Cluster Metric Pairwise precision – Example 1 rik (rik − 1) Pp (D, Q, R, C) = ci |D| ci (ci − 1) Ci ∈C qk ∈Q ci >1 Query set: disjoint classiﬁcation with two classes a and b, three clusters: (aab|bb|aa) Pp = 1 (3( 1 + 0) + 2(0 + 1) + 2(1 + 0)) = 5 . 7 3 7 Perfect clustering for a disjoint classiﬁcation would yield Pp = 1 for arbitrary query sets, values > 1 are possible

23. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 11 Cluster Metric Pairwise recall rik = |{dm ∈ Ci |(qk , dm ) ∈ R}| (number of relevant documents in Ci wrt. qk ) gk = |{d ∈ D|(qk , d) ∈ R}| (number of relevant documents for qk ) (micro recall) qk ∈Q Ci ∈C rik (rik − 1) Rp (D, Q, R, C) = qk ∈Q gk (gk − 1) gk >1 Example: (aab|bb|aa) 2 a pairs (out of 6) 1 b pair (out of 3) 2+1 1 Rp = 6+3 = 3.

24. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 11 Cluster Metric Pairwise recall rik = |{dm ∈ Ci |(qk , dm ) ∈ R}| (number of relevant documents in Ci wrt. qk ) gk = |{d ∈ D|(qk , d) ∈ R}| (number of relevant documents for qk ) (micro recall) qk ∈Q Ci ∈C rik (rik − 1) Rp (D, Q, R, C) = qk ∈Q gk (gk − 1) gk >1 Example: (aab|bb|aa) 2 a pairs (out of 6) 1 b pair (out of 3) 2+1 1 Rp = 6+3 = 3.

25. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 12 Cluster Metric Perfect clustering C is a perfect clustering iff there exists no clustering C s.th. Pp (D, Q, R, C) < Pp (D, Q, R, C )∧ Rp (D, Q, R, C) < Rp (D, Q, R, C ) strong Pareto optimum – more than one perfect clustering possible Pp ({d1 , d2 , d3 }, {d4 , d5 }) = Pp ({d1 , d2 }, {d3 , d4 , d5 }) = 1, Example: Rp = 23 Pp ({d1 , d2 , d3 , d4 , d5 }) = 0.6, Rp = 1

26. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 12 Cluster Metric Perfect clustering C is a perfect clustering iff there exists no clustering C s.th. Pp (D, Q, R, C) < Pp (D, Q, R, C )∧ Rp (D, Q, R, C) < Rp (D, Q, R, C ) strong Pareto optimum – more than one perfect clustering possible Pp ({d1 , d2 , d3 }, {d4 , d5 }) = Pp ({d1 , d2 }, {d3 , d4 , d5 }) = 1, Example: Rp = 23 Pp ({d1 , d2 , d3 , d4 , d5 }) = 0.6, Rp = 1

27. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 12 Cluster Metric Perfect clustering C is a perfect clustering iff there exists no clustering C s.th. Pp (D, Q, R, C) < Pp (D, Q, R, C )∧ Rp (D, Q, R, C) < Rp (D, Q, R, C ) strong Pareto optimum – more than one perfect clustering possible Example: Pp ({d1 , d2 , d3 }, {d4 , d5 }) = Pp ({d1 , d2 }, {d3 , d4 , d5 }) = 1, Rp = 23 Pp ({d1 , d2 , d3 , d4 , d5 }) = 0.6, Rp = 1

28. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 13 Cluster Metric Do perfect clusterings form a hierarchy?

29. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 13 Cluster Metric Do perfect clusterings form a hierarchy? Pp 1 C 1 Rp C = {{d1 , d2 , d3 , d4 }}

30. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 13 Cluster Metric Do perfect clusterings form a hierarchy? Pp 1 C’ C 1 Rp C = {{d1 , d2 , d3 , d4 }} C = {{d1 , d2 }, {d3 , d4 }}

31. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 13 Cluster Metric Do perfect clusterings form a hierarchy? Pp 1 C’ C’’ C 1 Rp C = {{d1 , d2 , d3 , d4 }} C = {{d1 , d2 }, {d3 , d4 }} C = {{d1 , d2 , d3 }, {d4 }}

32. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 14 Optimum clustering 1 Introduction 2 Cluster Metric 3 Optimum clustering 4 Towards Optimum Clustering 5 Experiments 6 Conclusion and Outlook

33. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 15 Optimum clustering Optimum Clustering Usually, clustering process has no knowledge about relevance judgments switch from external to internal cluster measures replace relevance judgments by estimates of probability of relevance requires probabilistic retrieval method yielding P(rel|q, d) compute expected cluster quality

42. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 18 Optimum clustering Expected recall qk ∈Q Ci ∈C rik (rik − 1) Rp (D, Q, R, C) = qk ∈Q gk (gk − 1) gk >1 Direct estimation requires estimation of denominator → biased estimates But: denominator is constant for a given query set → ignore compute an estimate for the numerator only: ρ(D, Q, C) = τ T (dl ) · τ (dm ) Ci ∈C (dl ,dm )∈Ci ×Ci dl =dm (Scalar product τ T (dl ) · τ (dm ) gives the expected number of queries for which both dl and dm are relevant)

46. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 19 Optimum clustering Optimum clustering C is an optimum clustering iff there exists no clustering C s.th. π(D, Q, C) < π(D, Q, C ) ∧ ρ(D, Q, C) < ρ(D, Q, C ) Pareto optima Set of perfect (and optimum) clusterings not even forms a cluster hierarchy no hierarchic clustering method will ﬁnd all optima!

50. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 20 Towards Optimum Clustering 1 Introduction 2 Cluster Metric 3 Optimum clustering 4 Towards Optimum Clustering 5 Experiments 6 Conclusion and Outlook

51. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 21 Towards Optimum Clustering Towards Optimum Clustering Development of an (optimum) clustering method 1 Set of queries, 2 Probabilistic retrieval method, 3 Document similarity metric, and 4 Fusion principle.

52. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 22 Towards Optimum Clustering A simple application 1 Set of queries: all possible one-term queries 2 Probabilistic retrieval method: tf ∗ idf 3 Document similarity metric: τ T (dl ) · τ (dm ) 4 Fusion principle: group average clustering 1 π(D, Q, C) = τ T (dl ) · τ (dm ) c(c − 1) (dl ,dm )∈C×C dl =dm standard clustering method

59. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 23 Towards Optimum Clustering Query set Too few queries in real collections → artiﬁcial query set collection clustering: set of all possible one-term queries Probability distribution over the query set: uniform / proportional to doc. freq. Document representation: original terms / transformations of the term space Semantic dimensions: focus on certain aspects only (e.g. images: color, contour, texture) result clustering: set of all query expansions

60. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 24 Towards Optimum Clustering Probabilistic retrieval method Model: In principle, any retrieval model suitable Transformation to probabilities: direct estimation / transforming the retrieval score into such a probability

61. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 25 Towards Optimum Clustering Document similarity metric. ﬁxed as τ T (dl ) · τ (dm )

62. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 26 Towards Optimum Clustering Fusion principles OCF only gives guidelines for good fusion principles: consider metrics π and/or ρ during fusion

63. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 27 Towards Optimum Clustering Group average clustering: 1 σ(C) = τ T (dl ) · τ (dm ) c(c − 1) (dl ,dm )∈C×C dl =dm expected precision as criterion! starts with singleton clusters minimum recall building larger clusters for increasing recall forms cluster with highest precision (which may be lower than that of the current clusters)

67. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 28 Towards Optimum Clustering Fusion principles – min cut starts with single cluster (maximum recall) searches for cut with minimum loss in recall ρ(D, Q, C) = τ T (dl ) · τ (dm ) Ci ∈C (dl ,dm )∈C×C dl =dm consider expected precision for breaking ties!

70. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 29 Towards Optimum Clustering Finding optimum clusterings Min cut (assume cohesive similarity graph) starts with optimum clustering for maximum recall min cut finds split with minimum loss in recall consider precision for tie breaking optimum clustering for two clusters O(n3 ) (vs. O(2n ) for the general case) subsequent splits will not necessarily reach optima Group average in general, multiple fusion steps for reaching first optimum greedy strategy does not necessarily find this optimum!

79. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 30 Experiments 1 Introduction 2 Cluster Metric 3 Optimum clustering 4 Towards Optimum Clustering 5 Experiments 6 Conclusion and Outlook

80. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 31 Experiments Experiments with a Query Set ADI collection: 35 queries 70 documents (relevant to 2.4 queries on avg.) Experiments: Q35opt using the actual relevance in τ (d) Q35 BM25 estimates for the 35 queries 1Tuni 1-term queries, uniform distribution 1Tdf 1-term queries, according to document frequency

81. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 32 Experiments 2.5 Q35opt Q35 2 1Tuni 1Tdf Precision 1.5 1 0.5 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall

82. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 33 Experiments Using Keyphrases as Query Set Compare clustering results based on different query sets 1 ‘bag-of-words’: single words as queries 2 keyphrases automatically extracted as head-noun phrases, single query = all keyphrases of a document Test collections: 4 test collections assembled from the RCV1 (Reuters) news corpus # documents: 600 vs. 6000 # categories: 6 vs. 12, Frequency distribution of classes: ([U]niform vs. [R]andom).

83. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 33 Experiments Using Keyphrases as Query Set Compare clustering results based on different query sets 1 ‘bag-of-words’: single words as queries 2 keyphrases automatically extracted as head-noun phrases, single query = all keyphrases of a document Test collections: 4 test collections assembled from the RCV1 (Reuters) news corpus # documents: 600 vs. 6000 # categories: 6 vs. 12, Frequency distribution of classes: ([U]niform vs. [R]andom).

84. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 34 Experiments Using Keyphrases as Query Set - Results Average Precision (External) F-measure

85. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 35 Experiments Evaluation of the Expected F-Measure Correlation between expected F-Measure (internal measure) and standard F-measure (comparison with reference classiﬁcation) test collections as before regard quality of 40 different clustering methods for each setting (ﬁnd optimum clustering among these 40 methods)

86. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 36 Experiments Correlation results Pearson correlation between internal measures and the external F-Measure

87. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 37 Conclusion and Outlook 1 Introduction 2 Cluster Metric 3 Optimum clustering 4 Towards Optimum Clustering 5 Experiments 6 Conclusion and Outlook

88. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 38 Conclusion and Outlook Summary Optimum Clustering Framework makes Cluster Hypothesis a requirement forms theoretical basis for development of better clustering methods yields positive experimental evidence

89. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 39 Conclusion and Outlook Further Research theoretical compatibility of existing clustering methods with OCF extension of OCF to soft clustering extension of OCF to hierarchical clustering experimental variation of query sets user experiments

The Optimum Clustering Framework: Implementing the Cluster Hypothesis

Recommended

Recommended

More Related Content

What's hot

What's hot (14)

Viewers also liked

Viewers also liked (6)

Similar to The Optimum Clustering Framework: Implementing the Cluster Hypothesis

Similar to The Optimum Clustering Framework: Implementing the Cluster Hypothesis (20)

More from yaevents

More from yaevents (20)

The Optimum Clustering Framework: Implementing the Cluster Hypothesis