10. Introduction Clustering Alignment
Hierarchical Co-clustering
Hierarchical co-clustering:
1. Co-cluster documents and words.
2. For each cluster: if contains too many documents, calculate
sub-matrix
3. Repeat step 1 on sub-matrix.
15. Introduction Clustering Alignment
Bipartite Spectral Graph Partitioning: algorithm
1. Given the m ∗ n document-by-word matrix A, calculate
diagonal help-matrices D1 and D2 , so that:
∀1 < i ≤ m : D1 (i, i) = Ai,j
j
∀1 < j ≤ n : D2 (j, j) = Ai,j
i
2. Compute An = D1 −1/2 ∗ A ∗ D2 −1/2
3. Take the SVD of An : SVD(An ) = U ∗ Λ ∗ V∗
4. Determine k, the numbers of clusters by the eigengap:
k = arg max(m≥i>1) λi−1 − λi )/λi−1 , where
λ1 ≥ λ2 ≥ · · · ≥ λm are the singular values of A
16. Introduction Clustering Alignment
Bipartite Spectral Graph Partitioning: algorithm (cont.)
5. From U and V, calculate U[2,··· ,l+1] and V[2,··· ,l+1]
respectively, by taking columns 2 to l + 1
where l = log2 k ,
D1 −1/2 U[2,··· ,l+1]
6. Compute Z = and normalize the rows
D2 −1/2 V[2,··· ,l+1]
of Z
7. Apply k-means to cluster the rows of Z into k clusters
8. Check for each clusters the number of documents. If this is
higher than a given treshold, construct a new
document-by-word matrix formed by the documents and
words in the cluster, and proceed to step 1
17. Introduction Clustering Alignment
Uses of a hierarchical co-clustering
• Documents are clustered according to topic hierarchy
• Words associated with cluster describe topic
• Words can be used for offline clustering
19. Introduction Clustering Alignment
Results
Precision of clustering 367 news stories from ABC and CNN.
k = defined by eigengap
Salience: 3743 words / TF-IDF: 7242 words
Co-clustering
Test set Precision Recall F1
Salience 74.6 % 41 % 52.9 %
TF-IDF 50.4 % 40.7 % 45.1 %
k-means
Test set Precision Recall F1
Salience 69.5 % 37.1 % 48.4 %
TF-IDF 38.3 % 41.8 % 40 %
20. Introduction Clustering Alignment
Results
Precision of clustering 367 news stories from ABC and CNN.
k = defined by eigengap
Co-clustering
Test set Precision Recall F1
Salience 64.3 % 48.3 % 55.2 %
k-means
Test set Precision Recall F1
Salience 58.3 % 41.7 % 48.8 %
21. Introduction Clustering Alignment
Goals
1. Find aligning segments in
1.1 text-text pairs
1.2 text-video pairs
2. Expand to multiple documents (text and video)
22. Introduction Clustering Alignment
Goals
Using aligned segments:
• Create elaborated story from several sources
• Create links between video and text
• Summarize video and text
• Select appropriate medial form for information
23. Introduction Clustering Alignment
Segments
Segments can be defined at different resolutions
• in text:
• word
• sentence
• paragraph
• in video:
• image
• shot
• Expand to multiple documents (text and video)
24. Introduction Clustering Alignment
Problems
• Degrees of comparability:
• Parallel pairs
• Near-parallel pairs
• Comparable pairs
• Representation of segments in different media: how to
compare
25. Introduction Clustering Alignment
Techniques
• Micro-macro aligment
• Top-down
• Bottom-up
• Make use of several
assumptions:
• Linearity
• Low variance of slope
• Injectivity
• Annealing and Context
26. Introduction Clustering Alignment
Multiple documents
Two possible directions
1. Dimension reduction
2. Expand dimensions of search algorithms