SlideShare a Scribd company logo
1 of 26
Download to read offline
Introduction                      Clustering                 Alignment




               Doctoral Seminar: Multi-document clustering
                             and alignment

                               Wim De Smet


                              March 23, 2007
Introduction                          Clustering      Alignment



                                Current goals




       CLASS, WP7
          1. Cluster documents according to topics.
          2. Align text and video
Introduction                          Clustering                         Alignment



                                     Goal




       Given news stories about different events, from several sources,
       cluster same stories.
Introduction                          Clustering                 Alignment



                                  Clustering



       Typical clustering algorithms: bag   of words approach.
       Document-by-words matrix:
             0.5 0.5 0.5 0            0      0
             0.4 0.6 0.5 0            0      0
             0.5 0.4 0.6 0            0      0
       A= 0         0     0 0.5 0.5         0.5
              0     0     0 0.5 0.5         0.5
             0.4 0.4 0 0.4 0.4              0.4
             0.4 0.4 0.4 0 0.4              0.4
Introduction                           Clustering          Alignment



                                    Clustering
       Document clustering according to word-similarity:

          0.5     0.5   0.5    0      0       0
          0.4     0.6   0.5    0      0       0
          0.5     0.4   0.6    0      0       0
       A= 0        0     0    0.5    0.5     0.5
           0       0     0    0.5    0.5     0.5
          0.4     0.4    0    0.4    0.4     0.4
          0.4     0.4   0.4    0     0.4     0.4
Introduction                           Clustering          Alignment



                                    Clustering
       Word clustering according to document-similarity:

          0.5     0.5   0.5    0      0       0
          0.4     0.6   0.5    0      0       0
          0.5     0.4   0.6    0      0       0
       A= 0        0     0    0.5    0.5     0.5
           0       0     0    0.5    0.5     0.5
          0.4     0.4    0    0.4    0.4     0.4
          0.4     0.4   0.4    0     0.4     0.4
Introduction                        Clustering                   Alignment



                              Co-clustering
       Purpose: simultaneously clustering words and documents,
       preserving information found in both clusterings.
             0.5 0.5 0.5 0           0    0
             0.4 0.6 0.5 0           0    0
             0.5 0.4 0.6 0           0    0
       A= 0          0   0 0.5 0.5 0.5
               0     0   0 0.5 0.5 0.5
             0.4 0.4 0 0.4 0.4 0.4
             0.4 0.4 0.4 0 0.4 0.4
Introduction                        Clustering                   Alignment



                              Co-clustering
       Purpose: simultaneously clustering words and documents,
       preserving information found in both clusterings.
             0.5 0.5 0.5 0           0    0
             0.4 0.6 0.5 0           0    0
             0.5 0.4 0.6 0           0    0
       A= 0          0   0 0.5 0.5 0.5
               0     0   0 0.5 0.5 0.5
             0.4 0.4 0 0.4 0.4 0.4
             0.4 0.4 0.4 0 0.4 0.4
Introduction                         Clustering                   Alignment



                               Co-clustering
       Purpose: simultaneously clustering words and documents,
       preserving information found in both clusterings.
             0.5 0.5 0.5 0           0    0
             0.4 0.6 0.5 0           0    0
             0.5 0.4 0.6 0           0    0             0.5 0
       A= 0          0   0 0.5 0.5 0.5                   0 0.5
               0     0   0 0.5 0.5 0.5                  0.4 0.4
             0.4 0.4 0 0.4 0.4 0.4
             0.4 0.4 0.4 0 0.4 0.4
Introduction                           Clustering                          Alignment



                         Hierarchical Co-clustering




       Hierarchical co-clustering:
          1. Co-cluster documents and words.
          2. For each cluster: if contains too many documents, calculate
             sub-matrix
          3. Repeat step 1 on sub-matrix.
Introduction                        Clustering                Alignment



          Bipartite Spectral Graph Partitioning: motivation

       View document-by-word matrix as bipartite graph

                         word1   word2       word2
          document1       a1,1     0           0
       A=
          document2        0      a2,2        a2,3
          document2       a3,2    a3,3         0
Introduction                         Clustering                       Alignment



          Bipartite Spectral Graph Partitioning: motivation
       Divide graph in document clusters Dm and associated word clusters
       Wm ?
Introduction                        Clustering                                 Alignment



          Bipartite Spectral Graph Partitioning: motivation
                                                                          
                                                                          
               Wm = wj :          Aij ≥          Aij , ∀l = 1, . . . , k
                                                                          
                           i∈Dm           i∈Dl
Introduction                         Clustering                                 Alignment



          Bipartite Spectral Graph Partitioning: motivation
                                                                           
                                                                           
               Dm = di :          Aij ≥           Aij , ∀l = 1, . . . , k
                                                                           
                           j∈Wm           j∈Wl
Introduction                           Clustering                    Alignment



           Bipartite Spectral Graph Partitioning: algorithm

          1. Given the m ∗ n document-by-word matrix A, calculate
             diagonal help-matrices D1 and D2 , so that:

                         ∀1 < i ≤ m : D1 (i, i) =        Ai,j
                                                     j

                          ∀1 < j ≤ n : D2 (j, j) =       Ai,j
                                                     i

          2. Compute An = D1 −1/2 ∗ A ∗ D2 −1/2
          3. Take the SVD of An : SVD(An ) = U ∗ Λ ∗ V∗
          4. Determine k, the numbers of clusters by the eigengap:
             k = arg max(m≥i>1) λi−1 − λi )/λi−1 , where
             λ1 ≥ λ2 ≥ · · · ≥ λm are the singular values of A
Introduction                              Clustering                            Alignment



   Bipartite Spectral Graph Partitioning: algorithm (cont.)


          5. From U and V, calculate U[2,··· ,l+1] and V[2,··· ,l+1]
             respectively, by taking columns 2 to l + 1
             where l = log2 k ,
                               D1 −1/2 U[2,··· ,l+1]
          6. Compute Z =                               and normalize the rows
                               D2 −1/2 V[2,··· ,l+1]
               of Z
          7. Apply k-means to cluster the rows of Z into k clusters
          8. Check for each clusters the number of documents. If this is
             higher than a given treshold, construct a new
             document-by-word matrix formed by the documents and
             words in the cluster, and proceed to step 1
Introduction                           Clustering                   Alignment



                   Uses of a hierarchical co-clustering




           • Documents are clustered according to topic hierarchy
           • Words associated with cluster describe topic
           • Words can be used for offline clustering
Introduction                     Clustering            Alignment



                  Entries of document-by-word matrix




          1. TF-IDF
          2. WP 2’s Salience
Introduction                            Clustering                  Alignment



                                   Results

       Precision of clustering 367 news stories from ABC and CNN.
       k = defined by eigengap
       Salience: 3743 words / TF-IDF: 7242 words

       Co-clustering
        Test set Precision     Recall        F1
        Salience 74.6 %         41 %       52.9 %
        TF-IDF      50.4 %     40.7 %      45.1 %

       k-means
        Test set   Precision   Recall        F1
        Salience    69.5 %     37.1 %      48.4 %
        TF-IDF      38.3 %     41.8 %       40 %
Introduction                            Clustering                  Alignment



                                   Results


       Precision of clustering 367 news stories from ABC and CNN.
       k = defined by eigengap

       Co-clustering
        Test set Precision     Recall        F1
        Salience 64.3 %        48.3 %      55.2 %

       k-means
        Test set   Precision   Recall        F1
        Salience    58.3 %     41.7 %      48.8 %
Introduction                             Clustering          Alignment



                                      Goals
          1. Find aligning segments in
               1.1 text-text pairs
               1.2 text-video pairs
          2. Expand to multiple documents (text and video)
Introduction                            Clustering            Alignment



                                      Goals




       Using aligned segments:
           • Create elaborated story from several sources
           • Create links between video and text
           • Summarize video and text
           • Select appropriate medial form for information
Introduction                          Clustering             Alignment



                                  Segments


       Segments can be defined at different resolutions
           • in text:
                • word
                • sentence
                • paragraph
           • in video:
                • image
                • shot
           • Expand to multiple documents (text and video)
Introduction                             Clustering                 Alignment



                                   Problems




           • Degrees of comparability:
               • Parallel pairs
               • Near-parallel pairs
               • Comparable pairs
           • Representation of segments in different media: how to
               compare
Introduction                                Clustering   Alignment



                                         Techniques



    • Micro-macro aligment
        • Top-down
        • Bottom-up
    • Make use of several
        assumptions:
               • Linearity
               • Low variance of slope
               • Injectivity
    • Annealing and Context
Introduction                          Clustering      Alignment



                            Multiple documents




       Two possible directions
          1. Dimension reduction
          2. Expand dimensions of search algorithms

More Related Content

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 

Featured

How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
ThinkNow
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 

Featured (20)

Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 

Presentatie

  • 1. Introduction Clustering Alignment Doctoral Seminar: Multi-document clustering and alignment Wim De Smet March 23, 2007
  • 2. Introduction Clustering Alignment Current goals CLASS, WP7 1. Cluster documents according to topics. 2. Align text and video
  • 3. Introduction Clustering Alignment Goal Given news stories about different events, from several sources, cluster same stories.
  • 4. Introduction Clustering Alignment Clustering Typical clustering algorithms: bag of words approach. Document-by-words matrix: 0.5 0.5 0.5 0 0 0 0.4 0.6 0.5 0 0 0 0.5 0.4 0.6 0 0 0 A= 0 0 0 0.5 0.5 0.5 0 0 0 0.5 0.5 0.5 0.4 0.4 0 0.4 0.4 0.4 0.4 0.4 0.4 0 0.4 0.4
  • 5. Introduction Clustering Alignment Clustering Document clustering according to word-similarity: 0.5 0.5 0.5 0 0 0 0.4 0.6 0.5 0 0 0 0.5 0.4 0.6 0 0 0 A= 0 0 0 0.5 0.5 0.5 0 0 0 0.5 0.5 0.5 0.4 0.4 0 0.4 0.4 0.4 0.4 0.4 0.4 0 0.4 0.4
  • 6. Introduction Clustering Alignment Clustering Word clustering according to document-similarity: 0.5 0.5 0.5 0 0 0 0.4 0.6 0.5 0 0 0 0.5 0.4 0.6 0 0 0 A= 0 0 0 0.5 0.5 0.5 0 0 0 0.5 0.5 0.5 0.4 0.4 0 0.4 0.4 0.4 0.4 0.4 0.4 0 0.4 0.4
  • 7. Introduction Clustering Alignment Co-clustering Purpose: simultaneously clustering words and documents, preserving information found in both clusterings. 0.5 0.5 0.5 0 0 0 0.4 0.6 0.5 0 0 0 0.5 0.4 0.6 0 0 0 A= 0 0 0 0.5 0.5 0.5 0 0 0 0.5 0.5 0.5 0.4 0.4 0 0.4 0.4 0.4 0.4 0.4 0.4 0 0.4 0.4
  • 8. Introduction Clustering Alignment Co-clustering Purpose: simultaneously clustering words and documents, preserving information found in both clusterings. 0.5 0.5 0.5 0 0 0 0.4 0.6 0.5 0 0 0 0.5 0.4 0.6 0 0 0 A= 0 0 0 0.5 0.5 0.5 0 0 0 0.5 0.5 0.5 0.4 0.4 0 0.4 0.4 0.4 0.4 0.4 0.4 0 0.4 0.4
  • 9. Introduction Clustering Alignment Co-clustering Purpose: simultaneously clustering words and documents, preserving information found in both clusterings. 0.5 0.5 0.5 0 0 0 0.4 0.6 0.5 0 0 0 0.5 0.4 0.6 0 0 0 0.5 0 A= 0 0 0 0.5 0.5 0.5 0 0.5 0 0 0 0.5 0.5 0.5 0.4 0.4 0.4 0.4 0 0.4 0.4 0.4 0.4 0.4 0.4 0 0.4 0.4
  • 10. Introduction Clustering Alignment Hierarchical Co-clustering Hierarchical co-clustering: 1. Co-cluster documents and words. 2. For each cluster: if contains too many documents, calculate sub-matrix 3. Repeat step 1 on sub-matrix.
  • 11. Introduction Clustering Alignment Bipartite Spectral Graph Partitioning: motivation View document-by-word matrix as bipartite graph word1 word2 word2 document1 a1,1 0 0 A= document2 0 a2,2 a2,3 document2 a3,2 a3,3 0
  • 12. Introduction Clustering Alignment Bipartite Spectral Graph Partitioning: motivation Divide graph in document clusters Dm and associated word clusters Wm ?
  • 13. Introduction Clustering Alignment Bipartite Spectral Graph Partitioning: motivation     Wm = wj : Aij ≥ Aij , ∀l = 1, . . . , k   i∈Dm i∈Dl
  • 14. Introduction Clustering Alignment Bipartite Spectral Graph Partitioning: motivation     Dm = di : Aij ≥ Aij , ∀l = 1, . . . , k   j∈Wm j∈Wl
  • 15. Introduction Clustering Alignment Bipartite Spectral Graph Partitioning: algorithm 1. Given the m ∗ n document-by-word matrix A, calculate diagonal help-matrices D1 and D2 , so that: ∀1 < i ≤ m : D1 (i, i) = Ai,j j ∀1 < j ≤ n : D2 (j, j) = Ai,j i 2. Compute An = D1 −1/2 ∗ A ∗ D2 −1/2 3. Take the SVD of An : SVD(An ) = U ∗ Λ ∗ V∗ 4. Determine k, the numbers of clusters by the eigengap: k = arg max(m≥i>1) λi−1 − λi )/λi−1 , where λ1 ≥ λ2 ≥ · · · ≥ λm are the singular values of A
  • 16. Introduction Clustering Alignment Bipartite Spectral Graph Partitioning: algorithm (cont.) 5. From U and V, calculate U[2,··· ,l+1] and V[2,··· ,l+1] respectively, by taking columns 2 to l + 1 where l = log2 k , D1 −1/2 U[2,··· ,l+1] 6. Compute Z = and normalize the rows D2 −1/2 V[2,··· ,l+1] of Z 7. Apply k-means to cluster the rows of Z into k clusters 8. Check for each clusters the number of documents. If this is higher than a given treshold, construct a new document-by-word matrix formed by the documents and words in the cluster, and proceed to step 1
  • 17. Introduction Clustering Alignment Uses of a hierarchical co-clustering • Documents are clustered according to topic hierarchy • Words associated with cluster describe topic • Words can be used for offline clustering
  • 18. Introduction Clustering Alignment Entries of document-by-word matrix 1. TF-IDF 2. WP 2’s Salience
  • 19. Introduction Clustering Alignment Results Precision of clustering 367 news stories from ABC and CNN. k = defined by eigengap Salience: 3743 words / TF-IDF: 7242 words Co-clustering Test set Precision Recall F1 Salience 74.6 % 41 % 52.9 % TF-IDF 50.4 % 40.7 % 45.1 % k-means Test set Precision Recall F1 Salience 69.5 % 37.1 % 48.4 % TF-IDF 38.3 % 41.8 % 40 %
  • 20. Introduction Clustering Alignment Results Precision of clustering 367 news stories from ABC and CNN. k = defined by eigengap Co-clustering Test set Precision Recall F1 Salience 64.3 % 48.3 % 55.2 % k-means Test set Precision Recall F1 Salience 58.3 % 41.7 % 48.8 %
  • 21. Introduction Clustering Alignment Goals 1. Find aligning segments in 1.1 text-text pairs 1.2 text-video pairs 2. Expand to multiple documents (text and video)
  • 22. Introduction Clustering Alignment Goals Using aligned segments: • Create elaborated story from several sources • Create links between video and text • Summarize video and text • Select appropriate medial form for information
  • 23. Introduction Clustering Alignment Segments Segments can be defined at different resolutions • in text: • word • sentence • paragraph • in video: • image • shot • Expand to multiple documents (text and video)
  • 24. Introduction Clustering Alignment Problems • Degrees of comparability: • Parallel pairs • Near-parallel pairs • Comparable pairs • Representation of segments in different media: how to compare
  • 25. Introduction Clustering Alignment Techniques • Micro-macro aligment • Top-down • Bottom-up • Make use of several assumptions: • Linearity • Low variance of slope • Injectivity • Annealing and Context
  • 26. Introduction Clustering Alignment Multiple documents Two possible directions 1. Dimension reduction 2. Expand dimensions of search algorithms