SlideShare a Scribd company logo
1 of 26
Download to read offline
Introduction                      Clustering                 Alignment




               Doctoral Seminar: Multi-document clustering
                             and alignment

                               Wim De Smet


                              March 23, 2007
Introduction                          Clustering      Alignment



                                Current goals




       CLASS, WP7
          1. Cluster documents according to topics.
          2. Align text and video
Introduction                          Clustering                         Alignment



                                     Goal




       Given news stories about different events, from several sources,
       cluster same stories.
Introduction                          Clustering                 Alignment



                                  Clustering



       Typical clustering algorithms: bag   of words approach.
       Document-by-words matrix:
             0.5 0.5 0.5 0            0      0
             0.4 0.6 0.5 0            0      0
             0.5 0.4 0.6 0            0      0
       A= 0         0     0 0.5 0.5         0.5
              0     0     0 0.5 0.5         0.5
             0.4 0.4 0 0.4 0.4              0.4
             0.4 0.4 0.4 0 0.4              0.4
Introduction                           Clustering          Alignment



                                    Clustering
       Document clustering according to word-similarity:

          0.5     0.5   0.5    0      0       0
          0.4     0.6   0.5    0      0       0
          0.5     0.4   0.6    0      0       0
       A= 0        0     0    0.5    0.5     0.5
           0       0     0    0.5    0.5     0.5
          0.4     0.4    0    0.4    0.4     0.4
          0.4     0.4   0.4    0     0.4     0.4
Introduction                           Clustering          Alignment



                                    Clustering
       Word clustering according to document-similarity:

          0.5     0.5   0.5    0      0       0
          0.4     0.6   0.5    0      0       0
          0.5     0.4   0.6    0      0       0
       A= 0        0     0    0.5    0.5     0.5
           0       0     0    0.5    0.5     0.5
          0.4     0.4    0    0.4    0.4     0.4
          0.4     0.4   0.4    0     0.4     0.4
Introduction                        Clustering                   Alignment



                              Co-clustering
       Purpose: simultaneously clustering words and documents,
       preserving information found in both clusterings.
             0.5 0.5 0.5 0           0    0
             0.4 0.6 0.5 0           0    0
             0.5 0.4 0.6 0           0    0
       A= 0          0   0 0.5 0.5 0.5
               0     0   0 0.5 0.5 0.5
             0.4 0.4 0 0.4 0.4 0.4
             0.4 0.4 0.4 0 0.4 0.4
Introduction                        Clustering                   Alignment



                              Co-clustering
       Purpose: simultaneously clustering words and documents,
       preserving information found in both clusterings.
             0.5 0.5 0.5 0           0    0
             0.4 0.6 0.5 0           0    0
             0.5 0.4 0.6 0           0    0
       A= 0          0   0 0.5 0.5 0.5
               0     0   0 0.5 0.5 0.5
             0.4 0.4 0 0.4 0.4 0.4
             0.4 0.4 0.4 0 0.4 0.4
Introduction                         Clustering                   Alignment



                               Co-clustering
       Purpose: simultaneously clustering words and documents,
       preserving information found in both clusterings.
             0.5 0.5 0.5 0           0    0
             0.4 0.6 0.5 0           0    0
             0.5 0.4 0.6 0           0    0             0.5 0
       A= 0          0   0 0.5 0.5 0.5                   0 0.5
               0     0   0 0.5 0.5 0.5                  0.4 0.4
             0.4 0.4 0 0.4 0.4 0.4
             0.4 0.4 0.4 0 0.4 0.4
Introduction                           Clustering                          Alignment



                         Hierarchical Co-clustering




       Hierarchical co-clustering:
          1. Co-cluster documents and words.
          2. For each cluster: if contains too many documents, calculate
             sub-matrix
          3. Repeat step 1 on sub-matrix.
Introduction                        Clustering                Alignment



          Bipartite Spectral Graph Partitioning: motivation

       View document-by-word matrix as bipartite graph

                         word1   word2       word2
          document1       a1,1     0           0
       A=
          document2        0      a2,2        a2,3
          document2       a3,2    a3,3         0
Introduction                         Clustering                       Alignment



          Bipartite Spectral Graph Partitioning: motivation
       Divide graph in document clusters Dm and associated word clusters
       Wm ?
Introduction                        Clustering                                 Alignment



          Bipartite Spectral Graph Partitioning: motivation
                                                                          
                                                                          
               Wm = wj :          Aij ≥          Aij , ∀l = 1, . . . , k
                                                                          
                           i∈Dm           i∈Dl
Introduction                         Clustering                                 Alignment



          Bipartite Spectral Graph Partitioning: motivation
                                                                           
                                                                           
               Dm = di :          Aij ≥           Aij , ∀l = 1, . . . , k
                                                                           
                           j∈Wm           j∈Wl
Introduction                           Clustering                    Alignment



           Bipartite Spectral Graph Partitioning: algorithm

          1. Given the m ∗ n document-by-word matrix A, calculate
             diagonal help-matrices D1 and D2 , so that:

                         ∀1 < i ≤ m : D1 (i, i) =        Ai,j
                                                     j

                          ∀1 < j ≤ n : D2 (j, j) =       Ai,j
                                                     i

          2. Compute An = D1 −1/2 ∗ A ∗ D2 −1/2
          3. Take the SVD of An : SVD(An ) = U ∗ Λ ∗ V∗
          4. Determine k, the numbers of clusters by the eigengap:
             k = arg max(m≥i>1) λi−1 − λi )/λi−1 , where
             λ1 ≥ λ2 ≥ · · · ≥ λm are the singular values of A
Introduction                              Clustering                            Alignment



   Bipartite Spectral Graph Partitioning: algorithm (cont.)


          5. From U and V, calculate U[2,··· ,l+1] and V[2,··· ,l+1]
             respectively, by taking columns 2 to l + 1
             where l = log2 k ,
                               D1 −1/2 U[2,··· ,l+1]
          6. Compute Z =                               and normalize the rows
                               D2 −1/2 V[2,··· ,l+1]
               of Z
          7. Apply k-means to cluster the rows of Z into k clusters
          8. Check for each clusters the number of documents. If this is
             higher than a given treshold, construct a new
             document-by-word matrix formed by the documents and
             words in the cluster, and proceed to step 1
Introduction                           Clustering                   Alignment



                   Uses of a hierarchical co-clustering




           • Documents are clustered according to topic hierarchy
           • Words associated with cluster describe topic
           • Words can be used for offline clustering
Introduction                     Clustering            Alignment



                  Entries of document-by-word matrix




          1. TF-IDF
          2. WP 2’s Salience
Introduction                            Clustering                  Alignment



                                   Results

       Precision of clustering 367 news stories from ABC and CNN.
       k = defined by eigengap
       Salience: 3743 words / TF-IDF: 7242 words

       Co-clustering
        Test set Precision     Recall        F1
        Salience 74.6 %         41 %       52.9 %
        TF-IDF      50.4 %     40.7 %      45.1 %

       k-means
        Test set   Precision   Recall        F1
        Salience    69.5 %     37.1 %      48.4 %
        TF-IDF      38.3 %     41.8 %       40 %
Introduction                            Clustering                  Alignment



                                   Results


       Precision of clustering 367 news stories from ABC and CNN.
       k = defined by eigengap

       Co-clustering
        Test set Precision     Recall        F1
        Salience 64.3 %        48.3 %      55.2 %

       k-means
        Test set   Precision   Recall        F1
        Salience    58.3 %     41.7 %      48.8 %
Introduction                             Clustering          Alignment



                                      Goals
          1. Find aligning segments in
               1.1 text-text pairs
               1.2 text-video pairs
          2. Expand to multiple documents (text and video)
Introduction                            Clustering            Alignment



                                      Goals




       Using aligned segments:
           • Create elaborated story from several sources
           • Create links between video and text
           • Summarize video and text
           • Select appropriate medial form for information
Introduction                          Clustering             Alignment



                                  Segments


       Segments can be defined at different resolutions
           • in text:
                • word
                • sentence
                • paragraph
           • in video:
                • image
                • shot
           • Expand to multiple documents (text and video)
Introduction                             Clustering                 Alignment



                                   Problems




           • Degrees of comparability:
               • Parallel pairs
               • Near-parallel pairs
               • Comparable pairs
           • Representation of segments in different media: how to
               compare
Introduction                                Clustering   Alignment



                                         Techniques



    • Micro-macro aligment
        • Top-down
        • Bottom-up
    • Make use of several
        assumptions:
               • Linearity
               • Low variance of slope
               • Injectivity
    • Annealing and Context
Introduction                          Clustering      Alignment



                            Multiple documents




       Two possible directions
          1. Dimension reduction
          2. Expand dimensions of search algorithms

More Related Content

Recently uploaded

Your enemies use GenAI too - staying ahead of fraud with Neo4j
Your enemies use GenAI too - staying ahead of fraud with Neo4jYour enemies use GenAI too - staying ahead of fraud with Neo4j
Your enemies use GenAI too - staying ahead of fraud with Neo4jNeo4j
 
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsContinuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsLeah Henrickson
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Patrick Viafore
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceSamy Fodil
 
AI mind or machine power point presentation
AI mind or machine power point presentationAI mind or machine power point presentation
AI mind or machine power point presentationyogeshlabana357357
 
ERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctBrainSell Technologies
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfFIDO Alliance
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftshyamraj55
 
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FIDO Alliance
 
Google I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGoogle I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGDSC PJATK
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...FIDO Alliance
 
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The InsideCollecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The InsideStefan Dietze
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...panagenda
 
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?Paolo Missier
 
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPTiSEO AI
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe中 央社
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfFIDO Alliance
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераMark Opanasiuk
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?Mark Billinghurst
 

Recently uploaded (20)

Your enemies use GenAI too - staying ahead of fraud with Neo4j
Your enemies use GenAI too - staying ahead of fraud with Neo4jYour enemies use GenAI too - staying ahead of fraud with Neo4j
Your enemies use GenAI too - staying ahead of fraud with Neo4j
 
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsContinuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 
AI mind or machine power point presentation
AI mind or machine power point presentationAI mind or machine power point presentation
AI mind or machine power point presentation
 
ERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage Intacct
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoft
 
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
 
Google I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGoogle I/O Extended 2024 Warsaw
Google I/O Extended 2024 Warsaw
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
 
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The InsideCollecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
 
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
 
Overview of Hyperledger Foundation
Overview of Hyperledger FoundationOverview of Hyperledger Foundation
Overview of Hyperledger Foundation
 
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджера
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?
 

Featured

Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 

Featured (20)

Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 

Presentatie

  • 1. Introduction Clustering Alignment Doctoral Seminar: Multi-document clustering and alignment Wim De Smet March 23, 2007
  • 2. Introduction Clustering Alignment Current goals CLASS, WP7 1. Cluster documents according to topics. 2. Align text and video
  • 3. Introduction Clustering Alignment Goal Given news stories about different events, from several sources, cluster same stories.
  • 4. Introduction Clustering Alignment Clustering Typical clustering algorithms: bag of words approach. Document-by-words matrix: 0.5 0.5 0.5 0 0 0 0.4 0.6 0.5 0 0 0 0.5 0.4 0.6 0 0 0 A= 0 0 0 0.5 0.5 0.5 0 0 0 0.5 0.5 0.5 0.4 0.4 0 0.4 0.4 0.4 0.4 0.4 0.4 0 0.4 0.4
  • 5. Introduction Clustering Alignment Clustering Document clustering according to word-similarity: 0.5 0.5 0.5 0 0 0 0.4 0.6 0.5 0 0 0 0.5 0.4 0.6 0 0 0 A= 0 0 0 0.5 0.5 0.5 0 0 0 0.5 0.5 0.5 0.4 0.4 0 0.4 0.4 0.4 0.4 0.4 0.4 0 0.4 0.4
  • 6. Introduction Clustering Alignment Clustering Word clustering according to document-similarity: 0.5 0.5 0.5 0 0 0 0.4 0.6 0.5 0 0 0 0.5 0.4 0.6 0 0 0 A= 0 0 0 0.5 0.5 0.5 0 0 0 0.5 0.5 0.5 0.4 0.4 0 0.4 0.4 0.4 0.4 0.4 0.4 0 0.4 0.4
  • 7. Introduction Clustering Alignment Co-clustering Purpose: simultaneously clustering words and documents, preserving information found in both clusterings. 0.5 0.5 0.5 0 0 0 0.4 0.6 0.5 0 0 0 0.5 0.4 0.6 0 0 0 A= 0 0 0 0.5 0.5 0.5 0 0 0 0.5 0.5 0.5 0.4 0.4 0 0.4 0.4 0.4 0.4 0.4 0.4 0 0.4 0.4
  • 8. Introduction Clustering Alignment Co-clustering Purpose: simultaneously clustering words and documents, preserving information found in both clusterings. 0.5 0.5 0.5 0 0 0 0.4 0.6 0.5 0 0 0 0.5 0.4 0.6 0 0 0 A= 0 0 0 0.5 0.5 0.5 0 0 0 0.5 0.5 0.5 0.4 0.4 0 0.4 0.4 0.4 0.4 0.4 0.4 0 0.4 0.4
  • 9. Introduction Clustering Alignment Co-clustering Purpose: simultaneously clustering words and documents, preserving information found in both clusterings. 0.5 0.5 0.5 0 0 0 0.4 0.6 0.5 0 0 0 0.5 0.4 0.6 0 0 0 0.5 0 A= 0 0 0 0.5 0.5 0.5 0 0.5 0 0 0 0.5 0.5 0.5 0.4 0.4 0.4 0.4 0 0.4 0.4 0.4 0.4 0.4 0.4 0 0.4 0.4
  • 10. Introduction Clustering Alignment Hierarchical Co-clustering Hierarchical co-clustering: 1. Co-cluster documents and words. 2. For each cluster: if contains too many documents, calculate sub-matrix 3. Repeat step 1 on sub-matrix.
  • 11. Introduction Clustering Alignment Bipartite Spectral Graph Partitioning: motivation View document-by-word matrix as bipartite graph word1 word2 word2 document1 a1,1 0 0 A= document2 0 a2,2 a2,3 document2 a3,2 a3,3 0
  • 12. Introduction Clustering Alignment Bipartite Spectral Graph Partitioning: motivation Divide graph in document clusters Dm and associated word clusters Wm ?
  • 13. Introduction Clustering Alignment Bipartite Spectral Graph Partitioning: motivation     Wm = wj : Aij ≥ Aij , ∀l = 1, . . . , k   i∈Dm i∈Dl
  • 14. Introduction Clustering Alignment Bipartite Spectral Graph Partitioning: motivation     Dm = di : Aij ≥ Aij , ∀l = 1, . . . , k   j∈Wm j∈Wl
  • 15. Introduction Clustering Alignment Bipartite Spectral Graph Partitioning: algorithm 1. Given the m ∗ n document-by-word matrix A, calculate diagonal help-matrices D1 and D2 , so that: ∀1 < i ≤ m : D1 (i, i) = Ai,j j ∀1 < j ≤ n : D2 (j, j) = Ai,j i 2. Compute An = D1 −1/2 ∗ A ∗ D2 −1/2 3. Take the SVD of An : SVD(An ) = U ∗ Λ ∗ V∗ 4. Determine k, the numbers of clusters by the eigengap: k = arg max(m≥i>1) λi−1 − λi )/λi−1 , where λ1 ≥ λ2 ≥ · · · ≥ λm are the singular values of A
  • 16. Introduction Clustering Alignment Bipartite Spectral Graph Partitioning: algorithm (cont.) 5. From U and V, calculate U[2,··· ,l+1] and V[2,··· ,l+1] respectively, by taking columns 2 to l + 1 where l = log2 k , D1 −1/2 U[2,··· ,l+1] 6. Compute Z = and normalize the rows D2 −1/2 V[2,··· ,l+1] of Z 7. Apply k-means to cluster the rows of Z into k clusters 8. Check for each clusters the number of documents. If this is higher than a given treshold, construct a new document-by-word matrix formed by the documents and words in the cluster, and proceed to step 1
  • 17. Introduction Clustering Alignment Uses of a hierarchical co-clustering • Documents are clustered according to topic hierarchy • Words associated with cluster describe topic • Words can be used for offline clustering
  • 18. Introduction Clustering Alignment Entries of document-by-word matrix 1. TF-IDF 2. WP 2’s Salience
  • 19. Introduction Clustering Alignment Results Precision of clustering 367 news stories from ABC and CNN. k = defined by eigengap Salience: 3743 words / TF-IDF: 7242 words Co-clustering Test set Precision Recall F1 Salience 74.6 % 41 % 52.9 % TF-IDF 50.4 % 40.7 % 45.1 % k-means Test set Precision Recall F1 Salience 69.5 % 37.1 % 48.4 % TF-IDF 38.3 % 41.8 % 40 %
  • 20. Introduction Clustering Alignment Results Precision of clustering 367 news stories from ABC and CNN. k = defined by eigengap Co-clustering Test set Precision Recall F1 Salience 64.3 % 48.3 % 55.2 % k-means Test set Precision Recall F1 Salience 58.3 % 41.7 % 48.8 %
  • 21. Introduction Clustering Alignment Goals 1. Find aligning segments in 1.1 text-text pairs 1.2 text-video pairs 2. Expand to multiple documents (text and video)
  • 22. Introduction Clustering Alignment Goals Using aligned segments: • Create elaborated story from several sources • Create links between video and text • Summarize video and text • Select appropriate medial form for information
  • 23. Introduction Clustering Alignment Segments Segments can be defined at different resolutions • in text: • word • sentence • paragraph • in video: • image • shot • Expand to multiple documents (text and video)
  • 24. Introduction Clustering Alignment Problems • Degrees of comparability: • Parallel pairs • Near-parallel pairs • Comparable pairs • Representation of segments in different media: how to compare
  • 25. Introduction Clustering Alignment Techniques • Micro-macro aligment • Top-down • Bottom-up • Make use of several assumptions: • Linearity • Low variance of slope • Injectivity • Annealing and Context
  • 26. Introduction Clustering Alignment Multiple documents Two possible directions 1. Dimension reduction 2. Expand dimensions of search algorithms