SlideShare a Scribd company logo
1 of 26
Download to read offline
Introduction                      Clustering                 Alignment




               Doctoral Seminar: Multi-document clustering
                             and alignment

                               Wim De Smet


                              March 23, 2007
Introduction                          Clustering      Alignment



                                Current goals




       CLASS, WP7
          1. Cluster documents according to topics.
          2. Align text and video
Introduction                          Clustering                         Alignment



                                     Goal




       Given news stories about different events, from several sources,
       cluster same stories.
Introduction                          Clustering                 Alignment



                                  Clustering



       Typical clustering algorithms: bag   of words approach.
       Document-by-words matrix:
             0.5 0.5 0.5 0            0      0
             0.4 0.6 0.5 0            0      0
             0.5 0.4 0.6 0            0      0
       A= 0         0     0 0.5 0.5         0.5
              0     0     0 0.5 0.5         0.5
             0.4 0.4 0 0.4 0.4              0.4
             0.4 0.4 0.4 0 0.4              0.4
Introduction                           Clustering          Alignment



                                    Clustering
       Document clustering according to word-similarity:

          0.5     0.5   0.5    0      0       0
          0.4     0.6   0.5    0      0       0
          0.5     0.4   0.6    0      0       0
       A= 0        0     0    0.5    0.5     0.5
           0       0     0    0.5    0.5     0.5
          0.4     0.4    0    0.4    0.4     0.4
          0.4     0.4   0.4    0     0.4     0.4
Introduction                           Clustering          Alignment



                                    Clustering
       Word clustering according to document-similarity:

          0.5     0.5   0.5    0      0       0
          0.4     0.6   0.5    0      0       0
          0.5     0.4   0.6    0      0       0
       A= 0        0     0    0.5    0.5     0.5
           0       0     0    0.5    0.5     0.5
          0.4     0.4    0    0.4    0.4     0.4
          0.4     0.4   0.4    0     0.4     0.4
Introduction                        Clustering                   Alignment



                              Co-clustering
       Purpose: simultaneously clustering words and documents,
       preserving information found in both clusterings.
             0.5 0.5 0.5 0           0    0
             0.4 0.6 0.5 0           0    0
             0.5 0.4 0.6 0           0    0
       A= 0          0   0 0.5 0.5 0.5
               0     0   0 0.5 0.5 0.5
             0.4 0.4 0 0.4 0.4 0.4
             0.4 0.4 0.4 0 0.4 0.4
Introduction                        Clustering                   Alignment



                              Co-clustering
       Purpose: simultaneously clustering words and documents,
       preserving information found in both clusterings.
             0.5 0.5 0.5 0           0    0
             0.4 0.6 0.5 0           0    0
             0.5 0.4 0.6 0           0    0
       A= 0          0   0 0.5 0.5 0.5
               0     0   0 0.5 0.5 0.5
             0.4 0.4 0 0.4 0.4 0.4
             0.4 0.4 0.4 0 0.4 0.4
Introduction                         Clustering                   Alignment



                               Co-clustering
       Purpose: simultaneously clustering words and documents,
       preserving information found in both clusterings.
             0.5 0.5 0.5 0           0    0
             0.4 0.6 0.5 0           0    0
             0.5 0.4 0.6 0           0    0             0.5 0
       A= 0          0   0 0.5 0.5 0.5                   0 0.5
               0     0   0 0.5 0.5 0.5                  0.4 0.4
             0.4 0.4 0 0.4 0.4 0.4
             0.4 0.4 0.4 0 0.4 0.4
Introduction                           Clustering                          Alignment



                         Hierarchical Co-clustering




       Hierarchical co-clustering:
          1. Co-cluster documents and words.
          2. For each cluster: if contains too many documents, calculate
             sub-matrix
          3. Repeat step 1 on sub-matrix.
Introduction                        Clustering                Alignment



          Bipartite Spectral Graph Partitioning: motivation

       View document-by-word matrix as bipartite graph

                         word1   word2       word2
          document1       a1,1     0           0
       A=
          document2        0      a2,2        a2,3
          document2       a3,2    a3,3         0
Introduction                         Clustering                       Alignment



          Bipartite Spectral Graph Partitioning: motivation
       Divide graph in document clusters Dm and associated word clusters
       Wm ?
Introduction                        Clustering                                 Alignment



          Bipartite Spectral Graph Partitioning: motivation
                                                                          
                                                                          
               Wm = wj :          Aij ≥          Aij , ∀l = 1, . . . , k
                                                                          
                           i∈Dm           i∈Dl
Introduction                         Clustering                                 Alignment



          Bipartite Spectral Graph Partitioning: motivation
                                                                           
                                                                           
               Dm = di :          Aij ≥           Aij , ∀l = 1, . . . , k
                                                                           
                           j∈Wm           j∈Wl
Introduction                           Clustering                    Alignment



           Bipartite Spectral Graph Partitioning: algorithm

          1. Given the m ∗ n document-by-word matrix A, calculate
             diagonal help-matrices D1 and D2 , so that:

                         ∀1 < i ≤ m : D1 (i, i) =        Ai,j
                                                     j

                          ∀1 < j ≤ n : D2 (j, j) =       Ai,j
                                                     i

          2. Compute An = D1 −1/2 ∗ A ∗ D2 −1/2
          3. Take the SVD of An : SVD(An ) = U ∗ Λ ∗ V∗
          4. Determine k, the numbers of clusters by the eigengap:
             k = arg max(m≥i>1) λi−1 − λi )/λi−1 , where
             λ1 ≥ λ2 ≥ · · · ≥ λm are the singular values of A
Introduction                              Clustering                            Alignment



   Bipartite Spectral Graph Partitioning: algorithm (cont.)


          5. From U and V, calculate U[2,··· ,l+1] and V[2,··· ,l+1]
             respectively, by taking columns 2 to l + 1
             where l = log2 k ,
                               D1 −1/2 U[2,··· ,l+1]
          6. Compute Z =                               and normalize the rows
                               D2 −1/2 V[2,··· ,l+1]
               of Z
          7. Apply k-means to cluster the rows of Z into k clusters
          8. Check for each clusters the number of documents. If this is
             higher than a given treshold, construct a new
             document-by-word matrix formed by the documents and
             words in the cluster, and proceed to step 1
Introduction                           Clustering                   Alignment



                   Uses of a hierarchical co-clustering




           • Documents are clustered according to topic hierarchy
           • Words associated with cluster describe topic
           • Words can be used for offline clustering
Introduction                     Clustering            Alignment



                  Entries of document-by-word matrix




          1. TF-IDF
          2. WP 2’s Salience
Introduction                            Clustering                  Alignment



                                   Results

       Precision of clustering 367 news stories from ABC and CNN.
       k = defined by eigengap
       Salience: 3743 words / TF-IDF: 7242 words

       Co-clustering
        Test set Precision     Recall        F1
        Salience 74.6 %         41 %       52.9 %
        TF-IDF      50.4 %     40.7 %      45.1 %

       k-means
        Test set   Precision   Recall        F1
        Salience    69.5 %     37.1 %      48.4 %
        TF-IDF      38.3 %     41.8 %       40 %
Introduction                            Clustering                  Alignment



                                   Results


       Precision of clustering 367 news stories from ABC and CNN.
       k = defined by eigengap

       Co-clustering
        Test set Precision     Recall        F1
        Salience 64.3 %        48.3 %      55.2 %

       k-means
        Test set   Precision   Recall        F1
        Salience    58.3 %     41.7 %      48.8 %
Introduction                             Clustering          Alignment



                                      Goals
          1. Find aligning segments in
               1.1 text-text pairs
               1.2 text-video pairs
          2. Expand to multiple documents (text and video)
Introduction                            Clustering            Alignment



                                      Goals




       Using aligned segments:
           • Create elaborated story from several sources
           • Create links between video and text
           • Summarize video and text
           • Select appropriate medial form for information
Introduction                          Clustering             Alignment



                                  Segments


       Segments can be defined at different resolutions
           • in text:
                • word
                • sentence
                • paragraph
           • in video:
                • image
                • shot
           • Expand to multiple documents (text and video)
Introduction                             Clustering                 Alignment



                                   Problems




           • Degrees of comparability:
               • Parallel pairs
               • Near-parallel pairs
               • Comparable pairs
           • Representation of segments in different media: how to
               compare
Introduction                                Clustering   Alignment



                                         Techniques



    • Micro-macro aligment
        • Top-down
        • Bottom-up
    • Make use of several
        assumptions:
               • Linearity
               • Low variance of slope
               • Injectivity
    • Annealing and Context
Introduction                          Clustering      Alignment



                            Multiple documents




       Two possible directions
          1. Dimension reduction
          2. Expand dimensions of search algorithms

More Related Content

Recently uploaded

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 

Recently uploaded (20)

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 

Featured

Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 

Featured (20)

Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 

Presentatie

  • 1. Introduction Clustering Alignment Doctoral Seminar: Multi-document clustering and alignment Wim De Smet March 23, 2007
  • 2. Introduction Clustering Alignment Current goals CLASS, WP7 1. Cluster documents according to topics. 2. Align text and video
  • 3. Introduction Clustering Alignment Goal Given news stories about different events, from several sources, cluster same stories.
  • 4. Introduction Clustering Alignment Clustering Typical clustering algorithms: bag of words approach. Document-by-words matrix: 0.5 0.5 0.5 0 0 0 0.4 0.6 0.5 0 0 0 0.5 0.4 0.6 0 0 0 A= 0 0 0 0.5 0.5 0.5 0 0 0 0.5 0.5 0.5 0.4 0.4 0 0.4 0.4 0.4 0.4 0.4 0.4 0 0.4 0.4
  • 5. Introduction Clustering Alignment Clustering Document clustering according to word-similarity: 0.5 0.5 0.5 0 0 0 0.4 0.6 0.5 0 0 0 0.5 0.4 0.6 0 0 0 A= 0 0 0 0.5 0.5 0.5 0 0 0 0.5 0.5 0.5 0.4 0.4 0 0.4 0.4 0.4 0.4 0.4 0.4 0 0.4 0.4
  • 6. Introduction Clustering Alignment Clustering Word clustering according to document-similarity: 0.5 0.5 0.5 0 0 0 0.4 0.6 0.5 0 0 0 0.5 0.4 0.6 0 0 0 A= 0 0 0 0.5 0.5 0.5 0 0 0 0.5 0.5 0.5 0.4 0.4 0 0.4 0.4 0.4 0.4 0.4 0.4 0 0.4 0.4
  • 7. Introduction Clustering Alignment Co-clustering Purpose: simultaneously clustering words and documents, preserving information found in both clusterings. 0.5 0.5 0.5 0 0 0 0.4 0.6 0.5 0 0 0 0.5 0.4 0.6 0 0 0 A= 0 0 0 0.5 0.5 0.5 0 0 0 0.5 0.5 0.5 0.4 0.4 0 0.4 0.4 0.4 0.4 0.4 0.4 0 0.4 0.4
  • 8. Introduction Clustering Alignment Co-clustering Purpose: simultaneously clustering words and documents, preserving information found in both clusterings. 0.5 0.5 0.5 0 0 0 0.4 0.6 0.5 0 0 0 0.5 0.4 0.6 0 0 0 A= 0 0 0 0.5 0.5 0.5 0 0 0 0.5 0.5 0.5 0.4 0.4 0 0.4 0.4 0.4 0.4 0.4 0.4 0 0.4 0.4
  • 9. Introduction Clustering Alignment Co-clustering Purpose: simultaneously clustering words and documents, preserving information found in both clusterings. 0.5 0.5 0.5 0 0 0 0.4 0.6 0.5 0 0 0 0.5 0.4 0.6 0 0 0 0.5 0 A= 0 0 0 0.5 0.5 0.5 0 0.5 0 0 0 0.5 0.5 0.5 0.4 0.4 0.4 0.4 0 0.4 0.4 0.4 0.4 0.4 0.4 0 0.4 0.4
  • 10. Introduction Clustering Alignment Hierarchical Co-clustering Hierarchical co-clustering: 1. Co-cluster documents and words. 2. For each cluster: if contains too many documents, calculate sub-matrix 3. Repeat step 1 on sub-matrix.
  • 11. Introduction Clustering Alignment Bipartite Spectral Graph Partitioning: motivation View document-by-word matrix as bipartite graph word1 word2 word2 document1 a1,1 0 0 A= document2 0 a2,2 a2,3 document2 a3,2 a3,3 0
  • 12. Introduction Clustering Alignment Bipartite Spectral Graph Partitioning: motivation Divide graph in document clusters Dm and associated word clusters Wm ?
  • 13. Introduction Clustering Alignment Bipartite Spectral Graph Partitioning: motivation     Wm = wj : Aij ≥ Aij , ∀l = 1, . . . , k   i∈Dm i∈Dl
  • 14. Introduction Clustering Alignment Bipartite Spectral Graph Partitioning: motivation     Dm = di : Aij ≥ Aij , ∀l = 1, . . . , k   j∈Wm j∈Wl
  • 15. Introduction Clustering Alignment Bipartite Spectral Graph Partitioning: algorithm 1. Given the m ∗ n document-by-word matrix A, calculate diagonal help-matrices D1 and D2 , so that: ∀1 < i ≤ m : D1 (i, i) = Ai,j j ∀1 < j ≤ n : D2 (j, j) = Ai,j i 2. Compute An = D1 −1/2 ∗ A ∗ D2 −1/2 3. Take the SVD of An : SVD(An ) = U ∗ Λ ∗ V∗ 4. Determine k, the numbers of clusters by the eigengap: k = arg max(m≥i>1) λi−1 − λi )/λi−1 , where λ1 ≥ λ2 ≥ · · · ≥ λm are the singular values of A
  • 16. Introduction Clustering Alignment Bipartite Spectral Graph Partitioning: algorithm (cont.) 5. From U and V, calculate U[2,··· ,l+1] and V[2,··· ,l+1] respectively, by taking columns 2 to l + 1 where l = log2 k , D1 −1/2 U[2,··· ,l+1] 6. Compute Z = and normalize the rows D2 −1/2 V[2,··· ,l+1] of Z 7. Apply k-means to cluster the rows of Z into k clusters 8. Check for each clusters the number of documents. If this is higher than a given treshold, construct a new document-by-word matrix formed by the documents and words in the cluster, and proceed to step 1
  • 17. Introduction Clustering Alignment Uses of a hierarchical co-clustering • Documents are clustered according to topic hierarchy • Words associated with cluster describe topic • Words can be used for offline clustering
  • 18. Introduction Clustering Alignment Entries of document-by-word matrix 1. TF-IDF 2. WP 2’s Salience
  • 19. Introduction Clustering Alignment Results Precision of clustering 367 news stories from ABC and CNN. k = defined by eigengap Salience: 3743 words / TF-IDF: 7242 words Co-clustering Test set Precision Recall F1 Salience 74.6 % 41 % 52.9 % TF-IDF 50.4 % 40.7 % 45.1 % k-means Test set Precision Recall F1 Salience 69.5 % 37.1 % 48.4 % TF-IDF 38.3 % 41.8 % 40 %
  • 20. Introduction Clustering Alignment Results Precision of clustering 367 news stories from ABC and CNN. k = defined by eigengap Co-clustering Test set Precision Recall F1 Salience 64.3 % 48.3 % 55.2 % k-means Test set Precision Recall F1 Salience 58.3 % 41.7 % 48.8 %
  • 21. Introduction Clustering Alignment Goals 1. Find aligning segments in 1.1 text-text pairs 1.2 text-video pairs 2. Expand to multiple documents (text and video)
  • 22. Introduction Clustering Alignment Goals Using aligned segments: • Create elaborated story from several sources • Create links between video and text • Summarize video and text • Select appropriate medial form for information
  • 23. Introduction Clustering Alignment Segments Segments can be defined at different resolutions • in text: • word • sentence • paragraph • in video: • image • shot • Expand to multiple documents (text and video)
  • 24. Introduction Clustering Alignment Problems • Degrees of comparability: • Parallel pairs • Near-parallel pairs • Comparable pairs • Representation of segments in different media: how to compare
  • 25. Introduction Clustering Alignment Techniques • Micro-macro aligment • Top-down • Bottom-up • Make use of several assumptions: • Linearity • Low variance of slope • Injectivity • Annealing and Context
  • 26. Introduction Clustering Alignment Multiple documents Two possible directions 1. Dimension reduction 2. Expand dimensions of search algorithms