SlideShare a Scribd company logo
Introduction                      Clustering                 Alignment




               Doctoral Seminar: Multi-document clustering
                             and alignment

                               Wim De Smet


                              March 23, 2007
Introduction                          Clustering      Alignment



                                Current goals




       CLASS, WP7
          1. Cluster documents according to topics.
          2. Align text and video
Introduction                          Clustering                         Alignment



                                     Goal




       Given news stories about different events, from several sources,
       cluster same stories.
Introduction                          Clustering                 Alignment



                                  Clustering



       Typical clustering algorithms: bag   of words approach.
       Document-by-words matrix:
             0.5 0.5 0.5 0            0      0
             0.4 0.6 0.5 0            0      0
             0.5 0.4 0.6 0            0      0
       A= 0         0     0 0.5 0.5         0.5
              0     0     0 0.5 0.5         0.5
             0.4 0.4 0 0.4 0.4              0.4
             0.4 0.4 0.4 0 0.4              0.4
Introduction                           Clustering          Alignment



                                    Clustering
       Document clustering according to word-similarity:

          0.5     0.5   0.5    0      0       0
          0.4     0.6   0.5    0      0       0
          0.5     0.4   0.6    0      0       0
       A= 0        0     0    0.5    0.5     0.5
           0       0     0    0.5    0.5     0.5
          0.4     0.4    0    0.4    0.4     0.4
          0.4     0.4   0.4    0     0.4     0.4
Introduction                           Clustering          Alignment



                                    Clustering
       Word clustering according to document-similarity:

          0.5     0.5   0.5    0      0       0
          0.4     0.6   0.5    0      0       0
          0.5     0.4   0.6    0      0       0
       A= 0        0     0    0.5    0.5     0.5
           0       0     0    0.5    0.5     0.5
          0.4     0.4    0    0.4    0.4     0.4
          0.4     0.4   0.4    0     0.4     0.4
Introduction                        Clustering                   Alignment



                              Co-clustering
       Purpose: simultaneously clustering words and documents,
       preserving information found in both clusterings.
             0.5 0.5 0.5 0           0    0
             0.4 0.6 0.5 0           0    0
             0.5 0.4 0.6 0           0    0
       A= 0          0   0 0.5 0.5 0.5
               0     0   0 0.5 0.5 0.5
             0.4 0.4 0 0.4 0.4 0.4
             0.4 0.4 0.4 0 0.4 0.4
Introduction                        Clustering                   Alignment



                              Co-clustering
       Purpose: simultaneously clustering words and documents,
       preserving information found in both clusterings.
             0.5 0.5 0.5 0           0    0
             0.4 0.6 0.5 0           0    0
             0.5 0.4 0.6 0           0    0
       A= 0          0   0 0.5 0.5 0.5
               0     0   0 0.5 0.5 0.5
             0.4 0.4 0 0.4 0.4 0.4
             0.4 0.4 0.4 0 0.4 0.4
Introduction                         Clustering                   Alignment



                               Co-clustering
       Purpose: simultaneously clustering words and documents,
       preserving information found in both clusterings.
             0.5 0.5 0.5 0           0    0
             0.4 0.6 0.5 0           0    0
             0.5 0.4 0.6 0           0    0             0.5 0
       A= 0          0   0 0.5 0.5 0.5                   0 0.5
               0     0   0 0.5 0.5 0.5                  0.4 0.4
             0.4 0.4 0 0.4 0.4 0.4
             0.4 0.4 0.4 0 0.4 0.4
Introduction                           Clustering                          Alignment



                         Hierarchical Co-clustering




       Hierarchical co-clustering:
          1. Co-cluster documents and words.
          2. For each cluster: if contains too many documents, calculate
             sub-matrix
          3. Repeat step 1 on sub-matrix.
Introduction                        Clustering                Alignment



          Bipartite Spectral Graph Partitioning: motivation

       View document-by-word matrix as bipartite graph

                         word1   word2       word2
          document1       a1,1     0           0
       A=
          document2        0      a2,2        a2,3
          document2       a3,2    a3,3         0
Introduction                         Clustering                       Alignment



          Bipartite Spectral Graph Partitioning: motivation
       Divide graph in document clusters Dm and associated word clusters
       Wm ?
Introduction                        Clustering                                 Alignment



          Bipartite Spectral Graph Partitioning: motivation
                                                                          
                                                                          
               Wm = wj :          Aij ≥          Aij , ∀l = 1, . . . , k
                                                                          
                           i∈Dm           i∈Dl
Introduction                         Clustering                                 Alignment



          Bipartite Spectral Graph Partitioning: motivation
                                                                           
                                                                           
               Dm = di :          Aij ≥           Aij , ∀l = 1, . . . , k
                                                                           
                           j∈Wm           j∈Wl
Introduction                           Clustering                    Alignment



           Bipartite Spectral Graph Partitioning: algorithm

          1. Given the m ∗ n document-by-word matrix A, calculate
             diagonal help-matrices D1 and D2 , so that:

                         ∀1 < i ≤ m : D1 (i, i) =        Ai,j
                                                     j

                          ∀1 < j ≤ n : D2 (j, j) =       Ai,j
                                                     i

          2. Compute An = D1 −1/2 ∗ A ∗ D2 −1/2
          3. Take the SVD of An : SVD(An ) = U ∗ Λ ∗ V∗
          4. Determine k, the numbers of clusters by the eigengap:
             k = arg max(m≥i>1) λi−1 − λi )/λi−1 , where
             λ1 ≥ λ2 ≥ · · · ≥ λm are the singular values of A
Introduction                              Clustering                            Alignment



   Bipartite Spectral Graph Partitioning: algorithm (cont.)


          5. From U and V, calculate U[2,··· ,l+1] and V[2,··· ,l+1]
             respectively, by taking columns 2 to l + 1
             where l = log2 k ,
                               D1 −1/2 U[2,··· ,l+1]
          6. Compute Z =                               and normalize the rows
                               D2 −1/2 V[2,··· ,l+1]
               of Z
          7. Apply k-means to cluster the rows of Z into k clusters
          8. Check for each clusters the number of documents. If this is
             higher than a given treshold, construct a new
             document-by-word matrix formed by the documents and
             words in the cluster, and proceed to step 1
Introduction                           Clustering                   Alignment



                   Uses of a hierarchical co-clustering




           • Documents are clustered according to topic hierarchy
           • Words associated with cluster describe topic
           • Words can be used for offline clustering
Introduction                     Clustering            Alignment



                  Entries of document-by-word matrix




          1. TF-IDF
          2. WP 2’s Salience
Introduction                            Clustering                  Alignment



                                   Results

       Precision of clustering 367 news stories from ABC and CNN.
       k = defined by eigengap
       Salience: 3743 words / TF-IDF: 7242 words

       Co-clustering
        Test set Precision     Recall        F1
        Salience 74.6 %         41 %       52.9 %
        TF-IDF      50.4 %     40.7 %      45.1 %

       k-means
        Test set   Precision   Recall        F1
        Salience    69.5 %     37.1 %      48.4 %
        TF-IDF      38.3 %     41.8 %       40 %
Introduction                            Clustering                  Alignment



                                   Results


       Precision of clustering 367 news stories from ABC and CNN.
       k = defined by eigengap

       Co-clustering
        Test set Precision     Recall        F1
        Salience 64.3 %        48.3 %      55.2 %

       k-means
        Test set   Precision   Recall        F1
        Salience    58.3 %     41.7 %      48.8 %
Introduction                             Clustering          Alignment



                                      Goals
          1. Find aligning segments in
               1.1 text-text pairs
               1.2 text-video pairs
          2. Expand to multiple documents (text and video)
Introduction                            Clustering            Alignment



                                      Goals




       Using aligned segments:
           • Create elaborated story from several sources
           • Create links between video and text
           • Summarize video and text
           • Select appropriate medial form for information
Introduction                          Clustering             Alignment



                                  Segments


       Segments can be defined at different resolutions
           • in text:
                • word
                • sentence
                • paragraph
           • in video:
                • image
                • shot
           • Expand to multiple documents (text and video)
Introduction                             Clustering                 Alignment



                                   Problems




           • Degrees of comparability:
               • Parallel pairs
               • Near-parallel pairs
               • Comparable pairs
           • Representation of segments in different media: how to
               compare
Introduction                                Clustering   Alignment



                                         Techniques



    • Micro-macro aligment
        • Top-down
        • Bottom-up
    • Make use of several
        assumptions:
               • Linearity
               • Low variance of slope
               • Injectivity
    • Annealing and Context
Introduction                          Clustering      Alignment



                            Multiple documents




       Two possible directions
          1. Dimension reduction
          2. Expand dimensions of search algorithms

More Related Content

Recently uploaded

Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
Pablo Gómez Abajo
 
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
Jason Yip
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Precisely
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
Neo4j
 
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
Neo4j
 
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
saastr
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
operationspcvita
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
Alex Pruden
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
AstuteBusiness
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Pitangent Analytics & Technology Solutions Pvt. Ltd
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
c5vrf27qcz
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
DianaGray10
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
Edge AI and Vision Alliance
 

Recently uploaded (20)

Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
 
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
 
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
 
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
 

Featured

Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
Pixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
ThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
marketingartwork
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
Skeleton Technologies
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
SpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Lily Ray
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
Rajiv Jayarajah, MAppComm, ACC
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Christy Abraham Joy
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
Vit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
MindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
GetSmarter
 

Featured (20)

Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 

Presentatie

  • 1. Introduction Clustering Alignment Doctoral Seminar: Multi-document clustering and alignment Wim De Smet March 23, 2007
  • 2. Introduction Clustering Alignment Current goals CLASS, WP7 1. Cluster documents according to topics. 2. Align text and video
  • 3. Introduction Clustering Alignment Goal Given news stories about different events, from several sources, cluster same stories.
  • 4. Introduction Clustering Alignment Clustering Typical clustering algorithms: bag of words approach. Document-by-words matrix: 0.5 0.5 0.5 0 0 0 0.4 0.6 0.5 0 0 0 0.5 0.4 0.6 0 0 0 A= 0 0 0 0.5 0.5 0.5 0 0 0 0.5 0.5 0.5 0.4 0.4 0 0.4 0.4 0.4 0.4 0.4 0.4 0 0.4 0.4
  • 5. Introduction Clustering Alignment Clustering Document clustering according to word-similarity: 0.5 0.5 0.5 0 0 0 0.4 0.6 0.5 0 0 0 0.5 0.4 0.6 0 0 0 A= 0 0 0 0.5 0.5 0.5 0 0 0 0.5 0.5 0.5 0.4 0.4 0 0.4 0.4 0.4 0.4 0.4 0.4 0 0.4 0.4
  • 6. Introduction Clustering Alignment Clustering Word clustering according to document-similarity: 0.5 0.5 0.5 0 0 0 0.4 0.6 0.5 0 0 0 0.5 0.4 0.6 0 0 0 A= 0 0 0 0.5 0.5 0.5 0 0 0 0.5 0.5 0.5 0.4 0.4 0 0.4 0.4 0.4 0.4 0.4 0.4 0 0.4 0.4
  • 7. Introduction Clustering Alignment Co-clustering Purpose: simultaneously clustering words and documents, preserving information found in both clusterings. 0.5 0.5 0.5 0 0 0 0.4 0.6 0.5 0 0 0 0.5 0.4 0.6 0 0 0 A= 0 0 0 0.5 0.5 0.5 0 0 0 0.5 0.5 0.5 0.4 0.4 0 0.4 0.4 0.4 0.4 0.4 0.4 0 0.4 0.4
  • 8. Introduction Clustering Alignment Co-clustering Purpose: simultaneously clustering words and documents, preserving information found in both clusterings. 0.5 0.5 0.5 0 0 0 0.4 0.6 0.5 0 0 0 0.5 0.4 0.6 0 0 0 A= 0 0 0 0.5 0.5 0.5 0 0 0 0.5 0.5 0.5 0.4 0.4 0 0.4 0.4 0.4 0.4 0.4 0.4 0 0.4 0.4
  • 9. Introduction Clustering Alignment Co-clustering Purpose: simultaneously clustering words and documents, preserving information found in both clusterings. 0.5 0.5 0.5 0 0 0 0.4 0.6 0.5 0 0 0 0.5 0.4 0.6 0 0 0 0.5 0 A= 0 0 0 0.5 0.5 0.5 0 0.5 0 0 0 0.5 0.5 0.5 0.4 0.4 0.4 0.4 0 0.4 0.4 0.4 0.4 0.4 0.4 0 0.4 0.4
  • 10. Introduction Clustering Alignment Hierarchical Co-clustering Hierarchical co-clustering: 1. Co-cluster documents and words. 2. For each cluster: if contains too many documents, calculate sub-matrix 3. Repeat step 1 on sub-matrix.
  • 11. Introduction Clustering Alignment Bipartite Spectral Graph Partitioning: motivation View document-by-word matrix as bipartite graph word1 word2 word2 document1 a1,1 0 0 A= document2 0 a2,2 a2,3 document2 a3,2 a3,3 0
  • 12. Introduction Clustering Alignment Bipartite Spectral Graph Partitioning: motivation Divide graph in document clusters Dm and associated word clusters Wm ?
  • 13. Introduction Clustering Alignment Bipartite Spectral Graph Partitioning: motivation     Wm = wj : Aij ≥ Aij , ∀l = 1, . . . , k   i∈Dm i∈Dl
  • 14. Introduction Clustering Alignment Bipartite Spectral Graph Partitioning: motivation     Dm = di : Aij ≥ Aij , ∀l = 1, . . . , k   j∈Wm j∈Wl
  • 15. Introduction Clustering Alignment Bipartite Spectral Graph Partitioning: algorithm 1. Given the m ∗ n document-by-word matrix A, calculate diagonal help-matrices D1 and D2 , so that: ∀1 < i ≤ m : D1 (i, i) = Ai,j j ∀1 < j ≤ n : D2 (j, j) = Ai,j i 2. Compute An = D1 −1/2 ∗ A ∗ D2 −1/2 3. Take the SVD of An : SVD(An ) = U ∗ Λ ∗ V∗ 4. Determine k, the numbers of clusters by the eigengap: k = arg max(m≥i>1) λi−1 − λi )/λi−1 , where λ1 ≥ λ2 ≥ · · · ≥ λm are the singular values of A
  • 16. Introduction Clustering Alignment Bipartite Spectral Graph Partitioning: algorithm (cont.) 5. From U and V, calculate U[2,··· ,l+1] and V[2,··· ,l+1] respectively, by taking columns 2 to l + 1 where l = log2 k , D1 −1/2 U[2,··· ,l+1] 6. Compute Z = and normalize the rows D2 −1/2 V[2,··· ,l+1] of Z 7. Apply k-means to cluster the rows of Z into k clusters 8. Check for each clusters the number of documents. If this is higher than a given treshold, construct a new document-by-word matrix formed by the documents and words in the cluster, and proceed to step 1
  • 17. Introduction Clustering Alignment Uses of a hierarchical co-clustering • Documents are clustered according to topic hierarchy • Words associated with cluster describe topic • Words can be used for offline clustering
  • 18. Introduction Clustering Alignment Entries of document-by-word matrix 1. TF-IDF 2. WP 2’s Salience
  • 19. Introduction Clustering Alignment Results Precision of clustering 367 news stories from ABC and CNN. k = defined by eigengap Salience: 3743 words / TF-IDF: 7242 words Co-clustering Test set Precision Recall F1 Salience 74.6 % 41 % 52.9 % TF-IDF 50.4 % 40.7 % 45.1 % k-means Test set Precision Recall F1 Salience 69.5 % 37.1 % 48.4 % TF-IDF 38.3 % 41.8 % 40 %
  • 20. Introduction Clustering Alignment Results Precision of clustering 367 news stories from ABC and CNN. k = defined by eigengap Co-clustering Test set Precision Recall F1 Salience 64.3 % 48.3 % 55.2 % k-means Test set Precision Recall F1 Salience 58.3 % 41.7 % 48.8 %
  • 21. Introduction Clustering Alignment Goals 1. Find aligning segments in 1.1 text-text pairs 1.2 text-video pairs 2. Expand to multiple documents (text and video)
  • 22. Introduction Clustering Alignment Goals Using aligned segments: • Create elaborated story from several sources • Create links between video and text • Summarize video and text • Select appropriate medial form for information
  • 23. Introduction Clustering Alignment Segments Segments can be defined at different resolutions • in text: • word • sentence • paragraph • in video: • image • shot • Expand to multiple documents (text and video)
  • 24. Introduction Clustering Alignment Problems • Degrees of comparability: • Parallel pairs • Near-parallel pairs • Comparable pairs • Representation of segments in different media: how to compare
  • 25. Introduction Clustering Alignment Techniques • Micro-macro aligment • Top-down • Bottom-up • Make use of several assumptions: • Linearity • Low variance of slope • Injectivity • Annealing and Context
  • 26. Introduction Clustering Alignment Multiple documents Two possible directions 1. Dimension reduction 2. Expand dimensions of search algorithms