1.
Computer-Assisted Clustering and Conceptualization Gary King Institute for Quantitative Social Science Harvard University Parthemos Lecture at University of Georgia, 3/4/20111 Based on joint work with Justin Grimmer (Harvard Stanford) Parthemos Lecture at University ofGary King (Harvard IQSS) Quantitative Discovery / 20
2.
A Method for Computer Assisted Conceptualization Conceptualization through Classiﬁcation: “one of the most central and generic of all our conceptual exercises. . . . the foundation not only for conceptualization, language, and speech, but also for mathematics, statistics, and data analysis. . . . Without classiﬁcation, there could be no advanced conceptualization, reasoning, language, data analysis or,for that matter, social science research.” (Bailey, 1994). Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
3.
A Method for Computer Assisted Conceptualization Conceptualization through Classiﬁcation: “one of the most central and generic of all our conceptual exercises. . . . the foundation not only for conceptualization, language, and speech, but also for mathematics, statistics, and data analysis. . . . Without classiﬁcation, there could be no advanced conceptualization, reasoning, language, data analysis or,for that matter, social science research.” (Bailey, 1994). Cluster Analysis: simultaneously (1) invents categories and (2) assigns documents to categories Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
4.
A Method for Computer Assisted Conceptualization Conceptualization through Classiﬁcation: “one of the most central and generic of all our conceptual exercises. . . . the foundation not only for conceptualization, language, and speech, but also for mathematics, statistics, and data analysis. . . . Without classiﬁcation, there could be no advanced conceptualization, reasoning, language, data analysis or,for that matter, social science research.” (Bailey, 1994). Cluster Analysis: simultaneously (1) invents categories and (2) assigns documents to categories We focus on unstructured text; methods apply more broadly. Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
5.
A Method for Computer Assisted Conceptualization Conceptualization through Classiﬁcation: “one of the most central and generic of all our conceptual exercises. . . . the foundation not only for conceptualization, language, and speech, but also for mathematics, statistics, and data analysis. . . . Without classiﬁcation, there could be no advanced conceptualization, reasoning, language, data analysis or,for that matter, social science research.” (Bailey, 1994). Cluster Analysis: simultaneously (1) invents categories and (2) assigns documents to categories We focus on unstructured text; methods apply more broadly. Main goal: Switch from Fully Automated to Computer Assisted Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
6.
What’s Hard about Clustering? Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
7.
What’s Hard about Clustering?(aka Why Johnny Can’t Classify) Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
8.
What’s Hard about Clustering?(aka Why Johnny Can’t Classify) Clustering seems easy; its not! Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
9.
What’s Hard about Clustering?(aka Why Johnny Can’t Classify) Clustering seems easy; its not! Bell(n) = number of ways of partitioning n objects Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
10.
What’s Hard about Clustering?(aka Why Johnny Can’t Classify) Clustering seems easy; its not! Bell(n) = number of ways of partitioning n objects Bell(2) = 2 (AB, A B) Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
11.
What’s Hard about Clustering?(aka Why Johnny Can’t Classify) Clustering seems easy; its not! Bell(n) = number of ways of partitioning n objects Bell(2) = 2 (AB, A B) Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C) Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
12.
What’s Hard about Clustering?(aka Why Johnny Can’t Classify) Clustering seems easy; its not! Bell(n) = number of ways of partitioning n objects Bell(2) = 2 (AB, A B) Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C) Bell(5) = 52 Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
13.
What’s Hard about Clustering?(aka Why Johnny Can’t Classify) Clustering seems easy; its not! Bell(n) = number of ways of partitioning n objects Bell(2) = 2 (AB, A B) Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C) Bell(5) = 52 Bell(100) ≈ Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
14.
What’s Hard about Clustering?(aka Why Johnny Can’t Classify) Clustering seems easy; its not! Bell(n) = number of ways of partitioning n objects Bell(2) = 2 (AB, A B) Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C) Bell(5) = 52 Bell(100) ≈ 1028 × Number of elementary particles in the universe Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
15.
What’s Hard about Clustering?(aka Why Johnny Can’t Classify) Clustering seems easy; its not! Bell(n) = number of ways of partitioning n objects Bell(2) = 2 (AB, A B) Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C) Bell(5) = 52 Bell(100) ≈ 1028 × Number of elementary particles in the universe Now imagine choosing the optimal classiﬁcation scheme by hand! Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
16.
What’s Hard about Clustering?(aka Why Johnny Can’t Classify) Clustering seems easy; its not! Bell(n) = number of ways of partitioning n objects Bell(2) = 2 (AB, A B) Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C) Bell(5) = 52 Bell(100) ≈ 1028 × Number of elementary particles in the universe Now imagine choosing the optimal classiﬁcation scheme by hand! Fully automated algorithms can help, but which ones? Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
17.
The Problem with Fully Automated Clustering Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
18.
The Problem with Fully Automated Clustering The (Impossible) Goal: optimal, fully automated, application-independent cluster analysis Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
19.
The Problem with Fully Automated Clustering The (Impossible) Goal: optimal, fully automated, application-independent cluster analysis No free lunch theorem: every possible clustering method performs equally well on average over all possible substantive applications Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
20.
The Problem with Fully Automated Clustering The (Impossible) Goal: optimal, fully automated, application-independent cluster analysis No free lunch theorem: every possible clustering method performs equally well on average over all possible substantive applications Existing methods: Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
21.
The Problem with Fully Automated Clustering The (Impossible) Goal: optimal, fully automated, application-independent cluster analysis No free lunch theorem: every possible clustering method performs equally well on average over all possible substantive applications Existing methods: Many choices: model-based, subspace, spectral, grid-based, graph- based, fuzzy k-modes, aﬃnity propagation, self-organizing maps,. . . Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
22.
The Problem with Fully Automated Clustering The (Impossible) Goal: optimal, fully automated, application-independent cluster analysis No free lunch theorem: every possible clustering method performs equally well on average over all possible substantive applications Existing methods: Many choices: model-based, subspace, spectral, grid-based, graph- based, fuzzy k-modes, aﬃnity propagation, self-organizing maps,. . . Well-deﬁned statistical, data analytic, or machine learning foundations Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
23.
The Problem with Fully Automated Clustering The (Impossible) Goal: optimal, fully automated, application-independent cluster analysis No free lunch theorem: every possible clustering method performs equally well on average over all possible substantive applications Existing methods: Many choices: model-based, subspace, spectral, grid-based, graph- based, fuzzy k-modes, aﬃnity propagation, self-organizing maps,. . . Well-deﬁned statistical, data analytic, or machine learning foundations How to add substantive knowledge: With few exceptions, unclear Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
24.
The Problem with Fully Automated Clustering The (Impossible) Goal: optimal, fully automated, application-independent cluster analysis No free lunch theorem: every possible clustering method performs equally well on average over all possible substantive applications Existing methods: Many choices: model-based, subspace, spectral, grid-based, graph- based, fuzzy k-modes, aﬃnity propagation, self-organizing maps,. . . Well-deﬁned statistical, data analytic, or machine learning foundations How to add substantive knowledge: With few exceptions, unclear The literature: little guidance on when methods apply Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
25.
The Problem with Fully Automated Clustering The (Impossible) Goal: optimal, fully automated, application-independent cluster analysis No free lunch theorem: every possible clustering method performs equally well on average over all possible substantive applications Existing methods: Many choices: model-based, subspace, spectral, grid-based, graph- based, fuzzy k-modes, aﬃnity propagation, self-organizing maps,. . . Well-deﬁned statistical, data analytic, or machine learning foundations How to add substantive knowledge: With few exceptions, unclear The literature: little guidance on when methods apply Deriving such guidance: diﬃcult or impossible Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
26.
The Problem with Fully Automated Clustering The (Impossible) Goal: optimal, fully automated, application-independent cluster analysis No free lunch theorem: every possible clustering method performs equally well on average over all possible substantive applications Existing methods: Many choices: model-based, subspace, spectral, grid-based, graph- based, fuzzy k-modes, aﬃnity propagation, self-organizing maps,. . . Well-deﬁned statistical, data analytic, or machine learning foundations How to add substantive knowledge: With few exceptions, unclear The literature: little guidance on when methods apply Deriving such guidance: diﬃcult or impossible Deep problem: full automation requires more information Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
27.
The Problem with Fully Automated Clustering The (Impossible) Goal: optimal, fully automated, application-independent cluster analysis No free lunch theorem: every possible clustering method performs equally well on average over all possible substantive applications Existing methods: Many choices: model-based, subspace, spectral, grid-based, graph- based, fuzzy k-modes, aﬃnity propagation, self-organizing maps,. . . Well-deﬁned statistical, data analytic, or machine learning foundations How to add substantive knowledge: With few exceptions, unclear The literature: little guidance on when methods apply Deriving such guidance: diﬃcult or impossible Deep problem: full automation requires more information No surprise: everyone’s tried cluster analysis; very few are satisﬁed Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
28.
Switch from Fully Automated to Computer Assisted Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
29.
Switch from Fully Automated to Computer Assisted Fully Automated Clustering may succeed sometimes, but fails in general: too hard to understand when each model applies Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
30.
Switch from Fully Automated to Computer Assisted Fully Automated Clustering may succeed sometimes, but fails in general: too hard to understand when each model applies An alternative: Computer-Assisted Clustering Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
31.
Switch from Fully Automated to Computer Assisted Fully Automated Clustering may succeed sometimes, but fails in general: too hard to understand when each model applies An alternative: Computer-Assisted Clustering Easy in theory: list all clusterings; choose the best Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
32.
Switch from Fully Automated to Computer Assisted Fully Automated Clustering may succeed sometimes, but fails in general: too hard to understand when each model applies An alternative: Computer-Assisted Clustering Easy in theory: list all clusterings; choose the best Impossible in practice: Too hard for us mere humans! Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
33.
Switch from Fully Automated to Computer Assisted Fully Automated Clustering may succeed sometimes, but fails in general: too hard to understand when each model applies An alternative: Computer-Assisted Clustering Easy in theory: list all clusterings; choose the best Impossible in practice: Too hard for us mere humans! An organized list will make the search possible Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
34.
Switch from Fully Automated to Computer Assisted Fully Automated Clustering may succeed sometimes, but fails in general: too hard to understand when each model applies An alternative: Computer-Assisted Clustering Easy in theory: list all clusterings; choose the best Impossible in practice: Too hard for us mere humans! An organized list will make the search possible Insight: Many clusterings are perceptually identical Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
35.
Switch from Fully Automated to Computer Assisted Fully Automated Clustering may succeed sometimes, but fails in general: too hard to understand when each model applies An alternative: Computer-Assisted Clustering Easy in theory: list all clusterings; choose the best Impossible in practice: Too hard for us mere humans! An organized list will make the search possible Insight: Many clusterings are perceptually identical E.g.,: consider two clusterings that diﬀer only because one document (of 10,000) moves from category 5 to 6 Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
36.
Switch from Fully Automated to Computer Assisted Fully Automated Clustering may succeed sometimes, but fails in general: too hard to understand when each model applies An alternative: Computer-Assisted Clustering Easy in theory: list all clusterings; choose the best Impossible in practice: Too hard for us mere humans! An organized list will make the search possible Insight: Many clusterings are perceptually identical E.g.,: consider two clusterings that diﬀer only because one document (of 10,000) moves from category 5 to 6 Question: How to organize clusterings so humans can understand? Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
37.
Our Idea: Meaning Through GeographySet of clusterings Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
38.
Our Idea: Meaning Through GeographySet of clusterings ≈A list of unconnected addresses Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
39.
Our Idea: Meaning Through GeographySet of clusterings ≈A list of unconnected addresses Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
40.
Our Idea: Meaning Through GeographySet of clusterings ≈A list of unconnected addresses We develop a (conceptual) geography of clusterings Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
41.
A New StrategyMake it easy to choose best clustering from millions of choices Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
42.
A New StrategyMake it easy to choose best clustering from millions of choices 1 Code text as numbers (in one or more of several ways) Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
43.
A New StrategyMake it easy to choose best clustering from millions of choices 1 Code text as numbers (in one or more of several ways) 2 Apply all clustering methods we can ﬁnd to the data — each representing diﬀerent (unstated) substantive assumptions (<15 mins) Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
44.
A New StrategyMake it easy to choose best clustering from millions of choices 1 Code text as numbers (in one or more of several ways) 2 Apply all clustering methods we can ﬁnd to the data — each representing diﬀerent (unstated) substantive assumptions (<15 mins) 3 (Too much for a person to understand, but organization will help) Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
45.
A New StrategyMake it easy to choose best clustering from millions of choices 1 Code text as numbers (in one or more of several ways) 2 Apply all clustering methods we can ﬁnd to the data — each representing diﬀerent (unstated) substantive assumptions (<15 mins) 3 (Too much for a person to understand, but organization will help) 4 Develop an application-independent distance metric between clusterings, a metric space of clusterings, and a 2-D projection Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
46.
A New StrategyMake it easy to choose best clustering from millions of choices 1 Code text as numbers (in one or more of several ways) 2 Apply all clustering methods we can ﬁnd to the data — each representing diﬀerent (unstated) substantive assumptions (<15 mins) 3 (Too much for a person to understand, but organization will help) 4 Develop an application-independent distance metric between clusterings, a metric space of clusterings, and a 2-D projection 5 “Local cluster ensemble” creates a new clustering at any point, based on weighted average of nearby clusterings Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
47.
A New StrategyMake it easy to choose best clustering from millions of choices 1 Code text as numbers (in one or more of several ways) 2 Apply all clustering methods we can ﬁnd to the data — each representing diﬀerent (unstated) substantive assumptions (<15 mins) 3 (Too much for a person to understand, but organization will help) 4 Develop an application-independent distance metric between clusterings, a metric space of clusterings, and a 2-D projection 5 “Local cluster ensemble” creates a new clustering at any point, based on weighted average of nearby clusterings 6 A new animated visualization to explore the space of clusterings (smoothly morphing from one into others) Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
48.
A New StrategyMake it easy to choose best clustering from millions of choices 1 Code text as numbers (in one or more of several ways) 2 Apply all clustering methods we can ﬁnd to the data — each representing diﬀerent (unstated) substantive assumptions (<15 mins) 3 (Too much for a person to understand, but organization will help) 4 Develop an application-independent distance metric between clusterings, a metric space of clusterings, and a 2-D projection 5 “Local cluster ensemble” creates a new clustering at any point, based on weighted average of nearby clusterings 6 A new animated visualization to explore the space of clusterings (smoothly morphing from one into others) 7 Millions of clusterings, easily comprehended Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
49.
Many Thousands of Clusterings, Sorted & OrganizedYou choose one (or more), based on insight, discovery, useful information,. . . Obama Space of mixvmf Clusterings Clustering 2 Ford Clustering 1 affprop info.costs Carter Nixon kmedoids stand.euc Johnson Carter Eisenhower rock affprop maximum Ford Roosevelt kmeans correlation hclust correlation single hclust pearson single Eisenhower Truman Truman Johnson Roosevelt hclust maximum single hclust correlation median hclust binary hclust correlationmedian centroidpearson centroid hclust pearson centroid hclust spec_max Nixon ``Other hclust canberra centroid ``Roosevelt hclust correlationaverage average hclust pearson mcquitty mcquitty hclust kendall single hclust maximum ward Presidents hclust euclidean centroid To Carter hclust canberra mcquitty binary median hclust hclust canberra median kmeans kendall hclust canberra single mspec_max hclust binary single biclust_spectral affprop manhattan affprop cosine q Clinton hclust manhattan centroid hclust manhattanmedian hclust maximum single hclust spearman centroid hclust maximum centroid kmedoids manhattan kendall centroid mspec_canb hclust euclidean median hclust canberra average hclust correlation complete hclust pearson complete divisive stand.euc mspec_cos hclust kendall average hclust manhattan median hclust spearman median hclust kendall median kmeans maximum hclust euclideanaverage single hclust maximum mcquitty hclust maximum complete kmeans pearson affprop euclidean hclust mcquitty average hclust manhattan average euclidean Kennedy Kennedy hclust spearman single q divisive euclidean Bushkmeans binary hclust binary average kmedoids euclidean som hclust spearman average spec_mink mspec_euc mspec_mink hclust binary complete hclust binary mcquitty divisive manhattan mspec_man hclust euclidean mcquitty hclust euclidean complete hclust kendall complete hclust correlation ward complete hclust canberra Bush clust_convex hclust euclidean ward hclust spearman mcquitty hclust kendall mcquitty dismea Obama hclust binary ward hclust canberra ward hclust spearman complete hclust manhattan complete spec_canb hclust kendall ward mixvmfVA spec_cos spec_euc hclust manhattan ward kmeans euclidean kmeans manhattan spec_man hclust pearson ward ``Reagan `` Reagan To Republicans hclust spearman ward Obama kmeans spearman Reagan kmeans canberra HWBush HWBush Clinton Reagan mult_dirproc Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
50.
Software Screenshot Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
51.
Evaluating Performance Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
52.
Evaluating Performance Goals: Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
53.
Evaluating Performance Goals: Validate Claim: computer-assisted conceptualization outperforms human conceptualization Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
54.
Evaluating Performance Goals: Validate Claim: computer-assisted conceptualization outperforms human conceptualization Demonstrate: new experimental designs for cluster evaluation Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
55.
Evaluating Performance Goals: Validate Claim: computer-assisted conceptualization outperforms human conceptualization Demonstrate: new experimental designs for cluster evaluation Inject human judgement: relying on insights from survey research Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
56.
Evaluating Performance Goals: Validate Claim: computer-assisted conceptualization outperforms human conceptualization Demonstrate: new experimental designs for cluster evaluation Inject human judgement: relying on insights from survey research We now present three evaluations Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
57.
Evaluating Performance Goals: Validate Claim: computer-assisted conceptualization outperforms human conceptualization Demonstrate: new experimental designs for cluster evaluation Inject human judgement: relying on insights from survey research We now present three evaluations Cluster Quality ⇒ RA coders Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
58.
Evaluating Performance Goals: Validate Claim: computer-assisted conceptualization outperforms human conceptualization Demonstrate: new experimental designs for cluster evaluation Inject human judgement: relying on insights from survey research We now present three evaluations Cluster Quality ⇒ RA coders Informative discoveries ⇒ Experienced scholars analyzing texts Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
59.
Evaluating Performance Goals: Validate Claim: computer-assisted conceptualization outperforms human conceptualization Demonstrate: new experimental designs for cluster evaluation Inject human judgement: relying on insights from survey research We now present three evaluations Cluster Quality ⇒ RA coders Informative discoveries ⇒ Experienced scholars analyzing texts Discovery ⇒ You’re the judge Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
60.
Evaluation 1: Cluster Quality Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
61.
Evaluation 1: Cluster Quality What Are Humans Good For? Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
62.
Evaluation 1: Cluster Quality What Are Humans Good For? They can’t: keep many documents & clusters in their head Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
63.
Evaluation 1: Cluster Quality What Are Humans Good For? They can’t: keep many documents & clusters in their head They can: compare two documents at a time Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
64.
Evaluation 1: Cluster Quality What Are Humans Good For? They can’t: keep many documents & clusters in their head They can: compare two documents at a time =⇒ Cluster quality evaluation: human judgement of document pairs Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
65.
Evaluation 1: Cluster Quality What Are Humans Good For? They can’t: keep many documents & clusters in their head They can: compare two documents at a time =⇒ Cluster quality evaluation: human judgement of document pairs Experimental Design to Assess Cluster Quality Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
66.
Evaluation 1: Cluster Quality What Are Humans Good For? They can’t: keep many documents & clusters in their head They can: compare two documents at a time =⇒ Cluster quality evaluation: human judgement of document pairs Experimental Design to Assess Cluster Quality automated visualization to choose one clustering Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
67.
Evaluation 1: Cluster Quality What Are Humans Good For? They can’t: keep many documents & clusters in their head They can: compare two documents at a time =⇒ Cluster quality evaluation: human judgement of document pairs Experimental Design to Assess Cluster Quality automated visualization to choose one clustering many pairs of documents Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
68.
Evaluation 1: Cluster Quality What Are Humans Good For? They can’t: keep many documents & clusters in their head They can: compare two documents at a time =⇒ Cluster quality evaluation: human judgement of document pairs Experimental Design to Assess Cluster Quality automated visualization to choose one clustering many pairs of documents for coders: (1) unrelated, (2) loosely related, (3) closely related Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
69.
Evaluation 1: Cluster Quality What Are Humans Good For? They can’t: keep many documents & clusters in their head They can: compare two documents at a time =⇒ Cluster quality evaluation: human judgement of document pairs Experimental Design to Assess Cluster Quality automated visualization to choose one clustering many pairs of documents for coders: (1) unrelated, (2) loosely related, (3) closely related Quality = mean(within cluster) - mean(between clusters) Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
70.
Evaluation 1: Cluster Quality What Are Humans Good For? They can’t: keep many documents & clusters in their head They can: compare two documents at a time =⇒ Cluster quality evaluation: human judgement of document pairs Experimental Design to Assess Cluster Quality automated visualization to choose one clustering many pairs of documents for coders: (1) unrelated, (2) loosely related, (3) closely related Quality = mean(within cluster) - mean(between clusters) Bias results against ourselves by not letting evaluators choose clustering Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
71.
Evaluation 1: Cluster Quality What Are Humans Good For? They can’t: keep many documents & clusters in their head They can: compare two documents at a time =⇒ Cluster quality evaluation: human judgement of document pairs Experimental Design to Assess Cluster Quality automated visualization to choose one clustering many pairs of documents for coders: (1) unrelated, (2) loosely related, (3) closely related Quality = mean(within cluster) - mean(between clusters) Bias results against ourselves by not letting evaluators choose clustering Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
72.
Evaluation 1: Cluster Quality −0.3 −0.2 −0.1 0.1 0.2 0.3 (Our Method) − (Human Coders) Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
73.
Evaluation 1: Cluster Quality Lautenberg Press Releases q −0.3 −0.2 −0.1 0.1 0.2 0.3 (Our Method) − (Human Coders)Lautenberg: 200 Senate Press Releases (appropriations, economy,education, tax, veterans, . . . ) Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
74.
Evaluation 1: Cluster Quality Lautenberg Press Releases q Policy Agendas Project q −0.3 −0.2 −0.1 0.1 0.2 0.3 (Our Method) − (Human Coders)Policy Agendas: 213 quasi-sentences from Bush’s State of the Union(agriculture, banking & commerce, civil rights/liberties, defense, . . . ) Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
75.
Evaluation 1: Cluster Quality Lautenberg Press Releases q Policy Agendas Project q Reuters Gold Standard q −0.3 −0.2 −0.1 0.1 0.2 0.3 (Our Method) − (Human Coders)Reuter’s: ﬁnancial news (trade, earnings, copper, gold, coﬀee, . . . ); “goldstandard” for supervised learning studies Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
76.
Evaluation 2: More Informative Discoveries Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
77.
Evaluation 2: More Informative Discoveries Found 2 scholars analyzing lots of textual data for their work Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
78.
Evaluation 2: More Informative Discoveries Found 2 scholars analyzing lots of textual data for their work Created 6 clusterings: Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
79.
Evaluation 2: More Informative Discoveries Found 2 scholars analyzing lots of textual data for their work Created 6 clusterings: 2 clusterings selected with our method (biased against us) Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
80.
Evaluation 2: More Informative Discoveries Found 2 scholars analyzing lots of textual data for their work Created 6 clusterings: 2 clusterings selected with our method (biased against us) 2 clusterings from each of 2 other methods (varying tuning parameters) Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
81.
Evaluation 2: More Informative Discoveries Found 2 scholars analyzing lots of textual data for their work Created 6 clusterings: 2 clusterings selected with our method (biased against us) 2 clusterings from each of 2 other methods (varying tuning parameters) Created info packet on each clustering (for each cluster: exemplar document, automated content summary) Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
82.
Evaluation 2: More Informative Discoveries Found 2 scholars analyzing lots of textual data for their work Created 6 clusterings: 2 clusterings selected with our method (biased against us) 2 clusterings from each of 2 other methods (varying tuning parameters) Created info packet on each clustering (for each cluster: exemplar document, automated content summary) 6 Asked for 2 =15 pairwise comparisons Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
83.
Evaluation 2: More Informative Discoveries Found 2 scholars analyzing lots of textual data for their work Created 6 clusterings: 2 clusterings selected with our method (biased against us) 2 clusterings from each of 2 other methods (varying tuning parameters) Created info packet on each clustering (for each cluster: exemplar document, automated content summary) 6 Asked for 2 =15 pairwise comparisons User chooses ⇒ only care about the one clustering that wins Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
84.
Evaluation 2: More Informative Discoveries Found 2 scholars analyzing lots of textual data for their work Created 6 clusterings: 2 clusterings selected with our method (biased against us) 2 clusterings from each of 2 other methods (varying tuning parameters) Created info packet on each clustering (for each cluster: exemplar document, automated content summary) 6 Asked for 2 =15 pairwise comparisons User chooses ⇒ only care about the one clustering that wins Both cases a Condorcet winner: Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
85.
Evaluation 2: More Informative Discoveries Found 2 scholars analyzing lots of textual data for their work Created 6 clusterings: 2 clusterings selected with our method (biased against us) 2 clusterings from each of 2 other methods (varying tuning parameters) Created info packet on each clustering (for each cluster: exemplar document, automated content summary) 6 Asked for 2 =15 pairwise comparisons User chooses ⇒ only care about the one clustering that wins Both cases a Condorcet winner:“Immigration”:Our Method 1 → vMF 1 → vMF 2 → Our Method 2 → K-Means 1 → K-Means 2 Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
86.
Evaluation 2: More Informative Discoveries Found 2 scholars analyzing lots of textual data for their work Created 6 clusterings: 2 clusterings selected with our method (biased against us) 2 clusterings from each of 2 other methods (varying tuning parameters) Created info packet on each clustering (for each cluster: exemplar document, automated content summary) 6 Asked for 2 =15 pairwise comparisons User chooses ⇒ only care about the one clustering that wins Both cases a Condorcet winner:“Immigration”:Our Method 1 → vMF 1 → vMF 2 → Our Method 2 → K-Means 1 → K-Means 2“Genetic testing”:Our Method 1 → {Our Method 2, K-Means 1, K-means 2} → Dir Proc. 1 → Dir Proc. 2 Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
87.
Evaluation 3: What Do Members of Congress Do? Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
88.
Evaluation 3: What Do Members of Congress Do? - David Mayhew’s (1974) famous typology Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
89.
Evaluation 3: What Do Members of Congress Do? - David Mayhew’s (1974) famous typology - Advertising Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
90.
Evaluation 3: What Do Members of Congress Do? - David Mayhew’s (1974) famous typology - Advertising - Credit Claiming Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
91.
Evaluation 3: What Do Members of Congress Do? - David Mayhew’s (1974) famous typology - Advertising - Credit Claiming - Position Taking Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
92.
Evaluation 3: What Do Members of Congress Do? - David Mayhew’s (1974) famous typology - Advertising - Credit Claiming - Position Taking - Data: 200 press releases from Frank Lautenberg’s oﬃce (D-NJ) Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
93.
Evaluation 3: What Do Members of Congress Do? - David Mayhew’s (1974) famous typology - Advertising - Credit Claiming - Position Taking - Data: 200 press releases from Frank Lautenberg’s oﬃce (D-NJ) - Apply our method Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
94.
Example Discovery mult_dirproc kmeans correlation hclust canberra ward sot_cor divisive stand.euc mixvmf hclust correlationmixvmfVAbinary complete hclust mcquitty hclust pearson single affprop cosine hclust pearson median hclust correlation single hclust pearson mcquitty hclust correlation median mec hclust pearson average hclust correlation complete hclust correlation averagehclust pearson complete hclust binary single hclust binary average kmeans pearson som hclust correlation centroid rock hclust pearson centroid hclust binary median hclust binary mcquitty hclust canberra single biclust_spectral hclust spearman complete spec_man spec_cos kmeans kendall median hclust canberra hclust canberra average spec_mink spec_euc spec_max mspec_minkspec_canb mspec_manaffprop maximum kmeans spearman kmeans manhattan mspec_max mspec_cos mspec_canb mspec_euc kmeansbinary centroid hclust canberra hclust kendall single hclust spearman centroid hclusthclust kendall centroid kendall medianaverage average spearmankendall mcquitty hclust median hclust spearman single hclust spearman mcquitty kendall complete hclust canberra centroid hclust hclust manhattan medianmanhattan euclideankmedoids hclusthclustmanhattan centroid hclust manhattan single hclust manhattan average single manhattan affprop hclust euclidean median hclust maximum single manhattan divisive hclust euclidean centroid hclust euclidean average hclust manhattan mcquitty clust_convex hclust correlation ward hclust euclidean mcquitty kmedoids euclidean hclust pearson wardstand.euc kmedoids hclust maximummedian divisive centroid hclust maximumeuclideanaffprop euclidean hclust canberra mcquitty hclust maximum average hclust maximum complete hclust euclidean complete hclust manhattan maximum mcquitty hclust complete dist_ebinary dist_binary dist_fbinary dist_minkowski dist_canb dist_max dist_cos dismea hclust manhattan ward affprop info.costs kmeanshclust euclidean ward euclidean hclust canberra complete sot_euc hclust binary ward hclusthclust spearman ward kendall ward hclust maximum ward kmeans binary kmeans maximum Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
95.
Example Discovery mult_dirproc kmeans correlation hclust canberra ward sot_cor divisive stand.euc mixvmf hclust correlationmixvmfVAbinary complete hclust mcquitty hclust pearson single affprop cosine hclust pearson median hclust correlation single hclust pearson mcquitty hclust correlation median mec hclust pearson average hclust correlation complete hclust correlation averagehclust pearson complete hclust binary single hclust binary average som hclust correlation centroid rock hclust pearson centroid hclust binary median hclust canberra single biclust_spectral affprop cosine hclust spearman complete hclust binary mcquitty kmeans pearson spec_man spec_cos kmeans kendall median hclust canberra hclust canberra average spec_mink spec_euc spec_max mspec_minkspec_canb mspec_manaffprop maximum kmeans spearman kmeans manhattan mspec_max mspec_cos mspec_canb mspec_euc kmeansbinary centroid hclust canberra hclust kendall single hclust spearman centroid hclusthclust kendall centroid kendall medianaverage average spearmankendall mcquitty hclust median hclust spearman single hclust spearman mcquitty kendall complete hclust canberra centroid hclust hclust manhattan medianmanhattan euclideankmedoids hclusthclustmanhattan centroid hclust manhattan single hclust manhattan average single manhattan affprop hclust euclidean median hclust maximum single manhattan divisive hclust euclidean centroid hclust euclidean average hclust manhattan mcquitty clust_convex hclust correlation ward hclust euclidean mcquitty kmedoids euclidean hclust pearson wardstand.euc kmedoids hclust maximummedian divisive centroid hclust maximumeuclideanaffprop euclidean hclust maximum average hclust maximum complete hclust euclidean complete hclust manhattan maximum mcquitty hclust complete dist_ebinary dist_binary dist_fbinary dist_minkowski dist_canb dist_max hclust manhattan ward dist_cos dismea hclust canberra mcquitty Red point: a clustering by affprop info.costs kmeanshclust euclidean ward sot_euc euclidean hclust canberra complete hclust binary ward Aﬃnity Propagation-Cosine hclust maximum ward hclusthclust spearman ward kendall ward kmeans binary (Dueck and Frey 2007) kmeans maximum Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
96.
Example Discovery mult_dirproc mixvmf kmeans correlation hclust canberra ward sot_cor divisive stand.euc mixvmf hclust correlationmixvmfVAbinary complete hclust mcquitty hclust pearson single affprop cosine hclust pearson median hclust correlation single hclust pearson mcquitty hclust correlation median mec hclust pearson average hclust correlation complete hclust correlation averagehclust pearson complete hclust binary single hclust binary average som hclust correlation centroid rock hclust pearson centroid hclust binary median hclust canberra single biclust_spectral affprop cosine hclust spearman complete hclust binary mcquitty kmeans pearson spec_man spec_cos kmeans kendall median hclust canberra hclust canberra average spec_mink spec_euc spec_max mspec_minkspec_canb mspec_manaffprop maximum kmeans spearman kmeans manhattan mspec_max mspec_cos mspec_canb mspec_euc kmeansbinary centroid hclust canberra hclust kendall single hclust spearman centroid hclusthclust kendall centroid kendall medianaverage average spearmankendall mcquitty hclust median hclust spearman single hclust spearman mcquitty kendall complete hclust canberra centroid hclust hclust manhattan medianmanhattan euclideankmedoids hclusthclustmanhattan centroid hclust manhattan single hclust manhattan average single manhattan affprop hclust euclidean median hclust maximum single manhattan divisive hclust euclidean centroid hclust euclidean average hclust manhattan mcquitty clust_convex hclust correlation ward hclust euclidean mcquitty kmedoids euclidean hclust pearson wardstand.euc kmedoids hclust maximummedian divisive centroid hclust maximumeuclideanaffprop euclidean hclust maximum average hclust maximum complete hclust euclidean complete hclust manhattan maximum mcquitty hclust complete dist_ebinary dist_binary dist_fbinary dist_minkowski dist_canb dist_max hclust manhattan ward dist_cos dismea hclust canberra mcquitty Red point: a clustering by affprop info.costs kmeanshclust euclidean ward sot_euc euclidean hclust canberra complete hclust binary ward Aﬃnity Propagation-Cosine hclust maximum ward hclusthclust spearman ward kendall ward kmeans binary (Dueck and Frey 2007) kmeans maximum Close to: Mixture of von Mises-Fisher distributions (Banerjee et. al. 2005) Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
A particular slide catching your eye?
Clipping is a handy way to collect important slides you want to go back to later.
Be the first to comment