• Save
Discov uga
Upcoming SlideShare
Loading in...5
×
 

Discov uga

on

  • 514 views

This is just a test.

This is just a test.

Statistics

Views

Total Views
514
Slideshare-icon Views on SlideShare
514
Embed Views
0

Actions

Likes
1
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Discov uga Discov uga Presentation Transcript

    • Computer-Assisted Clustering and Conceptualization Gary King Institute for Quantitative Social Science Harvard University Parthemos Lecture at University of Georgia, 3/4/20111 Based on joint work with Justin Grimmer (Harvard Stanford) Parthemos Lecture at University ofGary King (Harvard IQSS) Quantitative Discovery / 20
    • A Method for Computer Assisted Conceptualization Conceptualization through Classification: “one of the most central and generic of all our conceptual exercises. . . . the foundation not only for conceptualization, language, and speech, but also for mathematics, statistics, and data analysis. . . . Without classification, there could be no advanced conceptualization, reasoning, language, data analysis or,for that matter, social science research.” (Bailey, 1994). Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • A Method for Computer Assisted Conceptualization Conceptualization through Classification: “one of the most central and generic of all our conceptual exercises. . . . the foundation not only for conceptualization, language, and speech, but also for mathematics, statistics, and data analysis. . . . Without classification, there could be no advanced conceptualization, reasoning, language, data analysis or,for that matter, social science research.” (Bailey, 1994). Cluster Analysis: simultaneously (1) invents categories and (2) assigns documents to categories Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • A Method for Computer Assisted Conceptualization Conceptualization through Classification: “one of the most central and generic of all our conceptual exercises. . . . the foundation not only for conceptualization, language, and speech, but also for mathematics, statistics, and data analysis. . . . Without classification, there could be no advanced conceptualization, reasoning, language, data analysis or,for that matter, social science research.” (Bailey, 1994). Cluster Analysis: simultaneously (1) invents categories and (2) assigns documents to categories We focus on unstructured text; methods apply more broadly. Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • A Method for Computer Assisted Conceptualization Conceptualization through Classification: “one of the most central and generic of all our conceptual exercises. . . . the foundation not only for conceptualization, language, and speech, but also for mathematics, statistics, and data analysis. . . . Without classification, there could be no advanced conceptualization, reasoning, language, data analysis or,for that matter, social science research.” (Bailey, 1994). Cluster Analysis: simultaneously (1) invents categories and (2) assigns documents to categories We focus on unstructured text; methods apply more broadly. Main goal: Switch from Fully Automated to Computer Assisted Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • What’s Hard about Clustering? Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • What’s Hard about Clustering?(aka Why Johnny Can’t Classify) Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • What’s Hard about Clustering?(aka Why Johnny Can’t Classify) Clustering seems easy; its not! Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • What’s Hard about Clustering?(aka Why Johnny Can’t Classify) Clustering seems easy; its not! Bell(n) = number of ways of partitioning n objects Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • What’s Hard about Clustering?(aka Why Johnny Can’t Classify) Clustering seems easy; its not! Bell(n) = number of ways of partitioning n objects Bell(2) = 2 (AB, A B) Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • What’s Hard about Clustering?(aka Why Johnny Can’t Classify) Clustering seems easy; its not! Bell(n) = number of ways of partitioning n objects Bell(2) = 2 (AB, A B) Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C) Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • What’s Hard about Clustering?(aka Why Johnny Can’t Classify) Clustering seems easy; its not! Bell(n) = number of ways of partitioning n objects Bell(2) = 2 (AB, A B) Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C) Bell(5) = 52 Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • What’s Hard about Clustering?(aka Why Johnny Can’t Classify) Clustering seems easy; its not! Bell(n) = number of ways of partitioning n objects Bell(2) = 2 (AB, A B) Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C) Bell(5) = 52 Bell(100) ≈ Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • What’s Hard about Clustering?(aka Why Johnny Can’t Classify) Clustering seems easy; its not! Bell(n) = number of ways of partitioning n objects Bell(2) = 2 (AB, A B) Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C) Bell(5) = 52 Bell(100) ≈ 1028 × Number of elementary particles in the universe Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • What’s Hard about Clustering?(aka Why Johnny Can’t Classify) Clustering seems easy; its not! Bell(n) = number of ways of partitioning n objects Bell(2) = 2 (AB, A B) Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C) Bell(5) = 52 Bell(100) ≈ 1028 × Number of elementary particles in the universe Now imagine choosing the optimal classification scheme by hand! Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • What’s Hard about Clustering?(aka Why Johnny Can’t Classify) Clustering seems easy; its not! Bell(n) = number of ways of partitioning n objects Bell(2) = 2 (AB, A B) Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C) Bell(5) = 52 Bell(100) ≈ 1028 × Number of elementary particles in the universe Now imagine choosing the optimal classification scheme by hand! Fully automated algorithms can help, but which ones? Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • The Problem with Fully Automated Clustering Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • The Problem with Fully Automated Clustering The (Impossible) Goal: optimal, fully automated, application-independent cluster analysis Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • The Problem with Fully Automated Clustering The (Impossible) Goal: optimal, fully automated, application-independent cluster analysis No free lunch theorem: every possible clustering method performs equally well on average over all possible substantive applications Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • The Problem with Fully Automated Clustering The (Impossible) Goal: optimal, fully automated, application-independent cluster analysis No free lunch theorem: every possible clustering method performs equally well on average over all possible substantive applications Existing methods: Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • The Problem with Fully Automated Clustering The (Impossible) Goal: optimal, fully automated, application-independent cluster analysis No free lunch theorem: every possible clustering method performs equally well on average over all possible substantive applications Existing methods: Many choices: model-based, subspace, spectral, grid-based, graph- based, fuzzy k-modes, affinity propagation, self-organizing maps,. . . Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • The Problem with Fully Automated Clustering The (Impossible) Goal: optimal, fully automated, application-independent cluster analysis No free lunch theorem: every possible clustering method performs equally well on average over all possible substantive applications Existing methods: Many choices: model-based, subspace, spectral, grid-based, graph- based, fuzzy k-modes, affinity propagation, self-organizing maps,. . . Well-defined statistical, data analytic, or machine learning foundations Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • The Problem with Fully Automated Clustering The (Impossible) Goal: optimal, fully automated, application-independent cluster analysis No free lunch theorem: every possible clustering method performs equally well on average over all possible substantive applications Existing methods: Many choices: model-based, subspace, spectral, grid-based, graph- based, fuzzy k-modes, affinity propagation, self-organizing maps,. . . Well-defined statistical, data analytic, or machine learning foundations How to add substantive knowledge: With few exceptions, unclear Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • The Problem with Fully Automated Clustering The (Impossible) Goal: optimal, fully automated, application-independent cluster analysis No free lunch theorem: every possible clustering method performs equally well on average over all possible substantive applications Existing methods: Many choices: model-based, subspace, spectral, grid-based, graph- based, fuzzy k-modes, affinity propagation, self-organizing maps,. . . Well-defined statistical, data analytic, or machine learning foundations How to add substantive knowledge: With few exceptions, unclear The literature: little guidance on when methods apply Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • The Problem with Fully Automated Clustering The (Impossible) Goal: optimal, fully automated, application-independent cluster analysis No free lunch theorem: every possible clustering method performs equally well on average over all possible substantive applications Existing methods: Many choices: model-based, subspace, spectral, grid-based, graph- based, fuzzy k-modes, affinity propagation, self-organizing maps,. . . Well-defined statistical, data analytic, or machine learning foundations How to add substantive knowledge: With few exceptions, unclear The literature: little guidance on when methods apply Deriving such guidance: difficult or impossible Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • The Problem with Fully Automated Clustering The (Impossible) Goal: optimal, fully automated, application-independent cluster analysis No free lunch theorem: every possible clustering method performs equally well on average over all possible substantive applications Existing methods: Many choices: model-based, subspace, spectral, grid-based, graph- based, fuzzy k-modes, affinity propagation, self-organizing maps,. . . Well-defined statistical, data analytic, or machine learning foundations How to add substantive knowledge: With few exceptions, unclear The literature: little guidance on when methods apply Deriving such guidance: difficult or impossible Deep problem: full automation requires more information Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • The Problem with Fully Automated Clustering The (Impossible) Goal: optimal, fully automated, application-independent cluster analysis No free lunch theorem: every possible clustering method performs equally well on average over all possible substantive applications Existing methods: Many choices: model-based, subspace, spectral, grid-based, graph- based, fuzzy k-modes, affinity propagation, self-organizing maps,. . . Well-defined statistical, data analytic, or machine learning foundations How to add substantive knowledge: With few exceptions, unclear The literature: little guidance on when methods apply Deriving such guidance: difficult or impossible Deep problem: full automation requires more information No surprise: everyone’s tried cluster analysis; very few are satisfied Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Switch from Fully Automated to Computer Assisted Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Switch from Fully Automated to Computer Assisted Fully Automated Clustering may succeed sometimes, but fails in general: too hard to understand when each model applies Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Switch from Fully Automated to Computer Assisted Fully Automated Clustering may succeed sometimes, but fails in general: too hard to understand when each model applies An alternative: Computer-Assisted Clustering Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Switch from Fully Automated to Computer Assisted Fully Automated Clustering may succeed sometimes, but fails in general: too hard to understand when each model applies An alternative: Computer-Assisted Clustering Easy in theory: list all clusterings; choose the best Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Switch from Fully Automated to Computer Assisted Fully Automated Clustering may succeed sometimes, but fails in general: too hard to understand when each model applies An alternative: Computer-Assisted Clustering Easy in theory: list all clusterings; choose the best Impossible in practice: Too hard for us mere humans! Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Switch from Fully Automated to Computer Assisted Fully Automated Clustering may succeed sometimes, but fails in general: too hard to understand when each model applies An alternative: Computer-Assisted Clustering Easy in theory: list all clusterings; choose the best Impossible in practice: Too hard for us mere humans! An organized list will make the search possible Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Switch from Fully Automated to Computer Assisted Fully Automated Clustering may succeed sometimes, but fails in general: too hard to understand when each model applies An alternative: Computer-Assisted Clustering Easy in theory: list all clusterings; choose the best Impossible in practice: Too hard for us mere humans! An organized list will make the search possible Insight: Many clusterings are perceptually identical Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Switch from Fully Automated to Computer Assisted Fully Automated Clustering may succeed sometimes, but fails in general: too hard to understand when each model applies An alternative: Computer-Assisted Clustering Easy in theory: list all clusterings; choose the best Impossible in practice: Too hard for us mere humans! An organized list will make the search possible Insight: Many clusterings are perceptually identical E.g.,: consider two clusterings that differ only because one document (of 10,000) moves from category 5 to 6 Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Switch from Fully Automated to Computer Assisted Fully Automated Clustering may succeed sometimes, but fails in general: too hard to understand when each model applies An alternative: Computer-Assisted Clustering Easy in theory: list all clusterings; choose the best Impossible in practice: Too hard for us mere humans! An organized list will make the search possible Insight: Many clusterings are perceptually identical E.g.,: consider two clusterings that differ only because one document (of 10,000) moves from category 5 to 6 Question: How to organize clusterings so humans can understand? Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Our Idea: Meaning Through GeographySet of clusterings Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Our Idea: Meaning Through GeographySet of clusterings ≈A list of unconnected addresses Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Our Idea: Meaning Through GeographySet of clusterings ≈A list of unconnected addresses Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Our Idea: Meaning Through GeographySet of clusterings ≈A list of unconnected addresses We develop a (conceptual) geography of clusterings Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • A New StrategyMake it easy to choose best clustering from millions of choices Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • A New StrategyMake it easy to choose best clustering from millions of choices 1 Code text as numbers (in one or more of several ways) Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • A New StrategyMake it easy to choose best clustering from millions of choices 1 Code text as numbers (in one or more of several ways) 2 Apply all clustering methods we can find to the data — each representing different (unstated) substantive assumptions (<15 mins) Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • A New StrategyMake it easy to choose best clustering from millions of choices 1 Code text as numbers (in one or more of several ways) 2 Apply all clustering methods we can find to the data — each representing different (unstated) substantive assumptions (<15 mins) 3 (Too much for a person to understand, but organization will help) Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • A New StrategyMake it easy to choose best clustering from millions of choices 1 Code text as numbers (in one or more of several ways) 2 Apply all clustering methods we can find to the data — each representing different (unstated) substantive assumptions (<15 mins) 3 (Too much for a person to understand, but organization will help) 4 Develop an application-independent distance metric between clusterings, a metric space of clusterings, and a 2-D projection Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • A New StrategyMake it easy to choose best clustering from millions of choices 1 Code text as numbers (in one or more of several ways) 2 Apply all clustering methods we can find to the data — each representing different (unstated) substantive assumptions (<15 mins) 3 (Too much for a person to understand, but organization will help) 4 Develop an application-independent distance metric between clusterings, a metric space of clusterings, and a 2-D projection 5 “Local cluster ensemble” creates a new clustering at any point, based on weighted average of nearby clusterings Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • A New StrategyMake it easy to choose best clustering from millions of choices 1 Code text as numbers (in one or more of several ways) 2 Apply all clustering methods we can find to the data — each representing different (unstated) substantive assumptions (<15 mins) 3 (Too much for a person to understand, but organization will help) 4 Develop an application-independent distance metric between clusterings, a metric space of clusterings, and a 2-D projection 5 “Local cluster ensemble” creates a new clustering at any point, based on weighted average of nearby clusterings 6 A new animated visualization to explore the space of clusterings (smoothly morphing from one into others) Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • A New StrategyMake it easy to choose best clustering from millions of choices 1 Code text as numbers (in one or more of several ways) 2 Apply all clustering methods we can find to the data — each representing different (unstated) substantive assumptions (<15 mins) 3 (Too much for a person to understand, but organization will help) 4 Develop an application-independent distance metric between clusterings, a metric space of clusterings, and a 2-D projection 5 “Local cluster ensemble” creates a new clustering at any point, based on weighted average of nearby clusterings 6 A new animated visualization to explore the space of clusterings (smoothly morphing from one into others) 7 Millions of clusterings, easily comprehended Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Many Thousands of Clusterings, Sorted & OrganizedYou choose one (or more), based on insight, discovery, useful information,. . . Obama Space of mixvmf Clusterings Clustering 2 Ford Clustering 1 affprop info.costs Carter Nixon kmedoids stand.euc Johnson Carter Eisenhower rock affprop maximum Ford Roosevelt kmeans correlation hclust correlation single hclust pearson single Eisenhower Truman Truman Johnson Roosevelt hclust maximum single hclust correlation median hclust binary hclust correlationmedian centroidpearson centroid hclust pearson centroid hclust spec_max Nixon ``Other hclust canberra centroid ``Roosevelt hclust correlationaverage average hclust pearson mcquitty mcquitty hclust kendall single hclust maximum ward Presidents hclust euclidean centroid To Carter hclust canberra mcquitty binary median hclust hclust canberra median kmeans kendall hclust canberra single mspec_max hclust binary single biclust_spectral affprop manhattan affprop cosine q Clinton hclust manhattan centroid hclust manhattanmedian hclust maximum single hclust spearman centroid hclust maximum centroid kmedoids manhattan kendall centroid mspec_canb hclust euclidean median hclust canberra average hclust correlation complete hclust pearson complete divisive stand.euc mspec_cos hclust kendall average hclust manhattan median hclust spearman median hclust kendall median kmeans maximum hclust euclideanaverage single hclust maximum mcquitty hclust maximum complete kmeans pearson affprop euclidean hclust mcquitty average hclust manhattan average euclidean Kennedy Kennedy hclust spearman single q divisive euclidean Bushkmeans binary hclust binary average kmedoids euclidean som hclust spearman average spec_mink mspec_euc mspec_mink hclust binary complete hclust binary mcquitty divisive manhattan mspec_man hclust euclidean mcquitty hclust euclidean complete hclust kendall complete hclust correlation ward complete hclust canberra Bush clust_convex hclust euclidean ward hclust spearman mcquitty hclust kendall mcquitty dismea Obama hclust binary ward hclust canberra ward hclust spearman complete hclust manhattan complete spec_canb hclust kendall ward mixvmfVA spec_cos spec_euc hclust manhattan ward kmeans euclidean kmeans manhattan spec_man hclust pearson ward ``Reagan `` Reagan To Republicans hclust spearman ward Obama kmeans spearman Reagan kmeans canberra HWBush HWBush Clinton Reagan mult_dirproc Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Software Screenshot Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Evaluating Performance Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Evaluating Performance Goals: Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Evaluating Performance Goals: Validate Claim: computer-assisted conceptualization outperforms human conceptualization Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Evaluating Performance Goals: Validate Claim: computer-assisted conceptualization outperforms human conceptualization Demonstrate: new experimental designs for cluster evaluation Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Evaluating Performance Goals: Validate Claim: computer-assisted conceptualization outperforms human conceptualization Demonstrate: new experimental designs for cluster evaluation Inject human judgement: relying on insights from survey research Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Evaluating Performance Goals: Validate Claim: computer-assisted conceptualization outperforms human conceptualization Demonstrate: new experimental designs for cluster evaluation Inject human judgement: relying on insights from survey research We now present three evaluations Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Evaluating Performance Goals: Validate Claim: computer-assisted conceptualization outperforms human conceptualization Demonstrate: new experimental designs for cluster evaluation Inject human judgement: relying on insights from survey research We now present three evaluations Cluster Quality ⇒ RA coders Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Evaluating Performance Goals: Validate Claim: computer-assisted conceptualization outperforms human conceptualization Demonstrate: new experimental designs for cluster evaluation Inject human judgement: relying on insights from survey research We now present three evaluations Cluster Quality ⇒ RA coders Informative discoveries ⇒ Experienced scholars analyzing texts Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Evaluating Performance Goals: Validate Claim: computer-assisted conceptualization outperforms human conceptualization Demonstrate: new experimental designs for cluster evaluation Inject human judgement: relying on insights from survey research We now present three evaluations Cluster Quality ⇒ RA coders Informative discoveries ⇒ Experienced scholars analyzing texts Discovery ⇒ You’re the judge Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Evaluation 1: Cluster Quality Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Evaluation 1: Cluster Quality What Are Humans Good For? Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Evaluation 1: Cluster Quality What Are Humans Good For? They can’t: keep many documents & clusters in their head Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Evaluation 1: Cluster Quality What Are Humans Good For? They can’t: keep many documents & clusters in their head They can: compare two documents at a time Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Evaluation 1: Cluster Quality What Are Humans Good For? They can’t: keep many documents & clusters in their head They can: compare two documents at a time =⇒ Cluster quality evaluation: human judgement of document pairs Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Evaluation 1: Cluster Quality What Are Humans Good For? They can’t: keep many documents & clusters in their head They can: compare two documents at a time =⇒ Cluster quality evaluation: human judgement of document pairs Experimental Design to Assess Cluster Quality Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Evaluation 1: Cluster Quality What Are Humans Good For? They can’t: keep many documents & clusters in their head They can: compare two documents at a time =⇒ Cluster quality evaluation: human judgement of document pairs Experimental Design to Assess Cluster Quality automated visualization to choose one clustering Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Evaluation 1: Cluster Quality What Are Humans Good For? They can’t: keep many documents & clusters in their head They can: compare two documents at a time =⇒ Cluster quality evaluation: human judgement of document pairs Experimental Design to Assess Cluster Quality automated visualization to choose one clustering many pairs of documents Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Evaluation 1: Cluster Quality What Are Humans Good For? They can’t: keep many documents & clusters in their head They can: compare two documents at a time =⇒ Cluster quality evaluation: human judgement of document pairs Experimental Design to Assess Cluster Quality automated visualization to choose one clustering many pairs of documents for coders: (1) unrelated, (2) loosely related, (3) closely related Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Evaluation 1: Cluster Quality What Are Humans Good For? They can’t: keep many documents & clusters in their head They can: compare two documents at a time =⇒ Cluster quality evaluation: human judgement of document pairs Experimental Design to Assess Cluster Quality automated visualization to choose one clustering many pairs of documents for coders: (1) unrelated, (2) loosely related, (3) closely related Quality = mean(within cluster) - mean(between clusters) Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Evaluation 1: Cluster Quality What Are Humans Good For? They can’t: keep many documents & clusters in their head They can: compare two documents at a time =⇒ Cluster quality evaluation: human judgement of document pairs Experimental Design to Assess Cluster Quality automated visualization to choose one clustering many pairs of documents for coders: (1) unrelated, (2) loosely related, (3) closely related Quality = mean(within cluster) - mean(between clusters) Bias results against ourselves by not letting evaluators choose clustering Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Evaluation 1: Cluster Quality What Are Humans Good For? They can’t: keep many documents & clusters in their head They can: compare two documents at a time =⇒ Cluster quality evaluation: human judgement of document pairs Experimental Design to Assess Cluster Quality automated visualization to choose one clustering many pairs of documents for coders: (1) unrelated, (2) loosely related, (3) closely related Quality = mean(within cluster) - mean(between clusters) Bias results against ourselves by not letting evaluators choose clustering Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Evaluation 1: Cluster Quality −0.3 −0.2 −0.1 0.1 0.2 0.3 (Our Method) − (Human Coders) Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Evaluation 1: Cluster Quality Lautenberg Press Releases q −0.3 −0.2 −0.1 0.1 0.2 0.3 (Our Method) − (Human Coders)Lautenberg: 200 Senate Press Releases (appropriations, economy,education, tax, veterans, . . . ) Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Evaluation 1: Cluster Quality Lautenberg Press Releases q Policy Agendas Project q −0.3 −0.2 −0.1 0.1 0.2 0.3 (Our Method) − (Human Coders)Policy Agendas: 213 quasi-sentences from Bush’s State of the Union(agriculture, banking & commerce, civil rights/liberties, defense, . . . ) Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Evaluation 1: Cluster Quality Lautenberg Press Releases q Policy Agendas Project q Reuters Gold Standard q −0.3 −0.2 −0.1 0.1 0.2 0.3 (Our Method) − (Human Coders)Reuter’s: financial news (trade, earnings, copper, gold, coffee, . . . ); “goldstandard” for supervised learning studies Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Evaluation 2: More Informative Discoveries Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Evaluation 2: More Informative Discoveries Found 2 scholars analyzing lots of textual data for their work Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Evaluation 2: More Informative Discoveries Found 2 scholars analyzing lots of textual data for their work Created 6 clusterings: Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Evaluation 2: More Informative Discoveries Found 2 scholars analyzing lots of textual data for their work Created 6 clusterings: 2 clusterings selected with our method (biased against us) Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Evaluation 2: More Informative Discoveries Found 2 scholars analyzing lots of textual data for their work Created 6 clusterings: 2 clusterings selected with our method (biased against us) 2 clusterings from each of 2 other methods (varying tuning parameters) Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Evaluation 2: More Informative Discoveries Found 2 scholars analyzing lots of textual data for their work Created 6 clusterings: 2 clusterings selected with our method (biased against us) 2 clusterings from each of 2 other methods (varying tuning parameters) Created info packet on each clustering (for each cluster: exemplar document, automated content summary) Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Evaluation 2: More Informative Discoveries Found 2 scholars analyzing lots of textual data for their work Created 6 clusterings: 2 clusterings selected with our method (biased against us) 2 clusterings from each of 2 other methods (varying tuning parameters) Created info packet on each clustering (for each cluster: exemplar document, automated content summary) 6 Asked for 2 =15 pairwise comparisons Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Evaluation 2: More Informative Discoveries Found 2 scholars analyzing lots of textual data for their work Created 6 clusterings: 2 clusterings selected with our method (biased against us) 2 clusterings from each of 2 other methods (varying tuning parameters) Created info packet on each clustering (for each cluster: exemplar document, automated content summary) 6 Asked for 2 =15 pairwise comparisons User chooses ⇒ only care about the one clustering that wins Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Evaluation 2: More Informative Discoveries Found 2 scholars analyzing lots of textual data for their work Created 6 clusterings: 2 clusterings selected with our method (biased against us) 2 clusterings from each of 2 other methods (varying tuning parameters) Created info packet on each clustering (for each cluster: exemplar document, automated content summary) 6 Asked for 2 =15 pairwise comparisons User chooses ⇒ only care about the one clustering that wins Both cases a Condorcet winner: Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Evaluation 2: More Informative Discoveries Found 2 scholars analyzing lots of textual data for their work Created 6 clusterings: 2 clusterings selected with our method (biased against us) 2 clusterings from each of 2 other methods (varying tuning parameters) Created info packet on each clustering (for each cluster: exemplar document, automated content summary) 6 Asked for 2 =15 pairwise comparisons User chooses ⇒ only care about the one clustering that wins Both cases a Condorcet winner:“Immigration”:Our Method 1 → vMF 1 → vMF 2 → Our Method 2 → K-Means 1 → K-Means 2 Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Evaluation 2: More Informative Discoveries Found 2 scholars analyzing lots of textual data for their work Created 6 clusterings: 2 clusterings selected with our method (biased against us) 2 clusterings from each of 2 other methods (varying tuning parameters) Created info packet on each clustering (for each cluster: exemplar document, automated content summary) 6 Asked for 2 =15 pairwise comparisons User chooses ⇒ only care about the one clustering that wins Both cases a Condorcet winner:“Immigration”:Our Method 1 → vMF 1 → vMF 2 → Our Method 2 → K-Means 1 → K-Means 2“Genetic testing”:Our Method 1 → {Our Method 2, K-Means 1, K-means 2} → Dir Proc. 1 → Dir Proc. 2 Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Evaluation 3: What Do Members of Congress Do? Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Evaluation 3: What Do Members of Congress Do? - David Mayhew’s (1974) famous typology Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Evaluation 3: What Do Members of Congress Do? - David Mayhew’s (1974) famous typology - Advertising Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Evaluation 3: What Do Members of Congress Do? - David Mayhew’s (1974) famous typology - Advertising - Credit Claiming Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Evaluation 3: What Do Members of Congress Do? - David Mayhew’s (1974) famous typology - Advertising - Credit Claiming - Position Taking Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Evaluation 3: What Do Members of Congress Do? - David Mayhew’s (1974) famous typology - Advertising - Credit Claiming - Position Taking - Data: 200 press releases from Frank Lautenberg’s office (D-NJ) Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Evaluation 3: What Do Members of Congress Do? - David Mayhew’s (1974) famous typology - Advertising - Credit Claiming - Position Taking - Data: 200 press releases from Frank Lautenberg’s office (D-NJ) - Apply our method Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Example Discovery mult_dirproc kmeans correlation hclust canberra ward sot_cor divisive stand.euc mixvmf hclust correlationmixvmfVAbinary complete hclust mcquitty hclust pearson single affprop cosine hclust pearson median hclust correlation single hclust pearson mcquitty hclust correlation median mec hclust pearson average hclust correlation complete hclust correlation averagehclust pearson complete hclust binary single hclust binary average kmeans pearson som hclust correlation centroid rock hclust pearson centroid hclust binary median hclust binary mcquitty hclust canberra single biclust_spectral hclust spearman complete spec_man spec_cos kmeans kendall median hclust canberra hclust canberra average spec_mink spec_euc spec_max mspec_minkspec_canb mspec_manaffprop maximum kmeans spearman kmeans manhattan mspec_max mspec_cos mspec_canb mspec_euc kmeansbinary centroid hclust canberra hclust kendall single hclust spearman centroid hclusthclust kendall centroid kendall medianaverage average spearmankendall mcquitty hclust median hclust spearman single hclust spearman mcquitty kendall complete hclust canberra centroid hclust hclust manhattan medianmanhattan euclideankmedoids hclusthclustmanhattan centroid hclust manhattan single hclust manhattan average single manhattan affprop hclust euclidean median hclust maximum single manhattan divisive hclust euclidean centroid hclust euclidean average hclust manhattan mcquitty clust_convex hclust correlation ward hclust euclidean mcquitty kmedoids euclidean hclust pearson wardstand.euc kmedoids hclust maximummedian divisive centroid hclust maximumeuclideanaffprop euclidean hclust canberra mcquitty hclust maximum average hclust maximum complete hclust euclidean complete hclust manhattan maximum mcquitty hclust complete dist_ebinary dist_binary dist_fbinary dist_minkowski dist_canb dist_max dist_cos dismea hclust manhattan ward affprop info.costs kmeanshclust euclidean ward euclidean hclust canberra complete sot_euc hclust binary ward hclusthclust spearman ward kendall ward hclust maximum ward kmeans binary kmeans maximum Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Example Discovery mult_dirproc kmeans correlation hclust canberra ward sot_cor divisive stand.euc mixvmf hclust correlationmixvmfVAbinary complete hclust mcquitty hclust pearson single affprop cosine hclust pearson median hclust correlation single hclust pearson mcquitty hclust correlation median mec hclust pearson average hclust correlation complete hclust correlation averagehclust pearson complete hclust binary single hclust binary average som hclust correlation centroid rock hclust pearson centroid hclust binary median hclust canberra single biclust_spectral affprop cosine hclust spearman complete hclust binary mcquitty kmeans pearson spec_man spec_cos kmeans kendall median hclust canberra hclust canberra average spec_mink spec_euc spec_max mspec_minkspec_canb mspec_manaffprop maximum kmeans spearman kmeans manhattan mspec_max mspec_cos mspec_canb mspec_euc kmeansbinary centroid hclust canberra hclust kendall single hclust spearman centroid hclusthclust kendall centroid kendall medianaverage average spearmankendall mcquitty hclust median hclust spearman single hclust spearman mcquitty kendall complete hclust canberra centroid hclust hclust manhattan medianmanhattan euclideankmedoids hclusthclustmanhattan centroid hclust manhattan single hclust manhattan average single manhattan affprop hclust euclidean median hclust maximum single manhattan divisive hclust euclidean centroid hclust euclidean average hclust manhattan mcquitty clust_convex hclust correlation ward hclust euclidean mcquitty kmedoids euclidean hclust pearson wardstand.euc kmedoids hclust maximummedian divisive centroid hclust maximumeuclideanaffprop euclidean hclust maximum average hclust maximum complete hclust euclidean complete hclust manhattan maximum mcquitty hclust complete dist_ebinary dist_binary dist_fbinary dist_minkowski dist_canb dist_max hclust manhattan ward dist_cos dismea hclust canberra mcquitty Red point: a clustering by affprop info.costs kmeanshclust euclidean ward sot_euc euclidean hclust canberra complete hclust binary ward Affinity Propagation-Cosine hclust maximum ward hclusthclust spearman ward kendall ward kmeans binary (Dueck and Frey 2007) kmeans maximum Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Example Discovery mult_dirproc mixvmf kmeans correlation hclust canberra ward sot_cor divisive stand.euc mixvmf hclust correlationmixvmfVAbinary complete hclust mcquitty hclust pearson single affprop cosine hclust pearson median hclust correlation single hclust pearson mcquitty hclust correlation median mec hclust pearson average hclust correlation complete hclust correlation averagehclust pearson complete hclust binary single hclust binary average som hclust correlation centroid rock hclust pearson centroid hclust binary median hclust canberra single biclust_spectral affprop cosine hclust spearman complete hclust binary mcquitty kmeans pearson spec_man spec_cos kmeans kendall median hclust canberra hclust canberra average spec_mink spec_euc spec_max mspec_minkspec_canb mspec_manaffprop maximum kmeans spearman kmeans manhattan mspec_max mspec_cos mspec_canb mspec_euc kmeansbinary centroid hclust canberra hclust kendall single hclust spearman centroid hclusthclust kendall centroid kendall medianaverage average spearmankendall mcquitty hclust median hclust spearman single hclust spearman mcquitty kendall complete hclust canberra centroid hclust hclust manhattan medianmanhattan euclideankmedoids hclusthclustmanhattan centroid hclust manhattan single hclust manhattan average single manhattan affprop hclust euclidean median hclust maximum single manhattan divisive hclust euclidean centroid hclust euclidean average hclust manhattan mcquitty clust_convex hclust correlation ward hclust euclidean mcquitty kmedoids euclidean hclust pearson wardstand.euc kmedoids hclust maximummedian divisive centroid hclust maximumeuclideanaffprop euclidean hclust maximum average hclust maximum complete hclust euclidean complete hclust manhattan maximum mcquitty hclust complete dist_ebinary dist_binary dist_fbinary dist_minkowski dist_canb dist_max hclust manhattan ward dist_cos dismea hclust canberra mcquitty Red point: a clustering by affprop info.costs kmeanshclust euclidean ward sot_euc euclidean hclust canberra complete hclust binary ward Affinity Propagation-Cosine hclust maximum ward hclusthclust spearman ward kendall ward kmeans binary (Dueck and Frey 2007) kmeans maximum Close to: Mixture of von Mises-Fisher distributions (Banerjee et. al. 2005) Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Example Discovery mult_dirproc kmeans correlation hclust canberra ward sot_cor divisive stand.euc mixvmf hclust correlationmixvmfVAbinary complete hclust mcquitty hclust pearson single affprop cosine hclust pearson median hclust correlation single hclust pearson mcquitty hclust correlation median mec hclust pearson average hclust correlation complete hclust correlation averagehclust pearson complete hclust binary single hclust binary average kmeans pearson som hclust correlation centroid rock hclust pearson centroid hclust binary median hclust binary mcquitty hclust canberra single biclust_spectral hclust spearman complete spec_man spec_cos kmeans kendall median hclust canberra hclust canberra average spec_mink spec_euc spec_max mspec_minkspec_canb mspec_manaffprop maximum kmeans spearman kmeans manhattan mspec_max mspec_cos mspec_canb mspec_euc kmeansbinary centroid hclust canberra hclust kendall single hclust spearman centroid hclusthclust kendall centroid kendall medianaverage average spearmankendall mcquitty hclust median hclust spearman single hclust spearman mcquitty kendall complete hclust canberra centroid hclust hclust manhattan medianmanhattan euclideankmedoids hclusthclustmanhattan centroid hclust manhattan single hclust manhattan average single manhattan affprop hclust euclidean median hclust maximum single manhattan divisive hclust euclidean centroid hclust euclidean average hclust manhattan mcquitty clust_convex hclust correlation ward hclust euclidean mcquitty kmedoids euclidean hclust pearson wardstand.euc kmedoids hclust maximummedian divisive centroid hclust maximumeuclideanaffprop euclidean hclust canberra mcquitty hclust maximum average hclust maximum complete hclust euclidean complete hclust manhattan maximum mcquitty hclust complete dist_ebinary dist_binary dist_fbinary dist_minkowski dist_canb dist_max dist_cos dismea hclust manhattan ward affprop info.costs kmeanshclust euclidean ward euclidean hclust canberra complete sot_euc hclust binary ward hclusthclust spearman ward kendall ward hclust maximum ward kmeans binary kmeans maximum Space between methods: Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Example Discovery mult_dirproc kmeans correlation hclust canberra ward sot_cor divisive stand.euc mixvmf hclust correlationmixvmfVAbinary complete hclust mcquitty hclust pearson single affprop cosine hclust pearson median hclust correlation single hclust pearson mcquitty hclust correlation median mec hclust pearson average hclust correlation complete hclust correlation averagehclust pearson complete hclust binary single hclust binary average kmeans pearson som hclust correlation centroid rock hclust pearson centroid hclust binary median hclust binary mcquitty hclust canberra single q biclust_spectral hclust spearman complete spec_man spec_cos kmeans kendall median hclust canberra hclust canberra average spec_mink spec_euc spec_max mspec_minkspec_canb mspec_manaffprop maximum kmeans spearman kmeans manhattan mspec_max mspec_cos mspec_canb mspec_euc kmeansbinary centroid hclust canberra hclust kendall single hclust spearman centroid hclusthclust kendall centroid kendall medianaverage average spearmankendall mcquitty hclust median hclust spearman single hclust spearman mcquitty kendall complete hclust canberra centroid hclust hclust manhattan medianmanhattan euclideankmedoids hclusthclustmanhattan centroid hclust manhattan single hclust manhattan average single manhattan affprop hclust euclidean median hclust maximum single manhattan divisive hclust euclidean centroid hclust euclidean average hclust manhattan mcquitty clust_convex hclust correlation ward hclust euclidean mcquitty kmedoids euclidean hclust pearson wardstand.euc kmedoids hclust maximummedian divisive centroid hclust maximumeuclideanaffprop euclidean hclust canberra mcquitty hclust maximum average hclust maximum complete hclust euclidean complete hclust manhattan maximum mcquitty hclust complete dist_ebinary dist_binary dist_fbinary dist_minkowski dist_canb dist_max dist_cos dismea hclust manhattan ward affprop info.costs kmeanshclust euclidean ward euclidean hclust canberra complete sot_euc hclust binary ward hclusthclust spearman ward kendall ward hclust maximum ward kmeans binary kmeans maximum Space between methods: Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Example Discovery mult_dirproc kmeans correlation hclust canberra ward sot_cor divisive stand.euc mixvmf hclust correlationmixvmfVAbinary complete hclust mcquitty hclust pearson single affprop cosine hclust pearson median hclust correlation single hclust pearson mcquitty hclust correlation median mec hclust pearson average hclust correlation complete hclust correlation averagehclust pearson complete hclust binary single hclust binary average kmeans pearson som hclust correlation centroid rock hclust pearson centroid hclust binary median hclust binary mcquitty hclust canberra single q biclust_spectral hclust spearman complete spec_man spec_cos kmeans kendall median hclust canberra hclust canberra average spec_mink spec_euc spec_max mspec_minkspec_canb mspec_manaffprop maximum kmeans spearman kmeans manhattan mspec_max mspec_cos mspec_canb mspec_euc kmeansbinary centroid hclust canberra hclust kendall single hclust spearman centroid hclusthclust kendall centroid kendall medianaverage average spearmankendall mcquitty hclust median hclust spearman single hclust spearman mcquitty kendall complete hclust canberra centroid hclust hclust manhattan medianmanhattan euclideankmedoids hclusthclustmanhattan centroid hclust manhattan single hclust manhattan average single manhattan affprop hclust euclidean median hclust maximum single manhattan divisive hclust euclidean centroid hclust euclidean average hclust manhattan mcquitty clust_convex hclust correlation ward hclust euclidean mcquitty kmedoids euclidean hclust pearson wardstand.euc kmedoids hclust maximummedian divisive centroid hclust maximumeuclideanaffprop euclidean hclust canberra mcquitty hclust maximum average hclust maximum complete hclust euclidean complete hclust manhattan maximum mcquitty hclust complete dist_ebinary dist_binary dist_fbinary dist_minkowski dist_canb dist_max dist_cos dismea hclust manhattan ward affprop info.costs kmeanshclust euclidean ward euclidean hclust canberra complete sot_euc hclust binary ward hclusthclust spearman ward kendall ward hclust maximum ward kmeans binary kmeans maximum Space between methods: local cluster ensemble Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Example Discovery mult_dirproc kmeans correlation hclust canberra ward sot_cor divisive stand.euc mixvmf hclust correlationmixvmfVAbinary complete hclust mcquitty hclust pearson single affprop cosine hclust pearson median hclust correlation single hclust pearson mcquitty hclust correlation median mec hclust pearson average hclust correlation complete hclust correlation averagehclust pearson complete hclust binary single hclust binary average kmeans pearson som hclust correlation centroid rock hclust pearson centroid hclust binary median hclust binary mcquitty hclust canberra single biclust_spectral hclust spearman complete spec_man spec_cos kmeans kendall median hclust canberra hclust canberra average spec_mink spec_euc spec_max mspec_minkspec_canb mspec_manaffprop maximum kmeans spearman kmeans manhattan mspec_max mspec_cos mspec_canb mspec_euc kmeansbinary centroid hclust canberra hclust kendall single hclust spearman centroid hclusthclust kendall centroid kendall medianaverage average spearmankendall mcquitty hclust median hclust spearman single hclust spearman mcquitty kendall complete hclust canberra centroid hclust hclust manhattan medianmanhattan euclideankmedoids hclusthclustmanhattan centroid hclust manhattan single hclust manhattan average single manhattan affprop hclust euclidean median hclust maximum single manhattan divisive hclust euclidean centroid hclust euclidean average hclust manhattan mcquitty clust_convex hclust correlation ward hclust euclidean mcquitty kmedoids euclidean hclust pearson wardstand.euc kmedoids hclust maximummedian divisive centroid hclust maximumeuclideanaffprop euclidean hclust canberra mcquitty hclust maximum average hclust maximum complete hclust euclidean complete hclust manhattan maximum mcquitty hclust complete dist_ebinary dist_binary dist_fbinary dist_minkowski dist_canb dist_max dist_cos dismea hclust manhattan ward affprop info.costs kmeanshclust euclidean ward euclidean hclust canberra complete sot_euc hclust binary ward hclusthclust spearman ward kendall ward hclust maximum ward kmeans binary kmeans maximum Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Example Discovery mult_dirproc kmeans correlation hclust canberra ward sot_cor divisive stand.euc mixvmf hclust correlationmixvmfVAbinary complete hclust mcquitty hclust pearson single affprop cosine hclust pearson median hclust correlation single hclust pearson mcquitty hclust correlation median mec hclust pearson average hclust correlation complete hclust correlation averagehclust pearson complete hclust binary single hclust binary average kmeans pearson som hclust correlation centroid rock hclust pearson centroid hclust binary median hclust binary mcquitty hclust canberra single biclust_spectral hclust spearman complete spec_man spec_cos kmeans kendall median hclust canberra hclust canberra average spec_mink spec_euc spec_max mspec_minkspec_canb mspec_manaffprop maximum kmeans spearman kmeans manhattan mspec_max mspec_cos mspec_canb mspec_euc kmeansbinary centroid hclust canberra hclust kendall single hclust spearman centroid hclusthclust kendall centroid kendall medianaverage average spearmankendall mcquitty hclust median hclust spearman single hclust spearman mcquitty kendall complete hclust canberra centroid hclust hclust manhattan medianmanhattan euclideankmedoids hclusthclustmanhattan centroid hclust manhattan single hclust manhattan average single manhattan affprop hclust euclidean median hclust maximum single manhattan divisive hclust euclidean centroid hclust euclidean average hclust manhattan mcquitty clust_convex hclust correlation ward hclust euclidean mcquitty kmedoids euclidean hclust pearson wardstand.euc kmedoids hclust maximummedian divisive centroid hclust maximumeuclideanaffprop euclidean hclust canberra mcquitty hclust maximum average hclust maximum complete hclust euclidean complete hclust manhattan maximum mcquitty hclust complete dist_ebinary dist_binary dist_fbinary dist_minkowski dist_canb dist_max dist_cos dismea hclust manhattan ward affprop info.costs kmeanshclust euclidean ward euclidean hclust canberra complete sot_euc hclust binary ward hclusthclust spearman ward kendall ward hclust maximum ward kmeans binary kmeans maximum Found a region with particularly insightful clusterings Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Example Discovery mult_dirproc kmeans correlation hclust canberra ward sot_cor divisive stand.euc hclust pearson single hclust pearson median hclust correlation single hclust correlation median mec mixvmf hclust correlationmixvmfVAbinary complete affprop cosine hclust mcquitty hclust pearson mcquitty hclust pearson average hclust correlation complete hclust pearson complete Mixture: hclust binary single hclust correlation average hclust binary average kmeans pearson som hclust correlation centroid rock hclust pearson centroid hclust binary median hclust binary mcquitty hclust canberra single biclust_spectral hclust spearman complete spec_man spec_cos kmeans kendall median hclust canberra hclust canberra average spec_mink spec_euc spec_max mspec_minkspec_canb mspec_manaffprop maximum kmeans spearman kmeans manhattan mspec_max mspec_cos mspec_canb mspec_euc kmeansbinary centroid hclust canberra hclust kendall single hclust spearman centroid hclusthclust kendall centroid kendall medianaverage average spearmankendall mcquitty hclust median hclust spearman single hclust spearman mcquitty kendall complete hclust canberra centroid hclust hclust manhattan medianmanhattan euclideankmedoids hclusthclustmanhattan centroid hclust manhattan single hclust manhattan average single manhattan affprop q hclust euclidean median divisive manhattan hclust maximum single hclust euclidean centroid hclust euclidean average hclust manhattan mcquitty clust_convex hclust correlation ward hclust euclidean mcquitty kmedoids euclidean hclust pearson wardstand.euc kmedoids hclust maximummedian divisive centroid hclust maximumeuclideanaffprop euclidean hclust canberra mcquitty hclust maximum average hclust maximum complete hclust euclidean complete hclust manhattan maximum mcquitty hclust complete dist_ebinary dist_binary dist_fbinary dist_minkowski dist_canb dist_max dist_cos dismea hclust manhattan ward affprop info.costs kmeanshclust euclidean ward euclidean hclust canberra complete sot_euc hclust binary ward hclusthclust spearman ward kendall ward hclust maximum ward kmeans binary kmeans maximum Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Example Discovery mult_dirproc kmeans correlation hclust canberra ward sot_cor divisive stand.euc hclust pearson single hclust pearson median hclust correlation single hclust correlation median mec mixvmf hclust correlationmixvmfVAbinary complete affprop cosine hclust mcquitty hclust pearson mcquitty hclust pearson average hclust correlation complete hclust pearson complete Mixture: hclust binary single hclust correlation average hclust binary average kmeans pearson som hclust correlation centroid rock hclust pearson centroid hclust binary median hclust canberra single biclust_spectral hclust spearman complete hclust binary mcquitty spec_man 0.39 Hclust-Canberra-McQuitty spec_cos kmeans kendall median hclust canberra hclust canberra average spec_mink spec_euc spec_max mspec_minkspec_canb mspec_manaffprop maximum kmeans spearman kmeans manhattan mspec_max mspec_cos mspec_canb mspec_euc kmeansbinary centroid hclust canberra hclust kendall single hclust spearman centroid hclusthclust kendall centroid kendall medianaverage average spearmankendall mcquitty hclust median hclust spearman single hclust spearman mcquitty kendall complete hclust canberra centroid hclust hclust manhattan medianmanhattan euclideankmedoids hclusthclustmanhattan centroid hclust manhattan single hclust manhattan average single manhattan affprop q hclust euclidean median divisive manhattan hclust maximum single hclust euclidean centroid hclust euclidean average hclust manhattan mcquitty clust_convex hclust correlation ward hclust euclidean mcquitty kmedoids euclidean hclust pearson wardstand.euc kmedoids hclust maximummedian divisive centroid hclust maximumeuclideanaffprop euclidean hclust canberra mcquitty hclust maximum average hclust maximum complete hclust euclidean complete hclust manhattan maximum mcquitty hclust complete dist_ebinary dist_binary dist_fbinary dist_minkowski dist_canb dist_max dist_cos dismea hclust manhattan ward affprop info.costs kmeanshclust euclidean ward euclidean hclust canberra complete sot_euc hclust binary ward hclusthclust spearman ward kendall ward hclust maximum ward kmeans binary kmeans maximum Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Example Discovery mult_dirproc kmeans correlation hclust canberra ward sot_cor divisive stand.euc hclust pearson single hclust pearson median hclust correlation single hclust correlation median mec mixvmf hclust correlationmixvmfVAbinary complete affprop cosine hclust mcquitty hclust pearson mcquitty hclust pearson average hclust correlation complete hclust pearson complete Mixture: hclust binary single hclust correlation average hclust binary average kmeans pearson som hclust correlation centroid rock hclust pearson centroid hclust binary median hclust canberra single biclust_spectral hclust spearman complete hclust binary mcquitty spec_man 0.39 Hclust-Canberra-McQuitty spec_cos kmeans kendall median hclust canberra hclust canberra average spec_mink spec_euc spec_max mspec_minkspec_canb mspec_manaffprop maximum kmeans spearman kmeans manhattan mspec_max kmeansbinary centroid hclust canberra hclust kendall single hclust spearman centroid hclusthclust kendall centroid kendall medianaverage average spearmankendall mcquitty hclust median hclust spearman single hclust spearman mcquitty kendall complete hclust canberra centroid hclust mspec_cos mspec_canb mspec_euc 0.30 Spectral clustering hclust manhattan medianmanhattan euclideankmedoids hclusthclustmanhattan centroid hclust manhattan single hclust manhattan average q single manhattan affprop hclust euclidean median divisive manhattan hclust maximum single hclust euclidean centroid hclust euclidean average hclust manhattan mcquitty hclust euclidean mcquitty hclust maximummedian clust_convex kmedoids euclidean hclust correlation ward hclust pearson wardstand.euc kmedoids Random Walk divisive centroid hclust maximumeuclideanaffprop euclidean hclust maximum average hclust canberra mcquitty hclust maximum complete hclust euclidean complete hclust manhattan maximum mcquitty hclust complete dist_ebinary dist_binary dist_fbinary dist_minkowski dist_canb dist_max hclust manhattan ward dist_cos dismea affprop info.costs (Metrics 1-6) kmeanshclust euclidean ward euclidean hclust canberra complete sot_euc hclust binary ward hclusthclust spearman ward kendall ward hclust maximum ward kmeans binary kmeans maximum Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Example Discovery mult_dirproc kmeans correlation hclust canberra ward sot_cor divisive stand.euc hclust pearson single hclust pearson median hclust correlation single hclust correlation median mec mixvmf hclust correlationmixvmfVAbinary complete affprop cosine hclust mcquitty hclust pearson mcquitty hclust pearson average hclust correlation complete hclust pearson complete Mixture: hclust binary single hclust correlation average hclust binary average kmeans pearson som hclust correlation centroid rock hclust pearson centroid hclust binary median hclust canberra single biclust_spectral hclust spearman complete hclust binary mcquitty spec_man 0.39 Hclust-Canberra-McQuitty spec_cos kmeans kendall median hclust canberra hclust canberra average spec_mink spec_euc spec_max mspec_minkspec_canb mspec_manaffprop maximum kmeans spearman kmeans manhattan mspec_max kmeansbinary centroid hclust canberra hclust kendall single hclust spearman centroid hclusthclust kendall centroid kendall medianaverage average spearmankendall mcquitty hclust median hclust spearman single hclust spearman mcquitty kendall complete hclust canberra centroid hclust mspec_cos mspec_canb mspec_euc 0.30 Spectral clustering hclust manhattan medianmanhattan euclideankmedoids hclusthclustmanhattan centroid hclust manhattan single hclust manhattan average q single manhattan affprop hclust euclidean median divisive manhattan hclust maximum single hclust euclidean centroid hclust euclidean average hclust manhattan mcquitty hclust euclidean mcquitty hclust maximummedian clust_convex kmedoids euclidean hclust correlation ward hclust pearson wardstand.euc kmedoids Random Walk divisive centroid hclust maximumeuclideanaffprop euclidean hclust maximum average hclust canberra mcquitty hclust maximum complete hclust euclidean complete hclust manhattan maximum mcquitty hclust complete dist_ebinary dist_binary dist_fbinary dist_minkowski dist_canb dist_max hclust manhattan ward dist_cos dismea affprop info.costs (Metrics 1-6) kmeanshclust euclidean ward euclidean hclust canberra complete sot_euc hclust binary ward 0.13 Hclust-Correlation-Ward hclusthclust spearman ward kendall ward hclust maximum ward kmeans binary kmeans maximum Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Example Discovery mult_dirproc kmeans correlation hclust canberra ward sot_cor divisive stand.euc hclust pearson single hclust pearson median hclust correlation single hclust correlation median mec mixvmf hclust correlationmixvmfVAbinary complete affprop cosine hclust mcquitty hclust pearson mcquitty hclust pearson average hclust correlation complete hclust pearson complete Mixture: hclust binary single hclust correlation average hclust binary average kmeans pearson som hclust correlation centroid rock hclust pearson centroid hclust binary median hclust canberra single biclust_spectral hclust spearman complete hclust binary mcquitty spec_man 0.39 Hclust-Canberra-McQuitty spec_cos kmeans kendall median hclust canberra hclust canberra average spec_mink spec_euc spec_max mspec_minkspec_canb mspec_manaffprop maximum kmeans spearman kmeans manhattan mspec_max kmeansbinary centroid hclust canberra hclust kendall single hclust spearman centroid hclusthclust kendall centroid kendall medianaverage average spearmankendall mcquitty hclust median hclust spearman single hclust spearman mcquitty kendall complete hclust canberra centroid hclust mspec_cos mspec_canb mspec_euc 0.30 Spectral clustering hclust manhattan medianmanhattan euclideankmedoids hclusthclustmanhattan centroid hclust manhattan single hclust manhattan average q single manhattan affprop hclust euclidean median divisive manhattan hclust maximum single hclust euclidean centroid hclust euclidean average hclust manhattan mcquitty hclust euclidean mcquitty hclust maximummedian clust_convex kmedoids euclidean hclust correlation ward hclust pearson wardstand.euc kmedoids Random Walk divisive centroid hclust maximumeuclideanaffprop euclidean hclust maximum average hclust canberra mcquitty hclust maximum complete hclust euclidean complete hclust manhattan maximum mcquitty hclust complete dist_ebinary dist_binary dist_fbinary dist_minkowski dist_canb dist_max hclust manhattan ward dist_cos dismea affprop info.costs (Metrics 1-6) kmeanshclust euclidean ward euclidean hclust canberra complete sot_euc hclust binary ward 0.13 Hclust-Correlation-Ward hclusthclust spearman ward kendall ward hclust maximum ward kmeans maximum kmeans binary 0.09 Hclust-Pearson-Ward Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Example Discovery mult_dirproc kmeans correlation hclust canberra ward sot_cor divisive stand.euc hclust pearson single hclust pearson median hclust correlation single hclust correlation median mec mixvmf hclust correlationmixvmfVAbinary complete affprop cosine hclust mcquitty hclust pearson mcquitty hclust pearson average hclust correlation complete hclust pearson complete Mixture: hclust binary single hclust correlation average hclust binary average kmeans pearson som hclust correlation centroid rock hclust pearson centroid hclust binary median hclust canberra single biclust_spectral hclust spearman complete hclust binary mcquitty spec_man 0.39 Hclust-Canberra-McQuitty spec_cos kmeans kendall median hclust canberra hclust canberra average spec_mink spec_euc spec_max mspec_minkspec_canb mspec_manaffprop maximum kmeans spearman kmeans manhattan mspec_max kmeansbinary centroid hclust canberra hclust kendall single hclust spearman centroid hclusthclust kendall centroid kendall medianaverage average spearmankendall mcquitty hclust median hclust spearman single hclust spearman mcquitty kendall complete hclust canberra centroid hclust mspec_cos mspec_canb mspec_euc 0.30 Spectral clustering hclust manhattan medianmanhattan euclideankmedoids hclusthclustmanhattan centroid hclust manhattan single hclust manhattan average q single manhattan affprop hclust euclidean median divisive manhattan hclust maximum single hclust euclidean centroid hclust euclidean average hclust manhattan mcquitty hclust euclidean mcquitty hclust maximummedian clust_convex kmedoids euclidean hclust correlation ward hclust pearson wardstand.euc kmedoids Random Walk divisive centroid hclust maximumeuclideanaffprop euclidean hclust maximum average hclust canberra mcquitty hclust maximum complete hclust euclidean complete hclust manhattan maximum mcquitty hclust complete dist_ebinary dist_binary dist_fbinary dist_minkowski dist_canb dist_max hclust manhattan ward dist_cos dismea affprop info.costs (Metrics 1-6) kmeanshclust euclidean ward euclidean hclust canberra complete sot_euc hclust binary ward 0.13 Hclust-Correlation-Ward hclusthclust spearman ward kendall ward hclust maximum ward kmeans maximum kmeans binary 0.09 Hclust-Pearson-Ward 0.05 Kmediods-Cosine Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Example Discovery mult_dirproc kmeans correlation hclust canberra ward sot_cor divisive stand.euc hclust pearson single hclust pearson median hclust correlation single hclust correlation median mec mixvmf hclust correlationmixvmfVAbinary complete affprop cosine hclust mcquitty hclust pearson mcquitty hclust pearson average hclust correlation complete hclust pearson complete Mixture: hclust binary single hclust correlation average hclust binary average kmeans pearson som hclust correlation centroid rock hclust pearson centroid hclust binary median hclust canberra single biclust_spectral hclust spearman complete hclust binary mcquitty spec_man 0.39 Hclust-Canberra-McQuitty spec_cos kmeans kendall median hclust canberra hclust canberra average spec_mink spec_euc spec_max mspec_minkspec_canb mspec_manaffprop maximum kmeans spearman kmeans manhattan mspec_max kmeansbinary centroid hclust canberra hclust kendall single hclust spearman centroid hclusthclust kendall centroid kendall medianaverage average spearmankendall mcquitty hclust median hclust spearman single hclust spearman mcquitty kendall complete hclust canberra centroid hclust mspec_cos mspec_canb mspec_euc 0.30 Spectral clustering hclust manhattan medianmanhattan euclideankmedoids hclusthclustmanhattan centroid hclust manhattan single hclust manhattan average q single manhattan affprop hclust euclidean median divisive manhattan hclust maximum single hclust euclidean centroid hclust euclidean average hclust manhattan mcquitty hclust euclidean mcquitty hclust maximummedian clust_convex kmedoids euclidean hclust correlation ward hclust pearson wardstand.euc kmedoids Random Walk divisive centroid hclust maximumeuclideanaffprop euclidean hclust maximum average hclust canberra mcquitty hclust maximum complete hclust euclidean complete hclust manhattan maximum mcquitty hclust complete dist_ebinary dist_binary dist_fbinary dist_minkowski dist_canb dist_max hclust manhattan ward dist_cos dismea affprop info.costs (Metrics 1-6) kmeanshclust euclidean ward euclidean hclust canberra complete sot_euc hclust binary ward 0.13 Hclust-Correlation-Ward hclusthclust spearman ward kendall ward hclust maximum ward kmeans maximum kmeans binary 0.09 Hclust-Pearson-Ward 0.05 Kmediods-Cosine 0.04 Spectral clustering Symmetric (Metrics 1-6) Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Example Discovery mult_dirproc kmeans correlation hclust canberra ward sot_cor divisive stand.euc mixvmf hclust correlationmixvmfVAbinary complete hclust mcquitty hclust pearson single affprop cosine hclust pearson median hclust correlation single hclust pearson mcquitty hclust correlation median mec hclust pearson average hclust correlation complete hclust correlation averagehclust pearson complete hclust binary single hclust binary average kmeans pearson som hclust correlation centroid rock hclust pearson centroid hclust binary median hclust binary mcquitty hclust canberra single biclust_spectral hclust spearman complete spec_man spec_cos kmeans kendall median hclust canberra hclust canberra average spec_mink spec_euc spec_max mspec_minkspec_canb mspec_manaffprop maximum kmeans spearman kmeans manhattan mspec_max mspec_cos mspec_canb mspec_euc kmeansbinary centroid hclust canberra hclust kendall single hclust spearman centroid hclusthclust kendall centroid kendall medianaverage average spearmankendall mcquitty hclust median hclust spearman single hclust spearman mcquitty kendall complete hclust canberra centroid hclust hclust manhattan medianmanhattan euclideankmedoids hclusthclustmanhattan centroid hclust manhattan single hclust manhattan average single manhattan affprop q hclust euclidean median hclust maximum single manhattan divisive hclust euclidean centroid hclust euclidean average hclust manhattan mcquitty clust_convex hclust correlation ward hclust euclidean mcquitty kmedoids euclidean hclust pearson wardstand.euc kmedoids hclust maximummedian divisive centroid hclust maximumeuclideanaffprop euclidean hclust canberra mcquitty hclust maximum average hclust maximum complete hclust euclidean complete hclust manhattan maximum mcquitty hclust complete dist_ebinary dist_binary dist_fbinary dist_minkowski dist_canb dist_max dist_cos dismea hclust manhattan ward affprop info.costs kmeanshclust euclidean ward euclidean hclust canberra complete sot_euc hclust binary ward hclusthclust spearman ward kendall ward hclust maximum ward kmeans binary kmeans maximum Clusters in this ClusteringMayhew Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Example Discovery mult_dirproc kmeans correlation hclust canberra ward sot_cor divisive stand.euc mixvmf hclust correlationmixvmfVAbinary complete hclust mcquitty hclust pearson single affprop cosine hclust pearson median hclust correlation single hclust pearson mcquitty hclust correlation median mec hclust pearson average hclust correlation complete hclust correlation averagehclust pearson complete hclust binary single hclust binary average kmeans pearson som hclust correlation centroid rock hclust pearson centroid hclust binary median hclust binary mcquitty hclust canberra single biclust_spectral hclust spearman complete spec_man spec_cos kmeans kendall median hclust canberra hclust canberra average spec_mink spec_euc spec_max mspec_minkspec_canb mspec_man affprop maximum kmeans spearman kmeans manhattan mspec_max mspec_cos mspec_canb mspec_euc kmeansbinary centroid hclust canberra hclust kendall single hclust spearman centroid hclusthclust kendall centroid kendall medianaverage average spearmankendall mcquitty hclust median hclust spearman single hclust spearman mcquitty kendall complete hclust canberra centroid euclideankmedoids hclusthclustmanhattan centroid hclust manhattan single hclust manhattan average hclust hclust manhattan medianmanhattan single manhattan affprop hclust euclidean median hclust maximum single manhattan divisive hclust euclidean centroid hclust euclidean average hclust manhattan mcquitty clust_convex q hclust correlation ward Credit Claiming, Pork: hclust euclidean mcquitty kmedoids euclidean hclust pearson wardstand.euc kmedoids “Sens. Frank R. Lautenberg hclust maximummedian divisive centroid hclust maximumeuclideanaffprop euclidean hclust canberra mcquitty hclust maximum average hclust maximum complete hclust euclidean complete hclust manhattan maximum mcquitty hclust complete dist_ebinary dist_binary dist_fbinary dist_minkowski dist_canb dist_max dist_cos dismea hclust manhattan ward affprop info.costs kmeanshclust euclidean ward sot_euc euclidean hclust canberra complete hclust binary ward (D-NJ) and Robert Menendez (D-NJ) announced that the U.S. hclusthclust spearman ward kendall ward hclust maximum ward kmeans binary kmeans maximum Department of Commerce has Clusters in this Clustering q awarded a $100,000 grant to the q qqqq q q q q qq qq q q q q q q q q South Jersey Economic q qCredit Claiming Development District” Pork Mayhew Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Example Discovery mult_dirproc kmeans correlation hclust canberra ward sot_cor divisive stand.euc mixvmf hclust correlationmixvmfVAbinary complete hclust mcquitty hclust pearson single affprop cosine hclust pearson median hclust correlation single hclust pearson mcquitty hclust correlation median mec hclust pearson average hclust correlation complete hclust correlation averagehclust pearson complete hclust binary single hclust binary average kmeans pearson som hclust correlation centroid rock hclust pearson centroid hclust binary median hclust binary mcquitty hclust canberra single biclust_spectral hclust spearman complete spec_man spec_cos kmeans kendall median hclust canberra hclust canberra average spec_mink spec_euc spec_max mspec_minkspec_canb mspec_man affprop maximum kmeans spearman kmeans manhattan mspec_max mspec_cos mspec_canb mspec_euc kmeansbinary centroid hclust canberra hclust kendall single hclust spearman centroid hclusthclust kendall centroid kendall medianaverage average spearmankendall mcquitty hclust median hclust spearman single hclust spearman mcquitty kendall complete hclust canberra centroid hclust hclust manhattan medianmanhattan euclideankmedoids hclusthclustmanhattan centroid hclust manhattan single hclust manhattan average single manhattan affprop q hclust euclidean median hclust maximum single manhattan divisive hclust euclidean centroid hclust euclidean average hclust manhattan mcquitty clust_convex hclust correlation ward hclust euclidean mcquitty kmedoids euclidean hclust pearson wardstand.euc kmedoids Credit Claiming, Legislation: hclust maximummedian divisive centroid hclust maximumeuclideanaffprop euclidean hclust canberra mcquitty hclust maximum average hclust maximum complete hclust euclidean complete hclust manhattan maximum mcquitty hclust complete dist_ebinary dist_binary dist_fbinary dist_minkowski dist_canb dist_max dist_cos dismea hclust manhattan ward affprop info.costs kmeanshclust euclidean ward sot_euc euclidean hclust canberra complete hclust binary ward “As the Senate begins its recess, Senator Frank Lautenberg today hclusthclust spearman ward kendall ward hclust maximum ward kmeans binary kmeans maximum pointed to a string of victories in Clusters in this Clustering q Congress on his legislative agenda q qqqq q q q q qq qq q q q q q q q q during this work period” q qCredit Claiming Pork q q q q q q qq q q q q q q q q q q qq qq qq q Credit Claiming Mayhew Legislation Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Example Discovery mult_dirproc kmeans correlation hclust canberra ward sot_cor divisive stand.euc mixvmf hclust correlationmixvmfVAbinary complete hclust mcquitty hclust pearson single affprop cosine hclust pearson median hclust correlation single hclust pearson mcquitty hclust correlation median mec hclust pearson average hclust correlation complete hclust correlation averagehclust pearson complete hclust binary single hclust binary average kmeans pearson som hclust correlation centroid rock hclust pearson centroid hclust binary median hclust binary mcquitty hclust canberra single biclust_spectral hclust spearman complete spec_man spec_cos kmeans kendall median hclust canberra hclust canberra average spec_mink spec_euc spec_max mspec_minkspec_canb mspec_man affprop maximum kmeans spearman kmeans manhattan mspec_max mspec_cos mspec_canb mspec_euc kmeansbinary centroid hclust canberra hclust kendall single hclust spearman centroid hclusthclust kendall centroid kendall medianaverage average spearmankendall mcquitty hclust median hclust spearman single hclust spearman mcquitty kendall complete hclust canberra centroid hclust hclust manhattan medianmanhattan euclideankmedoids hclusthclustmanhattan centroid hclust manhattan single hclust manhattan average single manhattan affprop q hclust euclidean median hclust maximum single manhattan divisive hclust euclidean centroid hclust euclidean average hclust manhattan mcquitty clust_convex hclust correlation ward hclust euclidean mcquitty kmedoids euclidean hclust pearson wardstand.euc kmedoids hclust maximummedian divisive centroid hclust maximumeuclideanaffprop euclidean hclust canberra mcquitty hclust maximum average hclust maximum complete hclust euclidean complete hclust manhattan maximum mcquitty hclust complete dist_ebinary dist_binary dist_fbinary dist_minkowski dist_canb dist_max hclust manhattan ward kmeanshclust euclidean ward sot_euc euclidean dist_cos dismea hclust canberra complete affprop info.costs Advertising: hclust binary ward hclusthclust spearman ward kendall ward “Senate Adopts hclust maximum ward kmeans binary kmeans maximum Lautenberg/Menendez Resolution Clusters in this Clustering Honoring Spelling Bee Champion qqq q q q qq qq q q q q q q q q from New Jersey” q q q q q q q q q q qq q qq qCredit Claiming Advertising Pork q q q q q q qq q q q q q q q q q q qq qq qq q Credit Claiming Mayhew Legislation Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Example Discovery: Partisan Taunting mult_dirproc kmeans correlation hclust canberra ward sot_cor divisive stand.euc mixvmf hclust correlationmixvmfVAbinary complete hclust mcquitty hclust pearson single affprop cosine hclust pearson median hclust correlation single hclust pearson mcquitty hclust correlation median mec hclust pearson average hclust correlation complete hclust correlation averagehclust pearson complete hclust binary single hclust binary average kmeans pearson som hclust correlation centroid rock hclust pearson centroid hclust binary median hclust binary mcquitty hclust canberra single biclust_spectral hclust spearman complete spec_man spec_cos kmeans kendall median hclust canberra hclust canberra average spec_mink spec_euc spec_max mspec_minkspec_canb mspec_man affprop maximum kmeans spearman kmeans manhattan mspec_max mspec_cos mspec_canb mspec_euc kmeansbinary centroid hclust canberra hclust kendall single hclust spearman centroid hclusthclust kendall centroid kendall medianaverage average spearmankendall mcquitty hclust median hclust spearman single hclust spearman mcquitty kendall complete hclust canberra centroid hclust hclust manhattan medianmanhattan euclideankmedoids hclusthclustmanhattan centroid hclust manhattan single hclust manhattan average single manhattan affprop q hclust euclidean median hclust maximum single manhattan divisive hclust euclidean centroid hclust euclidean average hclust manhattan mcquitty clust_convex hclust correlation ward hclust euclidean mcquitty kmedoids euclidean hclust pearson wardstand.euc kmedoids hclust maximummedian divisive centroid hclust maximumeuclideanaffprop euclidean hclust canberra mcquitty hclust maximum average hclust maximum complete hclust euclidean complete hclust manhattan maximum mcquitty hclust complete dist_ebinary dist_binary dist_fbinary dist_minkowski dist_canb dist_max dist_cos dismea hclust manhattan ward affprop info.costs kmeanshclust euclidean ward euclidean hclust canberra complete sot_euc hclust binary ward hclusthclust spearman ward kendall ward Partisan Taunting: hclust maximum ward kmeans binary kmeans maximum “Republicans Selling Out Nation Clusters in this Clustering on Chemical Plant Security” q q q qq qq qq q q qq q q q qq q q q q q q q q q q qq q qq qCredit Claiming Advertising Pork Partisan Taunting q q q q qq q q q q q q q q qq q q q q q q q q q q qq q q q q q q q qq q q q q qq qq q q q q Credit Claiming Mayhew q Legislation Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Example Discovery: Partisan Taunting mult_dirproc kmeans correlation hclust canberra ward sot_cor divisive stand.euc mixvmf hclust correlationmixvmfVAbinary complete hclust mcquitty hclust pearson single affprop cosine hclust pearson median hclust correlation single hclust pearson mcquitty hclust correlation median mec hclust pearson average hclust correlation complete hclust correlation averagehclust pearson complete hclust binary single hclust binary average kmeans pearson som hclust correlation centroid rock hclust pearson centroid hclust binary median hclust binary mcquitty hclust canberra single biclust_spectral hclust spearman complete spec_man spec_cos kmeans kendall median hclust canberra hclust canberra average spec_mink spec_euc spec_max mspec_minkspec_canb mspec_man affprop maximum kmeans spearman kmeans manhattan mspec_max mspec_cos mspec_canb mspec_euc kmeansbinary centroid hclust canberra hclust kendall single hclust spearman centroid hclusthclust kendall centroid kendall medianaverage average spearmankendall mcquitty hclust median hclust spearman single hclust spearman mcquitty kendall complete Partisan Taunting: hclust canberra centroid hclust hclust manhattan medianmanhattan euclideankmedoids hclusthclustmanhattan centroid hclust manhattan single hclust manhattan average single manhattan affprop q hclust euclidean median hclust maximum single manhattan divisive hclust euclidean centroid hclust euclidean average hclust manhattan mcquitty clust_convex hclust correlation ward hclust euclidean mcquitty kmedoids euclidean hclust pearson wardstand.euc kmedoids hclust maximummedian divisive centroid hclust maximumeuclideanaffprop euclidean hclust canberra mcquitty hclust maximum average hclust maximum complete hclust euclidean complete hclust manhattan maximum mcquitty hclust complete dist_ebinary dist_binary dist_fbinary dist_minkowski dist_canb dist_max hclust manhattan ward dist_cos dismea affprop info.costs “Senator Lautenberg’s amendment would change the kmeanshclust euclidean ward euclidean hclust canberra complete sot_euc hclust binary ward hclusthclust spearman ward kendall ward hclust maximum ward kmeans maximum kmeans binary name of. . . the Republican Clusters in this Clustering bill. . . to ‘More Tax Breaks for q q q q the Rich and More Debt for Ourq qq qq q q q q q qq qq q q q q q q q q q q q qq qq Grandchildren Deficit Expansion qCredit Claiming Advertising Reconciliation Act of 2006”’ Pork Partisan Taunting q q q q qq q q q q q q q q qq q q q q q q q q q q qq q q q q q q q qq q q q q qq qq q q q q Credit Claiming Mayhew q Legislation Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Example Discovery: Partisan Taunting mult_dirproc kmeans correlation hclust canberra ward sot_cor divisive stand.euc mixvmf hclust correlationmixvmfVAbinary complete hclust mcquitty hclust pearson single affprop cosine hclust pearson median hclust correlation single hclust pearson mcquitty hclust correlation median mec hclust pearson average hclust correlation complete hclust correlation averagehclust pearson complete hclust binary single hclust binary average kmeans pearson som hclust correlation centroid rock hclust pearson centroid hclust binary median hclust binary mcquitty hclust canberra single biclust_spectral hclust spearman complete spec_man spec_cos kmeans kendall median hclust canberra hclust canberra average spec_mink spec_euc spec_max mspec_minkspec_canb mspec_man affprop maximum kmeans spearman kmeans manhattan mspec_max mspec_cos mspec_canb mspec_euc kmeansbinary centroid hclust canberra hclust kendall single hclust spearman centroid hclusthclust kendall centroid kendall medianaverage average spearmankendall mcquitty hclust median hclust spearman single hclust spearman mcquitty kendall complete hclust canberra centroid hclust hclust manhattan medianmanhattan euclideankmedoids hclusthclustmanhattan centroid hclust manhattan single hclust manhattan average single manhattan affprop q hclust euclidean median hclust maximum single manhattan divisive hclust euclidean centroid hclust euclidean average hclust manhattan mcquitty clust_convex hclust correlation ward hclust euclidean mcquitty kmedoids euclidean hclust pearson wardstand.euc kmedoids hclust maximummedian divisive centroid hclust maximumeuclideanaffprop euclidean hclust canberra mcquitty hclust maximum average Definition: Explicit, public, and hclust maximum complete hclust euclidean complete hclust manhattan maximum mcquitty hclust complete dist_ebinary dist_binary dist_fbinary dist_minkowski dist_canb dist_max dist_cos dismea hclust manhattan ward affprop info.costs kmeanshclust euclidean ward euclidean hclust canberra complete sot_euc hclusthclust spearman ward kendall ward hclust binary ward negative attacks on another hclust maximum ward kmeans maximum kmeans binary political party or its members Clusters in this Clustering q q q qq qq qq q q qq q q q qq q q q q q q q q q q qq q qq qCredit Claiming Advertising Pork Partisan Taunting q q q q qq q q q q q q q q qq q q q q q q q q q q qq q q q q q q q qq q q q q qq qq q q q q Credit Claiming Mayhew q Legislation Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Example Discovery: Partisan Taunting mult_dirproc kmeans correlation hclust canberra ward sot_cor divisive stand.euc mixvmf hclust correlationmixvmfVAbinary complete hclust mcquitty hclust pearson single affprop cosine hclust pearson median hclust correlation single hclust pearson mcquitty hclust correlation median mec hclust pearson average hclust correlation complete hclust correlation averagehclust pearson complete hclust binary single hclust binary average kmeans pearson som hclust correlation centroid rock hclust pearson centroid hclust binary median hclust binary mcquitty hclust canberra single biclust_spectral hclust spearman complete spec_man spec_cos kmeans kendall median hclust canberra hclust canberra average spec_mink spec_euc spec_max mspec_minkspec_canb mspec_man affprop maximum kmeans spearman kmeans manhattan mspec_max mspec_cos mspec_canb mspec_euc kmeansbinary centroid hclust canberra hclust kendall single hclust spearman centroid hclusthclust kendall centroid kendall medianaverage average spearmankendall mcquitty hclust median hclust spearman single hclust spearman mcquitty kendall complete hclust canberra centroid hclust hclust manhattan medianmanhattan euclideankmedoids hclusthclustmanhattan centroid hclust manhattan single hclust manhattan average single manhattan affprop q hclust euclidean median hclust maximum single manhattan divisive hclust euclidean centroid hclust euclidean average hclust manhattan mcquitty clust_convex hclust correlation ward hclust euclidean mcquitty kmedoids euclidean hclust pearson wardstand.euc kmedoids hclust maximummedian divisive centroid hclust maximumeuclideanaffprop euclidean hclust canberra mcquitty hclust maximum average Definition: Explicit, public, and hclust maximum complete hclust euclidean complete hclust manhattan maximum mcquitty hclust complete dist_ebinary dist_binary dist_fbinary dist_minkowski dist_canb dist_max dist_cos dismea hclust manhattan ward affprop info.costs kmeanshclust euclidean ward euclidean hclust canberra complete sot_euc hclusthclust spearman ward kendall ward hclust binary ward negative attacks on another hclust maximum ward kmeans maximum kmeans binary political party or its members q Clusters in this Clustering Taunting ruins q qqq q q q q q q qq qq qq q q q q q q q q q q q qq q q q deliberation qCredit Claiming Advertising Pork Partisan Taunting q q q q qq q q q q q q q q qq q q q q q q q q q q qq q q q q q q q qq q q q q qq qq q q q q Credit Claiming Mayhew q Legislation Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • In Sample Illustration of Partisan TauntingTaunting ruins deliberation - “Senator Lautenberg Blasts Republicans as ‘Chicken Hawks’ ” [Government Oversight]Sen. Lautenbergon Senate Floor4/29/04 Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • In Sample Illustration of Partisan TauntingTaunting ruins deliberation - “Senator Lautenberg Blasts Republicans as ‘Chicken Hawks’ ” [Government Oversight] - “The scopes trial took place in 1925. Sadly, President Bush’s veto today shows that we haven’t progressed much since then” [Healthcare]Sen. Lautenbergon Senate Floor4/29/04 Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • In Sample Illustration of Partisan TauntingTaunting ruins deliberation - “Senator Lautenberg Blasts Republicans as ‘Chicken Hawks’ ” [Government Oversight] - “The scopes trial took place in 1925. Sadly, President Bush’s veto today shows that we haven’t progressed much since then” [Healthcare] - “Every day the House RepublicansSen. Lautenberg dragged this out was a day thaton Senate Floor made our communities less4/29/04 safe.”[Homeland Security] Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Out of Sample Confirmation of Partisan Taunting - Discovered using 200 press releases; 1 senator. Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Out of Sample Confirmation of Partisan Taunting - Discovered using 200 press releases; 1 senator. - Confirmed using 64,033 press releases; 301 senator-years. Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Out of Sample Confirmation of Partisan Taunting - Discovered using 200 press releases; 1 senator. - Confirmed using 64,033 press releases; 301 senator-years. - Apply supervised learning method: measure proportion of press releases a senator taunts other party Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Out of Sample Confirmation of Partisan Taunting - Discovered using 200 press releases; 1 senator. - Confirmed using 64,033 press releases; 301 senator-years. - Apply supervised learning method: measure proportion of press releases a senator taunts other party 30 Frequency 20 10 0.1 0.2 0.3 0.4 0.5 Prop. of Press Releases Taunting Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Out of Sample Confirmation of Partisan Taunting - Discovered using 200 press releases; 1 senator. - Confirmed using 64,033 press releases; 301 senator-years. - Apply supervised learning method: measure proportion of press releases a senator taunts other party On Avg., Senators Taunt in 27 % of Press Releases 30 Frequency 20 10 0.1 0.2 0.3 0.4 0.5 Prop. of Press Releases Taunting Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Quantitative Methods for Qualitative Conceptualization 1) Conceptualization Qualitative Methods (reading!) 2) Measurement Quantitative Methods 3) ValidationQuantitative methods for conceptualization and discovery Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Quantitative Methods for Qualitative Conceptualization 1) Conceptualization Qualitative Methods (reading!) 2) Measurement Quantitative Methods 3) ValidationQuantitative methods for conceptualization and discovery - Few formal methods designed explicitly for conceptualization Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Quantitative Methods for Qualitative Conceptualization 1) Conceptualization Qualitative Methods (reading!) 2) Measurement Quantitative Methods 3) ValidationQuantitative methods for conceptualization and discovery - Few formal methods designed explicitly for conceptualization - Belittled: “Tom Swift and His Electric Factor Analysis Machine” (Armstrong 1967) Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • Quantitative Methods for Qualitative Conceptualization 1) Conceptualization Qualitative Methods (reading!) 2) Measurement Quantitative Methods 3) ValidationQuantitative methods for conceptualization and discovery - Few formal methods designed explicitly for conceptualization - Belittled: “Tom Swift and His Electric Factor Analysis Machine” (Armstrong 1967) - Evaluation methods measure progress in discovery Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20
    • For more information http://GKing.Harvard.edu Parthemos Lecture at University of Gary King (Harvard IQSS) Quantitative Discovery / 20