Using Clustering as a Tool:
Mixed Methods in Qualitative Data Analysis
Laura Macia, PhD
Behavioral and Community Health
Sciences
Graduate School of Public Health
University of Pittsburgh
Types of data
Mixed Methods
• Type of Data / Data Collection
• Data Analysis
Mixed Methods in Data Analysis
Cluster Analysis
• Method for grouping data by their similarity
– Appropriate data
– Defining similarity
– Clustering
Data Preparation
• Types of data:
– Nominal
– Ordinal
– Interval / Ratio
Qualitative Data
(an example)
Latino Grievances Project
Summary Table: Nodes and Attributes (after thematic analysis using Nvivo)
Select Variables Values [description]
Part 1: Gender
Strata
Legal status
Income
Education
(0) Male; (1) Female
(0) Blue-collar; (1) Spouse of American citizen; (2) White-collar
(0) US citizen; (1) Legal permanent resident; (2) Immigrant visa; (3) Non-immigrant visa;
(4) Visa overstay; (5) Undocumented
(0) Under $20k; (1) $20k to $40k; (2) $40k to $60k; (1) $60k to $80k; (1) $80k to $100k;
(5) Over $100k
(0) Primary; (1) Some secondary; (2) High-school diploma; (3) College degree; (4) Graduate
degree; (5) Other degree
Part 2: Type
Nationality
(0) Male; (1) Female; (2) Individual [when gender unknown]; (3) Institution; (4)
Government; (5) Other
(0) American; (1) Latino; (2) Other; (3) Unknown
Grievance (0) Debt; (1) Discrimination; (2) Domestic; (3) With the law
Procedural
mode
(1) None
(2) Adjudication [third party with authority to intervene, i.e. courts]
(3) Arbitration [third party agreed to by principals]
(4) Mediation [third party aiding principals reach an agreement]
(5) Negotiation [two principals decide on settlement]
(6) Coercion [imposition of outcome by unilateral threat or use of force]
(7) Avoidance [terminate relationship / withdraw from situation]
(8) Lumping it [“letting go” as of grievance]
(9) Assumed fault* [structure grievance as occurring due to own situation/fault]
(10)Talk back* [letting know of grievance without expecting further action]
(11)Other
* Data-driven codes, not included in predefined coding scheme
Data Preparation
• Types of data:
– Nominal
– Ordinal
– Interval / Ratio
Qualitative Data
Gender: (0) Male, (1) Female, …
Type of Grievance:
(0) Debt, (1) Discrimination, …
Chosen Procedure:
(2) Adjudication, …(6) Coercion, …
Income: (0) <$20k, (1) $20k-$40k, …
Education:
(0) primary , … (2) high school diploma, …
Units of analysis: Cases
ID Strata Part2 Part2Natlity Type ProcMode1ProcMode2ProcMode3Support1 Support2
1 WC Individual Unknown Debt Other None None None None
2 WC Institution American Debt NegotiationAvoidanceNone None None
3 WC Female American DiscriminationAssumed faultLumping itTalk back None None
4 WC Individual American DiscriminationOther None None None None
5 WC Male Latino Domestic Other NegotiationNone Other None
6 WC Female Latino Domestic NegotiationOther None Family None
7 WC Male Latino Domestic NegotiationNone None Family None
8 WC Government American Law NegotiationAssumed faultOther Family None
9 WC Male Latino Debt NegotiationLumping itOther Family Friend
10 WC Female Other Debt Talk back AvoidanceOther Family None
11 WC Institution American Debt AvoidanceOther None Friend None
12 WC Institution American Debt Other None None None None
13 WC Male Unknown Debt Assumed faultNegotiationNone Friend None
14 WC Male American DiscriminationLumping itNone None None None
15 WC Institution American DiscriminationOther None None Church None
16 WC Male Other DiscriminationLumping itOther None Family None
17 WC Other Latino Domestic NegotiationNone None None None
18 WC Female Other Domestic NegotiationNone None Other None
19 WC Female Other Domestic NegotiationOther None None None
20 WC Government American Law Assumed faultNone None None None
12 variables
Cluster Analysis: Data Reduction
• Transform qualitative data into binary data
ID 1-Fem 1-Male 2-Fem 2-Male 2-Indiv 2-Govmnt 2-Instit 2-Other 2N-American
WC-F-De-11-1 1 0 0 0 1 0 0 0 0
WC-F-De-11-2 1 0 0 0 0 0 1 0 1
WC-F-Di-11-3 1 0 1 0 1 0 0 0 1
WC-F-Di-11-4 1 0 0 0 1 0 0 0 1
WC-F-Do-11-6 1 0 1 0 1 0 0 0 0
WC-F-L-11-8 1 0 0 0 0 1 0 0 1
WC-M-De-45-9 0 1 0 1 1 0 0 0 0
WC-M-De-45-10 0 1 1 0 1 0 0 0 0
WC-M-De-45-11 0 1 0 0 0 0 1 0 1
WC-M-De-45-12 0 1 0 0 0 0 1 0 1
WC-M-De-45-13 0 1 0 1 1 0 0 0 0
WC-M-Di-45-14 0 1 0 1 1 0 0 0 1
WC-M-Di-45-15 0 1 0 0 0 0 1 0 1
WC-M-Do-45-18 0 1 1 0 1 0 0 0 0
WC-M-Do-45-19 0 1 1 0 1 0 0 0 0
WC-M-L-45-20 0 1 0 0 0 1 0 0 1
WC-M-O-45-21 0 1 0 0 0 0 1 0 1
BC-M-Do-29-22 0 1 0 0 1 0 0 0 0
BC-M-De-32-23 0 1 0 0 0 0 1 0 1
BC-M-De-32-24 0 1 0 1 1 0 0 0 0
59 binary
variables
Clustering decisions: variables
• Variables to include
– All relevant variables
what is your question?
• Variables to exclude
– irrelevant variables that bias towards certain
cluster solutions
Clustering decisions: similarity
• For binary data: Contingency Tables
• Pay attention to the a, b, c and ds in your data:
– Which are more common?
– More meaningful?
Example similarity measures
aa+b+c+d=ap.
𝑅𝑅 𝑥, 𝑦 =
𝑎
𝑎+𝑏+𝑐+𝑑
[Russel and Rao]
𝑆𝑀 𝑥, 𝑦 =
𝑎+𝑑
𝑎+𝑏+𝑐+𝑑
[Simple Matching]
𝐽𝐴𝐶𝐶𝐴𝑅𝐷 𝑥, 𝑦 =
𝑎
𝑎+𝑏+𝑐
[Jaccard]
𝐷𝐼𝐶𝐸 𝑥, 𝑦 =
2𝑎
2𝑎+𝑏+𝑐
[Dice]
𝑆𝑆1 𝑥, 𝑦 =
2 𝑎+𝑑
2 𝑎+𝑑 +𝑏+𝑐
[Sokal and Sneath 1]
Clustering decisions: linkage
• Classification strategy
– Hierarchical clustering
• Good for “smaller” sizes (in the hundreds)
• Allows choosing from many similarity measures
• Randomize order, repeat, compare
agglomerative
divisive
Clustering decisions: method
• Linkage method:
• NOT: centroid, median, or Ward
• Between-groups linkage:
d = smallest resulting avg cross-linkage distance
• Within-groups:
d = smallest resulting avg within linkage distance
• Nearest neighbor(single linkage):
d = smallest between two points
• Furthest neighbor (complete linkage):
d = largest between two points
How This Looks in SPSS
Select “Hierarchical Cluster…”
Select variables to
include
Methods Menu: Measure (BINARY), Cluster Method
Statistics Menu: Cluster Membership (CHOOSE)
Plots Menu: Select Dendogram / Icicle Plots [Optional]
Results -
Output:
Agglomeration
Schedule
Results -
Output:
Dendogram
Results: Cluster Membership (as new variables)
Laura Macia: lam60@pitt.edu
THANK YOU!

Using Clustering as a Tool: Mixed Methods in Qualitative Data Analysis

  • 1.
    Using Clustering asa Tool: Mixed Methods in Qualitative Data Analysis Laura Macia, PhD Behavioral and Community Health Sciences Graduate School of Public Health University of Pittsburgh
  • 2.
  • 3.
    Mixed Methods • Typeof Data / Data Collection • Data Analysis
  • 4.
    Mixed Methods inData Analysis
  • 5.
    Cluster Analysis • Methodfor grouping data by their similarity – Appropriate data – Defining similarity – Clustering
  • 6.
    Data Preparation • Typesof data: – Nominal – Ordinal – Interval / Ratio Qualitative Data (an example) Latino Grievances Project
  • 7.
    Summary Table: Nodesand Attributes (after thematic analysis using Nvivo) Select Variables Values [description] Part 1: Gender Strata Legal status Income Education (0) Male; (1) Female (0) Blue-collar; (1) Spouse of American citizen; (2) White-collar (0) US citizen; (1) Legal permanent resident; (2) Immigrant visa; (3) Non-immigrant visa; (4) Visa overstay; (5) Undocumented (0) Under $20k; (1) $20k to $40k; (2) $40k to $60k; (1) $60k to $80k; (1) $80k to $100k; (5) Over $100k (0) Primary; (1) Some secondary; (2) High-school diploma; (3) College degree; (4) Graduate degree; (5) Other degree Part 2: Type Nationality (0) Male; (1) Female; (2) Individual [when gender unknown]; (3) Institution; (4) Government; (5) Other (0) American; (1) Latino; (2) Other; (3) Unknown Grievance (0) Debt; (1) Discrimination; (2) Domestic; (3) With the law Procedural mode (1) None (2) Adjudication [third party with authority to intervene, i.e. courts] (3) Arbitration [third party agreed to by principals] (4) Mediation [third party aiding principals reach an agreement] (5) Negotiation [two principals decide on settlement] (6) Coercion [imposition of outcome by unilateral threat or use of force] (7) Avoidance [terminate relationship / withdraw from situation] (8) Lumping it [“letting go” as of grievance] (9) Assumed fault* [structure grievance as occurring due to own situation/fault] (10)Talk back* [letting know of grievance without expecting further action] (11)Other * Data-driven codes, not included in predefined coding scheme
  • 8.
    Data Preparation • Typesof data: – Nominal – Ordinal – Interval / Ratio Qualitative Data Gender: (0) Male, (1) Female, … Type of Grievance: (0) Debt, (1) Discrimination, … Chosen Procedure: (2) Adjudication, …(6) Coercion, … Income: (0) <$20k, (1) $20k-$40k, … Education: (0) primary , … (2) high school diploma, …
  • 9.
    Units of analysis:Cases ID Strata Part2 Part2Natlity Type ProcMode1ProcMode2ProcMode3Support1 Support2 1 WC Individual Unknown Debt Other None None None None 2 WC Institution American Debt NegotiationAvoidanceNone None None 3 WC Female American DiscriminationAssumed faultLumping itTalk back None None 4 WC Individual American DiscriminationOther None None None None 5 WC Male Latino Domestic Other NegotiationNone Other None 6 WC Female Latino Domestic NegotiationOther None Family None 7 WC Male Latino Domestic NegotiationNone None Family None 8 WC Government American Law NegotiationAssumed faultOther Family None 9 WC Male Latino Debt NegotiationLumping itOther Family Friend 10 WC Female Other Debt Talk back AvoidanceOther Family None 11 WC Institution American Debt AvoidanceOther None Friend None 12 WC Institution American Debt Other None None None None 13 WC Male Unknown Debt Assumed faultNegotiationNone Friend None 14 WC Male American DiscriminationLumping itNone None None None 15 WC Institution American DiscriminationOther None None Church None 16 WC Male Other DiscriminationLumping itOther None Family None 17 WC Other Latino Domestic NegotiationNone None None None 18 WC Female Other Domestic NegotiationNone None Other None 19 WC Female Other Domestic NegotiationOther None None None 20 WC Government American Law Assumed faultNone None None None 12 variables
  • 10.
    Cluster Analysis: DataReduction • Transform qualitative data into binary data ID 1-Fem 1-Male 2-Fem 2-Male 2-Indiv 2-Govmnt 2-Instit 2-Other 2N-American WC-F-De-11-1 1 0 0 0 1 0 0 0 0 WC-F-De-11-2 1 0 0 0 0 0 1 0 1 WC-F-Di-11-3 1 0 1 0 1 0 0 0 1 WC-F-Di-11-4 1 0 0 0 1 0 0 0 1 WC-F-Do-11-6 1 0 1 0 1 0 0 0 0 WC-F-L-11-8 1 0 0 0 0 1 0 0 1 WC-M-De-45-9 0 1 0 1 1 0 0 0 0 WC-M-De-45-10 0 1 1 0 1 0 0 0 0 WC-M-De-45-11 0 1 0 0 0 0 1 0 1 WC-M-De-45-12 0 1 0 0 0 0 1 0 1 WC-M-De-45-13 0 1 0 1 1 0 0 0 0 WC-M-Di-45-14 0 1 0 1 1 0 0 0 1 WC-M-Di-45-15 0 1 0 0 0 0 1 0 1 WC-M-Do-45-18 0 1 1 0 1 0 0 0 0 WC-M-Do-45-19 0 1 1 0 1 0 0 0 0 WC-M-L-45-20 0 1 0 0 0 1 0 0 1 WC-M-O-45-21 0 1 0 0 0 0 1 0 1 BC-M-Do-29-22 0 1 0 0 1 0 0 0 0 BC-M-De-32-23 0 1 0 0 0 0 1 0 1 BC-M-De-32-24 0 1 0 1 1 0 0 0 0 59 binary variables
  • 11.
    Clustering decisions: variables •Variables to include – All relevant variables what is your question? • Variables to exclude – irrelevant variables that bias towards certain cluster solutions
  • 12.
    Clustering decisions: similarity •For binary data: Contingency Tables • Pay attention to the a, b, c and ds in your data: – Which are more common? – More meaningful?
  • 13.
    Example similarity measures aa+b+c+d=ap. 𝑅𝑅𝑥, 𝑦 = 𝑎 𝑎+𝑏+𝑐+𝑑 [Russel and Rao] 𝑆𝑀 𝑥, 𝑦 = 𝑎+𝑑 𝑎+𝑏+𝑐+𝑑 [Simple Matching] 𝐽𝐴𝐶𝐶𝐴𝑅𝐷 𝑥, 𝑦 = 𝑎 𝑎+𝑏+𝑐 [Jaccard] 𝐷𝐼𝐶𝐸 𝑥, 𝑦 = 2𝑎 2𝑎+𝑏+𝑐 [Dice] 𝑆𝑆1 𝑥, 𝑦 = 2 𝑎+𝑑 2 𝑎+𝑑 +𝑏+𝑐 [Sokal and Sneath 1]
  • 14.
    Clustering decisions: linkage •Classification strategy – Hierarchical clustering • Good for “smaller” sizes (in the hundreds) • Allows choosing from many similarity measures • Randomize order, repeat, compare agglomerative divisive
  • 15.
    Clustering decisions: method •Linkage method: • NOT: centroid, median, or Ward • Between-groups linkage: d = smallest resulting avg cross-linkage distance • Within-groups: d = smallest resulting avg within linkage distance • Nearest neighbor(single linkage): d = smallest between two points • Furthest neighbor (complete linkage): d = largest between two points
  • 16.
  • 17.
  • 18.
  • 19.
    Methods Menu: Measure(BINARY), Cluster Method
  • 20.
    Statistics Menu: ClusterMembership (CHOOSE)
  • 21.
    Plots Menu: SelectDendogram / Icicle Plots [Optional]
  • 22.
  • 23.
  • 24.
    Results: Cluster Membership(as new variables)
  • 25.

Editor's Notes

  • #12 Clarify that some discussion can be found. However, it allows using distance measures that are non-euclidean.