Categorical Data Analysis in Python

1
Categorical Data Analysis in Python
By
Jaidev Deshpande
Data Scientist, DataCulture Analytics
twitter.com/jaidevd

2
Problem: Who's likely to attend the next
meetup?
●
Who comes often?
●
Men / Women?
●
Where do you live? How far from the venue?
●
Proficiency with Python
(Beginner / Intermediate / Advanced)?
●
Area of interest?

3
Something like..
Attendees Features
Attendance
(%)
Gender Pincode Proficiency in
Python
Interest ...
attendee_1 80 M 411013 Intermediate Web ...
attendee_2 30 F 411040 Advanced Test /
Automation
...
attendee_3 55 M 411001 Beginners Scientific ...
... ... ... ... ... ... ...
● 1. Numerical features – continuous and quantitative
● 2. Categorical features – discrete and qualitative

4
Common Numerical Operations on Data
●
Obviously – add, subtract, multiply divide
●
Statistical moments
●
Operations in vector spaces
– Distance measures
– Slicing

5
Comparison of Operations
Numerical Data
Addition, subtract, multiply, divide
Mean, Variance, Standard Deviation
Vector Spaces – the very idea of
'measuring'
Categorical Data (Strings, etc)
What's the product of two strings?
The average pincode of two areas?
&%%#&$$*&!!!!
At least get some numbers!

6
One-hot Encoding
●
[Apples,
Oranges,
Mangoes]
● sklearn.preprocessing.OneHotEncoder
● sklearn.feature_extraction.DictVectorizer
[0, 0, 1;
0, 1, 0;
1, 0, 0]

7
Original Data
Attendees Features
Attendance
(%)
Gender Pincode Proficiency in
Python
Interest ...
attendee_1 80 [0 1] [1 0 0 … 0] [0 1 0] [1 0 0 0 0 0] ...
attendee_2 30 [1 0] [0 1 0 … 0] [1 0 0] [0 1 0 0 0 0] ...
attendee_3 55 [0 1] [0 0 1 … 0] [0 0 1] [0 0 1 0 0 0] ...
... ... ... ... ... ... ...

9
Correspondence Analysis
●
Contingency tables (pandas.crosstab)
profeciency advanced beginner intermediate
gender
F 1 0 0
M 0 1 1
●
Different numerical measures
●
Perceptual maps

10
Correspondence Analysis
●
How are proficiencies related w.r.t gender? (Row profiles)
●
How are genders related w.r.t proficiency? (Column profiles)
– Cosine similarity
– Correlation / Covariance
●
How are they interrelated?
– Weighted chi-squared distance
●
Can the dimensionality be reduced?
– Singular value decomposition / PCA
– sklearn.decomposition.PCA
– sklearn.decomposition.TruncatedSVD

11
Sample Problem
●
Consider the proficiency and interest features from the original
problem
●
Fake data with 100 observations
●
Contingency matrix:
automation scientific web
advanced 8 1 7
beginner 13 9 35
intermediate 7 1 19

13
Source and Tutorials
●
http://github.com/motherbox/mca

Categorical Data Analysis in Python

Recommended

Recommended

More Related Content

Similar to Categorical Data Analysis in Python

Similar to Categorical Data Analysis in Python (20)

Recently uploaded

Recently uploaded (20)

Categorical Data Analysis in Python