Regression Methods in
Machine Learning
Categorical Variable Conversion
Portland Data Science Group
Andrew Ferlitsch
Community Outreach Officer
July, 2017
Linear Regression
• All the features (independent variables) need to be a
real number.
• CANNOT be a categorical value, ie., a named or
enumerated value.
• Example:
Male vs. Female
Red, Blue, Green
Apple, Banana, Pear, Orange
Categorical Variables
Age Gender Income
25 Male 25000
26 Female 22000
30 Male 45000
24 Female 26000
Independent Variables (Features)
Dependent Variables (Label)
Real Values Value to Predict
Categorical Values
Dummy Variable Conversion
Known in Python as OneHotEncoder
For each categorical feature:
1. Scan the dataset and determine all the unique instances.
2. Create a new feature (i.e., dummy variable) in dataset, one
per unique instance.
3. Remove the categorical feature from the dataset.
4. For each sample (row), set a 1 in the feature (dummy
variable) that corresponds to that categorical value instance,
and:
5. Set a 0 in the remaining features (dummy variables) for that
categorical field.
6. Remove one dummy variable field.
Dummy Variable Trap
Gender
Male
Female
Male
Female
Need to Drop one Dummy Variable!
Male Female
1 0
0 1
1 0
0 1
x1 x2 x3
Multicollinearity occurs when one variable predicts another.
i.e., x2 = ( 1 – x3)
As a result, a regression analysis cannot distinguish between the
contribution of x2 and x3.
Drop one of Dummy Variables
Age Male Income
25 1 25000
26 0 22000
30 1 45000
24 0 26000
Drop one of the Dummy Variables
Age Gender Income
25 Male 25000
26 Female 22000
30 Male 45000
24 Female 26000
Gender is Replaced with Male
Age Race Income
20 White Apple
26 Hispanic 22000
30 Asian 45000
24 Asian 26000
Age White Asian Income
20 1 0 Apple
26 0 0 22000
30 0 1 45000
24 0 1 26000
Dropped Hispanic (i.e., Hispanic = White: 0, Asian: 0)

Machine Learning - Dummy Variable Conversion

  • 1.
    Regression Methods in MachineLearning Categorical Variable Conversion Portland Data Science Group Andrew Ferlitsch Community Outreach Officer July, 2017
  • 2.
    Linear Regression • Allthe features (independent variables) need to be a real number. • CANNOT be a categorical value, ie., a named or enumerated value. • Example: Male vs. Female Red, Blue, Green Apple, Banana, Pear, Orange
  • 3.
    Categorical Variables Age GenderIncome 25 Male 25000 26 Female 22000 30 Male 45000 24 Female 26000 Independent Variables (Features) Dependent Variables (Label) Real Values Value to Predict Categorical Values
  • 4.
    Dummy Variable Conversion Knownin Python as OneHotEncoder For each categorical feature: 1. Scan the dataset and determine all the unique instances. 2. Create a new feature (i.e., dummy variable) in dataset, one per unique instance. 3. Remove the categorical feature from the dataset. 4. For each sample (row), set a 1 in the feature (dummy variable) that corresponds to that categorical value instance, and: 5. Set a 0 in the remaining features (dummy variables) for that categorical field. 6. Remove one dummy variable field.
  • 5.
    Dummy Variable Trap Gender Male Female Male Female Needto Drop one Dummy Variable! Male Female 1 0 0 1 1 0 0 1 x1 x2 x3 Multicollinearity occurs when one variable predicts another. i.e., x2 = ( 1 – x3) As a result, a regression analysis cannot distinguish between the contribution of x2 and x3.
  • 6.
    Drop one ofDummy Variables Age Male Income 25 1 25000 26 0 22000 30 1 45000 24 0 26000 Drop one of the Dummy Variables Age Gender Income 25 Male 25000 26 Female 22000 30 Male 45000 24 Female 26000 Gender is Replaced with Male Age Race Income 20 White Apple 26 Hispanic 22000 30 Asian 45000 24 Asian 26000 Age White Asian Income 20 1 0 Apple 26 0 0 22000 30 0 1 45000 24 0 1 26000 Dropped Hispanic (i.e., Hispanic = White: 0, Asian: 0)