How to handle Categorical Data
By
Srinivas Rao PrithviNag Kolla,
Masters in Data Science,
University of North Texas,
Email: prithvikolla8@gmail.com
Categorical Variable:
Generally Data falling into a fixed set of categories is called a categorical data.
Ex:
Survey of what type of phone brand people own comes under categorical data.
Id Name Phone Brand
1 Alex Apple
2 George Nokia
3 Chen Apple
4 prithvi Samsung
Dropping Categorical Variables:
If the columns in the data set have categorical values , which are not useful for
modeling , we can drop them.
Label Encoding:
Giving a unique integer value for the labels in the categorical column.
Ex:
Phone Brand
Apple
Nokia
Jio
Apple
Phone Brand
0
1
2
0
Label Encoding
Decision Trees and Random Forests work well with Label Encoding.
‘Apple’(0) < ‘Nokia’(1) < ‘Jio’(2)
One – Hot – Encoding
• As previously in Label Encoding, we gave an order based unique values to the
labels.
• It doesn’t work well, so we use the method of separating the categorical values
into columns and give ‘1’ as they are present in that row, if not ‘0’.
Ex:
Phone Brand
Apple
Nokia
Apple
Jio
Nokia
Apple Nokia Jio
1 0 0
0 1 0
1 0 0
0 0 1
0 1 0
One-Hot-Encoding

How to Handle Categorical Data

  • 1.
    How to handleCategorical Data By Srinivas Rao PrithviNag Kolla, Masters in Data Science, University of North Texas, Email: prithvikolla8@gmail.com
  • 2.
    Categorical Variable: Generally Datafalling into a fixed set of categories is called a categorical data. Ex: Survey of what type of phone brand people own comes under categorical data. Id Name Phone Brand 1 Alex Apple 2 George Nokia 3 Chen Apple 4 prithvi Samsung Dropping Categorical Variables: If the columns in the data set have categorical values , which are not useful for modeling , we can drop them.
  • 3.
    Label Encoding: Giving aunique integer value for the labels in the categorical column. Ex: Phone Brand Apple Nokia Jio Apple Phone Brand 0 1 2 0 Label Encoding Decision Trees and Random Forests work well with Label Encoding. ‘Apple’(0) < ‘Nokia’(1) < ‘Jio’(2)
  • 4.
    One – Hot– Encoding • As previously in Label Encoding, we gave an order based unique values to the labels. • It doesn’t work well, so we use the method of separating the categorical values into columns and give ‘1’ as they are present in that row, if not ‘0’. Ex: Phone Brand Apple Nokia Apple Jio Nokia Apple Nokia Jio 1 0 0 0 1 0 1 0 0 0 0 1 0 1 0 One-Hot-Encoding