Assignment 8
One Hot encoding
Introduction
Categorical features are generally divided into 3 types:
A. Binary: Either/or
Examples:
– Yes, No
– True, False
B. Ordinal: Specific ordered Groups.
Examples:
– low, medium, high
– cold, hot, lava Hot
C. Nominal: Unordered Groups. Examples
– cat, dog, tiger
– pizza, burger, coke
Label Encoding
• Label Encoding is a technique that is used to convert
categorical columns into numerical ones so that they
can be fitted by machine learning models which only
take numerical data.
• It is an important pre-processing step in a machine-
learning project.
Height Height
Tall 0
Medium 1
Short 2
Code
# Import label encoder
from sklearn import preprocessing
df = pd.read_csv('car_details.csv')
# label_encoder object knows
# how to understand word labels.
label_encoder = preprocessing.LabelEncoder()
# Encode labels in column 'species'.
df['Manu']= label_encoder.fit_transform(df['Manufacturer'])
df['Manu'].unique()
One Hot Encoding
• One hot encoding is a technique that we use
to represent categorical variables as numerical
values in a machine learning model.
Example
Fruit
Categorical
value of fruit
Price
apple 1 5
mango 2 10
apple 1 15
orange 3 20
Consider the data where fruits, their corresponding categorical values, and prices are given.
apple mango orange price
1 0 0 5
0 1 0 10
1 0 0 15
0 0 1 20
Advantage
The advantages of using one hot encoding include:
• It allows the use of categorical variables in
models that require numerical input.
• It can improve model performance by providing
more information to the model about the
categorical variable.
• It can help to avoid the problem of ordinality,
which can occur when a categorical variable has a
natural ordering (e.g. “small”, “medium”, “large”).
Disadvantage
• It can lead to increased dimensionality, as a separate column is
created for each category in the variable. This can make the model
more complex and slow to train.
• It can lead to sparse data, as most observations will have a value of
0 in most of the one-hot encoded columns.
• It can lead to overfitting, especially if there are many categories in
the variable and the sample size is relatively small.
• One-hot-encoding is a powerful technique to treat categorical data,
but it can lead to increased dimensionality, sparsity, and overfitting.
It is important to use it cautiously and consider other methods such
as ordinal encoding or binary encoding.
Python code
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
# Retrieving data
data = pd.read_csv('car_details.csv')
# Converting type of columns to category
data['Manu'] = data['Manufacturer'].astype('category')
data['Model'] = data['Model'].astype('category')
# Assigning numerical values and storing it in another columns
data['Manu_new'] = data['Manu'].cat.codes
data['Model_new'] = data['Model'].cat.codes
# Create an instance of One-hot-encoder
enc = OneHotEncoder()
# Passing encoded columns
enc_data = pd.DataFrame(enc.fit_transform(
data[['Manu_new', 'Model_new']]).toarray())
# Merge with main
#New_df = data.join(enc_data)
print(enc_data.head(30))

Data Preprocessing: One Hot Encoding Method

  • 1.
  • 2.
    Introduction Categorical features aregenerally divided into 3 types: A. Binary: Either/or Examples: – Yes, No – True, False B. Ordinal: Specific ordered Groups. Examples: – low, medium, high – cold, hot, lava Hot C. Nominal: Unordered Groups. Examples – cat, dog, tiger – pizza, burger, coke
  • 3.
    Label Encoding • LabelEncoding is a technique that is used to convert categorical columns into numerical ones so that they can be fitted by machine learning models which only take numerical data. • It is an important pre-processing step in a machine- learning project. Height Height Tall 0 Medium 1 Short 2
  • 4.
    Code # Import labelencoder from sklearn import preprocessing df = pd.read_csv('car_details.csv') # label_encoder object knows # how to understand word labels. label_encoder = preprocessing.LabelEncoder() # Encode labels in column 'species'. df['Manu']= label_encoder.fit_transform(df['Manufacturer']) df['Manu'].unique()
  • 5.
    One Hot Encoding •One hot encoding is a technique that we use to represent categorical variables as numerical values in a machine learning model.
  • 6.
    Example Fruit Categorical value of fruit Price apple1 5 mango 2 10 apple 1 15 orange 3 20 Consider the data where fruits, their corresponding categorical values, and prices are given. apple mango orange price 1 0 0 5 0 1 0 10 1 0 0 15 0 0 1 20
  • 7.
    Advantage The advantages ofusing one hot encoding include: • It allows the use of categorical variables in models that require numerical input. • It can improve model performance by providing more information to the model about the categorical variable. • It can help to avoid the problem of ordinality, which can occur when a categorical variable has a natural ordering (e.g. “small”, “medium”, “large”).
  • 8.
    Disadvantage • It canlead to increased dimensionality, as a separate column is created for each category in the variable. This can make the model more complex and slow to train. • It can lead to sparse data, as most observations will have a value of 0 in most of the one-hot encoded columns. • It can lead to overfitting, especially if there are many categories in the variable and the sample size is relatively small. • One-hot-encoding is a powerful technique to treat categorical data, but it can lead to increased dimensionality, sparsity, and overfitting. It is important to use it cautiously and consider other methods such as ordinal encoding or binary encoding.
  • 9.
    Python code import pandasas pd import numpy as np from sklearn.preprocessing import OneHotEncoder # Retrieving data data = pd.read_csv('car_details.csv') # Converting type of columns to category data['Manu'] = data['Manufacturer'].astype('category') data['Model'] = data['Model'].astype('category') # Assigning numerical values and storing it in another columns data['Manu_new'] = data['Manu'].cat.codes data['Model_new'] = data['Model'].cat.codes # Create an instance of One-hot-encoder enc = OneHotEncoder() # Passing encoded columns enc_data = pd.DataFrame(enc.fit_transform( data[['Manu_new', 'Model_new']]).toarray()) # Merge with main #New_df = data.join(enc_data) print(enc_data.head(30))