Data Preprocessing
MRS. MANISHA PATIL ,ASST PROFESSOR MIT ACSC ALANDI 1
Objectives
Understand the concept of data preprocessing
Discuss the various types of data and possible errors
Understand the need and use of various data preprocessing errors
2
Introduction to data processing
Data collected for performing data analysis in Data science are in a raw and unprocessed state
Data preprocessing is the task of transforming raw data to be ready to be fed into an algorithm
Data preparation takes place in usually two phases for any data science project :
Data Preprocessing
Data wrangling
3
Data types and forms
Categorical data
Nominal data
Ordinal data
Numerical data
Interval data
Ratio data
4
Categorical data
This data is non numeric and consists of text that can be coded as numeric.
It can be of two type
Nominal data : This data is used to label variables without providing any quantitative value .For example
gender can assigned numbers
Nominal scales are exclusive
Ordinal data :
This type of data is used to label variables that need to follow some order.
For example : A company take feedback about the quality of their service.
5
Numerical data
This data is numeric and it usually follows an order of values.
Types of numeric data
Interval data :
This type of data follows numeric scales in which the order and exact difference between values is
considered.
The distance between each value on the interval scale are always kept equal.
Ratio data :
6
Possible data error types
Missing data
● Missing Completely At Random
● Missing at Random data
● Missing Not at Random
7
Recor
d
Cust Id Salary Dateof Birth Role Spouse
1 A121 42000 1985-30-05 Manager Anjali
2 A122 07-02-1982 CEO Priya
3 A123 28530 11-09-1987 Asst
Manager
4 A124 32000 12/24/1986 Rahul
5 A125 37450 09/07/1988 Secy Bina
6 A126 37450 07-09-1987 Secretary Sumit
Manual Input
Data inconsistency
Regional Formats
Numerical units
Wrong data types
File Manipulation
Missing anonymization
8
Recor
d
Cust Id Salary Dateof Birth Role Spouse
1 A121 42000 1985-30-05 Manager Anjali
2 A122 07-02-1982 CEO Priya
3 A123 28530 11-09-1987 Asst
Manager
4 A124 32000 12/24/1986 Rahul
5 A125 37450 09/07/1988 Secy Bina
6 A126 37450 07-09-1987 Secretary Sumit
Various Data Preprocessing Operation
9
Data Cleaning
To handle irrelevant or missing data
Data is cleaned by filling in the missing values,smoothing any noisy data,identifying and removing outliers
Resolving inconsistencies
10
Filling Missing values
Replace missing values with Zeros
Dropping Rows with Missing Values
Replace missing value with Mean/Mode/Median
11
#Method 1 - Filling Every Missing Values with 0
print("nn Every Missing Value Replaced with '0':")
print("--------------------------------------------")
print(df.fillna(0))
12
#Method 2 - Dropping Rows Having Missing Values
print("nn Dropping Rows with Missing Values:")
print("----------------------------------------")
print(df.dropna())
13
#Method 3 - Replacing missing values with the Median
Valuemedian = df['C01'].median()
df['C01'].fillna(median, inplace=True)
print("nn Missing Values for Column 1 Replaced with Median
Value:")
print("--------------------------------------------------")
print(df)
14
Smoothing noisy data
15
One hot encoding
Label encoding assigns a numeric value to each categorical value .This will be ok for categorical
labels
But nominal features do not have any order
Eg : Color of car values do not have any order among themselves
To prevent this one hot encoding is used for nominal attributes
It splits the column which contain the nominal categorical data to many columns depending on the
number of categories present in that column.Each column may contain 0 or 2 corresponding to
which column it has been placed.
EG Color_of _cars =[white’,’red’,’black’]
The one hot encoding matrix will be
16
white red black
1 0 0
0 1 0
0 0 1
Data Reduction
Data reduction is a technique used in data mining to reduce the size of a dataset while still preserving the most
important information.
It reduces the data by removing unimportant and unwanted features from the transformation
17
Data cube Aggregation
● Data cubes are multidimensional sets of data that can be stored in a spreadsheet
● A data cube can be two,three,or a higher dimension.
● Each dimension represent an attribute of interest.
● Data Cube Aggregation is a multidimensional aggregation that uses aggregation at various levels of a
data cube to represent the original data set, thus achieving data reduction.
● Data cubes provide fast access to pre-computed ,summarized data.
18
Numerosity Reduction
● Numerosity reduction is a technique used in data mining to reduce the number of data points in a
dataset while still preserving the most important information.
● Numerosity Reduction is a data reduction technique which replaces the original data by smaller
form of data representation.
● There are two techniques for numerosity reduction- Parametric and Non-Parametric methods.
● This can be beneficial in situations where the dataset is too large to be processed efficiently, or
where the dataset contains a large amount of irrelevant or redundant data points.
● For parametric methods, data is represented using some model. The model is used to estimate the
data, so that only parameters of data are required to be stored, instead of actual data. Regression
and Log-Linear methods are used for creating such models.
● These methods are used for storing reduced representations of the data include histograms,
clustering, sampling and data cube aggregation.
19
Data Discretization
The data discretization techniques can be used to reduce the number of values for a given continuous attribute
by dividing the range of the attribute into intervals.
This leads to a concise, easy-to-use, knowledge-level representation of mining results.
Discretization is the process through which we can transform continuous variables ,models or functions into discr
20

Chapter 3 Data Preprocessing techniques.pptx

  • 1.
    Data Preprocessing MRS. MANISHAPATIL ,ASST PROFESSOR MIT ACSC ALANDI 1
  • 2.
    Objectives Understand the conceptof data preprocessing Discuss the various types of data and possible errors Understand the need and use of various data preprocessing errors 2
  • 3.
    Introduction to dataprocessing Data collected for performing data analysis in Data science are in a raw and unprocessed state Data preprocessing is the task of transforming raw data to be ready to be fed into an algorithm Data preparation takes place in usually two phases for any data science project : Data Preprocessing Data wrangling 3
  • 4.
    Data types andforms Categorical data Nominal data Ordinal data Numerical data Interval data Ratio data 4
  • 5.
    Categorical data This datais non numeric and consists of text that can be coded as numeric. It can be of two type Nominal data : This data is used to label variables without providing any quantitative value .For example gender can assigned numbers Nominal scales are exclusive Ordinal data : This type of data is used to label variables that need to follow some order. For example : A company take feedback about the quality of their service. 5
  • 6.
    Numerical data This datais numeric and it usually follows an order of values. Types of numeric data Interval data : This type of data follows numeric scales in which the order and exact difference between values is considered. The distance between each value on the interval scale are always kept equal. Ratio data : 6
  • 7.
    Possible data errortypes Missing data ● Missing Completely At Random ● Missing at Random data ● Missing Not at Random 7 Recor d Cust Id Salary Dateof Birth Role Spouse 1 A121 42000 1985-30-05 Manager Anjali 2 A122 07-02-1982 CEO Priya 3 A123 28530 11-09-1987 Asst Manager 4 A124 32000 12/24/1986 Rahul 5 A125 37450 09/07/1988 Secy Bina 6 A126 37450 07-09-1987 Secretary Sumit
  • 8.
    Manual Input Data inconsistency RegionalFormats Numerical units Wrong data types File Manipulation Missing anonymization 8 Recor d Cust Id Salary Dateof Birth Role Spouse 1 A121 42000 1985-30-05 Manager Anjali 2 A122 07-02-1982 CEO Priya 3 A123 28530 11-09-1987 Asst Manager 4 A124 32000 12/24/1986 Rahul 5 A125 37450 09/07/1988 Secy Bina 6 A126 37450 07-09-1987 Secretary Sumit
  • 9.
  • 10.
    Data Cleaning To handleirrelevant or missing data Data is cleaned by filling in the missing values,smoothing any noisy data,identifying and removing outliers Resolving inconsistencies 10
  • 11.
    Filling Missing values Replacemissing values with Zeros Dropping Rows with Missing Values Replace missing value with Mean/Mode/Median 11
  • 12.
    #Method 1 -Filling Every Missing Values with 0 print("nn Every Missing Value Replaced with '0':") print("--------------------------------------------") print(df.fillna(0)) 12
  • 13.
    #Method 2 -Dropping Rows Having Missing Values print("nn Dropping Rows with Missing Values:") print("----------------------------------------") print(df.dropna()) 13
  • 14.
    #Method 3 -Replacing missing values with the Median Valuemedian = df['C01'].median() df['C01'].fillna(median, inplace=True) print("nn Missing Values for Column 1 Replaced with Median Value:") print("--------------------------------------------------") print(df) 14
  • 15.
  • 16.
    One hot encoding Labelencoding assigns a numeric value to each categorical value .This will be ok for categorical labels But nominal features do not have any order Eg : Color of car values do not have any order among themselves To prevent this one hot encoding is used for nominal attributes It splits the column which contain the nominal categorical data to many columns depending on the number of categories present in that column.Each column may contain 0 or 2 corresponding to which column it has been placed. EG Color_of _cars =[white’,’red’,’black’] The one hot encoding matrix will be 16 white red black 1 0 0 0 1 0 0 0 1
  • 17.
    Data Reduction Data reductionis a technique used in data mining to reduce the size of a dataset while still preserving the most important information. It reduces the data by removing unimportant and unwanted features from the transformation 17
  • 18.
    Data cube Aggregation ●Data cubes are multidimensional sets of data that can be stored in a spreadsheet ● A data cube can be two,three,or a higher dimension. ● Each dimension represent an attribute of interest. ● Data Cube Aggregation is a multidimensional aggregation that uses aggregation at various levels of a data cube to represent the original data set, thus achieving data reduction. ● Data cubes provide fast access to pre-computed ,summarized data. 18
  • 19.
    Numerosity Reduction ● Numerosityreduction is a technique used in data mining to reduce the number of data points in a dataset while still preserving the most important information. ● Numerosity Reduction is a data reduction technique which replaces the original data by smaller form of data representation. ● There are two techniques for numerosity reduction- Parametric and Non-Parametric methods. ● This can be beneficial in situations where the dataset is too large to be processed efficiently, or where the dataset contains a large amount of irrelevant or redundant data points. ● For parametric methods, data is represented using some model. The model is used to estimate the data, so that only parameters of data are required to be stored, instead of actual data. Regression and Log-Linear methods are used for creating such models. ● These methods are used for storing reduced representations of the data include histograms, clustering, sampling and data cube aggregation. 19
  • 20.
    Data Discretization The datadiscretization techniques can be used to reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. This leads to a concise, easy-to-use, knowledge-level representation of mining results. Discretization is the process through which we can transform continuous variables ,models or functions into discr 20