2. Objectives
Understand the concept of data preprocessing
Discuss the various types of data and possible errors
Understand the need and use of various data preprocessing errors
2
3. Introduction to data processing
Data collected for performing data analysis in Data science are in a raw and unprocessed state
Data preprocessing is the task of transforming raw data to be ready to be fed into an algorithm
Data preparation takes place in usually two phases for any data science project :
Data Preprocessing
Data wrangling
3
4. Data types and forms
Categorical data
Nominal data
Ordinal data
Numerical data
Interval data
Ratio data
4
5. Categorical data
This data is non numeric and consists of text that can be coded as numeric.
It can be of two type
Nominal data : This data is used to label variables without providing any quantitative value .For example
gender can assigned numbers
Nominal scales are exclusive
Ordinal data :
This type of data is used to label variables that need to follow some order.
For example : A company take feedback about the quality of their service.
5
6. Numerical data
This data is numeric and it usually follows an order of values.
Types of numeric data
Interval data :
This type of data follows numeric scales in which the order and exact difference between values is
considered.
The distance between each value on the interval scale are always kept equal.
Ratio data :
6
7. Possible data error types
Missing data
● Missing Completely At Random
● Missing at Random data
● Missing Not at Random
7
Recor
d
Cust Id Salary Dateof Birth Role Spouse
1 A121 42000 1985-30-05 Manager Anjali
2 A122 07-02-1982 CEO Priya
3 A123 28530 11-09-1987 Asst
Manager
4 A124 32000 12/24/1986 Rahul
5 A125 37450 09/07/1988 Secy Bina
6 A126 37450 07-09-1987 Secretary Sumit
8. Manual Input
Data inconsistency
Regional Formats
Numerical units
Wrong data types
File Manipulation
Missing anonymization
8
Recor
d
Cust Id Salary Dateof Birth Role Spouse
1 A121 42000 1985-30-05 Manager Anjali
2 A122 07-02-1982 CEO Priya
3 A123 28530 11-09-1987 Asst
Manager
4 A124 32000 12/24/1986 Rahul
5 A125 37450 09/07/1988 Secy Bina
6 A126 37450 07-09-1987 Secretary Sumit
10. Data Cleaning
To handle irrelevant or missing data
Data is cleaned by filling in the missing values,smoothing any noisy data,identifying and removing outliers
Resolving inconsistencies
10
11. Filling Missing values
Replace missing values with Zeros
Dropping Rows with Missing Values
Replace missing value with Mean/Mode/Median
11
12. #Method 1 - Filling Every Missing Values with 0
print("nn Every Missing Value Replaced with '0':")
print("--------------------------------------------")
print(df.fillna(0))
12
13. #Method 2 - Dropping Rows Having Missing Values
print("nn Dropping Rows with Missing Values:")
print("----------------------------------------")
print(df.dropna())
13
14. #Method 3 - Replacing missing values with the Median
Valuemedian = df['C01'].median()
df['C01'].fillna(median, inplace=True)
print("nn Missing Values for Column 1 Replaced with Median
Value:")
print("--------------------------------------------------")
print(df)
14
16. One hot encoding
Label encoding assigns a numeric value to each categorical value .This will be ok for categorical
labels
But nominal features do not have any order
Eg : Color of car values do not have any order among themselves
To prevent this one hot encoding is used for nominal attributes
It splits the column which contain the nominal categorical data to many columns depending on the
number of categories present in that column.Each column may contain 0 or 2 corresponding to
which column it has been placed.
EG Color_of _cars =[white’,’red’,’black’]
The one hot encoding matrix will be
16
white red black
1 0 0
0 1 0
0 0 1
17. Data Reduction
Data reduction is a technique used in data mining to reduce the size of a dataset while still preserving the most
important information.
It reduces the data by removing unimportant and unwanted features from the transformation
17
18. Data cube Aggregation
● Data cubes are multidimensional sets of data that can be stored in a spreadsheet
● A data cube can be two,three,or a higher dimension.
● Each dimension represent an attribute of interest.
● Data Cube Aggregation is a multidimensional aggregation that uses aggregation at various levels of a
data cube to represent the original data set, thus achieving data reduction.
● Data cubes provide fast access to pre-computed ,summarized data.
18
19. Numerosity Reduction
● Numerosity reduction is a technique used in data mining to reduce the number of data points in a
dataset while still preserving the most important information.
● Numerosity Reduction is a data reduction technique which replaces the original data by smaller
form of data representation.
● There are two techniques for numerosity reduction- Parametric and Non-Parametric methods.
● This can be beneficial in situations where the dataset is too large to be processed efficiently, or
where the dataset contains a large amount of irrelevant or redundant data points.
● For parametric methods, data is represented using some model. The model is used to estimate the
data, so that only parameters of data are required to be stored, instead of actual data. Regression
and Log-Linear methods are used for creating such models.
● These methods are used for storing reduced representations of the data include histograms,
clustering, sampling and data cube aggregation.
19
20. Data Discretization
The data discretization techniques can be used to reduce the number of values for a given continuous attribute
by dividing the range of the attribute into intervals.
This leads to a concise, easy-to-use, knowledge-level representation of mining results.
Discretization is the process through which we can transform continuous variables ,models or functions into discr
20