Data science engineering Preprocessing.pptx

What is Data Preprocessing?
• Data preprocessing is the process of transforming raw data into an
understandable format. It is also an important step in data mining as we
cannot work with raw data. The quality of the data should be checked
before applying machine learning or data mining algorithms.
• In machine learning processes, data preprocessing is critical for ensuring
large datasets are formatted in such a way that the data they contain can
be interpreted and parsed by learning algorithms.

• Preprocessing involves both data validation and data imputation.
• The goal of data validation is to assess whether the data in question is both
complete and accurate.
• The goal of data imputation is to correct errors and input missing values -- either
manually or automatically through business process automation(BPA)
programming.

Is data Pre-processing important?

Why is Data pre-processing important?
• Preprocessing of data is mainly used to check the data quality. The
quality can be checked by the following
• Accuracy: To check whether the data entered is correct or not.
• Completeness: To check whether the data is available or not
recorded.
• Consistency: To check whether the same data is kept in all the places
that do or do not match.
• Timeliness: The data should be updated correctly.
• Believability: The data should be trustable.
• Interpretability: The understandability of the data.

Handling missing values, noisy data
Data Cleaning
Data Integration
Data Transformation
Combing multiple datasets from
different databases
Data Reduction

Handling missing values:
1.Deleting the columns with missing data
2.Deleting the rows with missing data
3.Filling the missing data with a mean value – Imputation
4.Imputation with an additional column

If a column doesn't contribute enough to the model then we can
remove the column But this is an extreme case and should only be used
when there are many null values in the column.

• The possible ways to do this are:
1.Filling the missing data with the mean or median value if it’s a numerical variable.
2.Filling the missing data with mode(the value that appears most often) if it’s a categorical value.
3.Filling the numerical value with 0 or -999, or some other number that will not occur in the data.
This can be done so that the machine can recognize that the data is not real or is different.
4.Filling the categorical value with a new type for the missing values.

Code:
• Checking for null values:
• Calculating mean, mode or median:
• Adding the corrected feature to dataframe named ‘data’ :

Data science engineering Preprocessing.pptx

Recommended

Recommended

More Related Content

Similar to Data science engineering Preprocessing.pptx

Similar to Data science engineering Preprocessing.pptx (20)

Recently uploaded

Recently uploaded (20)

Data science engineering Preprocessing.pptx