The model interacts with the environment seeking ways to maximize the reward. There is a feedback from the environment.

Data Preprocessing
Master’s Degree in Data Science - Advanced Methods in Machine Learning
Ángela Fernández Pascual
Escuela Politécnica Superior
Universidad Autónoma de Madrid
Academic Year 2024–25

Contents
1 Data Preprocessing
2 Data cleaning
3 Data transformation
4 Summary

Data Preprocessing Machine Learning Systems
Machine Learning Systems
1 Collecting data: measuring devices, acquisition sensors, databases,...
2 Preprocessing mechanisms
• Data cleaning
• Data transformation
• Dimensionality reduction
3 Learning algorithms ⇒ Models
• Supervised models
• Unsupervised models
• Semi-supervised models
4 Model evaluation
A. Fernández (EPS–UAM) Data Preprocessing Academic Year 2024–25 1 / 18

Data Preprocessing Data Preprocessing
Data Preprocessing
Data Preprocessing
It is a preliminary step in machine learning systems, where raw data is transformed into understandable information for
our models.
PERFECT
DATA
GARBAGE
MODEL
GARBAGE
RESULT
GARBAGE
DATA
PERFECT
MODEL
GARBAGE
RESULT

Data Preprocessing Data Preprocessing
Steps in Data Preprocessing
1 Data auditing
The data is audited with the use of statistical methods to detect anomalies and contradictions.
2 Data cleaning
The process of fixing or removing incorrect or incomplete data within a dataset.
3 Data integration
When data come from different sources, integration is needed for solving conflicts:
• Different scales
• Conflicting names
• Duplicates or redundant information
4 Data transformation
The process of converting data from one format to another, understandable for the models.
5 Dimensionality reduction
The process of finding a new representation of the data in a new space of lower dimension.

Data cleaning
Data Cleaning: Missing data (I)
Types of missing data
▶ Missing completely at random (MCAR)
▶ Missing at random (MAR)
▶ Missing not at random (MNAR)
How to deal with missing data?
▶ Removing information
▶ Filling the gaps
▶ Special techniques that makes a full analysis, e.g. Expectation-Maximization algorithm

Data cleaning
Missing data (II): removing information
▶ Delete the pattern
• When the target is missing
• Several attributes with missing values for that pattern
• Few data are missing
▶ Delete the complete attribute

Data cleaning
Missing data (III): filling the gaps
▶ Use a global constant
▶ Use the attribute mean
▶ Use the attribute mean of the samples at the same class or with similar target
▶ Particular solutions for particular cases: e.g. for time series, interpolation is very typical
Be careful!
It can bias the data...

Notebook
Data Cleaning: Missing Data

Data cleaning
Data Cleaning: Denoising
Noise: random error or variance in a measured variable
▶ Measurement tools noise
How to deal with noise?
▶ Binning
A way to group a number of more or less continuous
values into a smaller number of “bins”.
• Bin mean
• Bin median
• Bin boundary
▶ Autoencoders

Data cleaning
Data Cleaning: Outlier Detection
Outlier: a data point that differs significantly from other observations. It may indicate:
▶ the data may have been coded incorrectly
▶ an experiment may not have been run correctly
Methods
▶ Clustering-based
• Values falling outside the clusters may be considered as outliers
• Most of the OD methods are based on this idea
• Examples: kNN-OD, one-class SVM, PCA-OD, LOF, . . .
▶ Regression-based
• Fit data to a function
• The new values given by the function are used instead of the original
values
Python package: PyOD

Data cleaning
Data Cleaning: Inconsistent Data
Inconsistency: discrepancies between attributes (two values in the data contradict each other)
▶ For correcting inconsistencies, domain knowledge or expert decisions are needed.
▶ Automatic routines can be developed to detect these cases.

Data transformation
Data transformation: Categorical variables (I)
Categorical variable: a variable which can take one of a limited set of possible values.
▶ In general, models cannot deal with this type of data
Ordinal encoding
▶ To convert categorical features (words) into integer codes.
One-hot encoding
▶ We convert a categorical variable of n possible values in n dichotomous variables (binary).
▶ In this case, each possible value of the categorical variable is transformed into a new attribute, which takes value 1
for the patterns in that category.

Data transformation
Data transformation: Categorical variables (II)
Exercise
Given a categorical attribute CarColor which could take values red, black or white.
1 Transform this attribute using an ordinal encoding.
2 Transform this attribute using one-hot encoding.
3 If we have a dataset where CarColor takes the values (red, red, white, red, black), which are the new transformed
attributes in both cases?
Solution
1 CarColor will now takes values 1, 2, 3 where 1 means red, 2 means black and 3 means white.
2 We substitute the attribute CarColor by three new binary attributes named, for example, red, black and white.
3 The transformed attributes will be:
1 Ordinal encoding: CarColor = (1, 1, 3, 1, 2)
2 One-hot encoding: red = (1, 1, 0, 1, 0); black = (0, 0, 0, 0, 1); white = (0, 0, 1, 0, 0)

Notebook
Data Transformation: Categorical Variables

Data transformation
Data transformation: Normalization (I)
▶ Many machine learning algorithms require similar attributes in terms of scale and variance.
Standardization
▶ Transform the variables into zero-mean and unitary-variance attributes.
▶ We change the original data distribution into a Gaussian one.
In a set D, a data point x will be standardized as follows:
x̃ =
x − µD
σD

Data transformation
Data transformation: Normalization (II)
Exercise
Given a dataset with 3 points and 2 attributes: S = {x1, x2, x3} where x1 = (1, 2), x2 = (3, 1), and x3 = (5, 3).
1 Compute the mean of the attributes.
2 Compute their variance.
3 Which are the transformed points in S̃?
Solution
1 Mean: µa1 = 1+3+5
3
= 3; µa2 = 2+1+3
3
= 2.
2 Deviation: σa1 =
q
(1−3)2+(3−3)2+(5−3)2
3
= 1.63; σa2 =
q
(2−2)2+(1−2)2+(3−2)2
3
= 0.82
3 x̃1 = (
1−µa1
σa1
,
2−µa2
σa2
) = (1.23, 0); x̃2 = (
3−µa1
σa1
,
1−µa2
σa2
) = (0, −1.22); x̃3 = (
5−µa1
σa1
,
3−µa2
σa2
) = (1.23, 1.22)

Data transformation
Data transformation: Normalization (III)
Scaling
▶ Transform the variables by scaling them to lie between a given minimum and maximum value.
▶ Typical intervals are [0, 1] or [−1, 1].
▶ In this case, the transformed data is more robust to very small standard deviations and preserves zero entries in sparse
data.
In a set D, a data point x will be scaled into a new interval [a, b] as follows:
x̃ = a + (b − a)
x − minD
maxD − minD

Data transformation
Data transformation: Normalization (IV)
Exercise
Given the same dataset with 3 points and 2 attributes: S = x1, x2, x3 where x1 = (1, 2), x2 = (3, 1), and x3 = (5, 3).
1 Compute the min and max of the attributes.
2 Which are the transformed points S̃ in this case for scaling into the intervale [0, 1]?
Solution
1 min = (1, 1), max = (5, 3)
2 a + (b − a) = 0 + (1 − 0) = 1,
x̃1 = (1,2)−(1,1)
(5,3)−(1,1)
= (0, 1
2
), x̃2 = (3,1)−(1,1)
(5,3)−(1,1)
= (1
2
, 0), x̃3 = (5,3)−(1,1)
(5,3)−(1,1)
= (1, 1)

Notebook
Data Transformation: Normalization

Data transformation
Data transformation: New features
▶ It can be useful to define synthetic variables.
▶ For example: combining attributes through some interesting expression.
▶ Expert knowledge is needed.
False predictors
A false predictor is a variable that is strongly correlated with the output class, but that is not available in a realistic
prediction scenario.
▶ It is necessary to eliminate them.

Data transformation
Data transformation: Class balancing
Imbalanced data: data has imbalanced data distributions among classes, so the number of samples belonging to one
class (majority class) surpasses amply the number of samples of other class (minority class).
How to deal with class imbalance?
▶ Oversampling in the minority class.
• Generate more samples.
▶ Undersampling in the majority class.

Summary
Summing up...
▶ Preprocessing is an important step in a machine learning system.
▶ It is the first thing after collecting data and it can be very time consuming.
▶ Several phases:
• Data cleaning: missing values, denoising, outliers detection, inconsistencies
• Data integration: conflicting scales, names, duplicate information
• Data transformation: categorical attributes, normalization, class balancing
• Dimensionality reduction

The model interacts with the environment seeking ways to maximize the reward. There is a feedback from the environment.

More Related Content

Similar to The model interacts with the environment seeking ways to maximize the reward. There is a feedback from the environment.

Recently uploaded

The model interacts with the environment seeking ways to maximize the reward. There is a feedback from the environment.