Modul Topik 4 - Kecerdasan Buatan.pdf

Topik 4
Konsep Transformasi Data, Ekstraksi Fitur,
dan Seleksi Fitur Dalam Machine Learning
Dr. Sunu Wibirama
Modul Kuliah Kecerdasan Buatan
Kode mata kuliah: UGMx 001001132012
June 13, 2022

June 13, 2022
1 Capaian Pembelajaran Mata Kuliah
Topik ini akan memenuhi CPMK 4, yakni mampu mendefinisikan konsep dasar trans-
formasi data dan seleksi fitur (feature selection) untuk machine learning.
Adapun indikator tercapainya CPMK tersebut adalah mampu memahami konsep data
preparation, data cleansing, dan feature selection serta teknik-teknik yang lazim digunakan
dalam machine learning.
2 Cakupan Materi
Cakupan materi dalam topik ini sebagai berikut:
a) Introduction to Data Preparation for Machine Learning: materi ini menjelaskan alasan-
alasan pentingnya melakukan persiapan awal sebelum menggunakan dataset dalam
machine learning. Pada materi ini juga dijelaskan langkah-langkah praktis untuk
mendapatkan data yang akan digunakan pada proses machine learning.
b) Overview of Data Preparation: materi ini menjelaskan teknik-teknik dasar yang akan
digunakan dalam mempersiapkan data, misalnya data cleaning, feature selection, data
transforms, feature engineering, dan dimensionality reduction.
c) Data Cleaning: materi ini menjelaskan konsep-konsep dasar data cleaning, yakni
mengidentifikasi dan mengoreksi kesalahan dalam data. Pada materi ini dijelaskan
konsep untuk mengidentifikasi kolom yang memiliki single value menggunakan pem-
rograman Python. Selain itu, materi ini juga menjelaskan cara-cara mengidentifikasi
outliers dalam data dengan menggunakan metode statistika seperti halnya standard
deviation atau interquartile range.
d) Feature Selection: materi ini menjelaskan teknik-teknik dasar pemilihan fitur. Hal
penting yang perlu diperhatikan dalam proses pemilihan fitur adalah melihat tipe data
pada masukan (input) dan luaran (output) algoritme machine learning. Pada materi
ini juga akan dijelaskan teknik Recursive Feature Elimination (RFE) dan Feature
Importance untuk memilih fitur pada proses machine learning.
e) Data Transforms: materi ini akan menjelaskan teknik-teknik dasar transformasi data,
diantaranya data normalization dan quantile transforms. Data normalization digu-
nakan untuk melakukan normalisasi data pada level individu atau elemen dataset.
Sementara itu, quantile transforms digunakan untuk mengubah distribusi data men-
jadi distribusi normal atau distribusi uniform.
f) Dimensionality Reduction: materi ini akan terbagi menjadi dua bagian, yakni penge-
nalan Principal Component Analysis (PCA) dan implementasi PCA. Pada bagian per-
tama, akan dijelaskan konsep dasar PCA, eigenvalues, dan eigenvector. Pada bagian
kedua, akan dijelaskan langkah-langkah praktis implementasi PCA dan aplikasinya
dengan pemrograman Python.
1

07/06/2022
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 1
Sunu Wibirama
sunu@ugm.ac.id
Department of Electrical and Information Engineering
Faculty of Engineering
Universitas Gadjah Mada
INDONESIA
Introduction to Data Preparation for Machine Learning
Kecerdasan Buatan | Artificial Intelligence
Version: January 2022
sunu@ugm.ac.id
Why data preparation
• Data preparation / data preprocessing: the act of transforming raw
data into a form that is appropriate for modeling.
• Data preparation is the most important part and the most difficult
process in machine learning project.
• Most time consuming, but it is the least discussed topic.
• The challenge of data preparation is that each dataset is unique
and different for each project:
• The number of variables (tens, hundreds, thousands, or more)
• The types of the variables (numeric, nominal, ordinal, ratio)
• The scale of the variables
• The drift in the values overtime

07/06/2022
sunu@ugm.ac.id
From raw data to insights
“…. the right features can only be defined in the context of both the model and the data; since data and
models are so diverse, it's difficult to generalize the practice of feature engineering across projects”
(Page vii, Feature Engineering for Machine Learning, 2018.)
Courtesy: Sanvendra Singh (2019)
sunu@ugm.ac.id
Raw data can’t be used directly
• Machine learning algorithms require data
to be numbers.
• Some machine learning algorithms
impose requirements on the data.
• Statistical noise and errors in the data
may need to be corrected.
• Complex nonlinear relationships may be
teased out of the data.
Courtesy: Sanvendra Singh (2019)

07/06/2022
sunu@ugm.ac.id
Standard tasks during data preparation
• Data cleaning: identifying and correcting
mistakes or errors in the data.
• Feature selection: identifying those input
variables that are most relevant to the task.
• Data transforms: changing the scale or
distribution of variables.
• Feature engineering: deriving new
variables from available data.
• Dimensionality reduction: creating
compact projections of the data. Courtesy: Akira Takezawa (2019)
sunu@ugm.ac.id
Before preparing our data
• Gather data from the problem domain.
• Discuss the project with subject matter experts.
• Select those variables to be used as inputs and
outputs for a predictive model.
• Review the data that has been collected.
• Summarize the collected data using statistical
methods.
• Visualize the collected data using plots and
charts.

07/06/2022
sunu@ugm.ac.id
7
End of File

07/06/2022
sunu@ugm.ac.id
Sunu Wibirama
sunu@ugm.ac.id
INDONESIA
Overview of Data Preparation (Part 01)
sunu@ugm.ac.id
Standard tasks during data preparation
• Data cleaning: identifying and correcting
mistakes or errors in the data.
• Feature selection: identifying those input
variables that are most relevant to the task.
• Data transforms: changing the scale or
distribution of variables.
• Feature engineering: deriving new
variables from available data.
• Dimensionality reduction: creating
compact projections of the data. Courtesy: Akira Takezawa (2019)

07/06/2022
sunu@ugm.ac.id
Data cleaning
• The most useful data cleaning involves
deep domain expertise and could involve
identifying and addressing specific
observations that may be incorrect.
• There are many reasons data may have
incorrect values, such as being mistyped,
corrupted, duplicated, and so on.
• Domain expertise may allow obviously
erroneous observations to be identified as
they are different from what is expected (a
person's height of 60 meters.
sunu@ugm.ac.id
General data cleaning operations
• Using statistics to define normal data
and identify outliers
• Identifying columns that have the same
value or no variance and removing
them
• Identifying duplicate rows of data and
removing them.
• Marking empty values as missing.
• Imputing missing values using statistics
or a learned model
Courtesy: Jason Brownlee (2020)

07/06/2022
sunu@ugm.ac.id
Feature selection
• Feature selection refers to techniques for
selecting a subset of input features that are most
relevant to the target variable that is being
predicted.
• This is important as irrelevant and redundant
input variables can distract or mislead learning
algorithms possibly resulting in lower predictive
performance.
• Additionally, it is desirable to develop models
only using the data that is required to make a
prediction, e.g. to favor the simplest possible well
performing model.
sunu@ugm.ac.id
Feature selection
• Feature selection techniques may generally
grouped into those that use the target variable
(supervised) and those that do not
(unsupervised).
• The supervised techniques can be further
divided into:
• models that automatically select features
as part of fitting the model (intrinsic)
• those that explicitly choose features that
result in the best performing model
(wrapper)
• those that score each input feature and
allow a subset to be selected (filter)
Courtesy: Jason Brownlee (2020)

07/06/2022
sunu@ugm.ac.id
Sunu Wibirama
sunu@ugm.ac.id
INDONESIA
Overview of Data Preparation (Part 02)
sunu@ugm.ac.id
Data transforms
• Data transforms are used to change the type
or distribution of data variables.
• Numeric data type: number values.
• Integer: integers with no fractional part.
• Float: floating point values.
• Categorical data type: label values.
• Ordinal: labels with a rank ordering.
• Nominal: labels with no rank ordering.
• Boolean: values True and False.

07/06/2022
sunu@ugm.ac.id
Some techniques of data transforms
• Discretization transform: encode a numeric
variable as an ordinal variable
• Ordinal transform: encode a categorical
variable into an integer variable
• One hot transform: encode a categorical
variable into binary variables
• Normalization transform: scale a variable to
the range 0 and 1
• Standardization transform: scale a variable to
a standard Gaussian
• Power transform: change the distribution of a
variable to be more Gaussian
sunu@ugm.ac.id
Feature engineering
• Feature engineering refers to the
process of creating new input variables
from the available data.
• Engineering new features is highly
specific to your data and data types. As
such, it often requires the collaboration
of a subject matter expert to help identify
new features that could be constructed
from the data.
• This specialization makes it a
challenging topic to generalize to
general methods.

07/06/2022
sunu@ugm.ac.id
Some techniques of feature engineering
• There are some techniques that can be
used in feature engineering:
• Adding a Boolean flag variable for
some state.
• Adding a group or global summary
statistic, such as a mean.
• Adding new variables for each
component of a compound variable,
such as a date-time.
• Polynomial Transform: Create copies
of numerical input variables that are
raised to a power
sunu@ugm.ac.id
Dimensionality reduction
• The number of input features for a dataset may be
considered the dimensionality of the data.
• Two input variables together can define a two-
dimensional area where each row of data defines a
point in that space.
• The problem is, the more dimensions this space has
(e.g. the more input variables), the more likely it is
that the dataset represents a very sparse and likely
unrepresentative sampling of that space (curse of
dimensionality)
• An alternative to feature selection is to create a
projection of the data into a lower-dimensional space
that still preserves the most important properties of
the original data.

07/06/2022
sunu@ugm.ac.id
Some techniques of dimensionality reduction
• The most common approach to dimensionality reduction is
to use a matrix factorization technique:
• Principal Component Analysis (PCA).
• Singular Value Decomposition (SVD).
• Other approaches with model-based methods:
• linear discriminant analysis
• autoencoders.
• Sometimes manifold learning algorithms can also be used:
• Kohonen self-organizing maps (SOME)
• t-Distributed Stochastic Neighbor Embedding (t-SNE).
A d-dimensional manifold is a part of an n-dimensional space (d<n) that locally resembles a d-
dimensional hyperplane. Manifold Learning can be thought of as an attempt to generalize linear
frameworks like PCA to be sensitive to non-linear structure in data.
sunu@ugm.ac.id
8
End of File

07/06/2022
sunu@ugm.ac.id
Sunu Wibirama
sunu@ugm.ac.id
INDONESIA
Data Cleaning (Part 01)
sunu@ugm.ac.id
Data cleaning in machine learning project
• Before jumping to the sophisticated methods,
there are some very basic data cleaning
operations that you probably should perform on
every single machine learning project.
• Although some techniques seem very basic,
they are so critical.
• If you skip this step, models may break or report
overly optimistic performance results.
• Our goal: identifying and correcting mistakes or
errors in the data.

07/06/2022
sunu@ugm.ac.id
Data cleaning
Our goal: identifying and correcting mistakes or errors in the data.
sunu@ugm.ac.id
Our dataset
• For short demonstration, we will use a dataset from Kubat et al.
(1998).
• The paper describes an application of machine learning to an
important environmental problem: detection of oil spills from
radar images of the sea surface.
• The task involves predicting whether the patch contains an oil
spill or not, e.g. from the illegal or accidental dumping of oil in
the ocean, given a vector that describes the contents of a patch
of a satellite image.
• There are 937 cases. Each case is comprised of 48 numerical
computer vision derived features, a patch number, and a class
label.
• The normal case is no oil spill assigned the class label of 0,
whereas an oil spill is indicated by a class label of 1. There are
896 cases for no oil spill and 41 cases of an oil spill.

07/06/2022
sunu@ugm.ac.id
Identifying columns that contain a single value
• Columns that have a single observation or value are probably
useless for modelling.
• These columns or features or predictors are referred to zero-
variance predictors as if we measured the variance (average
value from the mean), it would be zero.
• A single value means that each row for that column has the same
value.
• Columns that have a single value for all rows do not contain any
information for modelling.
• Depending on the choice of data preparation and modelling
algorithms, variables with a single value can also cause errors or
unexpected results.
sunu@ugm.ac.id
Python code
• We will use Python to demonstrate technical steps in
detecting single value columns
• You can detect rows that have this property using the
unique() NumPy function that will report the number
of unique values in each column.
• A simpler approach is to use the nunique() Pandas
function that does the hard work for you. Below is the
same example using the Pandas function.
• The example loads the oil-spill classification dataset
that contains 50 variables (48 numerical computer
vision derived features, a patch number, and a class
label).
• The the code summarizes the number of unique
values for each column.

07/06/2022
sunu@ugm.ac.id
We will see that column
index 22 only has a
single value and should
be removed.
sunu@ugm.ac.id
Deleting column with a single unique value
• Columns are relatively easy to remove from a NumPy
array or Pandas DataFrame.
• One approach is to record all columns that have a single
unique value, then delete them from the Pandas
DataFrame by calling the drop() function.
• Running the example first loads the dataset and reports
the number of rows and columns.
• The number of unique values for each column is
calculated, and those columns that have a single unique
value are identified. In this case, column index 22.
• The identified columns are then removed from the
DataFrame, and the number of rows and columns in the
DataFrame are reported to confirm the change.

07/06/2022
sunu@ugm.ac.id
9
End of File

07/06/2022
sunu@ugm.ac.id
Sunu Wibirama
sunu@ugm.ac.id
INDONESIA
Data Cleaning (Part 02)
sunu@ugm.ac.id
Outlier identification and removal
• When modelling, it is important to clean the data sample to
ensure that the observations best represent the problem.
• Sometimes a dataset can contain extreme values that are
outside the range of what is expected and unlike the other
data.
• These are called outliers and often machine learning
modelling and model skill in general can be improved by
understanding and even removing these outlier values.
• Outliers can have many causes, such as:
• Measurement or input error.
• Data corruption.
• True outlier observation.
(Courtesy: Ou Zhang)

07/06/2022
sunu@ugm.ac.id
Detecting outliers
• There is no precise way to define and
identify outliers in general because of the
specifics of each dataset.
• Instead, you, or a domain expert, must
interpret the raw observations and
decide whether a value is an outlier or
not.
• We can use statistical methods to
identify observations that appear to be
rare or unlikely given the available data.
sunu@ugm.ac.id
Dataset
• We will generate a population 10,000 random numbers drawn from a
Gaussian distribution with a mean of 50 and a standard deviation of 5.
• Numbers drawn from a Gaussian distribution will have outliers. That is, by
virtue of the distribution itself, there will be a few values that will be a long
way from the mean, rare values that we can identify as outliers.
• We will use the randn() function to generate random Gaussian values
with a mean of 0 and a standard deviation of 1, then multiply the results
by our own standard deviation and add the mean to shift the values into
the preferred range.

07/06/2022
sunu@ugm.ac.id
Dataset
Running the example generates the sample and then prints the mean and
standard deviation. As expected, the values are very close to the expected values.
sunu@ugm.ac.id
Standard deviation of Gaussian distribution
• The Gaussian distribution has the property that the
standard deviation from the mean can be used to reliably
summarize the percentage of values in the sample.
• For example, within one standard deviation of the mean
will cover 68 percent of the data.
• So, if the mean is 50 and the standard deviation is 5, as in
the test dataset above, then all data in the sample
between 45 and 55 will account for about 68 percent of
the data sample.
• A value that falls outside of 3 standard deviations is part
of the distribution, but it is an unlikely or rare event at
approximately 1 in 370 samples.
• Three standard deviations from the mean is a common
cut-off in practice for identifying outliers in a Gaussian or
Gaussian-like distribution.

07/06/2022
sunu@ugm.ac.id
Removing outliers
• We can calculate the mean and standard
deviation of a given sample, then calculate
the cut-off for identifying outliers as more
than 3 standard deviations from the mean.
• We can then identify outliers as those
examples that fall outside of the defined
lower and upper limits.
• Running the example will first print the
number of identified outliers and then the
number of observations that are not
outliers, demonstrating how to identify and
filter out outliers respectively.
sunu@ugm.ac.id
Interquartile Range method
• Not all data is normal or normal enough to treat it as being
drawn from a Gaussian distribution.
• A good statistic for summarizing a non-Gaussian
distribution sample of data is the Interquartile Range
(IQR).
• The IQR is calculated as the difference between the 75th
and the 25th percentiles of the data.
• We refer to the percentiles as quartiles (quart meaning 4)
because the data is divided into four groups via the 25th,
50th and 75th values.
• The IQR can be used to identify outliers by defining limits
on the sample values that are a factor k of the IQR below
the 25th percentile or above the 75th percentile.
• The common value for the factor k is the value 1.5. A
factor k of 3 or more can be used to identify values that
are extreme outliers
Courtesy: Perez and Tah

07/06/2022
sunu@ugm.ac.id
Interquartile Range method
• We can calculate the percentiles of a dataset
using the percentile() NumPy function that
takes the dataset and specification of the
desired percentile.
• The IQR can then be calculated as the
difference between the 75th and 25th
percentiles.
• We can then calculate the cutoff for outliers as
1.5 times the IQR and subtract this cut-off from
the 25th percentile and add it to the 75th
percentile to give the actual limits on the data.
• We can then use these limits to identify the
outlier values.
sunu@ugm.ac.id
10
End of File

07/06/2022
sunu@ugm.ac.id
Sunu Wibirama
sunu@ugm.ac.id
INDONESIA
Feature Selection (Part 01)
sunu@ugm.ac.id
Feature selection
• Feature selection is the process of reducing the
number of input variables when developing a
predictive model.
• This step is important to reduce the computational
cost of modelling and, in many cases, to improve
the performance of the model.
• Statistical-based feature selection methods
involve evaluating the relationship between each
input variable and the target variable using
statistics
• Then, we select those input variables that have
the strongest relationship with the target variable.

07/06/2022
sunu@ugm.ac.id
Feature selection
• One way to think about feature selection methods are in
terms of supervised and unsupervised methods:
• Unsupervised selection: do not use the target
variable (e.g. remove redundant variables).
• Supervised selection: use the target variable (e.g.
remove irrelevant variables).
• Supervised feature selection methods may further be
classified into three groups:
• Intrinsic: algorithms that perform automatic feature
selection during training.
• Filter: select subsets of features based on their
relationship with the target.
• Wrapper: search subsets of features that perform
according to a predictive model.
sunu@ugm.ac.id
Statistics for feature selection
• It is common to use correlation type statistical measures
between input and output variables as the basis for filter
feature selection.
• However, the choice of statistical measures is highly
dependent upon the variable data types.
• Common data types include numerical (such as height) and
categorical (such as a label).
• Input variable: variables used as input to a predictive model.
• Output variable: variables output or predicted by a model:
• Numerical output: regression predictive modelling
problem.
• Categorical output: classification predictive modelling
problem.

07/06/2022
sunu@ugm.ac.id
Numerical input
• Numerical input and numerical output:
This is a regression predictive modelling problem
with numerical input variables:
• Pearson's correlation coefficient (linear).
• Spearman's rank coefficient (nonlinear).
• Numerical input and categorical output:
This is a classification predictive modelling
problem with numerical input variables. This
might be the most common example of a
classification problem:
• ANOVA correlation coefficient (linear).
• Kendall's rank coefficient (nonlinear).
sunu@ugm.ac.id
Categorical input
• Categorical input, numerical output:
• This is a regression predictive
modelling problem with categorical
input variables.
• This is a strange example of a
regression problem (e.g. you would not
encounter it often).
• You can use the same numerical input,
categorical output methods (described
previously), but in reverse.

07/06/2022
sunu@ugm.ac.id
Categorical input
• Categorical input, categorical output:
• This is a classification predictive
modelling problem with categorical
input variables.
• The most common correlation
measure for categorical data is the chi-
squared test.
• You can also use mutual information
(information gain) from the field of
information theory.
sunu@ugm.ac.id
8
End of File

07/06/2022
sunu@ugm.ac.id
Sunu Wibirama
sunu@ugm.ac.id
INDONESIA
sunu@ugm.ac.id
Recursive Feature Elimination (RFE)
• RFE is a wrapper-type feature selection algorithm.
• This means that a different machine learning
algorithm is given and used in the core of the
method, is wrapped by RFE, and used to help
select features.
• RFE works by searching for a subset of features
by starting with all features in the training dataset
and successfully removing features until the
desired number remains.
• This is achieved by fitting the given machine
learning algorithm used in the core of the model,
ranking features by importance, discarding the
least important features, and re-fitting the model.
(Source: scikit-learn documentation)

07/06/2022
sunu@ugm.ac.id
Recursive Feature Elimination (RFE)
Source: Ravishankar, et al. (2016)
sunu@ugm.ac.id
RFE with scikit-learn library
• Scikit-learn is a machine learning library with the
following features:
• simple and efficient tools for predictive data
analysis
• accessible to everybody, and reusable in
various contexts
• built on NumPy, SciPy, and matplotlib
• open source, commercially usable - BSD
license
• funded by several companies and universities

07/06/2022
sunu@ugm.ac.id
• To use it, first the class is configured with the
chosen algorithm specified via the estimator
argument and the number of features to select via
the n_features_to_select argument.
• RFE requires a nested algorithm that is used to
provide the feature importance scores, such as a
Decision Tree.
• The nested algorithm used in RFE does not have
to be the algorithm that is fit on the selected
features; different algorithms can be used.
sunu@ugm.ac.id
Decision Tree
• Root node : is the first node in decision trees
• Splitting : is a process of dividing node into two or
more sub-nodes, starting from the root node
• Node : splitting results from the root node into sub-
nodes and splitting sub-nodes into further sub-
nodes
• Leaf or terminal node : end of a node, since node
cannot be split anymore
• Branch / Sub-Tree : A subsection of the entire tree
is called branch or sub-tree.
• Parent and Child Node: A node, which is divided
into sub-nodes is called parent node of sub-nodes
whereas sub-nodes are the child of parent node.
Source: https://medium.com/@arifromadhan19/the-basics-of-decision-trees-e5837cc2aba7

07/06/2022
sunu@ugm.ac.id
sunu@ugm.ac.id

07/06/2022
sunu@ugm.ac.id
sunu@ugm.ac.id
• A box and whisker plot is created for the
distribution of accuracy scores for each
configured number of features.
• We can see that performance improves
as the number of features increase
• The performance peaks around 4-to-7
features as we might expect, given that
only five features are relevant to the
target variable.
• We can see that using 7 features yields
most accurate accuracy score (0.885)
Accuracy
Number of features

07/06/2022
sunu@ugm.ac.id
13
End of File

07/06/2022
sunu@ugm.ac.id
Sunu Wibirama
sunu@ugm.ac.id
INDONESIA
sunu@ugm.ac.id
Feature importance
• Feature importance refers to techniques that
assign a score to input features based on how
useful they are at predicting a target variable.
• There are many types and sources of feature
importance scores.
• Popular examples:
• statistical correlation scores
• coefficients calculated as part of linear models
• decision trees
• permutation importance scores
Source: https://medium.com/analytics-vidhya/ranking-features-based-on-importance-predictive-power-with-respect-to-the-class-labels-of-the-25afaed71e90

07/06/2022
sunu@ugm.ac.id
Why feature importance score is useful?
Better understanding the data:
• The relative scores can highlight which
features may be most relevant to the
target, and the converse, which features
are the least relevant.
• This may be interpreted by a domain
expert and could be used as the basis for
gathering more or different data.
sunu@ugm.ac.id
Better understanding the model:
• Most importance scores are calculated by
a predictive model that has been fit on the
dataset.
• Inspecting the importance score provides
insight into that specific model and which
features are the most important and least
important to the model when making a
prediction.
• This is a type of model interpretation that
can be performed for those models that
support it.

07/06/2022
sunu@ugm.ac.id
Reducing the number of input features:
• This can be achieved by using the
importance scores to select those features
to delete (lowest scores) or those features
to keep (highest scores).
• This is a type of feature selection and can
simplify the problem that is being
modelled, speed up the modelling process
(deleting features is called dimensionality
reduction), and in some cases, improve
the performance of the model.
sunu@ugm.ac.id
Feature importance in Linear Regression

07/06/2022
sunu@ugm.ac.id
Feature importance in Linear Regression
• The scores suggest that the model found
the five important features and marked
all other features with a zero coefficient,
essentially removing them from the
model.
• A bar chart is then created for the feature
importance scores.
sunu@ugm.ac.id
Feature importance in Decision Tree

07/06/2022
sunu@ugm.ac.id
Feature importance in Decision Tree
• Running the example fits the model, then
reports the coefficient value for each
feature.
• The results suggest perhaps three of the 10
features as being important to prediction:
feature 4, 5, and 6.
• Note: result of each code execution may
vary given the stochastic nature of the
algorithm or evaluation procedure, or
differences in numerical precision.
• The code should be run a few times and we
can compare the average outcome.
sunu@ugm.ac.id
10
End of File

07/06/2022
sunu@ugm.ac.id
Sunu Wibirama
sunu@ugm.ac.id
INDONESIA
Data Transforms (Part 01)
sunu@ugm.ac.id
The scale of your data is important
• Machine learning models learn a mapping from
input variables to an output variable.
• Unfortunately, the scale and distribution of the
data drawn from the domain may be different for
each variable.
• Input variables may have different units (e.g.
feet, kilometers, and hours) that, in turn, may
mean the variables have different scales.
• Differences in the scales across input variables
may increase the difficulty of the problem being
modelled.
Courtesy: Lagabrielle et al. (2018)

07/06/2022
sunu@ugm.ac.id
What ML algorithms affected by scale of data?
• Algorithms that fit a model that use a weighted sum of input
variables: :
• linear regression
• logistic regression
• artificial neural networks (deep learning).
• Algorithms that use distance measures between examples:
• k-nearest neighbors
• support vector machines.
• It can also be a good idea to scale the target variable for
regression predictive modelling problems to make the problem
easier to learn, most notably in the case of neural network
models.
• A target variable with a large spread of values, in turn, may result
in large error gradient values causing weight values to change
dramatically, making the learning process unstable.
sunu@ugm.ac.id
Data normalization
• Normalization is a rescaling of the data from the original range so that all
values are within the new range of 0 and 1.
• Normalization requires that you know or are able to accurately estimate the
minimum and maximum observable values.
• You may be able to estimate these values from your available data.
• A value is normalized as follows:
=
−
−
where the minimum and maximum values pertain to the value x being
normalized.

07/06/2022
sunu@ugm.ac.id
Data normalization
• For example, for a dataset, we could guesstimate the min and max observable values as
30 and -10. We can then normalize any value, like 18.8, as follows:
=
−
−
=
18.8 − (−10)
30 − (−1))
=
28.8
40
= 0.72
• You can see that if an value is provided that is outside the bounds of the minimum and
maximum values, the resulting value will not be in the range of 0 and 1.
sunu@ugm.ac.id
Data normalization in scikit-learn library
• You can normalize your dataset using the scikit-learn object
MinMaxScaler. Good practice usage with the MinMaxScaler and
other scaling techniques is as follows:
• Fit the scaler using available training data: for normalization, this
means the training data will be used to estimate the minimum and
maximum observable values. This is done by calling the fit() function.
• Apply the scale to training data: this means you can use the
normalized data to train your model. This is done by calling the
transform() function.
• Apply the scale to data going forward: this means you can prepare
new data in the future on which you want to make predictions.

07/06/2022
sunu@ugm.ac.id
sunu@ugm.ac.id
• Running the example first reports the raw
dataset, showing 2 columns with 5 rows.
• The values are in scientific notation which can
be hard to read if you’re not used to it.
• Next, the scaler is defined, fit on the whole
dataset and then used to create a transformed
version of the dataset with each column
normalized independently.
• We can see that the largest raw value for each
column now has the value 1.0 and the smallest
value for each column now has the value 0.0.
Results of
scaling
Results of
normalization

07/06/2022
sunu@ugm.ac.id
Sunu Wibirama
sunu@ugm.ac.id
INDONESIA
Data Transforms (Part 02)
sunu@ugm.ac.id
What is a quantile?
• Wikipedia:
“In statistics and probability, quantiles are cut
points dividing the range of a probability
distribution into continuous intervals with equal
probabilities, or dividing the observations in a
sample in the same way”.
• 2 quantiles  median
• 4 quantiles  quartiles
• 100 quantiles  percentiles
Probability density of a normal distribution, with quartiles shown. The area below
the red curve is the same in the intervals (−∞,Q1), (Q1,Q2), (Q2,Q3), and (Q3,+∞).

07/06/2022
sunu@ugm.ac.id
Quantiles on a Cumulative Distribution Function (CDF)
Percent (y-axis) of data that is at or below a given value on x-axis
Source: https://www.youtube.com/watch?v=ByjPLoxQAZk
sunu@ugm.ac.id
Quantiles on Iris dataset

07/06/2022
sunu@ugm.ac.id
sunu@ugm.ac.id

07/06/2022
sunu@ugm.ac.id
Quantile function
Quantile function is an inverse of cumulative distribution function (CDF)
sunu@ugm.ac.id
Non-standard data distribution
• Numerical input variables may have a
highly skewed or non-standard
distribution.
• This could be caused by outliers in the
data, multi-modal distributions, highly
exponential distributions, and so on.
• Many machine learning algorithms
prefer or perform better when numerical
input variables.
• Even output variables in the case of
regression have a standard probability
distribution, such as a Gaussian
(normal) or a uniform distribution
Source: https://www.biologyforlife.com/skew.html

07/06/2022
sunu@ugm.ac.id
Quantile transforms
sunu@ugm.ac.id
Why quantile transforms?
• Improving accuracy of machine learning model
• Perform monotonic transformation of features  preserve rank of
values
• Robust  less susceptible to outliers
• Disadvantage: distorts correlation and distances within and across
features.

07/06/2022
sunu@ugm.ac.id
Quantile transforms in scikit-learn
sunu@ugm.ac.id
Results of quantile transforms
• Running the example first creates a sample of
1,000 random Gaussian values and adds a skew
to the dataset.
• A histogram is created from the skewed dataset
and clearly shows the distribution pushed to the
far left.
• Then a QuantileTransformer is used to map the
data to a Gaussian distribution and standardize
the result, centering the values on the mean value
of 0 and a standard deviation of 1.0.
• A histogram of the transform data is created
showing a Gaussian shaped data distribution.

07/06/2022
sunu@ugm.ac.id
13
End of File
sunu@ugm.ac.id
The scale of your data is important
• Machine learning models learn a mapping from
input variables to an output variable.
• Unfortunately, the scale and distribution of the
data drawn from the domain may be different for
each variable.
• Input variables may have different units (e.g.
feet, kilometers, and hours) that, in turn, may
mean the variables have different scales.
• Differences in the scales across input variables
may increase the difficulty of the problem being
modelled.
Courtesy: Lagabrielle et al. (2018)

08/06/2022
sunu@ugm.ac.id
Sunu Wibirama
sunu@ugm.ac.id
INDONESIA
Introduction to PCA (Part 01)
sunu@ugm.ac.id
Eigendecomposition and Principal Component Analysis (PCA)
• Principal Component Analysis
(PCA) is used for dimensionality
reduction
• Example:
• We want to reduce data from
2D space to 1D.
• However, we don’t want to
loose important information
from our features.
• We transform the data to be
aligned with the most important
direction (red) and remove less
important direction (green).
Source: https://towardsdatascience.com/a-one-stop-shop-for-principal-component-analysis-5582fb7e0a9c

08/06/2022
sunu@ugm.ac.id
Why using Principal Component Analysis?
• You find that features in your data are highly
correlated each other  multicollinearity
• Ideal case: features should be independent
from each other.
• Solving this issue is super important,
because working with highly correlated
features in multivariate regression models
can lead to inaccurate results.
• To understand PCA, we have to learn about
eigenvalues and eigenvectors
sunu@ugm.ac.id
Eigenvectors & Eigenvalues
• Eigen in German means distinctive, characteristic, particular
to person or place.
• Eigenvalue and eigenvector are important for matrix
decomposition aimed at reducing dimensionality without losing
much information and reducing computational cost of matrix
processing.
• An eigenvector of the matrix is a vector that is contracted or
elongated when transformed by the matrix
• The eigenvalue is the scaling factor by which the vector is
contracted or elongated:
(a) If the scaling factor is positive, the directions of the initial
and the transformed vectors are the same:
(b) If the scaling factor is negative, their directions are reversed. Matrix acts by stretching the vector , not changing its
direction, so is an eigenvector of .
x
x

08/06/2022
sunu@ugm.ac.id
Which one is the eigen vector, red or blue?
Answer: blue
Eigenvectors & Eigenvalues: step-by-step
sunu@ugm.ac.id
• Let be an eigenvector of the matrix . Then there must
exist an eigenvalue such that = or, equivalently,
− = 0 or
( − ) = 0
• If we define a new matrix = − , then
=
• If has an inverse, then = = . But an
eigenvector cannot be zero.
• Thus, it follows that will be an eigenvector of if and
only if does not have an inverse, or equivalently
det( )=0, or
det( − )=0
• This is called the characteristic equation of .
Its roots determine the eigenvalues of .
Mathematical approach (1)
If the determinant equals to zero,
the transformation is scaled into a line

08/06/2022
sunu@ugm.ac.id
• Suppose we have matrix =
2 1
1 2
• Applying eigenvectors and eigenvalues equation = :
= =
2 1
1 2
=
• To solve , and , we rearrange this equation:
2 + =
+ 2 =
which we can further rearrange as:
(2 − ) + = 0
+ (2− ) = 0
sunu@ugm.ac.id
• Since det( − )=0, we find the determinant accordingly:
Thus = 3 and = 1. Using these values, we can get and . We can find eigenvectors
which correspond to these eigenvalues by plugging back in to the equations above and
solving for and . To find an eigenvector corresponding to = 3, start with
Finding determinant of
2-dimensional matrix

08/06/2022
sunu@ugm.ac.id
• There are an infinite number of values for which satisfy this equation.
The only restriction is that not all the components in an eigenvector can
equal zero.
• So if = 1, then = 1 and an eigenvector corresponding to = 3 is [1, 1].
• Finding an eigenvector for = 1 works the same way.
• So an eigenvector for = 1 is [1, −1].
sunu@ugm.ac.id
10
End of File

08/06/2022
sunu@ugm.ac.id
Sunu Wibirama
sunu@ugm.ac.id
INDONESIA
Introduction to PCA (Part 02)
sunu@ugm.ac.id
Variance and covariance
• Variance : It is a measure of the
variability or it simply measures how
spread the data set is. Mathematically, it
is the average squared deviation from
the mean score.
• Covariance : It is a measure of the extent
to which corresponding elements from two
sets of ordered data move in the same
direction.

08/06/2022
sunu@ugm.ac.id
Variance and covariance
• Positive covariance means and are positively related i.e. as increases
also increases. Negative covariance depicts the exact opposite relation.
However zero covariance means and are not related.
sunu@ugm.ac.id
Step-by-step PCA
The whole process of mathematics in PCA can be divided into 5 parts:
1. Standardizing the data
2. Calculate the co-variance matrix
3. Calculating the eigenvectors and eigenvalues
4. Computing the principal components
5. Reducing the dimension of the datasets
Source: https://medium.com/analytics-vidhya/principal-component-analysis-pca-558969e63613

08/06/2022
sunu@ugm.ac.id
Bui H-B, Nguyen H, Choi Y, Bui X-N, Nguyen-Thoi T, Zandi Y. A Novel Artificial Intelligence
Technique to Estimate the Gross Calorific Value of Coal Based on Meta-Heuristic and
Support Vector Regression Algorithms. Applied Sciences. 2019; 9(22):4868.
https://doi.org/10.3390/app9224868
Reducing the dimension of the datasets
Z*
1. Find the mean vector 2. Standardizing data: subtracting mean
and dividing by standard deviation 3. Compute covariance matrix:
=
4. Compute eigenvalues and
eigenvectors of , then
decompose into
: matrix of eigenvectors
is the diagonal matrix with
eigenvalues on the diagonal and
values of zero everywhere else
5. Sort eigenvectors from
highest eigenvalues
6. Project original data to
eigenvectors
7. Obtain projected points in low
dimensions, choose most
important principal components
P* ZP*
sunu@ugm.ac.id
1. Standardizing the data
• Standardizing is the process of scaling the data in such a way that all the
variables and their values lie within a similar range. The formula for
standardization is shown below:
• where:
• : observation or sample
• : mean
• : standard deviation
• Save the standardized data in a matrix .

08/06/2022
sunu@ugm.ac.id
2. Calculate the covariance matrix
• Take the matrix , transpose it, and multiply the transposed matrix by matrix :
=
• The result is a co-variance matrix: expressing the correlation between the different
variables in the data set.
• It is essential to identify highly dependent variables because they contain biased and
redundant information which can hamper the overall performance of the model.
• If our dataset has more than 2 dimensions then it can have more than one covariance
measurement. For example, if we have a dataset with 3 dimensions x, y and z, then the
covariance matrix of this dataset will look like this
sunu@ugm.ac.id
3. Calculate eigenvectors and eigenvalues
• Next, calculate eigenvectors and eigenvalues
from the covariance matrix.
• the eigendecomposition of = is where
we decompose into
• where:
• : matrix of eigenvectors
• is the diagonal matrix with eigenvalues
on the diagonal and values of zero
everywhere else.
(Deisenroth; Faisal; Ong, 2020)

08/06/2022
sunu@ugm.ac.id
3. Calculate eigenvectors and eigenvalues
• Next, calculate eigenvectors and eigenvalues
from the covariance matrix.
• the eigendecomposition of = is where
we decompose into ,
• where:
• : matrix of eigenvectors
• is the diagonal matrix with eigenvalues
on the diagonal and values of zero
everywhere else.
(Deisenroth; Faisal; Ong, 2020)
Note: p1 and p2 are orthogonal because their dot product equals to zero. Why?
Because A is a symmetric matrix. Eigenvectors of a symmetric matrix will always be
orthogonal. What if these vectors are not orthogonal? Use The Gram–Schmidt
orthogonalization to construct orthogonal or orthonormal vectors.
sunu@ugm.ac.id
4. Computing the principal components
• Take the eigenvalues , ,…, and sort
them from largest to smallest.
• Then, sort the eigenvectors in accordingly (if
is the largest eigenvalue, then take the 2nd
column of and place it in the 1st column
position).
• The eigenvector with the highest eigenvalue is
the most significant and therefore forms the
1st principal component (PC 1).
• Call this sorted matrix of eigenvectors *
(the columns of * should be the same as the
columns of , but can be in a different order.)
• PC 1 is the most significant and stores the
maximum possible information
• PC 2 is the second most significant PC and stores
remaining maximum information

08/06/2022
sunu@ugm.ac.id
• Re-arrange the original dataset on the final
principal components which represent the
maximum and most significant information of the
dataset.
• Calculate ∗
= *
• This new matrix, ∗
, is a centered/standardized
version of original data.
• However, each observation in ∗
is a combination
of the original variables, where the weights are
determined by the eigenvector.
• Because our eigenvectors in * are independent
of one another, each column of ∗
is also
independent of one another.
Z*
The left graph is our original data, the right graph
would be our transformed data ∗
sunu@ugm.ac.id
• Finally, we need to determine how many principal component (PC) to
keep versus how many to drop. Normally we keep the most important
PC and drop the less PC.

08/06/2022
sunu@ugm.ac.id
PCA in Python
1. Data preparation—standardizing the data
sunu@ugm.ac.id
PCA in Python
2. Calculate covariance matrix, eigenvalues, eigenvectors,
and computing the principle components

08/06/2022
sunu@ugm.ac.id
PCA in Python
sunu@ugm.ac.id
16
End of File

Modul Topik 4 - Kecerdasan Buatan.pdf

Recommended

Recommended

More Related Content

Similar to Modul Topik 4 - Kecerdasan Buatan.pdf

Similar to Modul Topik 4 - Kecerdasan Buatan.pdf (20)

More from Sunu Wibirama

More from Sunu Wibirama (8)

Recently uploaded

Recently uploaded (20)

Modul Topik 4 - Kecerdasan Buatan.pdf