internship ppt on smartinternz platform as salesforce developer
Week 12 Dimensionality Reduction Bagian 1
1. Program Studi Teknik Informatika
Fakultas Teknik – Universitas Surabaya
Dimensionality Reduction:
Principal Component Analysis
Week 12
1604C055 - Machine Learning
2. Dimensionality reduction
• Dimensionality reduction is a process to transform data from a high-
dimensional space into new data in a low-dimensional space such that
the new data still has some meaningful properties of the original data.
• A high-dimensional data in machine learning leads to:
– High computational demands
– Low generalization performance
– Poor error estimates
• Some techniques:
– Principal component analysis (PCA)
– Linear discriminant analysis (LDA)
– Deep Learning: Autoencoders
3. Principal component analysis (PCA)
• PCA is a statistical techniques used to reduce the dimensions of
data/variables/features without losing the intrinsic information
contained in the original data.
• PCA is categorized as unsupervised learning
• PCA works by transforming the original variables into new variables,
called principal components
• Principal components:
– Uncorrelated variables
– Ordered such that the first few principal components retain the most
variation in the original variables
5. Principal component analysis (PCA)
• Transformation from 2D to 1D:
– Green: without PCA
– Blue: with PCA
• Transformation without PCA
causes the new data close to
each other.
• Transformation with PCA
increase the distance of each
data
PC
24. Scree plot
Find the "elbow" of the graph where
the eigenvalues seem to level off is
found .Components to the left of this
point should be retained as
significant
Elbow
32. Assignment
• Download dataset here:
https://drive.google.com/drive/folders/1fXfv0VECkys55fnlqxPEuiL3C
-3KyheV?usp=sharing
• This is digit mnist dataset which contain images of handwritten digits
(range from 0-4). The distribution of digit label:
– digit 0-3: 100 for each digit
– digit 4: 200
• Code in the next slide is provided to read dataset where the final
output is a matrix “original_data” (row is for the number of image
being read, 600 images, and column is for image features, which is
from image pixels = 784 pixels = 28 pixels × 28 pixels).
33.
34. Assignment
• Perform PCA to reduce the dimension of dataset from 784 D to any
number of dimension that would give the optimal result. Save it to
matrix “reduced_data”.
• Choose the best classification algorithm that you think would give
the best result to predict the digit label.
• Perform classification for both “original_data” and “reduced_data”
using the same classification algorithm chosen before. Compare the
result for both of them.
35. Assignment
• You could perform any data pre-processing techniques to the
dataset before used to train the model such that the best model is
obtained.
• Before feeding to classifier, split the dataset into training and testing
data. Use StratifiedShuffleSplit from scikit-learn with n_splits=1 and
ratio of 70%:30% for training:testing data.
• Evaluate the model using accuracy and F1 Score (weighted).
• State your conclusion.