SlideShare a Scribd company logo
Data Science
Data Preprocessing
(Data Transformation)
Data
Preprocessing
Data
Integration
Data
Transformation
Data
Reduction
or
dimension
reduction
Data
Cleaning
Scaling, Discretization
Categorical Encoding
Handling missing
values Outliers,
duplicates
Data Transformation
Data transformation is the process of converting data from one format or
structure into another format or structure for analysis. It is a fundamental
aspect of most data integration and data management tasks such as data
wrangling, data warehousing, data integration and application integration.
The primary goal of data transformation is to make the data more suitable
for the analysis tasks at hand, such as predictive modeling or exploratory
data analysis.
Common techniques used in data transformation include normalization,
standardization, and discretization.
Transformation Techniques
1. Scaling
Scaling is the process of transforming the features of a dataset so that they
fall within a specific range. Scaling is useful when we want to compare two
different variables on equal grounds.
The goal of scaling is to ensure that all variables contribute equally to the
analysis, particularly when using algorithms that are sensitive to the scale of
the input features.
Methods of Scaling
Scaling
Normalization
(Min-Max Scaling)
Standardization
(Z-Score Scaling)
Normalization
Normalization (Min-Max Scaling): Normalization rescales the values of a
feature to fit within a specific range, usually between 0 and 1. The formula
for normalization is:
Age: 25,35,45 Salary: 30000, 50000, 70000
For Age X normalized
Age=25 (25-25) / (45-25) 0
Age=35 (35-25)/(45-25) 0.5
Age=45 (45-25)/ (45-25) 1
Salary Y normalized
30000 0/40000 0
50000 20000/40000 0.5
70000 40000/40000 1
Standardization
Standardization (z-score Scaling): Standardization rescales the values of a
feature so that they have a mean of 0 and a standard deviation of 1. The
formula for standardization is:
Age: 25,35,45
μ (Mean Age) = 35 μ (S.D Age) = 8.16
For Age X standardized
Age=25 (25-35) / 8.16 -1.22
Age=35 (35-35)/8.16 0
Age=45 (45-35)/ 8.16 1.22
Transformation Techniques
2. Discretization
• This is a process of converting continuous data into a set of data intervals.
Continuous attribute values are substituted by small interval labels. This
makes the data easier to study and analyze. If a data mining task handles a
continuous attribute, then its discrete values can be replaced by constant
quality attributes. This improves the efficiency of the task.
• This method is also called a data reduction mechanism as it transforms a
large dataset into a set of categorical data.
• For example, 25,30,35,40,45,50,55,60,65,70 The values for the age
attribute can be replaced by the interval labels such as (25-40 : Young,
41-60 : Adult, 61-70 : Senior)
Transformation Techniques
3. Data Aggregation
• Data collection or aggregation is the method of storing and presenting
data in a summary format. The data may be obtained from multiple data
sources to integrate these data sources into a data analysis description.
This is a crucial step since the accuracy of data analysis insights is highly
dependent on the quantity and quality of the data used.
• For example, Sales, data may be aggregated to compute monthly &
annual total amounts.
Transformation Techniques
4. Encoding categorical variables
It involves transforming categorical variables into a numerical format
suitable for machine learning format. There are several methods for
encoding categorical variables, two common approaches are:
• One-hot encoding
• Label encoding.
One-Hot Encoding
• One-hot encoding transforms each categorical variable into a binary
vector where each category is represented by a binary bit. Each category
is represented by a binary bit, with a 1 indicating the presence of the
category and a 0 indicating absence.
• This method creates additional columns, one for each unique category,
which can lead to a high-dimensional dataset. It's suitable for categorical
variables with a relatively small number of unique categories.
• Example:
• Original categorical variable: { "Red", "Blue", "Green" }
• One-hot encoded variables:
• "Red" : [1, 0, 0]
• "Blue" : [0, 1, 0]
• "Green" : [0, 0, 1]
Label Encoding
• Label encoding assigns a unique numerical label to each category. It
replaces each category with its corresponding numerical label.
• This method does not create additional columns but can introduce
ordinality among categories, which may not always be desirable.
• It's suitable for categorical variables with ordinal relationships between
categories.
• Example:
• Original categorical variable: { "Red", "Blue", "Green" }
• Label encoded variables:
• "Red" : 0
• "Blue" : 1
• "Green" : 2
Thanks for Watching!

More Related Content

Similar to Data Preprocessing- Data transformation, Scaling, Normalization, Standardization.

KNOLX_Data_preprocessing
KNOLX_Data_preprocessingKNOLX_Data_preprocessing
KNOLX_Data_preprocessing
Knoldus Inc.
 
Data processing
Data processingData processing
Data processing
AnupamSingh211
 
Scaling and Normalization
Scaling and NormalizationScaling and Normalization
Scaling and Normalization
Kush Kulshrestha
 
Preprocessing_new.ppt
Preprocessing_new.pptPreprocessing_new.ppt
Preprocessing_new.ppt
YashikaSengar2
 
Data analysis
Data analysisData analysis
Data analysis
amlbinder
 
introduction to Statistical Theory.pptx
 introduction to Statistical Theory.pptx introduction to Statistical Theory.pptx
introduction to Statistical Theory.pptx
Dr.Shweta
 
QUANTITATIVE-DATA.pptx
QUANTITATIVE-DATA.pptxQUANTITATIVE-DATA.pptx
QUANTITATIVE-DATA.pptx
ViaFortuna
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
Sanghamitra Deb
 
Group 1 Report CRISP - DM METHODOLOGY.pptx
Group 1 Report CRISP - DM METHODOLOGY.pptxGroup 1 Report CRISP - DM METHODOLOGY.pptx
Group 1 Report CRISP - DM METHODOLOGY.pptx
ellamangapis2003
 
Dimensionality Reduction.pptx
Dimensionality Reduction.pptxDimensionality Reduction.pptx
Dimensionality Reduction.pptx
PriyadharshiniG41
 
Pelatihan Data Analitik
Pelatihan Data AnalitikPelatihan Data Analitik
Pelatihan Data Analitik
John Sihotang, Dr, MM, Ir
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
Gopal Sakarkar
 
Data warehouse 17 dimensional data model
Data warehouse 17 dimensional data modelData warehouse 17 dimensional data model
Data warehouse 17 dimensional data model
Vaibhav Khanna
 
Data Mining: Data Preprocessing
Data Mining: Data PreprocessingData Mining: Data Preprocessing
Data Mining: Data Preprocessing
Lakshmi Sarvani Videla
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Boston Institute of Analytics
 
20150814 Wrangling Data From Raw to Tidy vs
20150814 Wrangling Data From Raw to Tidy vs20150814 Wrangling Data From Raw to Tidy vs
20150814 Wrangling Data From Raw to Tidy vsIan Feller
 
Data Preparation with the help of Analytics Methodology
Data Preparation with the help of Analytics MethodologyData Preparation with the help of Analytics Methodology
Data Preparation with the help of Analytics Methodology
Rupak Roy
 
1. chapter i(pasw)
1. chapter i(pasw)1. chapter i(pasw)
1. chapter i(pasw)
Chhom Karath
 
EDA by Sastry.pptx
EDA by Sastry.pptxEDA by Sastry.pptx
EDA by Sastry.pptx
AmitDas125851
 
Machine learning module 2
Machine learning module 2Machine learning module 2
Machine learning module 2
Gokulks007
 

Similar to Data Preprocessing- Data transformation, Scaling, Normalization, Standardization. (20)

KNOLX_Data_preprocessing
KNOLX_Data_preprocessingKNOLX_Data_preprocessing
KNOLX_Data_preprocessing
 
Data processing
Data processingData processing
Data processing
 
Scaling and Normalization
Scaling and NormalizationScaling and Normalization
Scaling and Normalization
 
Preprocessing_new.ppt
Preprocessing_new.pptPreprocessing_new.ppt
Preprocessing_new.ppt
 
Data analysis
Data analysisData analysis
Data analysis
 
introduction to Statistical Theory.pptx
 introduction to Statistical Theory.pptx introduction to Statistical Theory.pptx
introduction to Statistical Theory.pptx
 
QUANTITATIVE-DATA.pptx
QUANTITATIVE-DATA.pptxQUANTITATIVE-DATA.pptx
QUANTITATIVE-DATA.pptx
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
 
Group 1 Report CRISP - DM METHODOLOGY.pptx
Group 1 Report CRISP - DM METHODOLOGY.pptxGroup 1 Report CRISP - DM METHODOLOGY.pptx
Group 1 Report CRISP - DM METHODOLOGY.pptx
 
Dimensionality Reduction.pptx
Dimensionality Reduction.pptxDimensionality Reduction.pptx
Dimensionality Reduction.pptx
 
Pelatihan Data Analitik
Pelatihan Data AnalitikPelatihan Data Analitik
Pelatihan Data Analitik
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
 
Data warehouse 17 dimensional data model
Data warehouse 17 dimensional data modelData warehouse 17 dimensional data model
Data warehouse 17 dimensional data model
 
Data Mining: Data Preprocessing
Data Mining: Data PreprocessingData Mining: Data Preprocessing
Data Mining: Data Preprocessing
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project Presentation
 
20150814 Wrangling Data From Raw to Tidy vs
20150814 Wrangling Data From Raw to Tidy vs20150814 Wrangling Data From Raw to Tidy vs
20150814 Wrangling Data From Raw to Tidy vs
 
Data Preparation with the help of Analytics Methodology
Data Preparation with the help of Analytics MethodologyData Preparation with the help of Analytics Methodology
Data Preparation with the help of Analytics Methodology
 
1. chapter i(pasw)
1. chapter i(pasw)1. chapter i(pasw)
1. chapter i(pasw)
 
EDA by Sastry.pptx
EDA by Sastry.pptxEDA by Sastry.pptx
EDA by Sastry.pptx
 
Machine learning module 2
Machine learning module 2Machine learning module 2
Machine learning module 2
 

More from Megha Sharma

Data Management Activities, Extraction, Transformation and Loading (ETL)
Data Management Activities, Extraction, Transformation and Loading (ETL)Data Management Activities, Extraction, Transformation and Loading (ETL)
Data Management Activities, Extraction, Transformation and Loading (ETL)
Megha Sharma
 
Descriptive Statistics: Mean, Median Mode and Standard Deviation.
Descriptive Statistics: Mean, Median Mode and Standard Deviation.Descriptive Statistics: Mean, Median Mode and Standard Deviation.
Descriptive Statistics: Mean, Median Mode and Standard Deviation.
Megha Sharma
 
Model Evaluation Matrix: Confusion Matrix, F1 Score, ROC curve AUC
Model Evaluation Matrix: Confusion Matrix, F1 Score, ROC curve AUCModel Evaluation Matrix: Confusion Matrix, F1 Score, ROC curve AUC
Model Evaluation Matrix: Confusion Matrix, F1 Score, ROC curve AUC
Megha Sharma
 
Model Evaluation Matrix: Accuracy, precision and recall
Model Evaluation Matrix: Accuracy, precision and recallModel Evaluation Matrix: Accuracy, precision and recall
Model Evaluation Matrix: Accuracy, precision and recall
Megha Sharma
 
Visualization Techniques- Box plot, Line Chart, Scatter plot, Bar chart.
Visualization Techniques- Box plot, Line Chart, Scatter plot, Bar chart.Visualization Techniques- Box plot, Line Chart, Scatter plot, Bar chart.
Visualization Techniques- Box plot, Line Chart, Scatter plot, Bar chart.
Megha Sharma
 
Visualization Techniques ,Exploratory Data Analysis(EDA), Histogram
Visualization Techniques ,Exploratory Data Analysis(EDA), HistogramVisualization Techniques ,Exploratory Data Analysis(EDA), Histogram
Visualization Techniques ,Exploratory Data Analysis(EDA), Histogram
Megha Sharma
 
Data Science- Data Preprocessing, Data Cleaning.
Data Science- Data Preprocessing, Data Cleaning.Data Science- Data Preprocessing, Data Cleaning.
Data Science- Data Preprocessing, Data Cleaning.
Megha Sharma
 
Data Preprocessing- Feature Selection and Merging.
Data Preprocessing- Feature Selection and Merging.Data Preprocessing- Feature Selection and Merging.
Data Preprocessing- Feature Selection and Merging.
Megha Sharma
 
Different types of data. Qualitative, Quantitative, Ordinal, Nominal, Discret...
Different types of data. Qualitative, Quantitative, Ordinal, Nominal, Discret...Different types of data. Qualitative, Quantitative, Ordinal, Nominal, Discret...
Different types of data. Qualitative, Quantitative, Ordinal, Nominal, Discret...
Megha Sharma
 
Data Science comparison with AI, ML, BI, and data warehousing, data mining.
Data Science comparison with AI, ML, BI, and data warehousing, data mining.Data Science comparison with AI, ML, BI, and data warehousing, data mining.
Data Science comparison with AI, ML, BI, and data warehousing, data mining.
Megha Sharma
 
Data Science Introduction, Application of Data Science.
Data Science Introduction, Application of Data Science.Data Science Introduction, Application of Data Science.
Data Science Introduction, Application of Data Science.
Megha Sharma
 
Ensemble learning
Ensemble learningEnsemble learning
Ensemble learning
Megha Sharma
 
Association Rule mining
Association Rule miningAssociation Rule mining
Association Rule mining
Megha Sharma
 
Bellman's equation Reinforcement learning - II
Bellman's equation Reinforcement learning - IIBellman's equation Reinforcement learning - II
Bellman's equation Reinforcement learning - II
Megha Sharma
 
Reinforcement learning in Machine learning
 Reinforcement learning in Machine learning Reinforcement learning in Machine learning
Reinforcement learning in Machine learning
Megha Sharma
 
E-M Algorithm
E-M AlgorithmE-M Algorithm
E-M Algorithm
Megha Sharma
 
Entropy and information gain in decision tree.
Entropy and information gain in decision tree.Entropy and information gain in decision tree.
Entropy and information gain in decision tree.
Megha Sharma
 
Types of Machine Learning. & Decision Tree.
Types of Machine Learning. & Decision Tree.Types of Machine Learning. & Decision Tree.
Types of Machine Learning. & Decision Tree.
Megha Sharma
 
If statements in C
If statements in CIf statements in C
If statements in C
Megha Sharma
 
Conditional and special operators
Conditional and special operatorsConditional and special operators
Conditional and special operators
Megha Sharma
 

More from Megha Sharma (20)

Data Management Activities, Extraction, Transformation and Loading (ETL)
Data Management Activities, Extraction, Transformation and Loading (ETL)Data Management Activities, Extraction, Transformation and Loading (ETL)
Data Management Activities, Extraction, Transformation and Loading (ETL)
 
Descriptive Statistics: Mean, Median Mode and Standard Deviation.
Descriptive Statistics: Mean, Median Mode and Standard Deviation.Descriptive Statistics: Mean, Median Mode and Standard Deviation.
Descriptive Statistics: Mean, Median Mode and Standard Deviation.
 
Model Evaluation Matrix: Confusion Matrix, F1 Score, ROC curve AUC
Model Evaluation Matrix: Confusion Matrix, F1 Score, ROC curve AUCModel Evaluation Matrix: Confusion Matrix, F1 Score, ROC curve AUC
Model Evaluation Matrix: Confusion Matrix, F1 Score, ROC curve AUC
 
Model Evaluation Matrix: Accuracy, precision and recall
Model Evaluation Matrix: Accuracy, precision and recallModel Evaluation Matrix: Accuracy, precision and recall
Model Evaluation Matrix: Accuracy, precision and recall
 
Visualization Techniques- Box plot, Line Chart, Scatter plot, Bar chart.
Visualization Techniques- Box plot, Line Chart, Scatter plot, Bar chart.Visualization Techniques- Box plot, Line Chart, Scatter plot, Bar chart.
Visualization Techniques- Box plot, Line Chart, Scatter plot, Bar chart.
 
Visualization Techniques ,Exploratory Data Analysis(EDA), Histogram
Visualization Techniques ,Exploratory Data Analysis(EDA), HistogramVisualization Techniques ,Exploratory Data Analysis(EDA), Histogram
Visualization Techniques ,Exploratory Data Analysis(EDA), Histogram
 
Data Science- Data Preprocessing, Data Cleaning.
Data Science- Data Preprocessing, Data Cleaning.Data Science- Data Preprocessing, Data Cleaning.
Data Science- Data Preprocessing, Data Cleaning.
 
Data Preprocessing- Feature Selection and Merging.
Data Preprocessing- Feature Selection and Merging.Data Preprocessing- Feature Selection and Merging.
Data Preprocessing- Feature Selection and Merging.
 
Different types of data. Qualitative, Quantitative, Ordinal, Nominal, Discret...
Different types of data. Qualitative, Quantitative, Ordinal, Nominal, Discret...Different types of data. Qualitative, Quantitative, Ordinal, Nominal, Discret...
Different types of data. Qualitative, Quantitative, Ordinal, Nominal, Discret...
 
Data Science comparison with AI, ML, BI, and data warehousing, data mining.
Data Science comparison with AI, ML, BI, and data warehousing, data mining.Data Science comparison with AI, ML, BI, and data warehousing, data mining.
Data Science comparison with AI, ML, BI, and data warehousing, data mining.
 
Data Science Introduction, Application of Data Science.
Data Science Introduction, Application of Data Science.Data Science Introduction, Application of Data Science.
Data Science Introduction, Application of Data Science.
 
Ensemble learning
Ensemble learningEnsemble learning
Ensemble learning
 
Association Rule mining
Association Rule miningAssociation Rule mining
Association Rule mining
 
Bellman's equation Reinforcement learning - II
Bellman's equation Reinforcement learning - IIBellman's equation Reinforcement learning - II
Bellman's equation Reinforcement learning - II
 
Reinforcement learning in Machine learning
 Reinforcement learning in Machine learning Reinforcement learning in Machine learning
Reinforcement learning in Machine learning
 
E-M Algorithm
E-M AlgorithmE-M Algorithm
E-M Algorithm
 
Entropy and information gain in decision tree.
Entropy and information gain in decision tree.Entropy and information gain in decision tree.
Entropy and information gain in decision tree.
 
Types of Machine Learning. & Decision Tree.
Types of Machine Learning. & Decision Tree.Types of Machine Learning. & Decision Tree.
Types of Machine Learning. & Decision Tree.
 
If statements in C
If statements in CIf statements in C
If statements in C
 
Conditional and special operators
Conditional and special operatorsConditional and special operators
Conditional and special operators
 

Recently uploaded

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
siemaillard
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
siemaillard
 
Fish and Chips - have they had their chips
Fish and Chips - have they had their chipsFish and Chips - have they had their chips
Fish and Chips - have they had their chips
GeoBlogs
 
How to Split Bills in the Odoo 17 POS Module
How to Split Bills in the Odoo 17 POS ModuleHow to Split Bills in the Odoo 17 POS Module
How to Split Bills in the Odoo 17 POS Module
Celine George
 
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
Thiyagu K
 
NLC-2024-Orientation-for-RO-SDO (1).pptx
NLC-2024-Orientation-for-RO-SDO (1).pptxNLC-2024-Orientation-for-RO-SDO (1).pptx
NLC-2024-Orientation-for-RO-SDO (1).pptx
ssuserbdd3e8
 
Palestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptxPalestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptx
RaedMohamed3
 
B.ed spl. HI pdusu exam paper-2023-24.pdf
B.ed spl. HI pdusu exam paper-2023-24.pdfB.ed spl. HI pdusu exam paper-2023-24.pdf
B.ed spl. HI pdusu exam paper-2023-24.pdf
Special education needs
 
Cambridge International AS A Level Biology Coursebook - EBook (MaryFosbery J...
Cambridge International AS  A Level Biology Coursebook - EBook (MaryFosbery J...Cambridge International AS  A Level Biology Coursebook - EBook (MaryFosbery J...
Cambridge International AS A Level Biology Coursebook - EBook (MaryFosbery J...
AzmatAli747758
 
Basic Civil Engineering Notes of Chapter-6, Topic- Ecosystem, Biodiversity G...
Basic Civil Engineering Notes of Chapter-6,  Topic- Ecosystem, Biodiversity G...Basic Civil Engineering Notes of Chapter-6,  Topic- Ecosystem, Biodiversity G...
Basic Civil Engineering Notes of Chapter-6, Topic- Ecosystem, Biodiversity G...
Denish Jangid
 
Extraction Of Natural Dye From Beetroot (Beta Vulgaris) And Preparation Of He...
Extraction Of Natural Dye From Beetroot (Beta Vulgaris) And Preparation Of He...Extraction Of Natural Dye From Beetroot (Beta Vulgaris) And Preparation Of He...
Extraction Of Natural Dye From Beetroot (Beta Vulgaris) And Preparation Of He...
SachinKumar945617
 
UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...
UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...
UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...
Sayali Powar
 
The Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdfThe Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdf
kaushalkr1407
 
2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
Sandy Millin
 
plant breeding methods in asexually or clonally propagated crops
plant breeding methods in asexually or clonally propagated cropsplant breeding methods in asexually or clonally propagated crops
plant breeding methods in asexually or clonally propagated crops
parmarsneha2
 
The Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve ThomasonThe Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve Thomason
Steve Thomason
 
Synthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptxSynthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptx
Pavel ( NSTU)
 
Chapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptxChapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
Mohd Adib Abd Muin, Senior Lecturer at Universiti Utara Malaysia
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
Jisc
 
Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
Tamralipta Mahavidyalaya
 

Recently uploaded (20)

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
 
Fish and Chips - have they had their chips
Fish and Chips - have they had their chipsFish and Chips - have they had their chips
Fish and Chips - have they had their chips
 
How to Split Bills in the Odoo 17 POS Module
How to Split Bills in the Odoo 17 POS ModuleHow to Split Bills in the Odoo 17 POS Module
How to Split Bills in the Odoo 17 POS Module
 
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
 
NLC-2024-Orientation-for-RO-SDO (1).pptx
NLC-2024-Orientation-for-RO-SDO (1).pptxNLC-2024-Orientation-for-RO-SDO (1).pptx
NLC-2024-Orientation-for-RO-SDO (1).pptx
 
Palestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptxPalestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptx
 
B.ed spl. HI pdusu exam paper-2023-24.pdf
B.ed spl. HI pdusu exam paper-2023-24.pdfB.ed spl. HI pdusu exam paper-2023-24.pdf
B.ed spl. HI pdusu exam paper-2023-24.pdf
 
Cambridge International AS A Level Biology Coursebook - EBook (MaryFosbery J...
Cambridge International AS  A Level Biology Coursebook - EBook (MaryFosbery J...Cambridge International AS  A Level Biology Coursebook - EBook (MaryFosbery J...
Cambridge International AS A Level Biology Coursebook - EBook (MaryFosbery J...
 
Basic Civil Engineering Notes of Chapter-6, Topic- Ecosystem, Biodiversity G...
Basic Civil Engineering Notes of Chapter-6,  Topic- Ecosystem, Biodiversity G...Basic Civil Engineering Notes of Chapter-6,  Topic- Ecosystem, Biodiversity G...
Basic Civil Engineering Notes of Chapter-6, Topic- Ecosystem, Biodiversity G...
 
Extraction Of Natural Dye From Beetroot (Beta Vulgaris) And Preparation Of He...
Extraction Of Natural Dye From Beetroot (Beta Vulgaris) And Preparation Of He...Extraction Of Natural Dye From Beetroot (Beta Vulgaris) And Preparation Of He...
Extraction Of Natural Dye From Beetroot (Beta Vulgaris) And Preparation Of He...
 
UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...
UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...
UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...
 
The Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdfThe Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdf
 
2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
 
plant breeding methods in asexually or clonally propagated crops
plant breeding methods in asexually or clonally propagated cropsplant breeding methods in asexually or clonally propagated crops
plant breeding methods in asexually or clonally propagated crops
 
The Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve ThomasonThe Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve Thomason
 
Synthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptxSynthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptx
 
Chapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptxChapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
 
Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
 

Data Preprocessing- Data transformation, Scaling, Normalization, Standardization.

  • 3. Data Transformation Data transformation is the process of converting data from one format or structure into another format or structure for analysis. It is a fundamental aspect of most data integration and data management tasks such as data wrangling, data warehousing, data integration and application integration. The primary goal of data transformation is to make the data more suitable for the analysis tasks at hand, such as predictive modeling or exploratory data analysis. Common techniques used in data transformation include normalization, standardization, and discretization.
  • 4. Transformation Techniques 1. Scaling Scaling is the process of transforming the features of a dataset so that they fall within a specific range. Scaling is useful when we want to compare two different variables on equal grounds. The goal of scaling is to ensure that all variables contribute equally to the analysis, particularly when using algorithms that are sensitive to the scale of the input features.
  • 5. Methods of Scaling Scaling Normalization (Min-Max Scaling) Standardization (Z-Score Scaling)
  • 6. Normalization Normalization (Min-Max Scaling): Normalization rescales the values of a feature to fit within a specific range, usually between 0 and 1. The formula for normalization is: Age: 25,35,45 Salary: 30000, 50000, 70000 For Age X normalized Age=25 (25-25) / (45-25) 0 Age=35 (35-25)/(45-25) 0.5 Age=45 (45-25)/ (45-25) 1 Salary Y normalized 30000 0/40000 0 50000 20000/40000 0.5 70000 40000/40000 1
  • 7. Standardization Standardization (z-score Scaling): Standardization rescales the values of a feature so that they have a mean of 0 and a standard deviation of 1. The formula for standardization is: Age: 25,35,45 μ (Mean Age) = 35 μ (S.D Age) = 8.16 For Age X standardized Age=25 (25-35) / 8.16 -1.22 Age=35 (35-35)/8.16 0 Age=45 (45-35)/ 8.16 1.22
  • 8. Transformation Techniques 2. Discretization • This is a process of converting continuous data into a set of data intervals. Continuous attribute values are substituted by small interval labels. This makes the data easier to study and analyze. If a data mining task handles a continuous attribute, then its discrete values can be replaced by constant quality attributes. This improves the efficiency of the task. • This method is also called a data reduction mechanism as it transforms a large dataset into a set of categorical data. • For example, 25,30,35,40,45,50,55,60,65,70 The values for the age attribute can be replaced by the interval labels such as (25-40 : Young, 41-60 : Adult, 61-70 : Senior)
  • 9. Transformation Techniques 3. Data Aggregation • Data collection or aggregation is the method of storing and presenting data in a summary format. The data may be obtained from multiple data sources to integrate these data sources into a data analysis description. This is a crucial step since the accuracy of data analysis insights is highly dependent on the quantity and quality of the data used. • For example, Sales, data may be aggregated to compute monthly & annual total amounts.
  • 10. Transformation Techniques 4. Encoding categorical variables It involves transforming categorical variables into a numerical format suitable for machine learning format. There are several methods for encoding categorical variables, two common approaches are: • One-hot encoding • Label encoding.
  • 11. One-Hot Encoding • One-hot encoding transforms each categorical variable into a binary vector where each category is represented by a binary bit. Each category is represented by a binary bit, with a 1 indicating the presence of the category and a 0 indicating absence. • This method creates additional columns, one for each unique category, which can lead to a high-dimensional dataset. It's suitable for categorical variables with a relatively small number of unique categories. • Example: • Original categorical variable: { "Red", "Blue", "Green" } • One-hot encoded variables: • "Red" : [1, 0, 0] • "Blue" : [0, 1, 0] • "Green" : [0, 0, 1]
  • 12. Label Encoding • Label encoding assigns a unique numerical label to each category. It replaces each category with its corresponding numerical label. • This method does not create additional columns but can introduce ordinality among categories, which may not always be desirable. • It's suitable for categorical variables with ordinal relationships between categories. • Example: • Original categorical variable: { "Red", "Blue", "Green" } • Label encoded variables: • "Red" : 0 • "Blue" : 1 • "Green" : 2