SlideShare a Scribd company logo
Programming for Data
Analysis
Week 8
Dr. Ferdin Joe John Joseph
Faculty of Information Technology
Thai – Nichi Institute of Technology, Bangkok
Today’s lesson
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
2
• Feature Engineering
• Feature Selection
• Feature Construction
• Laboratory
Importance
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
3
Features
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
4
Features
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
5
Features
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
6
Real Data features
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
7
Feature Engineering
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
8
Feature Engineering
• Feature engineering is the process of using domain knowledge to
extract features from raw data via data mining techniques.
• These features can be used to improve the performance of machine
learning algorithms.
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
9
Features
• A feature is an attribute or property shared by all of the independent
units on which analysis or prediction is to be done. Any attribute
could be a feature, as long as it is useful to the model.
• The purpose of a feature, other than being an attribute, would be
much easier to understand in the context of a problem. A feature is a
characteristic that might help when solving the problem.
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
10
Process of Feature Engineering
Brainstorming or testing features
Deciding what features to create
Creating features
Checking how the features work with your model
Improving your features if needed
Go back to brainstorming/creating more features until the work is done
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
11
Techniques in Feature Engineering
• Imputation
• Handling Outliers
• Binning
• Log Transform
• One-Hot Encoding
• Grouping Operations
• Feature Split
• Scaling
• Extracting Date
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
12
Imputation
• Missing values are one of the most common problems you can
encounter when you try to prepare your data for machine learning.
• The reason for the missing values might be human errors,
interruptions in the data flow, privacy concerns, and so on.
• This affects the performance of machine learning models
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
13
Imputation
• Dropping columns with missing values will reduce performance
• Make a threshold of 70%
• Remove columns having more than 30% missing values
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
14
Numerical Imputation
• Fill missing values with a constant
• Fill missing values with a statistical formula
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
15
Categorical imputation
• Replacing missing value with maximum occurred value in that column
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
16
Handling Outliers
• Best way to detect outliers is to visualize data
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
17
Statistical ways to handle outliers
• Standard Deviation
• Percentiles
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
18
Handling outliers – Standard Deviation
• If a value has a distance to the average higher than x * standard
deviation, it can be assumed as an outlier.
• x = 2 to 4 is practical. Z-score can also be used
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
19
Handling Outliers - Percentile
• If your data ranges from 0 to 100, your top 5% is not the values
between 96 and 100.
• Top 5% means here the values that are out of the 95th percentile of
data.
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
20
Binning
• Binning is done for numerical data
• Categorical data are converted to numerical format and binned
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
21
Binning - Example
#Numerical Binning Example
Value Bin
0-30 -> Low
31-70 -> Mid
71-100 -> High
#Categorical Binning Example
Value Bin
Spain -> Europe
Italy -> Europe
Chile -> South America
Brazil -> South America
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
22
Motivation of binning
• Make the model robust
• Prevent overfitting
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
23
Log Transform
• Logarithmic Transformation
• The data you apply log transform must have only positive values,
otherwise you receive an error.
• Also, you can add 1 to your data before transform it.
• Thus, you ensure the output of the transformation to be positive.
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
24
Example
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
25
One hot encoding
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
26
Grouping Operations
• Categorical Column Grouping
• Numerical Column Grouping
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
27
Categorical Column Grouping
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
28
Numerical Column Grouping
• Numerical columns are grouped using sum and mean functions in
most of the cases.
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
29
Feature Split
• Splitting features is a good way to make them useful in terms of
machine learning.
• By extracting the utilizable parts of a column into new features:
• We enable machine learning algorithms to comprehend them.
• Make possible to bin and group them.
• Improve model performance by uncovering potential information.
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
30
Example
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
31
Scaling
• In real life, it is nonsense to expect age and income columns to have
the same range.
• Scaling solves this problem.
• However, the algorithms based on distance calculations such as k-NN
or k-Means need to have scaled continuous features as model input.
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
32
Scaling Methods
• Normalization
• Standardization
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
33
Normalization
• Normalization (or min-max normalization) scale all values in a fixed
range between 0 and 1.
• This transformation does not change the distribution of the feature
and due to the decreased standard deviations, the effects of the
outliers increases.
• Therefore, before normalization, it is recommended to handle the
outliers.
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
34
Normalization - Example
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
35
Standardization
• Also known as z-score normalization
• Scales the values while taking into account standard deviation.
• If the standard deviation of features is different, their range also
would differ from each other.
• This reduces the effect of the outliers in the features.
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
36
Example
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
37
Extracting Date
• Extracting the parts of the date into different columns: Year, month,
day, etc.
• Extracting the time period between the current date and columns in
terms of years, months, days, etc.
• Extracting some specific features from the date: Name of the
weekday, Weekend or not, holiday or not, etc.
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
38
Extracting Date
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
39
DSA 207 – Feature Engineering
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
40

More Related Content

What's hot

Data wrangling week 10
Data wrangling week 10Data wrangling week 10
Data wrangling week 10
Ferdin Joe John Joseph PhD
 
Blockchain Technology - Week 6 - Role of Cryptography in Blockchain
Blockchain Technology - Week 6 - Role of Cryptography in BlockchainBlockchain Technology - Week 6 - Role of Cryptography in Blockchain
Blockchain Technology - Week 6 - Role of Cryptography in Blockchain
Ferdin Joe John Joseph PhD
 
Blockchain Technology - Week 2 - Blockchain Terminologies
Blockchain Technology - Week 2 - Blockchain TerminologiesBlockchain Technology - Week 2 - Blockchain Terminologies
Blockchain Technology - Week 2 - Blockchain Terminologies
Ferdin Joe John Joseph PhD
 
Blockchain Technology - Week 9 - Blockciphers
Blockchain Technology - Week 9 - BlockciphersBlockchain Technology - Week 9 - Blockciphers
Blockchain Technology - Week 9 - Blockciphers
Ferdin Joe John Joseph PhD
 
Blockchain Technology - Week 11 - Thai-Nichi Institute of Technology
Blockchain Technology - Week 11 - Thai-Nichi Institute of TechnologyBlockchain Technology - Week 11 - Thai-Nichi Institute of Technology
Blockchain Technology - Week 11 - Thai-Nichi Institute of Technology
Ferdin Joe John Joseph PhD
 
Blockchain Technology - Week 10 - CAP Teorem, Byzantines General Problem
Blockchain Technology - Week 10 - CAP Teorem, Byzantines General ProblemBlockchain Technology - Week 10 - CAP Teorem, Byzantines General Problem
Blockchain Technology - Week 10 - CAP Teorem, Byzantines General Problem
Ferdin Joe John Joseph PhD
 
Data wrangling week 6
Data wrangling week 6Data wrangling week 6
Data wrangling week 6
Ferdin Joe John Joseph PhD
 
Blockchain Technology - Week 4 - Hyperledger and Smart Contracts
Blockchain Technology - Week 4 - Hyperledger and Smart ContractsBlockchain Technology - Week 4 - Hyperledger and Smart Contracts
Blockchain Technology - Week 4 - Hyperledger and Smart Contracts
Ferdin Joe John Joseph PhD
 
Data wrangling week2
Data wrangling week2Data wrangling week2
Data wrangling week2
Ferdin Joe John Joseph PhD
 
Data wrangling week3
Data wrangling week3Data wrangling week3
Data wrangling week3
Ferdin Joe John Joseph PhD
 
Data wrangling week1
Data wrangling week1Data wrangling week1
Data wrangling week1
Ferdin Joe John Joseph PhD
 
Blockchain Technology - Week 3 - FinTech and Cryptocurrencies
Blockchain Technology - Week 3 - FinTech and CryptocurrenciesBlockchain Technology - Week 3 - FinTech and Cryptocurrencies
Blockchain Technology - Week 3 - FinTech and Cryptocurrencies
Ferdin Joe John Joseph PhD
 
Why Python?
Why Python?Why Python?
Why Python?
Adam Pah
 
実用Brainf*ckプログラミング
実用Brainf*ckプログラミング実用Brainf*ckプログラミング
実用Brainf*ckプログラミング
京大 マイコンクラブ
 
Introduction python
Introduction pythonIntroduction python
Introduction python
Jumbo Techno e_Learning
 
Palindromic tree
Palindromic treePalindromic tree
Palindromic tree
__math
 
Introduction To Python | Edureka
Introduction To Python | EdurekaIntroduction To Python | Edureka
Introduction To Python | Edureka
Edureka!
 
実用Brainf*ckプログラミング入門編
実用Brainf*ckプログラミング入門編実用Brainf*ckプログラミング入門編
実用Brainf*ckプログラミング入門編
京大 マイコンクラブ
 
Advanced Content Workflow Using GitHub and Markdown
Advanced Content Workflow Using GitHub and MarkdownAdvanced Content Workflow Using GitHub and Markdown
Advanced Content Workflow Using GitHub and Markdown
Ian Lurie
 

What's hot (20)

Data wrangling week 10
Data wrangling week 10Data wrangling week 10
Data wrangling week 10
 
Blockchain Technology - Week 6 - Role of Cryptography in Blockchain
Blockchain Technology - Week 6 - Role of Cryptography in BlockchainBlockchain Technology - Week 6 - Role of Cryptography in Blockchain
Blockchain Technology - Week 6 - Role of Cryptography in Blockchain
 
Blockchain Technology - Week 2 - Blockchain Terminologies
Blockchain Technology - Week 2 - Blockchain TerminologiesBlockchain Technology - Week 2 - Blockchain Terminologies
Blockchain Technology - Week 2 - Blockchain Terminologies
 
Blockchain Technology - Week 9 - Blockciphers
Blockchain Technology - Week 9 - BlockciphersBlockchain Technology - Week 9 - Blockciphers
Blockchain Technology - Week 9 - Blockciphers
 
Blockchain Technology - Week 11 - Thai-Nichi Institute of Technology
Blockchain Technology - Week 11 - Thai-Nichi Institute of TechnologyBlockchain Technology - Week 11 - Thai-Nichi Institute of Technology
Blockchain Technology - Week 11 - Thai-Nichi Institute of Technology
 
Blockchain Technology - Week 10 - CAP Teorem, Byzantines General Problem
Blockchain Technology - Week 10 - CAP Teorem, Byzantines General ProblemBlockchain Technology - Week 10 - CAP Teorem, Byzantines General Problem
Blockchain Technology - Week 10 - CAP Teorem, Byzantines General Problem
 
Data wrangling week 6
Data wrangling week 6Data wrangling week 6
Data wrangling week 6
 
Data Wrangling Week 4
Data Wrangling Week 4Data Wrangling Week 4
Data Wrangling Week 4
 
Blockchain Technology - Week 4 - Hyperledger and Smart Contracts
Blockchain Technology - Week 4 - Hyperledger and Smart ContractsBlockchain Technology - Week 4 - Hyperledger and Smart Contracts
Blockchain Technology - Week 4 - Hyperledger and Smart Contracts
 
Data wrangling week2
Data wrangling week2Data wrangling week2
Data wrangling week2
 
Data wrangling week3
Data wrangling week3Data wrangling week3
Data wrangling week3
 
Data wrangling week1
Data wrangling week1Data wrangling week1
Data wrangling week1
 
Blockchain Technology - Week 3 - FinTech and Cryptocurrencies
Blockchain Technology - Week 3 - FinTech and CryptocurrenciesBlockchain Technology - Week 3 - FinTech and Cryptocurrencies
Blockchain Technology - Week 3 - FinTech and Cryptocurrencies
 
Why Python?
Why Python?Why Python?
Why Python?
 
実用Brainf*ckプログラミング
実用Brainf*ckプログラミング実用Brainf*ckプログラミング
実用Brainf*ckプログラミング
 
Introduction python
Introduction pythonIntroduction python
Introduction python
 
Palindromic tree
Palindromic treePalindromic tree
Palindromic tree
 
Introduction To Python | Edureka
Introduction To Python | EdurekaIntroduction To Python | Edureka
Introduction To Python | Edureka
 
実用Brainf*ckプログラミング入門編
実用Brainf*ckプログラミング入門編実用Brainf*ckプログラミング入門編
実用Brainf*ckプログラミング入門編
 
Advanced Content Workflow Using GitHub and Markdown
Advanced Content Workflow Using GitHub and MarkdownAdvanced Content Workflow Using GitHub and Markdown
Advanced Content Workflow Using GitHub and Markdown
 

Similar to Week 8: Programming for Data Analysis

2019 DSA 105 Introduction to Data Science Week 3
2019 DSA 105 Introduction to Data Science Week 32019 DSA 105 Introduction to Data Science Week 3
2019 DSA 105 Introduction to Data Science Week 3
Ferdin Joe John Joseph PhD
 
introduction to Statistical Theory.pptx
 introduction to Statistical Theory.pptx introduction to Statistical Theory.pptx
introduction to Statistical Theory.pptx
Dr.Shweta
 
Coursework Assignment Design of a taxi meter .docx
Coursework Assignment   Design of a taxi meter .docxCoursework Assignment   Design of a taxi meter .docx
Coursework Assignment Design of a taxi meter .docx
vanesaburnand
 
Module-4_Part-II.pptx
Module-4_Part-II.pptxModule-4_Part-II.pptx
Module-4_Part-II.pptx
VaishaliBagewadikar
 
Machine learning specialist ver#4
Machine learning specialist ver#4Machine learning specialist ver#4
Machine learning specialist ver#4
EPSILON AI INSTITUTE
 
Data Science Lifecycle
Data Science LifecycleData Science Lifecycle
Data Science Lifecycle
SwapnilDahake2
 
romi-pm-08-quality-april2013.pptx
romi-pm-08-quality-april2013.pptxromi-pm-08-quality-april2013.pptx
romi-pm-08-quality-april2013.pptx
fauzi chayo
 
Introduction to Data Science - Week 3 - Steps involved in Data Science
Introduction to Data Science - Week 3 - Steps involved in Data ScienceIntroduction to Data Science - Week 3 - Steps involved in Data Science
Introduction to Data Science - Week 3 - Steps involved in Data Science
Ferdin Joe John Joseph PhD
 
Handling Missing Attributes using Matrix Factorization 
Handling Missing Attributes using Matrix Factorization Handling Missing Attributes using Matrix Factorization 
Handling Missing Attributes using Matrix Factorization 
CS, NcState
 
Data Wrangling Week 7
Data Wrangling Week 7Data Wrangling Week 7
Data Wrangling Week 7
Ferdin Joe John Joseph PhD
 
TELECOM_CHURN_PREDICTIAAAAAAAAAAAAAAAAAON[1].pptx
TELECOM_CHURN_PREDICTIAAAAAAAAAAAAAAAAAON[1].pptxTELECOM_CHURN_PREDICTIAAAAAAAAAAAAAAAAAON[1].pptx
TELECOM_CHURN_PREDICTIAAAAAAAAAAAAAAAAAON[1].pptx
GaganaGowda31
 
Software metrics by Dr. B. J. Mohite
Software metrics by Dr. B. J. MohiteSoftware metrics by Dr. B. J. Mohite
Software metrics by Dr. B. J. Mohite
Zeal Education Society, Pune
 
Feature Engineering.pdf
Feature Engineering.pdfFeature Engineering.pdf
Feature Engineering.pdf
Rajoo Jha
 
Experimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles BakerExperimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles Baker
Databricks
 
The Machine Learning Workflow with Azure
The Machine Learning Workflow with AzureThe Machine Learning Workflow with Azure
The Machine Learning Workflow with Azure
Ivo Andreev
 
Intelligent Career Guidance System.pptx
Intelligent Career Guidance System.pptxIntelligent Career Guidance System.pptx
Intelligent Career Guidance System.pptx
Anonymous366406
 
Testing of Object-Oriented Software
Testing of Object-Oriented SoftwareTesting of Object-Oriented Software
Testing of Object-Oriented Software
Praveen Penumathsa
 
Data Analysis and Synthesis & Techniques of System.pptx
Data Analysis and Synthesis & Techniques of System.pptxData Analysis and Synthesis & Techniques of System.pptx
Data Analysis and Synthesis & Techniques of System.pptx
Ts. Heshalini Rajagopal
 
Big Data - IBA.pptx
Big Data - IBA.pptxBig Data - IBA.pptx
Big Data - IBA.pptx
Muhammad Shamim
 
Algorithmic Software Cost Modeling
Algorithmic Software Cost ModelingAlgorithmic Software Cost Modeling
Algorithmic Software Cost Modeling
Kasun Ranga Wijeweera
 

Similar to Week 8: Programming for Data Analysis (20)

2019 DSA 105 Introduction to Data Science Week 3
2019 DSA 105 Introduction to Data Science Week 32019 DSA 105 Introduction to Data Science Week 3
2019 DSA 105 Introduction to Data Science Week 3
 
introduction to Statistical Theory.pptx
 introduction to Statistical Theory.pptx introduction to Statistical Theory.pptx
introduction to Statistical Theory.pptx
 
Coursework Assignment Design of a taxi meter .docx
Coursework Assignment   Design of a taxi meter .docxCoursework Assignment   Design of a taxi meter .docx
Coursework Assignment Design of a taxi meter .docx
 
Module-4_Part-II.pptx
Module-4_Part-II.pptxModule-4_Part-II.pptx
Module-4_Part-II.pptx
 
Machine learning specialist ver#4
Machine learning specialist ver#4Machine learning specialist ver#4
Machine learning specialist ver#4
 
Data Science Lifecycle
Data Science LifecycleData Science Lifecycle
Data Science Lifecycle
 
romi-pm-08-quality-april2013.pptx
romi-pm-08-quality-april2013.pptxromi-pm-08-quality-april2013.pptx
romi-pm-08-quality-april2013.pptx
 
Introduction to Data Science - Week 3 - Steps involved in Data Science
Introduction to Data Science - Week 3 - Steps involved in Data ScienceIntroduction to Data Science - Week 3 - Steps involved in Data Science
Introduction to Data Science - Week 3 - Steps involved in Data Science
 
Handling Missing Attributes using Matrix Factorization 
Handling Missing Attributes using Matrix Factorization Handling Missing Attributes using Matrix Factorization 
Handling Missing Attributes using Matrix Factorization 
 
Data Wrangling Week 7
Data Wrangling Week 7Data Wrangling Week 7
Data Wrangling Week 7
 
TELECOM_CHURN_PREDICTIAAAAAAAAAAAAAAAAAON[1].pptx
TELECOM_CHURN_PREDICTIAAAAAAAAAAAAAAAAAON[1].pptxTELECOM_CHURN_PREDICTIAAAAAAAAAAAAAAAAAON[1].pptx
TELECOM_CHURN_PREDICTIAAAAAAAAAAAAAAAAAON[1].pptx
 
Software metrics by Dr. B. J. Mohite
Software metrics by Dr. B. J. MohiteSoftware metrics by Dr. B. J. Mohite
Software metrics by Dr. B. J. Mohite
 
Feature Engineering.pdf
Feature Engineering.pdfFeature Engineering.pdf
Feature Engineering.pdf
 
Experimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles BakerExperimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles Baker
 
The Machine Learning Workflow with Azure
The Machine Learning Workflow with AzureThe Machine Learning Workflow with Azure
The Machine Learning Workflow with Azure
 
Intelligent Career Guidance System.pptx
Intelligent Career Guidance System.pptxIntelligent Career Guidance System.pptx
Intelligent Career Guidance System.pptx
 
Testing of Object-Oriented Software
Testing of Object-Oriented SoftwareTesting of Object-Oriented Software
Testing of Object-Oriented Software
 
Data Analysis and Synthesis & Techniques of System.pptx
Data Analysis and Synthesis & Techniques of System.pptxData Analysis and Synthesis & Techniques of System.pptx
Data Analysis and Synthesis & Techniques of System.pptx
 
Big Data - IBA.pptx
Big Data - IBA.pptxBig Data - IBA.pptx
Big Data - IBA.pptx
 
Algorithmic Software Cost Modeling
Algorithmic Software Cost ModelingAlgorithmic Software Cost Modeling
Algorithmic Software Cost Modeling
 

More from Ferdin Joe John Joseph PhD

Invited Talk DGTiCon 2022
Invited Talk DGTiCon 2022Invited Talk DGTiCon 2022
Invited Talk DGTiCon 2022
Ferdin Joe John Joseph PhD
 
Week 12: Cloud AI- DSA 441 Cloud Computing
Week 12: Cloud AI- DSA 441 Cloud ComputingWeek 12: Cloud AI- DSA 441 Cloud Computing
Week 12: Cloud AI- DSA 441 Cloud Computing
Ferdin Joe John Joseph PhD
 
Week 11: Cloud Native- DSA 441 Cloud Computing
Week 11: Cloud Native- DSA 441 Cloud ComputingWeek 11: Cloud Native- DSA 441 Cloud Computing
Week 11: Cloud Native- DSA 441 Cloud Computing
Ferdin Joe John Joseph PhD
 
Week 10: Cloud Security- DSA 441 Cloud Computing
Week 10: Cloud Security- DSA 441 Cloud ComputingWeek 10: Cloud Security- DSA 441 Cloud Computing
Week 10: Cloud Security- DSA 441 Cloud Computing
Ferdin Joe John Joseph PhD
 
Week 9: Relational Database Service Alibaba Cloud- DSA 441 Cloud Computing
Week 9: Relational Database Service Alibaba Cloud- DSA 441 Cloud ComputingWeek 9: Relational Database Service Alibaba Cloud- DSA 441 Cloud Computing
Week 9: Relational Database Service Alibaba Cloud- DSA 441 Cloud Computing
Ferdin Joe John Joseph PhD
 
Week 7: Object Storage Service Alibaba Cloud- DSA 441 Cloud Computing
Week 7: Object Storage Service Alibaba Cloud- DSA 441 Cloud ComputingWeek 7: Object Storage Service Alibaba Cloud- DSA 441 Cloud Computing
Week 7: Object Storage Service Alibaba Cloud- DSA 441 Cloud Computing
Ferdin Joe John Joseph PhD
 
Week 6: Server Load Balancer and Auto Scaling Alibaba Cloud- DSA 441 Cloud Co...
Week 6: Server Load Balancer and Auto Scaling Alibaba Cloud- DSA 441 Cloud Co...Week 6: Server Load Balancer and Auto Scaling Alibaba Cloud- DSA 441 Cloud Co...
Week 6: Server Load Balancer and Auto Scaling Alibaba Cloud- DSA 441 Cloud Co...
Ferdin Joe John Joseph PhD
 
Week 5: Elastic Compute Service (ECS) with Alibaba Cloud- DSA 441 Cloud Compu...
Week 5: Elastic Compute Service (ECS) with Alibaba Cloud- DSA 441 Cloud Compu...Week 5: Elastic Compute Service (ECS) with Alibaba Cloud- DSA 441 Cloud Compu...
Week 5: Elastic Compute Service (ECS) with Alibaba Cloud- DSA 441 Cloud Compu...
Ferdin Joe John Joseph PhD
 
Week 4: Big Data and Hadoop in Alibaba Cloud - DSA 441 Cloud Computing
Week 4: Big Data and Hadoop in Alibaba Cloud - DSA 441 Cloud ComputingWeek 4: Big Data and Hadoop in Alibaba Cloud - DSA 441 Cloud Computing
Week 4: Big Data and Hadoop in Alibaba Cloud - DSA 441 Cloud Computing
Ferdin Joe John Joseph PhD
 
Week 3: Virtual Private Cloud, On Premise, IaaS, PaaS, SaaS - DSA 441 Cloud C...
Week 3: Virtual Private Cloud, On Premise, IaaS, PaaS, SaaS - DSA 441 Cloud C...Week 3: Virtual Private Cloud, On Premise, IaaS, PaaS, SaaS - DSA 441 Cloud C...
Week 3: Virtual Private Cloud, On Premise, IaaS, PaaS, SaaS - DSA 441 Cloud C...
Ferdin Joe John Joseph PhD
 
Week 2: Virtualization and VM Ware - DSA 441 Cloud Computing
Week 2: Virtualization and VM Ware - DSA 441 Cloud ComputingWeek 2: Virtualization and VM Ware - DSA 441 Cloud Computing
Week 2: Virtualization and VM Ware - DSA 441 Cloud Computing
Ferdin Joe John Joseph PhD
 
Week 1: Introduction to Cloud Computing - DSA 441 Cloud Computing
Week 1: Introduction to Cloud Computing - DSA 441 Cloud ComputingWeek 1: Introduction to Cloud Computing - DSA 441 Cloud Computing
Week 1: Introduction to Cloud Computing - DSA 441 Cloud Computing
Ferdin Joe John Joseph PhD
 
Sept 6 2021 BTech Artificial Intelligence and Data Science curriculum
Sept 6 2021 BTech Artificial Intelligence and Data Science curriculumSept 6 2021 BTech Artificial Intelligence and Data Science curriculum
Sept 6 2021 BTech Artificial Intelligence and Data Science curriculum
Ferdin Joe John Joseph PhD
 
Hadoop in Alibaba Cloud
Hadoop in Alibaba CloudHadoop in Alibaba Cloud
Hadoop in Alibaba Cloud
Ferdin Joe John Joseph PhD
 
Cloud Computing Essentials in Alibaba Cloud
Cloud Computing Essentials in Alibaba CloudCloud Computing Essentials in Alibaba Cloud
Cloud Computing Essentials in Alibaba Cloud
Ferdin Joe John Joseph PhD
 
Transforming deep into transformers – a computer vision approach
Transforming deep into transformers – a computer vision approachTransforming deep into transformers – a computer vision approach
Transforming deep into transformers – a computer vision approach
Ferdin Joe John Joseph PhD
 
Deep learning - Introduction
Deep learning - IntroductionDeep learning - Introduction
Deep learning - Introduction
Ferdin Joe John Joseph PhD
 
Data wrangling week 11
Data wrangling week 11Data wrangling week 11
Data wrangling week 11
Ferdin Joe John Joseph PhD
 
Data wrangling week 9
Data wrangling week 9Data wrangling week 9
Data wrangling week 9
Ferdin Joe John Joseph PhD
 
Deep Learning and CNN Architectures
Deep Learning and CNN ArchitecturesDeep Learning and CNN Architectures
Deep Learning and CNN Architectures
Ferdin Joe John Joseph PhD
 

More from Ferdin Joe John Joseph PhD (20)

Invited Talk DGTiCon 2022
Invited Talk DGTiCon 2022Invited Talk DGTiCon 2022
Invited Talk DGTiCon 2022
 
Week 12: Cloud AI- DSA 441 Cloud Computing
Week 12: Cloud AI- DSA 441 Cloud ComputingWeek 12: Cloud AI- DSA 441 Cloud Computing
Week 12: Cloud AI- DSA 441 Cloud Computing
 
Week 11: Cloud Native- DSA 441 Cloud Computing
Week 11: Cloud Native- DSA 441 Cloud ComputingWeek 11: Cloud Native- DSA 441 Cloud Computing
Week 11: Cloud Native- DSA 441 Cloud Computing
 
Week 10: Cloud Security- DSA 441 Cloud Computing
Week 10: Cloud Security- DSA 441 Cloud ComputingWeek 10: Cloud Security- DSA 441 Cloud Computing
Week 10: Cloud Security- DSA 441 Cloud Computing
 
Week 9: Relational Database Service Alibaba Cloud- DSA 441 Cloud Computing
Week 9: Relational Database Service Alibaba Cloud- DSA 441 Cloud ComputingWeek 9: Relational Database Service Alibaba Cloud- DSA 441 Cloud Computing
Week 9: Relational Database Service Alibaba Cloud- DSA 441 Cloud Computing
 
Week 7: Object Storage Service Alibaba Cloud- DSA 441 Cloud Computing
Week 7: Object Storage Service Alibaba Cloud- DSA 441 Cloud ComputingWeek 7: Object Storage Service Alibaba Cloud- DSA 441 Cloud Computing
Week 7: Object Storage Service Alibaba Cloud- DSA 441 Cloud Computing
 
Week 6: Server Load Balancer and Auto Scaling Alibaba Cloud- DSA 441 Cloud Co...
Week 6: Server Load Balancer and Auto Scaling Alibaba Cloud- DSA 441 Cloud Co...Week 6: Server Load Balancer and Auto Scaling Alibaba Cloud- DSA 441 Cloud Co...
Week 6: Server Load Balancer and Auto Scaling Alibaba Cloud- DSA 441 Cloud Co...
 
Week 5: Elastic Compute Service (ECS) with Alibaba Cloud- DSA 441 Cloud Compu...
Week 5: Elastic Compute Service (ECS) with Alibaba Cloud- DSA 441 Cloud Compu...Week 5: Elastic Compute Service (ECS) with Alibaba Cloud- DSA 441 Cloud Compu...
Week 5: Elastic Compute Service (ECS) with Alibaba Cloud- DSA 441 Cloud Compu...
 
Week 4: Big Data and Hadoop in Alibaba Cloud - DSA 441 Cloud Computing
Week 4: Big Data and Hadoop in Alibaba Cloud - DSA 441 Cloud ComputingWeek 4: Big Data and Hadoop in Alibaba Cloud - DSA 441 Cloud Computing
Week 4: Big Data and Hadoop in Alibaba Cloud - DSA 441 Cloud Computing
 
Week 3: Virtual Private Cloud, On Premise, IaaS, PaaS, SaaS - DSA 441 Cloud C...
Week 3: Virtual Private Cloud, On Premise, IaaS, PaaS, SaaS - DSA 441 Cloud C...Week 3: Virtual Private Cloud, On Premise, IaaS, PaaS, SaaS - DSA 441 Cloud C...
Week 3: Virtual Private Cloud, On Premise, IaaS, PaaS, SaaS - DSA 441 Cloud C...
 
Week 2: Virtualization and VM Ware - DSA 441 Cloud Computing
Week 2: Virtualization and VM Ware - DSA 441 Cloud ComputingWeek 2: Virtualization and VM Ware - DSA 441 Cloud Computing
Week 2: Virtualization and VM Ware - DSA 441 Cloud Computing
 
Week 1: Introduction to Cloud Computing - DSA 441 Cloud Computing
Week 1: Introduction to Cloud Computing - DSA 441 Cloud ComputingWeek 1: Introduction to Cloud Computing - DSA 441 Cloud Computing
Week 1: Introduction to Cloud Computing - DSA 441 Cloud Computing
 
Sept 6 2021 BTech Artificial Intelligence and Data Science curriculum
Sept 6 2021 BTech Artificial Intelligence and Data Science curriculumSept 6 2021 BTech Artificial Intelligence and Data Science curriculum
Sept 6 2021 BTech Artificial Intelligence and Data Science curriculum
 
Hadoop in Alibaba Cloud
Hadoop in Alibaba CloudHadoop in Alibaba Cloud
Hadoop in Alibaba Cloud
 
Cloud Computing Essentials in Alibaba Cloud
Cloud Computing Essentials in Alibaba CloudCloud Computing Essentials in Alibaba Cloud
Cloud Computing Essentials in Alibaba Cloud
 
Transforming deep into transformers – a computer vision approach
Transforming deep into transformers – a computer vision approachTransforming deep into transformers – a computer vision approach
Transforming deep into transformers – a computer vision approach
 
Deep learning - Introduction
Deep learning - IntroductionDeep learning - Introduction
Deep learning - Introduction
 
Data wrangling week 11
Data wrangling week 11Data wrangling week 11
Data wrangling week 11
 
Data wrangling week 9
Data wrangling week 9Data wrangling week 9
Data wrangling week 9
 
Deep Learning and CNN Architectures
Deep Learning and CNN ArchitecturesDeep Learning and CNN Architectures
Deep Learning and CNN Architectures
 

Recently uploaded

Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
alex933524
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
StarCompliance.io
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
theahmadsaood
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
AlejandraGmez176757
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 

Recently uploaded (20)

Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 

Week 8: Programming for Data Analysis

  • 1. Programming for Data Analysis Week 8 Dr. Ferdin Joe John Joseph Faculty of Information Technology Thai – Nichi Institute of Technology, Bangkok
  • 2. Today’s lesson Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 2 • Feature Engineering • Feature Selection • Feature Construction • Laboratory
  • 3. Importance Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 3
  • 4. Features Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 4
  • 5. Features Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 5
  • 6. Features Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 6
  • 7. Real Data features Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 7
  • 8. Feature Engineering Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 8
  • 9. Feature Engineering • Feature engineering is the process of using domain knowledge to extract features from raw data via data mining techniques. • These features can be used to improve the performance of machine learning algorithms. Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 9
  • 10. Features • A feature is an attribute or property shared by all of the independent units on which analysis or prediction is to be done. Any attribute could be a feature, as long as it is useful to the model. • The purpose of a feature, other than being an attribute, would be much easier to understand in the context of a problem. A feature is a characteristic that might help when solving the problem. Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 10
  • 11. Process of Feature Engineering Brainstorming or testing features Deciding what features to create Creating features Checking how the features work with your model Improving your features if needed Go back to brainstorming/creating more features until the work is done Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 11
  • 12. Techniques in Feature Engineering • Imputation • Handling Outliers • Binning • Log Transform • One-Hot Encoding • Grouping Operations • Feature Split • Scaling • Extracting Date Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 12
  • 13. Imputation • Missing values are one of the most common problems you can encounter when you try to prepare your data for machine learning. • The reason for the missing values might be human errors, interruptions in the data flow, privacy concerns, and so on. • This affects the performance of machine learning models Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 13
  • 14. Imputation • Dropping columns with missing values will reduce performance • Make a threshold of 70% • Remove columns having more than 30% missing values Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 14
  • 15. Numerical Imputation • Fill missing values with a constant • Fill missing values with a statistical formula Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 15
  • 16. Categorical imputation • Replacing missing value with maximum occurred value in that column Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 16
  • 17. Handling Outliers • Best way to detect outliers is to visualize data Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 17
  • 18. Statistical ways to handle outliers • Standard Deviation • Percentiles Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 18
  • 19. Handling outliers – Standard Deviation • If a value has a distance to the average higher than x * standard deviation, it can be assumed as an outlier. • x = 2 to 4 is practical. Z-score can also be used Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 19
  • 20. Handling Outliers - Percentile • If your data ranges from 0 to 100, your top 5% is not the values between 96 and 100. • Top 5% means here the values that are out of the 95th percentile of data. Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 20
  • 21. Binning • Binning is done for numerical data • Categorical data are converted to numerical format and binned Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 21
  • 22. Binning - Example #Numerical Binning Example Value Bin 0-30 -> Low 31-70 -> Mid 71-100 -> High #Categorical Binning Example Value Bin Spain -> Europe Italy -> Europe Chile -> South America Brazil -> South America Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 22
  • 23. Motivation of binning • Make the model robust • Prevent overfitting Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 23
  • 24. Log Transform • Logarithmic Transformation • The data you apply log transform must have only positive values, otherwise you receive an error. • Also, you can add 1 to your data before transform it. • Thus, you ensure the output of the transformation to be positive. Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 24
  • 25. Example Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 25
  • 26. One hot encoding Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 26
  • 27. Grouping Operations • Categorical Column Grouping • Numerical Column Grouping Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 27
  • 28. Categorical Column Grouping Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 28
  • 29. Numerical Column Grouping • Numerical columns are grouped using sum and mean functions in most of the cases. Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 29
  • 30. Feature Split • Splitting features is a good way to make them useful in terms of machine learning. • By extracting the utilizable parts of a column into new features: • We enable machine learning algorithms to comprehend them. • Make possible to bin and group them. • Improve model performance by uncovering potential information. Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 30
  • 31. Example Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 31
  • 32. Scaling • In real life, it is nonsense to expect age and income columns to have the same range. • Scaling solves this problem. • However, the algorithms based on distance calculations such as k-NN or k-Means need to have scaled continuous features as model input. Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 32
  • 33. Scaling Methods • Normalization • Standardization Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 33
  • 34. Normalization • Normalization (or min-max normalization) scale all values in a fixed range between 0 and 1. • This transformation does not change the distribution of the feature and due to the decreased standard deviations, the effects of the outliers increases. • Therefore, before normalization, it is recommended to handle the outliers. Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 34
  • 35. Normalization - Example Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 35
  • 36. Standardization • Also known as z-score normalization • Scales the values while taking into account standard deviation. • If the standard deviation of features is different, their range also would differ from each other. • This reduces the effect of the outliers in the features. Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 36
  • 37. Example Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 37
  • 38. Extracting Date • Extracting the parts of the date into different columns: Year, month, day, etc. • Extracting the time period between the current date and columns in terms of years, months, days, etc. • Extracting some specific features from the date: Name of the weekday, Weekend or not, holiday or not, etc. Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 38
  • 39. Extracting Date Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 39
  • 40. DSA 207 – Feature Engineering Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 40