SlideShare a Scribd company logo
1 of 23
Download to read offline
Data Preprocessing
Dr.M.Pyingkodi
AP/MCA
Kongu Engineering College
Erode, Tamilnadu
DATA PREPROCESSING
• Process of preparing the data for analysis.
• technique of preparing (cleaning and organizing) the
raw data to make it suitable for a building and
training Machine Learning models.
• Real-world data :
• Incomplete
• Inconsistent
• likely to contain many errors.
• Data cleaning
• Noise, outliers, missing values, duplicate data
• Dealing with categorical data
• Data integration
• Data transformation
• Data reduction
• Sampling
• Imputation
• Discretization
• Feature extraction
• Splitting the dataset into training and testing sets
• Scaling the features
PREPROCESSING TECHNIQUES
TYPES OF DATA
• Numerical data
• Discrete - Date, No. of students in a class
• Continuous - Cost of a house
• Categorical data
 Nominal – Gender
 Ordinal – Grades of the student
 Dichotomous – Cancerous, Non-cancerous
DATA CLEANING
• Process of detecting and correcting (or removing)
corrupt or inaccurate records from a record set
• Identifying incomplete, incorrect, inaccurate or
irrelevant parts of the data and then replacing,
modifying, or deleting the dirty or coarse data within
a dataset.
• Duplicate observations
• Irrelevant observations
• Fixing Structural errors
• Managing Unwanted outliers
OUTLIERS
Outliers are extreme values
that fall a long way outside
of the other observations.
For example, in a normal
distribution, outliers may be
values on the tails of the
distribution.
FINDING OUTLIERS
• Box plot
• Scatter plot
• Z-Score
• expectation-maximization.
• linear correlations (principle component analysis)
• cluster, density or nearest neighbor analysis.
• interquartile range (IQR)
HANDLING MISSING VALUES
TECHNIQUES OF DEALING WITH MISSING DATA
• Drop missing values/columns/rows
• Imputation
• A slightly better approach towards handling missing data
is Imputation. Imputation means to replace or fill the
missing data with some value.
• There are lot of ways to impute the data.
• A constant value that belongs to the set of possible
values of that variable, such as 0, distinct from all other
values
• A mean, median or mode value for the column
• A value estimated by another predictive model
• Multiple Imputation
DATA INTEGRATION
• combine data from disparate sources into meaningful
and valuable information
• data from various sources(technologies)
• It includes multiple databases, data cubes or flat files
Issues
• Schema Integration
• Redundancy
• Detection and resolution of data value conflicts.
DATA TRANSFORMATION
• Taking data stored in one format and converting it to
another.
• Datasets in which different columns have different units
– like one column can be in kilograms, while another
column can be in centimeters.
DATA TRANSFORMATION
• MinMax Scaler
It just scales all the data between 0 and 1. The formula for calculating the scaled value is-
x_scaled = (x – x_min)/(x_max – x_min)
• Standard Scaler
the Standard Scaler scales the values such that the mean is 0 and the standard
deviation is 1(or the variance). df_std
• MaxAbsScaler
takes the absolute maximum value of each column and divides each value in the
column by the maximum value.
scales the data between the range [-1, 1].
• Robust Scaler
to standardizing input variables in the presence of outliers is to ignore the outliers
from the calculation of the mean and standard deviation
• Quantile Transformer Scaler
converts the variable distribution to a normal distribution. and scales it accordingly.
The quantile function ranks or smooths out the relationship between observations and can be
mapped onto other distributions, such as the uniform or normal distribution.
DATA TRANSFORMATION
• Log Transform
take the log of the values in a column and use these values as the column instead.
It is primarily used to convert a skewed distribution to a normal distribution/less-
skewed distribution
the log-transformed data follows a normal or near normal distribution.
Reducing the impact of too-low values
Reducing the impact of too-high values.
• Unit Vector Scaler/Normalizer
Normalization is the process of scaling individual samples to have unit norm.
Normalizer works on the rows
If we are using L1 norm, the values in each column are converted so that the sum of
their absolute values along the row = 1
If we are using L2 norm, the values in each column are first squared and added so
that the sum of their absolute values along the row = 1
50, 250, 400
0.05, 0.25 and 0.4.
HANDLING CATEGORICAL DATA
• Find and Replace
• Label Encoding
• Binary encoding
• One Hot Encoding
pd.get_dummies(obj_df, columns=["drive_wheels"]).head()
• OrdinalEncoder
from sklearn.preprocessing import OrdinalEncoder
ord_enc = OrdinalEncoder()
obj_df["make_code"] = ord_enc.fit_transform(obj_df[["make"]])
obj_df[["make", "make_code"]].head(11)
SAMPLING
Sampling is done to draw conclusions about populations
from samples,
it enables us to determine a population’s characteristics by
directly observing only a portion (or sample) of the
population.
TYPES OF SAMPLING
• Simple Random Sampling
• Systematic Sampling
• Stratified Sampling
• Cluster Sampling
RESAMPLING
• Re-sampling is a series of methods used to reconstruct
your sample data sets, including training sets and
validation sets.
• Cross-validation (CV)
• Imbalance Dataset
Eg:
In an utilities fraud detection data set you have the following
data:
Total Observations = 1000
Fraudulent Observations = 20
Non Fraudulent Observations = 980
Event Rate= 2 %
RESAMPLING TECHNIQUES
• Random Under-Sampling
• Random Over-Sampling
• Cluster-Based Over Sampling
• Informed Over Sampling
DATA REDUCTION
• Dimension reduction compresses large set of features onto
a new feature subspace of lower dimensional without
losing the important information.
• Dimensionality reduction can be done in two different ways:
• By only keeping the most relevant variables from the
original dataset (this technique is called feature selection)
• By finding a smaller set of new variables, each being a
combination of the input variables, containing basically the
same information as the input variables (this technique is
called dimensionality reduction)
DATA REDUCTION TECHNIQUES
• Missing Value Ratio
• Low Variance Filter
• Random Forest
• High Correlation
• Backward Feature Elimination
• Factor Analysis
• Principal Component Analysis (PCA)
DISCRETIZATION
• To divide the attributes of the continuous nature into data with
intervals.
• Binning
• Histogram analysis
• Equal Frequency partitioning: Partitioning the values based on
their number of occurrences in the data set.
• Equal Width Partioning: Partioning the values in a fixed gap
based on the number of bins i.e. a set of values ranging from 0-
20.
• Clustering: Grouping the similar data together.
PYTHON PACKAGES/TOOLS FOR DATA MINING
• Scikit-learn
• Orange
• Pandas
• MLPy
• MDP
• PyBrain … and many more
22
SOME OTHER BASIC PACKAGES
• NumPy and SciPy
• Fundamental Packages for scientific computing with Python
• Contains powerful n-dimensional array objects
• Useful linear algebra, random number and other capabilities
• Pandas
• Contains useful data structures and algorithms
• Matplotlib
• Contains functions for plotting/visualizing data.
23

More Related Content

What's hot

Data pre processing
Data pre processingData pre processing
Data pre processing
pommurajopt
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
ksamyMCA
 

What's hot (20)

Data preprocessing in Data Mining
Data preprocessing in Data MiningData preprocessing in Data Mining
Data preprocessing in Data Mining
 
Data pre processing
Data pre processingData pre processing
Data pre processing
 
Data Preprocessing || Data Mining
Data Preprocessing || Data MiningData Preprocessing || Data Mining
Data Preprocessing || Data Mining
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
 
Data preprocessing PPT
Data preprocessing PPTData preprocessing PPT
Data preprocessing PPT
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Classification and prediction in data mining
Classification and prediction in data miningClassification and prediction in data mining
Classification and prediction in data mining
 
Dynamic programming
Dynamic programmingDynamic programming
Dynamic programming
 
Decision tree induction \ Decision Tree Algorithm with Example| Data science
Decision tree induction \ Decision Tree Algorithm with Example| Data scienceDecision tree induction \ Decision Tree Algorithm with Example| Data science
Decision tree induction \ Decision Tree Algorithm with Example| Data science
 
Data For Datamining
Data For DataminingData For Datamining
Data For Datamining
 
Data PreProcessing
Data PreProcessingData PreProcessing
Data PreProcessing
 
Data warehouse architecture
Data warehouse architecture Data warehouse architecture
Data warehouse architecture
 
3 tier data warehouse
3 tier data warehouse3 tier data warehouse
3 tier data warehouse
 
Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree Learning
 
Decision tree
Decision treeDecision tree
Decision tree
 
Kdd process
Kdd processKdd process
Kdd process
 
5desc
5desc5desc
5desc
 
Classification in data mining
Classification in data mining Classification in data mining
Classification in data mining
 
Decision trees for machine learning
Decision trees for machine learningDecision trees for machine learning
Decision trees for machine learning
 

Similar to Data preprocessing in Machine Learning

Week 12 Dimensionality Reduction Bagian 1
Week 12 Dimensionality Reduction Bagian 1Week 12 Dimensionality Reduction Bagian 1
Week 12 Dimensionality Reduction Bagian 1
khairulhuda242
 
Data preprocessing ppt1
Data preprocessing ppt1Data preprocessing ppt1
Data preprocessing ppt1
meenas06
 

Similar to Data preprocessing in Machine Learning (20)

Nimrita deep learning
Nimrita deep learningNimrita deep learning
Nimrita deep learning
 
dimension reduction.ppt
dimension reduction.pptdimension reduction.ppt
dimension reduction.ppt
 
Exploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdfExploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdf
 
Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16
 
Dimensionality Reduction.pptx
Dimensionality Reduction.pptxDimensionality Reduction.pptx
Dimensionality Reduction.pptx
 
Data preprocess
Data preprocessData preprocess
Data preprocess
 
Spc
SpcSpc
Spc
 
Pattern recognition UNIT 5
Pattern recognition UNIT 5Pattern recognition UNIT 5
Pattern recognition UNIT 5
 
ML-Unit-4.pdf
ML-Unit-4.pdfML-Unit-4.pdf
ML-Unit-4.pdf
 
KNOLX_Data_preprocessing
KNOLX_Data_preprocessingKNOLX_Data_preprocessing
KNOLX_Data_preprocessing
 
Pre-Processing and Data Preparation
Pre-Processing and Data PreparationPre-Processing and Data Preparation
Pre-Processing and Data Preparation
 
Week 12 Dimensionality Reduction Bagian 1
Week 12 Dimensionality Reduction Bagian 1Week 12 Dimensionality Reduction Bagian 1
Week 12 Dimensionality Reduction Bagian 1
 
Data preprocessing ppt1
Data preprocessing ppt1Data preprocessing ppt1
Data preprocessing ppt1
 
Machine Learning.pdf
Machine Learning.pdfMachine Learning.pdf
Machine Learning.pdf
 
7 8. emi - analog instruments and digital instruments
7 8. emi - analog instruments and digital instruments7 8. emi - analog instruments and digital instruments
7 8. emi - analog instruments and digital instruments
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project Presentation
 
Analyst’s Nightmare or Laundering Massive Spreadsheets
Analyst’s Nightmare or Laundering Massive SpreadsheetsAnalyst’s Nightmare or Laundering Massive Spreadsheets
Analyst’s Nightmare or Laundering Massive Spreadsheets
 
Daamen r 2010scwr-cpaper
Daamen r 2010scwr-cpaperDaamen r 2010scwr-cpaper
Daamen r 2010scwr-cpaper
 
Introduction of data science
Introduction of data scienceIntroduction of data science
Introduction of data science
 
Machine Learning techniques used in AI.
Machine Learning  techniques used in AI.Machine Learning  techniques used in AI.
Machine Learning techniques used in AI.
 

More from Pyingkodi Maran

More from Pyingkodi Maran (20)

Health Monitoring System using IoT.doc
Health Monitoring System using IoT.docHealth Monitoring System using IoT.doc
Health Monitoring System using IoT.doc
 
IoT Industry Adaptation of AI.ppt
IoT Industry Adaptation of AI.pptIoT Industry Adaptation of AI.ppt
IoT Industry Adaptation of AI.ppt
 
IoT_Testing.ppt
IoT_Testing.pptIoT_Testing.ppt
IoT_Testing.ppt
 
Azure Devops
Azure DevopsAzure Devops
Azure Devops
 
Creation of Web Portal using DURPAL
Creation of Web Portal using DURPALCreation of Web Portal using DURPAL
Creation of Web Portal using DURPAL
 
AWS Relational Database Instance
AWS Relational Database InstanceAWS Relational Database Instance
AWS Relational Database Instance
 
AWS S3 Buckets
AWS S3  BucketsAWS S3  Buckets
AWS S3 Buckets
 
Creation of AWS Instance in Cloud Platform
Creation of AWS Instance in Cloud PlatformCreation of AWS Instance in Cloud Platform
Creation of AWS Instance in Cloud Platform
 
Amazon Web Service.pdf
Amazon Web Service.pdfAmazon Web Service.pdf
Amazon Web Service.pdf
 
Cloud Security
Cloud SecurityCloud Security
Cloud Security
 
Cloud Computing Introduction
Cloud Computing IntroductionCloud Computing Introduction
Cloud Computing Introduction
 
Supervised Machine Learning Algorithm
Supervised Machine Learning AlgorithmSupervised Machine Learning Algorithm
Supervised Machine Learning Algorithm
 
Unsupervised Learning in Machine Learning
Unsupervised Learning in Machine LearningUnsupervised Learning in Machine Learning
Unsupervised Learning in Machine Learning
 
Feature Engineering in Machine Learning
Feature Engineering in Machine LearningFeature Engineering in Machine Learning
Feature Engineering in Machine Learning
 
Normalization in DBMS
Normalization in DBMSNormalization in DBMS
Normalization in DBMS
 
Relational Database and Relational Algebra
Relational Database and Relational AlgebraRelational Database and Relational Algebra
Relational Database and Relational Algebra
 
Transaction in DBMS
Transaction in DBMSTransaction in DBMS
Transaction in DBMS
 
IoT_Frameworks_.pdf
IoT_Frameworks_.pdfIoT_Frameworks_.pdf
IoT_Frameworks_.pdf
 
IoT Real world Applications.pdf
IoT Real world Applications.pdfIoT Real world Applications.pdf
IoT Real world Applications.pdf
 
IoT_Introduction.pdf
IoT_Introduction.pdfIoT_Introduction.pdf
IoT_Introduction.pdf
 

Recently uploaded

Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
Epec Engineered Technologies
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
MayuraD1
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
mphochane1998
 

Recently uploaded (20)

Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLEGEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
AIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsAIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech students
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdf
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
 
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxA CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
 
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
 
Engineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planesEngineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planes
 

Data preprocessing in Machine Learning

  • 2. DATA PREPROCESSING • Process of preparing the data for analysis. • technique of preparing (cleaning and organizing) the raw data to make it suitable for a building and training Machine Learning models. • Real-world data : • Incomplete • Inconsistent • likely to contain many errors.
  • 3. • Data cleaning • Noise, outliers, missing values, duplicate data • Dealing with categorical data • Data integration • Data transformation • Data reduction • Sampling • Imputation • Discretization • Feature extraction • Splitting the dataset into training and testing sets • Scaling the features PREPROCESSING TECHNIQUES
  • 4. TYPES OF DATA • Numerical data • Discrete - Date, No. of students in a class • Continuous - Cost of a house • Categorical data  Nominal – Gender  Ordinal – Grades of the student  Dichotomous – Cancerous, Non-cancerous
  • 5. DATA CLEANING • Process of detecting and correcting (or removing) corrupt or inaccurate records from a record set • Identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data within a dataset. • Duplicate observations • Irrelevant observations • Fixing Structural errors • Managing Unwanted outliers
  • 6. OUTLIERS Outliers are extreme values that fall a long way outside of the other observations. For example, in a normal distribution, outliers may be values on the tails of the distribution.
  • 7. FINDING OUTLIERS • Box plot • Scatter plot • Z-Score • expectation-maximization. • linear correlations (principle component analysis) • cluster, density or nearest neighbor analysis. • interquartile range (IQR)
  • 9. TECHNIQUES OF DEALING WITH MISSING DATA • Drop missing values/columns/rows • Imputation • A slightly better approach towards handling missing data is Imputation. Imputation means to replace or fill the missing data with some value. • There are lot of ways to impute the data. • A constant value that belongs to the set of possible values of that variable, such as 0, distinct from all other values • A mean, median or mode value for the column • A value estimated by another predictive model • Multiple Imputation
  • 10. DATA INTEGRATION • combine data from disparate sources into meaningful and valuable information • data from various sources(technologies) • It includes multiple databases, data cubes or flat files Issues • Schema Integration • Redundancy • Detection and resolution of data value conflicts.
  • 11. DATA TRANSFORMATION • Taking data stored in one format and converting it to another. • Datasets in which different columns have different units – like one column can be in kilograms, while another column can be in centimeters.
  • 12. DATA TRANSFORMATION • MinMax Scaler It just scales all the data between 0 and 1. The formula for calculating the scaled value is- x_scaled = (x – x_min)/(x_max – x_min) • Standard Scaler the Standard Scaler scales the values such that the mean is 0 and the standard deviation is 1(or the variance). df_std • MaxAbsScaler takes the absolute maximum value of each column and divides each value in the column by the maximum value. scales the data between the range [-1, 1]. • Robust Scaler to standardizing input variables in the presence of outliers is to ignore the outliers from the calculation of the mean and standard deviation • Quantile Transformer Scaler converts the variable distribution to a normal distribution. and scales it accordingly. The quantile function ranks or smooths out the relationship between observations and can be mapped onto other distributions, such as the uniform or normal distribution.
  • 13. DATA TRANSFORMATION • Log Transform take the log of the values in a column and use these values as the column instead. It is primarily used to convert a skewed distribution to a normal distribution/less- skewed distribution the log-transformed data follows a normal or near normal distribution. Reducing the impact of too-low values Reducing the impact of too-high values. • Unit Vector Scaler/Normalizer Normalization is the process of scaling individual samples to have unit norm. Normalizer works on the rows If we are using L1 norm, the values in each column are converted so that the sum of their absolute values along the row = 1 If we are using L2 norm, the values in each column are first squared and added so that the sum of their absolute values along the row = 1 50, 250, 400 0.05, 0.25 and 0.4.
  • 14. HANDLING CATEGORICAL DATA • Find and Replace • Label Encoding • Binary encoding • One Hot Encoding pd.get_dummies(obj_df, columns=["drive_wheels"]).head() • OrdinalEncoder from sklearn.preprocessing import OrdinalEncoder ord_enc = OrdinalEncoder() obj_df["make_code"] = ord_enc.fit_transform(obj_df[["make"]]) obj_df[["make", "make_code"]].head(11)
  • 15. SAMPLING Sampling is done to draw conclusions about populations from samples, it enables us to determine a population’s characteristics by directly observing only a portion (or sample) of the population.
  • 16. TYPES OF SAMPLING • Simple Random Sampling • Systematic Sampling • Stratified Sampling • Cluster Sampling
  • 17. RESAMPLING • Re-sampling is a series of methods used to reconstruct your sample data sets, including training sets and validation sets. • Cross-validation (CV) • Imbalance Dataset Eg: In an utilities fraud detection data set you have the following data: Total Observations = 1000 Fraudulent Observations = 20 Non Fraudulent Observations = 980 Event Rate= 2 %
  • 18. RESAMPLING TECHNIQUES • Random Under-Sampling • Random Over-Sampling • Cluster-Based Over Sampling • Informed Over Sampling
  • 19. DATA REDUCTION • Dimension reduction compresses large set of features onto a new feature subspace of lower dimensional without losing the important information. • Dimensionality reduction can be done in two different ways: • By only keeping the most relevant variables from the original dataset (this technique is called feature selection) • By finding a smaller set of new variables, each being a combination of the input variables, containing basically the same information as the input variables (this technique is called dimensionality reduction)
  • 20. DATA REDUCTION TECHNIQUES • Missing Value Ratio • Low Variance Filter • Random Forest • High Correlation • Backward Feature Elimination • Factor Analysis • Principal Component Analysis (PCA)
  • 21. DISCRETIZATION • To divide the attributes of the continuous nature into data with intervals. • Binning • Histogram analysis • Equal Frequency partitioning: Partitioning the values based on their number of occurrences in the data set. • Equal Width Partioning: Partioning the values in a fixed gap based on the number of bins i.e. a set of values ranging from 0- 20. • Clustering: Grouping the similar data together.
  • 22. PYTHON PACKAGES/TOOLS FOR DATA MINING • Scikit-learn • Orange • Pandas • MLPy • MDP • PyBrain … and many more 22
  • 23. SOME OTHER BASIC PACKAGES • NumPy and SciPy • Fundamental Packages for scientific computing with Python • Contains powerful n-dimensional array objects • Useful linear algebra, random number and other capabilities • Pandas • Contains useful data structures and algorithms • Matplotlib • Contains functions for plotting/visualizing data. 23