SlideShare a Scribd company logo
1 of 23
Download to read offline
Data Preprocessing
Dr.M.Pyingkodi
AP/MCA
Kongu Engineering College
Erode, Tamilnadu
DATA PREPROCESSING
• Process of preparing the data for analysis.
• technique of preparing (cleaning and organizing) the
raw data to make it suitable for a building and
training Machine Learning models.
• Real-world data :
• Incomplete
• Inconsistent
• likely to contain many errors.
• Data cleaning
• Noise, outliers, missing values, duplicate data
• Dealing with categorical data
• Data integration
• Data transformation
• Data reduction
• Sampling
• Imputation
• Discretization
• Feature extraction
• Splitting the dataset into training and testing sets
• Scaling the features
PREPROCESSING TECHNIQUES
TYPES OF DATA
• Numerical data
• Discrete - Date, No. of students in a class
• Continuous - Cost of a house
• Categorical data
 Nominal – Gender
 Ordinal – Grades of the student
 Dichotomous – Cancerous, Non-cancerous
DATA CLEANING
• Process of detecting and correcting (or removing)
corrupt or inaccurate records from a record set
• Identifying incomplete, incorrect, inaccurate or
irrelevant parts of the data and then replacing,
modifying, or deleting the dirty or coarse data within
a dataset.
• Duplicate observations
• Irrelevant observations
• Fixing Structural errors
• Managing Unwanted outliers
OUTLIERS
Outliers are extreme values
that fall a long way outside
of the other observations.
For example, in a normal
distribution, outliers may be
values on the tails of the
distribution.
FINDING OUTLIERS
• Box plot
• Scatter plot
• Z-Score
• expectation-maximization.
• linear correlations (principle component analysis)
• cluster, density or nearest neighbor analysis.
• interquartile range (IQR)
HANDLING MISSING VALUES
TECHNIQUES OF DEALING WITH MISSING DATA
• Drop missing values/columns/rows
• Imputation
• A slightly better approach towards handling missing data
is Imputation. Imputation means to replace or fill the
missing data with some value.
• There are lot of ways to impute the data.
• A constant value that belongs to the set of possible
values of that variable, such as 0, distinct from all other
values
• A mean, median or mode value for the column
• A value estimated by another predictive model
• Multiple Imputation
DATA INTEGRATION
• combine data from disparate sources into meaningful
and valuable information
• data from various sources(technologies)
• It includes multiple databases, data cubes or flat files
Issues
• Schema Integration
• Redundancy
• Detection and resolution of data value conflicts.
DATA TRANSFORMATION
• Taking data stored in one format and converting it to
another.
• Datasets in which different columns have different units
– like one column can be in kilograms, while another
column can be in centimeters.
DATA TRANSFORMATION
• MinMax Scaler
It just scales all the data between 0 and 1. The formula for calculating the scaled value is-
x_scaled = (x – x_min)/(x_max – x_min)
• Standard Scaler
the Standard Scaler scales the values such that the mean is 0 and the standard
deviation is 1(or the variance). df_std
• MaxAbsScaler
takes the absolute maximum value of each column and divides each value in the
column by the maximum value.
scales the data between the range [-1, 1].
• Robust Scaler
to standardizing input variables in the presence of outliers is to ignore the outliers
from the calculation of the mean and standard deviation
• Quantile Transformer Scaler
converts the variable distribution to a normal distribution. and scales it accordingly.
The quantile function ranks or smooths out the relationship between observations and can be
mapped onto other distributions, such as the uniform or normal distribution.
DATA TRANSFORMATION
• Log Transform
take the log of the values in a column and use these values as the column instead.
It is primarily used to convert a skewed distribution to a normal distribution/less-
skewed distribution
the log-transformed data follows a normal or near normal distribution.
Reducing the impact of too-low values
Reducing the impact of too-high values.
• Unit Vector Scaler/Normalizer
Normalization is the process of scaling individual samples to have unit norm.
Normalizer works on the rows
If we are using L1 norm, the values in each column are converted so that the sum of
their absolute values along the row = 1
If we are using L2 norm, the values in each column are first squared and added so
that the sum of their absolute values along the row = 1
50, 250, 400
0.05, 0.25 and 0.4.
HANDLING CATEGORICAL DATA
• Find and Replace
• Label Encoding
• Binary encoding
• One Hot Encoding
pd.get_dummies(obj_df, columns=["drive_wheels"]).head()
• OrdinalEncoder
from sklearn.preprocessing import OrdinalEncoder
ord_enc = OrdinalEncoder()
obj_df["make_code"] = ord_enc.fit_transform(obj_df[["make"]])
obj_df[["make", "make_code"]].head(11)
SAMPLING
Sampling is done to draw conclusions about populations
from samples,
it enables us to determine a population’s characteristics by
directly observing only a portion (or sample) of the
population.
TYPES OF SAMPLING
• Simple Random Sampling
• Systematic Sampling
• Stratified Sampling
• Cluster Sampling
RESAMPLING
• Re-sampling is a series of methods used to reconstruct
your sample data sets, including training sets and
validation sets.
• Cross-validation (CV)
• Imbalance Dataset
Eg:
In an utilities fraud detection data set you have the following
data:
Total Observations = 1000
Fraudulent Observations = 20
Non Fraudulent Observations = 980
Event Rate= 2 %
RESAMPLING TECHNIQUES
• Random Under-Sampling
• Random Over-Sampling
• Cluster-Based Over Sampling
• Informed Over Sampling
DATA REDUCTION
• Dimension reduction compresses large set of features onto
a new feature subspace of lower dimensional without
losing the important information.
• Dimensionality reduction can be done in two different ways:
• By only keeping the most relevant variables from the
original dataset (this technique is called feature selection)
• By finding a smaller set of new variables, each being a
combination of the input variables, containing basically the
same information as the input variables (this technique is
called dimensionality reduction)
DATA REDUCTION TECHNIQUES
• Missing Value Ratio
• Low Variance Filter
• Random Forest
• High Correlation
• Backward Feature Elimination
• Factor Analysis
• Principal Component Analysis (PCA)
DISCRETIZATION
• To divide the attributes of the continuous nature into data with
intervals.
• Binning
• Histogram analysis
• Equal Frequency partitioning: Partitioning the values based on
their number of occurrences in the data set.
• Equal Width Partioning: Partioning the values in a fixed gap
based on the number of bins i.e. a set of values ranging from 0-
20.
• Clustering: Grouping the similar data together.
PYTHON PACKAGES/TOOLS FOR DATA MINING
• Scikit-learn
• Orange
• Pandas
• MLPy
• MDP
• PyBrain … and many more
22
SOME OTHER BASIC PACKAGES
• NumPy and SciPy
• Fundamental Packages for scientific computing with Python
• Contains powerful n-dimensional array objects
• Useful linear algebra, random number and other capabilities
• Pandas
• Contains useful data structures and algorithms
• Matplotlib
• Contains functions for plotting/visualizing data.
23

More Related Content

What's hot

Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
kayathri02
 

What's hot (20)

Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data Preprocessing || Data Mining
Data Preprocessing || Data MiningData Preprocessing || Data Mining
Data Preprocessing || Data Mining
 
Decision tree
Decision treeDecision tree
Decision tree
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Data PreProcessing
Data PreProcessingData PreProcessing
Data PreProcessing
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Data mining
Data mining Data mining
Data mining
 
Data preprocessing in Data Mining
Data preprocessing  in Data MiningData preprocessing  in Data Mining
Data preprocessing in Data Mining
 
Data Science - Part III - EDA & Model Selection
Data Science - Part III - EDA & Model SelectionData Science - Part III - EDA & Model Selection
Data Science - Part III - EDA & Model Selection
 
Tutorial On Database Management System
Tutorial On Database Management SystemTutorial On Database Management System
Tutorial On Database Management System
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data Reduction
Data ReductionData Reduction
Data Reduction
 
Data mining
Data miningData mining
Data mining
 
Data science life cycle
Data science life cycleData science life cycle
Data science life cycle
 
Introduction to Data mining
Introduction to Data miningIntroduction to Data mining
Introduction to Data mining
 
Data Cleaning
Data CleaningData Cleaning
Data Cleaning
 
Data Reduction Stratergies
Data Reduction StratergiesData Reduction Stratergies
Data Reduction Stratergies
 
Exploratory Data Analysis using Python
Exploratory Data Analysis using PythonExploratory Data Analysis using Python
Exploratory Data Analysis using Python
 

Similar to Data preprocessing in Machine Learning

Week 12 Dimensionality Reduction Bagian 1
Week 12 Dimensionality Reduction Bagian 1Week 12 Dimensionality Reduction Bagian 1
Week 12 Dimensionality Reduction Bagian 1
khairulhuda242
 
Data preprocessing ppt1
Data preprocessing ppt1Data preprocessing ppt1
Data preprocessing ppt1
meenas06
 

Similar to Data preprocessing in Machine Learning (20)

Nimrita deep learning
Nimrita deep learningNimrita deep learning
Nimrita deep learning
 
dimension reduction.ppt
dimension reduction.pptdimension reduction.ppt
dimension reduction.ppt
 
Exploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdfExploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdf
 
Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16
 
Dimensionality Reduction.pptx
Dimensionality Reduction.pptxDimensionality Reduction.pptx
Dimensionality Reduction.pptx
 
Data preprocess
Data preprocessData preprocess
Data preprocess
 
Spc
SpcSpc
Spc
 
Pattern recognition UNIT 5
Pattern recognition UNIT 5Pattern recognition UNIT 5
Pattern recognition UNIT 5
 
ML-Unit-4.pdf
ML-Unit-4.pdfML-Unit-4.pdf
ML-Unit-4.pdf
 
KNOLX_Data_preprocessing
KNOLX_Data_preprocessingKNOLX_Data_preprocessing
KNOLX_Data_preprocessing
 
Pre-Processing and Data Preparation
Pre-Processing and Data PreparationPre-Processing and Data Preparation
Pre-Processing and Data Preparation
 
Week 12 Dimensionality Reduction Bagian 1
Week 12 Dimensionality Reduction Bagian 1Week 12 Dimensionality Reduction Bagian 1
Week 12 Dimensionality Reduction Bagian 1
 
Data preprocessing ppt1
Data preprocessing ppt1Data preprocessing ppt1
Data preprocessing ppt1
 
Machine Learning.pdf
Machine Learning.pdfMachine Learning.pdf
Machine Learning.pdf
 
7 8. emi - analog instruments and digital instruments
7 8. emi - analog instruments and digital instruments7 8. emi - analog instruments and digital instruments
7 8. emi - analog instruments and digital instruments
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project Presentation
 
Analyst’s Nightmare or Laundering Massive Spreadsheets
Analyst’s Nightmare or Laundering Massive SpreadsheetsAnalyst’s Nightmare or Laundering Massive Spreadsheets
Analyst’s Nightmare or Laundering Massive Spreadsheets
 
Daamen r 2010scwr-cpaper
Daamen r 2010scwr-cpaperDaamen r 2010scwr-cpaper
Daamen r 2010scwr-cpaper
 
Introduction of data science
Introduction of data scienceIntroduction of data science
Introduction of data science
 
Machine Learning techniques used in AI.
Machine Learning  techniques used in AI.Machine Learning  techniques used in AI.
Machine Learning techniques used in AI.
 

More from Pyingkodi Maran

More from Pyingkodi Maran (20)

Health Monitoring System using IoT.doc
Health Monitoring System using IoT.docHealth Monitoring System using IoT.doc
Health Monitoring System using IoT.doc
 
IoT Industry Adaptation of AI.ppt
IoT Industry Adaptation of AI.pptIoT Industry Adaptation of AI.ppt
IoT Industry Adaptation of AI.ppt
 
IoT_Testing.ppt
IoT_Testing.pptIoT_Testing.ppt
IoT_Testing.ppt
 
Azure Devops
Azure DevopsAzure Devops
Azure Devops
 
Creation of Web Portal using DURPAL
Creation of Web Portal using DURPALCreation of Web Portal using DURPAL
Creation of Web Portal using DURPAL
 
AWS Relational Database Instance
AWS Relational Database InstanceAWS Relational Database Instance
AWS Relational Database Instance
 
AWS S3 Buckets
AWS S3  BucketsAWS S3  Buckets
AWS S3 Buckets
 
Creation of AWS Instance in Cloud Platform
Creation of AWS Instance in Cloud PlatformCreation of AWS Instance in Cloud Platform
Creation of AWS Instance in Cloud Platform
 
Amazon Web Service.pdf
Amazon Web Service.pdfAmazon Web Service.pdf
Amazon Web Service.pdf
 
Cloud Security
Cloud SecurityCloud Security
Cloud Security
 
Cloud Computing Introduction
Cloud Computing IntroductionCloud Computing Introduction
Cloud Computing Introduction
 
Supervised Machine Learning Algorithm
Supervised Machine Learning AlgorithmSupervised Machine Learning Algorithm
Supervised Machine Learning Algorithm
 
Unsupervised Learning in Machine Learning
Unsupervised Learning in Machine LearningUnsupervised Learning in Machine Learning
Unsupervised Learning in Machine Learning
 
Feature Engineering in Machine Learning
Feature Engineering in Machine LearningFeature Engineering in Machine Learning
Feature Engineering in Machine Learning
 
Normalization in DBMS
Normalization in DBMSNormalization in DBMS
Normalization in DBMS
 
Relational Database and Relational Algebra
Relational Database and Relational AlgebraRelational Database and Relational Algebra
Relational Database and Relational Algebra
 
Transaction in DBMS
Transaction in DBMSTransaction in DBMS
Transaction in DBMS
 
IoT_Frameworks_.pdf
IoT_Frameworks_.pdfIoT_Frameworks_.pdf
IoT_Frameworks_.pdf
 
IoT Real world Applications.pdf
IoT Real world Applications.pdfIoT Real world Applications.pdf
IoT Real world Applications.pdf
 
IoT_Introduction.pdf
IoT_Introduction.pdfIoT_Introduction.pdf
IoT_Introduction.pdf
 

Recently uploaded

Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Kandungan 087776558899
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
MayuraD1
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
MsecMca
 

Recently uploaded (20)

Bridge Jacking Design Sample Calculation.pptx
Bridge Jacking Design Sample Calculation.pptxBridge Jacking Design Sample Calculation.pptx
Bridge Jacking Design Sample Calculation.pptx
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
Computer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersComputer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to Computers
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
 
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
 
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxA CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 
Learn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic MarksLearn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic Marks
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal load
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
 

Data preprocessing in Machine Learning

  • 2. DATA PREPROCESSING • Process of preparing the data for analysis. • technique of preparing (cleaning and organizing) the raw data to make it suitable for a building and training Machine Learning models. • Real-world data : • Incomplete • Inconsistent • likely to contain many errors.
  • 3. • Data cleaning • Noise, outliers, missing values, duplicate data • Dealing with categorical data • Data integration • Data transformation • Data reduction • Sampling • Imputation • Discretization • Feature extraction • Splitting the dataset into training and testing sets • Scaling the features PREPROCESSING TECHNIQUES
  • 4. TYPES OF DATA • Numerical data • Discrete - Date, No. of students in a class • Continuous - Cost of a house • Categorical data  Nominal – Gender  Ordinal – Grades of the student  Dichotomous – Cancerous, Non-cancerous
  • 5. DATA CLEANING • Process of detecting and correcting (or removing) corrupt or inaccurate records from a record set • Identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data within a dataset. • Duplicate observations • Irrelevant observations • Fixing Structural errors • Managing Unwanted outliers
  • 6. OUTLIERS Outliers are extreme values that fall a long way outside of the other observations. For example, in a normal distribution, outliers may be values on the tails of the distribution.
  • 7. FINDING OUTLIERS • Box plot • Scatter plot • Z-Score • expectation-maximization. • linear correlations (principle component analysis) • cluster, density or nearest neighbor analysis. • interquartile range (IQR)
  • 9. TECHNIQUES OF DEALING WITH MISSING DATA • Drop missing values/columns/rows • Imputation • A slightly better approach towards handling missing data is Imputation. Imputation means to replace or fill the missing data with some value. • There are lot of ways to impute the data. • A constant value that belongs to the set of possible values of that variable, such as 0, distinct from all other values • A mean, median or mode value for the column • A value estimated by another predictive model • Multiple Imputation
  • 10. DATA INTEGRATION • combine data from disparate sources into meaningful and valuable information • data from various sources(technologies) • It includes multiple databases, data cubes or flat files Issues • Schema Integration • Redundancy • Detection and resolution of data value conflicts.
  • 11. DATA TRANSFORMATION • Taking data stored in one format and converting it to another. • Datasets in which different columns have different units – like one column can be in kilograms, while another column can be in centimeters.
  • 12. DATA TRANSFORMATION • MinMax Scaler It just scales all the data between 0 and 1. The formula for calculating the scaled value is- x_scaled = (x – x_min)/(x_max – x_min) • Standard Scaler the Standard Scaler scales the values such that the mean is 0 and the standard deviation is 1(or the variance). df_std • MaxAbsScaler takes the absolute maximum value of each column and divides each value in the column by the maximum value. scales the data between the range [-1, 1]. • Robust Scaler to standardizing input variables in the presence of outliers is to ignore the outliers from the calculation of the mean and standard deviation • Quantile Transformer Scaler converts the variable distribution to a normal distribution. and scales it accordingly. The quantile function ranks or smooths out the relationship between observations and can be mapped onto other distributions, such as the uniform or normal distribution.
  • 13. DATA TRANSFORMATION • Log Transform take the log of the values in a column and use these values as the column instead. It is primarily used to convert a skewed distribution to a normal distribution/less- skewed distribution the log-transformed data follows a normal or near normal distribution. Reducing the impact of too-low values Reducing the impact of too-high values. • Unit Vector Scaler/Normalizer Normalization is the process of scaling individual samples to have unit norm. Normalizer works on the rows If we are using L1 norm, the values in each column are converted so that the sum of their absolute values along the row = 1 If we are using L2 norm, the values in each column are first squared and added so that the sum of their absolute values along the row = 1 50, 250, 400 0.05, 0.25 and 0.4.
  • 14. HANDLING CATEGORICAL DATA • Find and Replace • Label Encoding • Binary encoding • One Hot Encoding pd.get_dummies(obj_df, columns=["drive_wheels"]).head() • OrdinalEncoder from sklearn.preprocessing import OrdinalEncoder ord_enc = OrdinalEncoder() obj_df["make_code"] = ord_enc.fit_transform(obj_df[["make"]]) obj_df[["make", "make_code"]].head(11)
  • 15. SAMPLING Sampling is done to draw conclusions about populations from samples, it enables us to determine a population’s characteristics by directly observing only a portion (or sample) of the population.
  • 16. TYPES OF SAMPLING • Simple Random Sampling • Systematic Sampling • Stratified Sampling • Cluster Sampling
  • 17. RESAMPLING • Re-sampling is a series of methods used to reconstruct your sample data sets, including training sets and validation sets. • Cross-validation (CV) • Imbalance Dataset Eg: In an utilities fraud detection data set you have the following data: Total Observations = 1000 Fraudulent Observations = 20 Non Fraudulent Observations = 980 Event Rate= 2 %
  • 18. RESAMPLING TECHNIQUES • Random Under-Sampling • Random Over-Sampling • Cluster-Based Over Sampling • Informed Over Sampling
  • 19. DATA REDUCTION • Dimension reduction compresses large set of features onto a new feature subspace of lower dimensional without losing the important information. • Dimensionality reduction can be done in two different ways: • By only keeping the most relevant variables from the original dataset (this technique is called feature selection) • By finding a smaller set of new variables, each being a combination of the input variables, containing basically the same information as the input variables (this technique is called dimensionality reduction)
  • 20. DATA REDUCTION TECHNIQUES • Missing Value Ratio • Low Variance Filter • Random Forest • High Correlation • Backward Feature Elimination • Factor Analysis • Principal Component Analysis (PCA)
  • 21. DISCRETIZATION • To divide the attributes of the continuous nature into data with intervals. • Binning • Histogram analysis • Equal Frequency partitioning: Partitioning the values based on their number of occurrences in the data set. • Equal Width Partioning: Partioning the values in a fixed gap based on the number of bins i.e. a set of values ranging from 0- 20. • Clustering: Grouping the similar data together.
  • 22. PYTHON PACKAGES/TOOLS FOR DATA MINING • Scikit-learn • Orange • Pandas • MLPy • MDP • PyBrain … and many more 22
  • 23. SOME OTHER BASIC PACKAGES • NumPy and SciPy • Fundamental Packages for scientific computing with Python • Contains powerful n-dimensional array objects • Useful linear algebra, random number and other capabilities • Pandas • Contains useful data structures and algorithms • Matplotlib • Contains functions for plotting/visualizing data. 23