2. DATA PREPROCESSING
• Process of preparing the data for analysis.
• technique of preparing (cleaning and organizing) the
raw data to make it suitable for a building and
training Machine Learning models.
• Real-world data :
• Incomplete
• Inconsistent
• likely to contain many errors.
3. • Data cleaning
• Noise, outliers, missing values, duplicate data
• Dealing with categorical data
• Data integration
• Data transformation
• Data reduction
• Sampling
• Imputation
• Discretization
• Feature extraction
• Splitting the dataset into training and testing sets
• Scaling the features
PREPROCESSING TECHNIQUES
4. TYPES OF DATA
• Numerical data
• Discrete - Date, No. of students in a class
• Continuous - Cost of a house
• Categorical data
Nominal – Gender
Ordinal – Grades of the student
Dichotomous – Cancerous, Non-cancerous
5. DATA CLEANING
• Process of detecting and correcting (or removing)
corrupt or inaccurate records from a record set
• Identifying incomplete, incorrect, inaccurate or
irrelevant parts of the data and then replacing,
modifying, or deleting the dirty or coarse data within
a dataset.
• Duplicate observations
• Irrelevant observations
• Fixing Structural errors
• Managing Unwanted outliers
6. OUTLIERS
Outliers are extreme values
that fall a long way outside
of the other observations.
For example, in a normal
distribution, outliers may be
values on the tails of the
distribution.
7. FINDING OUTLIERS
• Box plot
• Scatter plot
• Z-Score
• expectation-maximization.
• linear correlations (principle component analysis)
• cluster, density or nearest neighbor analysis.
• interquartile range (IQR)
9. TECHNIQUES OF DEALING WITH MISSING DATA
• Drop missing values/columns/rows
• Imputation
• A slightly better approach towards handling missing data
is Imputation. Imputation means to replace or fill the
missing data with some value.
• There are lot of ways to impute the data.
• A constant value that belongs to the set of possible
values of that variable, such as 0, distinct from all other
values
• A mean, median or mode value for the column
• A value estimated by another predictive model
• Multiple Imputation
10. DATA INTEGRATION
• combine data from disparate sources into meaningful
and valuable information
• data from various sources(technologies)
• It includes multiple databases, data cubes or flat files
Issues
• Schema Integration
• Redundancy
• Detection and resolution of data value conflicts.
11. DATA TRANSFORMATION
• Taking data stored in one format and converting it to
another.
• Datasets in which different columns have different units
– like one column can be in kilograms, while another
column can be in centimeters.
12. DATA TRANSFORMATION
• MinMax Scaler
It just scales all the data between 0 and 1. The formula for calculating the scaled value is-
x_scaled = (x – x_min)/(x_max – x_min)
• Standard Scaler
the Standard Scaler scales the values such that the mean is 0 and the standard
deviation is 1(or the variance). df_std
• MaxAbsScaler
takes the absolute maximum value of each column and divides each value in the
column by the maximum value.
scales the data between the range [-1, 1].
• Robust Scaler
to standardizing input variables in the presence of outliers is to ignore the outliers
from the calculation of the mean and standard deviation
• Quantile Transformer Scaler
converts the variable distribution to a normal distribution. and scales it accordingly.
The quantile function ranks or smooths out the relationship between observations and can be
mapped onto other distributions, such as the uniform or normal distribution.
13. DATA TRANSFORMATION
• Log Transform
take the log of the values in a column and use these values as the column instead.
It is primarily used to convert a skewed distribution to a normal distribution/less-
skewed distribution
the log-transformed data follows a normal or near normal distribution.
Reducing the impact of too-low values
Reducing the impact of too-high values.
• Unit Vector Scaler/Normalizer
Normalization is the process of scaling individual samples to have unit norm.
Normalizer works on the rows
If we are using L1 norm, the values in each column are converted so that the sum of
their absolute values along the row = 1
If we are using L2 norm, the values in each column are first squared and added so
that the sum of their absolute values along the row = 1
50, 250, 400
0.05, 0.25 and 0.4.
14. HANDLING CATEGORICAL DATA
• Find and Replace
• Label Encoding
• Binary encoding
• One Hot Encoding
pd.get_dummies(obj_df, columns=["drive_wheels"]).head()
• OrdinalEncoder
from sklearn.preprocessing import OrdinalEncoder
ord_enc = OrdinalEncoder()
obj_df["make_code"] = ord_enc.fit_transform(obj_df[["make"]])
obj_df[["make", "make_code"]].head(11)
15. SAMPLING
Sampling is done to draw conclusions about populations
from samples,
it enables us to determine a population’s characteristics by
directly observing only a portion (or sample) of the
population.
16. TYPES OF SAMPLING
• Simple Random Sampling
• Systematic Sampling
• Stratified Sampling
• Cluster Sampling
17. RESAMPLING
• Re-sampling is a series of methods used to reconstruct
your sample data sets, including training sets and
validation sets.
• Cross-validation (CV)
• Imbalance Dataset
Eg:
In an utilities fraud detection data set you have the following
data:
Total Observations = 1000
Fraudulent Observations = 20
Non Fraudulent Observations = 980
Event Rate= 2 %
18. RESAMPLING TECHNIQUES
• Random Under-Sampling
• Random Over-Sampling
• Cluster-Based Over Sampling
• Informed Over Sampling
19. DATA REDUCTION
• Dimension reduction compresses large set of features onto
a new feature subspace of lower dimensional without
losing the important information.
• Dimensionality reduction can be done in two different ways:
• By only keeping the most relevant variables from the
original dataset (this technique is called feature selection)
• By finding a smaller set of new variables, each being a
combination of the input variables, containing basically the
same information as the input variables (this technique is
called dimensionality reduction)
20. DATA REDUCTION TECHNIQUES
• Missing Value Ratio
• Low Variance Filter
• Random Forest
• High Correlation
• Backward Feature Elimination
• Factor Analysis
• Principal Component Analysis (PCA)
21. DISCRETIZATION
• To divide the attributes of the continuous nature into data with
intervals.
• Binning
• Histogram analysis
• Equal Frequency partitioning: Partitioning the values based on
their number of occurrences in the data set.
• Equal Width Partioning: Partioning the values in a fixed gap
based on the number of bins i.e. a set of values ranging from 0-
20.
• Clustering: Grouping the similar data together.
22. PYTHON PACKAGES/TOOLS FOR DATA MINING
• Scikit-learn
• Orange
• Pandas
• MLPy
• MDP
• PyBrain … and many more
22
23. SOME OTHER BASIC PACKAGES
• NumPy and SciPy
• Fundamental Packages for scientific computing with Python
• Contains powerful n-dimensional array objects
• Useful linear algebra, random number and other capabilities
• Pandas
• Contains useful data structures and algorithms
• Matplotlib
• Contains functions for plotting/visualizing data.
23