2. What is EDA
To analyze and investigate data sets
summarize their main characteristics, often employing data
visualization methods.
3. Steps in EDA
• Description of data
• Handling missing data
• Handling outliers
• Understanding relationships and new insights through plots
Step 1:
• load data, study the dimension (how many rows and columns are in the data).
• look to see if the data types are the same (for example, dates are represented by a date, not a string) and
change the data to the correct type.
• look at measures of the central trend (I look at averages, distribution type, range, minimum and - maximum
values, standard deviation, negative values, etc.).
• check the data for the absence of values (delete or fill, depending on the selected method).
• I am studying a time series (checking if the time series is interrupted, if any dates are missing, etc.).
• I work with categorical variables (check the spelling of strings, duplicates, standardize names if necessary).
4. Contd..
• When the data is ready, I move on to generating the aggregated data (second step) answering my
questions (concatenating tables, grouping data, creating new functions, etc
• Third Step Visualise
• Final Step Story Telling
5. Why EDA is required
The one hand to answer questions, test business assumptions, generate hypotheses for
further analysis.
On the other hand, you can also use it to prepare the data for modeling. The thing that
these two probably have in common is a good knowledge of your data to either get the
answers that you need or to develop an intuition for interpreting the results of future
modeling.
you can get to know whether the selected features are good enough to model, are all
the features required, are there any correlations based on which we can either go
back to the Data Pre-processing step or move on to modeling
6. What will happen if EDA not performed
If EDA is not done properly then it can hamper the further steps in the machine
learning model building process
7. Types of EDA
Graphical
Box plots:.
Heatmap:
Histograms:
Line graphs:
Pictograms:.
Scattergrams or scatterplots:.
Non-graphical EDA
Data profiling is concerned with summarizing your dataset through descriptive statistics.
The goal of data profiling is to have a solid understanding of your data so you can afterwards start
querying and visualizing your data in various ways
8.
9. Steps in EDA
Data Sourcing
Identification of variables and data types
Analyzing the basic metrics
Non-Graphical Univariate Analysis
Graphical Univariate Analysis
Bivariate Analysis
Multivariate Analysis
Variable transformations/Variable creation
Missing value treatment
Outlier treatment
Correlation Analysis
Dimensionality Reduction
10. Data Sourcing
Data Sourcing is the process of finding and loading the data into our system.
Broadly there are two ways in which we can find data.
Private Data
Public Data
Web Scrapping
12. Analyzing the basic metrics
Descriptive Analysis on the data
Head of the data
Data Structures of the data
Standardisation on the data
13. Non-Graphical Univariate Analysis:
To get the count of unique values:
To get the list & number of unique values:
Filtering based on Conditions:
Finding null values:
Data Type Conversion using to_datetime() and astype() methods
14. Graphical Univariate Analysis
Categorical Unordered Univariate Analysis:An unordered variable is a
categorical variable that has no defined order
Bar Chart is Best
Categorical Ordered Univariate Analysis:
Ordered variables are those variables that have a natural rank of order. Some
examples of categorical ordered variables from our dataset are:
Month: Jan, Feb, March……
Education: Primary, Secondary,
Pie Chart
15. Bivariate Analysis
Numeric-Numeric Analysis:
Scatter Plot
Pair Plot
Correlation Matrix
Numeric - Categorical Analysis
We analyze them mainly using mean, median, and box plots, Bar Chart
Categorical — Categorical Analysis
Barcharts
16. Multivariate Analysis
If we analyze data by taking more than two
variables/columns into consideration from
a dataset, it is known as Multivariate
Analysis.
Let’s see how ‘Education’, ‘Marital’, and
‘Response_rate’ vary with each other.
First, we’ll create a pivot table with the
three columns and after that, we’ll create a
heatmap
17. Missing value treatment
A Simple Option: Drop Columns with Missing Values
A Better Option: Imputation
Count Plot is best to identify
19. So Far
• What is EDA, Why EDA required, What ll happen if EDA not performed.
• Types of EDA and Steps in EDA.
Data Sourcing
Identification of variables and data types
Analyzing the basic metrics
Non-Graphical Univariate Analysis
Graphical Univariate Analysis
Bivariate Analysis & Multivariate Analysis
Variable transformations/Variable creation
Missing value treatment &Outlier treatment
Correlation Analysis
Dimensionality Reduction
20. What is EDA
There are main components of exploring data:
Understanding your variables
Cleaning your dataset
Analyzing relationships between variables
21. What we are going to discuss
• Feature Engineering
• Missing Values Handling (Imputation techniques for categorical and Numerical)
• Cardinality of Categorical Variables (Encoding)
• Ouliers Handling
• Plots
• Does correlation imply Causation
• Imbalance Datasets
23. Missing Values Handling
• Missing completely at Random (MCAR)
• Missing at Random (MAR)
• Missing Not at Random (MNAR)
• Continuous: Mean and Median Impudence, Impudence by Regression
• Categorical: Mode Impudence, classifier impudence, cluster impudence
df.isna().any() returns a boolean value for each column. If there is at least one missing value in that
column, the result is True.
df.isna().sum() returns the number of missing values in each column
Using method parameter, missing values can be replaced with the values before or after them(Last
observation carried forward (LOCF) method.) For longitudinal Behaviour.
data["Age"] = data["Age"].interpolate(method='linear‘/bfill/ffill, limit_direction='forward', axis=0)
Multiple Imputation ( Multivariate Imputation by Chained Equations MICE)
25. Cardinality in Category Variables
• Nominal Encoding
• One hot encoding
• One hot encoding with many categories
• Mean encoding
• Ordinal Encoding
• Label Encoding
• Target guided ordinal encoding
• Count/Frequency encoding
27. Label Encoding (ordinal)
Cold<Hot<Very Hot<Warm….0 < 1 < 2 < 3
One major issue with this approach is there is no
relation or order between these classes, but the
algorithm might consider them as some order or
some relationship
28. Frequency Encoding
•Select a categorical variable you would like to transform
•Group by the categorical variable and obtain counts of
each category
•Join it back with the training dataset
29. Mean/Target Encoding
In mean target encoding for each category in the
feature label is decided with the mean value of the
target variable on training data. This encoding
method brings out the relation between similar
categories, but the connections are bounded within
the categories and target itself
1.Select a categorical variable you would like to
transform.
2. Group by the categorical variable and obtain
aggregated sum over the “Target” variable. (total
number of 1’s for each category in ‘Temperature’)
3. Group by the categorical variable and obtain
aggregated count over “Target” variable
4. Divide the step 2 / step 3 results and join it back with
the train
31. Home Work
Helmert Encoding
Weight of Evidence Encoding (% of non events / % of events)
Probability Ratio Encoding
Hashing Encoding
Backward Difference Encoding
Leave One Out Encoding
James-Stein Encoding
M-estimator Encoding
Thermometer Encoder
32. Helmert Coding
What is Helmert coding? Sounds different?
It compares levels of a variable with the mean of the subsequent levels of the variable.
Simply put, each level of the variable is compared to the later level of variables.
Hard to get?** Let’s understand this with an example:**
If there are L levels then the first comparison is of level vs. (L−1) other levels. The
weights are then (L−1)/L for the first level and −1/L for each of the other levels. If the
no.of levels are 4, then, L = 4 so the weights will be .75 and -.25 (3 times)
The next comparison has only L−1 levels (the first level is no longer part of the
comparisons), so now the weights are (L−2)/(L−1) for the first level and −1/(L−1) for
the others (in our case, 2/3 and -1/3. And it goes on.
This type of encoding is useful when the levels of the categorical variable are ordered
in a meaningful way.
33. Outliers Handling
For Normal Distribution
Lower Boundary = Mean — 3* (Standard Deviation)
Upper Boundary= Mean + 3 * (Standard Deviation)
We will use the Interquartile Range to measure the limits
of Outliers if the data doesn’t follow a Normal Distribution
or is either right-skewed or left-skewed.
Lower Boundary= First Quartile(Q1/25th percentile) —
(1.50 or 3 * IQR)
Upper Boundary = Third Quartile(Q3/75th percentile)
+(1.5 or 3* IQR)
34. Outliers handling
Remove the observations by box plot & Winsorize method
Imputation
As impute values, we can choose between the mean, median, mode, and boundary
values.
If you don’t want to remove the outliers, then what you can do is tune the range
of the dataset down to a certain range.(Transformation)
Minkowski Error Method
35. Imbalance Datasets
Under-sampling majority class
Under-sampling the majority class will resample the majority class points in the data to make them
equal to the minority class.
Over Sampling Minority class by duplication
Oversampling minority class will resample the minority class points in the data to make them
equal to the majority class
36. Imbalance Datasets
Under-sampling majority class
Over Sampling Minority class by duplication
Over Sampling minority class using Synthetic Minority Oversampling Technique
(SMOTE)