EDA by Sastry.pptx

What is EDA
 To analyze and investigate data sets
 summarize their main characteristics, often employing data
visualization methods.

Steps in EDA
• Description of data
• Handling missing data
• Handling outliers
• Understanding relationships and new insights through plots
Step 1:
• load data, study the dimension (how many rows and columns are in the data).
• look to see if the data types are the same (for example, dates are represented by a date, not a string) and
change the data to the correct type.
• look at measures of the central trend (I look at averages, distribution type, range, minimum and - maximum
values, standard deviation, negative values, etc.).
• check the data for the absence of values (delete or fill, depending on the selected method).
• I am studying a time series (checking if the time series is interrupted, if any dates are missing, etc.).
• I work with categorical variables (check the spelling of strings, duplicates, standardize names if necessary).

Contd..
• When the data is ready, I move on to generating the aggregated data (second step) answering my
questions (concatenating tables, grouping data, creating new functions, etc
• Third Step Visualise
• Final Step Story Telling

Why EDA is required
 The one hand to answer questions, test business assumptions, generate hypotheses for
further analysis.
 On the other hand, you can also use it to prepare the data for modeling. The thing that
these two probably have in common is a good knowledge of your data to either get the
answers that you need or to develop an intuition for interpreting the results of future
modeling.
 you can get to know whether the selected features are good enough to model, are all
the features required, are there any correlations based on which we can either go
back to the Data Pre-processing step or move on to modeling

What will happen if EDA not performed
 If EDA is not done properly then it can hamper the further steps in the machine
learning model building process

Types of EDA
Graphical
 Box plots:.
 Heatmap:
 Histograms:
 Line graphs:
 Pictograms:.
 Scattergrams or scatterplots:.
Non-graphical EDA
 Data profiling is concerned with summarizing your dataset through descriptive statistics.
 The goal of data profiling is to have a solid understanding of your data so you can afterwards start
querying and visualizing your data in various ways

Steps in EDA
 Data Sourcing
 Identification of variables and data types
 Analyzing the basic metrics
 Non-Graphical Univariate Analysis
 Graphical Univariate Analysis
 Bivariate Analysis
 Multivariate Analysis
 Variable transformations/Variable creation
 Missing value treatment
 Outlier treatment
 Correlation Analysis
 Dimensionality Reduction

Data Sourcing
 Data Sourcing is the process of finding and loading the data into our system.
Broadly there are two ways in which we can find data.
 Private Data
 Public Data
Web Scrapping

Analyzing the basic metrics
 Descriptive Analysis on the data
 Head of the data
 Data Structures of the data
 Standardisation on the data

Non-Graphical Univariate Analysis:
 To get the count of unique values:
 To get the list & number of unique values:
 Filtering based on Conditions:
 Finding null values:
 Data Type Conversion using to_datetime() and astype() methods

Graphical Univariate Analysis
 Categorical Unordered Univariate Analysis:An unordered variable is a
categorical variable that has no defined order
 Bar Chart is Best
 Categorical Ordered Univariate Analysis:
 Ordered variables are those variables that have a natural rank of order. Some
examples of categorical ordered variables from our dataset are:
 Month: Jan, Feb, March……
 Education: Primary, Secondary,
 Pie Chart

Bivariate Analysis
 Numeric-Numeric Analysis:
 Scatter Plot
 Pair Plot
 Correlation Matrix
 Numeric - Categorical Analysis
 We analyze them mainly using mean, median, and box plots, Bar Chart
 Categorical — Categorical Analysis
 Barcharts

Multivariate Analysis
 If we analyze data by taking more than two
variables/columns into consideration from
a dataset, it is known as Multivariate
Analysis.
 Let’s see how ‘Education’, ‘Marital’, and
‘Response_rate’ vary with each other.
 First, we’ll create a pivot table with the
three columns and after that, we’ll create a
heatmap

Missing value treatment
 A Simple Option: Drop Columns with Missing Values
 A Better Option: Imputation
 Count Plot is best to identify

So Far
• What is EDA, Why EDA required, What ll happen if EDA not performed.
• Types of EDA and Steps in EDA.
 Data Sourcing
 Identification of variables and data types
 Analyzing the basic metrics
 Non-Graphical Univariate Analysis
 Graphical Univariate Analysis
 Bivariate Analysis & Multivariate Analysis
 Variable transformations/Variable creation
 Missing value treatment &Outlier treatment
 Correlation Analysis
 Dimensionality Reduction

What is EDA
 There are main components of exploring data:
 Understanding your variables
 Cleaning your dataset
 Analyzing relationships between variables

What we are going to discuss
• Feature Engineering
• Missing Values Handling (Imputation techniques for categorical and Numerical)
• Cardinality of Categorical Variables (Encoding)
• Ouliers Handling
• Plots
• Does correlation imply Causation
• Imbalance Datasets

Feature Engineering
• Feature Scaling
• min-max normalization & z-score normalization
• Feature Transformation
• Feature Selection
• Feature Importance
• Feature Extraction/Creation

Missing Values Handling
• Missing completely at Random (MCAR)
• Missing at Random (MAR)
• Missing Not at Random (MNAR)
• Continuous: Mean and Median Impudence, Impudence by Regression
• Categorical: Mode Impudence, classifier impudence, cluster impudence
 df.isna().any() returns a boolean value for each column. If there is at least one missing value in that
column, the result is True.
 df.isna().sum() returns the number of missing values in each column
 Using method parameter, missing values can be replaced with the values before or after them(Last
observation carried forward (LOCF) method.) For longitudinal Behaviour.
 data["Age"] = data["Age"].interpolate(method='linear‘/bfill/ffill, limit_direction='forward', axis=0)
 Multiple Imputation ( Multivariate Imputation by Chained Equations MICE)

from fancyimpute import
IterativeImputer as MICE
data_fit =
pd.DataFrame(MICE().fit_transform(data
))

Cardinality in Category Variables
• Nominal Encoding
• One hot encoding
• One hot encoding with many categories
• Mean encoding
• Ordinal Encoding
• Label Encoding
• Target guided ordinal encoding
• Count/Frequency encoding

One Hot Encoding
(Note: Dummy Variable)

Label Encoding (ordinal)
Cold<Hot<Very Hot<Warm….0 < 1 < 2 < 3
One major issue with this approach is there is no
relation or order between these classes, but the
algorithm might consider them as some order or
some relationship

Frequency Encoding
•Select a categorical variable you would like to transform
•Group by the categorical variable and obtain counts of
each category
•Join it back with the training dataset

Mean/Target Encoding
In mean target encoding for each category in the
feature label is decided with the mean value of the
target variable on training data. This encoding
method brings out the relation between similar
categories, but the connections are bounded within
the categories and target itself
1.Select a categorical variable you would like to
transform.
2. Group by the categorical variable and obtain
aggregated sum over the “Target” variable. (total
number of 1’s for each category in ‘Temperature’)
3. Group by the categorical variable and obtain
aggregated count over “Target” variable
4. Divide the step 2 / step 3 results and join it back with
the train

Home Work
Helmert Encoding
Weight of Evidence Encoding (% of non events / % of events)
Probability Ratio Encoding
Hashing Encoding
Backward Difference Encoding
Leave One Out Encoding
James-Stein Encoding
M-estimator Encoding
Thermometer Encoder

Helmert Coding
 What is Helmert coding? Sounds different?
 It compares levels of a variable with the mean of the subsequent levels of the variable.
Simply put, each level of the variable is compared to the later level of variables.
Hard to get?** Let’s understand this with an example:**
 If there are L levels then the first comparison is of level vs. (L−1) other levels. The
weights are then (L−1)/L for the first level and −1/L for each of the other levels. If the
no.of levels are 4, then, L = 4 so the weights will be .75 and -.25 (3 times)
 The next comparison has only L−1 levels (the first level is no longer part of the
comparisons), so now the weights are (L−2)/(L−1) for the first level and −1/(L−1) for
the others (in our case, 2/3 and -1/3. And it goes on.
 This type of encoding is useful when the levels of the categorical variable are ordered
in a meaningful way.

Outliers Handling
For Normal Distribution
Lower Boundary = Mean — 3* (Standard Deviation)
Upper Boundary= Mean + 3 * (Standard Deviation)
We will use the Interquartile Range to measure the limits
of Outliers if the data doesn’t follow a Normal Distribution
or is either right-skewed or left-skewed.
Lower Boundary= First Quartile(Q1/25th percentile) —
(1.50 or 3 * IQR)
Upper Boundary = Third Quartile(Q3/75th percentile)
+(1.5 or 3* IQR)

Outliers handling
 Remove the observations by box plot & Winsorize method
 Imputation
 As impute values, we can choose between the mean, median, mode, and boundary
values.
 If you don’t want to remove the outliers, then what you can do is tune the range
of the dataset down to a certain range.(Transformation)
 Minkowski Error Method

Imbalance Datasets
Under-sampling majority class
Under-sampling the majority class will resample the majority class points in the data to make them
equal to the minority class.
Over Sampling Minority class by duplication
Oversampling minority class will resample the minority class points in the data to make them
equal to the majority class

Imbalance Datasets
 Under-sampling majority class
 Over Sampling Minority class by duplication
 Over Sampling minority class using Synthetic Minority Oversampling Technique
(SMOTE)

EDA by Sastry.pptx

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to EDA by Sastry.pptx

Similar to EDA by Sastry.pptx (20)

Recently uploaded

Recently uploaded (20)

EDA by Sastry.pptx