EDA and Feature Preprocessing for Data Analysis

EDA and Feature Preprocessing Session
- Brajkishore Prajapati

What is Exploratory Data Analysis (EDA)?
● Exploratory Data Analysis (EDA) is an approach to analyze
the data using visual techniques. It is used to discover trends,
patterns, or to check assumptions with the help of statistical
summary and graphical representations.
● EDA allows us to make decisions and it helps in the feature
preprocessing.

Statistical Analysis
1. Descriptive Statistical Analysis
2. Inferential Statistical Analysis
Descriptive statistics summarize the characteristics of a data set.
Inferential statistics allow you to test a hypothesis or assess
whether your data is generalizable to the broader population.

Steps to perform effective EDA
1. Univariate Analysis
2. Bivariate Analysis
3. Multivariate Analysis
4. Analysis of Missing values
5. Analysis of Outliers

Univariate Analysis
1. Numerical Features:
● In this section we are going to explore and analyze the numerical features and
collecting statistical information from the data.
● Describe method is used for extracting statistical information
● Histograms and KDE plots are used to get the visual representation of the data
and observe the trends and patterns.

Descriptive Statistics Example

Inferential Statistical Analysis

2. Categorical Features
● In this section we are going to explore and analyze the categorical features, their
classes and finding the important information from the dataset.
● If the problem is classification problem, then check for imbalance dataset
● Check for the value counts of each category in a variable and the hidden %
patterns.

Bivariate Analysis
1. Numerical- Numerical:
● In this section we are going to explore and analyze with two numerical features
taking at a time
● Finding correlation of features.
● Verifying assumptions.

2. Numerical - Categorical
● In this section we are going to analyze our data with one categorical feature and a
numerical variable and also perform some hypothesis testing failing in this section.
● Checking the significance of features w.r.t target variable.
● Hypothesis testing
● Statistical Test falls under this are:
1. t-test
2. z-test
3. Anova test

3. Categorical - Categorical
● In this section we are going to analyze our data with categorical- categorical
features. We will be plotting grouped count plot for categorical feature also
perform some hypothesis testing failing in this section.
● Checking the significance of features w.r.t target variable.
● Hypothesis testing
● Statistical Test falls under this category is Chisquare test.

Multivariate Analysis
● In the multivariate analysis we are taking more than two variables at a time and
analyze the trends and patterns and extract the hidden insights from the data.
● Pairplot is one of the graph we can use to perform multivariate analysis.

Analysis of Missing Values
● During Univariate analysis
we can conclude the
presence of missing values.
So in this section we are
going to find the
relationship and information
between the missing values
of different features with
other features.

EDA on Text Data
1. Understanding Problem Statement.
2. Analyzing text statistics
2.1 Sentence length analysis
2.2 Word frequency analysis
1. Analysis of most occurring stopwords.
2. Analysis of most occurring words without stopwords. (Unigrams and Bigrams both)
3. Polarity of sentences.
4. Word Cloud
● Sentence length analysis is used to set the number of embeddings for training data.

Feature Preprocessing
1. Handling Missing values
● Deleting rows with missing values
● Replace the missing values by fixed value based on the insights gain from the data.
● Impute missing values for continuous variable by mean and median
● Impute missing values for categorical variable by mode
● Using algorithms that support missing values : KNN imputer
● Prediction of missing values
● Forward and backward fill methods

2. Handling Outliers
● There are two methods to handle the outliers:
(a) Z-score method
(b) IQR Method

(b) IQR Method:In this method by using Inter Quartile Range(IQR), we detect outliers.
IQR tells us the variation in the data set. Any value, which is beyond the range of -1.5 x IQR
to 1.5 x IQR treated as outliers.
● Q1 represents the 1st quartile/25th percentile of the data.
● Q2 represents the 2nd quartile/median/50th percentile of the data.
● Q3 represents the 3rd quartile/75th percentile of the data.
● (Q1–1.5IQR) represent the smallest value in the data set and (Q3+1.5IQR)
represent the largest value in the data set

3. Handling Imbalanced dataset
3.1 Random Oversampling
3.2 Random Undersampling
3.3 Syntemtic Minority Over-sampling Technique (SMOTE)

Feature Preprocessing on Text Data
In any machine learning task, cleaning or preprocessing the data is very important, and
when it comes to unstructured data like text, this process is even more important

Some of the common text preprocessing/cleaning steps are:
● Lower casing
● Removal of Punctuations
● Removal of stopwords
● Removal of frequent words if not important
● Removal of rare words
● Stemming
● Removing alpha numeric words

● Lemmatization
● Removal of emojis
● Removal of special characters
● Removal of URLs
● Removals of HTML tags
● Spelling correction
● Convert number to words and vice-versa

Short Assignment
1. Analysis on Text Data
2. Pre-processing and Data Cleaning on Text Data
Tools & Technologies
1. Python programming language
2. Numpy
3. Pandas
4. Matplotlib
5. Seaborn
6. EDA
7. Feautre Engineering
8. Modelling

EDA and Feature Preprocessing for Data Analysis

Recommended

Recommended

More Related Content

Similar to EDA and Feature Preprocessing for Data Analysis

Similar to EDA and Feature Preprocessing for Data Analysis (20)

Recently uploaded

Recently uploaded (20)

EDA and Feature Preprocessing for Data Analysis