SlideShare a Scribd company logo
1 of 26
EDA and Feature Preprocessing Session
- Brajkishore Prajapati
What is Exploratory Data Analysis (EDA)?
● Exploratory Data Analysis (EDA) is an approach to analyze
the data using visual techniques. It is used to discover trends,
patterns, or to check assumptions with the help of statistical
summary and graphical representations.
● EDA allows us to make decisions and it helps in the feature
preprocessing.
Statistical Analysis
1. Descriptive Statistical Analysis
2. Inferential Statistical Analysis
Descriptive statistics summarize the characteristics of a data set.
Inferential statistics allow you to test a hypothesis or assess
whether your data is generalizable to the broader population.
Steps to perform effective EDA
1. Univariate Analysis
2. Bivariate Analysis
3. Multivariate Analysis
4. Analysis of Missing values
5. Analysis of Outliers
Univariate Analysis
1. Numerical Features:
● In this section we are going to explore and analyze the numerical features and
collecting statistical information from the data.
● Describe method is used for extracting statistical information
● Histograms and KDE plots are used to get the visual representation of the data
and observe the trends and patterns.
Descriptive Statistics Example
Inferential Statistical Analysis
2. Categorical Features
● In this section we are going to explore and analyze the categorical features, their
classes and finding the important information from the dataset.
● If the problem is classification problem, then check for imbalance dataset
● Check for the value counts of each category in a variable and the hidden %
patterns.
Bivariate Analysis
1. Numerical- Numerical:
● In this section we are going to explore and analyze with two numerical features
taking at a time
● Finding correlation of features.
● Verifying assumptions.
2. Numerical - Categorical
● In this section we are going to analyze our data with one categorical feature and a
numerical variable and also perform some hypothesis testing failing in this section.
● Checking the significance of features w.r.t target variable.
● Hypothesis testing
● Statistical Test falls under this are:
1. t-test
2. z-test
3. Anova test
3. Categorical - Categorical
● In this section we are going to analyze our data with categorical- categorical
features. We will be plotting grouped count plot for categorical feature also
perform some hypothesis testing failing in this section.
● Checking the significance of features w.r.t target variable.
● Hypothesis testing
● Statistical Test falls under this category is Chisquare test.
Multivariate Analysis
● In the multivariate analysis we are taking more than two variables at a time and
analyze the trends and patterns and extract the hidden insights from the data.
● Pairplot is one of the graph we can use to perform multivariate analysis.
Analysis of Missing Values
● During Univariate analysis
we can conclude the
presence of missing values.
So in this section we are
going to find the
relationship and information
between the missing values
of different features with
other features.
EDA on Text Data
1. Understanding Problem Statement.
2. Analyzing text statistics
2.1 Sentence length analysis
2.2 Word frequency analysis
1. Analysis of most occurring stopwords.
2. Analysis of most occurring words without stopwords. (Unigrams and Bigrams both)
3. Polarity of sentences.
4. Word Cloud
● Sentence length analysis is used to set the number of embeddings for training data.
Feature Preprocessing
1. Handling Missing values
● Deleting rows with missing values
● Replace the missing values by fixed value based on the insights gain from the data.
● Impute missing values for continuous variable by mean and median
● Impute missing values for categorical variable by mode
● Using algorithms that support missing values : KNN imputer
● Prediction of missing values
● Forward and backward fill methods
2. Handling Outliers
● There are two methods to handle the outliers:
(a) Z-score method
(b) IQR Method
(b) IQR Method:In this method by using Inter Quartile Range(IQR), we detect outliers.
IQR tells us the variation in the data set. Any value, which is beyond the range of -1.5 x IQR
to 1.5 x IQR treated as outliers.
● Q1 represents the 1st quartile/25th percentile of the data.
● Q2 represents the 2nd quartile/median/50th percentile of the data.
● Q3 represents the 3rd quartile/75th percentile of the data.
● (Q1–1.5IQR) represent the smallest value in the data set and (Q3+1.5IQR)
represent the largest value in the data set
3. Handling Imbalanced dataset
3.1 Random Oversampling
3.2 Random Undersampling
3.3 Syntemtic Minority Over-sampling Technique (SMOTE)
Feature Preprocessing on Text Data
In any machine learning task, cleaning or preprocessing the data is very important, and
when it comes to unstructured data like text, this process is even more important
Some of the common text preprocessing/cleaning steps are:
● Lower casing
● Removal of Punctuations
● Removal of stopwords
● Removal of frequent words if not important
● Removal of rare words
● Stemming
● Removing alpha numeric words
● Lemmatization
● Removal of emojis
● Removal of special characters
● Removal of URLs
● Removals of HTML tags
● Spelling correction
● Convert number to words and vice-versa
Short Assignment
1. Analysis on Text Data
2. Pre-processing and Data Cleaning on Text Data
Tools & Technologies
1. Python programming language
2. Numpy
3. Pandas
4. Matplotlib
5. Seaborn
6. EDA
7. Feautre Engineering
8. Modelling

More Related Content

Similar to EDA and Feature Preprocessing for Data Analysis

UNIT - 5 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHON
UNIT - 5 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHONUNIT - 5 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHON
UNIT - 5 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHONNandakumar P
 
Exploratory Data Analysis Unit 1 ppt presentation.pptx
Exploratory Data Analysis Unit 1 ppt presentation.pptxExploratory Data Analysis Unit 1 ppt presentation.pptx
Exploratory Data Analysis Unit 1 ppt presentation.pptxMayura shelke
 
Singular Value Decomposition (SVD).pptx
Singular Value Decomposition (SVD).pptxSingular Value Decomposition (SVD).pptx
Singular Value Decomposition (SVD).pptxrajalakshmi5921
 
EDAB Module 5 Singular Value Decomposition (SVD).pptx
EDAB Module 5 Singular Value Decomposition (SVD).pptxEDAB Module 5 Singular Value Decomposition (SVD).pptx
EDAB Module 5 Singular Value Decomposition (SVD).pptxrajalakshmi5921
 
introduction to Statistical Theory.pptx
 introduction to Statistical Theory.pptx introduction to Statistical Theory.pptx
introduction to Statistical Theory.pptxDr.Shweta
 
Data Analytics Using R - Report
Data Analytics Using R - ReportData Analytics Using R - Report
Data Analytics Using R - ReportAkanksha Gohil
 
Stock market analysis using supervised machine learning
Stock market analysis using supervised machine learningStock market analysis using supervised machine learning
Stock market analysis using supervised machine learningPriyanshu Gandhi
 
Introduction of data science
Introduction of data scienceIntroduction of data science
Introduction of data scienceTanujaSomvanshi1
 
Machine learning Mind Map
Machine learning Mind MapMachine learning Mind Map
Machine learning Mind MapAshish Patel
 
Accuracy, Sensitivity and Specificity Measurement of Various Classification T...
Accuracy, Sensitivity and Specificity Measurement of Various Classification T...Accuracy, Sensitivity and Specificity Measurement of Various Classification T...
Accuracy, Sensitivity and Specificity Measurement of Various Classification T...IOSR Journals
 
Machine Learning.pdf
Machine Learning.pdfMachine Learning.pdf
Machine Learning.pdfBeyaNasr1
 
3 Missing data12256429.ppt
3 Missing data12256429.ppt3 Missing data12256429.ppt
3 Missing data12256429.pptAravind Reddy
 
Unit 8 data analysis and interpretation
Unit 8 data analysis and interpretationUnit 8 data analysis and interpretation
Unit 8 data analysis and interpretationAsima shahzadi
 
Survey paper on Big Data Imputation and Privacy Algorithms
Survey paper on Big Data Imputation and Privacy AlgorithmsSurvey paper on Big Data Imputation and Privacy Algorithms
Survey paper on Big Data Imputation and Privacy AlgorithmsIRJET Journal
 
Data mining techniques a survey paper
Data mining techniques a survey paperData mining techniques a survey paper
Data mining techniques a survey papereSAT Publishing House
 
Data Mining Module 2 Business Analytics.
Data Mining Module 2 Business Analytics.Data Mining Module 2 Business Analytics.
Data Mining Module 2 Business Analytics.Jayanti Pande
 
Analysing & interpreting data.ppt
Analysing & interpreting data.pptAnalysing & interpreting data.ppt
Analysing & interpreting data.pptmanaswidebbarma1
 
Module 4 data analysis
Module 4 data analysisModule 4 data analysis
Module 4 data analysisILRI-Jmaru
 

Similar to EDA and Feature Preprocessing for Data Analysis (20)

UNIT - 5 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHON
UNIT - 5 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHONUNIT - 5 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHON
UNIT - 5 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHON
 
Exploratory Data Analysis Unit 1 ppt presentation.pptx
Exploratory Data Analysis Unit 1 ppt presentation.pptxExploratory Data Analysis Unit 1 ppt presentation.pptx
Exploratory Data Analysis Unit 1 ppt presentation.pptx
 
Singular Value Decomposition (SVD).pptx
Singular Value Decomposition (SVD).pptxSingular Value Decomposition (SVD).pptx
Singular Value Decomposition (SVD).pptx
 
EDAB Module 5 Singular Value Decomposition (SVD).pptx
EDAB Module 5 Singular Value Decomposition (SVD).pptxEDAB Module 5 Singular Value Decomposition (SVD).pptx
EDAB Module 5 Singular Value Decomposition (SVD).pptx
 
introduction to Statistical Theory.pptx
 introduction to Statistical Theory.pptx introduction to Statistical Theory.pptx
introduction to Statistical Theory.pptx
 
Data Analytics Using R - Report
Data Analytics Using R - ReportData Analytics Using R - Report
Data Analytics Using R - Report
 
Module-4_Part-II.pptx
Module-4_Part-II.pptxModule-4_Part-II.pptx
Module-4_Part-II.pptx
 
Stock market analysis using supervised machine learning
Stock market analysis using supervised machine learningStock market analysis using supervised machine learning
Stock market analysis using supervised machine learning
 
Introduction of data science
Introduction of data scienceIntroduction of data science
Introduction of data science
 
Machine learning Mind Map
Machine learning Mind MapMachine learning Mind Map
Machine learning Mind Map
 
Accuracy, Sensitivity and Specificity Measurement of Various Classification T...
Accuracy, Sensitivity and Specificity Measurement of Various Classification T...Accuracy, Sensitivity and Specificity Measurement of Various Classification T...
Accuracy, Sensitivity and Specificity Measurement of Various Classification T...
 
Machine Learning.pdf
Machine Learning.pdfMachine Learning.pdf
Machine Learning.pdf
 
3 Missing data12256429.ppt
3 Missing data12256429.ppt3 Missing data12256429.ppt
3 Missing data12256429.ppt
 
Unit 8 data analysis and interpretation
Unit 8 data analysis and interpretationUnit 8 data analysis and interpretation
Unit 8 data analysis and interpretation
 
Survey paper on Big Data Imputation and Privacy Algorithms
Survey paper on Big Data Imputation and Privacy AlgorithmsSurvey paper on Big Data Imputation and Privacy Algorithms
Survey paper on Big Data Imputation and Privacy Algorithms
 
Data mining techniques a survey paper
Data mining techniques a survey paperData mining techniques a survey paper
Data mining techniques a survey paper
 
Data Mining Module 2 Business Analytics.
Data Mining Module 2 Business Analytics.Data Mining Module 2 Business Analytics.
Data Mining Module 2 Business Analytics.
 
Data mining with weka
Data mining with wekaData mining with weka
Data mining with weka
 
Analysing & interpreting data.ppt
Analysing & interpreting data.pptAnalysing & interpreting data.ppt
Analysing & interpreting data.ppt
 
Module 4 data analysis
Module 4 data analysisModule 4 data analysis
Module 4 data analysis
 

Recently uploaded

Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 

Recently uploaded (20)

Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 

EDA and Feature Preprocessing for Data Analysis

  • 1. EDA and Feature Preprocessing Session - Brajkishore Prajapati
  • 2. What is Exploratory Data Analysis (EDA)? ● Exploratory Data Analysis (EDA) is an approach to analyze the data using visual techniques. It is used to discover trends, patterns, or to check assumptions with the help of statistical summary and graphical representations. ● EDA allows us to make decisions and it helps in the feature preprocessing.
  • 3. Statistical Analysis 1. Descriptive Statistical Analysis 2. Inferential Statistical Analysis Descriptive statistics summarize the characteristics of a data set. Inferential statistics allow you to test a hypothesis or assess whether your data is generalizable to the broader population.
  • 4. Steps to perform effective EDA 1. Univariate Analysis 2. Bivariate Analysis 3. Multivariate Analysis 4. Analysis of Missing values 5. Analysis of Outliers
  • 5. Univariate Analysis 1. Numerical Features: ● In this section we are going to explore and analyze the numerical features and collecting statistical information from the data. ● Describe method is used for extracting statistical information ● Histograms and KDE plots are used to get the visual representation of the data and observe the trends and patterns.
  • 8. 2. Categorical Features ● In this section we are going to explore and analyze the categorical features, their classes and finding the important information from the dataset. ● If the problem is classification problem, then check for imbalance dataset ● Check for the value counts of each category in a variable and the hidden % patterns.
  • 9.
  • 10. Bivariate Analysis 1. Numerical- Numerical: ● In this section we are going to explore and analyze with two numerical features taking at a time ● Finding correlation of features. ● Verifying assumptions.
  • 11. 2. Numerical - Categorical ● In this section we are going to analyze our data with one categorical feature and a numerical variable and also perform some hypothesis testing failing in this section. ● Checking the significance of features w.r.t target variable. ● Hypothesis testing ● Statistical Test falls under this are: 1. t-test 2. z-test 3. Anova test
  • 12.
  • 13. 3. Categorical - Categorical ● In this section we are going to analyze our data with categorical- categorical features. We will be plotting grouped count plot for categorical feature also perform some hypothesis testing failing in this section. ● Checking the significance of features w.r.t target variable. ● Hypothesis testing ● Statistical Test falls under this category is Chisquare test.
  • 14.
  • 15. Multivariate Analysis ● In the multivariate analysis we are taking more than two variables at a time and analyze the trends and patterns and extract the hidden insights from the data. ● Pairplot is one of the graph we can use to perform multivariate analysis.
  • 16. Analysis of Missing Values ● During Univariate analysis we can conclude the presence of missing values. So in this section we are going to find the relationship and information between the missing values of different features with other features.
  • 17. EDA on Text Data 1. Understanding Problem Statement. 2. Analyzing text statistics 2.1 Sentence length analysis 2.2 Word frequency analysis 1. Analysis of most occurring stopwords. 2. Analysis of most occurring words without stopwords. (Unigrams and Bigrams both) 3. Polarity of sentences. 4. Word Cloud ● Sentence length analysis is used to set the number of embeddings for training data.
  • 18. Feature Preprocessing 1. Handling Missing values ● Deleting rows with missing values ● Replace the missing values by fixed value based on the insights gain from the data. ● Impute missing values for continuous variable by mean and median ● Impute missing values for categorical variable by mode ● Using algorithms that support missing values : KNN imputer ● Prediction of missing values ● Forward and backward fill methods
  • 19. 2. Handling Outliers ● There are two methods to handle the outliers: (a) Z-score method (b) IQR Method
  • 20.
  • 21. (b) IQR Method:In this method by using Inter Quartile Range(IQR), we detect outliers. IQR tells us the variation in the data set. Any value, which is beyond the range of -1.5 x IQR to 1.5 x IQR treated as outliers. ● Q1 represents the 1st quartile/25th percentile of the data. ● Q2 represents the 2nd quartile/median/50th percentile of the data. ● Q3 represents the 3rd quartile/75th percentile of the data. ● (Q1–1.5IQR) represent the smallest value in the data set and (Q3+1.5IQR) represent the largest value in the data set
  • 22. 3. Handling Imbalanced dataset 3.1 Random Oversampling 3.2 Random Undersampling 3.3 Syntemtic Minority Over-sampling Technique (SMOTE)
  • 23. Feature Preprocessing on Text Data In any machine learning task, cleaning or preprocessing the data is very important, and when it comes to unstructured data like text, this process is even more important
  • 24. Some of the common text preprocessing/cleaning steps are: ● Lower casing ● Removal of Punctuations ● Removal of stopwords ● Removal of frequent words if not important ● Removal of rare words ● Stemming ● Removing alpha numeric words
  • 25. ● Lemmatization ● Removal of emojis ● Removal of special characters ● Removal of URLs ● Removals of HTML tags ● Spelling correction ● Convert number to words and vice-versa
  • 26. Short Assignment 1. Analysis on Text Data 2. Pre-processing and Data Cleaning on Text Data Tools & Technologies 1. Python programming language 2. Numpy 3. Pandas 4. Matplotlib 5. Seaborn 6. EDA 7. Feautre Engineering 8. Modelling