SlideShare a Scribd company logo
1 of 36
EXPLORATORY DATA
ANALYSIS
What is EDA
 To analyze and investigate data sets
 summarize their main characteristics, often employing data
visualization methods.
Steps in EDA
• Description of data
• Handling missing data
• Handling outliers
• Understanding relationships and new insights through plots
Step 1:
• load data, study the dimension (how many rows and columns are in the data).
• look to see if the data types are the same (for example, dates are represented by a date, not a string) and
change the data to the correct type.
• look at measures of the central trend (I look at averages, distribution type, range, minimum and - maximum
values, standard deviation, negative values, etc.).
• check the data for the absence of values ​​(delete or fill, depending on the selected method).
• I am studying a time series (checking if the time series is interrupted, if any dates are missing, etc.).
• I work with categorical variables (check the spelling of strings, duplicates, standardize names if necessary).
Contd..
• When the data is ready, I move on to generating the aggregated data (second step) answering my
questions (concatenating tables, grouping data, creating new functions, etc
• Third Step Visualise
• Final Step Story Telling
Why EDA is required
 The one hand to answer questions, test business assumptions, generate hypotheses for
further analysis.
 On the other hand, you can also use it to prepare the data for modeling. The thing that
these two probably have in common is a good knowledge of your data to either get the
answers that you need or to develop an intuition for interpreting the results of future
modeling.
 you can get to know whether the selected features are good enough to model, are all
the features required, are there any correlations based on which we can either go
back to the Data Pre-processing step or move on to modeling
What will happen if EDA not performed
 If EDA is not done properly then it can hamper the further steps in the machine
learning model building process
Types of EDA
Graphical
 Box plots:.
 Heatmap:
 Histograms:
 Line graphs:
 Pictograms:.
 Scattergrams or scatterplots:.
Non-graphical EDA
 Data profiling is concerned with summarizing your dataset through descriptive statistics.
 The goal of data profiling is to have a solid understanding of your data so you can afterwards start
querying and visualizing your data in various ways
Steps in EDA
 Data Sourcing
 Identification of variables and data types
 Analyzing the basic metrics
 Non-Graphical Univariate Analysis
 Graphical Univariate Analysis
 Bivariate Analysis
 Multivariate Analysis
 Variable transformations/Variable creation
 Missing value treatment
 Outlier treatment
 Correlation Analysis
 Dimensionality Reduction
Data Sourcing
 Data Sourcing is the process of finding and loading the data into our system.
Broadly there are two ways in which we can find data.
 Private Data
 Public Data
Web Scrapping
Variable identification
Analyzing the basic metrics
 Descriptive Analysis on the data
 Head of the data
 Data Structures of the data
 Standardisation on the data
Non-Graphical Univariate Analysis:
 To get the count of unique values:
 To get the list & number of unique values:
 Filtering based on Conditions:
 Finding null values:
 Data Type Conversion using to_datetime() and astype() methods
Graphical Univariate Analysis
 Categorical Unordered Univariate Analysis:An unordered variable is a
categorical variable that has no defined order
 Bar Chart is Best
 Categorical Ordered Univariate Analysis:
 Ordered variables are those variables that have a natural rank of order. Some
examples of categorical ordered variables from our dataset are:
 Month: Jan, Feb, March……
 Education: Primary, Secondary,
 Pie Chart
Bivariate Analysis
 Numeric-Numeric Analysis:
 Scatter Plot
 Pair Plot
 Correlation Matrix
 Numeric - Categorical Analysis
 We analyze them mainly using mean, median, and box plots, Bar Chart
 Categorical — Categorical Analysis
 Barcharts
Multivariate Analysis
 If we analyze data by taking more than two
variables/columns into consideration from
a dataset, it is known as Multivariate
Analysis.
 Let’s see how ‘Education’, ‘Marital’, and
‘Response_rate’ vary with each other.
 First, we’ll create a pivot table with the
three columns and after that, we’ll create a
heatmap
Missing value treatment
 A Simple Option: Drop Columns with Missing Values
 A Better Option: Imputation
 Count Plot is best to identify
2nd Session on EDA.
So Far
• What is EDA, Why EDA required, What ll happen if EDA not performed.
• Types of EDA and Steps in EDA.
 Data Sourcing
 Identification of variables and data types
 Analyzing the basic metrics
 Non-Graphical Univariate Analysis
 Graphical Univariate Analysis
 Bivariate Analysis & Multivariate Analysis
 Variable transformations/Variable creation
 Missing value treatment &Outlier treatment
 Correlation Analysis
 Dimensionality Reduction
What is EDA
 There are main components of exploring data:
 Understanding your variables
 Cleaning your dataset
 Analyzing relationships between variables
What we are going to discuss
• Feature Engineering
• Missing Values Handling (Imputation techniques for categorical and Numerical)
• Cardinality of Categorical Variables (Encoding)
• Ouliers Handling
• Plots
• Does correlation imply Causation
• Imbalance Datasets
Feature Engineering
• Feature Scaling
• min-max normalization & z-score normalization
• Feature Transformation
• Feature Selection
• Feature Importance
• Feature Extraction/Creation
Missing Values Handling
• Missing completely at Random (MCAR)
• Missing at Random (MAR)
• Missing Not at Random (MNAR)
• Continuous: Mean and Median Impudence, Impudence by Regression
• Categorical: Mode Impudence, classifier impudence, cluster impudence
 df.isna().any() returns a boolean value for each column. If there is at least one missing value in that
column, the result is True.
 df.isna().sum() returns the number of missing values in each column
 Using method parameter, missing values can be replaced with the values before or after them(Last
observation carried forward (LOCF) method.) For longitudinal Behaviour.
 data["Age"] = data["Age"].interpolate(method='linear‘/bfill/ffill, limit_direction='forward', axis=0)
 Multiple Imputation ( Multivariate Imputation by Chained Equations MICE)
from fancyimpute import
IterativeImputer as MICE
data_fit =
pd.DataFrame(MICE().fit_transform(data
))
Cardinality in Category Variables
• Nominal Encoding
• One hot encoding
• One hot encoding with many categories
• Mean encoding
• Ordinal Encoding
• Label Encoding
• Target guided ordinal encoding
• Count/Frequency encoding
One Hot Encoding
(Note: Dummy Variable)
Label Encoding (ordinal)
Cold<Hot<Very Hot<Warm….0 < 1 < 2 < 3
One major issue with this approach is there is no
relation or order between these classes, but the
algorithm might consider them as some order or
some relationship
Frequency Encoding
•Select a categorical variable you would like to transform
•Group by the categorical variable and obtain counts of
each category
•Join it back with the training dataset
Mean/Target Encoding
In mean target encoding for each category in the
feature label is decided with the mean value of the
target variable on training data. This encoding
method brings out the relation between similar
categories, but the connections are bounded within
the categories and target itself
1.Select a categorical variable you would like to
transform.
2. Group by the categorical variable and obtain
aggregated sum over the “Target” variable. (total
number of 1’s for each category in ‘Temperature’)
3. Group by the categorical variable and obtain
aggregated count over “Target” variable
4. Divide the step 2 / step 3 results and join it back with
the train
Binary Encoding
Home Work
Helmert Encoding
Weight of Evidence Encoding (% of non events / % of events)
Probability Ratio Encoding
Hashing Encoding
Backward Difference Encoding
Leave One Out Encoding
James-Stein Encoding
M-estimator Encoding
Thermometer Encoder
Helmert Coding
 What is Helmert coding? Sounds different?
 It compares levels of a variable with the mean of the subsequent levels of the variable.
Simply put, each level of the variable is compared to the later level of variables.
Hard to get?** Let’s understand this with an example:**
 If there are L levels then the first comparison is of level vs. (L−1) other levels. The
weights are then (L−1)/L for the first level and −1/L for each of the other levels. If the
no.of levels are 4, then, L = 4 so the weights will be .75 and -.25 (3 times)
 The next comparison has only L−1 levels (the first level is no longer part of the
comparisons), so now the weights are (L−2)/(L−1) for the first level and −1/(L−1) for
the others (in our case, 2/3 and -1/3. And it goes on.
 This type of encoding is useful when the levels of the categorical variable are ordered
in a meaningful way.
Outliers Handling
For Normal Distribution
Lower Boundary = Mean — 3* (Standard Deviation)
Upper Boundary= Mean + 3 * (Standard Deviation)
We will use the Interquartile Range to measure the limits
of Outliers if the data doesn’t follow a Normal Distribution
or is either right-skewed or left-skewed.
Lower Boundary= First Quartile(Q1/25th percentile) —
(1.50 or 3 * IQR)
Upper Boundary = Third Quartile(Q3/75th percentile)
+(1.5 or 3* IQR)
Outliers handling
 Remove the observations by box plot & Winsorize method
 Imputation
 As impute values, we can choose between the mean, median, mode, and boundary
values.
 If you don’t want to remove the outliers, then what you can do is tune the range
of the dataset down to a certain range.(Transformation)
 Minkowski Error Method
Imbalance Datasets
Under-sampling majority class
Under-sampling the majority class will resample the majority class points in the data to make them
equal to the minority class.
Over Sampling Minority class by duplication
Oversampling minority class will resample the minority class points in the data to make them
equal to the majority class
Imbalance Datasets
 Under-sampling majority class
 Over Sampling Minority class by duplication
 Over Sampling minority class using Synthetic Minority Oversampling Technique
(SMOTE)

More Related Content

What's hot

Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysisVishwas N
 
Quantitative analysis using SPSS
Quantitative analysis using SPSSQuantitative analysis using SPSS
Quantitative analysis using SPSSAlaa Sadik
 
ML with Power BI for Business and Pros
ML with Power BI for Business and ProsML with Power BI for Business and Pros
ML with Power BI for Business and ProsIvo Andreev
 
introduction to statistics
introduction to statisticsintroduction to statistics
introduction to statisticsmirabubakar1
 
Data Reduction
Data ReductionData Reduction
Data ReductionRajan Shah
 
Time Series Analysis - 1 | Time Series in R | Time Series Forecasting | Data ...
Time Series Analysis - 1 | Time Series in R | Time Series Forecasting | Data ...Time Series Analysis - 1 | Time Series in R | Time Series Forecasting | Data ...
Time Series Analysis - 1 | Time Series in R | Time Series Forecasting | Data ...Simplilearn
 
Data warehouse architecture
Data warehouse architecture Data warehouse architecture
Data warehouse architecture janani thirupathi
 
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...Md. Main Uddin Rony
 
Data mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, dataData mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, dataSalah Amean
 
An Introduction to Factor analysis ppt
An Introduction to Factor analysis pptAn Introduction to Factor analysis ppt
An Introduction to Factor analysis pptMukesh Bisht
 
Unit 8 data analysis and interpretation
Unit 8 data analysis and interpretationUnit 8 data analysis and interpretation
Unit 8 data analysis and interpretationAsima shahzadi
 
Processing of research data
Processing of research dataProcessing of research data
Processing of research dataAshish Sahu
 
Introduction to ML (Machine Learning)
Introduction to ML (Machine Learning)Introduction to ML (Machine Learning)
Introduction to ML (Machine Learning)SwatiTripathi44
 
Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...Simplilearn
 

What's hot (20)

Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysis
 
DataPreprocessing.pptx
DataPreprocessing.pptxDataPreprocessing.pptx
DataPreprocessing.pptx
 
Data warehousing
Data warehousingData warehousing
Data warehousing
 
Quantitative analysis using SPSS
Quantitative analysis using SPSSQuantitative analysis using SPSS
Quantitative analysis using SPSS
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
ML with Power BI for Business and Pros
ML with Power BI for Business and ProsML with Power BI for Business and Pros
ML with Power BI for Business and Pros
 
introduction to statistics
introduction to statisticsintroduction to statistics
introduction to statistics
 
Data Reduction
Data ReductionData Reduction
Data Reduction
 
Time Series Analysis - 1 | Time Series in R | Time Series Forecasting | Data ...
Time Series Analysis - 1 | Time Series in R | Time Series Forecasting | Data ...Time Series Analysis - 1 | Time Series in R | Time Series Forecasting | Data ...
Time Series Analysis - 1 | Time Series in R | Time Series Forecasting | Data ...
 
Data warehouse architecture
Data warehouse architecture Data warehouse architecture
Data warehouse architecture
 
Metadata ppt
Metadata pptMetadata ppt
Metadata ppt
 
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
 
Data mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, dataData mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, data
 
Analysis of data
Analysis of dataAnalysis of data
Analysis of data
 
Data Analysis
Data AnalysisData Analysis
Data Analysis
 
An Introduction to Factor analysis ppt
An Introduction to Factor analysis pptAn Introduction to Factor analysis ppt
An Introduction to Factor analysis ppt
 
Unit 8 data analysis and interpretation
Unit 8 data analysis and interpretationUnit 8 data analysis and interpretation
Unit 8 data analysis and interpretation
 
Processing of research data
Processing of research dataProcessing of research data
Processing of research data
 
Introduction to ML (Machine Learning)
Introduction to ML (Machine Learning)Introduction to ML (Machine Learning)
Introduction to ML (Machine Learning)
 
Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...
 

Similar to EDA by Sastry.pptx

Machine learning module 2
Machine learning module 2Machine learning module 2
Machine learning module 2Gokulks007
 
Machine learning Mind Map
Machine learning Mind MapMachine learning Mind Map
Machine learning Mind MapAshish Patel
 
Machine Learning.pdf
Machine Learning.pdfMachine Learning.pdf
Machine Learning.pdfBeyaNasr1
 
Exploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdfExploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdfAmmarAhmedSiddiqui2
 
Introduction of data science
Introduction of data scienceIntroduction of data science
Introduction of data scienceTanujaSomvanshi1
 
04 Classification in Data Mining
04 Classification in Data Mining04 Classification in Data Mining
04 Classification in Data MiningValerii Klymchuk
 
Singular Value Decomposition (SVD).pptx
Singular Value Decomposition (SVD).pptxSingular Value Decomposition (SVD).pptx
Singular Value Decomposition (SVD).pptxrajalakshmi5921
 
EDAB Module 5 Singular Value Decomposition (SVD).pptx
EDAB Module 5 Singular Value Decomposition (SVD).pptxEDAB Module 5 Singular Value Decomposition (SVD).pptx
EDAB Module 5 Singular Value Decomposition (SVD).pptxrajalakshmi5921
 
Data Mining StepsProblem Definition Market AnalysisC
Data Mining StepsProblem Definition Market AnalysisCData Mining StepsProblem Definition Market AnalysisC
Data Mining StepsProblem Definition Market AnalysisCsharondabriggs
 
Module 4 data analysis
Module 4 data analysisModule 4 data analysis
Module 4 data analysisILRI-Jmaru
 
EXPLORATORY DATA ANALYSIS
EXPLORATORY DATA ANALYSISEXPLORATORY DATA ANALYSIS
EXPLORATORY DATA ANALYSISBabasID2
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysisgokulprasath06
 
Exploratory Data Analysis.pptx for Data Analytics
Exploratory Data Analysis.pptx for Data AnalyticsExploratory Data Analysis.pptx for Data Analytics
Exploratory Data Analysis.pptx for Data Analyticsharshrnotaria
 
Chapter 11 Data Analysis Classification and Tabulation
Chapter 11 Data Analysis Classification and TabulationChapter 11 Data Analysis Classification and Tabulation
Chapter 11 Data Analysis Classification and TabulationInternational advisers
 
Research methodology - Analysis of Data
Research methodology - Analysis of DataResearch methodology - Analysis of Data
Research methodology - Analysis of DataThe Stockker
 

Similar to EDA by Sastry.pptx (20)

Machine learning module 2
Machine learning module 2Machine learning module 2
Machine learning module 2
 
Machine learning Mind Map
Machine learning Mind MapMachine learning Mind Map
Machine learning Mind Map
 
Machine Learning.pdf
Machine Learning.pdfMachine Learning.pdf
Machine Learning.pdf
 
Unit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdfUnit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdf
 
Unit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdfUnit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdf
 
Unit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdfUnit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdf
 
Exploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdfExploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdf
 
ML-Unit-4.pdf
ML-Unit-4.pdfML-Unit-4.pdf
ML-Unit-4.pdf
 
Introduction of data science
Introduction of data scienceIntroduction of data science
Introduction of data science
 
RM7.ppt
RM7.pptRM7.ppt
RM7.ppt
 
04 Classification in Data Mining
04 Classification in Data Mining04 Classification in Data Mining
04 Classification in Data Mining
 
Singular Value Decomposition (SVD).pptx
Singular Value Decomposition (SVD).pptxSingular Value Decomposition (SVD).pptx
Singular Value Decomposition (SVD).pptx
 
EDAB Module 5 Singular Value Decomposition (SVD).pptx
EDAB Module 5 Singular Value Decomposition (SVD).pptxEDAB Module 5 Singular Value Decomposition (SVD).pptx
EDAB Module 5 Singular Value Decomposition (SVD).pptx
 
Data Mining StepsProblem Definition Market AnalysisC
Data Mining StepsProblem Definition Market AnalysisCData Mining StepsProblem Definition Market AnalysisC
Data Mining StepsProblem Definition Market AnalysisC
 
Module 4 data analysis
Module 4 data analysisModule 4 data analysis
Module 4 data analysis
 
EXPLORATORY DATA ANALYSIS
EXPLORATORY DATA ANALYSISEXPLORATORY DATA ANALYSIS
EXPLORATORY DATA ANALYSIS
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysis
 
Exploratory Data Analysis.pptx for Data Analytics
Exploratory Data Analysis.pptx for Data AnalyticsExploratory Data Analysis.pptx for Data Analytics
Exploratory Data Analysis.pptx for Data Analytics
 
Chapter 11 Data Analysis Classification and Tabulation
Chapter 11 Data Analysis Classification and TabulationChapter 11 Data Analysis Classification and Tabulation
Chapter 11 Data Analysis Classification and Tabulation
 
Research methodology - Analysis of Data
Research methodology - Analysis of DataResearch methodology - Analysis of Data
Research methodology - Analysis of Data
 

Recently uploaded

Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 

Recently uploaded (20)

Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 

EDA by Sastry.pptx

  • 2. What is EDA  To analyze and investigate data sets  summarize their main characteristics, often employing data visualization methods.
  • 3. Steps in EDA • Description of data • Handling missing data • Handling outliers • Understanding relationships and new insights through plots Step 1: • load data, study the dimension (how many rows and columns are in the data). • look to see if the data types are the same (for example, dates are represented by a date, not a string) and change the data to the correct type. • look at measures of the central trend (I look at averages, distribution type, range, minimum and - maximum values, standard deviation, negative values, etc.). • check the data for the absence of values ​​(delete or fill, depending on the selected method). • I am studying a time series (checking if the time series is interrupted, if any dates are missing, etc.). • I work with categorical variables (check the spelling of strings, duplicates, standardize names if necessary).
  • 4. Contd.. • When the data is ready, I move on to generating the aggregated data (second step) answering my questions (concatenating tables, grouping data, creating new functions, etc • Third Step Visualise • Final Step Story Telling
  • 5. Why EDA is required  The one hand to answer questions, test business assumptions, generate hypotheses for further analysis.  On the other hand, you can also use it to prepare the data for modeling. The thing that these two probably have in common is a good knowledge of your data to either get the answers that you need or to develop an intuition for interpreting the results of future modeling.  you can get to know whether the selected features are good enough to model, are all the features required, are there any correlations based on which we can either go back to the Data Pre-processing step or move on to modeling
  • 6. What will happen if EDA not performed  If EDA is not done properly then it can hamper the further steps in the machine learning model building process
  • 7. Types of EDA Graphical  Box plots:.  Heatmap:  Histograms:  Line graphs:  Pictograms:.  Scattergrams or scatterplots:. Non-graphical EDA  Data profiling is concerned with summarizing your dataset through descriptive statistics.  The goal of data profiling is to have a solid understanding of your data so you can afterwards start querying and visualizing your data in various ways
  • 8.
  • 9. Steps in EDA  Data Sourcing  Identification of variables and data types  Analyzing the basic metrics  Non-Graphical Univariate Analysis  Graphical Univariate Analysis  Bivariate Analysis  Multivariate Analysis  Variable transformations/Variable creation  Missing value treatment  Outlier treatment  Correlation Analysis  Dimensionality Reduction
  • 10. Data Sourcing  Data Sourcing is the process of finding and loading the data into our system. Broadly there are two ways in which we can find data.  Private Data  Public Data Web Scrapping
  • 12. Analyzing the basic metrics  Descriptive Analysis on the data  Head of the data  Data Structures of the data  Standardisation on the data
  • 13. Non-Graphical Univariate Analysis:  To get the count of unique values:  To get the list & number of unique values:  Filtering based on Conditions:  Finding null values:  Data Type Conversion using to_datetime() and astype() methods
  • 14. Graphical Univariate Analysis  Categorical Unordered Univariate Analysis:An unordered variable is a categorical variable that has no defined order  Bar Chart is Best  Categorical Ordered Univariate Analysis:  Ordered variables are those variables that have a natural rank of order. Some examples of categorical ordered variables from our dataset are:  Month: Jan, Feb, March……  Education: Primary, Secondary,  Pie Chart
  • 15. Bivariate Analysis  Numeric-Numeric Analysis:  Scatter Plot  Pair Plot  Correlation Matrix  Numeric - Categorical Analysis  We analyze them mainly using mean, median, and box plots, Bar Chart  Categorical — Categorical Analysis  Barcharts
  • 16. Multivariate Analysis  If we analyze data by taking more than two variables/columns into consideration from a dataset, it is known as Multivariate Analysis.  Let’s see how ‘Education’, ‘Marital’, and ‘Response_rate’ vary with each other.  First, we’ll create a pivot table with the three columns and after that, we’ll create a heatmap
  • 17. Missing value treatment  A Simple Option: Drop Columns with Missing Values  A Better Option: Imputation  Count Plot is best to identify
  • 19. So Far • What is EDA, Why EDA required, What ll happen if EDA not performed. • Types of EDA and Steps in EDA.  Data Sourcing  Identification of variables and data types  Analyzing the basic metrics  Non-Graphical Univariate Analysis  Graphical Univariate Analysis  Bivariate Analysis & Multivariate Analysis  Variable transformations/Variable creation  Missing value treatment &Outlier treatment  Correlation Analysis  Dimensionality Reduction
  • 20. What is EDA  There are main components of exploring data:  Understanding your variables  Cleaning your dataset  Analyzing relationships between variables
  • 21. What we are going to discuss • Feature Engineering • Missing Values Handling (Imputation techniques for categorical and Numerical) • Cardinality of Categorical Variables (Encoding) • Ouliers Handling • Plots • Does correlation imply Causation • Imbalance Datasets
  • 22. Feature Engineering • Feature Scaling • min-max normalization & z-score normalization • Feature Transformation • Feature Selection • Feature Importance • Feature Extraction/Creation
  • 23. Missing Values Handling • Missing completely at Random (MCAR) • Missing at Random (MAR) • Missing Not at Random (MNAR) • Continuous: Mean and Median Impudence, Impudence by Regression • Categorical: Mode Impudence, classifier impudence, cluster impudence  df.isna().any() returns a boolean value for each column. If there is at least one missing value in that column, the result is True.  df.isna().sum() returns the number of missing values in each column  Using method parameter, missing values can be replaced with the values before or after them(Last observation carried forward (LOCF) method.) For longitudinal Behaviour.  data["Age"] = data["Age"].interpolate(method='linear‘/bfill/ffill, limit_direction='forward', axis=0)  Multiple Imputation ( Multivariate Imputation by Chained Equations MICE)
  • 24. from fancyimpute import IterativeImputer as MICE data_fit = pd.DataFrame(MICE().fit_transform(data ))
  • 25. Cardinality in Category Variables • Nominal Encoding • One hot encoding • One hot encoding with many categories • Mean encoding • Ordinal Encoding • Label Encoding • Target guided ordinal encoding • Count/Frequency encoding
  • 26. One Hot Encoding (Note: Dummy Variable)
  • 27. Label Encoding (ordinal) Cold<Hot<Very Hot<Warm….0 < 1 < 2 < 3 One major issue with this approach is there is no relation or order between these classes, but the algorithm might consider them as some order or some relationship
  • 28. Frequency Encoding •Select a categorical variable you would like to transform •Group by the categorical variable and obtain counts of each category •Join it back with the training dataset
  • 29. Mean/Target Encoding In mean target encoding for each category in the feature label is decided with the mean value of the target variable on training data. This encoding method brings out the relation between similar categories, but the connections are bounded within the categories and target itself 1.Select a categorical variable you would like to transform. 2. Group by the categorical variable and obtain aggregated sum over the “Target” variable. (total number of 1’s for each category in ‘Temperature’) 3. Group by the categorical variable and obtain aggregated count over “Target” variable 4. Divide the step 2 / step 3 results and join it back with the train
  • 31. Home Work Helmert Encoding Weight of Evidence Encoding (% of non events / % of events) Probability Ratio Encoding Hashing Encoding Backward Difference Encoding Leave One Out Encoding James-Stein Encoding M-estimator Encoding Thermometer Encoder
  • 32. Helmert Coding  What is Helmert coding? Sounds different?  It compares levels of a variable with the mean of the subsequent levels of the variable. Simply put, each level of the variable is compared to the later level of variables. Hard to get?** Let’s understand this with an example:**  If there are L levels then the first comparison is of level vs. (L−1) other levels. The weights are then (L−1)/L for the first level and −1/L for each of the other levels. If the no.of levels are 4, then, L = 4 so the weights will be .75 and -.25 (3 times)  The next comparison has only L−1 levels (the first level is no longer part of the comparisons), so now the weights are (L−2)/(L−1) for the first level and −1/(L−1) for the others (in our case, 2/3 and -1/3. And it goes on.  This type of encoding is useful when the levels of the categorical variable are ordered in a meaningful way.
  • 33. Outliers Handling For Normal Distribution Lower Boundary = Mean — 3* (Standard Deviation) Upper Boundary= Mean + 3 * (Standard Deviation) We will use the Interquartile Range to measure the limits of Outliers if the data doesn’t follow a Normal Distribution or is either right-skewed or left-skewed. Lower Boundary= First Quartile(Q1/25th percentile) — (1.50 or 3 * IQR) Upper Boundary = Third Quartile(Q3/75th percentile) +(1.5 or 3* IQR)
  • 34. Outliers handling  Remove the observations by box plot & Winsorize method  Imputation  As impute values, we can choose between the mean, median, mode, and boundary values.  If you don’t want to remove the outliers, then what you can do is tune the range of the dataset down to a certain range.(Transformation)  Minkowski Error Method
  • 35. Imbalance Datasets Under-sampling majority class Under-sampling the majority class will resample the majority class points in the data to make them equal to the minority class. Over Sampling Minority class by duplication Oversampling minority class will resample the minority class points in the data to make them equal to the majority class
  • 36. Imbalance Datasets  Under-sampling majority class  Over Sampling Minority class by duplication  Over Sampling minority class using Synthetic Minority Oversampling Technique (SMOTE)