SlideShare a Scribd company logo
1 of 14
Exploratory Data Analysis
by Neha Mathur
0
What is Exploratory Data
Analysis
1
EDA is an approach for data analysis using variety
of techniques to gain insights about the data.
• Cleaning and preprocessing
• Statistical Analysis
• Visualization for trend analysis,
anomaly detection, outlier
detection (and removal).
Basic steps in any
exploratory data
analysis:
Importance of EDA
Improve understanding of variables by extracting averages, mean,
minimum, and maximum values, etc.
Discover errors, outliers, and missing values in the data.
Identify patterns by visualizing data in graphs such as bar graphs,
scatter plots, heatmaps and histograms.
2
EDA using Pandas
Import data into workplace(Jupyter notebook, Google colab, Python IDE)
Descriptive statistics
Removal of nulls
Visualization
3
1. Packages and data import
• Step 1 : Import pandas to the workplace.
• “Import pandas”
• Step 2 : Read data/dataset into Pandas dataframe. Different input
formats include:
• Excel : read_excel
• CSV: read_csv
• JSON: read_json
• HTML and many more
4
2. Descriptive
Stats (Pandas)
• Used to make preliminary assessments about the population
distribution of the variable.
• Commonly used statistics:
1. Central tendency :
• Mean – The average value of all the data points. :
dataframe.mean()
• Median – The middle value when all the data points are
put in an ordered list: dataframe.median()
• Mode – The data point which occurs the most in the
dataset :dataframe.mode()
2. Spread : It is the measure of how far the datapoints are away
from the mean or median
• Variance - The variance is the mean of the squares of the
individual deviations: dataframe.var()
• Standard deviation - The standard deviation is the square
root of the variance:dataframe.std()
3. Skewness: It is a measure of asymmetry: dataframe.skew()
Descriptive
Stats (contd.)
Other methods to get a quick look on the data:
• Describe() : Summarizes the central tendency,
dispersion and shape of a dataset’s distribution,
excluding NaN values.
• Syntax: pandas.dataframe.describe()
• Info() :Prints a concise summary of the
dataframe. This method prints information
about a dataframe including the index dtype
and columns, non-null values and memory
usage.
• Syntax: pandas.dataframe.info()
3. Null values
7
Detecting
Detecting Null-
values:
•Isnull(): It is used as an
alias for dataframe.isna().
This function returns the
dataframe with boolean
values indicating missing
values.
•Syntax :
dataframe.isnull()
Handling
Handling null values:
•Dropping the rows with
null values: dropna()
function is used to delete
rows or columns with
null values.
•Replacing missing values:
fillna() function can fill
the missing values with a
special value value like
mean or median.
4. Visualization
• Univariate: Looking at one variable/column at a time
• Bar-graph
• Histograms
• Boxplot
• Multivariate : Looking at relationship between two or more variables
• Scatter plots
• Pie plots
• Heatmaps(seaborn)
8
Bar-Graph,Histogram
and Boxplot
• Bar graph: A bar plot is a plot that presents
data with rectangular bars with lengths
proportional to the values that they represent.
• Boxplot : Depicts numerical data graphically
through their quartiles. The box extends from
the Q1 to Q3 quartile values of the data, with
a line at the median (Q2).
• Histogram: A histogram is a representation of
the distribution of data.
9
Scatterplot, Pieplot
• Scatterplot : Shows the data as a collection of points.
• Syntax: dataframe.plot.scatter(x = 'x_column_name', y = 'y_columnn_name’)
• Pie plot : Proportional representation of the numerical data in a column.
• Syntax: dataframe.plot.pie(y=‘column_name’)
10
Outlier detection
• An outlier is a point or set of data points that lie away from the rest of
the data values of the dataset..
• Outliers are easily identified by visualizing the data.
• For e.g.
• In a boxplot, the data points which lie outside the upper and lower bound can
be considered as outliers
• In a scatterplot, the data points which lie outside the groups of datapoints can
be considered as outliers
11
Outlier removal
• Calculate the IQR as follows:
Calculate the first and third quartile (Q1 and Q3)
Calculate the interquartile range, IQR = Q3-Q1
Find the lower bound which is Q1*1.5
Find the upper bound which is Q3*1.5
Replace the data points which lie outside this range.
They can be replaced by mean or median.
12
References
• More information on EDA tools and Pandas can be found on below
links:
• https://pandas.pydata.org/docs/user_guide/index.html
• https://pandas.pydata.org/docs/user_guide/missing_data.html
• https://pandas.pydata.org/docs/user_guide/visualization.html
13

More Related Content

Similar to EDA.pptx

Python for Data Analysis.pdf
Python for Data Analysis.pdfPython for Data Analysis.pdf
Python for Data Analysis.pdfJulioRecaldeLara1
 
Python-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxPython-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxtangadhurai
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingTony Nguyen
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingHarry Potter
 
MODULE 5- EDA.pptx
MODULE 5- EDA.pptxMODULE 5- EDA.pptx
MODULE 5- EDA.pptxnikshaikh786
 
Pre_processing_the_data_using_advance_technique
Pre_processing_the_data_using_advance_techniquePre_processing_the_data_using_advance_technique
Pre_processing_the_data_using_advance_techniqueBhushan134837
 
2. Data Preprocessing with Numpy and Pandas.pptx
2. Data Preprocessing with Numpy and Pandas.pptx2. Data Preprocessing with Numpy and Pandas.pptx
2. Data Preprocessing with Numpy and Pandas.pptxPeangSereysothirich
 
II B.Sc IT DATA STRUCTURES.pptx
II B.Sc IT DATA STRUCTURES.pptxII B.Sc IT DATA STRUCTURES.pptx
II B.Sc IT DATA STRUCTURES.pptxsabithabanu83
 
Exploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdfExploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdfAmmarAhmedSiddiqui2
 
Lecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learningLecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learningmy6305874
 
Data exploration validation and sanitization
Data exploration validation and sanitizationData exploration validation and sanitization
Data exploration validation and sanitizationVenkata Reddy Konasani
 
EDA tools and making sense of data.pdf
EDA tools and making sense of   data.pdfEDA tools and making sense of   data.pdf
EDA tools and making sense of data.pdf9wldv5h8n
 
Data_Preparation.pptx
Data_Preparation.pptxData_Preparation.pptx
Data_Preparation.pptxImXaib
 
UNIT - 5 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHON
UNIT - 5 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHONUNIT - 5 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHON
UNIT - 5 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHONNandakumar P
 

Similar to EDA.pptx (20)

Python for Data Analysis.pdf
Python for Data Analysis.pdfPython for Data Analysis.pdf
Python for Data Analysis.pdf
 
Python-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxPython-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptx
 
Python for data analysis
Python for data analysisPython for data analysis
Python for data analysis
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
MODULE 5- EDA.pptx
MODULE 5- EDA.pptxMODULE 5- EDA.pptx
MODULE 5- EDA.pptx
 
Pre_processing_the_data_using_advance_technique
Pre_processing_the_data_using_advance_techniquePre_processing_the_data_using_advance_technique
Pre_processing_the_data_using_advance_technique
 
2. Data Preprocessing with Numpy and Pandas.pptx
2. Data Preprocessing with Numpy and Pandas.pptx2. Data Preprocessing with Numpy and Pandas.pptx
2. Data Preprocessing with Numpy and Pandas.pptx
 
II B.Sc IT DATA STRUCTURES.pptx
II B.Sc IT DATA STRUCTURES.pptxII B.Sc IT DATA STRUCTURES.pptx
II B.Sc IT DATA STRUCTURES.pptx
 
1234
12341234
1234
 
Datapreprocessing
DatapreprocessingDatapreprocessing
Datapreprocessing
 
Exploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdfExploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdf
 
Lecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learningLecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learning
 
Data exploration validation and sanitization
Data exploration validation and sanitizationData exploration validation and sanitization
Data exploration validation and sanitization
 
Datapreprocessing
DatapreprocessingDatapreprocessing
Datapreprocessing
 
Data Exploration in R.pptx
Data Exploration in R.pptxData Exploration in R.pptx
Data Exploration in R.pptx
 
EDA tools and making sense of data.pdf
EDA tools and making sense of   data.pdfEDA tools and making sense of   data.pdf
EDA tools and making sense of data.pdf
 
Data_Preparation.pptx
Data_Preparation.pptxData_Preparation.pptx
Data_Preparation.pptx
 
Lecture 9.pptx
Lecture 9.pptxLecture 9.pptx
Lecture 9.pptx
 
UNIT - 5 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHON
UNIT - 5 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHONUNIT - 5 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHON
UNIT - 5 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHON
 

Recently uploaded

Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 

Recently uploaded (20)

Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 

EDA.pptx

  • 2. What is Exploratory Data Analysis 1 EDA is an approach for data analysis using variety of techniques to gain insights about the data. • Cleaning and preprocessing • Statistical Analysis • Visualization for trend analysis, anomaly detection, outlier detection (and removal). Basic steps in any exploratory data analysis:
  • 3. Importance of EDA Improve understanding of variables by extracting averages, mean, minimum, and maximum values, etc. Discover errors, outliers, and missing values in the data. Identify patterns by visualizing data in graphs such as bar graphs, scatter plots, heatmaps and histograms. 2
  • 4. EDA using Pandas Import data into workplace(Jupyter notebook, Google colab, Python IDE) Descriptive statistics Removal of nulls Visualization 3
  • 5. 1. Packages and data import • Step 1 : Import pandas to the workplace. • “Import pandas” • Step 2 : Read data/dataset into Pandas dataframe. Different input formats include: • Excel : read_excel • CSV: read_csv • JSON: read_json • HTML and many more 4
  • 6. 2. Descriptive Stats (Pandas) • Used to make preliminary assessments about the population distribution of the variable. • Commonly used statistics: 1. Central tendency : • Mean – The average value of all the data points. : dataframe.mean() • Median – The middle value when all the data points are put in an ordered list: dataframe.median() • Mode – The data point which occurs the most in the dataset :dataframe.mode() 2. Spread : It is the measure of how far the datapoints are away from the mean or median • Variance - The variance is the mean of the squares of the individual deviations: dataframe.var() • Standard deviation - The standard deviation is the square root of the variance:dataframe.std() 3. Skewness: It is a measure of asymmetry: dataframe.skew()
  • 7. Descriptive Stats (contd.) Other methods to get a quick look on the data: • Describe() : Summarizes the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values. • Syntax: pandas.dataframe.describe() • Info() :Prints a concise summary of the dataframe. This method prints information about a dataframe including the index dtype and columns, non-null values and memory usage. • Syntax: pandas.dataframe.info()
  • 8. 3. Null values 7 Detecting Detecting Null- values: •Isnull(): It is used as an alias for dataframe.isna(). This function returns the dataframe with boolean values indicating missing values. •Syntax : dataframe.isnull() Handling Handling null values: •Dropping the rows with null values: dropna() function is used to delete rows or columns with null values. •Replacing missing values: fillna() function can fill the missing values with a special value value like mean or median.
  • 9. 4. Visualization • Univariate: Looking at one variable/column at a time • Bar-graph • Histograms • Boxplot • Multivariate : Looking at relationship between two or more variables • Scatter plots • Pie plots • Heatmaps(seaborn) 8
  • 10. Bar-Graph,Histogram and Boxplot • Bar graph: A bar plot is a plot that presents data with rectangular bars with lengths proportional to the values that they represent. • Boxplot : Depicts numerical data graphically through their quartiles. The box extends from the Q1 to Q3 quartile values of the data, with a line at the median (Q2). • Histogram: A histogram is a representation of the distribution of data. 9
  • 11. Scatterplot, Pieplot • Scatterplot : Shows the data as a collection of points. • Syntax: dataframe.plot.scatter(x = 'x_column_name', y = 'y_columnn_name’) • Pie plot : Proportional representation of the numerical data in a column. • Syntax: dataframe.plot.pie(y=‘column_name’) 10
  • 12. Outlier detection • An outlier is a point or set of data points that lie away from the rest of the data values of the dataset.. • Outliers are easily identified by visualizing the data. • For e.g. • In a boxplot, the data points which lie outside the upper and lower bound can be considered as outliers • In a scatterplot, the data points which lie outside the groups of datapoints can be considered as outliers 11
  • 13. Outlier removal • Calculate the IQR as follows: Calculate the first and third quartile (Q1 and Q3) Calculate the interquartile range, IQR = Q3-Q1 Find the lower bound which is Q1*1.5 Find the upper bound which is Q3*1.5 Replace the data points which lie outside this range. They can be replaced by mean or median. 12
  • 14. References • More information on EDA tools and Pandas can be found on below links: • https://pandas.pydata.org/docs/user_guide/index.html • https://pandas.pydata.org/docs/user_guide/missing_data.html • https://pandas.pydata.org/docs/user_guide/visualization.html 13