1
DATA EXPLORATION IN PYTHON: WHAT IS
EXPLORATORY DATA ANALYSIS (EDA)
Siddharth Kumar Sahu
ASB22005
Contents:
Importing
necessary
libraries
Loading the
data
Descriptive
statistics
Visualization
Correlation
analysis
2
 Data exploration is the process of analyzing, summarizing, and
visualizing data to gain insights and understanding of its properties
and relationships.
 Here are some clear steps for performing data exploration in Python.
 Importing necessary libraries: To perform data exploration in
Python, you will need to import certain libraries, including pandas,
numpy, matplotlib, and seaborn.
3
You can import these libraries using
the following code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
4
Loading the data:
The next step is to load your data into
Python.
You can use pandas to load data from various
sources such as CSV, Excel, SQL, etc.
5
data = pd.read_csv('filename.csv')
Understanding the data:
 It is important to get an overview of the data before diving into the
analysis.
 Overview of the data:
 Display the first few rows of the data using data.head()
 Display the last few rows of the data using data.tail()
 Check the shape of the data using data.shape
 Check the data types of the columns using data.dtypes
 Check for missing values using data.isnull().sum()
6
Descriptive statistics:
 Descriptive statistics provide a summary of the central tendency,
dispersion, and shape of the distribution of a dataset.
 You can use pandas to generate descriptive statistics for numerical
columns using the describe() method:
 data.describe()
7
Visualization:
 Visualization is a powerful tool for understanding data.
 You can use matplotlib and seaborn to create visualizations such as
histograms, scatter plots, box plots, and heatmaps.
 Here is an example of creating a histogram using matplotlib:
 plt.hist(data['column_name'])
 plt.show()
8
Correlation analysis:
 Correlation analysis is used to identify the relationship between
variables in a dataset.
 You can use pandas to calculate the correlation matrix and seaborn
to create a heatmap to visualize the correlations:
 correlation_matrix = data.corr()
 sns.heatmap(correlation_matrix, annot=True)
 plt.show()
9
Exploratory Data Analysis (EDA):
 Exploratory Data Analysis (EDA) is a process of analyzing and
summarizing a dataset to get insights and understanding of its
properties and relationships.
 The primary goal of EDA is to uncover patterns, trends, and
relationships in the data that can be used to guide further analysis.
 In simpler terms, EDA is a way to explore and understand data
before performing any formal statistical analysis or modeling.
 It involves visualizing the data, identifying outliers and missing
values, calculating summary statistics, and examining the
relationships between variables.
10
 EDA is an important step in data analysis because it can help
identify issues with the data, such as inconsistencies, errors, or
biases.
 By understanding the data, you can make informed decisions about
which statistical techniques and models to use, and how to interpret
the results.
 Overall, EDA is a crucial step in any data analysis project, as it
helps to ensure that the data is properly understood and prepared
before proceeding with further analysis.
11
Thank you
12

Data Exploration in Python.pptx

  • 1.
    1 DATA EXPLORATION INPYTHON: WHAT IS EXPLORATORY DATA ANALYSIS (EDA) Siddharth Kumar Sahu ASB22005
  • 2.
  • 3.
     Data explorationis the process of analyzing, summarizing, and visualizing data to gain insights and understanding of its properties and relationships.  Here are some clear steps for performing data exploration in Python.  Importing necessary libraries: To perform data exploration in Python, you will need to import certain libraries, including pandas, numpy, matplotlib, and seaborn. 3
  • 4.
    You can importthese libraries using the following code: import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns 4
  • 5.
    Loading the data: Thenext step is to load your data into Python. You can use pandas to load data from various sources such as CSV, Excel, SQL, etc. 5 data = pd.read_csv('filename.csv')
  • 6.
    Understanding the data: It is important to get an overview of the data before diving into the analysis.  Overview of the data:  Display the first few rows of the data using data.head()  Display the last few rows of the data using data.tail()  Check the shape of the data using data.shape  Check the data types of the columns using data.dtypes  Check for missing values using data.isnull().sum() 6
  • 7.
    Descriptive statistics:  Descriptivestatistics provide a summary of the central tendency, dispersion, and shape of the distribution of a dataset.  You can use pandas to generate descriptive statistics for numerical columns using the describe() method:  data.describe() 7
  • 8.
    Visualization:  Visualization isa powerful tool for understanding data.  You can use matplotlib and seaborn to create visualizations such as histograms, scatter plots, box plots, and heatmaps.  Here is an example of creating a histogram using matplotlib:  plt.hist(data['column_name'])  plt.show() 8
  • 9.
    Correlation analysis:  Correlationanalysis is used to identify the relationship between variables in a dataset.  You can use pandas to calculate the correlation matrix and seaborn to create a heatmap to visualize the correlations:  correlation_matrix = data.corr()  sns.heatmap(correlation_matrix, annot=True)  plt.show() 9
  • 10.
    Exploratory Data Analysis(EDA):  Exploratory Data Analysis (EDA) is a process of analyzing and summarizing a dataset to get insights and understanding of its properties and relationships.  The primary goal of EDA is to uncover patterns, trends, and relationships in the data that can be used to guide further analysis.  In simpler terms, EDA is a way to explore and understand data before performing any formal statistical analysis or modeling.  It involves visualizing the data, identifying outliers and missing values, calculating summary statistics, and examining the relationships between variables. 10
  • 11.
     EDA isan important step in data analysis because it can help identify issues with the data, such as inconsistencies, errors, or biases.  By understanding the data, you can make informed decisions about which statistical techniques and models to use, and how to interpret the results.  Overall, EDA is a crucial step in any data analysis project, as it helps to ensure that the data is properly understood and prepared before proceeding with further analysis. 11
  • 12.