**Exploratory Data Analysis (EDA) Tools:**
Exploratory Data Analysis (EDA) is a crucial step in understanding and making sense of data in data science projects. Various tools and libraries are available to assist in this process, offering features like visualizations, data profiling, and statistical analysis. Here are some popular EDA tools:
1. **DataPrep**:
- Offers interactive visualizations and fast performance due to its Dask-based computing module.
- Suitable for big data analysis and provides insights through comprehensive visualizations.
- Efficient in handling missing values, checking correlations, and data cleansing[1][3].
2. **Pandas-profiling**:
- Popular for its ability to handle large datasets and address data privacy concerns.
- Generates detailed reports with relevant features highlighted for EDA.
- Useful for smaller datasets where privacy is a concern[1][2].
3. **SweetViz**:
- Provides detailed visualizations to understand complex data patterns.
- Offers insights into the dataset through interactive graphs and distribution charts[1].
4. **Lux**:
- Appeals to users comfortable with pandas syntax, offering additional functionality with a simple call.
- Enables users to perform EDA tasks conveniently within the pandas environment[1].
5. **D-Tale**:
- Stands out for its interactive GUI that eliminates the need for coding during EDA tasks.
- Offers a network analyzer for visualizing relationships between factors and responses[1].
These tools cater to different user preferences and requirements, providing a range of functionalities to facilitate effective exploratory data analysis
3. Steps followed in Handling Data
• Importing the libraries
• Importing the Dataset
• Handling of Missing Data
• Handling of Categorical Data
• Data Visualization
11. Encoding the categorical data
• two categorical variables – country and purchased.
#Categorical data
#for Country Variable
from sklearn.preprocessing import LabelEncoder
label_encoder_x= LabelEncoder()
x[:, 0]= label_encoder_x.fit_transform(x[:, 0])
Label Encoder class has successfully encoded the
variables into digits.
12. Encode the dependent variable
• For the second categorical variable
- purchased or not purchased -
you can use the “labelencoder”
object of the LableEncoder class.
• OneHotEncoder class - purchased
variable only has two categories
yes or no - which are encoded into
0 and 1.
14. One hot encoder
• labelencoder_y= LabelEncoder()
• y= labelencoder_y.fit_transform(y)
• output will be –
• Out : array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])
15. Exploratory Data Analysis (EDA)
•method of studying and exploring data sets to
apprehend their predominant traits, discover
patterns, locate outliers, and identify
relationships between variables.
•EDA is normally carried out as a preliminary step
before modelling
16. Purpose of using EDA tools
vData Visualization
vCorrelation and Relationships
vFeature Engineering
vData Segmentation
vTime Series Analysis
vMissing Data Analysis
vOutlier Analysis
17. EDA
• approach of analyzing data sets - to summarize their statistical
characteristics
• using statistical graphics and other data visualization methods.
• critical process of performing initial investigations on data so as to
discover patterns, to spot anomalies (anomaly detection) ,to test
hypothesis and to check assumptions with the help of summary
statistics and graphical representations.
• understand the data first and try to gather as many insights from it.
• making sense of data
18. Read Data set
• import pandas as pd
• import numpy as np
• # read datasdet using pandas
• df =
pd.read_csv('employees.csv')
• df.head()
22. Box plots to visualize outliers
• one of the many ways to visualize
data distribution.
• Using matplotlib or seaborn
• plots the q1 (25th percentile), q2
(50th percentile or median) and q3
(75th percentile) of the data along
with (q1–1.5*(q3-
q1)) and (q3+1.5*(q3-q1)).
• Outliers - points above and below
the plot.
23. Anomaly Detection – outliers with Boxplot
• anomalous data - linked to some sort of
problem or rare event such as hacking,
bank fraud, malfunctioning equipment,
structural defects / infrastructure
failures, or textual errors.
• outlier detection - identification of
unexpected events, observations, or
items that differ significantly from the
norm.
• If applied to unlabelled data -
unsupervised anomaly detection
24. • pandas “.corr()” function -
visualize the correlation matrix
using a heatmap in seaborn.
25. • Dark shades represents positive correlation
while lighter shades represents negative
correlation.
• Good practice to remove variableswith zero
correlation during feature selection.
• correlation is zero - No linear relationship
between these two predictors.
• safe to drop these features
26. EDA tools
• pandas, numpy,matplotlib and
seaborn)
• Typical graphical
techniques used in EDA are:
• Box plot
• Histogram
• Scatter plot