SlideShare a Scribd company logo
1 of 26
Download to read offline
Agenda
•Significance of Exploratory Data Analysis,
•Making sense of Data.
Steps followed in Handling Data
• Importing the libraries
• Importing the Dataset
• Handling of Missing Data
• Handling of Categorical Data
• Data Visualization
Handling Categorical data
One Hot Encoder
• one_hot_encoded_data = pd.get_dummies(data, columns =
['Remarks', 'Gender'])
• print(one_hot_encoded_data)
Handling Categorical data
• # importing libraries
• import pandas as pd
• import numpy as np
• from sklearn.preprocessing import OneHotEncoder
• # Retrieving data
• data = pd.read_csv('Employee_data.csv')
• # Converting type of columns to category
• data['Gender'] = data['Gender'].astype('category')
• data['Remarks'] = data['Remarks'].astype('category')
Handling Categorical data
• # Assigning numerical values and storing it in another columns
• data['Gen_new'] = data['Gender'].cat.codes
• data['Rem_new'] = data['Remarks'].cat.codes
• # Create an instance of One-hot-encoder
• enc = OneHotEncoder()
• # Passing encoded columns
• enc_data = pd.DataFrame(enc.fit_transform(
• data[['Gen_new', 'Rem_new']]).toarray())
• # Merge with main
• New_df = data.join(enc_data)
• print(New_df)
Output of One hot encoder
Handling Categorical data
(on output
purchasedvariable -y)
Encoding the categorical data
• two categorical variables – country and purchased.
#Categorical data
#for Country Variable
from sklearn.preprocessing import LabelEncoder
label_encoder_x= LabelEncoder()
x[:, 0]= label_encoder_x.fit_transform(x[:, 0])
Label Encoder class has successfully encoded the
variables into digits.
Encode the dependent variable
• For the second categorical variable
- purchased or not purchased -
you can use the “labelencoder”
object of the LableEncoder class.
• OneHotEncoder class - purchased
variable only has two categories
yes or no - which are encoded into
0 and 1.
Output
array([[2, 38.0, 68000.0],
[0, 43.0, 45000.0],
[1, 30.0, 54000.0],
[0, 48.0, 65000.0],
[1, 40.0, 65222.22222222222],
[2, 35.0, 58000.0],
[1, 41.111111111111114, 53000.0],
[0, 49.0, 79000.0],
[2, 50.0, 88000.0],
One hot encoder
• labelencoder_y= LabelEncoder()
• y= labelencoder_y.fit_transform(y)
• output will be –
• Out : array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])
Exploratory Data Analysis (EDA)
•method of studying and exploring data sets to
apprehend their predominant traits, discover
patterns, locate outliers, and identify
relationships between variables.
•EDA is normally carried out as a preliminary step
before modelling
Purpose of using EDA tools
vData Visualization
vCorrelation and Relationships
vFeature Engineering
vData Segmentation
vTime Series Analysis
vMissing Data Analysis
vOutlier Analysis
EDA
• approach of analyzing data sets - to summarize their statistical
characteristics
• using statistical graphics and other data visualization methods.
• critical process of performing initial investigations on data so as to
discover patterns, to spot anomalies (anomaly detection) ,to test
hypothesis and to check assumptions with the help of summary
statistics and graphical representations.
• understand the data first and try to gather as many insights from it.
• making sense of data
Read Data set
• import pandas as pd
• import numpy as np
• # read datasdet using pandas
• df =
pd.read_csv('employees.csv')
• df.head()
Histogram
• # importing packages
• import seaborn as sns
• import matplotlib.pyplot as plt
• sns.histplot(x='Salary', data=df, )
• plt.show()
Box Plot
• box plot - distribution of data based
on the five number summary:
• Minimum
• First quartile
• Median
• Third quartile
• Maximum.
Boxplot
• # importing packages
• import seaborn as sns
• import matplotlib.pyplot as plt
• sns.boxplot( x="Salary",
y='Team', data=df, )
• plt.show()
Box plots to visualize outliers
• one of the many ways to visualize
data distribution.
• Using matplotlib or seaborn
• plots the q1 (25th percentile), q2
(50th percentile or median) and q3
(75th percentile) of the data along
with (q1–1.5*(q3-
q1)) and (q3+1.5*(q3-q1)).
• Outliers - points above and below
the plot.
Anomaly Detection – outliers with Boxplot
• anomalous data - linked to some sort of
problem or rare event such as hacking,
bank fraud, malfunctioning equipment,
structural defects / infrastructure
failures, or textual errors.
• outlier detection - identification of
unexpected events, observations, or
items that differ significantly from the
norm.
• If applied to unlabelled data -
unsupervised anomaly detection
• pandas “.corr()” function -
visualize the correlation matrix
using a heatmap in seaborn.
• Dark shades represents positive correlation
while lighter shades represents negative
correlation.
• Good practice to remove variableswith zero
correlation during feature selection.
• correlation is zero - No linear relationship
between these two predictors.
• safe to drop these features
EDA tools
• pandas, numpy,matplotlib and
seaborn)
• Typical graphical
techniques used in EDA are:
• Box plot
• Histogram
• Scatter plot

More Related Content

Similar to EDA tools and making sense of data.pdf

Data preprocessing
Data preprocessingData preprocessing
Data preprocessingHarry Potter
 
Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache SparkCloudera, Inc.
 
Machine Learning with Python
Machine Learning with PythonMachine Learning with Python
Machine Learning with PythonAnkit Rathi
 
visualisasi data praktik pakai excel, py
visualisasi data praktik pakai excel, pyvisualisasi data praktik pakai excel, py
visualisasi data praktik pakai excel, pyElmaLyrics
 
python-numpyandpandas-170922144956 (1).pptx
python-numpyandpandas-170922144956 (1).pptxpython-numpyandpandas-170922144956 (1).pptx
python-numpyandpandas-170922144956 (1).pptxAkashgupta517936
 
Approaching (almost) Any Machine Learning Problem (kaggledays dubai)
Approaching (almost) Any Machine Learning Problem (kaggledays dubai)Approaching (almost) Any Machine Learning Problem (kaggledays dubai)
Approaching (almost) Any Machine Learning Problem (kaggledays dubai)Abhishek Thakur
 
R programming & Machine Learning
R programming & Machine LearningR programming & Machine Learning
R programming & Machine LearningAmanBhalla14
 
Python - Numpy/Pandas/Matplot Machine Learning Libraries
Python - Numpy/Pandas/Matplot Machine Learning LibrariesPython - Numpy/Pandas/Matplot Machine Learning Libraries
Python - Numpy/Pandas/Matplot Machine Learning LibrariesAndrew Ferlitsch
 
Stream-based Data Synchronization
Stream-based Data SynchronizationStream-based Data Synchronization
Stream-based Data SynchronizationKlemen Verdnik
 
Three steps to untangle data traffic jams
Three steps to untangle data traffic jamsThree steps to untangle data traffic jams
Three steps to untangle data traffic jamsBol.com Techlab
 
Matplotlib adalah pustaka plotting 2D Python yang menghasilkan gambar berkual...
Matplotlib adalah pustaka plotting 2D Python yang menghasilkan gambar berkual...Matplotlib adalah pustaka plotting 2D Python yang menghasilkan gambar berkual...
Matplotlib adalah pustaka plotting 2D Python yang menghasilkan gambar berkual...HendraPurnama31
 
Python-for-Data-Analysis.pdf
Python-for-Data-Analysis.pdfPython-for-Data-Analysis.pdf
Python-for-Data-Analysis.pdfssuser598883
 
Feature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsFeature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsGabriel Moreira
 
Python-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxPython-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxSandeep Singh
 
Python for Data Analysis.pdf
Python for Data Analysis.pdfPython for Data Analysis.pdf
Python for Data Analysis.pdfJulioRecaldeLara1
 
Python-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxPython-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxtangadhurai
 
The fundamentals of regression
The fundamentals of regressionThe fundamentals of regression
The fundamentals of regressionStephanie Locke
 

Similar to EDA tools and making sense of data.pdf (20)

Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache Spark
 
Machine Learning with Python
Machine Learning with PythonMachine Learning with Python
Machine Learning with Python
 
visualisasi data praktik pakai excel, py
visualisasi data praktik pakai excel, pyvisualisasi data praktik pakai excel, py
visualisasi data praktik pakai excel, py
 
python-numpyandpandas-170922144956 (1).pptx
python-numpyandpandas-170922144956 (1).pptxpython-numpyandpandas-170922144956 (1).pptx
python-numpyandpandas-170922144956 (1).pptx
 
Approaching (almost) Any Machine Learning Problem (kaggledays dubai)
Approaching (almost) Any Machine Learning Problem (kaggledays dubai)Approaching (almost) Any Machine Learning Problem (kaggledays dubai)
Approaching (almost) Any Machine Learning Problem (kaggledays dubai)
 
R programming & Machine Learning
R programming & Machine LearningR programming & Machine Learning
R programming & Machine Learning
 
Time Series.pptx
Time Series.pptxTime Series.pptx
Time Series.pptx
 
Python - Numpy/Pandas/Matplot Machine Learning Libraries
Python - Numpy/Pandas/Matplot Machine Learning LibrariesPython - Numpy/Pandas/Matplot Machine Learning Libraries
Python - Numpy/Pandas/Matplot Machine Learning Libraries
 
Stream-based Data Synchronization
Stream-based Data SynchronizationStream-based Data Synchronization
Stream-based Data Synchronization
 
Three steps to untangle data traffic jams
Three steps to untangle data traffic jamsThree steps to untangle data traffic jams
Three steps to untangle data traffic jams
 
Matplotlib adalah pustaka plotting 2D Python yang menghasilkan gambar berkual...
Matplotlib adalah pustaka plotting 2D Python yang menghasilkan gambar berkual...Matplotlib adalah pustaka plotting 2D Python yang menghasilkan gambar berkual...
Matplotlib adalah pustaka plotting 2D Python yang menghasilkan gambar berkual...
 
Python-for-Data-Analysis.pdf
Python-for-Data-Analysis.pdfPython-for-Data-Analysis.pdf
Python-for-Data-Analysis.pdf
 
Feature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsFeature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive models
 
3 module 2
3 module 23 module 2
3 module 2
 
Python-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxPython-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptx
 
Python for Data Analysis.pdf
Python for Data Analysis.pdfPython for Data Analysis.pdf
Python for Data Analysis.pdf
 
Python-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxPython-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptx
 
Lecture 9.pptx
Lecture 9.pptxLecture 9.pptx
Lecture 9.pptx
 
The fundamentals of regression
The fundamentals of regressionThe fundamentals of regression
The fundamentals of regression
 

Recently uploaded

(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...RajaP95
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 

Recently uploaded (20)

(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
 
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 

EDA tools and making sense of data.pdf

  • 1. Agenda •Significance of Exploratory Data Analysis, •Making sense of Data.
  • 2.
  • 3. Steps followed in Handling Data • Importing the libraries • Importing the Dataset • Handling of Missing Data • Handling of Categorical Data • Data Visualization
  • 4.
  • 6. One Hot Encoder • one_hot_encoded_data = pd.get_dummies(data, columns = ['Remarks', 'Gender']) • print(one_hot_encoded_data)
  • 7. Handling Categorical data • # importing libraries • import pandas as pd • import numpy as np • from sklearn.preprocessing import OneHotEncoder • # Retrieving data • data = pd.read_csv('Employee_data.csv') • # Converting type of columns to category • data['Gender'] = data['Gender'].astype('category') • data['Remarks'] = data['Remarks'].astype('category')
  • 8. Handling Categorical data • # Assigning numerical values and storing it in another columns • data['Gen_new'] = data['Gender'].cat.codes • data['Rem_new'] = data['Remarks'].cat.codes • # Create an instance of One-hot-encoder • enc = OneHotEncoder() • # Passing encoded columns • enc_data = pd.DataFrame(enc.fit_transform( • data[['Gen_new', 'Rem_new']]).toarray()) • # Merge with main • New_df = data.join(enc_data) • print(New_df)
  • 9. Output of One hot encoder
  • 10. Handling Categorical data (on output purchasedvariable -y)
  • 11. Encoding the categorical data • two categorical variables – country and purchased. #Categorical data #for Country Variable from sklearn.preprocessing import LabelEncoder label_encoder_x= LabelEncoder() x[:, 0]= label_encoder_x.fit_transform(x[:, 0]) Label Encoder class has successfully encoded the variables into digits.
  • 12. Encode the dependent variable • For the second categorical variable - purchased or not purchased - you can use the “labelencoder” object of the LableEncoder class. • OneHotEncoder class - purchased variable only has two categories yes or no - which are encoded into 0 and 1.
  • 13. Output array([[2, 38.0, 68000.0], [0, 43.0, 45000.0], [1, 30.0, 54000.0], [0, 48.0, 65000.0], [1, 40.0, 65222.22222222222], [2, 35.0, 58000.0], [1, 41.111111111111114, 53000.0], [0, 49.0, 79000.0], [2, 50.0, 88000.0],
  • 14. One hot encoder • labelencoder_y= LabelEncoder() • y= labelencoder_y.fit_transform(y) • output will be – • Out : array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])
  • 15. Exploratory Data Analysis (EDA) •method of studying and exploring data sets to apprehend their predominant traits, discover patterns, locate outliers, and identify relationships between variables. •EDA is normally carried out as a preliminary step before modelling
  • 16. Purpose of using EDA tools vData Visualization vCorrelation and Relationships vFeature Engineering vData Segmentation vTime Series Analysis vMissing Data Analysis vOutlier Analysis
  • 17. EDA • approach of analyzing data sets - to summarize their statistical characteristics • using statistical graphics and other data visualization methods. • critical process of performing initial investigations on data so as to discover patterns, to spot anomalies (anomaly detection) ,to test hypothesis and to check assumptions with the help of summary statistics and graphical representations. • understand the data first and try to gather as many insights from it. • making sense of data
  • 18. Read Data set • import pandas as pd • import numpy as np • # read datasdet using pandas • df = pd.read_csv('employees.csv') • df.head()
  • 19. Histogram • # importing packages • import seaborn as sns • import matplotlib.pyplot as plt • sns.histplot(x='Salary', data=df, ) • plt.show()
  • 20. Box Plot • box plot - distribution of data based on the five number summary: • Minimum • First quartile • Median • Third quartile • Maximum.
  • 21. Boxplot • # importing packages • import seaborn as sns • import matplotlib.pyplot as plt • sns.boxplot( x="Salary", y='Team', data=df, ) • plt.show()
  • 22. Box plots to visualize outliers • one of the many ways to visualize data distribution. • Using matplotlib or seaborn • plots the q1 (25th percentile), q2 (50th percentile or median) and q3 (75th percentile) of the data along with (q1–1.5*(q3- q1)) and (q3+1.5*(q3-q1)). • Outliers - points above and below the plot.
  • 23. Anomaly Detection – outliers with Boxplot • anomalous data - linked to some sort of problem or rare event such as hacking, bank fraud, malfunctioning equipment, structural defects / infrastructure failures, or textual errors. • outlier detection - identification of unexpected events, observations, or items that differ significantly from the norm. • If applied to unlabelled data - unsupervised anomaly detection
  • 24. • pandas “.corr()” function - visualize the correlation matrix using a heatmap in seaborn.
  • 25. • Dark shades represents positive correlation while lighter shades represents negative correlation. • Good practice to remove variableswith zero correlation during feature selection. • correlation is zero - No linear relationship between these two predictors. • safe to drop these features
  • 26. EDA tools • pandas, numpy,matplotlib and seaborn) • Typical graphical techniques used in EDA are: • Box plot • Histogram • Scatter plot