Data Exploratory, Feature
Engineering and Visualization
Dr.M.Shanthi,
ADS
ODD SEM-2024-2025
Unit-I
EDA Fundamentals-Understanding Data Science
– Significance of EDA – Making Sense of Data-
Comparing EDA with Classical and Bayesian
Analysis- Software Tools available for EDA.
Understanding Data Science
• Data Science: Scientific Study of Data.
• Data science involves cross-disciplinary knowledge from computer science,
statistics and mathematics.
• Data Analysis Phases:
1. Data Requirements
2. Data Collection
3. Data Processing
4. Data cleaning
5. EDA  Transformation, Descriptive Statistics.
6. Modeling and Algorithm
7. Data Product
8. Communication – Data Visualization
The Significance of EDA
- Different fields (science, economics, engineering, and
marketing) accumulate and store the data in electronic
formats.
- Appropriate and well established decision should be made
using the data collected.
- Impossible to take decisions from datasets without the help
computer programs.
- Data mining  data insights & make further decisions.
- Exploratory Data Analysis is a first Exercise in data mining.
- Visualize the data  to understand , create hypotheses for
further analysis.
The Significance of EDA
• EDA reveals ground truth of the content 
without making any underlying assumptions.
• Scientists uses (EDA)  what type of
modeling and hypothesis can be created.
• EDA
- summarizing data,[pandas]
- statistical data,[scipy]
- visualization of the data. [matplotlib]
Steps in EDA
• Problem Definition
- Define the business problem
- Data analysis plan execution
- Main deliverables
- Obtaining the current status of the data
- Performing cost/benefit Analysis
• Data Preparation
- Sources of the data
-Define data schemas and tables
-Main characteristics of the data
-Clean the dataset
- Delete the non-relevant datasets
- Transform the data
- Divide the data into required chunks for analysis
• Data Analysis
- Summarizing the data
- finding the hidden correlation and relationships among the data
• Development and representation of the results.
- Graphs
- Summary Tables
- Plots
Making sense of Data
Type of data analysis?
1. Numerical data[Quantitative data]
 Discrete data (fixed and distinct values)
Ex: Country code variable
Rank for students
 Continuous data
Infinite number of numerical values within a
specific range
Making Sense of Data
2. Categorical data[Qualitative data]
Categorical data represents the characteristics of an object.
Example:
Gender
Marital status
Movie Genres
Blood Type
Types of drugs
Types:
Binary categorical variable can take exactly two values anyone will be
selected. Dichotomous variable.
Polytomous variable can take more than two possible values. (Marital status)
Measurement Scales
• Most of the categorical dataset follows either
nominal or ordinal measurement scales.
- Nominal
- Ordinal
- Interval
- ratio
Measurement Scales
Nominal
• labeling variables without any quantitative value.
• The scales are generally referred to as labels.
• Scales are mutually exclusive / do not carry numerical values.
Example:
• What is the gender?
Male, Female, Third gender/Non-binary
I prefer not to answer, Other
• The languages that are spoken in a particular country
Tamil, Telugu, Malayalam etc.
• Biological Species
• Parts Speech in grammar
Important Note: Someone uses numbers for labels in the nominal measurement sense, they
have no concrete numerical value or meaning.
 No form of arithmetic calculation can be made on nominal measures.
Measurement Scales
In case of a Nominal dataset, you can certainly know the following:
Frequency rate at which a label occurs over a period of time
within the dataset.
Proportion Dividing the frequency by the total number of events
Percentage  compute the percentage of each proportion
Visualize  Pie chart or Bar Chart
Nominal scale: Pie chart or Bar Chart
Important note:
Type of data  Computation Type of model  Type of
visualization
Measurement Scales
Ordinal
- Difference between Ordinal and Nominal scale is the
order.
- Order of the values is significant factor.
- Represented by Likert scale.
Diagram need to be attached:
- Ordinal scales as an order of ranking.
- Median  measure of central tendency.
- Average is not permitted.
Measurement Scales
Interval
• The order and exact differences between the
values are significant.
• Used in statistics
• Measure of central tendencies i.e.
mean,median,mode and standard deviations.
• Example : Temperature.
Measurement Scales
Ratio:
The order, exact values and absolute zero
Possible to apply descriptive and inferential statistics.
 Central tendencies, Measure of dispersion(scattering the data/distribution)
 coefficient variation(ratio of measure of dispersion around the mean).
Examples:
- Dose amount
- Reaction rate
- Flow rate
- Concentration
- Pulse
- Weight
- Length
Measurement Scales
Comparing EDA with Classical and Bayesian
Analysis
Software tools available for EDA
• Python
• R programming Language
• Weka
• KNIME
Visual Aids for EDA
• Line Chart
• Bar Chart
• Scatter Plot
• Area Plot and stacked plot
• Pie Chart
• Table chart
• Polar Chart
• Histogram
• Lollipop Chart
• Choosing the best Chart
• Other Libraries to explore
Line Chart
• Line chart is used to illustrate the relationship
between two or more continuous variable.
• Matplotlib library
• Example:
- Date vs Stock_price
Lollipop chart
• A Lollipop chart can be used to display ranking
in the data.
• It is similar to an ordered bar chart.
• The line and the circle on the top gives nice
illustration of different types of cars and their
associated miles.
Bar Chart
• Bar charts are frequently used.
• To distinguish objects between distinct
collections in order to track variations over
time.
• Bars can be drawn horizontally or vertically to
represent the categorical variables.
• Example: Pharmacy in Norway keeps track of
the amount of Zoloft sold every month.
Table Chart
• A table chart combines a bar chart and a table.
• Example: Consider the standard LED bulbs that
come in different wattages.
• Based on two categorical variables: The year
and wattage. The number of units sold in a
particular year.
Histogram
• Histogram plots are used to depict the
distribution of any continuous variable.
• These types of plots are very popular in statistical
analysis.
• To find out the distribution we can go with
histogram plot.
• Example: Frequency vs years of experience with
python programming.
Scatter Plot
• Scatter plots are also called scatter graphs,
scatter charts.
• Cartesian co-ordinates x,y.
Cartesian Co-Ordinates
Polar Co-ordinates
Polar Chart
Data Transformation
• Merging database-style dataframes
• Transformation techniques
• Benefits of data transformation
Data transformation
• Concat
• Concat with an axis
• Merge
inner join
outer join
left join
right join
index
• Reshaping and pivoting
stacking
unstacking
Transformation Techniques
• Data Duplication
• Replacing values
• Handling missing data
Transformation Techniques
• Dropping the Missing Values
– Row-wise
– Column-wise
– Based on threshold
Transformation Techniques
• Filling the Missing Values
- Fill by zero value
- Fill by Forward/Backward Filling
- Fill by interpolating method
Descriptive Statistics
• Simple summaries of the entire dataset.
Central Tendencies
Mean
Median
Mode
Descriptive Statistics
Mean/Average might not be the best representation of the dataset ?
Measure of Dispersion
1. Standard Deviation
2. Variance
3. Skewness ( Measure of Symmetry and Asymmetry Variable)
Positive Skewness
Symmetrical
Negative Skewness
4. Kurtosis (Heaviness of the tail of the distribution)
( 0 ) Mesokurtic
(+3) Leptokurtic
(-1) Platykurtic
5. Percentile ( Measure the percentage of values in any dataset that lie below a
certain value)
25%
50%
75%
100%
6. Quartiles
- Visualization of Quartiles
Skewness
• Asymmetry of the variable in the dataset
about its mean.
• Positive
• Negative
• Symmetrical
Skewness
function: df.skew()
Kurtosis
Function= df.kurt()
• Kurtosis is a statistical measure that illustrates
how heavily the tails of distribution differ
from those of a normal distribution.
• Identify whether a given distribution contains
extreme values.
• Measure of outlier presence in a given
distribution.
• High kurtosis  high Outliers.
Kurtosis
Kurtosis
• There are three types of Kurtosis:
Mesokurtic  0
Leptokurtic  (K>3) High Flat  High
Outliers
Platykurtic (K<0) Low Outliers
Percentile
Function = np.percentile(attribute,50)
• Measure the percentage of values in any
dataset that lie below a certain value.
Quartiles
• Quartiles are values that split the given
dataset into quarters.
Grouping Datasets
• Groupby Mechanisms
- Grouping by features, hierarchically
- Aggregating a dataset by groups
- Applying custom aggregation functions to
groups
- Transforming a dataset groupwise
Grouping the Datasets
• Selecting a subset of columns
• Max and Min
• Mean

Types of Data in Machine Learning, Number aand Categorical

  • 1.
    Data Exploratory, Feature Engineeringand Visualization Dr.M.Shanthi, ADS ODD SEM-2024-2025
  • 2.
    Unit-I EDA Fundamentals-Understanding DataScience – Significance of EDA – Making Sense of Data- Comparing EDA with Classical and Bayesian Analysis- Software Tools available for EDA.
  • 3.
    Understanding Data Science •Data Science: Scientific Study of Data. • Data science involves cross-disciplinary knowledge from computer science, statistics and mathematics. • Data Analysis Phases: 1. Data Requirements 2. Data Collection 3. Data Processing 4. Data cleaning 5. EDA  Transformation, Descriptive Statistics. 6. Modeling and Algorithm 7. Data Product 8. Communication – Data Visualization
  • 4.
    The Significance ofEDA - Different fields (science, economics, engineering, and marketing) accumulate and store the data in electronic formats. - Appropriate and well established decision should be made using the data collected. - Impossible to take decisions from datasets without the help computer programs. - Data mining  data insights & make further decisions. - Exploratory Data Analysis is a first Exercise in data mining. - Visualize the data  to understand , create hypotheses for further analysis.
  • 5.
    The Significance ofEDA • EDA reveals ground truth of the content  without making any underlying assumptions. • Scientists uses (EDA)  what type of modeling and hypothesis can be created. • EDA - summarizing data,[pandas] - statistical data,[scipy] - visualization of the data. [matplotlib]
  • 6.
    Steps in EDA •Problem Definition - Define the business problem - Data analysis plan execution - Main deliverables - Obtaining the current status of the data - Performing cost/benefit Analysis • Data Preparation - Sources of the data -Define data schemas and tables -Main characteristics of the data -Clean the dataset - Delete the non-relevant datasets - Transform the data - Divide the data into required chunks for analysis • Data Analysis - Summarizing the data - finding the hidden correlation and relationships among the data • Development and representation of the results. - Graphs - Summary Tables - Plots
  • 7.
    Making sense ofData Type of data analysis? 1. Numerical data[Quantitative data]  Discrete data (fixed and distinct values) Ex: Country code variable Rank for students  Continuous data Infinite number of numerical values within a specific range
  • 8.
    Making Sense ofData 2. Categorical data[Qualitative data] Categorical data represents the characteristics of an object. Example: Gender Marital status Movie Genres Blood Type Types of drugs Types: Binary categorical variable can take exactly two values anyone will be selected. Dichotomous variable. Polytomous variable can take more than two possible values. (Marital status)
  • 9.
    Measurement Scales • Mostof the categorical dataset follows either nominal or ordinal measurement scales. - Nominal - Ordinal - Interval - ratio
  • 10.
    Measurement Scales Nominal • labelingvariables without any quantitative value. • The scales are generally referred to as labels. • Scales are mutually exclusive / do not carry numerical values. Example: • What is the gender? Male, Female, Third gender/Non-binary I prefer not to answer, Other • The languages that are spoken in a particular country Tamil, Telugu, Malayalam etc. • Biological Species • Parts Speech in grammar Important Note: Someone uses numbers for labels in the nominal measurement sense, they have no concrete numerical value or meaning.  No form of arithmetic calculation can be made on nominal measures.
  • 11.
    Measurement Scales In caseof a Nominal dataset, you can certainly know the following: Frequency rate at which a label occurs over a period of time within the dataset. Proportion Dividing the frequency by the total number of events Percentage  compute the percentage of each proportion Visualize  Pie chart or Bar Chart Nominal scale: Pie chart or Bar Chart Important note: Type of data  Computation Type of model  Type of visualization
  • 12.
    Measurement Scales Ordinal - Differencebetween Ordinal and Nominal scale is the order. - Order of the values is significant factor. - Represented by Likert scale. Diagram need to be attached: - Ordinal scales as an order of ranking. - Median  measure of central tendency. - Average is not permitted.
  • 13.
    Measurement Scales Interval • Theorder and exact differences between the values are significant. • Used in statistics • Measure of central tendencies i.e. mean,median,mode and standard deviations. • Example : Temperature.
  • 14.
    Measurement Scales Ratio: The order,exact values and absolute zero Possible to apply descriptive and inferential statistics.  Central tendencies, Measure of dispersion(scattering the data/distribution)  coefficient variation(ratio of measure of dispersion around the mean). Examples: - Dose amount - Reaction rate - Flow rate - Concentration - Pulse - Weight - Length
  • 15.
  • 16.
    Comparing EDA withClassical and Bayesian Analysis
  • 17.
    Software tools availablefor EDA • Python • R programming Language • Weka • KNIME
  • 18.
    Visual Aids forEDA • Line Chart • Bar Chart • Scatter Plot • Area Plot and stacked plot • Pie Chart • Table chart • Polar Chart • Histogram • Lollipop Chart • Choosing the best Chart • Other Libraries to explore
  • 19.
    Line Chart • Linechart is used to illustrate the relationship between two or more continuous variable. • Matplotlib library • Example: - Date vs Stock_price
  • 20.
    Lollipop chart • ALollipop chart can be used to display ranking in the data. • It is similar to an ordered bar chart. • The line and the circle on the top gives nice illustration of different types of cars and their associated miles.
  • 21.
    Bar Chart • Barcharts are frequently used. • To distinguish objects between distinct collections in order to track variations over time. • Bars can be drawn horizontally or vertically to represent the categorical variables. • Example: Pharmacy in Norway keeps track of the amount of Zoloft sold every month.
  • 22.
    Table Chart • Atable chart combines a bar chart and a table. • Example: Consider the standard LED bulbs that come in different wattages. • Based on two categorical variables: The year and wattage. The number of units sold in a particular year.
  • 23.
    Histogram • Histogram plotsare used to depict the distribution of any continuous variable. • These types of plots are very popular in statistical analysis. • To find out the distribution we can go with histogram plot. • Example: Frequency vs years of experience with python programming.
  • 24.
    Scatter Plot • Scatterplots are also called scatter graphs, scatter charts. • Cartesian co-ordinates x,y.
  • 25.
  • 26.
  • 27.
  • 28.
    Data Transformation • Mergingdatabase-style dataframes • Transformation techniques • Benefits of data transformation
  • 29.
    Data transformation • Concat •Concat with an axis • Merge inner join outer join left join right join index • Reshaping and pivoting stacking unstacking
  • 30.
    Transformation Techniques • DataDuplication • Replacing values • Handling missing data
  • 31.
    Transformation Techniques • Droppingthe Missing Values – Row-wise – Column-wise – Based on threshold
  • 32.
    Transformation Techniques • Fillingthe Missing Values - Fill by zero value - Fill by Forward/Backward Filling - Fill by interpolating method
  • 33.
    Descriptive Statistics • Simplesummaries of the entire dataset. Central Tendencies Mean Median Mode
  • 34.
    Descriptive Statistics Mean/Average mightnot be the best representation of the dataset ? Measure of Dispersion 1. Standard Deviation 2. Variance 3. Skewness ( Measure of Symmetry and Asymmetry Variable) Positive Skewness Symmetrical Negative Skewness 4. Kurtosis (Heaviness of the tail of the distribution) ( 0 ) Mesokurtic (+3) Leptokurtic (-1) Platykurtic 5. Percentile ( Measure the percentage of values in any dataset that lie below a certain value) 25% 50% 75% 100% 6. Quartiles - Visualization of Quartiles
  • 35.
    Skewness • Asymmetry ofthe variable in the dataset about its mean. • Positive • Negative • Symmetrical
  • 36.
  • 37.
    Kurtosis Function= df.kurt() • Kurtosisis a statistical measure that illustrates how heavily the tails of distribution differ from those of a normal distribution. • Identify whether a given distribution contains extreme values. • Measure of outlier presence in a given distribution. • High kurtosis  high Outliers.
  • 38.
  • 39.
    Kurtosis • There arethree types of Kurtosis: Mesokurtic  0 Leptokurtic  (K>3) High Flat  High Outliers Platykurtic (K<0) Low Outliers
  • 40.
    Percentile Function = np.percentile(attribute,50) •Measure the percentage of values in any dataset that lie below a certain value.
  • 41.
    Quartiles • Quartiles arevalues that split the given dataset into quarters.
  • 42.
    Grouping Datasets • GroupbyMechanisms - Grouping by features, hierarchically - Aggregating a dataset by groups - Applying custom aggregation functions to groups - Transforming a dataset groupwise
  • 43.
    Grouping the Datasets •Selecting a subset of columns • Max and Min • Mean