Unit 1: DATA PROCESSING AND STATESTICS
Basics of Data and its processing -Record Keeping , Statistics and data science ,
measurement scales , properties of data, Visualization, cleaning the data
Symbolic data analysis , Statistics-Basic Statistical Measures, Variance and
Standard Deviation, Visualizing Statistical Measures, Calculating Percentiles,
Quartiles and Box Plots, Missing data handling methods-Finding missing values,
dealing with missing values. Outliers- What are Outliers, Using Z-scores to Find
Outliers, Modified Z-score, Using IQR to Detect Outliers
Statistics & Data Science
Data science involves the collection, organization, analysis and visualization of large amounts
of data.
Statisticians, meanwhile, use mathematical models to quantify relationships between
variables and outcomes and make predictions based on those relationships.
Statisticians do not use computer science, algorithms or machine learning to the same degree
as computer scientists.
Data Science Statistics
Definition Is an interdisciplinary branch of
computer science used to gain valuable
information from a large data using
statistics, computers and technology.
Is a mathematical science for analysing
existing data pertaining to specific
problems, applying statistical tools to
this data, and presenting the results for
decision-making.
Concept 1. primary goal is to identify underlying
trends and patterns in a data for
decision making.
2. works well on both quantitative and
qualitative data
Key steps include
data mining
data pre-processing
Exploratory Data Analysis (EDA)
Model building and optimization
1. primary goal is to determine cause-
and-effect relationship in analysed
data, is a purely mathematical
approach.
2. works only on quantitative data
Key terms include
Mean
Median
Mode
Standard deviation (σ)
Variance (σ2)
Some important techniques include
regression, classification
Some important techniques
include probability
distribution, acceptance
sampling and statistical
quality control
Application
Areas
Can be applied in specialized areas
like computer vision, natural
language processing, disaster
management, recommender
systems and search engines, etc.
Can be applied in areas
where random variations
are observed in sampled
data like medical,
information technology,
economics, engineering,
finance, marketing,
accounting, and business,
etc.
Properties of Data
following are the properties of data:
1) amenability of use,
2) clarity,
3) accuracy, and
4) the quality
Amenability of use: From the dictionary meaning of data it is learnt
that data are facts used in deciding something. In short, data are
meant to be used as a base for arriving at definitive conclusions. They
are not required, if they are not amenable to use.
Clarity: This means data should necessarily' display so essential for
communicating the essence of the matter. Without clarity, the
meaning desired to be communicated will remain hidden.
Accuracy: Data should be real, complete and accurate. Accuracy is
thus, an essential property of data. Since data offer a basis for
deciding something, they must necessarily be accurate if valid
conclusions are to be drawn.
Essence: In social sciences, large quantities of data are collected
which cannot be presented, nor is it necessary to present them in
that form. They have to be compressed and refined. Data so refined
can present the essence or derived qualitative value, of the matter.
Data in sciences consist of observations made from scientific
experiments, these are all measured quantities. Data, thus, are
always the essence of the matter.
Outlier - Jupyter Notebook
Missing Data Handling Methods
The real-world data often has a lot of missing values. The cause of
missing values can be data corruption or failure to record data. The
handling of missing data is very important during the preprocessing of
the dataset as many machine learning algorithms do not support missing
values.
1.Deleting Rows with missing values
2.Impute missing values for continuous variable
3.Impute missing values for categorical variable
4.Other Imputation Methods
5.Using Algorithms that support missing values
6.Prediction of missing values
7.Imputation using Deep Learning Library — Datawig
Delete Rows with Missing Values:
Missing values can be handled by deleting the rows or columns having
null values. If columns have more than half of the rows as null then the
entire column can be dropped. The rows which are having one or more
columns values as null can also be dropped.
Replacing with an arbitrary value
If you can make an educated guess about the missing value, then you can
replace it with some arbitrary value using the following code. E.g., in the
following code, we are replacing the missing values of the ‘Dependents’
column with ‘0’.
IN:
#Replace the missing value with '0' using 'fiilna' method
train_df['Dependents'] = train_df['Dependents'].fillna(0)
train_df[‘Dependents'].isnull().sum()
OUT:
Replacing with the mean
Replacing with the mode
Replacing with the median
Replacing with the previous value – forward fill
Replacing with the next value – backward fill
How to Impute Missing Values for Categorical
Features?
There are two ways to impute missing values for categorical features as
follows:
Impute the Most Frequent Value :We will use ‘SimpleImputer’ in this
case, and as this is a non-numeric column, we can’t use mean or
median, but we can use the most frequent value and constant.
Impute the Value “Missing” : We can impute the value “missing,”
which treats it as a separate category.
Outliers
An outlier is an observation that lies an abnormal distance from other values in a
random sample from a population.
Outlier detection is a process used to identify and remove data points
from a dataset that differ from the rest of the data points In the dataset.
OR
Outlier detection is the process of identifying abnormal or abnormal-
looking data points in a dataset.
Types of outlier detection
There are two main types of outlier detection: descriptive and
prescriptive.
Descriptive outlier detection simply describes the outliers while
prescriptive outlier detection determines what action, if any,
needs to be taken based on the outlier.
Identifying Outliers using Z-Score
Z-Score is a measure of how many standard deviations a data point is
away from the mean.
data points with a Z-Score greater than a threshold are considered
outliers.
Definition of Z-Scores: Z-Scores are calculated by subtracting the
mean of the data set from a data point and dividing the result by the
standard deviation of the data set. The resulting value is a measure of
how many standard deviations a data point is away from the mean.
For example, let's say we have a dataset of test scores for a group of
students. The mean score is 75, and the standard deviation is 5. If a
student scored 85 on the test, we can calculate their Z-score as follows:
Z-score = (85 - 75) / 5 = 2
Outlier - Jupyter Notebook
modified z-score
However, z-scores can be affected by unusually large or small data values, which is
why a more robust way to detect outliers is to use a modified z-score, which is
calculated as:
Modified z-score = 0.6745(xi – x̃) / MAD
where:
•xi: A single data value
•x̃: The median of the dataset
•MAD: The median absolute deviation of the dataset
Identifying Outliers using IQR (Interquartile Range): The IQR is the range between
the first quartile (Q1) and the third quartile (Q3) of the data.
Outliers are often identified as values outside the range [Q1 - 1.5 * IQR, Q3 + 1.5 *
IQR]

Exploratory Data Analysis Unit 1 ppt presentation.pptx

  • 1.
    Unit 1: DATAPROCESSING AND STATESTICS Basics of Data and its processing -Record Keeping , Statistics and data science , measurement scales , properties of data, Visualization, cleaning the data Symbolic data analysis , Statistics-Basic Statistical Measures, Variance and Standard Deviation, Visualizing Statistical Measures, Calculating Percentiles, Quartiles and Box Plots, Missing data handling methods-Finding missing values, dealing with missing values. Outliers- What are Outliers, Using Z-scores to Find Outliers, Modified Z-score, Using IQR to Detect Outliers
  • 3.
    Statistics & DataScience Data science involves the collection, organization, analysis and visualization of large amounts of data. Statisticians, meanwhile, use mathematical models to quantify relationships between variables and outcomes and make predictions based on those relationships. Statisticians do not use computer science, algorithms or machine learning to the same degree as computer scientists.
  • 4.
    Data Science Statistics DefinitionIs an interdisciplinary branch of computer science used to gain valuable information from a large data using statistics, computers and technology. Is a mathematical science for analysing existing data pertaining to specific problems, applying statistical tools to this data, and presenting the results for decision-making. Concept 1. primary goal is to identify underlying trends and patterns in a data for decision making. 2. works well on both quantitative and qualitative data Key steps include data mining data pre-processing Exploratory Data Analysis (EDA) Model building and optimization 1. primary goal is to determine cause- and-effect relationship in analysed data, is a purely mathematical approach. 2. works only on quantitative data Key terms include Mean Median Mode Standard deviation (σ) Variance (σ2)
  • 5.
    Some important techniquesinclude regression, classification Some important techniques include probability distribution, acceptance sampling and statistical quality control Application Areas Can be applied in specialized areas like computer vision, natural language processing, disaster management, recommender systems and search engines, etc. Can be applied in areas where random variations are observed in sampled data like medical, information technology, economics, engineering, finance, marketing, accounting, and business, etc.
  • 6.
    Properties of Data followingare the properties of data: 1) amenability of use, 2) clarity, 3) accuracy, and 4) the quality
  • 7.
    Amenability of use:From the dictionary meaning of data it is learnt that data are facts used in deciding something. In short, data are meant to be used as a base for arriving at definitive conclusions. They are not required, if they are not amenable to use. Clarity: This means data should necessarily' display so essential for communicating the essence of the matter. Without clarity, the meaning desired to be communicated will remain hidden. Accuracy: Data should be real, complete and accurate. Accuracy is thus, an essential property of data. Since data offer a basis for deciding something, they must necessarily be accurate if valid conclusions are to be drawn.
  • 8.
    Essence: In socialsciences, large quantities of data are collected which cannot be presented, nor is it necessary to present them in that form. They have to be compressed and refined. Data so refined can present the essence or derived qualitative value, of the matter. Data in sciences consist of observations made from scientific experiments, these are all measured quantities. Data, thus, are always the essence of the matter.
  • 9.
  • 10.
    Missing Data HandlingMethods The real-world data often has a lot of missing values. The cause of missing values can be data corruption or failure to record data. The handling of missing data is very important during the preprocessing of the dataset as many machine learning algorithms do not support missing values.
  • 11.
    1.Deleting Rows withmissing values 2.Impute missing values for continuous variable 3.Impute missing values for categorical variable 4.Other Imputation Methods 5.Using Algorithms that support missing values 6.Prediction of missing values 7.Imputation using Deep Learning Library — Datawig
  • 12.
    Delete Rows withMissing Values: Missing values can be handled by deleting the rows or columns having null values. If columns have more than half of the rows as null then the entire column can be dropped. The rows which are having one or more columns values as null can also be dropped.
  • 13.
    Replacing with anarbitrary value If you can make an educated guess about the missing value, then you can replace it with some arbitrary value using the following code. E.g., in the following code, we are replacing the missing values of the ‘Dependents’ column with ‘0’. IN: #Replace the missing value with '0' using 'fiilna' method train_df['Dependents'] = train_df['Dependents'].fillna(0) train_df[‘Dependents'].isnull().sum() OUT:
  • 14.
    Replacing with themean Replacing with the mode Replacing with the median Replacing with the previous value – forward fill Replacing with the next value – backward fill
  • 15.
    How to ImputeMissing Values for Categorical Features? There are two ways to impute missing values for categorical features as follows: Impute the Most Frequent Value :We will use ‘SimpleImputer’ in this case, and as this is a non-numeric column, we can’t use mean or median, but we can use the most frequent value and constant. Impute the Value “Missing” : We can impute the value “missing,” which treats it as a separate category.
  • 16.
    Outliers An outlier isan observation that lies an abnormal distance from other values in a random sample from a population. Outlier detection is a process used to identify and remove data points from a dataset that differ from the rest of the data points In the dataset. OR Outlier detection is the process of identifying abnormal or abnormal- looking data points in a dataset.
  • 17.
    Types of outlierdetection There are two main types of outlier detection: descriptive and prescriptive. Descriptive outlier detection simply describes the outliers while prescriptive outlier detection determines what action, if any, needs to be taken based on the outlier.
  • 18.
    Identifying Outliers usingZ-Score Z-Score is a measure of how many standard deviations a data point is away from the mean. data points with a Z-Score greater than a threshold are considered outliers. Definition of Z-Scores: Z-Scores are calculated by subtracting the mean of the data set from a data point and dividing the result by the standard deviation of the data set. The resulting value is a measure of how many standard deviations a data point is away from the mean.
  • 19.
    For example, let'ssay we have a dataset of test scores for a group of students. The mean score is 75, and the standard deviation is 5. If a student scored 85 on the test, we can calculate their Z-score as follows: Z-score = (85 - 75) / 5 = 2 Outlier - Jupyter Notebook
  • 20.
    modified z-score However, z-scorescan be affected by unusually large or small data values, which is why a more robust way to detect outliers is to use a modified z-score, which is calculated as: Modified z-score = 0.6745(xi – x̃) / MAD where: •xi: A single data value •x̃: The median of the dataset •MAD: The median absolute deviation of the dataset
  • 21.
    Identifying Outliers usingIQR (Interquartile Range): The IQR is the range between the first quartile (Q1) and the third quartile (Q3) of the data. Outliers are often identified as values outside the range [Q1 - 1.5 * IQR, Q3 + 1.5 * IQR]