Exploratory Data Analysis Unit 1 ppt presentation.pptx
1. Unit 1: DATA PROCESSING AND STATESTICS
Basics of Data and its processing -Record Keeping , Statistics and data science ,
measurement scales , properties of data, Visualization, cleaning the data
Symbolic data analysis , Statistics-Basic Statistical Measures, Variance and
Standard Deviation, Visualizing Statistical Measures, Calculating Percentiles,
Quartiles and Box Plots, Missing data handling methods-Finding missing values,
dealing with missing values. Outliers- What are Outliers, Using Z-scores to Find
Outliers, Modified Z-score, Using IQR to Detect Outliers
2.
3. Statistics & Data Science
Data science involves the collection, organization, analysis and visualization of large amounts
of data.
Statisticians, meanwhile, use mathematical models to quantify relationships between
variables and outcomes and make predictions based on those relationships.
Statisticians do not use computer science, algorithms or machine learning to the same degree
as computer scientists.
4. Data Science Statistics
Definition Is an interdisciplinary branch of
computer science used to gain valuable
information from a large data using
statistics, computers and technology.
Is a mathematical science for analysing
existing data pertaining to specific
problems, applying statistical tools to
this data, and presenting the results for
decision-making.
Concept 1. primary goal is to identify underlying
trends and patterns in a data for
decision making.
2. works well on both quantitative and
qualitative data
Key steps include
data mining
data pre-processing
Exploratory Data Analysis (EDA)
Model building and optimization
1. primary goal is to determine cause-
and-effect relationship in analysed
data, is a purely mathematical
approach.
2. works only on quantitative data
Key terms include
Mean
Median
Mode
Standard deviation (σ)
Variance (σ2)
5. Some important techniques include
regression, classification
Some important techniques
include probability
distribution, acceptance
sampling and statistical
quality control
Application
Areas
Can be applied in specialized areas
like computer vision, natural
language processing, disaster
management, recommender
systems and search engines, etc.
Can be applied in areas
where random variations
are observed in sampled
data like medical,
information technology,
economics, engineering,
finance, marketing,
accounting, and business,
etc.
6. Properties of Data
following are the properties of data:
1) amenability of use,
2) clarity,
3) accuracy, and
4) the quality
7. Amenability of use: From the dictionary meaning of data it is learnt
that data are facts used in deciding something. In short, data are
meant to be used as a base for arriving at definitive conclusions. They
are not required, if they are not amenable to use.
Clarity: This means data should necessarily' display so essential for
communicating the essence of the matter. Without clarity, the
meaning desired to be communicated will remain hidden.
Accuracy: Data should be real, complete and accurate. Accuracy is
thus, an essential property of data. Since data offer a basis for
deciding something, they must necessarily be accurate if valid
conclusions are to be drawn.
8. Essence: In social sciences, large quantities of data are collected
which cannot be presented, nor is it necessary to present them in
that form. They have to be compressed and refined. Data so refined
can present the essence or derived qualitative value, of the matter.
Data in sciences consist of observations made from scientific
experiments, these are all measured quantities. Data, thus, are
always the essence of the matter.
10. Missing Data Handling Methods
The real-world data often has a lot of missing values. The cause of
missing values can be data corruption or failure to record data. The
handling of missing data is very important during the preprocessing of
the dataset as many machine learning algorithms do not support missing
values.
11. 1.Deleting Rows with missing values
2.Impute missing values for continuous variable
3.Impute missing values for categorical variable
4.Other Imputation Methods
5.Using Algorithms that support missing values
6.Prediction of missing values
7.Imputation using Deep Learning Library — Datawig
12. Delete Rows with Missing Values:
Missing values can be handled by deleting the rows or columns having
null values. If columns have more than half of the rows as null then the
entire column can be dropped. The rows which are having one or more
columns values as null can also be dropped.
13. Replacing with an arbitrary value
If you can make an educated guess about the missing value, then you can
replace it with some arbitrary value using the following code. E.g., in the
following code, we are replacing the missing values of the ‘Dependents’
column with ‘0’.
IN:
#Replace the missing value with '0' using 'fiilna' method
train_df['Dependents'] = train_df['Dependents'].fillna(0)
train_df[‘Dependents'].isnull().sum()
OUT:
14. Replacing with the mean
Replacing with the mode
Replacing with the median
Replacing with the previous value – forward fill
Replacing with the next value – backward fill
15. How to Impute Missing Values for Categorical
Features?
There are two ways to impute missing values for categorical features as
follows:
Impute the Most Frequent Value :We will use ‘SimpleImputer’ in this
case, and as this is a non-numeric column, we can’t use mean or
median, but we can use the most frequent value and constant.
Impute the Value “Missing” : We can impute the value “missing,”
which treats it as a separate category.
16. Outliers
An outlier is an observation that lies an abnormal distance from other values in a
random sample from a population.
Outlier detection is a process used to identify and remove data points
from a dataset that differ from the rest of the data points In the dataset.
OR
Outlier detection is the process of identifying abnormal or abnormal-
looking data points in a dataset.
17. Types of outlier detection
There are two main types of outlier detection: descriptive and
prescriptive.
Descriptive outlier detection simply describes the outliers while
prescriptive outlier detection determines what action, if any,
needs to be taken based on the outlier.
18. Identifying Outliers using Z-Score
Z-Score is a measure of how many standard deviations a data point is
away from the mean.
data points with a Z-Score greater than a threshold are considered
outliers.
Definition of Z-Scores: Z-Scores are calculated by subtracting the
mean of the data set from a data point and dividing the result by the
standard deviation of the data set. The resulting value is a measure of
how many standard deviations a data point is away from the mean.
19. For example, let's say we have a dataset of test scores for a group of
students. The mean score is 75, and the standard deviation is 5. If a
student scored 85 on the test, we can calculate their Z-score as follows:
Z-score = (85 - 75) / 5 = 2
Outlier - Jupyter Notebook
20. modified z-score
However, z-scores can be affected by unusually large or small data values, which is
why a more robust way to detect outliers is to use a modified z-score, which is
calculated as:
Modified z-score = 0.6745(xi – x̃) / MAD
where:
•xi: A single data value
•x̃: The median of the dataset
•MAD: The median absolute deviation of the dataset
21. Identifying Outliers using IQR (Interquartile Range): The IQR is the range between
the first quartile (Q1) and the third quartile (Q3) of the data.
Outliers are often identified as values outside the range [Q1 - 1.5 * IQR, Q3 + 1.5 *
IQR]