22PCOAM21 Data Quality Session 3 Data Quality.pptx

07/26/2025 1
Department of Computer Science & Engineering (SB-ET)
III B. Tech -II Semester
DATAANALYTICS
SUBJECT CODE: 22PCOAM21
AcademicYear : 2024-2025
By
Dr.M.Gokilavani
GNITC
Department of CSE (SB-ET)

07/26/2025 Department of CSE (SB-ET) 2
22PCOAM21 DATAANALYTICS
UNIT – I
Syllabus
Data Management: Design Data Architecture and manage the data for
analysis, understand various sources of Data like Sensors/Signals/GPS etc.
Data Management, Data Quality (noise, outliers, missing values, duplicate
data) and Data Preprocessing &Processing.
Course Prerequisites
1. Database Management Systems.
2. Knowledge of probability and statistics.

07/26/2025 3
TEXTBOOK:
• Student’s Handbook for Associate Analytics - II, III.
• Data Mining Concepts and Techniques, Han, Kamber, 3rd Edition, Morgan Kaufmann
Publishers.
REFERENCES:
• Introduction to Data Mining, Tan, Steinbach and Kumar, Addision Wisley, 2006.
• Data Mining Analysis and Concepts, M. Zaki and W. Meira
• Mining of Massive Datasets, Jure Leskovec Stanford Univ. Anand Rajaraman Milliway Labs,
Jeffrey D Ullman Stanford Univ.
No of Hours Required: 13
Department of CSE (SB-ET)
UNIT - I LECTURE - 03

Data Quality
• Missing values imputation using Mean, Median and k-Nearest Neighbor
approach in Distance Measure.
• Data quality issues like
i.Noise
ii.Outliers
iii.Missing values
iv.Duplicate data
can significantly impact data analysis and machine learning models.
• These issues can lead to inaccurate results and flawed insights if not addressed
properly.

Data Quality
• Data cleaning or Preprocessing , a crucial preprocessing step, involves
identifying and correcting these issues to improve data quality.
• Its importance, and the essential procedures for making sure the data we use is
correct, reliable, and appropriate for the reason for which data was collected.
• Data quality is a major concern in Learning tasks model.
• Why: At most all Learning algorithms induce knowledge strictly from data.
• The quality of knowledge extracted highly depends on the quality of data.
• There are two main problems in data quality:-
• Missing data: The data not present.
• Noisy data: The data present but not correct.

Noise
• It refers to irrelevant or incorrect data that is difficult for machines to
interpret, often caused by errors in data collection or entry.
• It can be handled in several ways:
i. Binning Method: The data is sorted into equal segments, and each
segment is smoothed by replacing values with the
mean or boundary values.
ii. Regression: Data can be smoothed by fitting it to a regression
function, either linear or multiple, to predict values.
iii. Clustering: This method groups similar data points together, with
outliers either being undetected or falling outside the
clusters. These techniques help remove noise and
improve data quality.

Noise

Outliers
• Outlier is a data point that stands out a lot from the other data points
in a set.
• An outlier is a data point that significantly deviates from the rest of the
data.
• It can be either much higher or much lower than the other data points,
and its presence can have a significant impact on the results of
machine learning algorithms.
• They can be caused by measurement or execution errors.
• The analysis of outlier data is referred to as outlier analysis.

Types of Outliers
There are two main types of outliers:
i. Global outliers:
• Global outliers are isolated data points that are far away from the main body
of the data.
• They are often easy to identify and remove.
ii. Contextual outliers:
• Contextual outliers are data points that are unusual in a specific context but
may not be outliers in a different context.
• They are often more difficult to identify and may require additional
information or domain knowledge to determine their significance.

Outlier Algorithm
Step 1: Calculate the mean of each
cluster.
Step 2: Initialize the Threshold
value.
Step 3: Calculate the distance of the
test data from each cluster mean.
Step 4: Find the nearest cluster to
the test data.
Step 5: If (Distance > Threshold)
then, Outlier.

Outlier Detection Methods
• Outlier detection plays a crucial role in ensuring the quality and accuracy of machine learning models.
• By identifying and removing or handling outliers effectively, we can prevent them from biasing the
model, reducing its performance, and hindering its interpretability.
• Here's an overview of various outlier detection methods:
• Statistical Methods
• Z-Score
• Interquartile Range (IQR)
• Distance-Based Methods
• K-Nearest Neighbors (KNN)
• Local Outlier Factor (LOF)
• Clustering-Based Methods
• Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
• Hierarchical clustering
• Other Methods
• Isolation Forest
• One-class Support Vector Machines (OCSVM)

Statistical & Distance based Methods
•Z-Score: This method calculates the standard deviation of the data points and identifies
outliers as those with Z-scores exceeding a certain threshold (typically 3 or -3).
•Interquartile Range (IQR): IQR identifies outliers as data points falling outside the range
defined by
• Q1-k*(Q3-Q1) and Q3+k*(Q3-Q1),
• where Q1 and Q3 are the first and third quartiles, and k is a factor (typically
1.5).
Distance-Based Methods
• K-Nearest Neighbors (KNN): KNN identifies outliers as data points whose K nearest
neighbors are far away from them.
• Local Outlier Factor (LOF): This method calculates the local density of data points and
identifies outliers as those with significantly lower density compared to their neighbors.

Clustering-Based Methods
• Density-Based Spatial Clustering of Applications with Noise
(DBSCAN): In DBSCAN, clusters data points based on their density and identifies
outliers as points not belonging to any cluster.
• Hierarchical clustering: Hierarchical clustering involves building a hierarchy of
clusters by iteratively merging or splitting clusters based on their similarity. Outliers
can be identified as clusters containing only a single data point or clusters significantly
smaller than others.
Other Methods:
•Isolation Forest: Isolation forest randomly isolates data points by splitting
features and identifies outliers as those isolated quickly and easily.
•One-class Support Vector Machines (OCSVM): One-Class SVM learns a
boundary around the normal data and identifies outliers as points falling outside
the boundary.

Techniques for Handling Outliers
1. Removal:
• This involves identifying and removing outliers from the dataset before training
the model. Common methods include:
• Thresholding: Outliers are identified as data points exceeding a certain
threshold (e.g., Z-score > 3).
• Distance-based methods: Outliers are identified based on their
distance from their nearest neighbors.
• Clustering: Outliers are identified as points not belonging to any
cluster or belonging to very small clusters.

2. Transformation:
• This involves transforming the data to reduce the influence of
outliers. Common methods include:
• Scaling: Standardizing or normalizing the data to have a mean of zero
and a standard deviation of one.
• Winsorization: Replacing outlier values with the nearest non-outlier
value.
• Log transformation: Applying a logarithmic transformation to
compress the data and reduce the impact of extreme values.

3. Robust Estimation:
• This involves using algorithms that are less sensitive to outliers. Some examples
include:
• Robust regression: Algorithms like L1-regularized regression or Huber
regression are less influenced by outliers than least squares regression.
• M-estimators: These algorithms estimate the model parameters based on a
robust objective function that down weights the influence of outliers.
• Outlier-insensitive clustering algorithms: Algorithms like DBSCAN are
less susceptible to the presence of outliers than K-means clustering.

4. Modeling Outliers:
• This involves explicitly modeling the outliers as a separate group. This can be done
by:
• Adding a separate feature: Create a new feature indicating whether a data
point is an outlier or not.
• Using a mixture model: Train a model that assumes the data comes from a
mixture of multiple distributions, where one distribution represents the
outliers.

Importance of outlier detection
• Biased models
• Reduced accuracy
• Increased variance
• Reduced interpretability

Imputations of Missing Values
• Imputation is a term that denotes a procedure that replaces the missing values in
a dataset by some plausible values.
• i.e. by considering relationship among correlated values among the
attributes of the dataset.

Topics to be covered in next session 4
• DataManagement
Thank you!!!

22PCOAM21 Data Quality Session 3 Data Quality.pptx

More Related Content

Similar to 22PCOAM21 Data Quality Session 3 Data Quality.pptx

More from Guru Nanak Technical Institutions

Recently uploaded

22PCOAM21 Data Quality Session 3 Data Quality.pptx