07/26/2025 1
Department of Computer Science & Engineering (SB-ET)
III B. Tech -II Semester
DATAANALYTICS
SUBJECT CODE: 22PCOAM21
AcademicYear : 2024-2025
By
Dr.M.Gokilavani
GNITC
Department of CSE (SB-ET)
07/26/2025 Department of CSE (SB-ET) 2
22PCOAM21 DATAANALYTICS
UNIT – I
Syllabus
Data Management: Design Data Architecture and manage the data for
analysis, understand various sources of Data like Sensors/Signals/GPS etc.
Data Management, Data Quality (noise, outliers, missing values, duplicate
data) and Data Preprocessing &Processing.
Course Prerequisites
1. Database Management Systems.
2. Knowledge of probability and statistics.
07/26/2025 3
TEXTBOOK:
• Student’s Handbook for Associate Analytics - II, III.
• Data Mining Concepts and Techniques, Han, Kamber, 3rd Edition, Morgan Kaufmann
Publishers.
REFERENCES:
• Introduction to Data Mining, Tan, Steinbach and Kumar, Addision Wisley, 2006.
• Data Mining Analysis and Concepts, M. Zaki and W. Meira
• Mining of Massive Datasets, Jure Leskovec Stanford Univ. Anand Rajaraman Milliway Labs,
Jeffrey D Ullman Stanford Univ.
No of Hours Required: 13
Department of CSE (SB-ET)
UNIT - I LECTURE - 03
07/26/2025 Department of CSE (SB-ET) 4
Data Quality
• Missing values imputation using Mean, Median and k-Nearest Neighbor
approach in Distance Measure.
• Data quality issues like
i.Noise
ii.Outliers
iii.Missing values
iv.Duplicate data
can significantly impact data analysis and machine learning models.
• These issues can lead to inaccurate results and flawed insights if not addressed
properly.
UNIT - I LECTURE - 03
07/26/2025 Department of CSE (SB-ET) 5
Data Quality
• Data cleaning or Preprocessing , a crucial preprocessing step, involves
identifying and correcting these issues to improve data quality.
• Its importance, and the essential procedures for making sure the data we use is
correct, reliable, and appropriate for the reason for which data was collected.
• Data quality is a major concern in Learning tasks model.
• Why: At most all Learning algorithms induce knowledge strictly from data.
• The quality of knowledge extracted highly depends on the quality of data.
• There are two main problems in data quality:-
• Missing data: The data not present.
• Noisy data: The data present but not correct.
UNIT - I LECTURE - 03
07/26/2025 Department of CSE (SB-ET) 7
Noise
UNIT - I LECTURE - 03
• It refers to irrelevant or incorrect data that is difficult for machines to
interpret, often caused by errors in data collection or entry.
• It can be handled in several ways:
i. Binning Method: The data is sorted into equal segments, and each
segment is smoothed by replacing values with the
mean or boundary values.
ii. Regression: Data can be smoothed by fitting it to a regression
function, either linear or multiple, to predict values.
iii. Clustering: This method groups similar data points together, with
outliers either being undetected or falling outside the
clusters. These techniques help remove noise and
improve data quality.
07/26/2025 Department of CSE (SB-ET) 8
Noise
UNIT - I LECTURE - 03
07/26/2025 Department of CSE (SB-ET) 9
Outliers
UNIT - I LECTURE - 03
• Outlier is a data point that stands out a lot from the other data points
in a set.
• An outlier is a data point that significantly deviates from the rest of the
data.
• It can be either much higher or much lower than the other data points,
and its presence can have a significant impact on the results of
machine learning algorithms.
• They can be caused by measurement or execution errors.
• The analysis of outlier data is referred to as outlier analysis.
07/26/2025 Department of CSE (SB-ET) 10
Types of Outliers
There are two main types of outliers:
i. Global outliers:
• Global outliers are isolated data points that are far away from the main body
of the data.
• They are often easy to identify and remove.
ii. Contextual outliers:
• Contextual outliers are data points that are unusual in a specific context but
may not be outliers in a different context.
• They are often more difficult to identify and may require additional
information or domain knowledge to determine their significance.
UNIT - I LECTURE - 03
07/26/2025 Department of CSE (SB-ET) 11
Outlier Algorithm
UNIT - I LECTURE - 03
Step 1: Calculate the mean of each
cluster.
Step 2: Initialize the Threshold
value.
Step 3: Calculate the distance of the
test data from each cluster mean.
Step 4: Find the nearest cluster to
the test data.
Step 5: If (Distance > Threshold)
then, Outlier.
07/26/2025 Department of CSE (SB-ET) 12
Outlier Detection Methods
• Outlier detection plays a crucial role in ensuring the quality and accuracy of machine learning models.
• By identifying and removing or handling outliers effectively, we can prevent them from biasing the
model, reducing its performance, and hindering its interpretability.
• Here's an overview of various outlier detection methods:
• Statistical Methods
• Z-Score
• Interquartile Range (IQR)
• Distance-Based Methods
• K-Nearest Neighbors (KNN)
• Local Outlier Factor (LOF)
• Clustering-Based Methods
• Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
• Hierarchical clustering
• Other Methods
• Isolation Forest
• One-class Support Vector Machines (OCSVM)
UNIT - I LECTURE - 03
07/26/2025 Department of CSE (SB-ET) 13
Statistical & Distance based Methods
UNIT - I LECTURE - 03
•Z-Score: This method calculates the standard deviation of the data points and identifies
outliers as those with Z-scores exceeding a certain threshold (typically 3 or -3).
•Interquartile Range (IQR): IQR identifies outliers as data points falling outside the range
defined by
• Q1-k*(Q3-Q1) and Q3+k*(Q3-Q1),
• where Q1 and Q3 are the first and third quartiles, and k is a factor (typically
1.5).
Distance-Based Methods
• K-Nearest Neighbors (KNN): KNN identifies outliers as data points whose K nearest
neighbors are far away from them.
• Local Outlier Factor (LOF): This method calculates the local density of data points and
identifies outliers as those with significantly lower density compared to their neighbors.
07/26/2025 Department of CSE (SB-ET) 14
Clustering-Based Methods
UNIT - I LECTURE - 03
• Density-Based Spatial Clustering of Applications with Noise
(DBSCAN): In DBSCAN, clusters data points based on their density and identifies
outliers as points not belonging to any cluster.
• Hierarchical clustering: Hierarchical clustering involves building a hierarchy of
clusters by iteratively merging or splitting clusters based on their similarity. Outliers
can be identified as clusters containing only a single data point or clusters significantly
smaller than others.
Other Methods:
•Isolation Forest: Isolation forest randomly isolates data points by splitting
features and identifies outliers as those isolated quickly and easily.
•One-class Support Vector Machines (OCSVM): One-Class SVM learns a
boundary around the normal data and identifies outliers as points falling outside
the boundary.
07/26/2025 Department of CSE (SB-ET) 15
Techniques for Handling Outliers
1. Removal:
• This involves identifying and removing outliers from the dataset before training
the model. Common methods include:
• Thresholding: Outliers are identified as data points exceeding a certain
threshold (e.g., Z-score > 3).
• Distance-based methods: Outliers are identified based on their
distance from their nearest neighbors.
• Clustering: Outliers are identified as points not belonging to any
cluster or belonging to very small clusters.
UNIT - I LECTURE - 03
07/26/2025 Department of CSE (SB-ET) 16
Techniques for Handling Outliers
2. Transformation:
• This involves transforming the data to reduce the influence of
outliers. Common methods include:
• Scaling: Standardizing or normalizing the data to have a mean of zero
and a standard deviation of one.
• Winsorization: Replacing outlier values with the nearest non-outlier
value.
• Log transformation: Applying a logarithmic transformation to
compress the data and reduce the impact of extreme values.
UNIT - I LECTURE - 03
07/26/2025 Department of CSE (SB-ET) 17
Techniques for Handling Outliers
3. Robust Estimation:
• This involves using algorithms that are less sensitive to outliers. Some examples
include:
• Robust regression: Algorithms like L1-regularized regression or Huber
regression are less influenced by outliers than least squares regression.
• M-estimators: These algorithms estimate the model parameters based on a
robust objective function that down weights the influence of outliers.
• Outlier-insensitive clustering algorithms: Algorithms like DBSCAN are
less susceptible to the presence of outliers than K-means clustering.
UNIT - I LECTURE - 03
07/26/2025 Department of CSE (SB-ET) 18
Techniques for Handling Outliers
4. Modeling Outliers:
• This involves explicitly modeling the outliers as a separate group. This can be done
by:
• Adding a separate feature: Create a new feature indicating whether a data
point is an outlier or not.
• Using a mixture model: Train a model that assumes the data comes from a
mixture of multiple distributions, where one distribution represents the
outliers.
UNIT - I LECTURE - 03
07/26/2025 Department of CSE (SB-ET) 19
Importance of outlier detection
• Biased models
• Reduced accuracy
• Increased variance
• Reduced interpretability
UNIT - I LECTURE - 03
07/26/2025 Department of CSE (SB-ET) 20
Imputations of Missing Values
• Imputation is a term that denotes a procedure that replaces the missing values in
a dataset by some plausible values.
• i.e. by considering relationship among correlated values among the
attributes of the dataset.
UNIT - I LECTURE - 03
07/26/2025 Department of CSE (SB-ET) 21
Topics to be covered in next session 4
• DataManagement
Thank you!!!
UNIT - I LECTURE - 03

22PCOAM21 Data Quality Session 3 Data Quality.pptx

  • 1.
    07/26/2025 1 Department ofComputer Science & Engineering (SB-ET) III B. Tech -II Semester DATAANALYTICS SUBJECT CODE: 22PCOAM21 AcademicYear : 2024-2025 By Dr.M.Gokilavani GNITC Department of CSE (SB-ET)
  • 2.
    07/26/2025 Department ofCSE (SB-ET) 2 22PCOAM21 DATAANALYTICS UNIT – I Syllabus Data Management: Design Data Architecture and manage the data for analysis, understand various sources of Data like Sensors/Signals/GPS etc. Data Management, Data Quality (noise, outliers, missing values, duplicate data) and Data Preprocessing &Processing. Course Prerequisites 1. Database Management Systems. 2. Knowledge of probability and statistics.
  • 3.
    07/26/2025 3 TEXTBOOK: • Student’sHandbook for Associate Analytics - II, III. • Data Mining Concepts and Techniques, Han, Kamber, 3rd Edition, Morgan Kaufmann Publishers. REFERENCES: • Introduction to Data Mining, Tan, Steinbach and Kumar, Addision Wisley, 2006. • Data Mining Analysis and Concepts, M. Zaki and W. Meira • Mining of Massive Datasets, Jure Leskovec Stanford Univ. Anand Rajaraman Milliway Labs, Jeffrey D Ullman Stanford Univ. No of Hours Required: 13 Department of CSE (SB-ET) UNIT - I LECTURE - 03
  • 4.
    07/26/2025 Department ofCSE (SB-ET) 4 Data Quality • Missing values imputation using Mean, Median and k-Nearest Neighbor approach in Distance Measure. • Data quality issues like i.Noise ii.Outliers iii.Missing values iv.Duplicate data can significantly impact data analysis and machine learning models. • These issues can lead to inaccurate results and flawed insights if not addressed properly. UNIT - I LECTURE - 03
  • 5.
    07/26/2025 Department ofCSE (SB-ET) 5 Data Quality • Data cleaning or Preprocessing , a crucial preprocessing step, involves identifying and correcting these issues to improve data quality. • Its importance, and the essential procedures for making sure the data we use is correct, reliable, and appropriate for the reason for which data was collected. • Data quality is a major concern in Learning tasks model. • Why: At most all Learning algorithms induce knowledge strictly from data. • The quality of knowledge extracted highly depends on the quality of data. • There are two main problems in data quality:- • Missing data: The data not present. • Noisy data: The data present but not correct. UNIT - I LECTURE - 03
  • 7.
    07/26/2025 Department ofCSE (SB-ET) 7 Noise UNIT - I LECTURE - 03 • It refers to irrelevant or incorrect data that is difficult for machines to interpret, often caused by errors in data collection or entry. • It can be handled in several ways: i. Binning Method: The data is sorted into equal segments, and each segment is smoothed by replacing values with the mean or boundary values. ii. Regression: Data can be smoothed by fitting it to a regression function, either linear or multiple, to predict values. iii. Clustering: This method groups similar data points together, with outliers either being undetected or falling outside the clusters. These techniques help remove noise and improve data quality.
  • 8.
    07/26/2025 Department ofCSE (SB-ET) 8 Noise UNIT - I LECTURE - 03
  • 9.
    07/26/2025 Department ofCSE (SB-ET) 9 Outliers UNIT - I LECTURE - 03 • Outlier is a data point that stands out a lot from the other data points in a set. • An outlier is a data point that significantly deviates from the rest of the data. • It can be either much higher or much lower than the other data points, and its presence can have a significant impact on the results of machine learning algorithms. • They can be caused by measurement or execution errors. • The analysis of outlier data is referred to as outlier analysis.
  • 10.
    07/26/2025 Department ofCSE (SB-ET) 10 Types of Outliers There are two main types of outliers: i. Global outliers: • Global outliers are isolated data points that are far away from the main body of the data. • They are often easy to identify and remove. ii. Contextual outliers: • Contextual outliers are data points that are unusual in a specific context but may not be outliers in a different context. • They are often more difficult to identify and may require additional information or domain knowledge to determine their significance. UNIT - I LECTURE - 03
  • 11.
    07/26/2025 Department ofCSE (SB-ET) 11 Outlier Algorithm UNIT - I LECTURE - 03 Step 1: Calculate the mean of each cluster. Step 2: Initialize the Threshold value. Step 3: Calculate the distance of the test data from each cluster mean. Step 4: Find the nearest cluster to the test data. Step 5: If (Distance > Threshold) then, Outlier.
  • 12.
    07/26/2025 Department ofCSE (SB-ET) 12 Outlier Detection Methods • Outlier detection plays a crucial role in ensuring the quality and accuracy of machine learning models. • By identifying and removing or handling outliers effectively, we can prevent them from biasing the model, reducing its performance, and hindering its interpretability. • Here's an overview of various outlier detection methods: • Statistical Methods • Z-Score • Interquartile Range (IQR) • Distance-Based Methods • K-Nearest Neighbors (KNN) • Local Outlier Factor (LOF) • Clustering-Based Methods • Density-Based Spatial Clustering of Applications with Noise (DBSCAN) • Hierarchical clustering • Other Methods • Isolation Forest • One-class Support Vector Machines (OCSVM) UNIT - I LECTURE - 03
  • 13.
    07/26/2025 Department ofCSE (SB-ET) 13 Statistical & Distance based Methods UNIT - I LECTURE - 03 •Z-Score: This method calculates the standard deviation of the data points and identifies outliers as those with Z-scores exceeding a certain threshold (typically 3 or -3). •Interquartile Range (IQR): IQR identifies outliers as data points falling outside the range defined by • Q1-k*(Q3-Q1) and Q3+k*(Q3-Q1), • where Q1 and Q3 are the first and third quartiles, and k is a factor (typically 1.5). Distance-Based Methods • K-Nearest Neighbors (KNN): KNN identifies outliers as data points whose K nearest neighbors are far away from them. • Local Outlier Factor (LOF): This method calculates the local density of data points and identifies outliers as those with significantly lower density compared to their neighbors.
  • 14.
    07/26/2025 Department ofCSE (SB-ET) 14 Clustering-Based Methods UNIT - I LECTURE - 03 • Density-Based Spatial Clustering of Applications with Noise (DBSCAN): In DBSCAN, clusters data points based on their density and identifies outliers as points not belonging to any cluster. • Hierarchical clustering: Hierarchical clustering involves building a hierarchy of clusters by iteratively merging or splitting clusters based on their similarity. Outliers can be identified as clusters containing only a single data point or clusters significantly smaller than others. Other Methods: •Isolation Forest: Isolation forest randomly isolates data points by splitting features and identifies outliers as those isolated quickly and easily. •One-class Support Vector Machines (OCSVM): One-Class SVM learns a boundary around the normal data and identifies outliers as points falling outside the boundary.
  • 15.
    07/26/2025 Department ofCSE (SB-ET) 15 Techniques for Handling Outliers 1. Removal: • This involves identifying and removing outliers from the dataset before training the model. Common methods include: • Thresholding: Outliers are identified as data points exceeding a certain threshold (e.g., Z-score > 3). • Distance-based methods: Outliers are identified based on their distance from their nearest neighbors. • Clustering: Outliers are identified as points not belonging to any cluster or belonging to very small clusters. UNIT - I LECTURE - 03
  • 16.
    07/26/2025 Department ofCSE (SB-ET) 16 Techniques for Handling Outliers 2. Transformation: • This involves transforming the data to reduce the influence of outliers. Common methods include: • Scaling: Standardizing or normalizing the data to have a mean of zero and a standard deviation of one. • Winsorization: Replacing outlier values with the nearest non-outlier value. • Log transformation: Applying a logarithmic transformation to compress the data and reduce the impact of extreme values. UNIT - I LECTURE - 03
  • 17.
    07/26/2025 Department ofCSE (SB-ET) 17 Techniques for Handling Outliers 3. Robust Estimation: • This involves using algorithms that are less sensitive to outliers. Some examples include: • Robust regression: Algorithms like L1-regularized regression or Huber regression are less influenced by outliers than least squares regression. • M-estimators: These algorithms estimate the model parameters based on a robust objective function that down weights the influence of outliers. • Outlier-insensitive clustering algorithms: Algorithms like DBSCAN are less susceptible to the presence of outliers than K-means clustering. UNIT - I LECTURE - 03
  • 18.
    07/26/2025 Department ofCSE (SB-ET) 18 Techniques for Handling Outliers 4. Modeling Outliers: • This involves explicitly modeling the outliers as a separate group. This can be done by: • Adding a separate feature: Create a new feature indicating whether a data point is an outlier or not. • Using a mixture model: Train a model that assumes the data comes from a mixture of multiple distributions, where one distribution represents the outliers. UNIT - I LECTURE - 03
  • 19.
    07/26/2025 Department ofCSE (SB-ET) 19 Importance of outlier detection • Biased models • Reduced accuracy • Increased variance • Reduced interpretability UNIT - I LECTURE - 03
  • 20.
    07/26/2025 Department ofCSE (SB-ET) 20 Imputations of Missing Values • Imputation is a term that denotes a procedure that replaces the missing values in a dataset by some plausible values. • i.e. by considering relationship among correlated values among the attributes of the dataset. UNIT - I LECTURE - 03
  • 21.
    07/26/2025 Department ofCSE (SB-ET) 21 Topics to be covered in next session 4 • DataManagement Thank you!!! UNIT - I LECTURE - 03