G H Patel College of Engineering & Technology
Submitted By: -
Vivek Gandhi
(140110107017)
Gujarat Technological University
Submitted To: -
Ms. Manpreet Bagga
Data Mining
Handling Noisy Data
Noisy Data: -
•Noise: - Random error or variance in a measured variable
or we can say meaningless data.
Incorrect attribute values may due to: -
•Faulty data collection
•Data entry problems
•Data transmission problems
•Technology limitation
•Inconsistency in naming convention
How to Handle Noisy Data?
•Binning method
•Clustering
•Regression
Binning Method: -
• First sort data and partition
• Then one can smooth by bin mean, median and boundaries.
•Equal-width (distance) partitioning:
• It divides the range into N intervals of equal size: uniform grid
• If A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B-A)/N.
• The most straightforward
• Skewed data is not handled well.
Continue…….
•Equal-depth (frequency) partitioning:
• It divides the range into N intervals, each containing approximately same
number of samples
• Good data scaling
• Managing categorical attributes can be tricky.
Binning Methods for Data Smoothing
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Cluster Analysis: -
• Outliers may be detected by clustering, where similar values
are organized into group of cluster.
Example:-
• Data:- 4, 8, 15 ,21 ,21 ,24 ,25 ,28 ,34
1. Partition into equidepth
Bin1:- 4 ,8 ,15
Bin2:- 21 ,21 ,24
Bin3:- 25 ,28 ,34
2. Smooth by bin
Bin1:- 9 ,9 ,9
Bin2:- 22 ,22 ,22
Bin3:- 29 ,29 ,39
Example:-
3. Smooth by Boundaries
Bin1:- 4 ,4 ,15
Bin2:- 21 ,21 ,24
Bin3:- 25 ,25 ,34
Regression: -
• Here data can be smoothed by fitting the data to a function.
• Linear regression involves finding the “best” line to fit two attributes,
so that one attribute can be used to predict the other.
• Multiple linear regression is an extension of linear regression, where
more than two attributes are involved and the data are fit to a
multidimensional surface.
Continue…….
Handling noisy data

Handling noisy data

  • 1.
    G H PatelCollege of Engineering & Technology Submitted By: - Vivek Gandhi (140110107017) Gujarat Technological University Submitted To: - Ms. Manpreet Bagga
  • 2.
  • 3.
    Noisy Data: - •Noise:- Random error or variance in a measured variable or we can say meaningless data.
  • 4.
    Incorrect attribute valuesmay due to: - •Faulty data collection •Data entry problems •Data transmission problems •Technology limitation •Inconsistency in naming convention
  • 5.
    How to HandleNoisy Data? •Binning method •Clustering •Regression
  • 6.
    Binning Method: - •First sort data and partition • Then one can smooth by bin mean, median and boundaries. •Equal-width (distance) partitioning: • It divides the range into N intervals of equal size: uniform grid • If A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B-A)/N. • The most straightforward • Skewed data is not handled well.
  • 7.
    Continue……. •Equal-depth (frequency) partitioning: •It divides the range into N intervals, each containing approximately same number of samples • Good data scaling • Managing categorical attributes can be tricky.
  • 8.
    Binning Methods forData Smoothing * Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34
  • 9.
    Cluster Analysis: - •Outliers may be detected by clustering, where similar values are organized into group of cluster.
  • 11.
    Example:- • Data:- 4,8, 15 ,21 ,21 ,24 ,25 ,28 ,34 1. Partition into equidepth Bin1:- 4 ,8 ,15 Bin2:- 21 ,21 ,24 Bin3:- 25 ,28 ,34 2. Smooth by bin Bin1:- 9 ,9 ,9 Bin2:- 22 ,22 ,22 Bin3:- 29 ,29 ,39
  • 12.
    Example:- 3. Smooth byBoundaries Bin1:- 4 ,4 ,15 Bin2:- 21 ,21 ,24 Bin3:- 25 ,25 ,34
  • 13.
    Regression: - • Heredata can be smoothed by fitting the data to a function. • Linear regression involves finding the “best” line to fit two attributes, so that one attribute can be used to predict the other. • Multiple linear regression is an extension of linear regression, where more than two attributes are involved and the data are fit to a multidimensional surface.
  • 14.

Editor's Notes

  • #4 Other data problems which requires data cleaning Duplicate records Incomplete data Inconsistent data