Overview of DATA PREPROCESS..

366 views

Published on

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
366
On SlideShare
0
From Embeds
0
Number of Embeds
105
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Overview of DATA PREPROCESS..

  1. 1. Why preprocess the data?  Data cleaning  Data integration and transformation  Data reduction  Discretization and concept hierarchy generation
  2. 2. Why Data Preprocessing? Data in the real world is dirty  incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data  noisy: containing errors or outliers  inconsistent: containing discrepancies in codes or names No quality data, no quality mining results  Quality decisions must be based on quality data  Data warehouse needs consistent integration of quality data
  3. 3. O Data cleaning tasks O Fill in missing values O Identify outliers and smooth out noisy data O Correct inconsistent data  duplicate records  incomplete data  inconsistent data
  4. 4. O Data integration: O combines data from multiple sources into a coherent store O Schema integration O integrate metadata from different sources O Entity identification problem: identify real world entities from multiple data sources, O Detecting and resolving data value conflicts O for the same real world entity, attribute values from different sources are different O possible reasons: different representations, different scales, e.g., metric vs. British units
  5. 5. O Smoothing: remove noise from data O Aggregation: summarization, data cube construction O Generalization: concept hierarchy climbing O Normalization: scaled to fall within a small, specified range O min-max normalization O z-score normalization O normalization by decimal scaling
  6. 6. O Data reduction O Obtains a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results O Data reduction strategies O Data cube aggregation O Dimensionality reduction O Numerosity reduction O Discretization and concept hierarchy generation
  7. 7. O Discretization: divide the range of a continuous attribute into intervals O Some classification algorithms only accept categorical attributes. O Reduce data size by discretization O Prepare for further analysis O reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. Interval labels can then be used to replace actual data values.
  8. 8. O Binning O Histogram analysis O Clustering analysis

×