Successfully reported this slideshow.

Data preprocessing

3,650 views

Published on

  • Be the first to comment

Data preprocessing

  1. 1. Data Preprocessing
  2. 2. Content  What & Why preprocess the data?  Data cleaning  Data integration  Data transformation  Data reduction PAAS Group
  3. 3. It is a data mining technique that involves transforming raw data into an understandable format. PAAS Group
  4. 4. Why preprocess the data? PAAS Group
  5. 5. Data Preprocessing • Data in the real world is: – incomplete: lacking values, certain attributes of interest, etc. – noisy: containing errors or outliers – inconsistent: lack of compatibility or similarity between two or more facts. • No quality data, no quality mining results! – Quality decisions must be based on quality data – Data warehouse needs consistent integration of quality data PAAS Group
  6. 6. Measure of Data Quality  Accuracy  Completeness  Consistency  Timeliness  Believability  Value added  Interpretability  Accessibility PAAS Group
  7. 7. Data preprocessing techniques • Data Cleaning • Data Integration • Data Transformation • Data Reduction PAAS Group
  8. 8. Major Tasks in Data Preprocessing • Data cleaning – Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies • Data integration – Integration of multiple databases, data cubes, or files • Data transformation – Normalization and aggregation • Data reduction – Obtains reduced representation in volume but produces the same or similar analytical results PAAS Group
  9. 9. Data Preprocessing PAAS Group
  10. 10. Data Cleaning PAAS Group
  11. 11. Data Cleaning “Data Cleaning attempt to fill in missing values, smooth out noise while identifying outliers and correct inconsistencies in the real world data.” PAAS Group
  12. 12. Data Cleaning - Missing Values • • • • • Ignore the tuple Fill in the missing value manually Use a global constant Use attribute mean Use the most probable value (decision tree, Bayesian Formalism) PAAS Group
  13. 13. Data Cleaning - Noisy Data • • • • Binning Clustering Combined computer and human inspection Regression PAAS Group
  14. 14. Data Cleaning - Inconsistent Data • Manually, using external references • Knowledge engineering tools PAAS Group
  15. 15. Few Important Terms • Discrepancy Detection – Human Error – Data Decay – Deliberate Errors • Metadata • Unique Rules • Null Rules PAAS Group
  16. 16. Data Integration PAAS Group
  17. 17. Data Integration “Data Integration implies combining of data from multiple sources into a coherent data store(data warehouse). ” PAAS Group
  18. 18. Data Integration - Issues • • • • Entity identification problem Redundancy Tuple Duplication Detecting data value conflicts PAAS Group
  19. 19. Data Transformation PAAS Group
  20. 20. Data Transformation “Transforming or consolidating data into mining suitable form is known as Data Transformation.” PAAS Group
  21. 21. Handling Redundant Data in Data Integration • Redundant data occur often when integration of multiple databases – The same attribute may have different names in different databases – One attribute may be a “derived” attribute in another table, e.g., annual revenue
  22. 22. Handling Redundant Data in Data Integration • Redundant data may be able to be detected by correlation analysis • Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality
  23. 23. Data Transformation • Smoothing: remove noise from data • Aggregation: summarization, data cube construction • Generalization: concept hierarchy climbing
  24. 24. Data Reduction PAAS Group
  25. 25. Data Reduction “Data reduction techniques are applied to obtain a reduced representation of the data set that is much smaller in volume, yet closely maintains the integrity of base data.” PAAS Group
  26. 26. Data Reduction - Strategies • • • • • Data cube aggregation Dimension Reduction Data Compression Numerosity Reduction Discretization and concept hierarchy generation PAAS Group
  27. 27. Example of Decision Tree Induction Initial attribute set: {A1, A2, A3, A4, A5, A6} A4 ? A6? A1? Class 1 > Class 2 Class 1 Class 2 Reduced attribute set: {A1, A4, A6} PAAS Group
  28. 28. Histograms • A popular data reduction technique • Divide data into buckets and store average (sum) for each bucket • Can be constructed optimally in one dimension. • Related to quantization problems. PAAS Group
  29. 29. Clustering • Partition data set into clusters, and one can store cluster representation only • Can be very effective if data is clustered. • Can have hierarchical clustering and be stored in multidimensional index tree structures. PAAS Group
  30. 30. Sampling • Allows a large data set to be represented by a much smaller of the data. • Let a large data set D, contains N tuples. • Methods to reduce data set D: – – – – Simple random sample without replacement (SRSWOR) Simple random sample with replacement (SRSWR) Cluster sample Stright sample PAAS Group
  31. 31. Sampling SRSW (simp OR le ra n do samp le wit m hout replac emen t) R SW SR Raw Data PAAS Group
  32. 32. Sampling Cluster/Stratified Sample Raw Data PAAS Group
  33. 33. PAAS Group

×