Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data preprocessing

312 views

Published on

  • Be the first to comment

  • Be the first to like this

Data preprocessing

  1. 1. Data Preprocessing By S.Dinesh Babu II MCA
  2. 2. Definition  Data preprocessing is a data mining technique that involves transforming raw data into an understandable format.  Data in the real world is dirty
  3. 3. Measures for data quality:A multidimensional view ◦ Accuracy: correct or wrong, accurate or not ◦ Completeness: not recorded, unavailable, … ◦ Consistency: some modified but some not, dangling, … ◦ Timeliness: timely update? ◦ Believability: how trustable the data are correct? ◦ Interpretability: how easily the data can be understood?
  4. 4. Major Tasks in Data Preprocessing  Data Cleaning  Data Integration  Data Reduction  Data Transformation and Data Discretization
  5. 5. Data Cleaning: Incomplete  Data is not always available  Ex:Age:” ”;  Missing data may be due to ◦ equipment malfunction ◦ inconsistent with other recorded data and thus deleted ◦ data not entered due to misunderstanding ◦ certain data may not be considered important at the time of entry
  6. 6. Noisy Data  Unstructured Data.  Increases the amount of storage space . Causes: Hardware Failure Programming Errors
  7. 7. Data Cleaning as a Process  Missing values, noise, and inconsistencies contribute to inaccurate data.  The first step in data cleaning as a process is discrepancy detection.  Discrepancies can be caused by several factors.  Poorly designed data entry forms  human error in data entry
  8. 8. The data should also be examined regarding: o Unique rule: Each attribute value must be different from all other attribute value. o Consecutive rule No missing values between lowest and highest values of the attribute. o Null rule Specifies the use of blanks, question marks, special characters.
  9. 9. Data Integration  The merging of data from multiple data stores.  It can help reduce, avoid redundancies and inconsistencies.  It improve the accuracy and speed of the subsequent data mining process.
  10. 10. Data Reduction  To obtain a reduced representation of the data set that is much smaller in volume. Strategies for data reduction include the following:  Data cube aggregation, where aggregation operations are applied to the data in the construction of a data cube.  Attribute subset selection, where irrelevant, weakly relevant, or redundant attributes or dimensions may be detected and removed.
  11. 11.  Dimensionality reduction, where encoding mechanisms are used to reduce the data set size.  Numerosity reduction, where the data are replaced or estimated by alternative, smaller data representations such as  Parametric models  Nonparametric methods such as clustering, sampling, and the use of histograms.
  12. 12. Data Transformation  In data transformation, the data are transformed or consolidated into forms appropriate for mining. Data transformation can involve the following:  Smoothing: remove noise from data  Aggregation: summarization, data cube construction  Generalization: concept hierarchy climbing  Normalization: scaled to fall within a small, specified range  min-max normalization
  13. 13. Data Discretization  Discretization: Divide the range of a continuous attribute into intervals ◦ Interval labels can then be used to replace actual data values ◦ Reduce data size by Discretization ◦ Split (top-down) vs. merge (bottom-up) ◦ Discretization can be performed recursively on an attribute ◦ Prepare for further analysis, e.g., classification
  14. 14.  Three types of attributes ◦ Nominal—values from an unordered set, e.g., color, profession ◦ Ordinal—values from an ordered set, e.g., military or academic rank ◦ Numeric—real numbers, e.g., integer or real numbers

×