Data Preprocessing
By
S.Dinesh Babu
II MCA
Definition
 Data preprocessing is a data mining technique
that involves transforming raw data into an
understandable format.
 Data in the real world is dirty
Measures for data quality:A multidimensional view
◦ Accuracy: correct or wrong, accurate or not
◦ Completeness: not recorded, unavailable, …
◦ Consistency: some modified but some not,
dangling, …
◦ Timeliness: timely update?
◦ Believability: how trustable the data are correct?
◦ Interpretability: how easily the data can be
understood?
Major Tasks in Data Preprocessing
 Data Cleaning
 Data Integration
 Data Reduction
 Data Transformation and Data
Discretization
Data Cleaning: Incomplete
 Data is not always available
 Ex:Age:” ”;
 Missing data may be due to
◦ equipment malfunction
◦ inconsistent with other recorded data and thus deleted
◦ data not entered due to misunderstanding
◦ certain data may not be considered important at the
time of entry
Noisy Data
 Unstructured Data.
 Increases the amount of storage space .
Causes:
Hardware Failure
Programming Errors
Data Cleaning as a Process
 Missing values, noise, and inconsistencies contribute to
inaccurate data.
 The first step in data cleaning as a process is
discrepancy detection.
 Discrepancies can be caused by several factors.
 Poorly designed data entry forms
 human error in data entry
The data should also be examined regarding:
o Unique rule:
Each attribute value must be different from all other attribute
value.
o Consecutive rule
No missing values between lowest and highest values of the
attribute.
o Null rule
Specifies the use of blanks, question marks, special
characters.
Data Integration
 The merging of data from multiple data stores.
 It can help reduce, avoid redundancies and
inconsistencies.
 It improve the accuracy and speed of the subsequent
data mining process.
Data Reduction
 To obtain a reduced representation of the data set that is
much smaller in volume.
Strategies for data reduction include the following:
 Data cube aggregation, where aggregation operations
are applied to the data in the construction of a data cube.
 Attribute subset selection, where irrelevant, weakly
relevant, or redundant attributes or dimensions may be
detected and removed.
 Dimensionality reduction, where encoding mechanisms are
used to reduce the data set size.
 Numerosity reduction, where the data are replaced or
estimated by alternative, smaller data representations such as
 Parametric models
 Nonparametric methods such as clustering, sampling,
and the use of histograms.
Data Transformation
 In data transformation, the data are transformed or
consolidated into forms appropriate for mining.
Data transformation can involve the following:
 Smoothing: remove noise from data
 Aggregation: summarization, data cube construction
 Generalization: concept hierarchy climbing
 Normalization: scaled to fall within a small,
specified range
 min-max normalization
Data Discretization
 Discretization: Divide the range of a continuous attribute
into intervals
◦ Interval labels can then be used to replace actual data
values
◦ Reduce data size by Discretization
◦ Split (top-down) vs. merge (bottom-up)
◦ Discretization can be performed recursively on an
attribute
◦ Prepare for further analysis, e.g., classification
 Three types of attributes
◦ Nominal—values from an unordered set, e.g., color, profession
◦ Ordinal—values from an ordered set, e.g., military or academic
rank
◦ Numeric—real numbers, e.g., integer or real numbers
ThankYou

Data preprocessing

  • 1.
  • 2.
    Definition  Data preprocessingis a data mining technique that involves transforming raw data into an understandable format.  Data in the real world is dirty
  • 3.
    Measures for dataquality:A multidimensional view ◦ Accuracy: correct or wrong, accurate or not ◦ Completeness: not recorded, unavailable, … ◦ Consistency: some modified but some not, dangling, … ◦ Timeliness: timely update? ◦ Believability: how trustable the data are correct? ◦ Interpretability: how easily the data can be understood?
  • 4.
    Major Tasks inData Preprocessing  Data Cleaning  Data Integration  Data Reduction  Data Transformation and Data Discretization
  • 5.
    Data Cleaning: Incomplete Data is not always available  Ex:Age:” ”;  Missing data may be due to ◦ equipment malfunction ◦ inconsistent with other recorded data and thus deleted ◦ data not entered due to misunderstanding ◦ certain data may not be considered important at the time of entry
  • 6.
    Noisy Data  UnstructuredData.  Increases the amount of storage space . Causes: Hardware Failure Programming Errors
  • 7.
    Data Cleaning asa Process  Missing values, noise, and inconsistencies contribute to inaccurate data.  The first step in data cleaning as a process is discrepancy detection.  Discrepancies can be caused by several factors.  Poorly designed data entry forms  human error in data entry
  • 8.
    The data shouldalso be examined regarding: o Unique rule: Each attribute value must be different from all other attribute value. o Consecutive rule No missing values between lowest and highest values of the attribute. o Null rule Specifies the use of blanks, question marks, special characters.
  • 9.
    Data Integration  Themerging of data from multiple data stores.  It can help reduce, avoid redundancies and inconsistencies.  It improve the accuracy and speed of the subsequent data mining process.
  • 10.
    Data Reduction  Toobtain a reduced representation of the data set that is much smaller in volume. Strategies for data reduction include the following:  Data cube aggregation, where aggregation operations are applied to the data in the construction of a data cube.  Attribute subset selection, where irrelevant, weakly relevant, or redundant attributes or dimensions may be detected and removed.
  • 11.
     Dimensionality reduction,where encoding mechanisms are used to reduce the data set size.  Numerosity reduction, where the data are replaced or estimated by alternative, smaller data representations such as  Parametric models  Nonparametric methods such as clustering, sampling, and the use of histograms.
  • 12.
    Data Transformation  Indata transformation, the data are transformed or consolidated into forms appropriate for mining. Data transformation can involve the following:  Smoothing: remove noise from data  Aggregation: summarization, data cube construction  Generalization: concept hierarchy climbing  Normalization: scaled to fall within a small, specified range  min-max normalization
  • 13.
    Data Discretization  Discretization:Divide the range of a continuous attribute into intervals ◦ Interval labels can then be used to replace actual data values ◦ Reduce data size by Discretization ◦ Split (top-down) vs. merge (bottom-up) ◦ Discretization can be performed recursively on an attribute ◦ Prepare for further analysis, e.g., classification
  • 14.
     Three typesof attributes ◦ Nominal—values from an unordered set, e.g., color, profession ◦ Ordinal—values from an ordered set, e.g., military or academic rank ◦ Numeric—real numbers, e.g., integer or real numbers
  • 15.