Data preprocessing is a technique used to prepare raw data for analysis by cleaning it, integrating data from multiple sources, reducing redundant features, and transforming the data. It involves techniques such as data cleaning to handle missing or inconsistent values, data integration to merge different data sources, data transformation such as normalization, and data reduction like aggregation or dimensionality reduction. Preprocessing resolves issues in raw data to prepare it for further analysis by techniques like data mining algorithms.
Data preprocessing techniques for data cleaning and transformation
1.
2. • Data preprocessing is a data mining technique
that involves transforming raw data into an
understandable format.
• Real-world data is often incomplete,
inconsistent, and/or lacking in certain
behaviors or trends, and is likely to contain
many errors.
3. • Data preprocessing is a proven method of
resolving such issues.
• Data preprocessing prepares raw data for
further processing.
• Data preprocessing is used database-driven
applications such as customer relationship
management and rule-based applications (like
neural networks).
4. Number of data preprocessing
techniques
•
•
•
•
Data cleaning
Data integration
Data transformation
Data reduction
5.
6. Data Preprocessing Techniques
• Data cleaning : can be applied to remove
noise and correct inconsistencies in the data.
• Data integration :merges data from multiple
sources into a coherent data store, such as a
data warehouse.
• Data transformations :such as
normalization, may be applied.
• Data reduction : can reduce the data size by
aggregating, eliminating redundant features,
or clustering ,for instance.
7. • routines work to “clean” the data by filling in
missing values, smoothing noisy data, identifying
or removing outliers, and resolving
inconsistencies.
• If users believe the data are dirty, they are
unlikely to trust the results of any data mining
that has been applied to it.
• Although most mining routines have some
procedures for dealing with incomplete or noisy
data, they are not always robust.
8. • Therefore, a useful preprocessing step is to
run your data through some data cleaning
routines.
9. • Include data from multiple sources in your
analysis.
• This would involve integrating multiple
databases, data cubes, or files, that is, data
integration.
• Yet some attributes representing a given
concept may have different names in different
databases, causing inconsistencies and
redundancies.
10. • Having a large amount of redundant data may
slow down or confuse the knowledge
discovery process.
• Clearly, in addition to data cleaning, steps
must be taken to help avoid redundancies
during data integration.
• Typically, data cleaning and data integration
are performed as a preprocessing step when
preparing the data for a data warehouse.
11. • Additional data cleaning can be performed to
detect and remove redundancies that may
have resulted from data integration.
12. • Getting back to your data, you have decided,
say, that you would like to use a distance
based mining algorithm for your analysis, such
as neural networks, nearest-neighbor
classifiers, or clustering.
• methods provide better results if the data to
be analyzed have been normalized, that is,
scaled to a specific range such as [0.0, 1.0].
13. • You soon realize that data transformation
operations, such as normalization and
aggregation, are additional data preprocessing
procedures that would contribute toward the
success of the mining process.
14. • Data reduction obtains a reduced
representation of the data set that is much
smaller in volume, yet produces the same (or
almost the same) analytical results.
• There are a number of strategies for data
reduction.
• These include data aggregation , attribute
subset selection , dimensionality reduction and
numerosity reduction.
15. DATA REDUCTION
• Data can also be “reduced” by generalization
with the use of concept hierarchies, where lowlevel concepts, such as city for customer location,
are replaced with higher-level concepts, such as
region or province or state.
• A concept hierarchy organizes the concepts into
varying levels of abstraction.
• Data discretization is a form of data reduction
that is very useful for the automatic generation of
concept hierarchies from numerical data.