2. DATA PRE-PROCESSING:
Data pre-processing is an often neglected but
important step in the data mining process.
If there is much irrelevant and redundant
information present or noisy and unreliable data,
then knowledge discovery during the training
phase is more difficult.
Data pre-processing includes cleaning,
normalization, transformation, feature extraction
and selection, etc.
3. Data Pre-processing Methods:
Raw data is highly susceptible to noise, missing
values, and inconsistency. The quality of data
affects the data mining results.
In order to help improve the quality of the data
and, consequently, of the mining results raw data
is pre-processed so as to improve the efficiency
and ease of the mining process.
5. Data cleaning:
data cleaning is the process of detecting
and correcting (or removing) corrupt or
inaccurate records from a record set, table, or
database and refers to identifying incomplete,
incorrect, inaccurate or irrelevant parts of the
data and then replacing, modifying, or
deleting the dirty or coarse data.
Data cleansing may be performed
interactively with data wrangling tools, or as
batch processing through scripting.
7. Data Integration:
Data integration primarily supports the
analytical processing of large data sets by
aligning, combining and presenting each data set
from organizational departments and external
remote sources to fulfill integrator objectives.
Data integration is generally implemented in
data warehouses (DW) through specialized
software that hosts large data repositories from
internal and external resources.
Data is extracted, amalgamated and presented
as a unified form.
8. Data Transformation:
Data transformation is the process of converting
data or information from one format to another,
usually from the format of a source system into
the required format of a new destination system.
data transformation involves the use of a
special program that's able to read the data’s
original base language, determine the language
into which the data that must be translated for it
to be usable by the new program or system, and
then proceeds to transform that data.
9. Two key phases:
Data Mapping:
The assignment of elements from the source
base or system toward the destination to capture
all transformations that occur. This is made more
complicated when there are complex
transformations like many-to-one or one-to-many
rules for transformation.
Code Generation:
The creation of the actual transformation
program. The resulting data map specification is
used to create an executable program to run on
computer systems.
10. Data reduction:
Data reduction is the transformation of
numerical or alphabetical digital
information derived empirically
or experimentally into a corrected, ordered, and
simplified form.
When information is derived from instrument
readings there may also be a transformation
from analog to digital form.
When the data are already in digital form the
'reduction' of the data typically involves some
editing, scaling, encoding, sorting, collating, and
producing tabular summaries.
11. When the observations are discrete but the
underlying phenomenon is continuous
then smoothing and interpolation are often
needed. Often the data reduction is undertaken
in the presence of reading or measurement
errors.
When the observations are discrete but the
underlying phenomenon is continuous
then smoothing and interpolation are often
needed. Often the data reduction is undertaken
in the presence of reading or measurement
errors.