2. Need, Objectives and Techniques of data preprocessing
Data Cleaning: Handling of Missing values and Noisy Data, Data cleaning as a process
Data Integration: Schema integration, Controlling redundancies using correlation.
Data Transformation: Smoothing, Aggregation, Generalization, Attribute construction,
Normalization
Data Reduction: Data Cube Aggregation; Attribute Subset Selection, Dimensionality Reduction,
Numerosity Reduction
Collected by : Dr. Dipali Meher
Agenda
3. Data Preprocessing
Collected by : Dr. Dipali Meher
Data preprocessing is the process of
transforming raw data into an
understandable format.
It is also an important step in data
mining as we cannot work with raw
data.
The quality of the data should be
checked before applying machine
learning or data mining algorithms.
5. Need of Data Preprocessing
Collected by : Dr. Dipali Meher
Preprocessing of data is mainly to check the data quality. The quality can be checked by the following.
•Real world data are generally:
•Incomplete: Missing attribute values, missing certain attributes of importance, or having only
aggregate data
•Noisy: Containing errors or outliers
•Inconsistent: Containing discrepancies in codes or names
Accuracy: To check whether the data entered is correct or not.
Completeness: To check whether the data is available or not recorded.
Consistency: To check whether the same data is kept in all the places that do or do not match
Interpretability: The understandability of the data.
Believability: The data should be trustable.
Timeliness: The data should be updated correctly.
6. Objectives of Data Preprocessing
To transform the raw data into an understandable format
To transform data for its usable format
To eliminate inconsistencies in data
To remove duplicates in data
To give more accurate data for preprocessing
To give assurance for incorrect or missing values in data
To reduce dimensionalities in data
Accurate data accurate results
Collected by : Dr. Dipali Meher
7. Data Preprocessing
Data Integration
Data Transformation -2,32,100,59,48 -0.02,0.32,1.00,0.59,0.48
Data Cleaning
Data Reduction
Attributes A1 A2 …. A200
Attributes A1 A2 ….A50
Collected by : Dr. Dipali Meher
8. Data Cleaning: Handling of Missing values and Noisy Data
Noisy data: This is data with error or data which has no meaning at all. This type of data can either lead to invalid results or
can create the problem to the process of mining itself. The problem of noisy data can be solved with binning method,
regression and clustering.
Data Cleaning
Missing Data Noisy Data
Missing data: Missing data is the case wherein some of the attributes or attribute data is missing or the data
is not normalized. This situation can be handled by either ignoring the values or filling the missing value.
Collected by : Dr. Dipali Meher
9. Data cleaning as a process
Missing data:
Ignore the tuple
Fill in the missing values manually
Use a global constant to fill in the missing value
Use a measure of central tendency for the attribute (such as mean or median) to fill
in the missing value
Use the attribute mean or median for all samples belonging to the same class as the
given tuple
Use the most probable value to fill in the missing value
Collected by : Dr. Dipali Meher
10. Data Integration
Combines data from multiple
homogeneous and heterogeneous
sources into coherent store
Data integration may produce
redundancies and inconsistencies in
the resulting data set
There are tow important tasks in data
integration
1 Detecting and resolving data value
and schema conflicts
2. Handling Redundancy
Collected by : Dr. Dipali Meher
11. Schema Integration
Integrate metadata from different sources
The same attribute or object may have different
names in different databases.
E.g. cust_id,is same as cust_no or cno
Collected by : Dr. Dipali Meher
15. Data Transformation
(a) Smoothing: This is the process of removing the unnecessary data and cleaning the
data so as to improve the functionality of the data.
(b) Aggregation: This is the process of collecting the data from heterogeneous
platforms and converting it to a uniform format. This improves the quality of the data.
(c) Discretization: Large data sets are complex to handle. Discretization is the process of
breaking up the data in small intervals. These chunks are continuous chunks, and these
are supported by all the existing frameworks.
(d) Attribute construction: To improve the efficiency in the mining process, some new
attributes are generated from existing data sets.
(e) Generalization: This is the process of converting low level attributes to high level
attributes using hierarchy.
(f) Normalization: In the process of Normalization, attributes are scaled within a
specified range.
Collected by : Dr. Dipali Meher
18. (a) Attribute Selection: When data is collected from various sources, it may contain duplicate attributes.
Some of the attributes are irrelevant. The Attribute Selection method is used to remove such
redundant and unnecessary attributes from the data set. This process results in an improved data set.
(b) Data Cube Aggregation: In this reduction method, aggregation property is applied on selected data
sets so as to get the data in a much simpler format.
(c) Numerosity Reduction: In this reduction method, actual data is substituted with a mathematical
model of the data.
(d) Dimensionality Reduction: In this reduction method, duplicate attributes are removed to reduce the
data size.
Data Reduction
Collected by : Dr. Dipali Meher
23. Data Discretization
Large data sets are complex to handle.
Discretization is the process of breaking up the data in
small intervals.
Here, the data size is reduced. But the data which is
divided into intervals is kept continuous having some
sequence.
Every interval has its own name and later these intervals
can be replaced with actual data.
These chunks are continuous chunks and these are
supported by all the existing frameworks.
Collected by : Dr. Dipali Meher
24. 1. Top-down Discretization: If the process starts by first finding
one or a few points (called split points or cut points) to split
the entire attribute range, and then repeats this recursively on
the resulting intervals, then it is called top-down discretization
or splitting.
2. Bottom-up Discretization: If the process starts by
considering all of the continuous values as potential split-
points, removes some by merging neighborhood values to
form intervals, then it is called bottom-up discretization or
merging.
Data Discretization
25. Concept hierarchies can be used to reduce the data by
collecting and replacing low-level concepts (such as numerical
values for the attribute age) with higher-level concepts (such
as youth, middle-aged, or senior).
In the multidimensional model, data are organized into multiple
dimensions, and each dimension contains multiple levels of
abstraction defined by concept hierarchies. This organization
provides users with the flexibility to view data from different
perspectives.
Concept Hierarchies
Collected by : Dr. Dipali Meher
26. Following are some Data discretization methods for numeric data:
1. Binning: This is a top-down unsupervised splitting technique based on a
specified number of bins. In this method, values found for an attribute are
grouped into a number of equal-width or equal-frequency bins. Then the
values are smoothened using bin mean or bin median in each bean. Using
this method recursively you can generate concept hierarchy.
2. Histogram Analysis: The histogram distributes an attribute's observed
value into a disjoint subset, often called buckets or bins.
3. Cluster Analysis: Cluster analysis is a common form of data discretization.
In this technique, a clustering algorithm can be applied to discretize a
numerical attribute by partitioning the values of that attribute into clusters or
groups.
Data Discretization Methods
Collected by : Dr. Dipali Meher
27. Binning
The stored values are distributed into a number of buckets or
bins and then replacing each bin value by the bin mean or
median.
It is top-down splitting techniques based on a specified
number of bins.
It is unsupervised discretization technique because it does not
use class information.
- Equal-width(distance)partitioning
- Equal –depth(frequency)Partitioning
Collected by : Dr. Dipali Meher
28. Binning: Equal width (distance) partitioning
Divides the range into N intervals of
equal size: uniform grid
If A and B are the lowest and highest value
of the attribute the width of the intervals will
be W=(B-A)/N
- Outliers may be come dominant
- Skewed data my not be handled well
Collected by : Dr. Dipali Meher
32. Cluster Analysis
Clustering can be used to generate a concept hierarchy for
A by following either a top down splitting strategy to a bottom up merging strategy
where each cluster forms a node of the conapt hierarchy
Initial cluster may be further decomposed into several sub clusters forming a
lower level hierarchy
Later on clusters my be repeatedly grouped with neighbor clusters to form
higher level concepts
Collected by : Dr. Dipali Meher
33. References
Collected by : Dr. Dipali Meher
Data Mining, Introduction and Advanced Topics
by Margaret H. Dunham and Sridhar
Pearson Education
ISBN 81-7758-785-4
Data Mining Concepts and Techniques
by Jiawei Han and Micheline Kamber
Morgan Kaufmann Publishers
ISBN 81-312-0535-5