DataPreprocessing.pptx

Dr.Dipali Meher
mailtomeher@gmail.com
Data Preprocessing

Need, Objectives and Techniques of data preprocessing
Data Cleaning: Handling of Missing values and Noisy Data, Data cleaning as a process
Data Integration: Schema integration, Controlling redundancies using correlation.
Data Transformation: Smoothing, Aggregation, Generalization, Attribute construction,
Normalization
Data Reduction: Data Cube Aggregation; Attribute Subset Selection, Dimensionality Reduction,
Numerosity Reduction
Collected by : Dr. Dipali Meher
Agenda

Data Preprocessing
Data preprocessing is the process of
transforming raw data into an
understandable format.
It is also an important step in data
mining as we cannot work with raw
data.
The quality of the data should be
checked before applying machine
learning or data mining algorithms.

Source:intellspot.com

Need of Data Preprocessing
Preprocessing of data is mainly to check the data quality. The quality can be checked by the following.
•Real world data are generally:
•Incomplete: Missing attribute values, missing certain attributes of importance, or having only
aggregate data
•Noisy: Containing errors or outliers
•Inconsistent: Containing discrepancies in codes or names
Accuracy: To check whether the data entered is correct or not.
Completeness: To check whether the data is available or not recorded.
Consistency: To check whether the same data is kept in all the places that do or do not match
Interpretability: The understandability of the data.
Believability: The data should be trustable.
Timeliness: The data should be updated correctly.

Objectives of Data Preprocessing
 To transform the raw data into an understandable format
 To transform data for its usable format
 To eliminate inconsistencies in data
 To remove duplicates in data
 To give more accurate data for preprocessing
 To give assurance for incorrect or missing values in data
 To reduce dimensionalities in data
Accurate data accurate results

Data Preprocessing
Data Integration
Data Transformation -2,32,100,59,48 -0.02,0.32,1.00,0.59,0.48
Data Cleaning
Data Reduction
Attributes A1 A2 …. A200
Attributes A1 A2 ….A50

Data Cleaning: Handling of Missing values and Noisy Data
Noisy data: This is data with error or data which has no meaning at all. This type of data can either lead to invalid results or
can create the problem to the process of mining itself. The problem of noisy data can be solved with binning method,
regression and clustering.
Data Cleaning
Missing Data Noisy Data
Missing data: Missing data is the case wherein some of the attributes or attribute data is missing or the data
is not normalized. This situation can be handled by either ignoring the values or filling the missing value.

Data cleaning as a process
Missing data:
Ignore the tuple
Fill in the missing values manually
Use a global constant to fill in the missing value
Use a measure of central tendency for the attribute (such as mean or median) to fill
in the missing value
Use the attribute mean or median for all samples belonging to the same class as the
given tuple
Use the most probable value to fill in the missing value

Data Integration
Combines data from multiple
homogeneous and heterogeneous
sources into coherent store
Data integration may produce
redundancies and inconsistencies in
the resulting data set
There are tow important tasks in data
integration
1 Detecting and resolving data value
and schema conflicts
2. Handling Redundancy

Schema Integration
Integrate metadata from different sources
The same attribute or object may have different
names in different databases.
E.g. cust_id,is same as cust_no or cno

Data Integration: Controlling redundancies using correlation.

Data Transformation
Smoothing
Aggregation
Discretization
Attribute Construction
Generalization
Normalization

Data Transformation
(a) Smoothing: This is the process of removing the unnecessary data and cleaning the
data so as to improve the functionality of the data.
(b) Aggregation: This is the process of collecting the data from heterogeneous
platforms and converting it to a uniform format. This improves the quality of the data.
(c) Discretization: Large data sets are complex to handle. Discretization is the process of
breaking up the data in small intervals. These chunks are continuous chunks, and these
are supported by all the existing frameworks.
(d) Attribute construction: To improve the efficiency in the mining process, some new
attributes are generated from existing data sets.
(e) Generalization: This is the process of converting low level attributes to high level
attributes using hierarchy.
(f) Normalization: In the process of Normalization, attributes are scaled within a
specified range.

Data Aggregation

Data
Reduction
Attribute Selection
Data Cube Aggregation
Dimensionality Reduction

(a) Attribute Selection: When data is collected from various sources, it may contain duplicate attributes.
Some of the attributes are irrelevant. The Attribute Selection method is used to remove such
redundant and unnecessary attributes from the data set. This process results in an improved data set.
(b) Data Cube Aggregation: In this reduction method, aggregation property is applied on selected data
sets so as to get the data in a much simpler format.
(c) Numerosity Reduction: In this reduction method, actual data is substituted with a mathematical
model of the data.
(d) Dimensionality Reduction: In this reduction method, duplicate attributes are removed to reduce the
data size.
Data Reduction

Data Cube Aggregation

Unit 3: Data Preprocessing

Data Reduction: Sampling

Data Discretization
 Large data sets are complex to handle.
 Discretization is the process of breaking up the data in
small intervals.
 Here, the data size is reduced. But the data which is
divided into intervals is kept continuous having some
sequence.
 Every interval has its own name and later these intervals
can be replaced with actual data.
 These chunks are continuous chunks and these are
supported by all the existing frameworks.

1. Top-down Discretization: If the process starts by first finding
one or a few points (called split points or cut points) to split
the entire attribute range, and then repeats this recursively on
the resulting intervals, then it is called top-down discretization
or splitting.
2. Bottom-up Discretization: If the process starts by
considering all of the continuous values as potential split-
points, removes some by merging neighborhood values to
form intervals, then it is called bottom-up discretization or
merging.
Data Discretization

 Concept hierarchies can be used to reduce the data by
collecting and replacing low-level concepts (such as numerical
values for the attribute age) with higher-level concepts (such
as youth, middle-aged, or senior).
 In the multidimensional model, data are organized into multiple
dimensions, and each dimension contains multiple levels of
abstraction defined by concept hierarchies. This organization
provides users with the flexibility to view data from different
perspectives.
Concept Hierarchies

Following are some Data discretization methods for numeric data:
1. Binning: This is a top-down unsupervised splitting technique based on a
specified number of bins. In this method, values found for an attribute are
grouped into a number of equal-width or equal-frequency bins. Then the
values are smoothened using bin mean or bin median in each bean. Using
this method recursively you can generate concept hierarchy.
2. Histogram Analysis: The histogram distributes an attribute's observed
value into a disjoint subset, often called buckets or bins.
3. Cluster Analysis: Cluster analysis is a common form of data discretization.
In this technique, a clustering algorithm can be applied to discretize a
numerical attribute by partitioning the values of that attribute into clusters or
groups.
Data Discretization Methods

Binning
The stored values are distributed into a number of buckets or
bins and then replacing each bin value by the bin mean or
median.
It is top-down splitting techniques based on a specified
number of bins.
It is unsupervised discretization technique because it does not
use class information.
- Equal-width(distance)partitioning
- Equal –depth(frequency)Partitioning

Binning: Equal width (distance) partitioning
Divides the range into N intervals of
equal size: uniform grid
If A and B are the lowest and highest value
of the attribute the width of the intervals will
be W=(B-A)/N
- Outliers may be come dominant
- Skewed data my not be handled well

Binning: Equal-width (distance) partitioning

Unit 3: Data Preprocessing
Binning: Equal-depth (frequency) partitioning

Histogram Analysis

Cluster Analysis
Clustering can be used to generate a concept hierarchy for
A by following either a top down splitting strategy to a bottom up merging strategy
where each cluster forms a node of the conapt hierarchy
 Initial cluster may be further decomposed into several sub clusters forming a
lower level hierarchy
 Later on clusters my be repeatedly grouped with neighbor clusters to form
higher level concepts

References
Data Mining, Introduction and Advanced Topics
by Margaret H. Dunham and Sridhar
Pearson Education
ISBN 81-7758-785-4
Data Mining Concepts and Techniques
by Jiawei Han and Micheline Kamber
Morgan Kaufmann Publishers
ISBN 81-312-0535-5

DataPreprocessing.pptx

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to DataPreprocessing.pptx

Similar to DataPreprocessing.pptx (20)

More from Dr-Dipali Meher

More from Dr-Dipali Meher (16)

Recently uploaded

Recently uploaded (20)

DataPreprocessing.pptx