




Data preprocessing
Need
Importance
Major tasks in data preprocessing




Data preprocessing is a data mining technique that involves
transforming raw data into an understandable format.
Dat...


To get the required information from huge,
incomplete, noisy and inconsistent set of data
it is necessary to use data p...


Data in the real world is dirty

Incomplete : lacking attribute values, lacking
certain attributes of interest, or con...
 Data

Cleaning

-Fill in missing values, smooth noisy data, identify or remove

outliers, and resolve inconsistencies

...


Data cleaning tasks :
• Fill in missing values
• Identify outliers and smooth out noisy data
• Correct inconsistent dat...




Ignore the tuple
Fill in the missing value manually
Fill in it automatically with
- a global constant : e.g., “unkn...








Binning
-first sort data and partition into (equal-frequency) bins
-then one can smooth by bin means, smooth b...








Data integration:
-Combines data from multiple sources into a coherent store
Schema integration: e.g., A.cust-...


Redundant data occur often when integration of multiple
databases

-Object identification: The same attribute or object...


Smoothing: remove noise from data



Aggregation: summarization, data cube construction



Generalization: concept hi...


Why data reduction?
A database/data warehouse may store terabytes of
data

Complex data analysis/mining may take a ver...
Data reduction strategies are:
 Aggregation
 Sampling
 Dimensionality Reduction
 Feature subset selection
 Feature cr...


Three types of attributes:
o Nominal — values from an unordered set
o Ordinal — values from an ordered set
o Continuous...
Good data preparation is
key to producing valid and
reliable models!


http://www.slideshare.net/dataminingtools/d
ata-processing-5005815
Datapreprocessing
Upcoming SlideShare
Loading in...5
×

Datapreprocessing

1,099

Published on

Published in: Technology
1 Comment
0 Likes
Statistics
Notes
  • Classification:
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

No Downloads
Views
Total Views
1,099
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
65
Comments
1
Likes
0
Embeds 0
No embeds

No notes for slide

Datapreprocessing

  1. 1.     Data preprocessing Need Importance Major tasks in data preprocessing
  2. 2.   Data preprocessing is a data mining technique that involves transforming raw data into an understandable format. Data in the real world is: Incomplete : lacking attribute values, lacking certain attributes of interest, or containing only aggregate data  Noisy: containing errors or outliers  Inconsistent: containing discrepancies in codes or names   No quality data, no quality mining results! Quality decisions must be based on quality data  Data warehouse needs consistent integration of quality data 
  3. 3.  To get the required information from huge, incomplete, noisy and inconsistent set of data it is necessary to use data processing.
  4. 4.  Data in the real world is dirty  Incomplete : lacking attribute values, lacking certain attributes of interest, or containing only aggregate data   e.g., occupation=“ ” Noisy : containing errors or outliers  e.g., Salary=“-10”  Inconsistent : containing discrepancies in codes or names  e.g., Age=“42” Birthday=“03/07/1997”  e.g., Was rating “1,2,3”, now rating “A, B, C”
  5. 5.  Data Cleaning -Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies  Data Integration - Integration of multiple databases, data cubes, or files  Data Transformation - Normalization and aggregation  Data Reduction -Obtains reduced representation in volume but produces the same or similar analytical results  Data Discretization -Part of data reduction but with particular importance, especially for numerical data
  6. 6.  Data cleaning tasks : • Fill in missing values • Identify outliers and smooth out noisy data • Correct inconsistent data
  7. 7.    Ignore the tuple Fill in the missing value manually Fill in it automatically with - a global constant : e.g., “unknown”, a new class?! - the attribute mean for all samples belonging to the same class: smarter - the most probable value: inferencebased such as Bayesian formula or regression
  8. 8.     Binning -first sort data and partition into (equal-frequency) bins -then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. Regression -smooth by fitting the data into regression functions Clustering -detect and remove outliers Combined computer and human inspection -detect suspicious values and check by human (e.g., deal with possible outliers)
  9. 9.     Data integration: -Combines data from multiple sources into a coherent store Schema integration: e.g., A.cust-id B.cust-# -Integrate metadata from different sources Entity identification problem: -Identify real world entities from multiple data sources, e.g., Bill Clinton = William Clinton Detecting and resolving data value conflicts -For the same real world entity, attribute values from different sources are different -Possible reasons: different representations, different scales
  10. 10.  Redundant data occur often when integration of multiple databases -Object identification: The same attribute or object may have different names in different databases -Derivable data: One attribute may be a “derived” attribute in another table, e.g., annual revenue   Redundant attributes may be able to be detected by correlation analysis Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality
  11. 11.  Smoothing: remove noise from data  Aggregation: summarization, data cube construction  Generalization: concept hierarchy climbing   Normalization: scaled to fall within a small, specified range **min-max normalization **z-score normalization **normalization by decimal scaling Attribute/feature construction **New attributes constructed from the given ones
  12. 12.  Why data reduction? A database/data warehouse may store terabytes of data  Complex data analysis/mining may take a very long time to run on the complete data set   Data reduction  Obtain a reduced representation of the data set that is much smaller in volume but yet produce the same (or almost the same) analytical results
  13. 13. Data reduction strategies are:  Aggregation  Sampling  Dimensionality Reduction  Feature subset selection  Feature creation  Discretization and Binarization  Attribute Transformation
  14. 14.  Three types of attributes: o Nominal — values from an unordered set o Ordinal — values from an ordered set o Continuous — real numbers  Discretization: divide the range of a continuous attribute into intervals -Some classification algorithms only accept categorical attributes. - Reduce data size by discretization - Prepare for further analysis
  15. 15. Good data preparation is key to producing valid and reliable models!
  16. 16.  http://www.slideshare.net/dataminingtools/d ata-processing-5005815
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×