Data Preprocessing
Presented by
P.Veeralakshmi
M.C.A
Why Data Preprocessing?
• Data in the real world is dirty
• Incomplete
Incomplete data may come from
Human/hardware/softwa...
Cont….
• Inconsistent:
Functional dependency violation
e.g., Age=“42” Birthday=“03/07/1997”
Major Tasks in Data Preprocessing?
• Data cleaning
Fill in missing values, smooth noisy data, identify or
remove outliers,...
Descriptive Data Summarization
• It is a techniques can be used to identify the
which data values should be treated as noi...
Graphic Displays of Basic Descriptive
Data Summaries
• Aside from the bar charts, pie charts, and line
graphs used in most...
HISTOGRAM
Qunatile
Q-Q plot
Loess Curve
Data Cleaning
• Data cleaning (or data cleansing) routines
attempt to fill in
• missing values
• noise
• identifying outli...
Missing Values
1. Ignore the tuple
2. Fill in the missing value manually
3. Use a global constant to fill in the missing v...
Noisy Data
1. Binning:
• The sorted values are distributed into a
number
of “buckets,” or bins
ex: Bin = 4,8,15
• Smoothin...
2.Regression
• Data can be smoothed by fitting the data to a
function.
• Linear regression involves finding the “best” lin...
3. Clustering
• Outliers may be detected by clustering, where
similar values are organized into groups, or
“clusters.”
• T...
SOME RULES
• The data should also be examined regarding
• unique rules - each value attribute must be
different from all o...
Data Integration
• Data integration, which combines data from
multiple sources into a coherent data store.
• Data integrat...
Data Transformation
• In data transformation, the data are transformed
or consolidated into forms appropriate for
mining.
...
Cont….
• Generalization - low-level or “primitive” (raw)
data are replaced by higher-level concepts
through the use of con...
Data Reduction
• Data reduction techniques can be applied to obtain a
reduced representation of the data set that is much
...
Cont…
3.Dimensionality reduction
where encoding mechanisms are used to
reduce the data set size.
4.Numerosity reduction
wh...
Data Discretization
• Data discretization techniques can be used to
reduce the number of values for a given
continuous att...
THANK YOU
Upcoming SlideShare
Loading in...5
×

Dmblog

130

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
130
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Dmblog"

  1. 1. Data Preprocessing Presented by P.Veeralakshmi M.C.A
  2. 2. Why Data Preprocessing? • Data in the real world is dirty • Incomplete Incomplete data may come from Human/hardware/software problems e.g., occupation=“ ” • Noisy: Faulty data collection instruments e.g., Salary=“-10”
  3. 3. Cont…. • Inconsistent: Functional dependency violation e.g., Age=“42” Birthday=“03/07/1997”
  4. 4. Major Tasks in Data Preprocessing? • Data cleaning Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies • Data integration Integration of multiple databases, data cubes, or files • Data transformation Normalization and aggregation • Data reduction Obtains reduced representation in volume but produces the same or similar analytical results • Data discretization Part of data reduction but with particular importance, especially for numerical data
  5. 5. Descriptive Data Summarization • It is a techniques can be used to identify the which data values should be treated as noise or outliers. • Measures of central tendency include 1. mean 2. median 3. mode 4. midrange,
  6. 6. Graphic Displays of Basic Descriptive Data Summaries • Aside from the bar charts, pie charts, and line graphs used in most statistical or graphical data presentation. • Histogram • Quantile plots • q-q plots • scatter plots • loess curves.
  7. 7. HISTOGRAM
  8. 8. Qunatile
  9. 9. Q-Q plot
  10. 10. Loess Curve
  11. 11. Data Cleaning • Data cleaning (or data cleansing) routines attempt to fill in • missing values • noise • identifying outliers • correct inconsistencies.
  12. 12. Missing Values 1. Ignore the tuple 2. Fill in the missing value manually 3. Use a global constant to fill in the missing value 4. Use the attribute mean to fill in the missing value 5. Use the attribute mean for all samples belonging to the same class as the given tuple 6. Use the most probable value to fill in the missing value Method 6,however, is a popular strategy.
  13. 13. Noisy Data 1. Binning: • The sorted values are distributed into a number of “buckets,” or bins ex: Bin = 4,8,15 • Smoothing by bin means Bin = 9 • Smoothing by bin boundaries 4,4,15
  14. 14. 2.Regression • Data can be smoothed by fitting the data to a function. • Linear regression involves finding the “best” line to fit two attributes • so that one attribute can be used to predict the other. • Multiple linear regression is an extension of linear regression
  15. 15. 3. Clustering • Outliers may be detected by clustering, where similar values are organized into groups, or “clusters.” • The values that fall outside of the set of clusters may be considered outliers .......... …… … …..
  16. 16. SOME RULES • The data should also be examined regarding • unique rules - each value attribute must be different from all other values • consecutive rules - no missing values between the lowest and highest values . • null rules - A null rule specifies the use of blanks,question marks, special characters.
  17. 17. Data Integration • Data integration, which combines data from multiple sources into a coherent data store. • Data integration Technique: • Schema integration • Redundancy • correlation analysis
  18. 18. Data Transformation • In data transformation, the data are transformed or consolidated into forms appropriate for mining. • Data transformation can involve the following: • Smoothing - to remove noise from the data. • Aggregation - summary or aggregation operations are applied to the data. • Ex : the daily sales data may be aggregated so as to compute monthly and annual total amounts.
  19. 19. Cont…. • Generalization - low-level or “primitive” (raw) data are replaced by higher-level concepts through the use of concept hierarchies. • Normalization - the attribute data are scaled so as to fall within a small specified range, such as 1:0 to 1:0, or 0:0 to 1:0. • Attribute construction - new attributes are constructed and added from the given set of attributes.
  20. 20. Data Reduction • Data reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller in volume. 1. Data cube aggregation where aggregation operations are applied to the data in the construction of a data cube. 2. Attribute subset selection where irrelevant, weakly relevant, or redundant attributes or dimensions may be detected and removed.
  21. 21. Cont… 3.Dimensionality reduction where encoding mechanisms are used to reduce the data set size. 4.Numerosity reduction where the data are replaced or estimated by alternative •clustering • sampling • histograms.
  22. 22. Data Discretization • Data discretization techniques can be used to reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. • Binning • Histogram Analysis • Entropy-Based Discretization • Interval Merging by x2 Analysis • Cluster Analysis • Discretization by Intuitive Partitioning
  23. 23. THANK YOU
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×