1
Data in the real world is dirty
incomplete: lacking attribute values, lacking certain attributes of
interest, or containin...
 Data Cleaning / Data Cleansing
Fill in missing values, smooth noisy data, identify or
remove outliers, and resolve incon...
DESCRIPTIVE DATA
SUMMARIZATION

Measuring the Central Tendency

Distributive
Measure

Algebraic
Measure

Holistic
Measure
...
The variance of N observations, x1, x2, ....xN is

1
2 =
N

N

( xi x)
i1

2

1
N

2
i

x

1
( xi )2
N

5
Where

x

is the mean value of observations

The Standard deviation is 
The  of the observations is the square root of t...
The basic properties of the standard deviation ,
  measures spread about the mean and should be used only
when the mean...


Apart from the bar charts, pie charts and line graphs, there

are other popular types of graphs for the display of data...
 Plotting histograms, or frequency histograms, is a graphical method
for summarizing the distribution of a given attribut...
14
12
10
8
6
4
2
0

Series 3
Series 2

Series 1

10
 It is a simple and effective way for a univariate data distribution
 First, it displays all of the data for the given a...
12
 A quantile-quantile plot, or q-q plot graphs the quantiles of one

variate distribution against the corresponding quanti...
6

5
4
3

Series 1

2

Series 2

1

Series 3

0
Category Category Category Category
1

2

3

4

14
 A Scatter plot used for determining if there appears to be a

relationship, pattern, or trend between two numerical attr...
Y-Values
3.5
3
2.5
2
1.5

Y-Values

1
0.5

0
0

1

2

3

16
Y-Values
3.5
3
2.5
2
1.5

Y-Values

1
0.5

0
0

1

2

3

17
Y-Values
3.5
3
2.5
2
1.5

Y-Values

1
0.5

0
0

1

2

3

18
 It is another graphic aid that adds a smooth curve to a scatter plot in order

to provide better perception of the patte...
Y-Values
3.5
3
2.5
2
1.5

Y-Values

1
0.5

0
0

1

2

3

20
 Descriptive data summaries provide valuable insight into the
overall behavior of our data
 By helping to identify noise...
 Data Cleaning or Data Cleansing routines attempt to fill in missing
values,

smooth

out

noise

while

identifying

out...
Many tuples have no recorded value for several attributes
Methods:

1. Ignore the tuple :
This is usually done when the cl...
4. Use the attribute mean to fill in the missing value:
Use the mean value to replace the missing value for particular
att...
 Methods 3 to 6 bias the data. The filled-in value may not be correct

 Method 6 is a popular strategy to predict missin...
 What is Noise?

 Noise is a random error or variance in a measured variable.
 Data Smoothing techniques:
Binning
Regre...
 Binning

first sort data and partition into (equal-frequency) bins then
one can smooth by bin means, smooth by bin media...
Combines data from multiple sources into a coherent store
 Schema integration:
Integrate metadata from different sources
...
Redundant data occur often when integration of multiple databases
Object identification: The same attribute or object may ...
Smoothing: remove noise from data
Aggregation: summarization, data cube construction
Generalization: concept hierarchy cli...


Min-max normalization: to [new_minA, new_maxA]

v'

v minA
(new _ maxA new _ minA) new _ minA
maxA minA

◦ Ex. Let inco...
Why data reduction?
 A database/data warehouse may store terabytes of data
 Complex data analysis/mining may take a very...
 Data preparation or preprocessing is a big issue for both data
warehousing and data mining
 Discriptive data summarizat...
34
Upcoming SlideShare
Loading in...5
×

Data mining

119

Published on

Published in: Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
119
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Data mining

  1. 1. 1
  2. 2. Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., occupation=― ‖ noisy: containing errors or outliers e.g., Salary=―-10‖ inconsistent: containing discrepancies in codes or names e.g., Age=―42‖ Birthday=―03/07/1997‖ e.g., Was rating ―1,2,3‖, now rating ―A, B, C‖ e.g., discrepancy between duplicate records 2
  3. 3.  Data Cleaning / Data Cleansing Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies  Data Integration Integration of multiple databases, data cubes, or files  Data Transformation Normalization and aggregation  Data Reduction Obtains reduced representation in volume but produces the same or similar analytical results 3
  4. 4. DESCRIPTIVE DATA SUMMARIZATION Measuring the Central Tendency Distributive Measure Algebraic Measure Holistic Measure Measuring the Dispersion of Data Range, Quartiles, Outliers and Boxplots Variance and Standard Deviation Graphic Displays 4
  5. 5. The variance of N observations, x1, x2, ....xN is 1 2 = N N ( xi x) i1 2 1 N 2 i x 1 ( xi )2 N 5
  6. 6. Where x is the mean value of observations The Standard deviation is  The  of the observations is the square root of the variance,  2 6
  7. 7. The basic properties of the standard deviation ,   measures spread about the mean and should be used only when the mean is chosen as the measure of center   = 0 only when there is no spread, that is, when all observations have the same value. Otherwise  > 0 7
  8. 8.  Apart from the bar charts, pie charts and line graphs, there are other popular types of graphs for the display of data summaries and distributions. Histograms Quantile plots q-q plots Scatter plots Curves 8
  9. 9.  Plotting histograms, or frequency histograms, is a graphical method for summarizing the distribution of a given attribute.  A histogram for an attribute A partitions into subsets or buckets.  The width of each bucket is uniform  Each bucket is represented by a rectangle  The resulting graph is referred to as Bar chart 9
  10. 10. 14 12 10 8 6 4 2 0 Series 3 Series 2 Series 1 10
  11. 11.  It is a simple and effective way for a univariate data distribution  First, it displays all of the data for the given attribute  Second, it plots quantile information  The mechanism is slightly differ from percentile computation  fi = i – 0.5 / N 11
  12. 12. 12
  13. 13.  A quantile-quantile plot, or q-q plot graphs the quantiles of one variate distribution against the corresponding quantiles of another  It is a powerful visualization tool that allows the user to view whether there is a shift in going from one distribution to another. 13
  14. 14. 6 5 4 3 Series 1 2 Series 2 1 Series 3 0 Category Category Category Category 1 2 3 4 14
  15. 15.  A Scatter plot used for determining if there appears to be a relationship, pattern, or trend between two numerical attributes  To construct a scatter plot, each pair of values is treated as a pair of coordinates in an algebraic sense and plotted as points in the plane  Bivariate data to see clusters of points and outliers or correlation relationships 15
  16. 16. Y-Values 3.5 3 2.5 2 1.5 Y-Values 1 0.5 0 0 1 2 3 16
  17. 17. Y-Values 3.5 3 2.5 2 1.5 Y-Values 1 0.5 0 0 1 2 3 17
  18. 18. Y-Values 3.5 3 2.5 2 1.5 Y-Values 1 0.5 0 0 1 2 3 18
  19. 19.  It is another graphic aid that adds a smooth curve to a scatter plot in order to provide better perception of the pattern of dependence  The word loess is short for ―local regression‖  There are two values are needed that is smoothing parameter () and , the degree of the polynomials that are fitted by regression 19
  20. 20. Y-Values 3.5 3 2.5 2 1.5 Y-Values 1 0.5 0 0 1 2 3 20
  21. 21.  Descriptive data summaries provide valuable insight into the overall behavior of our data  By helping to identify noise and outliers, they are especially useful for data cleaning 21
  22. 22.  Data Cleaning or Data Cleansing routines attempt to fill in missing values, smooth out noise while identifying outliers, and correct inconsistencies in the data.  Handling missing values  Data Smoothing techniques  Data Cleaning as a process 22
  23. 23. Many tuples have no recorded value for several attributes Methods: 1. Ignore the tuple : This is usually done when the class label is missing It is not very effective 2. Fill in the missing value manually: It is time-consuming and may not be feasible in large data set 3. Use a global constant to fill in the missing value: Replace all missing attribute values by the same constant 23
  24. 24. 4. Use the attribute mean to fill in the missing value: Use the mean value to replace the missing value for particular attribute 5. Use the attribute mean for all samples belonging to the same class as the given tuple: if classifying customers according to credit_risk, replace the missing value with the average income value for customers 6. Use the most probable value to fill in the missing value: This may be determined with regression, inference-based tools using a Bayesian formalism, or decision tree induction. 24
  25. 25.  Methods 3 to 6 bias the data. The filled-in value may not be correct  Method 6 is a popular strategy to predict missing values  By considering the other values of the other attributes in its estimation of the missing value  In some cases, a missing value may not imply an error in the data! Ex: Phone number in some application form  NULL 25
  26. 26.  What is Noise?  Noise is a random error or variance in a measured variable.  Data Smoothing techniques: Binning Regression Clustering 26
  27. 27.  Binning first sort data and partition into (equal-frequency) bins then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.  Regression smooth by fitting the data into regression functions  Clustering detect and remove outliers 27
  28. 28. Combines data from multiple sources into a coherent store  Schema integration: Integrate metadata from different sources  Entity identification problem: Identify real world entities from multiple data sources, e.g., Bill Clinton = William Clinton  Detecting and resolving data value conflicts For the same real world entity, attribute values from different sources are different 28
  29. 29. Redundant data occur often when integration of multiple databases Object identification: The same attribute or object may have different names in different databases Derivable data: One attribute may be a ―derived‖ attribute in another table, e.g., annual revenue Redundant attributes may be able to be detected by correlation analysis Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality 29
  30. 30. Smoothing: remove noise from data Aggregation: summarization, data cube construction Generalization: concept hierarchy climbing Normalization: scaled to fall within a small, specified range min-max normalization z-score normalization normalization by decimal scaling 30
  31. 31.  Min-max normalization: to [new_minA, new_maxA] v' v minA (new _ maxA new _ minA) new _ minA maxA minA ◦ Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is mapped to 73,600 12,000 (1.0 0) 0 0.716 98,000 12 ,000   Z-score normalization (μ: mean, σ: standard deviation): v A v' 73,600 54 ,000 A 1.225 ◦ Ex. Let μ = 54,000, σ = 16,000. Then Normalization by decimal scaling v v' 10 j 16 ,000 Where j is the smallest integer such that Max(|ν’|) < 1 31
  32. 32. Why data reduction?  A database/data warehouse may store terabytes of data  Complex data analysis/mining may take a very long time to run on the complete data set Data reduction Obtain a reduced representation of the data set that is much smaller in volume but yet produce the same (or almost the same) analytical results Data reduction strategies o Data cube aggregation: o Dimensionality reduction — e.g., remove unimportant attributes o Data Compression o Numerosity reduction — e.g., fit data into models o Discretization and concept hierarchy generation 32
  33. 33.  Data preparation or preprocessing is a big issue for both data warehousing and data mining  Discriptive data summarization is need for quality data preprocessing  Data preparation includes,  Data cleaning and data integration  Data reduction and feature selection  Discretization  A lot a methods have been developed but data preprocessing still an active area of research 33
  34. 34. 34
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×