History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
Dbm630 lecture04
1. DBM630: Data Mining and
Data Warehousing
MS.IT. Rangsit University
Semester 2/2011
Lecture 4
Data Mining Concepts
Data Preprocessing and Postprocessing
by Kritsada Sriphaew (sriphaew.k AT gmail.com)
1
2. Topics
Data Mining vs. Machine Learning vs. Statistics
Instances with attributes and concepts(input)
Knowledge Representation (output)
Why we need data preprocessing and postprocessing?
Engineering the input
Data cleaning
Data integration
Data transformation and data reduction
Engineering the output
Combining multiple models
2 Data Warehousing and Data Mining by Kritsada Sriphaew
3. Data Mining vs. Machine Learning
We are overwhelmed with electronic/recorded data, how
we can discover the knowledge from such data.
Data Mining (DM) is a process of discovering patterns in
data. The process must be automatic or semi-automatic.
Many techniques have been developed within a field
known as Machine Learning (ML).
DM is a practical topic and involves learning in a
practical, not a theoretical sense while ML focuses on
theoretical one.
DM is for gaining knowledge, not just good prediction.
DM = ML + topic-oriented + knowledge-oriented
3 Data Warehousing and Data Mining by Kritsada Sriphaew
4. DM&ML vs. Statistics
DM = Statistics + Marketing
Machine learning has been more concerned with formulating
the process of generalization as a search through possible
hypothesis
Statistics has been more concerned with testing hypotheses.
Very similar schemes have been developed in parallel in
machine learning and statistics, e.g., decision tree induction,
classification and regression tree, nearest-neighbor methods.
Most learning algorithms use statistical tests when constructing
rules or trees and for correcting models that are “overfitted” in
that they depend too strongly on the details of particular
examples used for building the model.
4 Data Warehousing and Data Mining by Kritsada Sriphaew
5. Generalization as Search
An aspect that distinguishes ML from statistical
approaches, is a search process through a space of
possible concept descriptions for one that fits the data.
Three properties that are important to characterize a
machine learning process, are
language bias: the concept description language, e.g., decision
tree, classification rule, association rules
search bias: the order in which the space is explored, e.g.,
greedy search, beam search
overfitting-avoidance bias: the way to avoid overfitting to the
particular training data, e.g., forward pruning or backward
pruning.
5 Data Warehousing and Data Mining by Kritsada Sriphaew
7. Input: Concepts, Instance & Attributes
Concept description
the thing that is to be learned (learning result)
hard to pin down precisely but
intelligible and operational
Instances (‘examples’ referred as input)
Information that the learner is given
A single table vs. multiple tables (denormalization to a single table)
Denormalization sometimes produces apparent regularities, such as
supplier vs. supplier address do always match together.
Attribute (features)
Each instance is characterized by a fixed, predefined set of features
or attributes
7 Data Warehousing and Data Mining by Kritsada Sriphaew
8. Input: Concepts, Instance & Attributes
Attributes Concepts
Ordinal Attr. Numeric Attr. Nominal Attr. Numeric Nominal
outlook temp. humidity windy Sponsor play-time play
sunny 85 87 True Sony 85 Y
sunny 80 90 False HP 90 Y
Instances (Examples)
overcast 87 75 True Ford 63 Y
rainy 70 95 True Ford 5 N
rainy 75 65 False HP 56 Y
sunny 90 94 True ? 25 N
rainy 65 86 True Nokia 5 N
overcast 88 92 True Honda 86 Y
rainy 79 75 False Ford 78 Y
Missing value
overcast 85 88 True Sony 74 Y
8 Data Warehousing and Data Mining by Kritsada Sriphaew
9. Independent vs. Dependent Instances
Normally, the input data are represented as a set of independent
instances.
But there are many problems involving relationship between objects.
That is, some instances depend with the others.
Ex.: A family tree: the sister-of relation Close World Assumption
first second sis first second sis
person person ter person person ter
Harry Sally Richard Julia
Harry Sally N Steven Demi Y
M F M F Harry Steven N Bruce Demi Y
Tison Diana Y
Steven Peter N Bill Diana Y
Steven Bruce Demi Tison Diana Bill
Steven Bruce N Nina Rica Y
M M F M F M Steven Demi Y Rica Nina Y
Bruce Demi Y All the rest N
Nina Rica Rica Nina Y
F F
9 Data Warehousing and Data Mining by Kritsada Sriphaew
10. Independent vs. Dependent Instances
Harry Sally Richard Julia name gender parent1 parent2 first second sis
M F M F person person ter
Harry Male ? ?
Sally Female Steven Demi Y
Steven Bruce Demi Tison Diana Bill ? ?
Bruce Demi Y
M M F M F M Richard Male ? ?
Julia Female ? ? Tison Diana Y
Steven Male Harry Sally Bill Diana Y
Nina Rica Bruce Male Harry Sally Nina Rica Y
F F Demi Female Harry Sally Rica Nina Y
Tison Male Richard Julia
sister_of(X,Y) :- female(Y), Diana Female Richard Julia All the rest N
parent(Z,X), Bill Male Richard Julia
parent(Z,Y). Nina Female Tison Demi
Rica Female Tison Demi
Denormalization
first second sister
gender parent1 parent2 gender parent1 parent2
person person
Steven Male Harry Sally Demi Female Harry Sally Y
Bruce Male Harry Sally Demi Female Harry Sally Y
Tison Male Richard Julia Diana Female Richard Julia Y
Bill Male Richard Julia Diana Female Richard Julia Y
Nina Female Tison Demi Rica Female Tison Demi Y
Rica Female Tison Demi Nina Female Tison Demi Y
All the rest N
10 Data Warehousing and Data Mining by Kritsada Sriphaew
11. Problems of Denormalization
A large table with duplication values included.
Relations among instances (rows) are ignored.
Some regularities in the data are merely reflections of the original
database structure but might be found by the data mining process,
e.g., supplier and supplier address.
Some relations are not finite, e.g., ancestor-of relations. Inductive logic
programming can use recursion to deal with this situations (the infinite
number of possible instances)
If person1 is a parent of person2
then person1 is an ancestor of person2
If person1 is a parent of person2 and
person2 is a parent of person3
then person1 is an ancestor of person3
11 Data Warehousing and Data Mining by Kritsada Sriphaew
12. Missing, Inaccurate, duplicated values
Many practical datasets may include three types of errors:
Missing values
frequently indicated by out-of-range entries (a negative number)
unknown vs. unrecorded vs. irrelevant values
Inaccurate values
typographical errors: misspelling, mistyping
measurement errors: errors generated by a measuring machine.
Intended errors: Ex.: input the zip code of the rental agency
instead of the renter’s zip code.
Duplicated values
repetition of data gives such data more influence on the result.
12 Data Warehousing and Data Mining by Kritsada Sriphaew
13. Output: Knowledge Representation
There are many different ways for representing the
patterns that can be discovered by machine learning.
Some popular ones are:
Decision tables
Decision trees
Classification rules
Association rules
Rules with exceptions
Rules involving relations
Trees for numeric prediction
Instance-based representation
Clusters
13 Data Warehousing and Data Mining by Kritsada Sriphaew
14. Decision Tables
The simplest, most rudimentary way of representing the output from
machine learning or data mining
Ex.: A decision table for the weather data to decide whether or not to
“play”
outlook temp. humidity windy Sponsor play-time play
sunny hot high True Sony 85 Y (1) How to make a
sunny hot high False HP 90 Y smaller, condensed
overcast hot normal True Ford 63 Y table with some
useless attributes
rainy mild high True Ford 5 N
are omitted.
rainy cool low False HP 56 Y
sunny hot low True Sony 25 N
(2) How to cope with a
rainy cool normal True Nokia 5 N case which does
overcast mild high True Honda 86 Y not exist in the
rainy mild low False Ford 78 Y table.
overcast hot high True Sony 74 Y
14 Data Warehousing and Data Mining by Kritsada Sriphaew
15. Decision Trees (1)
A “divide-and-onquer” approach to the problem of learning.
Ex.: A decision tree (DT) for the contact lens data to decide which type
of contact lens is suitable.
Tear production rate
reduced
normal
none astigmatism
no yes
soft
Spectacle prescription
myope hyperope
hard none
15 Data Warehousing and Data Mining by Kritsada Sriphaew
16. Decision Trees (2)
Nodes in a DT involve testing a particular attribute with
a constant. However, it is possible to compare two
attributes with each other, or to utilized some function
of one or more attributes.
If the attribute that is tested at a node is a nominal one,
the number of children is usually the number of possible
values of the attributes.
In this case, the same attribute will not be tested again
further down the tree.
In the case that the attributes are divided into two
subsets, the attribute might be tested more than one
times in a path.
16 Data Warehousing and Data Mining by Kritsada Sriphaew
17. Decision Trees (3)
If the attribute is numeric, the test at a node usually
determines whether its value is greater or less than a
predetermined constant.
If missing value is treated as an attribute value, there
will be a third branch.
Three-way split into (1) less-than, equal-to and
greater-than, or (2) below, within and above.
17 Data Warehousing and Data Mining by Kritsada Sriphaew
18. Classification Rules (1)
A popular alternative to decision trees. Also called a decision list.
Ex.: If outlook = sunny and humidity = high then play = yes
If outlook = rainy and windy = true then play = no
If outlook = overcast then play = yes
outlook temp. humidity windy Sponsor play-time play Decision Table
sunny hot high True Sony 85 Y
sunny hot high False HP 90 Y
overcast hot normal True Ford 63 Y
rainy mild high True Ford 5 N
rainy cool low False HP 56 Y
sunny hot low True Sony 25 N
rainy cool normal True Nokia 5 N
overcast mild high True Honda 86 Y
rainy mild low False Ford 78 Y
overcast hot high True Sony 74 Y
18 Data Warehousing and Data Mining by Kritsada Sriphaew
19. Classification Rules (2)
A set of rules is interpreted in
sequence. a
y n
The antecedent (or precondition) is a
series of tests while the consequent (or y b c
conclusion) gives the class or classes n y n
x
to the instances. c d
y n
It is easy to read a set of rules directly n y
off a decision trees but the opposite y d n x
function is not quite straightforward.
x
Ex.: replicated subtree problem
If a and b then x
If c and d then x
19 Data Warehousing and Data Mining by Kritsada Sriphaew
20. Classification Rules (3)
One reason why classification rules are popular:
Each rule seems to represent an independent “nugget” of
knowledge.
New rules can be added to an existing rule set without disturbing
those already there (In the DT case, it is necessary to reshaping the
whole tree).
If a rule set gives multiple classifications for a particular
example, one solution is to give no conclusion at all.
Another solution is to count how often each rule fires on the
training data and go with the most popular one.
One more problem occurs when an instance is encountered
that the rules fail to classify at all.
Solutions: (1) fail to classify, or (2) choose the most popular class
20 Data Warehousing and Data Mining by Kritsada Sriphaew
21. Classification Rules (4)
In a particularly straightforward situation, when rules
lead to a class that is boolean (y/n) and when only
rules leading to one outcome (say yes) are expressed
A form of closed world assumption.
The result rules cannot be conflict and there is no
ambiguity in rule interpretation.
A set of rules can be written as a logic expression
disjunctive normal form ( a disjunction (OR) of
conjunctive (AND) conditions ).
21 Data Warehousing and Data Mining by Kritsada Sriphaew
22. Association Rules (1)
Association rules are really no different from classification rules
except that they can predict any attribute, not just the class.
This gives them the freedom to predict combinations of
attributes, too.
Association rules (ARs) are not intended to be used together as
a set, as classification rules are
Different ARs express different regularities that underlies the
dataset, and they generally predict different things.
From even a small dataset, a large number of ARs can be
generated. Therefore, some constraints are needed for finding
useful rules. Two most popular ones are (1) support and (2)
confidence.
22 Data Warehousing and Data Mining by Kritsada Sriphaew
23. Association Rules (2)
For example, xy [ s = p(x,y), c = p(x,y)/p(x) ]
If temperature = hot then humidity = high (s=3/10,c=3/5)
If windy=true and play=Y then humidity=high and outlook=overcast (s=2/10, c=2/4)
If windy=true and play=Y and humidity=high then outlook=overcast (s=2/10, c=2/3)
outlook temp. humidity windy Sponsor play-time play
sunny hot high True Sony 85 Y
sunny hot high False HP 90 Y
overcast hot normal True Ford 63 Y
rainy mild high True Ford 5 N
rainy cool low False HP 56 Y
sunny hot low True Sony 25 N
rainy cool normal True Nokia 5 N
overcast mild high True Honda 86 Y
rainy mild low False Ford 78 Y
overcast hot high True Sony 74 Y
23 Data Warehousing and Data Mining by Kritsada Sriphaew
24. Rules with Exception (1)
For classification rules, incremental modifications can be
made to a rule set by expressing exceptions to existing
rules rather than by reengineering the entire set. Ex.:
If petal-length >= 2.45 and petal-length < 4.45
then Iris-versicolor
Sepal length Sepal width Petal length Petal width type
A new case
5.1 3.5 2.6 0.2 Iris-setosa
If petal-length >= 2.45 and petal-length < 4.45 then Iris-
versicolor EXCEPT if petal-width < 1.0 then Iris-setosa
Of course, we can have exceptions to the exceptions,
exceptions to these and so on.
24 Data Warehousing and Data Mining by Kritsada Sriphaew
25. Rules with Exception (2)
Rules with exceptions can be used to represent the entire concept description in the
first place.
Ex.:
Default: Iris-setosa
except if petal-length >= 2.45 and petal-length < 5.355 and
petal-width < 1.75
then Iris-versicolor
except if petal-length >= 4.95 and petal-width < 1.55
then Iris-virginica
else if sepal-length < 4.95 and sepal-width >=2.45
then Iris-virginica
else if petal-length >= 3.35
then Iris-virginica
except if petal-length < 4.85 and sepal-length<5.95
then Iris-versicolor
25 Data Warehousing and Data Mining by Kritsada Sriphaew
26. Rules with Exception (3)
Rules with exceptions can be proved to be logically
equivalent to an if-else statements.
The user can see that it is plausible, the expression
in terms of (common) rules and (rare) exceptions will
be easier to grasp than a normal structure (if-else).
26 Data Warehousing and Data Mining by Kritsada Sriphaew
27. Rules involving relations (1)
So far the conditions in rules involve testing an attribute value against a
constant.
This is called propositional (in propositional calculus).
Anyway, there are situation where a more expressive form of rule would
provide more intuitive&concise concept description.
Ex.: the concept of standing up.
There are two classes: standing and lying.
The information given is the width, height and the number of sides of each block.
standing
lying
27 Data Warehousing and Data Mining by Kritsada Sriphaew
28. Rules involving relations (2)
A propositional rule set produced for this data might be
If width >= 3.5 and height < 7.0 then lying
If height >= 3.5 then standing
A rule set with relations that will be produced, is
If width(b)>height(b) then lying
If height(b)>width(b) then standing lying
width height sides class
2 4 4 stand
3 6 4 stand
4 3 4 lying
standing
7 8 3 stand
7 6 3 lying
2 9 3 stand
9 1 4 lying
10 2 3 lying
28 Data Warehousing and Data Mining by Kritsada Sriphaew
29. Trees for numeric prediction
Instead of predicting categories, predicting numeric
quantities is also very important.
We can use regression equation.
There are two more knowledge representations:
regression tree and model tree.
Regression trees are decision tree with averaged numeric
values at the leaves.
It is possible to combine regression equations with
regression trees. The result model is model tree, a tree
whose leaves contain linear expressions.
29 Data Warehousing and Data Mining by Kritsada Sriphaew
31. Instance-based representation (1)
The simplest form of learning is plain memorization.
Encountering a new instance the memory is searched for
the training instance that most strongly resembles the
new one.
This is a completely different way of representing the
“knowledge” extracted from a set of instances: just store
the instances themselves and operate by relating new
instances whose class is unknown to existing ones whose
class is known.
Instead of creating rules, work directly from the
examples themselves.
31 Data Warehousing and Data Mining by Kritsada Sriphaew
32. Instance-based representation (2)
Instance-based learning is lazy, deferring the real work as long
as possible.
Other methods are eager, producing a generalization as soon
as the data has been seen.
In instance-based learning, each new instance is compared with
existing ones using a distance metric, and the closest existing
instance is used to assign the class to the new one. This is also
called the nearest-neighbor classification method.
Sometimes more than one nearest neighbor is used, and the
majority class of the closest k neighbors is assigned to the new
instance. This is termed the k-nearest-neighbor method.
32 Data Warehousing and Data Mining by Kritsada Sriphaew
33. Instance-based representation (3)
When computing the distance between two examples, the
standard Euclidean distance may be used.
When nominal attributes are present, we may use the following
procedure.
A distance of 0 is assigned if the values are identical, otherwise the
distance is 1.
Some attributes will be more important than others. We need
some kinds of attribute weighting. To get suitable attribute
weights from the training set is a key problem.
It may not be necessary, or desirable, to store all the training
instances.
To reduce the nearest neighbor calculation time.
To reduce the unrealistic amounts of storages.
33 Data Warehousing and Data Mining by Kritsada Sriphaew
34. Instance-based representation (4)
Generally some regions of attribute space are more
stable with regard to class than others, and just a few
examples are needed inside stable regions.
An apparent drawback to instance-based
representation is that they do not make explicit the
structures that are learned.
(a) (b) (c)
34 Data Warehousing and Data Mining by Kritsada Sriphaew
35. Clusters
The output takes the form of a diagram that shows how the instances fall into clusters.
The simplest case involving associating a cluster number with each instance (Fig. a).
Some clustering algorithm allow one instance to belong to more than one cluster, a Venn
diagram (Fig. b).
Some algorithms associate instances with clusters probabilistically rather than
categorically (Fig. c).
Other algorithms produce a hierarchical structure of clusters, called dendrograms (Fig. d).
Clustering may work with other learning methods for more performance.
1 2 3
g a 0.4 0.3 0.3
g b 0.6 0.3 0.1
h e a h e a c 0.1 0.4 0.5
d d d 0.5 0.2 0.3
c b f c b f e 0.6 0.3 0.1
f 0.4 0.1 0.5
g 0.1 0.4 0.5
(a) (b) h 0.2 0.7 0.1 a b c d e f g h
35 (c) (d)
36. Why Data Preprocessing? (1)
Data in the real world is dirty
incomplete: lacking attribute values, lacking certain attributes
of interest, or containing only aggregate data. Ex: occupation
=“”
noisy: containing errors or outliers. Ex: salary = “-10”
inconsistent: containing discrepancies in codes or names.
Ex: Age=“42” but Birthday = “01/01/1997”
Was rating “1,2,3” but now rating “A,B,C”
No quality data, no quality mining results!
Quality decisions must be based on quality data
Data warehouse needs consistent integration of quality data
36 Data Warehousing and Data Mining by Kritsada Sriphaew
37. Why Data Preprocessing? (2)
To integrate multiple sources of data to more meaningful one.
To transform data to the form that makes sense and is more
descriptive
To reduce the size (1) in cardinality aspect and/or (2) in variety
aspect in order to improve the computational time and
accuracy
Multi-Dimensional Measure of Data Quality
A well-accepted multidimensional view:
• Accuracy • Believability
• Completeness • Value added
• Consistency • Interpretability
• Timeliness • Accessibility
37 Data Warehousing and Data Mining by Kritsada Sriphaew
38. Major Tasks in Data Preprocessing
Data cleaning
Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data transformation and data reduction
Normalization and aggregation
Obtains reduced representation in volume but produces the
same or similar analytical results
Data discretization: data reduction, especially for numerical
data
38 Data Warehousing and Data Mining by Kritsada Sriphaew
39. Forms of Data Preprocessing
Data
Cleaning
Data
Integration
Data
Transformation
Data
Reduction
39 Data Warehousing and Data Mining by Kritsada Sriphaew
40. Data Cleaning
Topics in Data Cleaning
Data cleaning tasks
Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data
Advanced techniques for automatic data cleaning
Improving decision tree
Robust regression
Detecting anomalies
40 Data Warehousing and Data Mining by Kritsada Sriphaew
41. Missing Data
Data is not always available
e.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
Missing data may be due to
equipment malfunction
inconsistent with other recorded data and thus deleted
data not entered due to misunderstanding
certain data may not be considered important at the time of
entry
not register history or changes of the data
Missing data may need to be inferred.
41 Data Warehousing and Data Mining by Kritsada Sriphaew
42. How to Handle Missing Data?
Ignore the tuple: usually done when class label is missing
Fill in the missing value manually: tedious + infeasible?
Use a global constant to fill in the missing value: e.g.,
“unknown”, a new class?!
Use the attribute mean to fill in the missing value
Use the attribute mean for all samples belonging to the same
class to fill in the missing value: smarter
Use the most probable value to fill in the missing value:
inference-based such as Bayesian formula or decision tree
The most popular, preserve relationship between missing attributes
and other attributes
42 Data Warehousing and Data Mining by Kritsada Sriphaew
43. How to Handle Missing Data?
(Examples) Attributes Concepts
outlook temp. humidity windy Sponsor play-time play
sunny 85 87 True Sony 85 Y 1
sunny 80 90 False HP 90 Y ignore
overcast 87 75 True Ford 63 ? 4
rainy 70 95 True Ford 5 N humid = 86.9
rainy 75 ? False HP 56 Y
humid|play=y 5
sunny 90 94 True ? 25 N
= 86.4
rainy 65 86 True Nokia 5 N
overcast 88 92 True Honda 86 Y 3
rainy 79 75 False Ford 78 Y Add
Unknown
overcast 85 88 ? Sony 74 Y
2
6
Predict by Bayesian formula or decision tree Manually Checking
43 Data Warehousing and Data Mining by Kritsada Sriphaew
44. Noisy Data
Noise: random error or variance in a measured variable
Incorrect attribute values may due to
faulty data collection instruments
data entry problems
data transmission problems
technology limitation
inconsistency in naming convention
Other data problems which requires data cleaning
duplicate records
incomplete data
inconsistent data
44 Data Warehousing and Data Mining by Kritsada Sriphaew
45. How to Handle Noisy Data
Binning method (Data smoothing):
first sort data and partition into (equi-depth) bins
then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
Clustering
detect and remove outliers
Combined computer and human inspection
detect suspicious values and check by human
Regression
smooth by fitting the data into regression functions
45 Data Warehousing and Data Mining by Kritsada Sriphaew
46. Simple Discretization Methods: Binning
Equal-width (distance) partitioning:
It divides the range into N intervals of equal size: uniform grid
if A and B are the lowest and highest values of the attribute,
the width of intervals W = (B-A)/N.
The most straightforward
But outliers may dominate presentation
(since we use lowest/highest values)
Skewed (asymmetrical) data is not handled well.
Equal-depth (frequency) partitioning:
It divides the range into N intervals, each containing around
same number of samples
Good data scaling
Managing categorical attributes can be tricky.
46 Data Warehousing and Data Mining by Kritsada Sriphaew
47. Binning Methods for Data Smoothing
Sorted data for price (in dollars):
4, 8, 9, 15, 21, 21, 24, 25, 26, 27, 29, 34
Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15 (mean = 9, median = 8.5) Partition into
- Bin 2: 21, 21, 24, 25 (mean = 22.75, median = 23) equidepth bin
- Bin 3: 26, 27, 29, 34 (mean = 29, median = 28) (depth=3)
Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 22.75, 22.75, 22.75, 22.75
- Bin 3: 29, 29, 29, 29
Each value in a bin is replaced by the
Smoothing by bin medians: mean (or median) value of the bin.
- Bin 1: 8.5, 8.5, 8.5, 8.5 Similarly, smoothing by bin median
- Bin 2: 23, 23, 23, 23
- Bin 3: 28, 28, 28, 28
Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15 The minimum and
- Bin 2: 21, 21, 25, 25 maximum values in a
given bin are identified
- Bin 3: 26, 26, 26, 34
as the bin boundaries
47 Data Warehousing and Data Mining by Kritsada Sriphaew
48. Cluster Analysis
[Clustering]
detect and remove outliers
48 Data Warehousing and Data Mining by Kritsada Sriphaew
49. Regression
y
Y1
Y1’ y=x+1
X1 x
[Regression]
smooth by fitting the data into
regression functions
49 Data Warehousing and Data Mining by Kritsada Sriphaew
50. Automatic Data Cleaning
(Improving Decision Trees)
Improving decision trees: relearn tree with
misclassified instances removed or pruning away
some subtrees
Better strategy (of course): let human expert check
misclassified instances
When systematic noise is present it is better not to
modify the data
Also: attribute noise should be left in training set
(Unsystematic) class noise in training set should be
eliminated if possible
50 Data Warehousing and Data Mining by Kritsada Sriphaew
51. Automatic Data Cleaning
(Robust Regression - I)
Statistical methods that address problem of outliers
are called robust
Possible way of making regression more robust:
Minimize absolute error instead of squared error
Remove outliers (i. e. 10% of points farthest from the
regression plane)
Minimize median instead of mean of squares (copes with
outliers in any direction)
Finds narrowest strip covering half the observations
51 Data Warehousing and Data Mining by Kritsada Sriphaew
52. Automatic Data Cleaning
(Robust Regression - II)
Least absolute
perpendicular
52 Data Warehousing and Data Mining by Kritsada Sriphaew
53. Automatic Data Cleaning
(Detecting Anomalies)
Visualization is a best way of detecting anomalies
(but often can’t be done)
Automatic approach:
committee of different learning schemes, e.g. decision
tree, nearest- neighbor learner, and a linear discriminant
function
Conservative approach: only delete instances which are
incorrectly classified by all of them
Problem: might sacrifice instances of small classes
53 Data Warehousing and Data Mining by Kritsada Sriphaew
54. Data Integration
Data Integration
Data integration:
combines data from multiple sources into a coherent store
Schema integration
integrate metadata from different sources
Entity identification problem: identify real world entities from
multiple data sources, e.g., How to match A.cust-num with
B.customer-id
Detecting and resolving data value conflicts
for the same real world entity, attribute values from different
sources are different
possible reasons: different representations, different scales,
e.g., metric vs. British units
54 Data Warehousing and Data Mining by Kritsada Sriphaew
55. Handling Redundant Data in Data
Integration
Redundant data occur often A correlation between
when integration of multiple attribute A and B
databases n
The same attribute may have
different names in different ( A A )( B B )
i i
databases R A, B i 1
One attribute may be a (n 1) A B
“derived” attribute in another
table, e.g., annual revenue
Some redundancies can be (x x) 2
n x 2 ( x ) 2
detected by correlational
n 1 n(n 1)
analysis where standard deviation
Careful integration of the data
from multiple sources may help If RA,B > 0 then A and B are positively correlated.
reduce/avoid redundancies and If RA,B = 0 then A and B are independent.
If RA,B < 0 then A and B are negatively correlated.
inconsistencies and improve
mining speed and quality
55 Data Warehousing and Data Mining by Kritsada Sriphaew
56. Data Transformation and Data Reduction
Data Transformation
Smoothing: remove noise from data
Aggregation: summarization, data cube construction
Generalization: concept hierarchy climbing
Normalization: scaled to fall within a small, specified
range
min-max normalization
z-score normalization
normalization by decimal scaling
Attribute/feature construction
New attributes constructed from the given ones
56 Data Warehousing and Data Mining by Kritsada Sriphaew
57. Data Transformation: Normalization
min-max normalization
v vmin
v' (vmax vmin ) vmin
new new new
vmax vmin
xx
z ( x)
z-score normalization
vv (x x)2 n x 2 ( x) 2
v'
v n 1 n(n 1)
normalization by decimal scaling
v
v' j Where j is the smallest integer such that Max(| v' |)<1
10
57 Data Warehousing and Data Mining by Kritsada Sriphaew
58. Data Reduction
Data Reduction Strategies
Warehouse may store terabytes of data: Complex data
analysis/mining may take a very long time to run on the
complete data
Data reduction
Obtains a reduced representation of the data set that is much
smaller in volume but yet produces the same (or almost the
same) analytical results
Data reduction strategies
Data cube aggregation (reduce rows)
Dimensionality reduction (reduce columns)
Numerosity reduction (reduce columns or values)
Discretization / Concept hierarchy generation (reduce values)
58 Data Warehousing and Data Mining by Kritsada Sriphaew
59. Three Types of Data Reduction
Three types of data reduction are:
Reduce no. of column (feature or attribute)
Reduce no. of row (case, example or instance)
Reduce no. of the values in a column (numeric/nominal)
Columns Values
outlook temp. humidity windy Sponsor play-time play
sunny 85 87 True Sony 85 Y
sunny 80 90 False HP 90 Y
Rows overcast 87 75 True Ford 63 Y
rainy 70 95 True Ford 5 N
rainy 75 65 False HP 56 Y
59 Data Warehousing and Data Mining by Kritsada Sriphaew
60. Data Cube Aggregation
Ex. You are interested in the annual sales rather than the total per
quarter, thus the data can be aggregated resulting data summarize the
total sales per year instead of per quarter
The resulting data set is smaller in volume, without loss of information necessary
for the analysis task
60 Data Warehousing and Data Mining by Kritsada Sriphaew
61. Dimensionality Reduction
Feature selection (i.e., attribute subset selection):
Select a minimum set of features such that the probability
distribution of different classes given the values for those
features is as close as possible to the original distribution given
the values of all features
reduce the number of patterns, easier to understand
Heuristic methods (due to exponential number of
choices):
decision-tree induction (wrapper approach)
independent assessment (filter method)
step-wise forward selection
step-wise backward elimination
combining forward selection+backward elimination
61 Data Warehousing and Data Mining by Kritsada Sriphaew
62. Decision Tree Induction
(Wrapper Approach)
Initial attribute set:
{A1, A2, A3, A4, A5, A6}
A4 ?
A1? A6?
Class 2 Class 1 Class 2
Class 1
Reduced attribute set: {A1, A4, A6}
62 Data Warehousing and Data Mining by Kritsada Sriphaew
63. Numerosity Reduction
Parametric methods
Assume the data fits some model, estimate model
parameters, store only the parameters, and discard the
data (except possible outliers)
Log-linear models: obtain value at a point in m-D space as
the product on appropriate marginal subspaces (estimate
the probability of each cell in a larger cuboid based on the
smaller cuboids)
Non-parametric methods
Do not assume models
Major families: histograms, clustering, sampling
65 Data Warehousing and Data Mining by Kritsada Sriphaew
64. Regression
Linear regression: Y = a + bX
Two parameters , a and b specify the line and are to be
estimated by using the data at hand.
using the least squares criterion to the known values of Y1,
Y2, …, X1, X2, ….
Multiple regression: Y = a + b1X1 + b2X2.
Many nonlinear functions can be transformed into the
above.
66 Data Warehousing and Data Mining by Kritsada Sriphaew
65. Histograms
A popular data reduction technique
Divide data into buckets and store average (or sum) for
each bucket
Related to quantization problems.
40
35
30
25
20
15
10
5
0
10000 30000 50000 70000 90000
67 Data Warehousing and Data Mining by Kritsada Sriphaew
66. Clustering
Partition data set into clusters, and one can store
cluster representation only
Can be very effective if data is clustered but not if
data is “smeared (dirty)”
Can have hierarchical clustering and be stored in
multi-dimensional index tree structures
There are many choices of clustering definitions and
clustering algorithms.
68 Data Warehousing and Data Mining by Kritsada Sriphaew
67. Sampling
Allow a mining algorithm to run in complexity that is
potentially sub-linear to the size of the data
Choose a representative subset of the data
Simple random sampling may have very poor performance
in the presence of skew (bias)
Develop adaptive sampling methods
Stratified (classify) sampling:
Approximate the percentage of each class (or
subpopulation of interest) in the overall database
Used in conjunction with skewed (biased) data
69 Data Warehousing and Data Mining by Kritsada Sriphaew
68. Sampling
Raw Data
70 Data Warehousing and Data Mining by Kritsada Sriphaew
69. Sampling
Raw Data Cluster/Stratified Sample
71 Data Warehousing and Data Mining by Kritsada Sriphaew
70. Discretization and concept hierarchy generation
Discretization
Three types of attributes:
Nominal: values from an unordered set
Ordinal: values from an ordered set
Continuous: real numbers
Discretization:
divide the range of a continuous attribute into intervals
Some classification algorithms only accept categorical
attributes.
Reduce data size by discretization
Prepare for further analysis
72 Data Warehousing and Data Mining by Kritsada Sriphaew
71. Discretization and Concept hierachy
Discretization
reduce the number of values for a given continuous
attribute by dividing the range of the attribute into
intervals. Interval labels can then be used to replace actual
data values.
Concept hierarchies
reduce the data by collecting and replacing low level
concepts (such as numeric values for the attribute age) by
higher level concepts (such as young, middle-aged, or
senior).
73 Data Warehousing and Data Mining by Kritsada Sriphaew
72. Discretization and Concept hierarchy generation
- numeric data
Binning (see sections before)
Histogram analysis (see sections before)
Clustering analysis (see sections before)
Entropy-based discretization
Keywords:
Supervised discretization
Entropy-based discretization
Unsupervised discretization
Clustering, Binning, Histogram
74 Data Warehousing and Data Mining by Kritsada Sriphaew
73. Entropy-Based Discretization
Given a set of samples S, if S is partitioned into two
intervals S1 and S2 using boundary T, the entropy after
partitioning is
info(S,T) = (|S1|/|S|) × info(S1) + (|S2|/|S|) × info(S2)
The boundary that minimizes the entropy function over
all possible boundaries is selected as a binary
discretization.
The process is recursively applied to partitions obtained
until some stopping criterion is met, e.g.,
info(S) - info(S,T) < threshold
Experiments show that it may reduce data size and
improve classification accuracy
75 Data Warehousing and Data Mining by Kritsada Sriphaew
74. Entropy-Based Discretization
Ex.: temperature attribute of weather data are
64 65 68 69 70 71 72 75 80 81 83 85
y n y y y n y/n y/y n y y n
N
Temp=71.5 info ( X ) pi log 2 pi
i 1
6 8
info ([4,2], [5.3]) info ([4,2]) info ([5,3])
14 14
0.939 bits
info([9,5]) 0.940 bits
76
75. Specification of a set of attributes (Concept
hierarchy generation)
Concept hierarchy can be automatically generated
based on the number of distinct values per attribute
in the given attribute set. The attribute with the most
distinct values is placed at the lowest level of the
hierarchy. country 15 distinct values
province_or_ state 65 distinct values
city 3567 distinct values
street 674,339 distinct values
77 Data Warehousing and Data Mining by Kritsada Sriphaew
76. Why Postprocessing?
To improve the acquired model (the mined
knowledge)?
Techniques to combine several mining approaches to
find better results
Method 1
Output Data
Combined
Input Data
Method 2
Method N
78 Data Warehousing and Data Mining by Kritsada Sriphaew
77. Combining Multiple Models Engineering the Output
(Overview)
Basic idea of “meta” learning schemes: build
different “experts” and let them vote
Advantage: often improves predictive performance
Disadvantage: produces output that is very hard to analyze
Schemes we will discuss are bagging, boosting and
stacking (or stacked generalization)
These approaches can be applied to both numeric
and nominal classification
79 Data Warehousing and Data Mining by Kritsada Sriphaew
78. Combining Multiple Models
(Bagging - general)
Employs simplest way of combining predictions: voting/
averaging
Each model receives equal weight
“Idealized” version of bagging:
Sample several training sets of size(instead of just having one
training set of size n)
Build a classifier for each training set
Combine the classifier’s predictions
This improves performance in almost all cases if learning
scheme is unstable
(i.e. decision trees)
80 Data Warehousing and Data Mining by Kritsada Sriphaew
79. Combining Multiple Models
(Bagging - algorithm)
Model generation
Let N be the number of instances in the training data.
For each of t iterations:
Sample n instances with replacement from training set.
Apply the learning algorithm to the sample.
Store the resulting model.
Classification
For each of the t models:
Predict class of instance using model.
Return class that has been predicted most often.
81 Data Warehousing and Data Mining by Kritsada Sriphaew
80. Combining Multiple Models
(Boosting - general)
Also uses voting/ averaging but models are weighted
according to their performance
Iterative procedure: new models are influenced by
performance of previously built ones
New model is encouraged to become expert for instances
classified incorrectly by earlier models
Intuitive justification: models should be experts that
complement each other
(There are several variants of this algorithm)
82 Data Warehousing and Data Mining by Kritsada Sriphaew
81. Combining Multiple Models
(Stacking - I)
Hard to analyze theoretically: “black magic”
Uses “meta learner” instead of voting to combine
predictions of base learners
Predictions of base learners (level-0 models) are used as
input for meta learner (level-1 model)
Base learners usually have different learning schemes
Predictions on training data can’t be used to
generate data for level-1 model!
Cross-validation-like scheme is employed
83 Data Warehousing and Data Mining by Kritsada Sriphaew