2. Introduction
Three types of attributes:
• Nominal — values from an unordered set.
• Ordinal — values from an ordered set.
• Continuous — real numbers.
Discretization:
• Divide the range of a continuous attribute into intervals
• Some classification algorithms only accept categorical
attributes.
• Reduce data size by Discretization.
• Prepare for further analysis.
3. Discretization
• Reduce the number of values for a given continuous
attribute by dividing the range of the attribute into intervals.
• Interval labels can then be used to replace actual data
values.
Concept hierarchies
• Reduce the data by collecting and replacing low level
concepts (such as numeric values for the attribute age) by
higher level concepts (such as young, middle-aged, or
senior).
5. Attribute values can be discretized by distributing the
values into bin and replacing each bin by the mean bin
value or bin median value.
These technique can be applied recursively to the resulting
partitions in order to generate concept hierarchies.
Binning does not use class information and unsupervised
discretization technique.
It is sensitive to the user-specified number of bins.
6. Equal-width (distance) partitioning:
• It divides the range into N intervals of equal size: uniform grid
• if A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B-A)/N. The most
straightforward
• But outliers may dominate presentation
• Skewed data is not handled well.
Equal-depth (frequency) partitioning:
• It divides the range into N intervals, each containing
approximately same number of samples
• Good data scaling
• Managing categorical attributes can be tricky.
7. * Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25,
26, 28, 29, 34
* Partition into (equi-width) bins:
- Bin 1 (4-14): 4, 8, 9
- Bin 2(15-24): 15, 21, 21, 24
- Bin 3(25-34): 25, 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 7, 7, 7
- Bin 2: 20, 20, 20, 20
- Bin 3: 28, 28, 28, 28, 28
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4
- Bin 2: 15, 24, 24, 24
- Bin 3: 25, 25, 25, 25, 34
8. Histograms can also be used for discretization.
Partitioning rules can be applied to define range of values.
The histogram analyses algorithm can be applied
recursively to each partition in order to automatically
generate a multilevel concept hierarchy, with the procedure
terminating once a prespecified number of concept levels
have been reached.
A minimum interval size can be used per level to control the
recursive procedure.
This specifies the minimum width of the partition, or the
minimum member of partitions at each level.
9. A popular data reduction
technique
Divide data into buckets and
store average (sum) for each
bucket
Can be constructed
optimally in one dimension
using dynamic
programming
Related to quantization
problems.
0
5
10
15
20
25
30
35
40
10000 30000 50000 70000 90000
10. Several techniques for determining buckets
• Equiwidth – width of each bucket range is uniform
• Equidepth – each bucket contains roughly the same number
of contiguous samples
• V-Optimal – weighted sum of the original values that each
bucket represents, where bucket weight = number of values
in a bucket
• MaxDiff – bucket boundary is established between each
pair for pairs having the B – 1 largest differences, where B
is user defined
V-Optimal & MaxDiff most accurate and
practical
11. A clustering algorithm can be applied to partition data
into clusters or groups.
Each cluster forms a node of a concept hierarchy,
where all noses are at the same conceptual level.
Each cluster may be further decomposed into sub-
clusters, forming a lower kevel in the hierarchy.
Clusters may also be grouped together to form a
higher-level concept hierarchy.
Can have hierarchical clustering and be stored in
multi-dimensional index tree structures.
13. Given a set of samples S, if S is partitioned
into two intervals S1 and S2 using boundary T,
the entropy after partitioning is
• S1 & S2 correspond to samples in S satisfying
conditions A<v & A>=v
The boundary that minimizes the entropy
function over all possible boundaries is
selected as a binary discretization.
E S T
S
Ent
S
Ent
S
S
S
S( , )
| |
| |
( )
| |
| |
( ) 1
1
2
2
14. The process is recursively applied to partitions
obtained until some stopping criterion is met,
e.g.,
Experiments show that it may reduce data size
and improve classification accuracy
Ent S E T S( ) ( , )
15. 3-4-5 rule can be used to segment numeric data into
relatively uniform, “natural” intervals.
If an interval covers 3, 6, 7 or 9 distinct values at the
most significant digit, partition the range into 3 equi-
width intervals.
If it covers 2, 4, or 8 distinct values at the most
significant digit, partition the range into 4 intervals.
If it covers 1, 5, or 10 distinct values at the most
significant digit, partition the range into 5 intervals.
17. Step 1 – Min=-$351,976, Max=$4,700,896, low (5th
percentile)=-$159,876, high (95th percentile)=$1,838,761.
Step 2 – For low and high, most significant digit is at
$1,000,000, rounding low -$1,000,000, rounding high
$2,000,000.
Step 3 – interval ranges over 3 distinct values at the most
significant digit, so using 3-4-5 rule partition into 3 intervals, -
$1,000,000-$0, $0-$1,000,000, and $1,000,000-$2,000,000.
Step 4 – Examine Min & Max values to see how they “fit” into
first level partitions, first partition covers Min value, so adjust
left boundary to make partition smaller, last partition doesn’t
cover Max value, so create a new partition (round max up to next
significant digit) $2,000,000-$5,000,000.
Step 5 – Recursively, each interval can be further partitioned
using 3-4-5 rule to form next lower level of the hierarchy .
18. Specification of a partial ordering of attributes explicitly at the
schema level by users or experts
• Example : rel db may contain: street, city,
province_or_state, country
• Expert defines ordering of hierarchy such as street < city <
province_or_state < country
Specification of a portion of a hierarchy by explicit data
grouping
• Example : province_or_state, country : {Alberta,
Saskatchewan, Manitoba} – prairies_Canada & {British
Columbia, prairies_Canada} – Western Canada
19. Specification of a set of attributes, but not of their partial
ordering.
• Auto generate the attribute ordering based upon observation that
attribute defining a high level concept has a smaller # of distinct values
than an attribute defining a lower level concept
• Example : country (15), state_or_province (365), city (3567), street
(674,339)
Specification of only a partial set of attributes
• Try and parse database schema to determine complete
hierarchy.
20. Concept hierarchy can be automatically generated based on
the number of distinct values per attribute in the given
attribute set.
The attribute with the most distinct values is placed at the
lowest level of the hierarchy.
country
province_or_ state
city
street
15 distinct values
65 distinct values
3567 distinct values
674,339 distinct values