3.
Data Preparation
60
50
Effort (%)
40
30
20
10
0
Objectives Data Preparation Data Mining Analysis &
Determination Assimilation
4.
Data Preparation
• Data Cleaning and Quality
• Types of Data
• Categorical versus Continuous Data
• Problem of Missing Data
– Imputation
– Missing Data Plots
• Problem of Outliers
• Dimension Reduction, Quantization, Sampling
5.
Data Preparation
• Quality
– Data may not have any statistically significant patterns or
relationships
– Results may be inconsistent with other data sets
– Data often of uneven quality, e.g. made up by respondent
– Opportunistically collected data may have biases or errors
– Discovered patterns may be too specific or too general to be useful
6.
Data Preparation
• Noise - Incorrect Values
– Faulty data collection instruments, e.g. sensors
– Transmission errors, e.g. intermittent errors from
satellite or Internet transmissions
– Data entry problems
– Technology limitations
– Naming conventions misused
7.
Data Preparation
• Noise - Incorrect Classification
– Human judgment
– Time varying
– Uncertainty/Probabilistic nature of data
8.
Data Preparation
• Redundant/Stale data
– Variables have different names in different databases
– Raw variable in one database is a derived variable in
another
– Irrelevant variables destroy speed (dimension reduction
needed)
– Changes in variable over time not reflected in database
9.
Data Preparation
• Data cleaning
• Selecting and appropriate data set and/or
sampling strategy
• Transformations
10.
Data Preparation
• Data Cleaning
– Duplicate removal (tool based)
– Missing value imputation (manual, statistical)
– Identify and remove data inconsistencies
– Identify and refresh stale data
– Create unique record (case) ID
11.
Data Preparation
• Categorical versus Continuous Data
– Most statistical theory, many graphics tools developed
for continuous data
– Much of the data if not most data in databases is
categorical
– Computer science view often takes continuous data into
categorical, e.g. salaries categorized as low, medium,
high, because more suited to Boolean operations
12.
Data Preparation
• Problem of Missing Values
– Missing values in massive data sets may or may not be
a problem
• Missing data may be irrelevant to desired result, e.g. cases with
missing demographic data may not help if I am trying to create
selection mechanism for good customers based on
demographics
• Massive data sets if acquired by instrumentation may have few
missing values anyway
• Imputation has model assumptions
– Suggest making a Missing Value Plot
13.
Data Preparation
• Missing Value Plot
– A plot of variables by cases
– Missing values colored red
– Special case of “color
histogram” with binary data
– “Color histogram” also
known as “data image”
– This example is 67
dimensions by 1000 cases
– This example is also fake
14.
Data Preparation
• Problem of Outliers
– Outliers easy to detect in low dimensions
– A high dimensional outlier may not show up in low
dimensional projections
– MVE or MCD algorithms are exponentially
computationally complex
• Visualization tools may help
– Fisher Info Matrix and Convex Hull Peeling more
feasible but still too complex for Massive datasets
• Some angle based methods are promising
15.
Data Preparation
• Database Sampling
– Exhaustive search may not be practically feasible because
of their size
– The KDD systems must be able to assist in the selection of
appropriate parts if the databases to be examined
– For sampling to work, the data must satisfy certain
conditions (not ordered, no systematic biases)
– Sampling can be very expensive operation especially when
the sample is taken from data stored in a DBMS. Sampling
5% of the database can be more expensive that a sequential
full scan of the data.
16.
Data Compression
• Often data preparation involves data
compression
– Sampling
– Quantization
17.
Data Quantization
Thinning vs Binning
• People’s first thoughts about Massive Data
usually is statistical subsampling
• Quantization is engineering’s success story
• Binning is statistician’s quantization
18.
Data Quantization
• Images are quantized in 8 to 24 bits, i.e. 256 to 16
million levels.
• Signals (audio on CDs) are quantized in 16 bits,
i.e. 65,536 levels
• Ask a statistician how many bins to use, likely
response is a few hundred, ask a CS data miner,
likely response is 3
• For a terabyte data set, 106 bins
19.
Data Quantization
• Binning, but at microresolution
• Conventions
– d = dimension
– k = # of bins
– n = sample size
– Typically k << n
20.
Data Quantization
• Choose E[W|Q = yj] = mean of observations
in jth bin = yj
• In other words, E[W|Q] = Q
• The quantizer is self-consistent
21.
Data Quantization
• E[W] = E[Q]
• If θ is a linear unbiased estimator, then so is E[θ|Q]
• If h is a convex function, then E[h(Q)] ≤ E[h(W)].
– In particular, E[Q2] ≤ E[W2] and var (Q) ≤ var (W).
• E[Q(Q-W)] = 0
• cov (W-Q) = cov (W) - cov (Q)
• E[W-P]2 ≥ E[W-Q]2 where P is any other quantizer.
23.
Distortion due to Quantization
• Distortion is the error due to quantization.
• In simple terms, E[W-Q]2.
• Distortion is minimized when the
quantization regions, Sj, are most like a
(hyper-) sphere.
24.
Geometry-based Quantization
• Need space-filling tessellations
• Need congruent tiles
• Need as spherical as possible
25.
Geometry-based Quantization
• In one dimension
– Only polytope is a straight line segment (also bounded
by a one-dimensional sphere).
• In two dimensions
– Only polytopes are equilateral triangles, squares and
hexagons
30.
Geometry-based Quantization
24 Cell with Cuboctahedron Envelope
Hexagonal Prism
31.
Geometry-based Quantization
• Using 106 bins is computationally and visually feasible.
• Fast binning, for data in the range [a,b], and for k bins
j = fixed[k*(xi-a)/(b-a)]
gives the index of the bin for xi in one dimension.
• Computational complexity is 4n+1=O(n).
• Memory requirements drop to 3k - location of bin + #
items in bin + representor of bin, I.e. storage complexity is
3k.
32.
Geometry-based Quantization
• In two dimensions
– Each hexagon is indexed by 3 parameters.
– Computational complexity is 3 times 1-D complexity,
– I.e. 12n+3=O(n).
– Complexity for squares is 2 times 1-D complexity.
– Ratio is 3/2.
– Storage complexity is still 3k.
33.
Geometry-based Quantization
• In 3 dimensions
– For truncated octahedron, there are 3 pairs of square
sides and 4 pairs of hexagonal sides.
– Computational complexity is 28n+7 = O(n).
– Computational complexity for a cube is 12n+3.
– Ratio is 7/3.
– Storage complexity is still 3k.
34.
Quantization Strategies
• Optimally for purposes of minimizing distortion,
use roundest polytope in d-dimensions.
– Complexity is always O(n).
– Storage complexity is 3k.
– # tiles grows exponentially with dimension, so-called
curse of dimensionality.
– Higher dimensional geometry is poorly known.
– Computational complexity grows faster than
hypercube.
35.
Quantization Strategies
• For purposes of simplicity, always use hypercube or d-
dimensional simplices
– Computational complexity is always O(n).
– Methods for data adaptive tiling are available
– Storage complexity is 3k.
– # tiles grows exponentially with dimension.
– Both polytopes depart spherical shape rapidly as d increases.
– Hypercube approach is known as datacube in computer science
literature and is closely related to multivariate histograms in
statistical literature.
36.
Quantization Strategies
• Conclusions on Geometric Quantization
– Geometric approach good to 4 or 5 dimensions.
– Adaptive tilings may improve rate at which # tiles
grows, but probably destroy spherical structure.
– Good for large n, but weaker for large d.
37.
Quantization Strategies
• Alternate Strategy
– Form bins via clustering
• Known in the electrical engineering literature as vector
quantization.
• Distance based clustering is O(n2) which implies poor
performance for large n.
• Not terribly dependent on dimension, d.
• Clusters may be very out of round, not even convex.
– Conclusion
• Cluster approach may work for large d, but fails for large n.
• Not particularly applicable to “massive” data mining.
38.
Quantization Strategies
• Third strategy
– Density-based clustering
• Density estimation with kernel estimators is O(n).
• Uses modes mα to form clusters
• Put xi in cluster α if it is closest to mode mα.
• This procedure is distance based, but with complexity O(kn)
not O(n2).
• Normal mixture densities may be an alternative approach.
• Roundness may be a problem.
– But quantization based on density-based clustering
offers promise for both large d and large n.
39.
Data Quantization
• Binning does not lose fine structure in tails as sampling
might.
• Roundoff analysis applies.
• With scale of binning, discretization not likely to be much
less accurate than accuracy of recorded data.
• Discretization - finite number of bins implies discrete
variables more compatible with categorical data.
40.
Data Quantization
• Analysis on a finite subset of the integers
has theoretical advantages
– Analysis is less delicate
• different forms of convergence are equivalent
– Analysis is often more natural since data is
already quantized or categorical
– Graphical analysis of numerical data is not
much changed since 106 pixels is at limit of
HVS
Be the first to comment