Statistical Data Mining


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Statistical Data Mining

  1. 1. Statistical Data Mining Lecture 2 Edward J. Wegman George Mason University
  2. 2. Data Preparation
  3. 3. Data Preparation 60 50 Effort (%) 40 30 20 10 0 Objectives Data Preparation Data Mining Analysis & Determination Assimilation
  4. 4. Data Preparation • Data Cleaning and Quality • Types of Data • Categorical versus Continuous Data • Problem of Missing Data – Imputation – Missing Data Plots • Problem of Outliers • Dimension Reduction, Quantization, Sampling
  5. 5. Data Preparation • Quality – Data may not have any statistically significant patterns or relationships – Results may be inconsistent with other data sets – Data often of uneven quality, e.g. made up by respondent – Opportunistically collected data may have biases or errors – Discovered patterns may be too specific or too general to be useful
  6. 6. Data Preparation • Noise - Incorrect Values – Faulty data collection instruments, e.g. sensors – Transmission errors, e.g. intermittent errors from satellite or Internet transmissions – Data entry problems – Technology limitations – Naming conventions misused
  7. 7. Data Preparation • Noise - Incorrect Classification – Human judgment – Time varying – Uncertainty/Probabilistic nature of data
  8. 8. Data Preparation • Redundant/Stale data – Variables have different names in different databases – Raw variable in one database is a derived variable in another – Irrelevant variables destroy speed (dimension reduction needed) – Changes in variable over time not reflected in database
  9. 9. Data Preparation • Data cleaning • Selecting and appropriate data set and/or sampling strategy • Transformations
  10. 10. Data Preparation • Data Cleaning – Duplicate removal (tool based) – Missing value imputation (manual, statistical) – Identify and remove data inconsistencies – Identify and refresh stale data – Create unique record (case) ID
  11. 11. Data Preparation • Categorical versus Continuous Data – Most statistical theory, many graphics tools developed for continuous data – Much of the data if not most data in databases is categorical – Computer science view often takes continuous data into categorical, e.g. salaries categorized as low, medium, high, because more suited to Boolean operations
  12. 12. Data Preparation • Problem of Missing Values – Missing values in massive data sets may or may not be a problem • Missing data may be irrelevant to desired result, e.g. cases with missing demographic data may not help if I am trying to create selection mechanism for good customers based on demographics • Massive data sets if acquired by instrumentation may have few missing values anyway • Imputation has model assumptions – Suggest making a Missing Value Plot
  13. 13. Data Preparation • Missing Value Plot – A plot of variables by cases – Missing values colored red – Special case of “color histogram” with binary data – “Color histogram” also known as “data image” – This example is 67 dimensions by 1000 cases – This example is also fake
  14. 14. Data Preparation • Problem of Outliers – Outliers easy to detect in low dimensions – A high dimensional outlier may not show up in low dimensional projections – MVE or MCD algorithms are exponentially computationally complex • Visualization tools may help – Fisher Info Matrix and Convex Hull Peeling more feasible but still too complex for Massive datasets • Some angle based methods are promising
  15. 15. Data Preparation • Database Sampling – Exhaustive search may not be practically feasible because of their size – The KDD systems must be able to assist in the selection of appropriate parts if the databases to be examined – For sampling to work, the data must satisfy certain conditions (not ordered, no systematic biases) – Sampling can be very expensive operation especially when the sample is taken from data stored in a DBMS. Sampling 5% of the database can be more expensive that a sequential full scan of the data.
  16. 16. Data Compression • Often data preparation involves data compression – Sampling – Quantization
  17. 17. Data Quantization Thinning vs Binning • People’s first thoughts about Massive Data usually is statistical subsampling • Quantization is engineering’s success story • Binning is statistician’s quantization
  18. 18. Data Quantization • Images are quantized in 8 to 24 bits, i.e. 256 to 16 million levels. • Signals (audio on CDs) are quantized in 16 bits, i.e. 65,536 levels • Ask a statistician how many bins to use, likely response is a few hundred, ask a CS data miner, likely response is 3 • For a terabyte data set, 106 bins
  19. 19. Data Quantization • Binning, but at microresolution • Conventions – d = dimension – k = # of bins – n = sample size – Typically k << n
  20. 20. Data Quantization • Choose E[W|Q = yj] = mean of observations in jth bin = yj • In other words, E[W|Q] = Q • The quantizer is self-consistent
  21. 21. Data Quantization • E[W] = E[Q] • If θ is a linear unbiased estimator, then so is E[θ|Q] • If h is a convex function, then E[h(Q)] ≤ E[h(W)]. – In particular, E[Q2] ≤ E[W2] and var (Q) ≤ var (W). • E[Q(Q-W)] = 0 • cov (W-Q) = cov (W) - cov (Q) • E[W-P]2 ≥ E[W-Q]2 where P is any other quantizer.
  22. 22. Data Quantization
  23. 23. Distortion due to Quantization • Distortion is the error due to quantization. • In simple terms, E[W-Q]2. • Distortion is minimized when the quantization regions, Sj, are most like a (hyper-) sphere.
  24. 24. Geometry-based Quantization • Need space-filling tessellations • Need congruent tiles • Need as spherical as possible
  25. 25. Geometry-based Quantization • In one dimension – Only polytope is a straight line segment (also bounded by a one-dimensional sphere). • In two dimensions – Only polytopes are equilateral triangles, squares and hexagons
  26. 26. Geometry-based Quantization • In 3 dimensions – Tetrahedrons (3-simplex), cube, hexagonal prism, rhombic dodecahedron, truncated octahedron. • In 4 dimensions – 4 simplex, hypercube, 24 cell Truncated octahedron tessellation
  27. 27. Geometry-based Quantization Tetrahedron* .1040042… Cube* .0833333… Octahedron .0825482… Hexagonal Prism* .0812227… Rhombic Dodecahedron* .0787451… Truncated Octahedron* .0785433… Dodecahedron .0781285… Icosahedron .0778185… Sphere .0769670 Dimensionless Second Moment for 3-D Polytopes
  28. 28. Geometry-based Quantization Tetrahedron Cube Octahedron Truncated Dodecahedron Icosahedron Octahedron
  29. 29. Geometry-based Quantization Rhombic Dodecahedron
  30. 30. Geometry-based Quantization 24 Cell with Cuboctahedron Envelope Hexagonal Prism
  31. 31. Geometry-based Quantization • Using 106 bins is computationally and visually feasible. • Fast binning, for data in the range [a,b], and for k bins j = fixed[k*(xi-a)/(b-a)] gives the index of the bin for xi in one dimension. • Computational complexity is 4n+1=O(n). • Memory requirements drop to 3k - location of bin + # items in bin + representor of bin, I.e. storage complexity is 3k.
  32. 32. Geometry-based Quantization • In two dimensions – Each hexagon is indexed by 3 parameters. – Computational complexity is 3 times 1-D complexity, – I.e. 12n+3=O(n). – Complexity for squares is 2 times 1-D complexity. – Ratio is 3/2. – Storage complexity is still 3k.
  33. 33. Geometry-based Quantization • In 3 dimensions – For truncated octahedron, there are 3 pairs of square sides and 4 pairs of hexagonal sides. – Computational complexity is 28n+7 = O(n). – Computational complexity for a cube is 12n+3. – Ratio is 7/3. – Storage complexity is still 3k.
  34. 34. Quantization Strategies • Optimally for purposes of minimizing distortion, use roundest polytope in d-dimensions. – Complexity is always O(n). – Storage complexity is 3k. – # tiles grows exponentially with dimension, so-called curse of dimensionality. – Higher dimensional geometry is poorly known. – Computational complexity grows faster than hypercube.
  35. 35. Quantization Strategies • For purposes of simplicity, always use hypercube or d- dimensional simplices – Computational complexity is always O(n). – Methods for data adaptive tiling are available – Storage complexity is 3k. – # tiles grows exponentially with dimension. – Both polytopes depart spherical shape rapidly as d increases. – Hypercube approach is known as datacube in computer science literature and is closely related to multivariate histograms in statistical literature.
  36. 36. Quantization Strategies • Conclusions on Geometric Quantization – Geometric approach good to 4 or 5 dimensions. – Adaptive tilings may improve rate at which # tiles grows, but probably destroy spherical structure. – Good for large n, but weaker for large d.
  37. 37. Quantization Strategies • Alternate Strategy – Form bins via clustering • Known in the electrical engineering literature as vector quantization. • Distance based clustering is O(n2) which implies poor performance for large n. • Not terribly dependent on dimension, d. • Clusters may be very out of round, not even convex. – Conclusion • Cluster approach may work for large d, but fails for large n. • Not particularly applicable to “massive” data mining.
  38. 38. Quantization Strategies • Third strategy – Density-based clustering • Density estimation with kernel estimators is O(n). • Uses modes mα to form clusters • Put xi in cluster α if it is closest to mode mα. • This procedure is distance based, but with complexity O(kn) not O(n2). • Normal mixture densities may be an alternative approach. • Roundness may be a problem. – But quantization based on density-based clustering offers promise for both large d and large n.
  39. 39. Data Quantization • Binning does not lose fine structure in tails as sampling might. • Roundoff analysis applies. • With scale of binning, discretization not likely to be much less accurate than accuracy of recorded data. • Discretization - finite number of bins implies discrete variables more compatible with categorical data.
  40. 40. Data Quantization • Analysis on a finite subset of the integers has theoretical advantages – Analysis is less delicate • different forms of convergence are equivalent – Analysis is often more natural since data is already quantized or categorical – Graphical analysis of numerical data is not much changed since 106 pixels is at limit of HVS