Your SlideShare is downloading. ×
Statistical Data Mining
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Statistical Data Mining

373
views

Published on


0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
373
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Statistical Data Mining Lecture 2 Edward J. Wegman George Mason University
  • 2. Data Preparation
  • 3. Data Preparation 60 50 Effort (%) 40 30 20 10 0 Objectives Data Preparation Data Mining Analysis & Determination Assimilation
  • 4. Data Preparation • Data Cleaning and Quality • Types of Data • Categorical versus Continuous Data • Problem of Missing Data – Imputation – Missing Data Plots • Problem of Outliers • Dimension Reduction, Quantization, Sampling
  • 5. Data Preparation • Quality – Data may not have any statistically significant patterns or relationships – Results may be inconsistent with other data sets – Data often of uneven quality, e.g. made up by respondent – Opportunistically collected data may have biases or errors – Discovered patterns may be too specific or too general to be useful
  • 6. Data Preparation • Noise - Incorrect Values – Faulty data collection instruments, e.g. sensors – Transmission errors, e.g. intermittent errors from satellite or Internet transmissions – Data entry problems – Technology limitations – Naming conventions misused
  • 7. Data Preparation • Noise - Incorrect Classification – Human judgment – Time varying – Uncertainty/Probabilistic nature of data
  • 8. Data Preparation • Redundant/Stale data – Variables have different names in different databases – Raw variable in one database is a derived variable in another – Irrelevant variables destroy speed (dimension reduction needed) – Changes in variable over time not reflected in database
  • 9. Data Preparation • Data cleaning • Selecting and appropriate data set and/or sampling strategy • Transformations
  • 10. Data Preparation • Data Cleaning – Duplicate removal (tool based) – Missing value imputation (manual, statistical) – Identify and remove data inconsistencies – Identify and refresh stale data – Create unique record (case) ID
  • 11. Data Preparation • Categorical versus Continuous Data – Most statistical theory, many graphics tools developed for continuous data – Much of the data if not most data in databases is categorical – Computer science view often takes continuous data into categorical, e.g. salaries categorized as low, medium, high, because more suited to Boolean operations
  • 12. Data Preparation • Problem of Missing Values – Missing values in massive data sets may or may not be a problem • Missing data may be irrelevant to desired result, e.g. cases with missing demographic data may not help if I am trying to create selection mechanism for good customers based on demographics • Massive data sets if acquired by instrumentation may have few missing values anyway • Imputation has model assumptions – Suggest making a Missing Value Plot
  • 13. Data Preparation • Missing Value Plot – A plot of variables by cases – Missing values colored red – Special case of “color histogram” with binary data – “Color histogram” also known as “data image” – This example is 67 dimensions by 1000 cases – This example is also fake
  • 14. Data Preparation • Problem of Outliers – Outliers easy to detect in low dimensions – A high dimensional outlier may not show up in low dimensional projections – MVE or MCD algorithms are exponentially computationally complex • Visualization tools may help – Fisher Info Matrix and Convex Hull Peeling more feasible but still too complex for Massive datasets • Some angle based methods are promising
  • 15. Data Preparation • Database Sampling – Exhaustive search may not be practically feasible because of their size – The KDD systems must be able to assist in the selection of appropriate parts if the databases to be examined – For sampling to work, the data must satisfy certain conditions (not ordered, no systematic biases) – Sampling can be very expensive operation especially when the sample is taken from data stored in a DBMS. Sampling 5% of the database can be more expensive that a sequential full scan of the data.
  • 16. Data Compression • Often data preparation involves data compression – Sampling – Quantization
  • 17. Data Quantization Thinning vs Binning • People’s first thoughts about Massive Data usually is statistical subsampling • Quantization is engineering’s success story • Binning is statistician’s quantization
  • 18. Data Quantization • Images are quantized in 8 to 24 bits, i.e. 256 to 16 million levels. • Signals (audio on CDs) are quantized in 16 bits, i.e. 65,536 levels • Ask a statistician how many bins to use, likely response is a few hundred, ask a CS data miner, likely response is 3 • For a terabyte data set, 106 bins
  • 19. Data Quantization • Binning, but at microresolution • Conventions – d = dimension – k = # of bins – n = sample size – Typically k << n
  • 20. Data Quantization • Choose E[W|Q = yj] = mean of observations in jth bin = yj • In other words, E[W|Q] = Q • The quantizer is self-consistent
  • 21. Data Quantization • E[W] = E[Q] • If θ is a linear unbiased estimator, then so is E[θ|Q] • If h is a convex function, then E[h(Q)] ≤ E[h(W)]. – In particular, E[Q2] ≤ E[W2] and var (Q) ≤ var (W). • E[Q(Q-W)] = 0 • cov (W-Q) = cov (W) - cov (Q) • E[W-P]2 ≥ E[W-Q]2 where P is any other quantizer.
  • 22. Data Quantization
  • 23. Distortion due to Quantization • Distortion is the error due to quantization. • In simple terms, E[W-Q]2. • Distortion is minimized when the quantization regions, Sj, are most like a (hyper-) sphere.
  • 24. Geometry-based Quantization • Need space-filling tessellations • Need congruent tiles • Need as spherical as possible
  • 25. Geometry-based Quantization • In one dimension – Only polytope is a straight line segment (also bounded by a one-dimensional sphere). • In two dimensions – Only polytopes are equilateral triangles, squares and hexagons
  • 26. Geometry-based Quantization • In 3 dimensions – Tetrahedrons (3-simplex), cube, hexagonal prism, rhombic dodecahedron, truncated octahedron. • In 4 dimensions – 4 simplex, hypercube, 24 cell Truncated octahedron tessellation
  • 27. Geometry-based Quantization Tetrahedron* .1040042… Cube* .0833333… Octahedron .0825482… Hexagonal Prism* .0812227… Rhombic Dodecahedron* .0787451… Truncated Octahedron* .0785433… Dodecahedron .0781285… Icosahedron .0778185… Sphere .0769670 Dimensionless Second Moment for 3-D Polytopes
  • 28. Geometry-based Quantization Tetrahedron Cube Octahedron Truncated Dodecahedron Icosahedron Octahedron
  • 29. Geometry-based Quantization Rhombic Dodecahedron http://www.jcrystal.com/steffenweber/POLYHEDRA/p_07.html
  • 30. Geometry-based Quantization 24 Cell with Cuboctahedron Envelope Hexagonal Prism
  • 31. Geometry-based Quantization • Using 106 bins is computationally and visually feasible. • Fast binning, for data in the range [a,b], and for k bins j = fixed[k*(xi-a)/(b-a)] gives the index of the bin for xi in one dimension. • Computational complexity is 4n+1=O(n). • Memory requirements drop to 3k - location of bin + # items in bin + representor of bin, I.e. storage complexity is 3k.
  • 32. Geometry-based Quantization • In two dimensions – Each hexagon is indexed by 3 parameters. – Computational complexity is 3 times 1-D complexity, – I.e. 12n+3=O(n). – Complexity for squares is 2 times 1-D complexity. – Ratio is 3/2. – Storage complexity is still 3k.
  • 33. Geometry-based Quantization • In 3 dimensions – For truncated octahedron, there are 3 pairs of square sides and 4 pairs of hexagonal sides. – Computational complexity is 28n+7 = O(n). – Computational complexity for a cube is 12n+3. – Ratio is 7/3. – Storage complexity is still 3k.
  • 34. Quantization Strategies • Optimally for purposes of minimizing distortion, use roundest polytope in d-dimensions. – Complexity is always O(n). – Storage complexity is 3k. – # tiles grows exponentially with dimension, so-called curse of dimensionality. – Higher dimensional geometry is poorly known. – Computational complexity grows faster than hypercube.
  • 35. Quantization Strategies • For purposes of simplicity, always use hypercube or d- dimensional simplices – Computational complexity is always O(n). – Methods for data adaptive tiling are available – Storage complexity is 3k. – # tiles grows exponentially with dimension. – Both polytopes depart spherical shape rapidly as d increases. – Hypercube approach is known as datacube in computer science literature and is closely related to multivariate histograms in statistical literature.
  • 36. Quantization Strategies • Conclusions on Geometric Quantization – Geometric approach good to 4 or 5 dimensions. – Adaptive tilings may improve rate at which # tiles grows, but probably destroy spherical structure. – Good for large n, but weaker for large d.
  • 37. Quantization Strategies • Alternate Strategy – Form bins via clustering • Known in the electrical engineering literature as vector quantization. • Distance based clustering is O(n2) which implies poor performance for large n. • Not terribly dependent on dimension, d. • Clusters may be very out of round, not even convex. – Conclusion • Cluster approach may work for large d, but fails for large n. • Not particularly applicable to “massive” data mining.
  • 38. Quantization Strategies • Third strategy – Density-based clustering • Density estimation with kernel estimators is O(n). • Uses modes mα to form clusters • Put xi in cluster α if it is closest to mode mα. • This procedure is distance based, but with complexity O(kn) not O(n2). • Normal mixture densities may be an alternative approach. • Roundness may be a problem. – But quantization based on density-based clustering offers promise for both large d and large n.
  • 39. Data Quantization • Binning does not lose fine structure in tails as sampling might. • Roundoff analysis applies. • With scale of binning, discretization not likely to be much less accurate than accuracy of recorded data. • Discretization - finite number of bins implies discrete variables more compatible with categorical data.
  • 40. Data Quantization • Analysis on a finite subset of the integers has theoretical advantages – Analysis is less delicate • different forms of convergence are equivalent – Analysis is often more natural since data is already quantized or categorical – Graphical analysis of numerical data is not much changed since 106 pixels is at limit of HVS