Statistical Data Mining: A Short Course for the Army ...

436 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
436
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
11
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Statistical Data Mining: A Short Course for the Army ...

  1. 1. Statistical Data Mining: A Short Course for the Army Conference on Applied Statistics Edward J. Wegman George Mason University Jeffrey L. Solka Naval Surface Warfare Center
  2. 2. Statistical Data Mining Agenda <ul><li>Introduction and Complexity </li></ul><ul><li>Data Preparation and Compression </li></ul><ul><li>Databases and Data Mining via Association Rules </li></ul><ul><li>Clustering, Classification, and Discrimination </li></ul><ul><li>Pattern Recognition and Intrusion Detection </li></ul><ul><li>Color Theory and Design </li></ul><ul><li>Visual Data Mining </li></ul><ul><li>CrystalVision Installation and Practice </li></ul>
  3. 3. Introduction to Data Mining
  4. 4. Introduction to Data Mining <ul><li>What is Data Mining All About </li></ul><ul><li>Hierarchy of Data Set Size </li></ul><ul><li>Computational Complexity and Feasibility </li></ul><ul><li>Data Mining Defined & Contrasted with EDA </li></ul><ul><li>Examples </li></ul>
  5. 5. Introduction to Data Mining <ul><li>Why Data Mining </li></ul><ul><li>What is Knowledge Discovery in Databases </li></ul><ul><li>Potential Applications </li></ul><ul><ul><li>Fraud Detection </li></ul></ul><ul><ul><li>Manufacturing Processes </li></ul></ul><ul><ul><li>Targeting Markets </li></ul></ul><ul><ul><li>Scientific Data Analysis </li></ul></ul><ul><ul><li>Risk Management </li></ul></ul><ul><ul><li>Web Intelligence </li></ul></ul>
  6. 6. Introduction to Data Mining <ul><li>Data Mining: On what kind of data? </li></ul><ul><ul><li>Relational Databases </li></ul></ul><ul><ul><li>Data Warehouses </li></ul></ul><ul><ul><li>Transactional Databases </li></ul></ul><ul><ul><li>Advanced </li></ul></ul><ul><ul><ul><li>Object-relational </li></ul></ul></ul><ul><ul><ul><li>Spatial, Temporal, Spatiotemporal </li></ul></ul></ul><ul><ul><ul><li>Text, www </li></ul></ul></ul><ul><ul><ul><li>Heterogeneous, Legacy, Distributed </li></ul></ul></ul>
  7. 7. Introduction to Data Mining <ul><li>Data Mining: Why now? </li></ul><ul><ul><li>Confluence of multiple disciplines </li></ul></ul><ul><ul><ul><li>Database systems, data warehouses, OLAP </li></ul></ul></ul><ul><ul><ul><li>Machine learning </li></ul></ul></ul><ul><ul><ul><li>Statistical and data analysis methods </li></ul></ul></ul><ul><ul><ul><li>Visualization </li></ul></ul></ul><ul><ul><ul><li>Mathematical programming </li></ul></ul></ul><ul><ul><ul><li>High performance computing </li></ul></ul></ul>
  8. 8. Introduction to Data Mining <ul><li>Why do we need data mining? </li></ul><ul><ul><li>Large number of records (cases) (10 8 -10 12 bytes) </li></ul></ul><ul><ul><li>High dimensional data (variables) (10-10 4 attributes) </li></ul></ul>How do you explore millions of records, tens or hundreds of fields, and find patterns?
  9. 9. Introduction to Data Mining <ul><li>Why do we need data mining? </li></ul><ul><ul><ul><li>Only a small portion, typically 5% to 10%, of the collected data is ever analyzed. </li></ul></ul></ul><ul><ul><ul><li>Data that may never be explored continues to be collected out of fear that something that may prove important in the future may be missing. </li></ul></ul></ul><ul><ul><ul><li>Magnitude of data precludes most traditional analysis (more on complexity later). </li></ul></ul></ul>
  10. 10. Introduction to Data Mining <ul><li>KDD and data mining have roots in traditional database technology </li></ul><ul><ul><ul><li>As database grow, the ability of the decision support process to exploit traditional (I.e. Boolean) query languages is limited. </li></ul></ul></ul><ul><ul><ul><ul><li>Many queries of interest are difficult/impossible to state in traditional query languages </li></ul></ul></ul></ul><ul><ul><ul><ul><li>“Find all cases of fraud in IRS tax returns.” </li></ul></ul></ul></ul><ul><ul><ul><ul><li>“Find all individuals likely to ignore Census questionnaires.” </li></ul></ul></ul></ul><ul><ul><ul><ul><li>“Find all documents relating to this customer’s problem.” </li></ul></ul></ul></ul>
  11. 11. Complexity
  12. 12. Complexity <ul><li>Descriptor Data Set Size in Bytes Storage Mode </li></ul><ul><li>Tiny 10 2 Piece of Paper </li></ul><ul><li>Small 10 4 A Few Pieces of Paper </li></ul><ul><li>Medium 10 6 A Floppy Disk </li></ul><ul><li>Large 10 8 Hard Disk </li></ul><ul><li>Huge 10 10 Multiple Hard Disks </li></ul><ul><li>Massive 10 12 Robotic Magnetic Tape </li></ul><ul><li>Storage Silos </li></ul><ul><li>Supermassive 10 15 Distributed Data Archives </li></ul><ul><li>The Huber-Wegman Taxonomy of Data Set Sizes </li></ul>
  13. 13. Complexity <ul><li>O( n ) Calculate Means, Variances, Kernel Density </li></ul><ul><li> Estimates </li></ul><ul><li>O(n log(n)) Calculate Fast Fourier Transforms </li></ul><ul><li>O(n c) Calculate Singular Value Decomposition of an r x c Matrix; Solve a Multiple Linear Regression </li></ul><ul><li>O( n 2 ) Solve most Clustering Algorithms </li></ul><ul><li>O( a n ) Detect Multivariate Outliers </li></ul><ul><li>Algorithmic Complexity </li></ul>
  14. 14. Complexity
  15. 15. Complexity
  16. 16. Complexity
  17. 17. Complexity
  18. 18. Complexity
  19. 19. Complexity
  20. 20. Complexity
  21. 21. Complexity
  22. 22. Complexity
  23. 23. Complexity <ul><li>Scenarios </li></ul><ul><li>Typical high resolution workstations, </li></ul><ul><li>1280x1024 = 1.31x10 6 pixels </li></ul><ul><li>Realistic using Wegman, immersion, 4:5 aspect ratio, </li></ul><ul><li>2333x1866 = 4.35x10 6 pixels </li></ul><ul><li>Very optimistic using 1 minute arc, immersion, 4:5 aspect ratio, 8400x6720 = 5.65x10 7 pixels </li></ul><ul><li>Wildly optimistic using Maar(2), immersion, 4:5 aspect ratio, 17,284x13,828 = 2.39x10 8 pixels </li></ul>
  24. 24. Massive Data Sets <ul><li>One Terabyte Dataset </li></ul><ul><li>vs </li></ul><ul><li>One Million Megabyte Data Sets </li></ul><ul><li>Both difficult to analyze </li></ul><ul><li>but for different reasons </li></ul>
  25. 25. Massive Data Sets: Commonly Used Language <ul><li>Data Mining = DM </li></ul><ul><li>Knowledge Discovery in Databases = KDD </li></ul><ul><li>Massive Data Sets = MD </li></ul><ul><li>Data Analysis = DA </li></ul>
  26. 26. Massive Data Sets
  27. 27. Data Mining of Massive Datasets <ul><li>Data Mining is a kind of Exploratory Data Analysis with Little or No Human Interaction using Computationally Feasible Techniques , </li></ul><ul><li>i.e., the Attempt to find Interesting Structure unknown a priori </li></ul>
  28. 28. Massive Data Sets <ul><li>Major Issues </li></ul><ul><ul><li>Complexity </li></ul></ul><ul><ul><li>Non-homogeneity </li></ul></ul><ul><li>Examples </li></ul><ul><ul><li>Huber’s Air Traffic Control </li></ul></ul><ul><ul><li>Highway Maintenance </li></ul></ul><ul><ul><li>Ultrasonic NDE </li></ul></ul>
  29. 29. Massive Data Sets <ul><li>Air Traffic Control </li></ul><ul><ul><li>6 to 12 Radar stations, several hundred aircraft, 64-byte record per radar per aircraft per antenna turn </li></ul></ul><ul><ul><li>megabyte of data per minute </li></ul></ul>
  30. 30. Massive Data Sets <ul><li>Highway Maintenance </li></ul><ul><ul><li>Records of maintenance records and measurements of road quality for several decades </li></ul></ul><ul><ul><li>Records of uneven quality </li></ul></ul><ul><ul><li>Records missing </li></ul></ul>
  31. 31. Massive Data Sets <ul><li>NDE using Ultrasound </li></ul><ul><ul><li>Inspection of cast iron projectiles </li></ul></ul><ul><ul><li>Time series of length 256, 360 degrees, 550 levels = 50,688,000 observations per projectile </li></ul></ul><ul><ul><li>Several thousand projectiles per day </li></ul></ul>
  32. 32. Massive Data Sets: A Distinction <ul><li>Human Analysis of the Structure of </li></ul><ul><li>Data and Pitfalls </li></ul><ul><li>vs </li></ul><ul><li>Human Analysis of the Data Itself </li></ul><ul><li>Limits of HVS and computational complexity limit the latter </li></ul><ul><li>Former is the basis for design of the analysis engine </li></ul>

×