Statistical Data Mining: A Short Course for the Army ...
Upcoming SlideShare
Loading in...5
×
 

Statistical Data Mining: A Short Course for the Army ...

on

  • 416 views

 

Statistics

Views

Total Views
416
Views on SlideShare
415
Embed Views
1

Actions

Likes
0
Downloads
5
Comments
0

1 Embed 1

http://www.slideshare.net 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Statistical Data Mining: A Short Course for the Army ... Statistical Data Mining: A Short Course for the Army ... Presentation Transcript

  • Statistical Data Mining: A Short Course for the Army Conference on Applied Statistics Edward J. Wegman George Mason University Jeffrey L. Solka Naval Surface Warfare Center
  • Statistical Data Mining Agenda
    • Introduction and Complexity
    • Data Preparation and Compression
    • Databases and Data Mining via Association Rules
    • Clustering, Classification, and Discrimination
    • Pattern Recognition and Intrusion Detection
    • Color Theory and Design
    • Visual Data Mining
    • CrystalVision Installation and Practice
  • Introduction to Data Mining
  • Introduction to Data Mining
    • What is Data Mining All About
    • Hierarchy of Data Set Size
    • Computational Complexity and Feasibility
    • Data Mining Defined & Contrasted with EDA
    • Examples
  • Introduction to Data Mining
    • Why Data Mining
    • What is Knowledge Discovery in Databases
    • Potential Applications
      • Fraud Detection
      • Manufacturing Processes
      • Targeting Markets
      • Scientific Data Analysis
      • Risk Management
      • Web Intelligence
  • Introduction to Data Mining
    • Data Mining: On what kind of data?
      • Relational Databases
      • Data Warehouses
      • Transactional Databases
      • Advanced
        • Object-relational
        • Spatial, Temporal, Spatiotemporal
        • Text, www
        • Heterogeneous, Legacy, Distributed
  • Introduction to Data Mining
    • Data Mining: Why now?
      • Confluence of multiple disciplines
        • Database systems, data warehouses, OLAP
        • Machine learning
        • Statistical and data analysis methods
        • Visualization
        • Mathematical programming
        • High performance computing
  • Introduction to Data Mining
    • Why do we need data mining?
      • Large number of records (cases) (10 8 -10 12 bytes)
      • High dimensional data (variables) (10-10 4 attributes)
    How do you explore millions of records, tens or hundreds of fields, and find patterns?
  • Introduction to Data Mining
    • Why do we need data mining?
        • Only a small portion, typically 5% to 10%, of the collected data is ever analyzed.
        • Data that may never be explored continues to be collected out of fear that something that may prove important in the future may be missing.
        • Magnitude of data precludes most traditional analysis (more on complexity later).
  • Introduction to Data Mining
    • KDD and data mining have roots in traditional database technology
        • As database grow, the ability of the decision support process to exploit traditional (I.e. Boolean) query languages is limited.
          • Many queries of interest are difficult/impossible to state in traditional query languages
          • “Find all cases of fraud in IRS tax returns.”
          • “Find all individuals likely to ignore Census questionnaires.”
          • “Find all documents relating to this customer’s problem.”
  • Complexity
  • Complexity
    • Descriptor Data Set Size in Bytes Storage Mode
    • Tiny 10 2 Piece of Paper
    • Small 10 4 A Few Pieces of Paper
    • Medium 10 6 A Floppy Disk
    • Large 10 8 Hard Disk
    • Huge 10 10 Multiple Hard Disks
    • Massive 10 12 Robotic Magnetic Tape
    • Storage Silos
    • Supermassive 10 15 Distributed Data Archives
    • The Huber-Wegman Taxonomy of Data Set Sizes
  • Complexity
    • O( n ) Calculate Means, Variances, Kernel Density
    • Estimates
    • O(n log(n)) Calculate Fast Fourier Transforms
    • O(n c) Calculate Singular Value Decomposition of an r x c Matrix; Solve a Multiple Linear Regression
    • O( n 2 ) Solve most Clustering Algorithms
    • O( a n ) Detect Multivariate Outliers
    • Algorithmic Complexity
  • Complexity
  • Complexity
  • Complexity
  • Complexity
  • Complexity
  • Complexity
  • Complexity
  • Complexity
  • Complexity
  • Complexity
    • Scenarios
    • Typical high resolution workstations,
    • 1280x1024 = 1.31x10 6 pixels
    • Realistic using Wegman, immersion, 4:5 aspect ratio,
    • 2333x1866 = 4.35x10 6 pixels
    • Very optimistic using 1 minute arc, immersion, 4:5 aspect ratio, 8400x6720 = 5.65x10 7 pixels
    • Wildly optimistic using Maar(2), immersion, 4:5 aspect ratio, 17,284x13,828 = 2.39x10 8 pixels
  • Massive Data Sets
    • One Terabyte Dataset
    • vs
    • One Million Megabyte Data Sets
    • Both difficult to analyze
    • but for different reasons
  • Massive Data Sets: Commonly Used Language
    • Data Mining = DM
    • Knowledge Discovery in Databases = KDD
    • Massive Data Sets = MD
    • Data Analysis = DA
  • Massive Data Sets
  • Data Mining of Massive Datasets
    • Data Mining is a kind of Exploratory Data Analysis with Little or No Human Interaction using Computationally Feasible Techniques ,
    • i.e., the Attempt to find Interesting Structure unknown a priori
  • Massive Data Sets
    • Major Issues
      • Complexity
      • Non-homogeneity
    • Examples
      • Huber’s Air Traffic Control
      • Highway Maintenance
      • Ultrasonic NDE
  • Massive Data Sets
    • Air Traffic Control
      • 6 to 12 Radar stations, several hundred aircraft, 64-byte record per radar per aircraft per antenna turn
      • megabyte of data per minute
  • Massive Data Sets
    • Highway Maintenance
      • Records of maintenance records and measurements of road quality for several decades
      • Records of uneven quality
      • Records missing
  • Massive Data Sets
    • NDE using Ultrasound
      • Inspection of cast iron projectiles
      • Time series of length 256, 360 degrees, 550 levels = 50,688,000 observations per projectile
      • Several thousand projectiles per day
  • Massive Data Sets: A Distinction
    • Human Analysis of the Structure of
    • Data and Pitfalls
    • vs
    • Human Analysis of the Data Itself
    • Limits of HVS and computational complexity limit the latter
    • Former is the basis for design of the analysis engine