NPACI AHM 2001 Tutorial on Data Mining for Scientific ...
Upcoming SlideShare
Loading in...5
×
 

NPACI AHM 2001 Tutorial on Data Mining for Scientific ...

on

  • 542 views

 

Statistics

Views

Total Views
542
Views on SlideShare
542
Embed Views
0

Actions

Likes
0
Downloads
8
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

NPACI AHM 2001 Tutorial on Data Mining for Scientific ... NPACI AHM 2001 Tutorial on Data Mining for Scientific ... Presentation Transcript

  • NPACI AHM2001 Tutorial on Data Mining for Scientific Applications Chaitan Baru Tony Fountain San Diego Supercomputer Center
  • Tutorial Objectives
    • Provide overview of the infrastructure – technologies and techniques – for:
      • data mining, database systems
    • Provide some illustrative examples of how the infrastructure can be used in scientific applications
    • Present plans for the SDSC Knowledge and Information Discovery Lab (SKIDL)
    • Identify potential collaborations – for applications as well as infrastructure
    • –– Our emphasis is on the infrastructure
  • Tutorial Outline
    • 8:00 - 8:15 Data intensive computing in NPACI (Baru)
    • 8:15 - 9:15 Introduction to data mining (Fountain)
    • 9:15 - 10:15 DBMS support for analysis of large- scale data (Baru)
    • 10:15 - 10:30 BREAK
    • 10:30 - 12:00 Examples of data mining tools (Fountain)
    • Next steps...
  • NPACI DICE
    • Focus on data, information, and knowledge management:
      • Persistent archives
        • Use of XML and archival storage systems (e.g. HPSS) for data storage
      • Metadata-based access to data sets (Extensible Metadata Catalog, eMCAT)
      • Distributed data handling (Storage Resource Broker, SRB)
      • Information mediation (Mediation of Information using XML, MIX)
      • Model-based mediation (NeuroMIX), use of Topic Maps
  • HPSS
    • Capacity
      • Total >400TB
      • Current usage: >240TB stored
    • Load
      • Transfer rate: 1TB/day
    • SRB provides a “container” mechanism for better usage and improved efficiency
  • Application (SRB client) Distributed Storage Resources DB2, Oracle, ObjectStore HPSS, UniTree UNIX, ftp MCAT SRB Servers SRB Middleware The SDSC Storage Resource Broker
    • Metadata-based access to data sets stored in distributed,
    • heterogeneous storage resources
    Solaris, Linux, NT, AIX, HP-UX, IRIX
  • Current Usage of SRB
    • Collections
      • Digital Sky: ~4TB, ~8 million files
      • Digital Embryo: ~700GB, millions of files
      • Digital library collections (ADL, UCB, Michigan): ~1 million files
      • HyperLTER – hyperspectral data
      • Particle Physics Data Grid
    • Upcoming collections
      • SLAC...
  • Mediation of Information using XML (MIX) Data Source XML Data Source Data Source MIXm Mediator XML View(s) XML View(s) XML View(s) Wrapper Wrapper Blended Browsing and Querying (BBQ) interface Definition of mediated view in XML Matching And Structuring (XMAS) query language Lazy evaluation of XMAS queries using DOM-VXD
  • From data management infrastructure to knowledge discovery infrastructure
    • The Affymetrix story
      • “ Technology built for Wall Street helps bioinformatics companies as well…”
    • The “scientist in the middle”
    • The infrastructure is a tool to help the scientist, not a replacement!
    infrastructure KDD
  • The Infrastructure Supports:
    • Exploratory data analysis of large data sets
      • efficient a d hoc statistical processing
    • Parallel data access, subsetting, and analysis
    • Data intensive approach to model building and verification
      • including, fusion of different forms of data (e.g. database tables, instrument outputs, remote sensing data, maps, …)
    • – Employ, and build upon, existing (commercial, freeware) tools and software packages, as much as possible
  • The S DSC K nowledge and I nformation D iscovery L ab ( SKIDL )
    • Initial hardware platform
      • 2-processor Sun, 512MB memory, 36 GB local disk
      • Upgrade to:
        • 20 processor Sun, 6 GB memory, 400 GB local disk
        • Access to additional disk storage via storage area network (SAN)
      • Possible further upgrade (via CalIT2)
        • Additional 4 GB memory, 1 TB SAN disk, Gigabit Ethernet capability
    • Software
      • High-performance, parallel database systems and file systems
        • DB2
        • Oracle, GPFS
      • Suite of data mining tools
        • Intelligent Miner, MineSet, Bayesian network tools
        • S-Plus, Darwin, Clementine, SAS
      • Presentation, visualization: ESRI ArcIMS, ...
  • Data Mining Tony Fountain NPACI ESS SDSC Knowledge & Information Discovery Lab
  • Overview (DM101)
    • Part 1:
      • Definition
      • Motivations
      • Methods, Techniques, & Tools
    • Part 2:
      • Examples & Demos
      • Data Mining to Decision Support
  • Overview (DM101)
    • Part 1:
      • Definition
      • Motivations
      • Methods, Techniques, & Tools
    • Database 605 – Chaitan Baru
    • Part 2:
      • Examples & Demos
      • Data Mining to Decision Support
  • Outline (DM101)
    • Part 1 – What is data mining?
      • Direct
      • Contributions from other disciplines
      • Motivations & context
      • Example applications
      • Analytical methods:
        • Association Rules
        • Classification & Prediction
        • Clustering
        • OLAP
      • MSU data set
  • Definition
    • The search for interesting patterns…
  • Definition
    • The search for interesting patterns,
    • in large databases…
  • Definition
    • The search for interesting patterns,
    • in large databases,
    • that were collected for other applications…
  • Definition
    • The search for interesting patterns,
    • in large databases,
    • that were collected for other applications,
    • using machine learning algorithms…
  • Definition
    • The search for interesting patterns,
    • in large databases,
    • that were collected for other applications,
    • using machine learning algorithms,
    • and high-performance computers…
  • Definition
    • The search for interesting patterns,
    • in large databases,
    • that were collected for other applications,
    • using machine learning algorithms,
    • and high-performance computers,
    • for fun and profit!
  • Definition
    • The search for interesting patterns,
    • in large databases,
    • that were collected for other applications,
    • using machine learning algorithms,
    • and high-performance computers,
    • for science and society!
  • KDD Process Knowledge Discovery and Data Mining Collection Processing/Cleansing/Correction/Formatting Mining/Analysis/Modeling Presentation/Visualization Application/Decision Support Management/Integration/Warehousing
  • Data Mining & Knowledge Discovery KD, KDD, KDD(D)*
    • What’s in a name?
        • Database
        • Data Mining
        • Discovery
        • Derivation
        • Decision Support
  • Contributions Data Mining Artificial Intelligence High Performance Computing Statistics Database Systems
  • Contributions Data Mining Artificial Intelligence High Performance Computing Statistics Database Systems Operations Research GIS Visualization
  • The Case for Data Mining: Data Reality
    • Controlled experimental data collection is an ideal
    • Legacy archives and independent collection activities
    • Deluge from new sources
      • Remote Sensing
      • Instrumentation & Wireless Communications
      • Simulation Models
    • Growth of data collections vs. analysts
    • Many types of data, many uses, many types of queries
    • Advances in computational infrastructure provide new opportunities for access and integration
    • Paradigm shift: hypothesis-driven data collection to data mining (KDD)
  • The Revolution in Ecology
    • Computational Ecology and Eco-Informatics
    • Instrumentation & Remote Sensing
      • Amphibian urls and hyperspectral data
      • Tropical glaciers in Ohio
    • Computer Simulations
      • Coupled biogeochemistry, ocean,
      • atmosphere…
    • Ecology without boots!
  • Classic Applications - Commercial
    • Fraud Detection – credit card
    • Churning – long-distance carriers
    • Targeted Marketing – customer profiles
    • Stock Market – futures trading
    • Market Basket Analysis
    • Soon to be classic: FL 2000 election
  • Classic Applications - Science
    • Volcanoes on Venus - Classification
      • Burl, et al., NASA, Cal Tech.
    • Astronomical clustering – Autoclass, Bayesian Clustering
      • Cheeseman, Stutz, NASA
    • Oil spills from remote sensing data – Decision Trees
      • Kubat, et al., Ottowa
    • Biodiversity analysis – Genetic algorithms, Bayesian Nets
      • Stockwell, SDSC/UCSD
    • … ???
  • Classic Applications - Science
    • Volcanoes on Venus - Classification
      • Burl, et al., NASA, Cal Tech.
    • Astronomical clustering – Autoclass, Bayesian Clustering
      • Cheeseman, Stutz, NASA
    • Oil spills from remote sensing data – Decision Trees
      • Kubat, et al., Ottowa
    • Biodiversity analysis – Genetic algorithms, Bayesian Nets
      • Stockwell, SDSC/UCSD
    • YOUR NAME HERE!! (1800-SKIDLME)
  • Data Mining Tools (suites)
    • SPSS - Clementine
      • http://www.spss.com/clementine/
    • Oracle - Darwin
      • http://www.oracle.com/ip/analyze/warehouse/datamining/
    • SGI - MineSet
      • http://www.sgi.com/software/mineset/
    • IBM - Intelligent Miner
      • http://www-4.ibm.com/software/data/iminer/fordata/
    • http://www.kdnuggets.com/software/index.html
  • Data Mining Analytical Techniques (patterns, hypotheses, models)
    • Statistical Methods
      • Descriptive, Modeling, Data Reduction…
    • Associations
      • Simple relations in categorical data
    • Classification & Prediction
      • Model induction - Supervised learning
    • Clustering
      • Concept discovery - Unsupervised learning
  • Association Rule Mining
    • Associations
      • Simple rules in categorical data
    • Sample applications
      • Market Basket Analysis
      • Buys(Milk) => Buys(Eggs)
      • Transaction Processing
      • Income(Hi) & Single(Y) => Owns(Computer)
    • Search for Strong Rules
      • Support R(A => B) = P(A U B)
      • Confidence R(A => B) = P(B | A) = P(AB) / P(A)
  • Association Rule Mining R1: [70 => (Bird & Lion)] Support: P(70 or (Bird & Lion)) = 4/5 = 80% Confidence: P((Bird & Lion) | 70)) = P(Bird & Lion & 70) / P(70) = (3/5) / (4/5) = 75% Bird Lion Snake 70 Hyena Lion Bird 70 Antelope Snake Tiger 70 Snake Bird Hyena 80 Lion Antelope Bird 70
  • Classification
    • Classification and prediction
      • Create model for distinguishing concepts
      • Labeled training data
      • Metrics based on accuracy rates and cross-validation
    • Numerous methods
      • Decision trees
      • Neural Nets
      • Bayesian Networks
      • Regression
    • Many applications
      • Identifying credit risks
      • Predicting biological productivity
      • Medical diagnosis
      • Classifying toxic risks…
  • Classification – Decision Tree Ecosystem Precipitation 63 Prairie 116 Forest 5 Desert 104 Forest 120 Forest 2 Desert
  • Classification – Decision Tree Precipitation < 60 63 Prairie 116 Forest 5 Desert 104 Forest 120 Forest 2 Desert 104 Forest 63 Prairie 116 Forest 120 Forest 5 Desert 2 Desert
  • Classification – Decision Tree Precip < 60 Precip < 100 63 Prairie 116 Forest 5 Desert 104 Forest 120 Forest 2 Desert 104 Forest 63 Prairie 116 Forest 120 Forest 5 Desert 2 Desert 120 Forest 116 Forest 104 Forest 63 Prairie
  • Classification – Decision Tree IF(Precip < 60 ) then Desert Else If (Precip < 100) then Prairie Else Forest 63 Prairie 116 Forest 5 Desert 104 Forest 120 Forest 2 Desert
  • Pruned Decision Tree Precipitation < 60 63 Prairie 116 Forest 5 Desert 104 Forest 120 Forest 2 Desert 104 Forest 63 Prairie 116 Forest 120 Forest 5 Desert 2 Desert
  • Pruned Decision Tree Precipitation < 60 IF(Precip < 60 ) then Desert Else [P(Forest) = .75] & [P(Prairie) = .25] 63 Prairie 116 Forest 5 Desert 104 Forest 120 Forest 2 Desert 104 Forest 63 Prairie 116 Forest 120 Forest 5 Desert 2 Desert
  • Clustering
    • Cluster Analysis – Concept Discovery
      • Create models for discovered concepts
      • No known class labels
      • Metrics based on cluster similarity
    • Numerous methods
      • K-means (partitioning)
      • Bayesian Networks
      • Hierarchical clustering
      • Neural Networks
    • Example applications
      • Identifying common subpopulations
      • Creating taxonomies (biological, manufacturing, commerce)
      • Discovering failure patterns in manufactured parts
      • Locating environmental risk areas…
  • Clustering – K-Means Precipitation Temperature 49 32 76 17 45 49 63 62 70 71 81 8
  • Clustering – K-Means
  • Clustering – K-Means
  • Clustering – K-Means
  • Clustering – K-Means
  • Clustering – K-Means
  • Clustering – K-Means
  • Clustering – K-Means Cluster Temperature Precipitation 50 – 80 50 – 80 C3 25 - 55 35 - 60 C2 0 - 25 70 - 85 C1
  • Clustering – K-Means Cluster Temperature Precipitation 50 – 80 50 – 80 C3 25 - 55 35 - 60 C2 0 - 25 70 - 85 C1
  • Clustering – K-Means Cluster Temperature Precipitation Ecosystem Forest 50 – 80 50 – 80 C3 Prairie 25 - 55 35 - 60 C2 Desert 0-25 70 - 85 C1
  • On-line Analytical Processing (OLAP)
    • On-line Transaction Processing (OLTP) vs. OLAP
      • Analysis & decision support are more compute intensive
    • Concept hierarchies - (representing forests & trees)
      • Space: site, county, state, country…
      • Time: day, week, month….
      • Taxonomic hierarchies …
    • Methods: rules, explicit specification, clustering
    • Multidimensional data & efficient access/selection
    • Operations: slice, dice, roll up, drill down, pivot
  • Concept Hierarchy for Precipitation (high) 0-3 0-12 inches 4-8 9-12 0-1 2-3 7-8 11-12 9-10 4-6 (low) (med)
  • OLAP Examples
    • Slice
      • For (precip = “4-8 inches”)
    • Dice
      • For (precip = “4-8 inches” AND week = “120”)
    • Drill down (specification)
      • On time from months to weeks
    • Roll up (generalization, summarization)
      • On Space from counties to states
  • MSU Data Set
    • Agricultural productivity simulation
      • Integrates land use, climate, ecosystem data
      • Remote sensing, computer simulations, field observations
    • Inputs – geographic & climatic parameters
      • Max and min temperatures
      • Solar radiation
      • Precipitation ….
    • Outputs – ecosystem
      • Leaf area index
      • Crop yield
      • Soil Water …
  • Statistics of MSU Simulation Data
      • 20 years, daily records
      • 1053 regions
      • 5 million rows
      • Approx 300MB
      • Stuart Gage, MSU Computational
        • Ecology and Visualization Lab
      • http://www.cevl.msu.edu/index.html
  • Example: DBMS support for OLAP
    • SQL support for rollup
      • SELECT region, week, day_of_week, sum(solrad) FROM msu.details_table
      • GROUP BY ROLLUP (region, week, day_of_week)
      • ORDER BY region, week, day_of_week
    • Output is summation of solrad by
      • (region, week, day_of_week)
      • (region, week, –)
      • (region, –, –)
      • (–, –, –)
  • DBMS Support for Large Data Analysis
    • Large database support
    • Parallel processing
    • OLAP functions
    • New data types, object extensions, spatial data, XML…
    • Distributed databases
  • Dealing with large databases
    • In the beginning…
      • Database size  max logical filesystem size (2GB) (UNIX)
    • Tablespaces
      • A tablespace can have multiple tablespace containers
      • Size of tablespace container  max filesystem size
    Database T1, T2, T3 ... /filesystem Database T1, T2, ... /fs Database T1, T3 Tablespace1 T2 Tablespace2 /fs /fs /fs /fs
  • Tablespaces...
    • Different types of tablespace containers
      • DBMS managed (“raw”)
      • File system managed (“cooked”)
    • Different types of tablespaces
      • Regular data and indexes (typical max size of 64GB)
      • Large objects (LOB’s) and temporary data (typical max is 2TB)
    • Larger page sizes for containers (4K to 32K)
      • Max. TS size for regular data increases to 512GB
    • What if a given table is > 512GB?
  • Loading large databases
    • The relevant industry benchmark is TPC-H (www.tpc.org)
    • Evolved from TPC-D Benchmark
      • First audited benchmark was performed in December 1995
      • 100GB database, 32-node IBM SP
    • Current largest benchmark runs are for 1TB database
    • Largest table in benchmark has
        • ~ 70% of data (700GB)
        • 6 billion rows
    • Measures single user performance (“power metric”) and multi-user performance (“throughput metric”)
  • Large Database Benchmarks
    • Results from IBM, April 2000
      • Loading 1TB database takes about 7.5 hours
      • Total disk = 9.7 times the raw size of database
      • Hardware configuration
        • 32 4-way IBM SP nodes, 4GB/node (128GB), 35x9GB disks/node
      • Total 5-year cost of system: $9.3M
      • Power: 12,812; QphH: 12,867; Price/perf: $725
    • Results from HP, Feb. 13th, 2001
      • Loading database takes 5.25 hours
      • Total disk = 10.2 times raw size of database
      • Hardware configuration
        • 64 processor Superdome, 96GB memory, 3 disk arrays with 558 18.2GB drives
      • Total 5-year cost of system: $9.6M
      • Power: 13,730; QphH: 9,755; Price/perf: $985
  • Large Database Benchmarks
    • IBM(cluster of SMP) vs HP (SMP) – based solely on analysis of published TPC-H numbers :
      • HP is 7.2% better in power (12,812 vs 13,730)
      • IBM is 24.2% better in throughput (12,867 vs 9,755)
      • IBM is 3.2% better in price ($9.3M vs $9.6M)
      • IBM is 36% better in price/performance ($725 vs $985)
    • TPC-C Benchmark example – IBM
      • 32x4 processors, 4GB/node (128GB), 218 18GB disks/node
      • Total managed storage of ~125TB
      • 440,879 tpm-c
      • Total cost: $14.2M
    • See www.tpc.org for all results
  • Large Database Benchmarks
    • High-end database sizes
      • “ several customers with 100TB of managed disk” – IBM
      • “ customer has requested 1PB (that’s petabyte ) of on-line storage for bioinformatics application over next 5 years” – Sun
      • “ TB’s are passé, think PB’s” – IBM Life Sciences rep
      • Legacy formats are files, but newer data will be in DBMS
    • Dealing with very large data sizes
      • Interfacing to archival storage
      • Parallelism
  • Linking DBMS to archival storage The DB2/HPSS Project DB2 Database table Create Tablespace HPSS-TSPACE Managed By Database Using FILE ( HPSS <hpss-filename> <size> DISKBUF <path> <size>); HPSS HPSS disk cache C4 C5 C1 C2 C3
    • Joint project with IBM TJ Watson Research Center
    • DB2 provides link to Tivoli ADSM
    • Oracle also supports interface to archival storage
    HPSS_TSPACE DB2 disk buffer
  • Parallelism in Database Systems
    • Example database
    • INPUT table:
    • region int — spatial region, county
    • year smallint— year (1972 - 1990)
    • day smallint— day of year (1-366)
    • solrad int — solar radiation
    • tmax float — max day temp. (-33, 44)
    • tmin float — min day temp. (-45, 29.5)
    • pp float — precipitation (mm)
    • dd float — degree days (heat)
    • OUTPUT table:
    • region int
    • year smallint
    • day smallint
    • x_albers int — x-coordinate
    • y_albers int — y-coordinate
    • tdd10 float — total degree days
    • add float — total anthesis degree days
    • tlai float — total leaf area index
    • seed float — total seed biomass (gr/m 2 )
    • yield float — final yield (tons/ha)
    • twater float — total soil water evaporation + total transpiration
    • ttsw float — Maximum water available
  • Generating query graphs
    • Convert SQL queries to query execution plans consisting of low-level query operators
      • Q1 : Select all regions where max temp is greater than 40 degrees, over the entire period of the study:
        • SELECT distinct(region) FROM Input WHERE tmax>40
      • Q2 : Select solar radiation and total leaf area index values for all days and regions in the year 1978:
        • SELECT solrad, tlai FROM Input A, Output B
        • WHERE A.region=B.region AND A.year=B.year AND A.day=B.day
    Q1 Q2 Remove duplicates, format output Apply tmax>40 Read INPUT table Format output Join (region, day, year) Read INPUT table Read OUTPUT table
  • Levels of Query Parallelism
    • Inter-query
      • Execute multiple queries (Q1 and Q2) at the same time
    • Inter-operator (intra-query)
      • Concurrently execute multiple operators in the query
      • Pipeline through the operators, e.g. read and join
    Format output Join (region, day, year) Read INPUT table Read OUTPUT table
  • Levels of Parallelism...
    • Intra-operator
      • Data parallelism
      • Employ multiple processes for each operator
    INPUT table OUTPUT table Read table Read table Format output Join
  • Parallel Architecture models and DBMS
    • Shared-everything
      • memory, process space, disk subsystem are all common
    • Shared disk
      • Separate memory/process space
      • Disk subsystem/filesystem is common
    • Shared nothing
      • Separate memory, disks, OS…
      • Only communication “bus” is shared
  • Shared Everything
    • SMP: Symmetric Multi-Processors
    • Provide well-balanced systems
    • Shared workload, resilient to “unexpected” workload
    • Dynamic allocation of processes to query operators (inter- as well as intra-query)
    • Expensive and don’t scale to large configurations
    Disk Processor Memory
  • Shared Disk
    • Some of the classic architectures map to this, VaxCluster, IBM mainframes (could make a comeback with SAN’s)
    • Can share I/O workload, dynamic partitioning of data
    • Only need to scale I/O subsystem, and not memory
    Disk Processor Memory
  • Shared Nothing
    • Highly scalable
    • Static partitioning of data
    • Cannot share workload
    • Cluster of SMP’s provides advantages of shared-nothing and SMP’s
    Disk Processor Memory
  • Combining Nodegroups and Tablespaces SN System Nodegroup Tablespace2 OUTPUT Tablespace1 INPUT SN System Nodegroup1 Tablespace1 INPUT Nodegroup2 Tablespace2 OUTPUT Format output Join (region, day, year) Read INPUT table Read OUTPUT table
  • The DBMS/Application bottleneck
    • Serial communication between DBMS and app.
    Application
  • The DBMS/Application bottleneck
    • Parallel communication between DBMS and app.
    App App App App App
  • DBMS / DM software connection Data Mining Platform Database Platform Extract data subsets Generate results Presentation (e.g. GIS, 3D) Store session results
  • Performance Tuning
    • Sample set of Database Manager configuration parameters:
    • CPU speed (millisec/instruction) (CPUSPEED) = 9.700848e-07
    • Comm. bandwidth (MB/sec) (COMM_BANDWIDTH) = 1.000000e+00
    • Max number of existing agents (MAXAGENTS) = 400
    • Initial number of agents in pool (NUM_INITAGENTS) = 0
    • Max number of coord. Agents (MAX_COORDAGENTS)
    • Max no. of concurrent coord. agents (MAXCAGENTS)
    • Maximum query degree of parallelism (MAX_QUERYDEGREE) = ANY
    • Enable intra-partition parallelism (INTRA_PARALLEL) = NO
  • Database Tuning
    • Sample set of Database configuration parameters:
    • Default query optimization class (DFT_QUERYOPT) = 9
    • Degree of parallelism (DFT_DEGREE) = 1
    • Database heap (4KB) (DBHEAP) = 1200
    • Catalog cache size (4KB) (CATALOGCACHE_SZ) = 64
    • Log buffer size (4KB) (LOGBUFSZ) = 8
    • Utilities heap size (4KB) (UTIL_HEAP_SZ) = 5000
    • Buffer pool size (pages) (BUFFPAGE) = 128000
    • Max storage for lock list (4KB) (LOCKLIST) = 100
    • Number of asynch page cleaners (NUM_IOCLEANERS) = 1
    • Number of I/O servers (NUM_IOSERVERS) = 3
    • Sequential detect flag (SEQDETECT) = YES
    • Default prefetch size (pages) (DFT_PREFETCH_SZ) = 32
  • Examples of data exploration
    • Testing temporal relationship (sensitivity analysis)
      • Can conditions from day N-1 be used to predict output of day N
      • How far back can we go?
      • Input table:
      • Generate output:
        • Region, Year, Day, Input i , Output i , Output (i-1)
  • “Flattening” the table
    • E.g. SQL query:
    • Query Explain facility
    SELECT A.region, A.year, A.day, A.solrad, A.tlai, B.day, B.tlai FROM msu.combined A, msu.combined B WHERE A.region=B.region AND A.year=B.year AND A.day=B.day-1
  • “Flattening” the table Access Table Name = MSU.COMBINED | #Columns = 5 | Relation Scan | | Prefetch: Eligible | Insert Into Sorted Temp Table ID = t1 | | #Columns = 4 | | #Sort Key Columns = 1 | | | Key 1: YEAR (Ascending) Access Temp Table ID = t1 | Relation Scan | | Prefetch: Eligible Merge Join Merge Join | Access Table Name = MSU.COMBINED | | #Columns = 4 | | Relation Scan | | | Prefetch: Eligible | | Insert Into Sorted Temp Table ID = t2 | | | #Columns = 4 | | | #Sort Key Columns = 1 | | | | Key 1: YEAR (Ascending) | Access Temp Table ID = t2 | | Relation Scan | | | Prefetch: Eligible | Residual Predicate(s) | | #Predicates = 2 Return Data to Application | #Columns = 7
  • “Flattening” the table, with indexing Access Table Name = MSU.COMBINED | #Columns = 5 | Relation Scan | | Prefetch: Eligible | Insert Into Sorted Temp Table ID = t1 | | #Columns = 4 | | #Sort Key Columns = 1 | | | Key 1: REGION (Ascending) Access Temp Table ID = t1 | Relation Scan | | Prefetch: Eligible Nested Loop Join Nested Loop Join | Access Table Name = MSU.COMBINED | | #Columns = 4 | | Index Scan : Name = MSU.C_RYD | | | Index Columns: | | | | 1: REGION (Ascending) | | | | 2: YEAR (Ascending) | | | | 3: DAY (Ascending) | | | Data Prefetch: Eligible 157 | | | Index Prefetch: Eligible 157 | | Return Data to Application | | | #Columns = 7
  • Declustering the table
    • Partition the table by Region and/or Year
      • Linearly scalable join operation
    • Testing spatial relationships/sensitivity
      • Compare region R with a specified neighborhood of R
      • Compare region R with other “similar” regions–spatial clustering
      • Decluster table by year/day
  • Built-in support for OLAP
    • Example table
      • INPUT (Region, Week, Day_of_week, Solrad)
      • 2 regions, 1978, 250 days/year (500 rows)
    • SQL support for rollup
      • SELECT region, week, day_of_week, sum(solrad)
      • FROM Input
      • GROUP BY ROLLUP (region, week, day_of_week)
      • ORDER BY region, week, day_of_week
    • Output is summation of solrad by
      • (region, week, day_of_week), (region, week, –)
      • (region, –, –), (–, –, –)
    • (Region, Week, Day_of_Week SUM(Solrad)
    • 17003 1 1 1661.0
    • 17003 1 2 2654.0
    • 17003 1 3 2709.0
    • 17003 1 4 2101.0
    • 17003 1 5 1197.0
    • 17003 1 6 1605.0
    • 17003 1 7 1133.0
    • 17003 1 - 13060.0
    • … .
    • 17003 36 1 6030.0
    • 17003 36 2 6222.0
    • 17003 36 3 6351.0
    • 17003 36 4 6387.0
    • 17003 36 5 6160.0
    • 17003 36 - 31150.0
    • 17003 - - 1206273.0
    • - - - 2398149.0
  • The “cube” operator
    • SQL query
      • SELECT region, week, day_of_week, sum(solrad)
      • FROM Input
      • GROUP BY CUBE (region, week, day_of_week)
      • ORDER BY region, week, day_of_week
    • Output is summation of solrad by
      • (region, week, day_of_week), (region, week, –), (region, –, –), (–, –, –)
      • (region, –, day_of_week)
      • (–, week, day_of_week)
      • (–, week, –)
      • (–, –, day_of_week)
  • Distributed data mining
    • “ Function shipping” vs. “data shipping”
    • Generalization of the “operator pushdown” notion
      • “ DataCutter” operations in SRB
      • Source/wrapper-side processing in MIX
    • Need to understand which operations can be distributed and how
    • Web-based infrastructure for OLAP and DM
      • XML for Analysis
  • Application (SRB client) MCAT SRB Servers SRB Middleware “ Remote” operations in SRB DataCutter, other “remote” operations
  • Wrapper-side processing in MIX Data Source XML Data Source Data Source MIXm Mediator Wrapper Wrapper Wrapper Application
  • The role of XML
    • Representing, exchanging metadata
      • image headers, instrumentation information, descriptive metadata...
    • Expressing service descriptions
      • Web-based services
    • Exchanging data among services
      • “ Raw” data: sequence information, GIS information…
      • Results of analysis: rowsets, multidimensional cubes,...
  • XML for analysis Client Functionality UI Client Functions Discover, Execute Calls S O A P H T T P XML for Analysis Provider Implementation Discover, Execute Calls - Server S O A P H T T P Data Client Web Service Provider Web Service Discover, Execute Data Data Source
  • Examples - Overview
    • Intelligent Miner – Data Analysis and Mining
      • Interface, database connectivity, data creation
      • Statistical routines
      • Classification
        • Decision Tree
        • Neural Network
      • Clustering
    • Netica - Probabilistic Modeling and Decision Support
      • Belief networks, probabilistic queries
      • Statistical decision theory, decision models, influence diagrams
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  • Probabilistic Modeling and Bayesian Belief Networks Productivity Precipitation Solar Radiation Yield
  •  
  •  
  • Statistical Decision Theory*
    • Normative model of rational decision making
    • Decision: Irrevocable allocation of resources
    • Beliefs: Probability theory
    • Preferences: Utility theory
    • Expected Utility = Probability * Utility
    • Value of Information = EU (A | I) – EU(A)
    • Principle of Rationality:
        • Maximize Expected Utility
    • (*Rational agents are your friends.)
  •  
  •  
  • BK2 Induction Algorithm
    • Data Mining via Probabilistic Model Induction
    • Discover Network Structure and Parameters
    • Greedy Algorithm – ML gradient search
    • Encode background Knowledge – Preferences
  • Model Day 1
  • Model Day 120
  • Other Mining Applications
    • Spatial Data Mining
    • Time Series
    • Sequence Mining
    • Text Data Mining
    • Multimedia Database Mining
    • Web Mining
    • Network Traffic Analysis
  • Acknowledgements
    • Students
      • Peter Shin, Ankur Jain
    • Science Collaborator
      • Stuart Gage, MSU – shared his data set and many insights about the data
    • SDSC
      • Mike Vildibill, Deputy Dir, providing hardware resources for SKIDL
      • Josh Polterock / Dave Archbell – help with software installation, maintenance
    • Funding support
      • NPACI ESS: support for Tony Fountain, Ankur Jain
      • NPACI DICE: support for Chaitan Baru
      • NSF REU: support for Peter Shin