Scientific Applications Of
Upcoming SlideShare
Loading in...5
×
 

Scientific Applications Of

on

  • 578 views

 

Statistics

Views

Total Views
578
Views on SlideShare
578
Embed Views
0

Actions

Likes
0
Downloads
3
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Scientific Applications Of Scientific Applications Of Presentation Transcript

  • Scientific Applications of Data Mining Bioinformatics Seminar August 28, 2002 Gary Lindstrom School of Computing University of Utah
  • Outline
    • What is data mining?
    • Where has it been successfully applied?
    • How can it be applied to scientific applications?
    • Research Opportunities
  • What Is Data Mining?
    • One definition (Robert Grossman)
      • Data mining is the semi-automatic discovery of patterns, associations, anomalies, structures, and changes in large data sets
  • Data Mining
    • Characteristics
      • Large data, vs. small data
      • Discovery , not validation
      • Data driven , not hypothesis driven
      • Automated , not manual application
    • Supported by
      • Statistics, machine learning, databases, high performance computing
  • The Data Gap
    • Exponential growth of data
      • More automation, greater throughput, more models, e.g. simulated
    • But: linear increase in number of researchers
      • Sift the sand, rather than searching a sensor
  • Classical Data Mining Applications
    • Retail
      • Market basket analysis
    • Political science
      • Targeting campaign resources
    • Financial
      • Exploiting market trends & imbalances
  • Decision Support Systems
    • Generic term for analytic and historic uses of DBs
      • Contrast with: operational uses
      • Commonly known as On-Line Transaction Processing (OLTP)
    • Data warehouses
      • Data culled from operational DBs, with history and derived summary data
  • Data Warehouses vs. Databases
      • Replicate data from distributed sources
      • Do not require strict currency of data
      • Oriented toward complex, often statistical queries
      • Often based on materialized views of operational data
        • Views which have been expanded into real tables
  • Tools for DSS
    • Ad hoc SQL-style queries
      • Optimized for large, complex data
    • On-Line Analytic Processing (OLAP)
      • Queries optimized for aggregation operations
      • Data is viewed as multidimensional array
      • Influenced by end-user tools such as spreadsheets
    • Data mining
      • Exploratory data analysis
      • Looking for interesting unanticipated patterns in the data
  • Data Warehousing External Data Source Metadata Repository OLAP Data Warehouse Data Mining SERVES EXTRACT TRANSFORM LOAD REFRESH Visualization
  • Creating And Maintaining A Warehouse
    • Challenges
      • Schema design for integrated information
      • Operations
        • Cleaning (curation): filling gaps, correcting errors
        • Transforming: making consistent with new schema
        • Loading: also sorting and summarizing
        • Refreshing: incorporate updates to operation data
        • Purging: aging out old data
    • Role of metadata
      • Sources of data, schema conversion information, refresh history, etc.
  • OLAP
    • Focuses on data reduction in context
      • In SQL terms, lots of group by and aggregation, and having operators
    • Multidimensional data model
      • More appropriate than operation DB tables
      • Based on a numeric measures
      • Each measure depends on a set of dimensions
  • Example: Retail Sales
    • Dimensions are Product , Location and Time
    8 10 10 30 20 50 25 8 15 locid
    • 2 3
    • timeid
    pid 11 12 1 3
  • Working With Multidimensional Data
    • Can be represented as a conventional table
      • n dimensions need n+1 columns
      • Called a fact table
    • OLAP systems may store all data in relational form
      • These are Relational OLAP or ROLAP systems
    • Each dimension can have multiple components
      • Example: Location = (Country, State, City)
  • OLAP Queries
    • Samples
      • Find the total sales
      • Find total sales for each city
      • Find total sales for each state
      • Find the top five products ranked by total sales
    • OLAP query jargon
      • Dimensionality reduction
        • Aggregating on a dimension, e.g., total sales by city
      • Roll-up
        • Given total sales by city, find total sales by state
      • Drill-down
        • Given total sales by state, ask for total sales by city
  • OLAP Operations
    • Pivoting
      • Spreadsheet-like summaries
      • Example: given tabular representation of sales
        • Pivoting on Location and Time gives table of total sales for each location and each time value
      • Can be combined with aggregation
        • E.g., yearly sales by state
    • Cross tabulation
      • Displays result of pivoting
      • Aggregation values shown as summary rows and columns
      • As extra rows and columns added to original table
  • OLAP Operations (cont’d)
    • Slicing
      • Equality selection on one or more dimensions
    • Dicing
      • Range selection
  • OLAP Naturally Leads to Data Mining
    • Seeks interesting trends or patterns in large datasets
      • An example of exploratory data analysis
      • Related to knowledge discovery and machine learning
    • Mining for rules
      • Association rules: motivated by retail market basket analysis
  • Market Basket Analysis
    • Market basket
      • A collection of items purchased by a customer in one transaction
      • Retailers want to learn of items often purchased together
        • For promotional and display grouping purposes
      • Simple tabular representation
        • Purchases(transid, custid, date, item, price, quantity)
  • Association Rules
    • Seek rules of the form:
        • { pen } => { ink }
      • Meaning:
        • If a pen is purchased in a transaction, it is likely that ink will also be purchased in that transaction
  • Important Measures for Association Rules
    • Support
      • % of transactions containing all items mentioned in rule
      • Low support reduces interest in the rule
    • Confidence
      • % of transactions containing the LHS that also contain RHS
      • Indicates degree of correlation
  • Using Association Rules For Prediction
    • Always somewhat risky
      • Because ultimate goal is understanding causality
      • Which is not directly reflected in transaction data
  • There Can Be High Support and Confidence
    • … but no causality
    • Example: pencils and pens are often bought together
      • And pens and ink are often bought together
      • Hence pencils and ink are often bought together
    • But there is no causal link between pencils and ink
      • Hence sale promotions on pencils and ink probably won’t be effective
  • Finding Association Rules
    • Seek rules with:
      • Support greater than minsup
      • Confidence greater than minconf
    • Steps
      • Find frequent item sets
        • Sets of items with support >= minsup
      • Break each frequent item set into LHS and RHS of candidate rules
        • Keep those with confidence >= minconf
  • Testing Candidate Rules
    • Confidence calculation for each candidate rule
      • Maintain two counters: lhscount , rhscount
      • Scan entire customer transaction table
      • Count in lhscount occurrences of all items in LHS
      • If LHS is present, tally in rhscount if all items in RHS are present
  • Identifying Frequent Item Sets
    • The a priori property:
      • Every subset of a frequent item set is also a frequent item set
    • This leads to an iterative algorithm
      • Identify frequent item sets of one item
      • Iteratively, seek to extend frequent item sets by adding an item
  • Finding Frequent Itemsets foreach item, check if it is a frequent itemset repeat foreach new frequent itemset I k with k items generate all itemsets I k+1 with k+1 items, I k  I k+1 Scan all transactions once and check if the generated k+1-itemsets are frequent until no new frequent itemsets are found
  • Generalized Association Rules
    • Grouping by transaction attributes
      • Example: group by custid
      • Association can be across multiple transactions by same customer
    • Group by categories
      • Example: rules of form
        • { apparel } => { stationery }
  • Generalized Association Rules (Cont’d)
    • Sequential patterns
      • Identify frequently arising buying patterns over time
    • Classification rules
      • “ If age is in a certain range and balance is in a certain range, then the customer is likely to default on a loan.”
  • Example: Managing Microarray Data
    • MS thesis by John Kokinis, 2000
    • ArrayBank
      • Tool for management of microarray gene expression data sets
      • Implemented in Visual Basic / Sybase
    • Figures from MS thesis
      • Pages 20, 45, 46
      • http://www.cs.utah.edu/~kokinis/THESIS.pdf
  • Example: Mining Simulated Combustion Data
    • Joint work with
      • Brijesh Garabadu, School of Computing
      • Zoran Djurisic, Chem. & Fuels Engg.
    • The problem
      • Combustion model for powdered coal furnaces
      • Which conditions control NOx pollution?
  • The Data
    • Multidimensional space
      • Pressure, fuel mix, oxygen concentration
      • Can explore (simulate) any combination
        • But which to look at?
    • Need to:
      • Locate relevant subspaces
      • Characterize important events
      • Develop causal hypotheses
  • Techniques Applied
    • Cluster analysis
      • Which datasets are similar?
    • Neural networks
      • Which datasets are interesting?
    • Decision trees
      • Which features best explain similarities?
  • Cluster Analysis: Unsupervised Learning
    • At outset, category structure of the data is unknown
      • All that is known is a collection of observations
    • Objective: To discover a category structure which fits the observation
      • i.e. finding natural groups in data
  • Combustion Application
    • Cluster analysis was used to detect relationships among various species
      • Are the behaviors of any two species related?
      • Is the concentration of one species dependent on that of one or more other species? 
    • One confirmed hypothesis:
      • CH reaches it peak concentration either before or at the same time as H reaches its peak concentration
      • An important engineering observation
  • Artificial Neural Networks
    • A general, practical method for learning real-valued, discrete-values, and vector-values function from examples
    • Combustion application
      • Finding out different kinds of pattern (increasing / decreasing, etc) in the lifetime of a species during the combustion process
      • This can be used to prove various hypothesis as well as to detect patterns of specific species in previously unseen data
  • Neural Networks: Supervised Learning
  • Application Technique
    • Training set data are labeled by the user
      • These labeled data are used to train the ANN
    • The ANN is then used to classify previously unseen data
      • e.g., species in a particular combustion
      • Into a particular pattern class
    • For example, NO shows two different trends under differing conditions
    • A trained ANN can be used to classify the datasets according to the trend of NO
  • Decision Trees
    • Characterize data by features
      • e.g., species concentration at an instant
    • Categorize data sets
      • Manually, or use ANN
      • e.g., according to the trend of NO
    • Use decision tree algorithm to discover clustering criteria
  • Sample Output === Classifier model (full training set) === J48 pruned tree --------------------- CO <= 0.002945 | OH <= 0.000016 | | CO <= 0.000166: yes (17.0/1.0) | | CO > 0.000166: no (3.0) | OH > 0.000016: yes (30.0) CO > 0.002945: no (60.0 / 1.0)
  • Research Opportunities
    • Try it!
      • In your area, on your data, for new results
    • Features
      • Definition, efficient extraction
    • Community building
      • Sharing data mining results
  • PMML
    • Predictive Model Markup Language
    • XML based representation of association rules
    • Developed by Data Mining Group
      • Industrial and university research collaboration
  • An Excellent Tutorial
    • Used for material in this talk
      • Data Mining Scientific and Engineering Applications
        • Tutorial at SC2001, November 12, 2001 by R. Grossman, C. Kamath and V. Kumar
    • http://www-users.cs.umn.edu/ ~kumar/Presentation/sc2001.html