Scientific Applications of  Data Mining Bioinformatics Seminar August 28, 2002 Gary Lindstrom School of Computing Universi...
Outline <ul><li>What is data mining? </li></ul><ul><li>Where has it been successfully applied? </li></ul><ul><li>How can i...
What Is Data Mining? <ul><li>One definition (Robert Grossman) </li></ul><ul><ul><li>Data mining is the semi-automatic disc...
Data Mining <ul><li>Characteristics  </li></ul><ul><ul><li>Large  data, vs.  small  data </li></ul></ul><ul><ul><li>Discov...
The Data Gap <ul><li>Exponential growth of data </li></ul><ul><ul><li>More automation, greater throughput, more models, e....
Classical Data Mining Applications <ul><li>Retail </li></ul><ul><ul><li>Market basket analysis </li></ul></ul><ul><li>Poli...
Decision Support Systems <ul><li>Generic term for analytic and historic uses of DBs </li></ul><ul><ul><li>Contrast with: o...
Data Warehouses vs. Databases <ul><ul><li>Replicate data from distributed sources </li></ul></ul><ul><ul><li>Do not requir...
Tools for DSS <ul><li>Ad hoc SQL-style queries </li></ul><ul><ul><li>Optimized for large, complex data </li></ul></ul><ul>...
Data Warehousing External Data Source Metadata Repository OLAP Data Warehouse Data Mining SERVES EXTRACT TRANSFORM LOAD RE...
Creating And Maintaining  A Warehouse <ul><li>Challenges </li></ul><ul><ul><li>Schema design for integrated information </...
OLAP <ul><li>Focuses on data reduction in context  </li></ul><ul><ul><li>In SQL terms, lots of  group by  and aggregation,...
Example: Retail Sales <ul><li>Dimensions are  Product ,  Location  and  Time </li></ul>8  10  10 30  20  50 25  8  15 loci...
Working With Multidimensional Data <ul><li>Can be represented as a conventional table </li></ul><ul><ul><li>n dimensions n...
OLAP Queries <ul><li>Samples </li></ul><ul><ul><li>Find the total sales </li></ul></ul><ul><ul><li>Find total sales for ea...
OLAP Operations <ul><li>Pivoting </li></ul><ul><ul><li>Spreadsheet-like summaries </li></ul></ul><ul><ul><li>Example: give...
OLAP Operations (cont’d) <ul><li>Slicing </li></ul><ul><ul><li>Equality selection on one or more dimensions </li></ul></ul...
OLAP Naturally Leads to Data Mining <ul><li>Seeks interesting trends or patterns in large datasets </li></ul><ul><ul><li>A...
Market Basket Analysis <ul><li>Market basket </li></ul><ul><ul><li>A collection of items purchased by a customer in one tr...
Association Rules <ul><li>Seek rules of the form: </li></ul><ul><ul><ul><li>{  pen  } => {  ink  } </li></ul></ul></ul><ul...
Important Measures for Association Rules <ul><li>Support </li></ul><ul><ul><li>% of transactions containing all items ment...
Using Association Rules  For Prediction <ul><li>Always somewhat risky </li></ul><ul><ul><li>Because ultimate goal is under...
There Can Be High Support and Confidence <ul><li>…  but no causality </li></ul><ul><li>Example:  pencils and pens are ofte...
Finding Association Rules <ul><li>Seek rules with: </li></ul><ul><ul><li>Support greater than  minsup </li></ul></ul><ul><...
Testing Candidate Rules <ul><li>Confidence calculation for each candidate rule </li></ul><ul><ul><li>Maintain two counters...
Identifying Frequent Item Sets <ul><li>The a priori property: </li></ul><ul><ul><li>Every subset of a frequent item set is...
Finding Frequent Itemsets foreach item, check if it is a frequent itemset repeat foreach new frequent itemset I k  with k ...
Generalized Association Rules <ul><li>Grouping by transaction attributes </li></ul><ul><ul><li>Example: group by  custid <...
Generalized Association Rules (Cont’d) <ul><li>Sequential patterns </li></ul><ul><ul><li>Identify frequently arising buyin...
Example: Managing Microarray Data <ul><li>MS thesis by John Kokinis, 2000 </li></ul><ul><li>ArrayBank </li></ul><ul><ul><l...
Example: Mining Simulated Combustion Data <ul><li>Joint work with </li></ul><ul><ul><li>Brijesh Garabadu, School of Comput...
The Data <ul><li>Multidimensional space </li></ul><ul><ul><li>Pressure, fuel mix, oxygen concentration </li></ul></ul><ul>...
Techniques Applied <ul><li>Cluster analysis </li></ul><ul><ul><li>Which datasets are similar? </li></ul></ul><ul><li>Neura...
Cluster Analysis: Unsupervised Learning <ul><li>At outset, category structure of the data is unknown </li></ul><ul><ul><li...
Combustion Application <ul><li>Cluster analysis was used to detect relationships among various species </li></ul><ul><ul><...
Artificial Neural Networks <ul><li>A general, practical method for learning real-valued, discrete-values, and vector-value...
Neural Networks: Supervised Learning
Application Technique <ul><li>Training set data are labeled by the user </li></ul><ul><ul><li>These labeled data are used ...
Decision Trees <ul><li>Characterize data by features </li></ul><ul><ul><li>e.g., species concentration at an instant </li>...
Sample Output === Classifier model (full training set) === J48 pruned tree --------------------- CO <= 0.002945 |  OH <= 0...
Research Opportunities <ul><li>Try it! </li></ul><ul><ul><li>In your area, on your data, for new results </li></ul></ul><u...
PMML <ul><li>Predictive Model Markup Language </li></ul><ul><li>XML based representation of association rules </li></ul><u...
An Excellent Tutorial <ul><li>Used for material in this talk </li></ul><ul><ul><li>Data Mining Scientific and Engineering ...
Upcoming SlideShare
Loading in …5
×

Scientific Applications Of Data Mining

1,728 views

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,728
On SlideShare
0
From Embeds
0
Number of Embeds
12
Actions
Shares
0
Downloads
95
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • Scientific Applications Of Data Mining

    1. 1. Scientific Applications of Data Mining Bioinformatics Seminar August 28, 2002 Gary Lindstrom School of Computing University of Utah
    2. 2. Outline <ul><li>What is data mining? </li></ul><ul><li>Where has it been successfully applied? </li></ul><ul><li>How can it be applied to scientific applications? </li></ul><ul><li>Research Opportunities </li></ul>
    3. 3. What Is Data Mining? <ul><li>One definition (Robert Grossman) </li></ul><ul><ul><li>Data mining is the semi-automatic discovery of patterns, associations, anomalies, structures, and changes in large data sets </li></ul></ul>
    4. 4. Data Mining <ul><li>Characteristics </li></ul><ul><ul><li>Large data, vs. small data </li></ul></ul><ul><ul><li>Discovery , not validation </li></ul></ul><ul><ul><li>Data driven , not hypothesis driven </li></ul></ul><ul><ul><li>Automated , not manual application </li></ul></ul><ul><li>Supported by </li></ul><ul><ul><li>Statistics, machine learning, databases, high performance computing </li></ul></ul>
    5. 5. The Data Gap <ul><li>Exponential growth of data </li></ul><ul><ul><li>More automation, greater throughput, more models, e.g. simulated </li></ul></ul><ul><li>But: linear increase in number of researchers </li></ul><ul><ul><li>Sift the sand, rather than searching a sensor </li></ul></ul>
    6. 6. Classical Data Mining Applications <ul><li>Retail </li></ul><ul><ul><li>Market basket analysis </li></ul></ul><ul><li>Political science </li></ul><ul><ul><li>Targeting campaign resources </li></ul></ul><ul><li>Financial </li></ul><ul><ul><li>Exploiting market trends & imbalances </li></ul></ul>
    7. 7. Decision Support Systems <ul><li>Generic term for analytic and historic uses of DBs </li></ul><ul><ul><li>Contrast with: operational uses </li></ul></ul><ul><ul><li>Commonly known as On-Line Transaction Processing (OLTP) </li></ul></ul><ul><li>Data warehouses </li></ul><ul><ul><li>Data culled from operational DBs, with history and derived summary data </li></ul></ul>
    8. 8. Data Warehouses vs. Databases <ul><ul><li>Replicate data from distributed sources </li></ul></ul><ul><ul><li>Do not require strict currency of data </li></ul></ul><ul><ul><li>Oriented toward complex, often statistical queries </li></ul></ul><ul><ul><li>Often based on materialized views of operational data </li></ul></ul><ul><ul><ul><li>Views which have been expanded into real tables </li></ul></ul></ul>
    9. 9. Tools for DSS <ul><li>Ad hoc SQL-style queries </li></ul><ul><ul><li>Optimized for large, complex data </li></ul></ul><ul><li>On-Line Analytic Processing (OLAP) </li></ul><ul><ul><li>Queries optimized for aggregation operations </li></ul></ul><ul><ul><li>Data is viewed as multidimensional array </li></ul></ul><ul><ul><li>Influenced by end-user tools such as spreadsheets </li></ul></ul><ul><li>Data mining </li></ul><ul><ul><li>Exploratory data analysis </li></ul></ul><ul><ul><li>Looking for interesting unanticipated patterns in the data </li></ul></ul>
    10. 10. Data Warehousing External Data Source Metadata Repository OLAP Data Warehouse Data Mining SERVES EXTRACT TRANSFORM LOAD REFRESH Visualization
    11. 11. Creating And Maintaining A Warehouse <ul><li>Challenges </li></ul><ul><ul><li>Schema design for integrated information </li></ul></ul><ul><ul><li>Operations </li></ul></ul><ul><ul><ul><li>Cleaning (curation): filling gaps, correcting errors </li></ul></ul></ul><ul><ul><ul><li>Transforming: making consistent with new schema </li></ul></ul></ul><ul><ul><ul><li>Loading: also sorting and summarizing </li></ul></ul></ul><ul><ul><ul><li>Refreshing: incorporate updates to operation data </li></ul></ul></ul><ul><ul><ul><li>Purging: aging out old data </li></ul></ul></ul><ul><li>Role of metadata </li></ul><ul><ul><li>Sources of data, schema conversion information, refresh history, etc. </li></ul></ul>
    12. 12. OLAP <ul><li>Focuses on data reduction in context </li></ul><ul><ul><li>In SQL terms, lots of group by and aggregation, and having operators </li></ul></ul><ul><li>Multidimensional data model </li></ul><ul><ul><li>More appropriate than operation DB tables </li></ul></ul><ul><ul><li>Based on a numeric measures </li></ul></ul><ul><ul><li>Each measure depends on a set of dimensions </li></ul></ul>
    13. 13. Example: Retail Sales <ul><li>Dimensions are Product , Location and Time </li></ul>8 10 10 30 20 50 25 8 15 locid <ul><li>2 3 </li></ul><ul><li>timeid </li></ul>pid 11 12 1 3
    14. 14. Working With Multidimensional Data <ul><li>Can be represented as a conventional table </li></ul><ul><ul><li>n dimensions need n+1 columns </li></ul></ul><ul><ul><li>Called a fact table </li></ul></ul><ul><li>OLAP systems may store all data in relational form </li></ul><ul><ul><li>These are Relational OLAP or ROLAP systems </li></ul></ul><ul><li>Each dimension can have multiple components </li></ul><ul><ul><li>Example: Location = (Country, State, City) </li></ul></ul>
    15. 15. OLAP Queries <ul><li>Samples </li></ul><ul><ul><li>Find the total sales </li></ul></ul><ul><ul><li>Find total sales for each city </li></ul></ul><ul><ul><li>Find total sales for each state </li></ul></ul><ul><ul><li>Find the top five products ranked by total sales </li></ul></ul><ul><li>OLAP query jargon </li></ul><ul><ul><li>Dimensionality reduction </li></ul></ul><ul><ul><ul><li>Aggregating on a dimension, e.g., total sales by city </li></ul></ul></ul><ul><ul><li>Roll-up </li></ul></ul><ul><ul><ul><li>Given total sales by city, find total sales by state </li></ul></ul></ul><ul><ul><li>Drill-down </li></ul></ul><ul><ul><ul><li>Given total sales by state, ask for total sales by city </li></ul></ul></ul>
    16. 16. OLAP Operations <ul><li>Pivoting </li></ul><ul><ul><li>Spreadsheet-like summaries </li></ul></ul><ul><ul><li>Example: given tabular representation of sales </li></ul></ul><ul><ul><ul><li>Pivoting on Location and Time gives table of total sales for each location and each time value </li></ul></ul></ul><ul><ul><li>Can be combined with aggregation </li></ul></ul><ul><ul><ul><li>E.g., yearly sales by state </li></ul></ul></ul><ul><li>Cross tabulation </li></ul><ul><ul><li>Displays result of pivoting </li></ul></ul><ul><ul><li>Aggregation values shown as summary rows and columns </li></ul></ul><ul><ul><li>As extra rows and columns added to original table </li></ul></ul>
    17. 17. OLAP Operations (cont’d) <ul><li>Slicing </li></ul><ul><ul><li>Equality selection on one or more dimensions </li></ul></ul><ul><li>Dicing </li></ul><ul><ul><li>Range selection </li></ul></ul>
    18. 18. OLAP Naturally Leads to Data Mining <ul><li>Seeks interesting trends or patterns in large datasets </li></ul><ul><ul><li>An example of exploratory data analysis </li></ul></ul><ul><ul><li>Related to knowledge discovery and machine learning </li></ul></ul><ul><li>Mining for rules </li></ul><ul><ul><li>Association rules: motivated by retail market basket analysis </li></ul></ul>
    19. 19. Market Basket Analysis <ul><li>Market basket </li></ul><ul><ul><li>A collection of items purchased by a customer in one transaction </li></ul></ul><ul><ul><li>Retailers want to learn of items often purchased together </li></ul></ul><ul><ul><ul><li>For promotional and display grouping purposes </li></ul></ul></ul><ul><ul><li>Simple tabular representation </li></ul></ul><ul><ul><ul><li>Purchases(transid, custid, date, item, price, quantity) </li></ul></ul></ul>
    20. 20. Association Rules <ul><li>Seek rules of the form: </li></ul><ul><ul><ul><li>{ pen } => { ink } </li></ul></ul></ul><ul><ul><li>Meaning: </li></ul></ul><ul><ul><ul><li>If a pen is purchased in a transaction, it is likely that ink will also be purchased in that transaction </li></ul></ul></ul>
    21. 21. Important Measures for Association Rules <ul><li>Support </li></ul><ul><ul><li>% of transactions containing all items mentioned in rule </li></ul></ul><ul><ul><li>Low support reduces interest in the rule </li></ul></ul><ul><li>Confidence </li></ul><ul><ul><li>% of transactions containing the LHS that also contain RHS </li></ul></ul><ul><ul><li>Indicates degree of correlation </li></ul></ul>
    22. 22. Using Association Rules For Prediction <ul><li>Always somewhat risky </li></ul><ul><ul><li>Because ultimate goal is understanding causality </li></ul></ul><ul><ul><li>Which is not directly reflected in transaction data </li></ul></ul>
    23. 23. There Can Be High Support and Confidence <ul><li>… but no causality </li></ul><ul><li>Example: pencils and pens are often bought together </li></ul><ul><ul><li>And pens and ink are often bought together </li></ul></ul><ul><ul><li>Hence pencils and ink are often bought together </li></ul></ul><ul><li>But there is no causal link between pencils and ink </li></ul><ul><ul><li>Hence sale promotions on pencils and ink probably won’t be effective </li></ul></ul>
    24. 24. Finding Association Rules <ul><li>Seek rules with: </li></ul><ul><ul><li>Support greater than minsup </li></ul></ul><ul><ul><li>Confidence greater than minconf </li></ul></ul><ul><li>Steps </li></ul><ul><ul><li>Find frequent item sets </li></ul></ul><ul><ul><ul><li>Sets of items with support >= minsup </li></ul></ul></ul><ul><ul><li>Break each frequent item set into LHS and RHS of candidate rules </li></ul></ul><ul><ul><ul><li>Keep those with confidence >= minconf </li></ul></ul></ul>
    25. 25. Testing Candidate Rules <ul><li>Confidence calculation for each candidate rule </li></ul><ul><ul><li>Maintain two counters: lhscount , rhscount </li></ul></ul><ul><ul><li>Scan entire customer transaction table </li></ul></ul><ul><ul><li>Count in lhscount occurrences of all items in LHS </li></ul></ul><ul><ul><li>If LHS is present, tally in rhscount if all items in RHS are present </li></ul></ul>
    26. 26. Identifying Frequent Item Sets <ul><li>The a priori property: </li></ul><ul><ul><li>Every subset of a frequent item set is also a frequent item set </li></ul></ul><ul><li>This leads to an iterative algorithm </li></ul><ul><ul><li>Identify frequent item sets of one item </li></ul></ul><ul><ul><li>Iteratively, seek to extend frequent item sets by adding an item </li></ul></ul>
    27. 27. Finding Frequent Itemsets foreach item, check if it is a frequent itemset repeat foreach new frequent itemset I k with k items generate all itemsets I k+1 with k+1 items, I k  I k+1 Scan all transactions once and check if the generated k+1-itemsets are frequent until no new frequent itemsets are found
    28. 28. Generalized Association Rules <ul><li>Grouping by transaction attributes </li></ul><ul><ul><li>Example: group by custid </li></ul></ul><ul><ul><li>Association can be across multiple transactions by same customer </li></ul></ul><ul><li>Group by categories </li></ul><ul><ul><li>Example: rules of form </li></ul></ul><ul><ul><ul><li>{ apparel } => { stationery } </li></ul></ul></ul>
    29. 29. Generalized Association Rules (Cont’d) <ul><li>Sequential patterns </li></ul><ul><ul><li>Identify frequently arising buying patterns over time </li></ul></ul><ul><li>Classification rules </li></ul><ul><ul><li>“ If age is in a certain range and balance is in a certain range, then the customer is likely to default on a loan.” </li></ul></ul>
    30. 30. Example: Managing Microarray Data <ul><li>MS thesis by John Kokinis, 2000 </li></ul><ul><li>ArrayBank </li></ul><ul><ul><li>Tool for management of microarray gene expression data sets </li></ul></ul><ul><ul><li>Implemented in Visual Basic / Sybase </li></ul></ul><ul><li>Figures from MS thesis </li></ul><ul><ul><li>Pages 20, 45, 46 </li></ul></ul><ul><ul><li>http://www.cs.utah.edu/~kokinis/THESIS.pdf </li></ul></ul>
    31. 31. Example: Mining Simulated Combustion Data <ul><li>Joint work with </li></ul><ul><ul><li>Brijesh Garabadu, School of Computing </li></ul></ul><ul><ul><li>Zoran Djurisic, Chem. & Fuels Engg. </li></ul></ul><ul><li>The problem </li></ul><ul><ul><li>Combustion model for powdered coal furnaces </li></ul></ul><ul><ul><li>Which conditions control NOx pollution? </li></ul></ul>
    32. 32. The Data <ul><li>Multidimensional space </li></ul><ul><ul><li>Pressure, fuel mix, oxygen concentration </li></ul></ul><ul><ul><li>Can explore (simulate) any combination </li></ul></ul><ul><ul><ul><li>But which to look at? </li></ul></ul></ul><ul><li>Need to: </li></ul><ul><ul><li>Locate relevant subspaces </li></ul></ul><ul><ul><li>Characterize important events </li></ul></ul><ul><ul><li>Develop causal hypotheses </li></ul></ul>
    33. 33. Techniques Applied <ul><li>Cluster analysis </li></ul><ul><ul><li>Which datasets are similar? </li></ul></ul><ul><li>Neural networks </li></ul><ul><ul><li>Which datasets are interesting? </li></ul></ul><ul><li>Decision trees </li></ul><ul><ul><li>Which features best explain similarities? </li></ul></ul>
    34. 34. Cluster Analysis: Unsupervised Learning <ul><li>At outset, category structure of the data is unknown </li></ul><ul><ul><li>All that is known is a collection of observations </li></ul></ul><ul><li>Objective: To discover a category structure which fits the observation </li></ul><ul><ul><li>i.e. finding natural groups in data </li></ul></ul>
    35. 35. Combustion Application <ul><li>Cluster analysis was used to detect relationships among various species </li></ul><ul><ul><li>Are the behaviors of any two species related? </li></ul></ul><ul><ul><li>Is the concentration of one species dependent on that of one or more other species?  </li></ul></ul><ul><li>One confirmed hypothesis: </li></ul><ul><ul><li>CH reaches it peak concentration either before or at the same time as H reaches its peak concentration </li></ul></ul><ul><ul><li>An important engineering observation </li></ul></ul>
    36. 36. Artificial Neural Networks <ul><li>A general, practical method for learning real-valued, discrete-values, and vector-values function from examples </li></ul><ul><li>Combustion application </li></ul><ul><ul><li>Finding out different kinds of pattern (increasing / decreasing, etc) in the lifetime of a species during the combustion process </li></ul></ul><ul><ul><li>This can be used to prove various hypothesis as well as to detect patterns of specific species in previously unseen data </li></ul></ul>
    37. 37. Neural Networks: Supervised Learning
    38. 38. Application Technique <ul><li>Training set data are labeled by the user </li></ul><ul><ul><li>These labeled data are used to train the ANN </li></ul></ul><ul><li>The ANN is then used to classify previously unseen data </li></ul><ul><ul><li>e.g., species in a particular combustion </li></ul></ul><ul><ul><li>Into a particular pattern class </li></ul></ul><ul><li>For example, NO shows two different trends under differing conditions </li></ul><ul><li>A trained ANN can be used to classify the datasets according to the trend of NO </li></ul>
    39. 39. Decision Trees <ul><li>Characterize data by features </li></ul><ul><ul><li>e.g., species concentration at an instant </li></ul></ul><ul><li>Categorize data sets </li></ul><ul><ul><li>Manually, or use ANN </li></ul></ul><ul><ul><li>e.g., according to the trend of NO </li></ul></ul><ul><li>Use decision tree algorithm to discover clustering criteria </li></ul>
    40. 40. Sample Output === Classifier model (full training set) === J48 pruned tree --------------------- CO <= 0.002945 | OH <= 0.000016 | | CO <= 0.000166: yes (17.0/1.0) | | CO > 0.000166: no (3.0) | OH > 0.000016: yes (30.0) CO > 0.002945: no (60.0 / 1.0)
    41. 41. Research Opportunities <ul><li>Try it! </li></ul><ul><ul><li>In your area, on your data, for new results </li></ul></ul><ul><li>Features </li></ul><ul><ul><li>Definition, efficient extraction </li></ul></ul><ul><li>Community building </li></ul><ul><ul><li>Sharing data mining results </li></ul></ul>
    42. 42. PMML <ul><li>Predictive Model Markup Language </li></ul><ul><li>XML based representation of association rules </li></ul><ul><li>Developed by Data Mining Group </li></ul><ul><ul><li>Industrial and university research collaboration </li></ul></ul>
    43. 43. An Excellent Tutorial <ul><li>Used for material in this talk </li></ul><ul><ul><li>Data Mining Scientific and Engineering Applications </li></ul></ul><ul><ul><ul><li>Tutorial at SC2001, November 12, 2001 by R. Grossman, C. Kamath and V. Kumar </li></ul></ul></ul><ul><li>http://www-users.cs.umn.edu/ ~kumar/Presentation/sc2001.html </li></ul>

    ×