Data Mining: An Overview David Madigan [email_address] http://stat.rutgers.edu/~madigan
Overview <ul><li>Brief Introduction to Data Mining </li></ul><ul><li>Data Mining Algorithms </li></ul><ul><li>Specific Exa...
Of “Laws”, Monsters, and Giants… <ul><li>Moore’s law: processing “capacity” doubles every 18 months :  CPU, cache, memory ...
What is Data Mining? <ul><li>Finding interesting structure in data </li></ul><ul><li>Structure:  refers to statistical pat...
 
 
Ronny Kohavi, ICML 1998
Ronny Kohavi, ICML 1998
Ronny Kohavi, ICML 1998
Stories: Online Retailing
Data Mining Algorithms “ A data mining algorithm is a well-defined procedure that takes data as input and produces output ...
Algorithm Components 1. The  task  the algorithm is used to address (e.g. classification, clustering, etc.) 2. The  struct...
 
Backpropagation data mining algorithm x 1 x 2 x 3 x 4 h 1 h 2 y <ul><li>vector of  p  input values multiplied by  p      ...
Backpropagation (cont.) Parameters: Score: Search: steepest descent; search for structure?
Models and Patterns Models Prediction Probability Distributions Structured Data <ul><li>Linear regression </li></ul><ul><l...
Models Prediction Probability Distributions Structured Data <ul><li>Linear regression </li></ul><ul><li>Piecewise linear <...
 
Models Prediction Probability Distributions Structured Data <ul><li>Linear regression </li></ul><ul><li>Piecewise linear <...
Models Prediction Probability Distributions Structured Data <ul><li>Linear regression </li></ul><ul><li>Piecewise linear <...
Models Prediction Probability Distributions Structured Data <ul><li>Linear regression </li></ul><ul><li>Piecewise linear <...
Bias-Variance Tradeoff High Bias - Low Variance Low Bias - High Variance “ overfitting” - modeling the random component Sc...
The Curse of Dimensionality X  ~ MVN p  ( 0  ,  I ) <ul><li>Gaussian kernel density estimation </li></ul><ul><li>Bandwidth...
Patterns Global Local <ul><li>Clustering via partitioning </li></ul><ul><li>Hierarchical Clustering </li></ul><ul><li>Mixt...
x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x The curve represents a road Each “x” marks an acci...
Scan with Fixed Window <ul><li>If we know the length of the “stretch of road” that we seek, e.g.,  we could slide this win...
How Unusual is a Window? <ul><li>Let  p W  and  p ¬W  denote the true probability of being red inside and outside the wind...
Permutation Test <ul><li>Since we look at the smallest    over  all  window locations, need to find the distribution of s...
Variable Length Window <ul><li>No need to use fixed-length window. Examine all possible windows up to say half the length ...
Spatial Scan Statistics <ul><li>Spatial scan statistic uses, e.g., circles instead of line segments </li></ul>
 
Spatial-Temporal Scan Statistics <ul><li>Spatial-temporal scan statistic use cylinders where the height of the cylinder re...
Other Issues <ul><li>Poisson model also common (instead of the bernoulli model) </li></ul><ul><li>Covariate adjustment </l...
Software: SaTScan + others http://www.satscan.org http ://www.phrl.org  http://www.terraseer.com
Association Rules: Support and Confidence <ul><li>Find all the rules  Y     Z  with minimum confidence and support </li><...
Mining Association Rules—An Example <ul><li>For rule  A      C : </li></ul><ul><ul><li>support = support({ A  & C }) = 50...
Mining Frequent Itemsets: the Key Step <ul><li>Find the  frequent itemsets : the sets of items that have minimum support <...
The Apriori Algorithm <ul><li>Join Step :  C k   is generated by joining L k-1 with itself </li></ul><ul><li>Prune Step : ...
The Apriori Algorithm — Example Database D Scan D C 1 L 1 L 2 C 2 C 2 Scan D C 3 L 3 Scan D
Association Rule Mining: A Road Map <ul><li>Boolean vs. quantitative associations  (Based on the types of values handled) ...
 
Model-based Clustering Padhraic Smyth, UCI
 
 
 
 
 
 
 
Mixtures of {Sequences, Curves, …} Generative Model - select a component c k  for individual i - generate data according t...
Example: Mixtures of SFSMs <ul><li>Simple model for traversal on a Web site </li></ul><ul><li>(equivalent to first-order M...
WebCanvas: Cadez, Heckerman, et al, KDD 2000
 
 
 
Discussion  <ul><li>What is data mining? Hard to pin down – who cares? </li></ul><ul><li>Textbook statistical ideas with a...
Privacy and Data Mining Ronny Kohavi, ICML 1998
Analyzing Hospital Discharge Data David Madigan Rutgers University
Comparing Outcomes Across Providers <ul><li>Florence Nightingale wrote in 1863: </li></ul>“ In attempting to arrive at the...
Data <ul><li>Data of various kinds are now available; e.g. data concerning all medicare/medicaid hospital admissions in st...
Pennsylvannia Healthcare Cost Containment Council. 2000-1, n=800,000   ADDOW MQGCELL DRGHC4 SPX5 ADMDX MQGCLUST PCMU SPX4 ...
Risk Adjustment <ul><li>Discharge data like these allow for comparisons of, e.g., mortality rates for CABG procedure acros...
 
Hospital Responses
 
p-value computation <ul><li>n =463;  suppose actual number of deaths=40 </li></ul><ul><li>e =29.56; </li></ul><ul><li>p-va...
Concerns <ul><li>Ad-hoc groupings of strata </li></ul><ul><li>Adequate risk adjustment for outcomes other than mortality? ...
A B SMR = 24/18 = 1.33;  p-value = 0.07 SMR = 66/42 = 1.57;  p-value = 0.0002 200 800 N 10 (5%) 16 8% High 8 (1%) 8 1% Low...
Hierarchical Model <ul><li>Patients -> physicians -> hospitals </li></ul><ul><li>Build a model using data at each level an...
Bayesian Hierarchical Model MCMC via WinBUGS
Goldstein and Spiegelhalter, 1996
Discussion <ul><li>Markov chain Monte Carlo + compute power enable hierarchical modeling  </li></ul><ul><li>Software is a ...
Upcoming SlideShare
Loading in …5
×

dataMiningTutorial.ppt

644 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
644
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
24
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

dataMiningTutorial.ppt

  1. 1. Data Mining: An Overview David Madigan [email_address] http://stat.rutgers.edu/~madigan
  2. 2. Overview <ul><li>Brief Introduction to Data Mining </li></ul><ul><li>Data Mining Algorithms </li></ul><ul><li>Specific Examples </li></ul><ul><ul><li>Algorithms: Disease Clusters </li></ul></ul><ul><ul><li>Algorithms: Model-Based Clustering </li></ul></ul><ul><ul><li>Algorithms: Frequent Items and Association Rules </li></ul></ul><ul><li>Future Directions, etc. </li></ul>
  3. 3. Of “Laws”, Monsters, and Giants… <ul><li>Moore’s law: processing “capacity” doubles every 18 months : CPU, cache, memory </li></ul><ul><li>It’s more aggressive cousin: </li></ul><ul><ul><li>Disk storage “capacity” doubles every 9 months </li></ul></ul><ul><li>What do the two “laws” combined produce? </li></ul><ul><ul><li>A rapidly growing gap between our ability to generate data, and our ability to make use of it. </li></ul></ul>
  4. 4. What is Data Mining? <ul><li>Finding interesting structure in data </li></ul><ul><li>Structure: refers to statistical patterns, predictive models, hidden relationships </li></ul><ul><li>Examples of tasks addressed by Data Mining </li></ul><ul><ul><li>Predictive Modeling (classification, regression) </li></ul></ul><ul><ul><li>Segmentation (Data Clustering ) </li></ul></ul><ul><ul><li>Summarization </li></ul></ul><ul><ul><li>Visualization </li></ul></ul>
  5. 7. Ronny Kohavi, ICML 1998
  6. 8. Ronny Kohavi, ICML 1998
  7. 9. Ronny Kohavi, ICML 1998
  8. 10. Stories: Online Retailing
  9. 11. Data Mining Algorithms “ A data mining algorithm is a well-defined procedure that takes data as input and produces output in the form of models or patterns” “ well-defined”: can be encoded in software “ algorithm”: must terminate after some finite number of steps Hand, Mannila, and Smyth
  10. 12. Algorithm Components 1. The task the algorithm is used to address (e.g. classification, clustering, etc.) 2. The structure of the model or pattern we are fitting to the data (e.g. a linear regression model) 3. The score function used to judge the quality of the fitted models or patterns (e.g. accuracy, BIC, etc.) 4. The search or optimization method used to search over parameters and/or structures (e.g. steepest descent, MCMC, etc.) 5. The data management technique used for storing, indexing, and retrieving data (critical when data too large to reside in memory)
  11. 14. Backpropagation data mining algorithm x 1 x 2 x 3 x 4 h 1 h 2 y <ul><li>vector of p input values multiplied by p  d 1 weight matrix </li></ul><ul><li>resulting d 1 values individually transformed by non-linear function </li></ul><ul><li>resulting d 1 values multiplied by d 1  d 2 weight matrix </li></ul>4 2 1
  12. 15. Backpropagation (cont.) Parameters: Score: Search: steepest descent; search for structure?
  13. 16. Models and Patterns Models Prediction Probability Distributions Structured Data <ul><li>Linear regression </li></ul><ul><li>Piecewise linear </li></ul>
  14. 17. Models Prediction Probability Distributions Structured Data <ul><li>Linear regression </li></ul><ul><li>Piecewise linear </li></ul><ul><li>Nonparamatric regression </li></ul>
  15. 19. Models Prediction Probability Distributions Structured Data <ul><li>Linear regression </li></ul><ul><li>Piecewise linear </li></ul><ul><li>Nonparametric regression </li></ul><ul><li>Classification </li></ul>logistic regression naïve bayes/TAN/bayesian networks NN support vector machines Trees etc.
  16. 20. Models Prediction Probability Distributions Structured Data <ul><li>Linear regression </li></ul><ul><li>Piecewise linear </li></ul><ul><li>Nonparametric regression </li></ul><ul><li>Classification </li></ul><ul><li>Parametric models </li></ul><ul><li>Mixtures of parametric models </li></ul><ul><li>Graphical Markov models (categorical, continuous, mixed) </li></ul>
  17. 21. Models Prediction Probability Distributions Structured Data <ul><li>Linear regression </li></ul><ul><li>Piecewise linear </li></ul><ul><li>Nonparametric regression </li></ul><ul><li>Classification </li></ul><ul><li>Parametric models </li></ul><ul><li>Mixtures of parametric models </li></ul><ul><li>Graphical Markov models (categorical, continuous, mixed) </li></ul><ul><li>Time series </li></ul><ul><li>Markov models </li></ul><ul><li>Mixture Transition Distribution models </li></ul><ul><li>Hidden Markov models </li></ul><ul><li>Spatial models </li></ul>
  18. 22. Bias-Variance Tradeoff High Bias - Low Variance Low Bias - High Variance “ overfitting” - modeling the random component Score function should embody the compromise
  19. 23. The Curse of Dimensionality X ~ MVN p ( 0 , I ) <ul><li>Gaussian kernel density estimation </li></ul><ul><li>Bandwidth chosen to minimize MSE at the mean </li></ul><ul><li>Suppose want: </li></ul>Dimension # data points 1 4 2 19 3 67 6 2,790 10 842,000
  20. 24. Patterns Global Local <ul><li>Clustering via partitioning </li></ul><ul><li>Hierarchical Clustering </li></ul><ul><li>Mixture Models </li></ul><ul><li>Outlier detection </li></ul><ul><li>Changepoint detection </li></ul><ul><li>Bump hunting </li></ul><ul><li>Scan statistics </li></ul><ul><li>Association rules </li></ul>
  21. 25. x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x The curve represents a road Each “x” marks an accident Red “x” denotes an injury accident Black “x” means no injury Is there a stretch of road where there is an unusually large fraction of injury accidents? Scan Statistics via Permutation Tests
  22. 26. Scan with Fixed Window <ul><li>If we know the length of the “stretch of road” that we seek, e.g., we could slide this window long the road and find the most “unusual” window location </li></ul>x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x
  23. 27. How Unusual is a Window? <ul><li>Let p W and p ¬W denote the true probability of being red inside and outside the window respectively. Let ( x W ,n W ) and ( x ¬W ,n ¬W ) denote the corresponding counts </li></ul><ul><li>Use the GLRT for comparing H 0 : p W = p ¬W versus H 1 : p W ≠ p ¬W </li></ul><ul><ul><li> 2 log  here has an asymptotic chi-square distribution with 1df </li></ul></ul><ul><li>lambda measures how unusual a window is </li></ul>
  24. 28. Permutation Test <ul><li>Since we look at the smallest  over all window locations, need to find the distribution of smallest-  under the null hypothesis that there are no clusters </li></ul><ul><li>Look at the distribution of smallest-  over say 999 random relabellings of the colors of the x’s </li></ul>x x x xx x x xx x x x x 0.376 x x x xx x x x x x x x x 0.233 xx x xx x x xx x x x x 0.412 xx x xx x x xx x xx x 0.222 … smallest-  <ul><li>Look at the position of observed smallest-  in this distribution to get the scan statistic p-value (e.g., if observed smallest-  is 5 th smallest, p-value is 0.005) </li></ul>
  25. 29. Variable Length Window <ul><li>No need to use fixed-length window. Examine all possible windows up to say half the length of the entire road </li></ul>O = fatal accident O = non-fatal accident
  26. 30. Spatial Scan Statistics <ul><li>Spatial scan statistic uses, e.g., circles instead of line segments </li></ul>
  27. 32. Spatial-Temporal Scan Statistics <ul><li>Spatial-temporal scan statistic use cylinders where the height of the cylinder represents a time window </li></ul>
  28. 33. Other Issues <ul><li>Poisson model also common (instead of the bernoulli model) </li></ul><ul><li>Covariate adjustment </li></ul><ul><li>Andrew Moore’s group at CMU: efficient algorithms for scan statistics </li></ul>
  29. 34. Software: SaTScan + others http://www.satscan.org http ://www.phrl.org http://www.terraseer.com
  30. 35. Association Rules: Support and Confidence <ul><li>Find all the rules Y  Z with minimum confidence and support </li></ul><ul><ul><li>support , s , probability that a transaction contains {Y & Z} </li></ul></ul><ul><ul><li>confidence , c , conditional probability that a transaction having {Y & Z} also contains Z </li></ul></ul><ul><li>Let minimum support 50%, and minimum confidence 50%, we have </li></ul><ul><ul><li>A  C (50%, 66.6%) </li></ul></ul><ul><ul><li>C  A (50%, 100%) </li></ul></ul>Customer buys diaper Customer buys both Customer buys beer
  31. 36. Mining Association Rules—An Example <ul><li>For rule A  C : </li></ul><ul><ul><li>support = support({ A & C }) = 50% </li></ul></ul><ul><ul><li>confidence = support({ A & C })/support({ A }) = 66.6% </li></ul></ul><ul><li>The Apriori principle: </li></ul><ul><ul><li>Any subset of a frequent itemset must be frequent </li></ul></ul>Min. support 50% Min. confidence 50%
  32. 37. Mining Frequent Itemsets: the Key Step <ul><li>Find the frequent itemsets : the sets of items that have minimum support </li></ul><ul><ul><li>A subset of a frequent itemset must also be a frequent itemset </li></ul></ul><ul><ul><ul><li>i.e., if { AB } is a frequent itemset, both { A } and { B } should be a frequent itemset </li></ul></ul></ul><ul><ul><li>Iteratively find frequent itemsets with cardinality from 1 to k (k- itemset ) </li></ul></ul><ul><li>Use the frequent itemsets to generate association rules. </li></ul>
  33. 38. The Apriori Algorithm <ul><li>Join Step : C k is generated by joining L k-1 with itself </li></ul><ul><li>Prune Step : Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset </li></ul><ul><li>Pseudo-code : </li></ul><ul><ul><ul><li>C k : Candidate itemset of size k </li></ul></ul></ul><ul><ul><ul><li>L k : frequent itemset of size k </li></ul></ul></ul><ul><ul><ul><li>L 1 = {frequent items}; </li></ul></ul></ul><ul><ul><ul><li>for ( k = 1; L k !=  ; k ++) do begin </li></ul></ul></ul><ul><ul><ul><li>C k+1 = candidates generated from L k ; </li></ul></ul></ul><ul><ul><ul><li>for each transaction t in database do </li></ul></ul></ul><ul><ul><ul><ul><li>increment the count of all candidates in C k+1 that are contained in t </li></ul></ul></ul></ul><ul><ul><ul><li>L k+1 = candidates in C k+1 with min_support </li></ul></ul></ul><ul><ul><ul><li>end </li></ul></ul></ul><ul><ul><ul><li>return  k L k ; </li></ul></ul></ul>
  34. 39. The Apriori Algorithm — Example Database D Scan D C 1 L 1 L 2 C 2 C 2 Scan D C 3 L 3 Scan D
  35. 40. Association Rule Mining: A Road Map <ul><li>Boolean vs. quantitative associations (Based on the types of values handled) </li></ul><ul><ul><li>buys(x, “SQLServer”) ^ buys(x, “DMBook”)  buys(x, “DBMiner”) [0.2%, 60%] </li></ul></ul><ul><ul><li>age(x, “30..39”) ^ income(x, “42..48K”)  buys(x, “PC”) [1%, 75%] </li></ul></ul><ul><li>Single dimension vs. multiple dimensional associations (see ex. Above) </li></ul><ul><li>Single level vs. multiple-level analysis </li></ul><ul><ul><li>What brands of beers are associated with what brands of diapers? </li></ul></ul><ul><li>Various extensions (thousands!) </li></ul>
  36. 42. Model-based Clustering Padhraic Smyth, UCI
  37. 50. Mixtures of {Sequences, Curves, …} Generative Model - select a component c k for individual i - generate data according to p(D i | c k ) - p(D i | c k ) can be very general - e.g., sets of sequences, spatial patterns, etc [Note: given p(D i | c k ), we can define an EM algorithm]
  38. 51. Example: Mixtures of SFSMs <ul><li>Simple model for traversal on a Web site </li></ul><ul><li>(equivalent to first-order Markov with end-state) </li></ul><ul><li>Generative model for large sets of Web users </li></ul><ul><li>- different behaviors <=> mixture of SFSMs </li></ul><ul><li>EM algorithm is quite simple: weighted counts </li></ul>
  39. 52. WebCanvas: Cadez, Heckerman, et al, KDD 2000
  40. 56. Discussion <ul><li>What is data mining? Hard to pin down – who cares? </li></ul><ul><li>Textbook statistical ideas with a new focus on algorithms </li></ul><ul><li>Lots of new ideas too </li></ul>
  41. 57. Privacy and Data Mining Ronny Kohavi, ICML 1998
  42. 58. Analyzing Hospital Discharge Data David Madigan Rutgers University
  43. 59. Comparing Outcomes Across Providers <ul><li>Florence Nightingale wrote in 1863: </li></ul>“ In attempting to arrive at the truth, I have applied everywhere for information, but in scarcely an instance have I been able to obtain hospital records fit for any purposes of comparison…I am fain to sum up with an urgent appeal for adopting some uniform system of publishing the statistical records of hospitals.”
  44. 60. Data <ul><li>Data of various kinds are now available; e.g. data concerning all medicare/medicaid hospital admissions in standard format UB-92; covers >95% of all admissions nationally </li></ul><ul><li>Considerable interest in using these data to compare providers (hospitals, physician groups, physicians, etc.) </li></ul><ul><li>In Pennsylvannia, large corporations such as Westinghouse and Hershey Foods are a motivating force and use the data to select providers. </li></ul>
  45. 61. Pennsylvannia Healthcare Cost Containment Council. 2000-1, n=800,000   ADDOW MQGCELL DRGHC4 SPX5 ADMDX MQGCLUST PCMU SPX4 ADHOUR APRROM DRGHOSP SPX3 ADSOURCE APRSOI BILLTYPE SPX2 ADTYPE APRDRG OCCUR2 SPX1 STATE APRMDC OCCUR1 PPX COUNTY MISCCHG NAIC SDX8 MKTSHARE SPECLCHG ESTPAYER SDX7 PRIVZIP EQUIPCHG PAYTYPE3 SDX6 AGECAT DRUGCHG PAYTYPE2 SDX5 AGE ANCLRCHG PAYTYPE1 SDX4 PSEUDOID ROOMCHG OPERID SDX3 RACE NONCVCHG ATTID SDX2 ETHNIC TOTALCHG REFID SDX1 PTSEX PROFCHG SPX5DOW PDX MAID MQNRSP SPX4DOW ECODE HREGION MQSEV SPX3DOW DCDOW PAF MDCHC4 SPX2DOW DCHOUR QUARTER CANCER2 SPX1DOW LOS YEAR CANCER1 PPXDOW DCSTATUS SYSID
  46. 62. Risk Adjustment <ul><li>Discharge data like these allow for comparisons of, e.g., mortality rates for CABG procedure across hospitals. </li></ul><ul><li>Some hospitals accept riskier patients than others; a fair comparison must account for such differences. </li></ul><ul><li>PHC4 (and many other organizations) use “indirect standardization” </li></ul><ul><li>http://www.phc4.org </li></ul>
  47. 64. Hospital Responses
  48. 66. p-value computation <ul><li>n =463; suppose actual number of deaths=40 </li></ul><ul><li>e =29.56; </li></ul><ul><li>p-value = </li></ul>p-value < 0.05
  49. 67. Concerns <ul><li>Ad-hoc groupings of strata </li></ul><ul><li>Adequate risk adjustment for outcomes other than mortality? Sensitivity analysis? Hopeless? </li></ul><ul><li>Statistical testing versus estimation </li></ul><ul><li>Simpson’s paradox </li></ul>
  50. 68. A B SMR = 24/18 = 1.33; p-value = 0.07 SMR = 66/42 = 1.57; p-value = 0.0002 200 800 N 10 (5%) 16 8% High 8 (1%) 8 1% Low Expected Number Actual Number Rate Risk Cat. 800 200 40 (5%) 64 8% High 2 (1%) 2 1% Low
  51. 69. Hierarchical Model <ul><li>Patients -> physicians -> hospitals </li></ul><ul><li>Build a model using data at each level and estimate quantities of interest </li></ul>
  52. 70. Bayesian Hierarchical Model MCMC via WinBUGS
  53. 71. Goldstein and Spiegelhalter, 1996
  54. 72. Discussion <ul><li>Markov chain Monte Carlo + compute power enable hierarchical modeling </li></ul><ul><li>Software is a significant barrier to the widespread application of better methodology </li></ul><ul><li>Are these data useful for the study of disease? </li></ul>

×