Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
A Statistical Perspective on Data Mining

John Maindonald (Centre for Mathematics and Its
  Applications, Australian Natio...
Content of the Talk


      Data Mining (DM); Machine Learning (ML)
      – Statistical Learning as an Over-riding Theme
 ...
Computing Technology is Transforming Science

          Huge data sets1 , new types of data
              Web pages, medic...
DM & ML Jargon – Mining, Learning and Training

   Mining is used in two senses:
       Mining for data
       Mining to e...
Example 1: Record Times for Track and Road Races
                                         road      q            track    ...
World record times – Learning from the data
 Fit a smooth to residuals.
 (’Let the computer rip’ !)                       ...
Example 2: the Forensic Glass Dataset (1987 data)
   Find a rule to predict the type of any new piece of glass:
          ...
Statistical Learning – Commentary

   1. Continuous Outcome (eg, Times vs Distances)
       Allow data to largely choose f...
Selection – An Aid to Seeing What is not There!

                       Gp 1      q        Gp 2   q           Gp 3   q
   ...
Selection Without Spurious Pattern – Train/Test

   Training/Test
       Split data into training and test sets
       Tra...
Cross-Validation

   Steps
       Split data into k parts (below,k=4)
       At the ith repeat or fold (i = 1, . . . k) us...
Selection Without Spurious Pattern – the Bootstrap

   Bootstrap Sampling
   Here are two bootstrap samples from the numbe...
JM’s Preferred Methods for Classification Data
   Linear Discriminant Analysis
       It is simple, in the style of classic...
Example (3) – Clinical Classification of Diabetes

           overt                 q              chemical         q      ...
Tree-based Classification



                  glucArea>=420.5
                          |



             fpg>=117        ...
Random forests – A Forest of Trees!
               Each tree is for a different random with replacement
            ('boot...
Plot derived from random forest analysis

                       overt   q         chemical       q        normal   q



 ...
When are crude data mining approaches enough?

   Often, rough and ready approaches, ignoring distributional and
   other ...
Different Relationships Between Source & Target


       Source versus target         Are data available from target?
    1...
The Hazards of Size: Observations? Variables?
   Many Observations
       Additional structure often comes with increased ...
The Hazards of High Dimensional Data

      Select, or summarize?
          For selection, beware of selection effects – wi...
Why plot the data?



       Which are the difficult points?
       Some points may be mislabeled (faulty medical
       dia...
New Technology has Many Attractions

   Are you by now convinced that
       Data analysis is dapper
       Data mining is...
The Science of Data Mining



          Getting the science right is more important than finding
          the true and onl...
References
   Berk, R. 2008. Statistical Learning from a Regression
     Perspective.
     [Berk’s insightful commentary i...
Web Sites
   http://www.sigkdd.org/
   [Association for Computing Machinery Special Interest Group
   on Knowledge Discove...
Upcoming SlideShare
Loading in …5
×

A Statistical Perspective on Data Mining

1,104 views

Published on

  • Be the first to comment

  • Be the first to like this

A Statistical Perspective on Data Mining

  1. 1. A Statistical Perspective on Data Mining John Maindonald (Centre for Mathematics and Its Applications, Australian National University) July 2, 2009
  2. 2. Content of the Talk Data Mining (DM); Machine Learning (ML) – Statistical Learning as an Over-riding Theme From regression to Data Mining The Tradition and Practice of Data Mining Validation/Accuracy assessment The Hazards of Size: Observations? Variables? Insight from Plots Comments on automation Summary
  3. 3. Computing Technology is Transforming Science Huge data sets1 , new types of data Web pages, medical images, expression arrays, genomic data, NMR spectroscopy, sky maps, . . . Sheer size brings challenges Data management; some new analysis challenges Many observations, or many variables? New algorithms; “algorithmic” models. Trees, random forests, Support Vector Machines, . . . Automation, especially for data collection There are severe limits to automated analysis Synergy: computing power with new theory. Data Mining & Machine Learning fit in this mix. Both use Statistical Learning approaches. 1 cf, the hype re Big Data in Weiss and Indurkha (1997)
  4. 4. DM & ML Jargon – Mining, Learning and Training Mining is used in two senses: Mining for data Mining to extract meaning, in a scientific/statistical sense. Pre-analysis data extraction & processing may rely heavily on computer technology. Additionally, design of data collection issues require attention. In mining to extract meaning, statistical considerations are pervasive. Learning & Training The (computing) machine learns from data. Use training data to train the machine or software.
  5. 5. Example 1: Record Times for Track and Road Races road q track q 290km 10^−1 10^0 10^1 10^2 q 10^3 24h q q Is the line the whole story? q q 10^2 q q q Ratio of largest to smallest q q qq time is ∼3000. Time qq q q 10^1 q q q q q A difference from the line of q 10^0 100m qq q >15%, for the longest race, is q q q q not visually obvious! q 9.6sec Resid (loge scale) q 0.15 q 0.10 0.05 q q q q qq q The Plot of Residuals q q q q qq 0.00 q q q q q qq qq q q Here, differences from the line q −0.05 qq q q qq q −0.10 q qq qq been scaled up by a factor of ∼20. 10^−1 10^0 10^1 10^2 Distance
  6. 6. World record times – Learning from the data Fit a smooth to residuals. (’Let the computer rip’ !) Is the curve ’real’ ? The lower panel shows The algorithm says “Yes”. residuals from the smooth. 10^−1 10^0 10^1 10^2 Questions remain: 0.15 q ’Real’, in what sense? resid(wr.lm) q 0.10 q 0.05 q q q q q q qq q q Will the pattern be qq q q q 0.00 q q qq q −0.05 qq q q q q q q q q q the same in 2030? q q q qq −0.10 Is it consistent across 0.15 geographical regions? Does it partly reflect Resid (2) 0.10 q q 0.05 0.00 q q q q q qq qq q q q q qq q q q qq q q q q q q q greater attention paid q q q q qqq q q −0.05 q to some distances? −0.10 10^−1 10^0 10^1 10^2 So why/when the smooth, Distance rather than the line? Martians may have a somewhat different curve!
  7. 7. Example 2: the Forensic Glass Dataset (1987 data) Find a rule to predict the type of any new piece of glass: WinF Veh Tabl WinNF Con Head Graph is a visual summary of a classification. 4 Window float (70) 2 Window non-float (76) 0 Vehicle window (17) Containers (13) Axis 2 −2 Tableware (9) −4 Headlamps (29) −6 Variables are %’s of Na, −8 Mg, . . . , plus refractive −4 −2 0 2 4 6 index. (214 rows × 10 Axis 1 columns.) NB: These data are well past their “use by” date!
  8. 8. Statistical Learning – Commentary 1. Continuous Outcome (eg, Times vs Distances) Allow data to largely choose form of response. Often, use methodology to refine a theoretical model (large datasets more readily highlight discrepancies) 2. Categorical Outcome (eg, forensic glass data) Theory is rarely much help in choosing the form of model. Theoretical accuracy measures are usually unavailable. Both Types of Outcome Variable selection and/or model tuning wreaks havoc with most theoretically based accuracy estimates. Hence the use of empirical approaches.
  9. 9. Selection – An Aid to Seeing What is not There! Gp 1 q Gp 2 q Gp 3 q Method 3 q Random data, 32 points, split Discriminant Axis 2 2 q qq randomly into 3 q q q 1 q groups . q q q q q q q 0 q q q qq (1) Select 15 q −1 q q q ‘best’ features, q q q q from 500. q −2 q (2) Show 2-D view q q of classification −3 −2 0 2 4 that uses the 15 ’best’ features. Discriminant Axis 1 NB: This demonstrates the usefulness of simulation.
  10. 10. Selection Without Spurious Pattern – Train/Test Training/Test Split data into training and test sets Train (NB: steps 1 & 2); use test data for assessment. Cross-validation Simple version: Train on subset 1, test on subset 2 Then; Train on subset 2, test on subset 1 More generally, data are split into k parts (eg, k = 10). Use each part in turn for testing, with other data used to train. Warning – What the books rarely acknowledge All methods give accuracies based on sampling from the source population. For observational data, target usually differs from source.
  11. 11. Cross-Validation Steps Split data into k parts (below,k=4) At the ith repeat or fold (i = 1, . . . k) use: the ith part for testing, the other k-1 parts for training. Combine the performance estimates from the k folds. Training Training Training FOLD TEST 4 Training Training Training FOLD TEST 3 Training Training Training FOLD TEST 2 Training Training Training FOLD TEST 1 n1 n2 n3 n 4 observations
  12. 12. Selection Without Spurious Pattern – the Bootstrap Bootstrap Sampling Here are two bootstrap samples from the numbers 1 . . . 10 1 3 6 6 6 6 6 8 9 10 (5 6’s; omits 2,4,5,7) 2 2 3 4 6 8 8 9 10 10 (2 2’s, 2 8’s, 2 10’s; omits 1,5,7) 1 1 1 1 3 3 4 6 7 8 (4 1’s, 2 3’s; omits 2,5,9,10) Bootstrap Sampling – Putting it to Use Take repeated (with replacement) random samples of the observations, of the same size as the initial sample. Repeat analysis on each new sample (NB: In the example above, repeat both step 1 (selection) & 2 (analysis). Variability between samples indicates statistical variability. Combine separate analysis results into an an overall result.
  13. 13. JM’s Preferred Methods for Classification Data Linear Discriminant Analysis It is simple, in the style of classical regression. It leads naturally to a 2-D plot. The plot may suggest trying methods that build in weaker assumptions. Random forests It is clear why it works and does not overfit. No other method consistently outperforms it. It is simple, and highly automatic to use. Use for illustration data with 3 outcome categories
  14. 14. Example (3) – Clinical Classification of Diabetes overt q chemical q normal q Predict class (overt, q q q chemical, or normal) q 2nd discriminant axis q 2 q q using relative weight, q q qq q fasting plasma glucose, q q q qqqq qqqqq qq q q qq qq qq q q q q qq qq q qq q glucose area, insulin area, qq q qq qq q qq q q q q q qqqq q q qq q qq qqq 0 q qq q q qq q qq q qq q & SSPG (steady state qq qq q qq q q q q q q q q q q qq q qqq q qq q qq qq q plasma glucose). q qq q qq q qq q −2 q q qq q Linear q qq Discriminant −4 q Analysis −6 −4 −2 0 2 CV acc = 89% 1st discriminant axis
  15. 15. Tree-based Classification glucArea>=420.5 | fpg>=117 Tree-based normal Classification CV acc = 97.2% overt chemical
  16. 16. Random forests – A Forest of Trees! Each tree is for a different random with replacement ('bootstrap') sample of the data, and sample of variables fpg>=117 glucArea>=420.5 glucArea>=408 | SSPG>=192 fpg>=116 | glucArea>=606 | overt normal normal chemical normal overt chemical overt chemical fpg>=117 glucArea>=420.5 glucArea>=422 |relwt>=1.045 glucArea>=596.5 | fpg>=118.5 | fpg>=102.5 overt normal normal chemical chemical normal overt chemical overt chemical fpg>=117 fpg>=119 fpg>=117 | | |SSPG>=150 Insulin>=294.5 relwt>=1.01 overt overt overt chemical normal chemical normal chemical normal fpg>=118.5 glucArea>=419.5 glucArea>=408 |fpg>=100.5 glucArea>=656.5 | glucArea>=656.5 | relwt>=1.045 Insulin>=228.5 overt normal normal chemical chemical normal normal overt chemical overt chemical Each tree has one vote; the majority wins
  17. 17. Plot derived from random forest analysis overt q chemical q normal q This plot tries qq 0.4 q q qq q hard to reflect q q q probabilities 2nd ordinate axis 0.2 of group membership qq qq qq q qq q q 0.0 q qq q q assigned by the q analysis. q −0.2 It does not q q q result from −0.4 q q q q qqq q a ’scaling’ of qq q qq q qq q q q the feature −0.4 −0.2 0.0 0.2 0.4 space. 1st ordinate axis
  18. 18. When are crude data mining approaches enough? Often, rough and ready approaches, ignoring distributional and other niceties, work well! In other cases, rough and ready approaches can be highly misleading! Some Key Issues Watch for source/target (eg, 2008/2009) differences. Allow for effects of model and variable selection. Do not try to interpret individual model coefficients, unless you know how to avoid the traps. Skill is needed to differentiate between problems where relatively cavalier approaches may be acceptable, and problems that demand greater care.
  19. 19. Different Relationships Between Source & Target Source versus target Are data available from target? 1: Identical (or nearly so) Yes; data from source suffice 2: Source & Target differ Yes 3: Source & Target differ No. But change can be modeled (cf: multi-level models; time series) 4: Source & Target differ No; must make an informed guess. Where source & target differ, it may be possible to get insight from matched historical source/target data. There are major issues here, which might occupy several further lectures!
  20. 20. The Hazards of Size: Observations? Variables? Many Observations Additional structure often comes with increased size – data may be less homogeneous, span larger regions of time or space, be from more countries Or there may extensive information about not much! e.g., temperatures, at second intervals, for a day. SEs from modeling that ignores this structure may be misleadingly small. In large homogeneous datasets, spurious effects are a risk Small SEs increase the risk of detecting spurious effects that arise, e.g., from sampling bias (likely in observational data) and/or from measurement error. Many variables (features) Huge numbers of variables spread information thinly! This is a challenge to the analyst.
  21. 21. The Hazards of High Dimensional Data Select, or summarize? For selection, beware of selection effects – with enough lottery tickets, some kind of prize is pretty much inevitable A pattern based on the ‘best’ 15 features, out of 500, may well be meaningless! Summary measures may involve selection. Beware of over-fitting Over-fitting reduces real accuracy. Preferably, use an algorithm that does not overfit with respect to the source population. Unless optimized with respect to the target, some over-fitting may be inevitable! Any algorithm can be misused to overfit! (even those that do not overfit!)
  22. 22. Why plot the data? Which are the difficult points? Some points may be mislabeled (faulty medical diagnosis?) Improvement of classification accuracy is a useful goal only if misclassified points are in principle classifiable. What if points are not well represented in 2-D or 3-D? One alternative is to identify points that are outliers on a posterior odds (of group membership) scale.
  23. 23. New Technology has Many Attractions Are you by now convinced that Data analysis is dapper Data mining is delightful Analytics is amazing Regression is rewarding Trees are tremendous Forests are fascinating Statistics is stupendous? Some or all of these? Even however with the best modern software, it is hard work to do data analysis well.
  24. 24. The Science of Data Mining Getting the science right is more important than finding the true and only best algorithm! (There is none!) Get to grips with the statistical issues Know the target (1987 glass is not 2008 glass) Understand the traps (too many to talk about here2 ) Grasp the basic ideas of time series, multi-level models, and comparisons based on profiles over time. Use the technology critically and with care! Do it, where possible, with graphs. 2 Maindonald, J.H. (2006) notes a number of common traps, with extensive references. Berk (2006) has excellent advice.
  25. 25. References Berk, R. 2008. Statistical Learning from a Regression Perspective. [Berk’s insightful commentary injects needed reality checks into the discussion of data mining and statistical learning.] Maindonald, J.H. 2006. Data Mining Methodological Weaknesses and Suggested Fixes. Proceedings of Australasian Data Mining Conference (Aus06)’3 Maindonald, J. H. and Braun, W. J. 2007. Data Analysis and Graphics Using R – An Example-Based Approach. 2nd edn, Cambridge University Press.4 [Statistics, with hints of data mining!] Rosenbaum, P. R., 2002. Observational Studies. Springer, 2ed. [Observational data from an experimental perspective.] Wood, S. N. 2006. Generalized Additive Models. An Introduction with R. Chapman & Hall/CRC. [This has an elegant treatment of linear models and generalized linear models, as a lead-in to methods for fitting
  26. 26. Web Sites http://www.sigkdd.org/ [Association for Computing Machinery Special Interest Group on Knowledge Discovery and Data Mining.] http://www.amstat.org/profession/index.cfm? fuseaction=dataminingfaq [Comments on many aspects of data mining.] http://www.cs.ucr.edu/~eamonn/TSDMA/ [UCR Time Series Data Mining Archive] http://kdd.ics.uci.edu/ [UCI KDD Archive] http://en.wikipedia.org/wiki/Data_mining [This (Dec 12 2008) has useful links. Lacking in sharp critical commentary. It emphasizes commercial data mining tools.] The R package mlbench has “a collection of artificial and real-world machine learning benchmark problems, including, e.g., several data sets from the UCI repository.”

×