Successfully reported this slideshow.
Upcoming SlideShare
×

# A Statistical Perspective on Data Mining

1,104 views

Published on

• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

### A Statistical Perspective on Data Mining

1. 1. A Statistical Perspective on Data Mining John Maindonald (Centre for Mathematics and Its Applications, Australian National University) July 2, 2009
2. 2. Content of the Talk Data Mining (DM); Machine Learning (ML) – Statistical Learning as an Over-riding Theme From regression to Data Mining The Tradition and Practice of Data Mining Validation/Accuracy assessment The Hazards of Size: Observations? Variables? Insight from Plots Comments on automation Summary
3. 3. Computing Technology is Transforming Science Huge data sets1 , new types of data Web pages, medical images, expression arrays, genomic data, NMR spectroscopy, sky maps, . . . Sheer size brings challenges Data management; some new analysis challenges Many observations, or many variables? New algorithms; “algorithmic” models. Trees, random forests, Support Vector Machines, . . . Automation, especially for data collection There are severe limits to automated analysis Synergy: computing power with new theory. Data Mining & Machine Learning ﬁt in this mix. Both use Statistical Learning approaches. 1 cf, the hype re Big Data in Weiss and Indurkha (1997)
4. 4. DM & ML Jargon – Mining, Learning and Training Mining is used in two senses: Mining for data Mining to extract meaning, in a scientiﬁc/statistical sense. Pre-analysis data extraction & processing may rely heavily on computer technology. Additionally, design of data collection issues require attention. In mining to extract meaning, statistical considerations are pervasive. Learning & Training The (computing) machine learns from data. Use training data to train the machine or software.
5. 5. Example 1: Record Times for Track and Road Races road q track q 290km 10^−1 10^0 10^1 10^2 q 10^3 24h q q Is the line the whole story? q q 10^2 q q q Ratio of largest to smallest q q qq time is ∼3000. Time qq q q 10^1 q q q q q A diﬀerence from the line of q 10^0 100m qq q >15%, for the longest race, is q q q q not visually obvious! q 9.6sec Resid (loge scale) q 0.15 q 0.10 0.05 q q q q qq q The Plot of Residuals q q q q qq 0.00 q q q q q qq qq q q Here, diﬀerences from the line q −0.05 qq q q qq q −0.10 q qq qq been scaled up by a factor of ∼20. 10^−1 10^0 10^1 10^2 Distance
6. 6. World record times – Learning from the data Fit a smooth to residuals. (’Let the computer rip’ !) Is the curve ’real’ ? The lower panel shows The algorithm says “Yes”. residuals from the smooth. 10^−1 10^0 10^1 10^2 Questions remain: 0.15 q ’Real’, in what sense? resid(wr.lm) q 0.10 q 0.05 q q q q q q qq q q Will the pattern be qq q q q 0.00 q q qq q −0.05 qq q q q q q q q q q the same in 2030? q q q qq −0.10 Is it consistent across 0.15 geographical regions? Does it partly reﬂect Resid (2) 0.10 q q 0.05 0.00 q q q q q qq qq q q q q qq q q q qq q q q q q q q greater attention paid q q q q qqq q q −0.05 q to some distances? −0.10 10^−1 10^0 10^1 10^2 So why/when the smooth, Distance rather than the line? Martians may have a somewhat diﬀerent curve!
7. 7. Example 2: the Forensic Glass Dataset (1987 data) Find a rule to predict the type of any new piece of glass: WinF Veh Tabl WinNF Con Head Graph is a visual summary of a classiﬁcation. 4 Window ﬂoat (70) 2 Window non-ﬂoat (76) 0 Vehicle window (17) Containers (13) Axis 2 −2 Tableware (9) −4 Headlamps (29) −6 Variables are %’s of Na, −8 Mg, . . . , plus refractive −4 −2 0 2 4 6 index. (214 rows × 10 Axis 1 columns.) NB: These data are well past their “use by” date!
8. 8. Statistical Learning – Commentary 1. Continuous Outcome (eg, Times vs Distances) Allow data to largely choose form of response. Often, use methodology to reﬁne a theoretical model (large datasets more readily highlight discrepancies) 2. Categorical Outcome (eg, forensic glass data) Theory is rarely much help in choosing the form of model. Theoretical accuracy measures are usually unavailable. Both Types of Outcome Variable selection and/or model tuning wreaks havoc with most theoretically based accuracy estimates. Hence the use of empirical approaches.
9. 9. Selection – An Aid to Seeing What is not There! Gp 1 q Gp 2 q Gp 3 q Method 3 q Random data, 32 points, split Discriminant Axis 2 2 q qq randomly into 3 q q q 1 q groups . q q q q q q q 0 q q q qq (1) Select 15 q −1 q q q ‘best’ features, q q q q from 500. q −2 q (2) Show 2-D view q q of classiﬁcation −3 −2 0 2 4 that uses the 15 ’best’ features. Discriminant Axis 1 NB: This demonstrates the usefulness of simulation.
10. 10. Selection Without Spurious Pattern – Train/Test Training/Test Split data into training and test sets Train (NB: steps 1 & 2); use test data for assessment. Cross-validation Simple version: Train on subset 1, test on subset 2 Then; Train on subset 2, test on subset 1 More generally, data are split into k parts (eg, k = 10). Use each part in turn for testing, with other data used to train. Warning – What the books rarely acknowledge All methods give accuracies based on sampling from the source population. For observational data, target usually diﬀers from source.
11. 11. Cross-Validation Steps Split data into k parts (below,k=4) At the ith repeat or fold (i = 1, . . . k) use: the ith part for testing, the other k-1 parts for training. Combine the performance estimates from the k folds. Training Training Training FOLD TEST 4 Training Training Training FOLD TEST 3 Training Training Training FOLD TEST 2 Training Training Training FOLD TEST 1 n1 n2 n3 n 4 observations
12. 12. Selection Without Spurious Pattern – the Bootstrap Bootstrap Sampling Here are two bootstrap samples from the numbers 1 . . . 10 1 3 6 6 6 6 6 8 9 10 (5 6’s; omits 2,4,5,7) 2 2 3 4 6 8 8 9 10 10 (2 2’s, 2 8’s, 2 10’s; omits 1,5,7) 1 1 1 1 3 3 4 6 7 8 (4 1’s, 2 3’s; omits 2,5,9,10) Bootstrap Sampling – Putting it to Use Take repeated (with replacement) random samples of the observations, of the same size as the initial sample. Repeat analysis on each new sample (NB: In the example above, repeat both step 1 (selection) & 2 (analysis). Variability between samples indicates statistical variability. Combine separate analysis results into an an overall result.
13. 13. JM’s Preferred Methods for Classiﬁcation Data Linear Discriminant Analysis It is simple, in the style of classical regression. It leads naturally to a 2-D plot. The plot may suggest trying methods that build in weaker assumptions. Random forests It is clear why it works and does not overﬁt. No other method consistently outperforms it. It is simple, and highly automatic to use. Use for illustration data with 3 outcome categories
14. 14. Example (3) – Clinical Classiﬁcation of Diabetes overt q chemical q normal q Predict class (overt, q q q chemical, or normal) q 2nd discriminant axis q 2 q q using relative weight, q q qq q fasting plasma glucose, q q q qqqq qqqqq qq q q qq qq qq q q q q qq qq q qq q glucose area, insulin area, qq q qq qq q qq q q q q q qqqq q q qq q qq qqq 0 q qq q q qq q qq q qq q & SSPG (steady state qq qq q qq q q q q q q q q q q qq q qqq q qq q qq qq q plasma glucose). q qq q qq q qq q −2 q q qq q Linear q qq Discriminant −4 q Analysis −6 −4 −2 0 2 CV acc = 89% 1st discriminant axis
15. 15. Tree-based Classiﬁcation glucArea>=420.5 | fpg>=117 Tree-based normal Classiﬁcation CV acc = 97.2% overt chemical
16. 16. Random forests – A Forest of Trees! Each tree is for a different random with replacement ('bootstrap') sample of the data, and sample of variables fpg>=117 glucArea>=420.5 glucArea>=408 | SSPG>=192 fpg>=116 | glucArea>=606 | overt normal normal chemical normal overt chemical overt chemical fpg>=117 glucArea>=420.5 glucArea>=422 |relwt>=1.045 glucArea>=596.5 | fpg>=118.5 | fpg>=102.5 overt normal normal chemical chemical normal overt chemical overt chemical fpg>=117 fpg>=119 fpg>=117 | | |SSPG>=150 Insulin>=294.5 relwt>=1.01 overt overt overt chemical normal chemical normal chemical normal fpg>=118.5 glucArea>=419.5 glucArea>=408 |fpg>=100.5 glucArea>=656.5 | glucArea>=656.5 | relwt>=1.045 Insulin>=228.5 overt normal normal chemical chemical normal normal overt chemical overt chemical Each tree has one vote; the majority wins
17. 17. Plot derived from random forest analysis overt q chemical q normal q This plot tries qq 0.4 q q qq q hard to reﬂect q q q probabilities 2nd ordinate axis 0.2 of group membership qq qq qq q qq q q 0.0 q qq q q assigned by the q analysis. q −0.2 It does not q q q result from −0.4 q q q q qqq q a ’scaling’ of qq q qq q qq q q q the feature −0.4 −0.2 0.0 0.2 0.4 space. 1st ordinate axis
18. 18. When are crude data mining approaches enough? Often, rough and ready approaches, ignoring distributional and other niceties, work well! In other cases, rough and ready approaches can be highly misleading! Some Key Issues Watch for source/target (eg, 2008/2009) diﬀerences. Allow for eﬀects of model and variable selection. Do not try to interpret individual model coeﬃcients, unless you know how to avoid the traps. Skill is needed to diﬀerentiate between problems where relatively cavalier approaches may be acceptable, and problems that demand greater care.
19. 19. Diﬀerent Relationships Between Source & Target Source versus target Are data available from target? 1: Identical (or nearly so) Yes; data from source suﬃce 2: Source & Target diﬀer Yes 3: Source & Target diﬀer No. But change can be modeled (cf: multi-level models; time series) 4: Source & Target diﬀer No; must make an informed guess. Where source & target diﬀer, it may be possible to get insight from matched historical source/target data. There are major issues here, which might occupy several further lectures!
20. 20. The Hazards of Size: Observations? Variables? Many Observations Additional structure often comes with increased size – data may be less homogeneous, span larger regions of time or space, be from more countries Or there may extensive information about not much! e.g., temperatures, at second intervals, for a day. SEs from modeling that ignores this structure may be misleadingly small. In large homogeneous datasets, spurious eﬀects are a risk Small SEs increase the risk of detecting spurious eﬀects that arise, e.g., from sampling bias (likely in observational data) and/or from measurement error. Many variables (features) Huge numbers of variables spread information thinly! This is a challenge to the analyst.
21. 21. The Hazards of High Dimensional Data Select, or summarize? For selection, beware of selection eﬀects – with enough lottery tickets, some kind of prize is pretty much inevitable A pattern based on the ‘best’ 15 features, out of 500, may well be meaningless! Summary measures may involve selection. Beware of over-ﬁtting Over-ﬁtting reduces real accuracy. Preferably, use an algorithm that does not overﬁt with respect to the source population. Unless optimized with respect to the target, some over-ﬁtting may be inevitable! Any algorithm can be misused to overﬁt! (even those that do not overﬁt!)
22. 22. Why plot the data? Which are the diﬃcult points? Some points may be mislabeled (faulty medical diagnosis?) Improvement of classiﬁcation accuracy is a useful goal only if misclassiﬁed points are in principle classiﬁable. What if points are not well represented in 2-D or 3-D? One alternative is to identify points that are outliers on a posterior odds (of group membership) scale.
23. 23. New Technology has Many Attractions Are you by now convinced that Data analysis is dapper Data mining is delightful Analytics is amazing Regression is rewarding Trees are tremendous Forests are fascinating Statistics is stupendous? Some or all of these? Even however with the best modern software, it is hard work to do data analysis well.
24. 24. The Science of Data Mining Getting the science right is more important than ﬁnding the true and only best algorithm! (There is none!) Get to grips with the statistical issues Know the target (1987 glass is not 2008 glass) Understand the traps (too many to talk about here2 ) Grasp the basic ideas of time series, multi-level models, and comparisons based on proﬁles over time. Use the technology critically and with care! Do it, where possible, with graphs. 2 Maindonald, J.H. (2006) notes a number of common traps, with extensive references. Berk (2006) has excellent advice.
25. 25. References Berk, R. 2008. Statistical Learning from a Regression Perspective. [Berk’s insightful commentary injects needed reality checks into the discussion of data mining and statistical learning.] Maindonald, J.H. 2006. Data Mining Methodological Weaknesses and Suggested Fixes. Proceedings of Australasian Data Mining Conference (Aus06)’3 Maindonald, J. H. and Braun, W. J. 2007. Data Analysis and Graphics Using R – An Example-Based Approach. 2nd edn, Cambridge University Press.4 [Statistics, with hints of data mining!] Rosenbaum, P. R., 2002. Observational Studies. Springer, 2ed. [Observational data from an experimental perspective.] Wood, S. N. 2006. Generalized Additive Models. An Introduction with R. Chapman & Hall/CRC. [This has an elegant treatment of linear models and generalized linear models, as a lead-in to methods for ﬁtting
26. 26. Web Sites http://www.sigkdd.org/ [Association for Computing Machinery Special Interest Group on Knowledge Discovery and Data Mining.] http://www.amstat.org/profession/index.cfm? fuseaction=dataminingfaq [Comments on many aspects of data mining.] http://www.cs.ucr.edu/~eamonn/TSDMA/ [UCR Time Series Data Mining Archive] http://kdd.ics.uci.edu/ [UCI KDD Archive] http://en.wikipedia.org/wiki/Data_mining [This (Dec 12 2008) has useful links. Lacking in sharp critical commentary. It emphasizes commercial data mining tools.] The R package mlbench has “a collection of artiﬁcial and real-world machine learning benchmark problems, including, e.g., several data sets from the UCI repository.”