What’s New in Data Mining?



                  Padhraic Smyth
         Information and Computer Science
           Univer...
Outline of Talk

•   What is Data Mining?



•   Computer Science and Statistics: the Interface



•   Hot Topics in Data ...
Technological Driving Factors

•   Larger, cheaper memory
     – Moore’s law for magnetic disk density
              “capa...
The Emergence of Data Mining

•   Distinct threads of evolution
     – AI/machine learning
          • 1989 KDD workshop -...
The Emergence of Data Mining

•   The “Mother in Law phenomenon”
         • even your mother-in-law has heard about data m...
What is data mining?




                       © Padhraic Smyth, Dec 2000: 6
What is data mining?




“the art of fishing over alternative models ….”


                         M. C. Lovell,
        ...
What is data mining?




“Data-driven discovery of models and patterns from
massive observational data sets”




         ...
What is data mining?




“The magic phrase to put in every funding proposal
you write to NSF, DARPA, NASA, etc”




      ...
What is data mining?




“The magic phrase you use to sell your…..
 - database software
 - statistical analysis software
 ...
What is data mining?



      “Data-driven discovery of models and patterns from massive
       observational data sets”

...
What is data mining?



      “Data-driven discovery of models and patterns from massive
       observational data sets”

...
What is data mining?



      “Data-driven discovery of models and patterns from massive
       observational data sets”

...
What is data mining?



      “Data-driven discovery of models and patterns from massive
       observational data sets”

...
Who is involved in Data Mining?

•   Business Applications
     – customer-oriented, transaction-oriented applications
   ...
Current Data Mining Software ToolKits

1. General purpose tools

   – software systems for data mining (IBM, SGI, etc)
   ...
Transaction Data and Association Rules
                                     Items
                           x            ...
Transaction Data and Association Rules
                                     Items
                           x            ...
Current Data Mining Software

2. Special purpose (“niche”) applications


    - fraud detection, ecommerce profiling, cred...
General Characteristics of Data Mining Applications


•   Emphasis on Predictive Modeling
     – scoring, classification, ...
Myths and Legends in Data Mining

•   “Data analysis can be fully automated”
     – human judgement is critical in almost ...
Myths and Legends in Data Mining

•   “Data analysis can be fully automated”
     – human judgement is critical in almost ...
Myths and Legends in Data Mining

•   “Data analysis can be fully automated”
     – human judgement is critical in almost ...
Outline

•   What is Data Mining?



•   Computer Science and Statistics: the Interface




                              ...
Historical Perspective

       Statistics                Computer Science/
                                 Engineering


...
Historical Perspective

       Statistics                Computer Science/
                                 Engineering


...
Historical Perspective

       Statistics                Computer Science/
                                 Engineering


...
Historical Perspective

        Statistics                Computer Science/
                                  Engineering
...
Historical Perspective

         Statistics                 Computer Science/
                                    Engineer...
Historical Perspective

         Statistics                  Computer Science/
                                     Engine...
Historical Perspective

         Statistics                  Computer Science/
                                     Engine...
Historical Perspective

         Statistics                  Computer Science/
                                     Engine...
Historical Perspective

         Statistics                  Computer Science/
                                     Engine...
Observations


•   Significant synergy/convergence of CS and Statistics emerged
    from neural networks
     – flexible p...
Where Work is Published

         Statistics                     Computer Science




Statistical    Statistical   Neural ...
The Predictive Modeling Cycle




Modeling,                    Computation,
Inference                    Algorithms




  ...
The Computer Scientist’s View




Modeling,
Inference
                      Computation,
                      Algorithms
...
A Statistician’s View




Modeling,                  Computation,
                           Algorithms



Inference
     ...
The Customer’s View




Modeling,
                                  Computation,
Inference
                               ...
Educational Differences

•   Computer Scientists:
     – undergraduate exposure in statistics
          • cookbook hypothe...
Educational Differences

•   Computer Scientists:
     – undergraduate exposure in statistics
          • cookbook hypothe...
Educational Differences

•   Computer Scientists:
     – undergraduate exposure in statistics
          • cookbook hypothe...
Cultural Differences

•   Computer Scientists:
     – little exposure to the “modeling art” of data analysis
     – stick ...
Cultural Differences

•   Computer Scientists:
     – little exposure to the “modeling art” of data analysis
     – stick ...
Cultural Differences

•   Computer Scientists:
     – little exposure to the “modeling art” of data analysis
     – stick ...
Modeling                Computation




           Evaluation




                                      © Padhraic Smyth, ...
Data Set                Task




                       Representation



Objective Function
                             ...
Multivariate                Prediction
                                                       CART
                      H...
Transaction Data        Exploratory
                                                 Association Rules
                   ...
The Reductionist Viewpoint


•   General Framework for Modeling
     – reduce problems to fundamental components
     – th...
Implications

•   The “renaissance data miner” is skilled in:
     – statistics: theories and principles of inference
    ...
Outline

•   What is Data Mining?



•   Computer Science and Statistics: the Interface



•   Hot Topics in Data Mining

...
Subspecies of Data Miners

•   SIGMOD/VLDB Conferences
     – Database issues: querying, efficiency: no modeling
     – fa...
Hot Topics, New Directions from Computer Science

•    Flexible predictive modeling
      – neural networks, boosting, SVM...
Flexible Predictive Modeling

•   Model Combining:
     – Stacking
        • linear combinations of models with X-validate...
Example of a Document-Term Matrix


50

100

150

200

250

300

350

400

450

500
       20   40   60   80   100   120  ...
Application: Flexible Classification Models for Text

•   The Web represents a huge data set of text documents
     – Prob...
2. Scale: How far away are the data?




                                     Disk
CPU               RAM




             ...
2. Scale: How far away are the data?




                                     Disk
CPU               RAM




             ...
2. Scale: How far away are the data?




                                     Disk
CPU               RAM




             ...
Basic Idea




Massive    Approximate           Human
Database   Model of Data           or
                              ...
Basic Idea




Massive                        Approximate                           Human
Database                       M...
2. Scalable Algorithms
•   “Scaling down the data” or “data approximation”
     – work from clever data summarizations (e....
3. Pattern Finding

•   Patterns = unusual hard-to-find local “pockets” of data
     – finding patterns is not the same as...
“Bump-Hunting”




                 © Padhraic Smyth, Dec 2000: 65
“Bump-Hunting”




                 © Padhraic Smyth, Dec 2000: 66
“Bump-Hunting”




                 © Padhraic Smyth, Dec 2000: 67
“Bump-Hunting”




                 © Padhraic Smyth, Dec 2000: 68
“Bump-Hunting”




                 © Padhraic Smyth, Dec 2000: 69
“Bump-Hunting”




                 © Padhraic Smyth, Dec 2000: 70
Pattern Finding (ctd.)

•   Contrast Sets (Bay and Pazzani, KDD99)
     – individuals or objects categorized into 2 groups...
Example: Deformable Templates
•    Segmental hidden semi-Markov model
      • Ge and Smyth, KDD 2000

•    Each waveform s...
End-Point Detection in Semiconductor Manufacturing

  500       Pattern-Based End-Point Detection
  400


  300           ...
Heterogeneous Data Modeling

•   Clustering Objects (sequences, curves, etc)
     – probabilistic approach: define a mixtu...
T R A J E C T O R I E S O F C E N T R O I D S O F M O V I N G H A N D I N V I D E O S T R E A M S
                        ...
4. (Un) Structured Data (e.g., Text, Web)
•   Applications
     – classification of text documents
         • automatic cl...
Document Clustering
•   Techniques
     – model-based mixture clustering
     – mixtures of multinomials
     – mixtures o...
Example of a Document-Term Matrix


50

100

150

200

250

300

350

400

450

500
       20   40   60   80   100   120  ...
Most Likely Terms in Component 5:
              weight = 0.08
               TERM                p(t|k)
               wri...
Most Likely Terms in Component 1
              weight = 0.11 :
               TERM                p(t|k)
               ar...
Pixel Representation of Mixture Components


                      1

                      2
   COMPONENT MODELS



     ...
Example: Web Log Mining

128.195.36.195, -, 3/22/00, 10:35:11, W3SVC, SRVR1, 128.200.39.181, 781, 363, 875, 200, 0, GET, /...
Clusters of Dynamic Behavior


             A                                A
Cluster 1                       Cluster 2

...
© Padhraic Smyth, Dec 2000: 84
WebCanvas: Cadez, Heckerman, etSmyth, Dec 2000: 85
                        © Padhraic al, KDD 2000
Application: Product Purchasing Data




               50


               100
TRANSACTIONS




               150


    ...
Application: Recommender Systems
                                     Products
                           x               ...
Application: Recommender Systems
                                         Products
                             x         ...
Approaches to Recommender Systems

•   Collaborative Filtering
     – “infer your interests from people with similar behav...
Final Comments
•   Successful data mining requires integration/understanding of
     – statistics
     – computer science
...
Pointers
•   Papers:
     – www.ics.uci.edu/~datalab
     – e.g., “Data mining: data analysis on a grand scale?”, P. Smyth...
Upcoming SlideShare
Loading in...5
×

What's New in Data Mining?

550

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
550
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
14
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

What's New in Data Mining?

  1. 1. What’s New in Data Mining? Padhraic Smyth Information and Computer Science University of California, Irvine © December 2000 Invited Talk at NonParametrics/Data Mining Workshop, SMU, Dallas © Padhraic Smyth, Dec 2000: 1
  2. 2. Outline of Talk • What is Data Mining? • Computer Science and Statistics: the Interface • Hot Topics in Data Mining • Conclusions © Padhraic Smyth, Dec 2000: 2
  3. 3. Technological Driving Factors • Larger, cheaper memory – Moore’s law for magnetic disk density “capacity doubles every 18 months” (Jim Gray, Microsoft) – storage cost per byte falling rapidly • Faster, cheaper processors – the CRAY of 10 years ago is now on your desk • Success of Relational Database Technology – everybody is a “data owner” • Flexible modeling paradigms – GLMs, trees, etc – computationally-intensive modeling, massive search © Padhraic Smyth, Dec 2000: 3
  4. 4. The Emergence of Data Mining • Distinct threads of evolution – AI/machine learning • 1989 KDD workshop -> ACM SIGKDD 2000 • focus on “automated discovery, novelty” – Database Research • focus on massive data sets (since 1995) • e.g., ACM SIGMOD -> association rules, scalable algorithms – “Data Owners” • what can we do with all this data in commercial databases? • primarily customer-oriented transaction data • industry dominated, applications-oriented © Padhraic Smyth, Dec 2000: 4
  5. 5. The Emergence of Data Mining • The “Mother in Law phenomenon” • even your mother-in-law has heard about data mining • people are hoping they can do data analysis without the “nuisance factor” of statistics • Beware of the hype! – remember expert systems, neural nets, etc – basically sound ideas that were oversold creating a backlash © Padhraic Smyth, Dec 2000: 5
  6. 6. What is data mining? © Padhraic Smyth, Dec 2000: 6
  7. 7. What is data mining? “the art of fishing over alternative models ….” M. C. Lovell, The Review of Economics and Statistics February 1983 © Padhraic Smyth, Dec 2000: 7
  8. 8. What is data mining? “Data-driven discovery of models and patterns from massive observational data sets” © Padhraic Smyth, Dec 2000: 8
  9. 9. What is data mining? “The magic phrase to put in every funding proposal you write to NSF, DARPA, NASA, etc” © Padhraic Smyth, Dec 2000: 9
  10. 10. What is data mining? “The magic phrase you use to sell your….. - database software - statistical analysis software - parallel computing hardware - consulting services” © Padhraic Smyth, Dec 2000: 10
  11. 11. What is data mining? “Data-driven discovery of models and patterns from massive observational data sets” Statistics, Inference © Padhraic Smyth, Dec 2000: 11
  12. 12. What is data mining? “Data-driven discovery of models and patterns from massive observational data sets” Languages and Statistics, Representations Inference © Padhraic Smyth, Dec 2000: 12
  13. 13. What is data mining? “Data-driven discovery of models and patterns from massive observational data sets” Languages Engineering, and Data Management Statistics, Representations Inference © Padhraic Smyth, Dec 2000: 13
  14. 14. What is data mining? “Data-driven discovery of models and patterns from massive observational data sets” Languages Engineering, and Data Management Statistics, Representations Retrospective Inference Analysis © Padhraic Smyth, Dec 2000: 14
  15. 15. Who is involved in Data Mining? • Business Applications – customer-oriented, transaction-oriented applications – very specific applications in fraud, ecommerce, credit-scoring • in-house applications (e.g., AT&T, Microsoft, Amazon, etc) • consulting firms: considerable hype factor! – largely involve the application of existing statistical ideas, scaled up to massive data sets (“engineering”) • Academic Researchers – mainly in computer science – extensions of existing ideas, significant “bandwagon effect” – database-oriented: “what can we compute quickly?” • Bottom Line: – primarily computer scientists, often with little knowledge of statistics, main focus is on algorithms © Padhraic Smyth, Dec 2000: 15
  16. 16. Current Data Mining Software ToolKits 1. General purpose tools – software systems for data mining (IBM, SGI, etc) • just simple statistical algorithms with SQL? • limited support for – statistical inference, temporal, spatial data • also: “born-again” statistical software packages – some successes (difficult to validate) • banking, marketing, retail • mainly useful for large-scale EDA? – “mining the miners” (Jerry Friedman): • similar to expert systems/neural networks hype in 80’s? © Padhraic Smyth, Dec 2000: 16
  17. 17. Transaction Data and Association Rules Items x x x x x x x x x Transactions x x x x x x x x x x x • Supermarket example: (Srikant and Agrawal, 1997) – #items = 500,000, #transactions = 1.5 million © Padhraic Smyth, Dec 2000: 17
  18. 18. Transaction Data and Association Rules Items x x x x x x x x x Transactions x x x x x x x x x x x • Example of an Association Rule If a customer buys beer they will also buy chips – p(chips|beer) = “confidence” – p(beer) = “support” • Algorithm: basically a fast way to compute correlations © Padhraic Smyth, Dec 2000: 18
  19. 19. Current Data Mining Software 2. Special purpose (“niche”) applications - fraud detection, ecommerce profiling, credit-scoring,etc. - often solve high-dimensional classification/regression problems - Fraud detection - telecom (AT&T), credit-cards (HNC) - Profiling -> Advertising - profile: “histogram” of products/terms - Engage: database of 70 million internet user profiles - common theme: “track the customer!” - difficult to validate claims of success (few publications) © Padhraic Smyth, Dec 2000: 19
  20. 20. General Characteristics of Data Mining Applications • Emphasis on Predictive Modeling – scoring, classification, detection • Massive Data Sets – significant “data engineering” component – variable selection, “feature definition” – offline: computational issues in model fitting – online: real-time response (e.g., e-commerce) • “Scaling up” traditional ideas – e.g., wide use of CART (decision trees) – often modified to handle large-scale issues © Padhraic Smyth, Dec 2000: 20
  21. 21. Myths and Legends in Data Mining • “Data analysis can be fully automated” – human judgement is critical in almost all applications – “semi-automation” is however very useful © Padhraic Smyth, Dec 2000: 21
  22. 22. Myths and Legends in Data Mining • “Data analysis can be fully automated” – human judgement is critical in almost all applications – “semi-automation” is however very useful • “Association rules are useful” – association rules are essentially lists of correlations – no documented successful application – compare with decision trees (numerous applications) © Padhraic Smyth, Dec 2000: 22
  23. 23. Myths and Legends in Data Mining • “Data analysis can be fully automated” – human judgement is critical in almost all applications – “semi-automation” is however very useful • “Association rules are useful” – association rules are essentially lists of correlations – none or few documented successful applications – compare with decision trees (numerous applications) • “With massive data sets you don’t need statistics” – massiveness can bring more heterogeneity and noise • even more statistics! © Padhraic Smyth, Dec 2000: 23
  24. 24. Outline • What is Data Mining? • Computer Science and Statistics: the Interface © Padhraic Smyth, Dec 2000: 24
  25. 25. Historical Perspective Statistics Computer Science/ Engineering 1950 1960 1970 1980 1990 2000 © Padhraic Smyth, Dec 2000: 25
  26. 26. Historical Perspective Statistics Computer Science/ Engineering 1950 Statistical AI Pattern 1960 Recognition 1970 1980 1990 2000 © Padhraic Smyth, Dec 2000: 26
  27. 27. Historical Perspective Statistics Computer Science/ Engineering 1950 Statistical AI EDA Pattern 1960 Recognition 1970 1980 1990 2000 © Padhraic Smyth, Dec 2000: 27
  28. 28. Historical Perspective Statistics Computer Science/ Engineering 1950 Statistical AI EDA Pattern 1960 Recognition 1970 Trees ML: 1980 Trees/Rules 1990 2000 © Padhraic Smyth, Dec 2000: 28
  29. 29. Historical Perspective Statistics Computer Science/ Engineering 1950 Statistical AI EDA Pattern 1960 Recognition 1970 Trees MARS ML: 1980 Trees/Rules Neural 1990 Networks 2000 © Padhraic Smyth, Dec 2000: 29
  30. 30. Historical Perspective Statistics Computer Science/ Engineering 1950 Statistical AI EDA Pattern 1960 Recognition 1970 Trees MARS ML: 1980 Trees/Rules Neural 1990 Networks 2000 Flexible Predictors © Padhraic Smyth, Dec 2000: 30
  31. 31. Historical Perspective Statistics Computer Science/ Engineering 1950 Statistical AI EDA Pattern 1960 Recognition 1970 Trees MARS ML: 1980 Trees/Rules Neural 1990 Networks KDD 2000 Flexible Predictors © Padhraic Smyth, Dec 2000: 31
  32. 32. Historical Perspective Statistics Computer Science/ Engineering 1950 Statistical AI EDA Pattern 1960 Recognition 1970 DB Trees MARS ML: 1980 Trees/Rules Neural 1990 Networks KDD OLAP 2000 Flexible Predictors © Padhraic Smyth, Dec 2000: 32
  33. 33. Historical Perspective Statistics Computer Science/ Engineering 1950 Statistical AI EDA Pattern 1960 Recognition 1970 DB Trees MARS ML: 1980 Trees/Rules Neural 1990 Networks KDD OLAP 2000 Flexible Data Predictors Mining © Padhraic Smyth, Dec 2000: 33
  34. 34. Observations • Significant synergy/convergence of CS and Statistics emerged from neural networks – flexible prediction models = “super offspring” – role of NIPS, Snowbird meetings, etc • Data Mining/KDD is still back where Neural Nets was 10 years ago – DM: “our stuff is cool and we don’t really need statistics - do we ?” – Statistics: “what are these guys talking about and why don’t they know some basic statistics?” – Nonetheless…. The DM folks have some very interesting applications and some interesting approaches © Padhraic Smyth, Dec 2000: 34
  35. 35. Where Work is Published Statistics Computer Science Statistical Statistical Neural Machine Data Databases Inference Pattern Networks Learning Mining Recognition ICML KDD SIGMOD JASA, IEEE PAMI NIPS COLT IJDMKD VLDB JRSS ICPR Neural Comp. ML Journal ICCV UAI www.jmlr.org © Padhraic Smyth, Dec 2000: 35
  36. 36. The Predictive Modeling Cycle Modeling, Computation, Inference Algorithms Evaluation, Interpretation © Padhraic Smyth, Dec 2000: 36
  37. 37. The Computer Scientist’s View Modeling, Inference Computation, Algorithms Evaluation, Interpretation © Padhraic Smyth, Dec 2000: 37
  38. 38. A Statistician’s View Modeling, Computation, Algorithms Inference Evaluation, Interpretation © Padhraic Smyth, Dec 2000: 38
  39. 39. The Customer’s View Modeling, Computation, Inference Algorithms Evaluation, Interpretation © Padhraic Smyth, Dec 2000: 39
  40. 40. Educational Differences • Computer Scientists: – undergraduate exposure in statistics • cookbook hypothesis tests – little or no exposure with mathematical modeling – good at algorithms/data structures © Padhraic Smyth, Dec 2000: 40
  41. 41. Educational Differences • Computer Scientists: – undergraduate exposure in statistics • cookbook hypothesis tests – little or no exposure with mathematical modeling – good at algorithms, data structures • Statisticians: – undergraduate exposure to CS • how to write Fortran code – little or no exposure to data structures/algorithms – not everyone learns the “art” of data analysis? © Padhraic Smyth, Dec 2000: 41
  42. 42. Educational Differences • Computer Scientists: – undergraduate exposure in statistics • cookbook hypothesis tests – little or no exposure with mathematical modeling – good at algorithms, data structures • Statisticians: – undergraduate exposure to CS • how to write Fortran code – little or no exposure to data structures/algorithms – how to learn the “art” of data analysis? • Bottom line – need a new breed of “data engineers” – note: easier to go from statistics to CS, than vice-versa © Padhraic Smyth, Dec 2000: 42
  43. 43. Cultural Differences • Computer Scientists: – little exposure to the “modeling art” of data analysis – stick to a small set of well-understood models and problems – “close to the data”: they often have ready access to data – business-oriented culture © Padhraic Smyth, Dec 2000: 43
  44. 44. Cultural Differences • Computer Scientists: – little exposure to the “modeling art” of data analysis – stick to a small set of well-understood models and problems – “close to the data”: they often have ready access to data – business-oriented culture • Statisticians: – applied statisticians often very good at the “art” component – little experience with the data management/engineering part – papers focus on inference/models, not algorithms – science-oriented culture © Padhraic Smyth, Dec 2000: 44
  45. 45. Cultural Differences • Computer Scientists: – little exposure to the “modeling art” of data analysis – stick to a small set of well-understood models and problems – “close to the data”: they often have ready access to data – business-oriented culture • Statisticians: – applied statisticians often very good at the “art” component – little experience with the data management/engineering part – papers focus on inference/models, not algorithms – science-oriented culture • Bottom line – computer scientists get more attention since they are much more marketing-savvy (less worried about objectivity) than statisticians © Padhraic Smyth, Dec 2000: 45
  46. 46. Modeling Computation Evaluation © Padhraic Smyth, Dec 2000: 46
  47. 47. Data Set Task Representation Objective Function Modeling Optimization Data Access Algorithm Evaluation and Deployment © Padhraic Smyth, Dec 2000: 47
  48. 48. Multivariate Prediction CART Hierarchical representation of Emphasis on piecewise constant mapping predictive power and flexibility Cross-Validation of model Modeling Greedy Search Flat File Algorithm Accuracy and Interpretability © Padhraic Smyth, Dec 2000: 48
  49. 49. Transaction Data Exploratory Association Rules Sets of local rules/ conditional probabilities Thresholds on p Modeling Systematic Search Emphasis on computational efficiency and Linear Data Scans data access Algorithm ???? © Padhraic Smyth, Dec 2000: 49
  50. 50. The Reductionist Viewpoint • General Framework for Modeling – reduce problems to fundamental components – think in terms of • application first • modeling second • algorithm third – ultimately the application should “drive” the algorithm – allows systematic comparison and synthesis • for work on synthesis, see Buntine et al, KDD 99 – clarifies relative role of statistics, databases, search, etc – see Hand, Mannila, and Smyth, MIT Press, May(?) 2001 © Padhraic Smyth, Dec 2000: 50
  51. 51. Implications • The “renaissance data miner” is skilled in: – statistics: theories and principles of inference – modeling: languages and representations for data – optimization and search – algorithm design and data management • The educational problem – is it necessary to know all these areas in depth? – Is it possible? – Do we need a new breed of professionals? • The applications viewpoint: – How does a scientist or business person keep up with all these developments? – How can they choose the best approach for their problem © Padhraic Smyth, Dec 2000: 51
  52. 52. Outline • What is Data Mining? • Computer Science and Statistics: the Interface • Hot Topics in Data Mining © Padhraic Smyth, Dec 2000: 52
  53. 53. Subspecies of Data Miners • SIGMOD/VLDB Conferences – Database issues: querying, efficiency: no modeling – fast querying/association rule algorithms • SIGKDD Conferences – Algorithm focus: scaling machine learning/stats methods – rule finding algorithms • Machine Learning Conference – Algorithmic focus – decision trees, reinforcement learning • NIPS – originally neural networks, but now mathematical/probabilistic learning: heavy statistical influence – SVMs, boosting, Gaussian processes, latent variable models • ICPR (Pattern Recognition), SIGIR, etc – speech, images, classifiers, etc: engineering applications © Padhraic Smyth, Dec 2000: 53
  54. 54. Hot Topics, New Directions from Computer Science • Flexible predictive modeling – neural networks, boosting, SVMs • Engineering of scale – scaling up statistical methods to new large-scale applications • Hidden/latent variable models – wide scale application of EM, e.g., HMMs for speech • Pattern finding – associations, rules, bumps: “non-global” patterns • Heterogeneous Data – modeling structured data, e.g, Web, multimedia (video/audio) © Padhraic Smyth, Dec 2000: 54
  55. 55. Flexible Predictive Modeling • Model Combining: – Stacking • linear combinations of models with X-validated weights – Bagging • equally weighted models from bootstrap samples – Boosting • iterative re-training on data points in error • Flexible Model Forms – Decision trees, Neural networks, Support vector machines • Common theme: – many of these ideas were popularized in computer science – later “legitimized” by statisticians (e.g., by Breiman, Friedman) © Padhraic Smyth, Dec 2000: 55
  56. 56. Example of a Document-Term Matrix 50 100 150 200 250 300 350 400 450 500 20 40 60 80 100 120 140 160 180 200 © Padhraic Smyth, Dec 2000: 56
  57. 57. Application: Flexible Classification Models for Text • The Web represents a huge data set of text documents – Problem: classification of Web pages into “topic categories” – e.g., automated creation of topic hierarchies for Yahoo – automated crawlers for information gathering • Technical challenges – standard representation of a Web page? • Typically use “list of term vectors” • very high-dimensional information – other information: images, page structure, etc • Current Activity – much research in data mining in document classification: • Web page -> high-d term vector -> flexible classifier – Commercial companies: Whizbang, Autonomy, IBM, etc © Padhraic Smyth, Dec 2000: 57
  58. 58. 2. Scale: How far away are the data? Disk CPU RAM © Padhraic Smyth, Dec 2000: 58
  59. 59. 2. Scale: How far away are the data? Disk CPU RAM 10-8 seconds 10-3 seconds Random Access Times © Padhraic Smyth, Dec 2000: 59
  60. 60. 2. Scale: How far away are the data? Disk CPU RAM 1 meter 100 km Effective Distances © Padhraic Smyth, Dec 2000: 60
  61. 61. Basic Idea Massive Approximate Human Database Model of Data or Algorithm “Fast”/Cached Memory “Slow” Memory © Padhraic Smyth, Dec 2000: 61
  62. 62. Basic Idea Massive Approximate Human Database Model of Data or Algorithm “Fast”/Cached Memory “Slow” Memory Comments: 1. even if data fits in main memory there are many advantages to clever data structures (e.g., see Andrew Moore’s talk) 2. Particularly relevant for massive streams of transaction data, e.g. telephone data (see Diane Lambert’s talk). © Padhraic Smyth, Dec 2000: 62
  63. 63. 2. Scalable Algorithms • “Scaling down the data” or “data approximation” – work from clever data summarizations (e.g., sufficient statistics) – e.g., “Data squashing” (DuMouchel et al, AT&T, KDD ‘99) – create a small “pseudo data set” – similar statistical properties to the original (massive) data set – now run your standard algorithm on the pseudo-data – interesting theoretical (statistical) basis • “Scaling up the algorithm” – data structures/caching strategies to speed up known algorithms • ADTrees, etc., from Andrew Moore (CMU) • scalable decision trees (Johannes Gehrke, Cornell) – can get orders of magnitude speed improvements © Padhraic Smyth, Dec 2000: 63
  64. 64. 3. Pattern Finding • Patterns = unusual hard-to-find local “pockets” of data – finding patterns is not the same as global model fitting – the simplest example of patterns are association rules – much other work on rule-finding in data mining/AI – other applications: • motif-finding in protein sequences • unusual objects in sky-survey data • “Bump-hunting” – PRIM algorithm of Friedman and Fisher (1999) – finds multivariate “boxes” in high-dimensional spaces where mean of target variable is higher – trades off “support” with “mean height” – effective and flexible • e.g., finding small highly profitable groups of customers © Padhraic Smyth, Dec 2000: 64
  65. 65. “Bump-Hunting” © Padhraic Smyth, Dec 2000: 65
  66. 66. “Bump-Hunting” © Padhraic Smyth, Dec 2000: 66
  67. 67. “Bump-Hunting” © Padhraic Smyth, Dec 2000: 67
  68. 68. “Bump-Hunting” © Padhraic Smyth, Dec 2000: 68
  69. 69. “Bump-Hunting” © Padhraic Smyth, Dec 2000: 69
  70. 70. “Bump-Hunting” © Padhraic Smyth, Dec 2000: 70
  71. 71. Pattern Finding (ctd.) • Contrast Sets (Bay and Pazzani, KDD99) – individuals or objects categorized into 2 groups • e.g., students enrolled in CS and in Engineering – high-dimensional multivariate measurements on each – automatically produces a summary of significant differences between groups (Bay and Pazzani, KDD ‘99) – combines massive search with statistical estimation • Time-Series Pattern Spotting – “find me a shape that looks like this” – semi-Markov deformable templates (Ge and Smyth, KDD 2000) – significantly outperforms template matching and DTW – Bayesian approach integrates prior knowledge with data © Padhraic Smyth, Dec 2000: 71
  72. 72. Example: Deformable Templates • Segmental hidden semi-Markov model • Ge and Smyth, KDD 2000 • Each waveform segment corresponds to a state in the model Segments -------- States ST S1 S2 © Padhraic Smyth, Dec 2000: 72
  73. 73. End-Point Detection in Semiconductor Manufacturing 500 Pattern-Based End-Point Detection 400 300 Original Pattern 200 0 50 100 150 200 250 300 350 400 500 400 300 Detected Pattern 200 0 50 100 150 200 250 300 350 400 TIME (SECONDS) © Padhraic Smyth, Dec 2000: 73
  74. 74. Heterogeneous Data Modeling • Clustering Objects (sequences, curves, etc) – probabilistic approach: define a mixture of models (Cadez, Gaffney, and Smyth, KDD 2000) – unified framework for clustering objects of different dimensions – applications: • curve-clustering: – e.g., mixture of regression models (Gaffney and Smyth (KDD ‘99) – video movement, gene expression data, storm trajectories • sequence clustering – e.g., mixtures of Markov models – clustering of MSNBC Web data (Cadez et al, KDD ‘00) © Padhraic Smyth, Dec 2000: 74
  75. 75. T R A J E C T O R I E S O F C E N T R O I D S O F M O V I N G H A N D I N V I D E O S T R E A M S 160 140 120 X-POSITION 100 80 60 40 0 5 10 15 20 2 5 30 TIM E ESTIMATED CLUSTER TRAJECTORY 125 ESTIMATED CLUSTER TRAJECTORY 80 120 70 115 60 110 X-POSITION 50 Y-POSITION 105 40 100 30 95 20 90 10 85 0 0 5 10 15 20 25 30 0 5 10 15 20 25 30 TIME TIME © Padhraic Smyth, Dec 2000: 75
  76. 76. 4. (Un) Structured Data (e.g., Text, Web) • Applications – classification of text documents • automatic classification of emails as junk/non-junk • automatic creation of taxonomies for Web page portals such as Yahoo – discovery of authoritative documents • search engines (Yahoo) • citation rankings • Techniques – “vector-space” model – adaptations of simple classification and clustering algorithms – graph-based techniques • Challenges for Statistics – scale of problem: huge documents, huge Web – structure, semantics of documents, Web © Padhraic Smyth, Dec 2000: 76
  77. 77. Document Clustering • Techniques – model-based mixture clustering – mixtures of multinomials – mixtures of conditional-independence models • Connections to Statistics – probabilistic models are well-known (latent class models) – EM algorithm for training • Differences from Statistics: – scale and nature of applications – e.g., Hofmann (Brown, U), “probabilistic PCA” – e.g., Lafferty (CMU), maxent models for text prediction © Padhraic Smyth, Dec 2000: 77
  78. 78. Example of a Document-Term Matrix 50 100 150 200 250 300 350 400 450 500 20 40 60 80 100 120 140 160 180 200 © Padhraic Smyth, Dec 2000: 78
  79. 79. Most Likely Terms in Component 5: weight = 0.08 TERM p(t|k) write 0.571 drive 0.465 problem 0.369 mail 0.364 articl 0.332 hard 0.323 Example work 0.319 system 0.303 of Document good 0.296 Cluster time 0.273 Highest Lift Terms in Component 5 weight = 0.08 TERM LIFT p(t|k) p(t) scsi 7.7 0.13 0.02 drive 5.7 0.47 0.08 hard 4.9 0.32 0.07 card 4.2 0.23 0.06 format 4.0 0.12 0.03 softwar 3.8 0.21 0.05 memori 3.6 0.14 0.04 install 3.6 0.14 0.04 disk 3.5 0.12 0.03 engin 3.3 0.21 0.06 © Padhraic Smyth, Dec 2000: 79
  80. 80. Most Likely Terms in Component 1 weight = 0.11 : TERM p(t|k) articl 0.684 good 0.368 dai 0.363 fact 0.322 god 0.320 claim 0.294 Example apr 0.279 fbi 0.256 of Document christian 0.256 Cluster group 0.239 Highest Lift Terms in Component 1: weight = 0.11 : TERM LIFT p(t|k) p(t) fbi 8.3 0.26 0.03 jesu 5.5 0.16 0.03 fire 5.2 0.20 0.04 christian 4.9 0.26 0.05 evid 4.8 0.24 0.05 god 4.6 0.32 0.07 gun 4.2 0.17 0.04 faith 4.2 0.12 0.03 kill 3.8 0.22 0.06 bibl 3.7 0.11 0.03 © Padhraic Smyth, Dec 2000: 80
  81. 81. Pixel Representation of Mixture Components 1 2 COMPONENT MODELS 3 4 5 6 7 8 9 10 10 20 30 40 50 60 70 80 90 100 TERMS © Padhraic Smyth, Dec 2000: 81
  82. 82. Example: Web Log Mining 128.195.36.195, -, 3/22/00, 10:35:11, W3SVC, SRVR1, 128.200.39.181, 781, 363, 875, 200, 0, GET, /top.html, -, 128.195.36.195, -, 3/22/00, 10:35:16, W3SVC, SRVR1, 128.200.39.181, 5288, 524, 414, 200, 0, POST, /spt/main.html, -, 128.195.36.195, -, 3/22/00, 10:35:17, W3SVC, SRVR1, 128.200.39.181, 30, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.195.36.101, -, 3/22/00, 16:18:50, W3SVC, SRVR1, 128.200.39.181, 60, 425, 72, 304, 0, GET, /top.html, -, 128.195.36.101, -, 3/22/00, 16:18:58, W3SVC, SRVR1, 128.200.39.181, 8322, 527, 414, 200, 0, POST, /spt/main.html, -, 128.195.36.101, -, 3/22/00, 16:18:59, W3SVC, SRVR1, 128.200.39.181, 0, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:54:37, W3SVC, SRVR1, 128.200.39.181, 140, 199, 875, 200, 0, GET, /top.html, -, 128.200.39.17, -, 3/22/00, 20:54:55, W3SVC, SRVR1, 128.200.39.181, 17766, 365, 414, 200, 0, POST, /spt/main.html, -, 128.200.39.17, -, 3/22/00, 20:54:55, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:55:07, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 1061, 382, 414, 200, 0, POST, /spt/main.html, -, 128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:55:39, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:56:03, W3SVC, SRVR1, 128.200.39.181, 1081, 382, 414, 200, 0, POST, /spt/main.html, -, 128.200.39.17, -, 3/22/00, 20:56:04, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:56:33, W3SVC, SRVR1, 128.200.39.181, 0, 262, 72, 304, 0, GET, /top.html, -, 128.200.39.17, -, 3/22/00, 20:56:52, W3SVC, SRVR1, 128.200.39.181, 19598, 382, 414, 200, 0, POST, /spt/main.html, -, User 1 2 3 2 2 3 3 3 1 1 1 3 1 3 3 3 3 User 2 3 3 3 1 1 1 User 3 7 7 7 7 7 7 7 7 User 4 1 5 1 1 1 5 1 5 1 1 1 1 1 1 User 5 5 1 1 5 … … © Padhraic Smyth, Dec 2000: 82
  83. 83. Clusters of Dynamic Behavior A A Cluster 1 Cluster 2 B B D D C C A B D Cluster 3 C © Padhraic Smyth, Dec 2000: 83
  84. 84. © Padhraic Smyth, Dec 2000: 84
  85. 85. WebCanvas: Cadez, Heckerman, etSmyth, Dec 2000: 85 © Padhraic al, KDD 2000
  86. 86. Application: Product Purchasing Data 50 100 TRANSACTIONS 150 200 250 300 350 5 10 15 20 25 30 35 40 45 50 PRODUCT CATEGORIES © Padhraic Smyth, Dec 2000: 86
  87. 87. Application: Recommender Systems Products x x x x x x x x x Transactions x x x x x x x x x x x New Customer x ? x ? ? ? ? ? ? ? ? ? ? © Padhraic Smyth, Dec 2000: 87
  88. 88. Application: Recommender Systems Products x x x x x x x x x Transactions x x x x x x x x x x x New Customer x ? x ? ? ? ? ? ? ? ? ? ? – high-dimensional inference/prediction problem – sparse data – recommendations must be in real-time! © Padhraic Smyth, Dec 2000: 88
  89. 89. Approaches to Recommender Systems • Collaborative Filtering – “infer your interests from people with similar behavior” – essentially a nearest-neighbor algorithm – considerable commercial interest (e.g., NetPerceptions, Firefly) – scalability problems • Model-based Recommenders – model the joint distribution of the products explicitly – Example • dependency networks from Microsoft Research • decision-tree models/MRFs • extremely fast, shipping in Microsoft products • Heckerman et al. (2000), Journal of Machine Learning Research, www.jmlr.org © Padhraic Smyth, Dec 2000: 89
  90. 90. Final Comments • Successful data mining requires integration/understanding of – statistics – computer science – the application discipline • Current practice of data mining – computer scientists focused on business applications – relatively little statistical sophistication: but some new ideas – considerable “hype” factor • Opportunities for Statisticians – new problems: e.g., statistical scalability – new applications: e.g., inference from Web and text data – ready audience for statistical techniques • need better marketing! © Padhraic Smyth, Dec 2000: 90
  91. 91. Pointers • Papers: – www.ics.uci.edu/~datalab – e.g., “Data mining: data analysis on a grand scale?”, P. Smyth, (2000), Statistical Methods in Medical Research. • Web Resources: – www.kdnuggets.com • Interface ‘01 – data mining and bioinformatics themes – June 13-16th, 2001 Costa Mesa, CA • Text (forthcoming) – Principles of Data Mining • D. J Hand, H. Mannila, P. Smyth • MIT Press, May 2001? © Padhraic Smyth, Dec 2000: 91
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×