(Talk in Powerpoint Format)


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

(Talk in Powerpoint Format)

  1. 1. Data Mining Predictive Descriptive classification regression time series analysis prediction clustering association rules summarization sequence discovery AI Machine learning Neural networks Deductive detabases
  2. 2. <ul><li>Detecting regularities in data (bird flue cases) </li></ul><ul><li>Detecting rare occurrences, rare events </li></ul><ul><li>Finding “causal” relationships </li></ul>Discovering useful information in large data sets
  3. 3. Opportunities <ul><li>Collecting vast amounts of data has become possible. </li></ul><ul><li>Ex1: Astromomy: petabytes of information are collected </li></ul><ul><li>Laboratory for Cosmological Data Mining (LCDM) </li></ul><ul><li>1 petabyte (PB) = 2 50 bytes </li></ul><ul><li> = 1,125,899,906,842,624 bytes. </li></ul><ul><li>1 petabyte = 1,024 terabytes </li></ul><ul><li>1 terabyte (TB) = 1,024 gigabytes </li></ul><ul><li>=> The armchair astronomer </li></ul>
  4. 4. <ul><li>Ex2: Biology: huge sequences of nucleotides have been collected. (The human genome contains more than 3.2 billion base pairs and more than 30 000 genes). </li></ul><ul><li>http://www.genomesonline.org </li></ul><ul><li>Very little of that has </li></ul><ul><li>been interpreted yet. </li></ul>
  5. 5. <ul><li>Ex: Physics, Geography, weather data, … </li></ul><ul><li>Business, … </li></ul>Data in many forms <ul><li>numerical </li></ul><ul><ul><li>discrete </li></ul></ul><ul><ul><li>continuous </li></ul></ul><ul><li>categorical </li></ul><ul><li>raw data </li></ul><ul><li>cleaned data </li></ul><ul><li>complete records </li></ul><ul><li>Incomplete records (missing data) </li></ul><ul><li>formatted data </li></ul><ul><li>unformatted data </li></ul>
  6. 6. Tasks <ul><li>Fit data to model </li></ul><ul><ul><li>Descriptive </li></ul></ul><ul><ul><li>Predictive </li></ul></ul><ul><li>Finding the “best” model ??? </li></ul><ul><ul><li>Beware of model overfitting! </li></ul></ul><ul><li>Interpreting results </li></ul><ul><li>Evaluating models (ex: lift charts) </li></ul><ul><li>=> Usually a lot of going back and forth between model(s) and data </li></ul>
  7. 7. Another complementary tack: Interactive visual data exploration <ul><li>Remarkable properties of the human visual system. (ex: analysis of a pseudo random number generator) </li></ul><ul><li>Various visual representation schemes </li></ul><ul><ul><li>Simultaneous viewing </li></ul></ul><ul><ul><li>(fast) sequential viewing </li></ul></ul><ul><li>Animating data (dynamic queries) </li></ul>Other possibilities: converting data to sounds, etc.
  8. 8. Two broad approaches to Learning <ul><li>Supervised learning </li></ul><ul><li>ex: want to discover a model to help classify stars, based on emission spectra. </li></ul><ul><li>In the “training set” the correct classification of the stars is known. </li></ul><ul><li>The resulting model is used to predict the class of a new star (not in the training set) </li></ul><ul><li>Unsupervised learning </li></ul><ul><li>ex: want to group a set of stars into a small number sufficiently homogenous sub-groups of stars </li></ul>
  9. 9. Many techniques Fast evolving field <ul><li>Statistical </li></ul><ul><ul><li>Descriptive stats, graphics, .. </li></ul></ul><ul><ul><li>Regression analysis </li></ul></ul><ul><ul><li>Principal components analysis </li></ul></ul><ul><ul><li>Time series analysis </li></ul></ul><ul><ul><li>Cluster analysis (use of a distance measure) </li></ul></ul><ul><ul><li>Naïve Bayse classifiers </li></ul></ul><ul><li>Artificial intelligence </li></ul><ul><ul><li>Rule induction (Machine Learning) </li></ul></ul><ul><ul><li>Various inference techniques (various logics, deductive databases,…) </li></ul></ul>
  10. 10. <ul><ul><li>Pattern matching (speech recognition) </li></ul></ul><ul><ul><li>Neural networks (many approaches) </li></ul></ul><ul><ul><li>Genetic algorithms </li></ul></ul><ul><ul><li>Baysian networks (probably the best approach to model complex causal structures) </li></ul></ul><ul><li>Information retrieval </li></ul><ul><ul><li>Many specialized models (vector model,…) </li></ul></ul><ul><ul><li>Concepts of Precision and Recall </li></ul></ul><ul><li>Many ad hoc techniques </li></ul><ul><ul><li>Co-occurrence analysis </li></ul></ul><ul><ul><li>MK generality analysis </li></ul></ul><ul><ul><li>Association analysis </li></ul></ul>
  11. 11. One famous technique Ross Quinlan’s ID3 algorithm
  12. 12. The weather data N TRUE high mild rain 14 P FALSE normal hot overcast 13 P TRUE high mild overcast 12 P TRUE normal mild sunny 11 P FALSE normal mild rain 10 P FALSE normal cool sunny 9 N FALSE high mild sunny 8 P TRUE normal cool overcast 7 N TRUE normal cool rain 6 P FALSE normal cool rain 5 P FALSE high mild rain 4 P FALSE high hot overcast 3 N TRUE high hot sunny 2 N FALSE high hot sunny 1 Class Windy Humidity Temperature Outlook Object
  13. 14. From decision trees to rules <ul><li>Reading rules from a tree </li></ul><ul><ul><li>Unambiguous </li></ul></ul><ul><ul><li>Rule order not counting </li></ul></ul><ul><ul><li>Alternative rules for the same conclusion are ORed </li></ul></ul><ul><ul><li>But too complex rules </li></ul></ul>
  14. 15. Rules can be much more compact than trees <ul><li>Ex: if x=1 and y = 1 then class=a </li></ul><ul><li>if z=1and w=1 then class=a </li></ul><ul><li>Otherwise class=b </li></ul>
  15. 16. From rules to decision trees <ul><li>Rule disjunction result in too complex trees. </li></ul><ul><li>Ex: write as a tree </li></ul><ul><ul><li>If a and b then x </li></ul></ul><ul><ul><li>If c and d then x ( Fig. 3.2 ) </li></ul></ul><ul><ul><li>(replicated sub-tree problem) </li></ul></ul><ul><li>Ex : tree and rules of equivalent complexity </li></ul><ul><li>Ex: tree much more complex than rules </li></ul>
  16. 17. To learn from examples, the examples must be rich enough <ul><li>Ex: sister-of relation ( fig 2-1 ) </li></ul><ul><li>Denormalization ( fig 2-3 ) </li></ul>Importance of data preparation
  17. 18. Attributes <ul><li>An attribute may be irrelevant in a given context (ex: number of wheels for a ship in a database of transportation vehicles => Create value “irrelevant” </li></ul>
  18. 19. Software tools <ul><li>Many commercial software </li></ul><ul><ul><li>CART ( http://www.salford-systems.com/landing.php ) </li></ul></ul><ul><ul><li>SPSS modules </li></ul></ul><ul><ul><li>WEKA (free) ( http://www.cs.waikato.ac.nz/~ml/weka/ ) </li></ul></ul><ul><ul><li>For a larger list: http:// www.kdnuggets.com/software/suites.html </li></ul></ul><ul><li>Many field specific software </li></ul><ul><ul><li>In the context of GRID computing </li></ul></ul><ul><li>Demonstrating WEKA </li></ul>
  19. 20. Ad hoc methods <ul><li>Co-occurrence analysis </li></ul><ul><li>MK generality analysis </li></ul>
  20. 21. Term Co-occurrence Analysis <ul><li>The following approach measures the strength of association between a term i and a term j of the set of documents by: </li></ul><ul><li>e(i,j) 2 = (C ij ) 2 /(C i * C j ) </li></ul><ul><li>Where: </li></ul><ul><li>Ci : is the number of documents indexed by term i </li></ul><ul><li>Cj : is the number of documents indexed by term j </li></ul><ul><li>Cij : is the number of documents indexed both by terms i and j </li></ul>
  21. 23. Interactive Data Visualization <ul><li>Fish eye views </li></ul><ul><li>Hyperbolic trees </li></ul><ul><li>Linear Visual data sequences </li></ul><ul><li>Dynamic queries </li></ul>
  22. 25. Tree Maps <ul><li>Financial Data http://www.smartmoney.com/marketmap/ </li></ul>
  23. 26. Conclusion <ul><li>Current state of the art (Graphic Models – Markov networks) </li></ul><ul><li>Still an art </li></ul><ul><li>Ethical issues </li></ul>
  24. 27. Baysian Networks <ul><li>Objective: determine probability estimates that a given sample belongs to a class </li></ul><ul><li>Probability(x  Class | attribute values) </li></ul><ul><li>Baysian network: </li></ul><ul><ul><li>One node for each attribute </li></ul></ul><ul><ul><li>Nodes connected in an acyclic graph </li></ul></ul><ul><ul><li>Conditional independance </li></ul></ul>
  25. 29. Learning a baysian network from data <ul><li>Function for evaluating a given network based on the data </li></ul><ul><li>Function for searching through the space of possible networks </li></ul><ul><li>K1 and TAN algorithms </li></ul>
  26. 30. Baysian Networks   Graphical Models = Markov models undirected edges