slides session 1


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

slides session 1

  1. 1. Data Mining Luc Dehaspe K.U.L. Computer Science Department - Marc Van Hulle K.U.L. Neurofysiologie Department
  2. 2. Course overview Data Mining Session 1: Introduction Session 2-3: Data warehousing/preparation Session 4-6: Symbolic Data Mining techniques Session 7: Application + Evaluation of Data Mining results <ul><li>Session 8-14: Numeric Data Mining methods </li></ul><ul><ul><ul><ul><ul><li>statistical techniques </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>self-organizing techniques </li></ul></ul></ul></ul></ul>(Hands-on) Exercise sessions
  3. 3. Exercise session <ul><li>Part 1 (L. Dehaspe) </li></ul><ul><ul><li>2* 2.5 h “paper-and-pencil” sessions </li></ul></ul><ul><ul><ul><li>application of algorithms </li></ul></ul></ul><ul><li>Part 2 (M. Van Hulle) </li></ul><ul><ul><li>hands-on exercises </li></ul></ul>
  4. 4. Exam <ul><li>Written exam, closed book </li></ul><ul><li>Part 1 (Sessions 1-7): 50% </li></ul><ul><ul><li>Coverage </li></ul></ul><ul><ul><ul><li>Questions RESTRICTED TO CONTENT OF SLIDES </li></ul></ul></ul><ul><ul><ul><li>Occasional pointers to additional material: I do not expect you to study this material </li></ul></ul></ul><ul><ul><li>Questions </li></ul></ul><ul><ul><ul><li>One main question: apply+understand algorithm (30%) </li></ul></ul></ul><ul><ul><ul><li>Two smaller questions: explain concept, compute model quality, … (2*10%) </li></ul></ul></ul><ul><li>Part 2 (Sessions 8-14): 50% (explained later by Marc Van Hulle) </li></ul>
  5. 5. Working definition data mining <ul><li>tools to search data for patterns and relationships that lead to better business decisions </li></ul><ul><ul><li>“business”: commercial/scientific </li></ul></ul>
  6. 6. Overview <ul><li>myths and facts </li></ul><ul><li>the Data Mining process </li></ul><ul><li>methods </li></ul><ul><ul><li>visual </li></ul></ul><ul><ul><li>non-visual </li></ul></ul>
  7. 7. Myths and facts <ul><li>New technology cycle </li></ul><ul><ul><li>phase 1: hype </li></ul></ul><ul><ul><ul><li>unrealistic expectations </li></ul></ul></ul><ul><ul><ul><li>“naive” users </li></ul></ul></ul><ul><ul><li>phase 2: frustration </li></ul></ul><ul><ul><li>phase 3: rejection </li></ul></ul><ul><li>Alternative: realistic view on vital technology </li></ul>
  8. 8. Myth 1: tabula rasa (virgin territory) <ul><li>Data mining methods are fundamentally different from previous methods </li></ul>Fact <ul><li>Underlying ideas often decades old </li></ul><ul><ul><li>neural networks: 1940 </li></ul></ul><ul><ul><li>k-nearest neighbour: 1950 </li></ul></ul><ul><ul><li>CART (regression trees): 1960 </li></ul></ul><ul><li>Novel </li></ul><ul><ul><li>integrated applications to general “business” problems </li></ul></ul><ul><ul><li>more data, more computing power </li></ul></ul><ul><ul><li>non-academic users </li></ul></ul>
  9. 9. Data Mining <ul><li>Myth: the magic bullet </li></ul>Performance Data Task Solution Problem Meta learning Solution: integration of tools, mixture of old and new
  10. 10. Take home lesson 1 <ul><li>Not: 1 optimal method optimal </li></ul><ul><li>But: portfolio of tools, mixture of old and new </li></ul>
  11. 11. Myth 2: manna from heaven <ul><li>Data mining produces surprising results </li></ul><ul><ul><li>that will turn your “business” upside-down </li></ul></ul><ul><ul><li>without any input of domain expert knowledge </li></ul></ul><ul><ul><li>without any tuning of the technology </li></ul></ul>Fact <ul><li>incremental changes rather than revolutionary </li></ul><ul><ul><li>long term competitive advantage </li></ul></ul><ul><ul><li>occasional breakthroughs (e.g. link aspirine-Reyes Syndrome) </li></ul></ul><ul><li>technology assistant to the domain expert </li></ul><ul><li>careful selection required of: </li></ul><ul><ul><li>goal </li></ul></ul><ul><ul><li>technology </li></ul></ul>
  12. 12. Take home lesson 2 <ul><li>Crucial combination of </li></ul><ul><ul><li>“business” (application domain) expertise </li></ul></ul><ul><ul><li>data mining technology expertise </li></ul></ul>
  13. 13. Data Mining process model <ul><li>Definition </li></ul><ul><li>Link with the scientific method </li></ul>
  14. 14. The data mining process <ul><ul><li>process : iterative; learn to ask better questions </li></ul></ul><ul><ul><li>valid : patterns can be generalized to new data </li></ul></ul><ul><ul><li>novel and useful : offer a competitive advantage </li></ul></ul><ul><ul><li>understandable : contribute to insight in the domain </li></ul></ul>The non-trivial process of finding valid, novel, potentially useful, and ultimately understandable patterns in data
  15. 15. Interrogating the database Look-up queries What is the average toxicity of cadmium chloride? How many earthquakes have occurred last year? Which customers have a car insurance? How did HIV patient p123 react to AZT? Biological data Clinical data Chemical data
  16. 16. Interrogating the database Finding patterns What is the relation between geological features and the occurrence of earthquakes? What is the relation between in vitro activity and chemical structure? What is the relation between the HIV patient’s therapy history and response to AZT? What is the profile of returning customers? Data Mining Biological data Clinical data Chemical data
  17. 17. 6 7 8 5 ACTIVE Science 2 3 4 1 NON-ACTIVE
  18. 18. Science collect data build hypothesis verify hypothesis formulate theory The formation of hypotheses is the most mysterious of all the categories of scientific method. Where they come from, no one knows. A person is setting somewhere, minding his own business, and suddenly - flash ! - he understands something he didn’t understand before. Robert M. Pirsig, Zen and the Art of Motorcycle maintenance The actual discovery of such an explanatory hypothesis is a process of creation , in which imagination as well as knowledge is involved. Irving Copi, Introduction to Logic, 1986 <ul><li>Tycho Brahe (1546-1601) </li></ul><ul><li>observational genius </li></ul><ul><li>collected data on Mars </li></ul><ul><li>Johannes Kepler (1571-1630) </li></ul><ul><li>mined Brahe’s data </li></ul><ul><li>discovered laws of planetary motion </li></ul>
  19. 19. Evolution of data generation Data source Data analyst Data < 1950 > 2000 Data Rich Knowledge Poor Everyone, even the most patient and thorough investigator, must pick and choose, deciding which facts to study and which to pass over. Irving Copi, Introduction to Logic, 1986
  20. 20. The scientific method collect data build hypothesis verify hypothesis formulate theory Data Mining Statistics - OLAP care inspiration Knowledge discovery in Databases Data warehousing
  21. 21. Data Mining <ul><li>Definition : </li></ul><ul><li>Extracting or “mining” knowledge from large amounts of data </li></ul>CRISP-DM process model
  22. 22. Data mining in industry <ul><li>An in silico research assistant allowing researchers to </li></ul><ul><ul><li>Explore integrated database </li></ul></ul><ul><ul><li>For variety of research purposes (“business goals”) </li></ul></ul><ul><ul><li>Using optimal selection of data mining technologies </li></ul></ul>pattern knowledge
  23. 23. Data Mining process model CRISP-DM
  24. 24. Business understanding <ul><li>Which are the business goals? </li></ul><ul><li>Translation to data mining problem definition </li></ul><ul><li>Design of a plan to meet objectives </li></ul>
  25. 25. Data understanding <ul><li>First collection of data </li></ul><ul><li>Becoming familiar with the data </li></ul><ul><li>Judge data quality </li></ul><ul><li>Discovery of </li></ul><ul><ul><li>first insights </li></ul></ul><ul><ul><li>interesting subsets </li></ul></ul>
  26. 26. Data preparation <ul><li>Extract final data set from original set </li></ul><ul><li>Selection of </li></ul><ul><ul><li>tables </li></ul></ul><ul><ul><li>records </li></ul></ul><ul><ul><li>attributes </li></ul></ul><ul><li>transformation </li></ul><ul><li>data cleaning </li></ul>
  27. 27. Modelling <ul><li>Selection modelling techniques </li></ul><ul><li>calibrating parameters </li></ul><ul><li>regular backtracking to adapt data to technology </li></ul><ul><li>(some techniques discussed further on) </li></ul>
  28. 28. Evaluation <ul><li>Decide whether to use Data Mining results </li></ul><ul><li>Verification of all steps </li></ul><ul><li>Check whether business goals have been met </li></ul>
  29. 29. Deployment <ul><li>Organisation & presentation of new insights </li></ul><ul><li>variable complexity </li></ul><ul><ul><li>deliver report </li></ul></ul><ul><ul><li>implement software that allows process to be repeated </li></ul></ul>
  30. 30. Visual Data Mining methods <ul><li>Pro </li></ul><ul><ul><li>image has got broader information-bandwidth than text </li></ul></ul><ul><ul><li>(cf., an image tells more than a thousand words ) </li></ul></ul><ul><li>Con </li></ul><ul><ul><li>problems with representation of > 3 dimensions </li></ul></ul><ul><ul><li>not effective in case of color blindness </li></ul></ul><ul><ul><li>interpretation gives more information on subject than on object </li></ul></ul><ul><ul><ul><li>stars, clouds, Hermann Rorschach test </li></ul></ul></ul>
  31. 31. Visual Data Mining methods <ul><li>Error detection </li></ul>
  32. 32. Visual Data Mining methods <ul><li>Linkage analysis </li></ul>
  33. 33. Visual Data Mining methods <ul><li>Conditional probabilities </li></ul>
  34. 34. Visual Data Mining methods <ul><li>landscapes </li></ul>
  35. 35. Visual Data Mining methods <ul><li>Scatter plots </li></ul>
  36. 36. Non-visual data mining methods <ul><li>Statistics - OLAP </li></ul><ul><ul><li>descriptive: average, median, standard deviation, distribution </li></ul></ul><ul><ul><li>hypothesis testing: (observed differences)/(random variation) </li></ul></ul><ul><ul><li>discriminant analysis </li></ul></ul><ul><ul><li>predictive regression analysis: linear, non-linear </li></ul></ul><ul><ul><li>clustering </li></ul></ul><ul><li>Neural networks </li></ul><ul><li>Decision trees and rules </li></ul><ul><li>Conceptual clustering </li></ul><ul><li>Association rules </li></ul>
  37. 37. (Non-)visual Data Mining methods OLAP - Data cubes <ul><li>Online analytical processing </li></ul><ul><li>Classical statistical methods +database technology </li></ul><ul><ul><li>real-time calculations </li></ul></ul><ul><ul><li>powerful visualisation methods </li></ul></ul>City Date Product Juice Cola Milk Cream Toothpaste Soap Pizza Cheese 1 2 3 4 5 6 7 Leuven NY Tokyo Casablanca Rio 10 50 35 60 20 15 70 25 Fact data: sales volume in $100
  38. 38. Non-Visual Data Mining methods Regression
  39. 39. Non-visual Data Mining methods Discriminant analysis <ul><li>R.A. Fischer, 1936 </li></ul><ul><li>discovers planes that separate classes </li></ul>
  40. 40. Non-Visual Data Mining methods Neural Networks <ul><li>Represent functions with output a discrete value, a real value, or a vector </li></ul><ul><li>Neurobiological motivation </li></ul><ul><li>Parameters network tuned on basis of input-output examples (backpropagation) </li></ul><ul><li>e.g. . input from sensors </li></ul><ul><ul><li>camera (face recognition) </li></ul></ul><ul><ul><li>microphone (speech recognition) </li></ul></ul>
  41. 41. Non-Visual Data Mining methods Decision trees
  42. 42. Non-Visual Data Mining methods Decision trees <ul><li>Attribute selection </li></ul><ul><ul><li>information gain </li></ul></ul><ul><ul><li>“how well does an attribute distribute the data according to their target class </li></ul></ul><ul><ul><li>maximal reduction of Entropy = </li></ul></ul><ul><ul><ul><li>- p M log 2 p M - p F log 2 p F </li></ul></ul></ul>
  43. 43. Non-Visual Data Mining methods Decision rules <ul><li>IF </li></ul><ul><ul><li>Frame = 2-Door AND </li></ul></ul><ul><ul><li>Engine  V6 AND </li></ul></ul><ul><ul><li>Age < 50 AND </li></ul></ul><ul><ul><li>Cost > 30K AND </li></ul></ul><ul><ul><li>Color = Red </li></ul></ul><ul><li>THEN </li></ul><ul><ul><li>buyer is highly likely to be male </li></ul></ul>
  44. 44. Non-Visual Data Mining methods Clustering Eisen et al, PNAS 1998 Cholesterol biosynthesis Cell cycle Early response Signaling and angiogenesis Wound healing
  45. 45. Non-Visual Data Mining methods Conceptual clustering  <ul><li>Groups examples and provides description of each group </li></ul> : all examples A : Age=-20 B : Age =20-40 b1 : Age =20-40 en Frame=2-Door b2 : Age =20-40 en Frame = 4-Door C : Age =40-60 D : Age =+60 d1 : Age =+60 en Frame = 2-Door d2 : Age =+60 en Frame = 4-Door A C D B b2 b1 d1 d2
  46. 46. Non-Visual Data Mining methods Association rules <ul><li>IF-THEN rules show relationships </li></ul><ul><li>e.g. . Which products bought together? </li></ul>40 % 60 % Wine and Pizza Wine, Pizza, Floppy, and Cheese item sets IF Wine and Pizza THEN Floppy and Cheese association- rule frequency: 40 % accuracy: 40% / 60% = 66%
  47. 47. Evaluation: pitfalls Post hoc ergo propter hoc Everyone who drank Stella in the year 1743 is now dead. Therefore, Stella is fatal.
  48. 48. Evaluation: pitfalls Correlation does not imply Causality <ul><li>Palm size correlates with your life expectancy </li></ul><ul><li>The larger your palm, the less you will live, on average. </li></ul>Women have smaller palms and live 6 years longer on average Why? !actions inspired by data mining results!
  49. 49. Evaluation: pitfalls Hypothesis validation <ul><li>descriptive statistics: 1 hypothesis </li></ul><ul><li>data mining: 1 hypothesis- SPACE </li></ul><ul><ul><li>much higher probability of random relationships </li></ul></ul><ul><ul><li>validation on separate data set required </li></ul></ul>