Data Mining and  Knowledge Discovery Part of  “New Media and eScience” MSc Programme and “Statistics” MSc Programme  Fall ...
Course  participants <ul><li>I. NMeS MPSJS students </li></ul><ul><ul><li>Robert Blatnik  </li></ul></ul><ul><ul><li>Joel ...
Courses in Knowledge Technologies: Fall 2004/05  10 Nov. 15h-19h Data Mining and Knowledge Discovery prof. dr. Nada Lavrač...
Courses in Knowledge Technologies: Fall 2004/05  25 Nov. 15h-19h ??????? Data Mining and Knowledge Discovery prof. dr. Nad...
Advanced Course on  Knowledge Technologies: ACAI-05  Ljubljana, June 27–July 8, 2005
Credits and coursework <ul><li>“ New Media and eScience” MSc Programme </li></ul><ul><li>6 credits </li></ul><ul><li>30 ho...
Credits and coursework: Sample individual programmes <ul><li>“ New Media and eScience” MSc Programme </li></ul><ul><li>6 c...
Outline of 10 Nov. and 25 Nov. lectures on DM and KDD <ul><li>I. Introduction </li></ul><ul><ul><li>Data Mining and KDD pr...
Introduction  to data mining <ul><li>Data Mining  (DM)  and  related areas </li></ul><ul><li>Why DM: Examples of discovere...
What is  data mining <ul><li>Extraction of useful information from data:  discovering relationships that have not been pre...
Related  a reas <ul><li>Database t e c hnolog y </li></ul><ul><li>and data warehouses </li></ul><ul><li>efficient storage,...
<ul><li>Statist ics ,  </li></ul><ul><li>machine learning, </li></ul><ul><li>pattern recognition </li></ul><ul><li>and sof...
Related  a reas <ul><li>Text and Web mining </li></ul><ul><li>Web page analysis </li></ul><ul><li>text categorization </li...
Related  a reas <ul><li>Visuali zation   </li></ul><ul><li>visualization of data and discovered knowledge </li></ul>DM sta...
Point of view in this tutorial <ul><li>Data mining with  machine   learning  methods </li></ul><ul><li>Emphasis on r elati...
Machine  l earning and  s tatistics <ul><li>Both have a long tradition of developing   indu ctive  te c hni ques   for dat...
D M  and KDD <ul><li>DM is a way of doing data analysis, aimed at finding patterns,   revealing hidden regularities and re...
The  KDD  p rocess <ul><li>KDD involves several phases: </li></ul><ul><ul><li>data   preparation   (selection, pre-process...
Part I. Introduction <ul><li>Data Mining and the KDD process </li></ul><ul><li>Why DM: Examples of discovered patterns and...
The  S ol E u N et  P roject <ul><li>E uropean 5FP project  “ Data Mining and Decision Support for Business Competitivenes...
Data mining  application prototypes <ul><li>Mediana  –  analysis of media research data  </li></ul><ul><li>Kline & Kline  ...
M ediana case study <ul><li>Questionnaires  about journal/magazine reading, watching TV programs and listening   to radio ...
M ediana case study <ul><li>Target patterns: </li></ul><ul><ul><li>Which other journals/magazines are read by readers of a...
Decision trees <ul><li>Finding reader profiles:  decision tree for classifying people into readers and non-readers of a te...
Classification rules Set of Rules:   if Cond then Class Interpretation:  if-then  ruleset, or if-then-else  decision list ...
Association rules <ul><li>Rules  X  =>  Y ,   X, Y conjunction of bin. attributes </li></ul><ul><li>Support:  Sup(X,Y) = #...
Association rules Finding profiles of readers of the Delo daily newspaper 1. read_Marketing magazine   116   =>   read_Del...
Anal ysis of UK traffic accidents <ul><li>End-user:  Hampshire County Council (HCC , UK )  </li></ul><ul><ul><li>Can recor...
STATS19  Data Base 10 <ul><li>Over 5 million accidents recorded in 1979-1999 </li></ul><ul><li>3 data tables </li></ul>Acc...
Data understanding
Data quality :  Accident location
Data preparation <ul><li>There are 51 police force areas in UK </li></ul><ul><li>For each area we count the number of acci...
Data preparation
Simple visualization of short time series <ul><li>Used for data understanding </li></ul><ul><li>Very informative and easy ...
Year/Month distribution Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Darker color - MORE accidents
Day of Week/Month distribution All weekdays (Mon – Fri) are  worse in deep winter, Friday the worst SUN FRI SAT MON THU TU...
Hour/Month distribution <ul><li>More Accidents at “Rush Hour”, Afternoon Rush hour is the worst </li></ul><ul><li>More hol...
Day of Week/Hour distribution <ul><li>More Accidents at “Rush Hour”, Afternoon Rush hour is the </li></ul><ul><li>worst an...
Traffic: different modeling approaches <ul><li>association rule learning </li></ul><ul><li>static subgroup discovery </li>...
Some discovered  association rules   <ul><li>Association rules: Road number and Severity of accident </li></ul><ul><ul><li...
Analysis of documents of  European IST project <ul><li>Data source:   </li></ul><ul><li>List of IST project descriptions a...
Analysis of documents of  European IST project
Visualization into 25 project groups Health Data analysis Knowledge Management Mobile computing
Institutional Backbone of IST   Telecommunication Transport Electronics No. of joint projects
Collaboration between countries   (top 12) Most active country Number of collaborations
Part I. Introduction <ul><li>Data Mining and the KDD process </li></ul><ul><li>Why DM: Examples of discovered patterns and...
Types of DM tasks  <ul><li>Predictive DM: </li></ul><ul><ul><li>Classification (learning of rulesets, decision trees, ...)...
Predictive vs. descriptive induction <ul><li>Predictive induction </li></ul><ul><li>Descriptive induction </li></ul>+ - + ...
Predictive vs. descriptive induction <ul><li>Predictive induction:  Inducing classifiers for solving classification and pr...
Predictive vs. descriptive induction: A rule learning perspective <ul><li>Predictive induction:  Induces  rulesets  acting...
Supervised vs. unsupervised learning: A rule learning perspective <ul><li>Supervised learning:  Rules are induced from lab...
Subgroups vs. classifiers <ul><li>Classifiers: </li></ul><ul><ul><li>Classification rules aim at pure subgroups </li></ul>...
Part I. Introduction <ul><li>Data Mining and the KDD process </li></ul><ul><li>Why DM: Examples of discovered patterns and...
Visualization <ul><li>can be used on its own (usually for description and summarization tasks) </li></ul><ul><li>can be us...
Data visualization:  Scatter plot
Daisy Graph Visualization by B. Zupan et al.
Daisy Graph Patients were mostly female
Daisy Graph The older the patient, the higher the difference of HHS between two follow-ups
Data visualization: time dependecy Cumulative ineffectiveness of antibiotics gentamycin, clyndamycin, cefpiramide, and cef...
Subgroup visualization Subgroups of patients with CHD risk [Gamberger, Lavrac & Wettschereck, IDAMAP2002]
Subgroup visualization Subgroups of patients with CHD risk [Gamberger, Lavrac & Wettschereck, IDAMAP2002]
Subgroup visualization Subgroups of patients with CHD risk [Gamberger & Lavrac, ICML2002]
DB Miner: Association rule visualization
MineSet: Association Rule Visualization
MineSet: Decision tree visualization
DM tools
Clementine
S-Plus
Part I: Summary <ul><li>KDD is the overall process of discovering useful knowledge in data </li></ul><ul><ul><li>many step...
Part I : Introduction  Questions
Upcoming SlideShare
Loading in...5
×

I-ed.ppt

472
-1

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
472
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
11
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

I-ed.ppt

  1. 1. Data Mining and Knowledge Discovery Part of “New Media and eScience” MSc Programme and “Statistics” MSc Programme Fall semester, 2004/05 Nada Lavra č Jo ž ef Stefan Institute Ljubljana, Slovenia Thanks to Blaz Zupan, Saso Dzeroski and Peter Flach for contributing some slides to this course material
  2. 2. Course participants <ul><li>I. NMeS MPSJS students </li></ul><ul><ul><li>Robert Blatnik </li></ul></ul><ul><ul><li>Joel Plisson </li></ul></ul><ul><ul><li>Jadran Prodan </li></ul></ul><ul><ul><li>Viljem Tisnikar </li></ul></ul><ul><li>II. Statistics students </li></ul><ul><ul><li>Borut Kodri č </li></ul></ul><ul><ul><li>Borut Rajer </li></ul></ul><ul><ul><li>Maja Sever </li></ul></ul><ul><li>III. Other participants </li></ul><ul><li>Dept. of Knowledge Technologies members , students , scholars </li></ul><ul><ul><li>Matjaz Depolli, Borut Lu žar, Primož Lukšič, … </li></ul></ul><ul><li>Facult y of mechanical engineering MSc students </li></ul><ul><ul><li>Jo že Jenkole, Viktor Zaletelj, Damir Husejnagič, Andrej Jermol </li></ul></ul>
  3. 3. Courses in Knowledge Technologies: Fall 2004/05 10 Nov. 15h-19h Data Mining and Knowledge Discovery prof. dr. Nada Lavrač 11 Nov 12h - 13h ??????? Concept of Sustainable Development prof. dr. Ivo Šlaus 11 Nov. 15h-19h Decision Support prof. dr. Marko Bohanec 17 Nov. 15h-19h Selected topics in New Media and eScience prof. dr. Sašo Džeroski
  4. 4. Courses in Knowledge Technologies: Fall 2004/05 25 Nov. 15h-19h ??????? Data Mining and Knowledge Discovery prof. dr. Nada Lavrač 15 Dec. 1 5 h - 1 9 h New Media and Knowledge Management Nada Lavrač, Mitja Jermol Tanja Urban čič Sašo Džeroski Toma ž Erjavec 13 Jan. 15h - 19h Language Technologies to be defined Text and Web Mining, Active Learning, Relational Data Mining, Equation Discovery, .. Mladeni ć, Grobelnik, Todorovski, ...
  5. 5. Advanced Course on Knowledge Technologies: ACAI-05 Ljubljana, June 27–July 8, 2005
  6. 6. Credits and coursework <ul><li>“ New Media and eScience” MSc Programme </li></ul><ul><li>6 credits </li></ul><ul><li>30 hours </li></ul><ul><ul><li>10 – lectures </li></ul></ul><ul><ul><li>10 – hands-on </li></ul></ul><ul><ul><li>10 – seminar </li></ul></ul><ul><li>Individual workload distribution and/or consultations: to be agreed by mail/phone </li></ul><ul><li>“ Statistics” MSc Programme </li></ul><ul><li>12 credits </li></ul><ul><li>36 hours </li></ul><ul><ul><li>24 – lectures </li></ul></ul><ul><ul><li>12 – seminar </li></ul></ul><ul><li>Individual workload distribution and/or consultations: to be agreed by mail/phone </li></ul>
  7. 7. Credits and coursework: Sample individual programmes <ul><li>“ New Media and eScience” MSc Programme </li></ul><ul><li>6 credits, 30 hours </li></ul><ul><ul><li>Lectures (with/without ACAI lectures) </li></ul></ul><ul><ul><li>e.g., ACAI hands-on (1x, 2x or 3x4 hours) </li></ul></ul><ul><ul><li>Seminar based on the results of ACAI hands-on work </li></ul></ul><ul><li>“ Statistics” MSc Programme </li></ul><ul><li>12 credits, 36 hours </li></ul><ul><ul><li>Lectures (e.g., with ACAI lectures) </li></ul></ul><ul><ul><li>e.g., WEKA ACAI hands-on (1x4 hours) </li></ul></ul><ul><ul><li>Individual seminar work, using you own data (e.g., using WEKA for survey data analysis) </li></ul></ul>
  8. 8. Outline of 10 Nov. and 25 Nov. lectures on DM and KDD <ul><li>I. Introduction </li></ul><ul><ul><li>Data Mining and KDD process </li></ul></ul><ul><ul><li>Why DM: Examples of discovered patterns and applications </li></ul></ul><ul><ul><li>Classification of DM tasks and techniques </li></ul></ul><ul><ul><li>Visualization and overview of DM tools </li></ul></ul><ul><ul><li>(Ch. 1,2,11,12,13 of DM&DS book) </li></ul></ul><ul><li>II. DM Techniques </li></ul><ul><ul><li>Classification of DM tasks and techniques </li></ul></ul><ul><ul><li>Predictive DM </li></ul></ul><ul><ul><ul><li>Decision Tree induction (Ch. 3 of Mitchell’s book) </li></ul></ul></ul><ul><ul><ul><li>Learning sets of rules ( Ch. 7 of IDA book, Ch. 10 of Mitchell’s book ) </li></ul></ul></ul><ul><ul><li>Descriptive DM </li></ul></ul><ul><ul><ul><li>Association rule induction </li></ul></ul></ul><ul><ul><ul><li>Subgroup discovery </li></ul></ul></ul><ul><ul><ul><li>Hierarchical clustering </li></ul></ul></ul><ul><li>III. Evaluation </li></ul><ul><ul><li>Evaluation methodology </li></ul></ul><ul><ul><li>Evaluation measures </li></ul></ul><ul><li>IV. Relational Data Mining </li></ul><ul><ul><li>What is RDM? </li></ul></ul><ul><ul><li>Propositionalization </li></ul></ul><ul><ul><li>Inductive Logic Programming </li></ul></ul><ul><ul><li>(Ch. 3,4,11 of RDM book) </li></ul></ul><ul><li>V. Concluding Remarks </li></ul>
  9. 9. Introduction to data mining <ul><li>Data Mining (DM) and related areas </li></ul><ul><li>Why DM: Examples of discovered patterns and applications </li></ul><ul><li>Classification of DM tasks and techniques </li></ul><ul><li>Visualization and overview of DM tools </li></ul>
  10. 10. What is data mining <ul><li>Extraction of useful information from data: discovering relationships that have not been previously known </li></ul><ul><li>The viewpoint in this course: DM i s the application of m achine l earning techniques to “hard” real-life problems </li></ul>
  11. 11. Related a reas <ul><li>Database t e c hnolog y </li></ul><ul><li>and data warehouses </li></ul><ul><li>efficient storage, access and manipulation of data </li></ul>DM statisti cs machine learning vi sualization text and Web mining soft computing pattern recognition databases
  12. 12. <ul><li>Statist ics , </li></ul><ul><li>machine learning, </li></ul><ul><li>pattern recognition </li></ul><ul><li>and soft computing* </li></ul><ul><li>techniques for classification and knowledge extraction from data </li></ul>Related a reas * ne ural networks, fuzzy logic, geneti c algorithms, probabilistic reasoning DM statisti cs machine learning vi sualization text and Web mining soft computing pattern recognition databases
  13. 13. Related a reas <ul><li>Text and Web mining </li></ul><ul><li>Web page analysis </li></ul><ul><li>text categorization </li></ul><ul><li>acquisition, filtering and structuring of textual information </li></ul><ul><li>natural language processing </li></ul>text and Web mining DM statisti cs machine learning vi sualization text and Web mining soft computing pattern recognition databases
  14. 14. Related a reas <ul><li>Visuali zation </li></ul><ul><li>visualization of data and discovered knowledge </li></ul>DM statisti cs machine learning vi sualization text and Web mining soft computing pattern recognition databases
  15. 15. Point of view in this tutorial <ul><li>Data mining with machine learning methods </li></ul><ul><li>Emphasis on r elation with statistics </li></ul>DM statisti cs machine learning vi sualization text and Web mining soft computing pattern recognition databases
  16. 16. Machine l earning and s tatistics <ul><li>Both have a long tradition of developing indu ctive te c hni ques for data analysis </li></ul><ul><ul><li>reasoning from properties of data sample s to properties of a population </li></ul></ul><ul><li>DM = statistics + marketing ? No ! D M = statistics + ... + machine learning </li></ul><ul><li>Statistics is particularly appropriate for hypothesis testing and data analysis under certain theoretical expectations about data distribution, independence, random sampling, sample size , … </li></ul><ul><li>M achine learning is particularly appropriate for inducing generalizations that consist of easily understandable patterns , induced from both large and small samples </li></ul>
  17. 17. D M and KDD <ul><li>DM is a way of doing data analysis, aimed at finding patterns, revealing hidden regularities and relationships </li></ul><ul><li>Knowledge Discovery in Databases (KDD) provides a broader view: </li></ul><ul><li> - KDD is defined as “the process of identifying valid, novel, potentially useful and ultimately understandable patterns in data” * </li></ul><ul><li> - KDD provid es tools to automate the entire process of data analysis, including the statistician’s art of hypothesis selection </li></ul><ul><li>DM is the key element in this much more elaborate KDD process </li></ul>* Usama M. Fayyad et al, The KDD Process for Extracting Useful Knowledge fr o m Volumes of Data . Comm ACM, Nov . 19 96
  18. 18. The KDD p rocess <ul><li>KDD involves several phases: </li></ul><ul><ul><li>data preparation (selection, pre-processing, transformation) </li></ul></ul><ul><ul><li>data mining </li></ul></ul><ul><ul><li>interpretation and evalua tion of discovered patterns </li></ul></ul><ul><li>D ata mining is the key phase, 15-25 % of the KDD process </li></ul>
  19. 19. Part I. Introduction <ul><li>Data Mining and the KDD process </li></ul><ul><li>Why DM: Examples of discovered patterns and applications </li></ul><ul><li>Classification of DM tasks and techniques </li></ul><ul><li>Visualization and overview of DM tools </li></ul>
  20. 20. The S ol E u N et P roject <ul><li>E uropean 5FP project “ Data Mining and Decision Support for Business Competitiveness: A European Virtual Enterprise” , 2000-2003 </li></ul><ul><li>Scientific coordinator J o z ef Stefan Institute, administrative Fraunhofer Gesellschaft </li></ul><ul><li>3 M €, 12 partners (8 academic and 4 business) from 7 countries </li></ul><ul><li>Main project objectives: </li></ul><ul><ul><li>development of prototype solutions for end-users </li></ul></ul><ul><ul><li>foundation of a virtual enterprise for marketing data mining and decision support expertise , involving business and academia </li></ul></ul>
  21. 21. Data mining application prototypes <ul><li>Mediana – analysis of media research data </li></ul><ul><li>Kline & Kline – improved brand name recognition </li></ul><ul><li>Australian financial house – customer quality evaluation, stock market prediction </li></ul><ul><li>Czech health farm – predict the use of resources </li></ul><ul><li>UK County Council - analysis of traffic accident data </li></ul><ul><li>Portuguese statistical bureau – Web page access analysis for better page organization </li></ul><ul><li>D etection of c oronary heart disease risk group s </li></ul><ul><li>Analysis of online d ating </li></ul><ul><li>EC Harris, UK - a nalysis of building construction projects </li></ul><ul><li>European Comission - analysis of 5F P IST projects: better understanding of large amounts of text documents, “clique” identification </li></ul>
  22. 22. M ediana case study <ul><li>Questionnaires about journal/magazine reading, watching TV programs and listening to radio programs , published annually since 1992 , about 1200 questions/attributes (frequency of reading/listening/watching, distribution w.r.t. sex, age, education, buying power, interests, ...) </li></ul><ul><li>Data for 1998, about 8000 questionnaires </li></ul><ul><li>Good quality , “clean” data </li></ul><ul><li>T ab le of n-t uples ( rows : individuals , columns : at tributes) </li></ul>
  23. 23. M ediana case study <ul><li>Target patterns: </li></ul><ul><ul><li>Which other journals/magazines are read by readers of a particular journal/magazine ? </li></ul></ul><ul><ul><li>What are the properties of individuals that are consumers of a particular media ? </li></ul></ul><ul><ul><li>Which properties are distinctive for readers of various journals ? </li></ul></ul><ul><li>Induced models : description (association rules, clusters) and classification (decision trees, classification rules) </li></ul>
  24. 24. Decision trees <ul><li>Finding reader profiles: decision tree for classifying people into readers and non-readers of a teenage magazine </li></ul>
  25. 25. Classification rules Set of Rules: if Cond then Class Interpretation: if-then ruleset, or if-then-else decision list Class : Reading of daily newspaper EN (Evening News) if a if person does not read MM (Maribor Magazine) and rarely reads the weekly magazine “7Days” then the person does not read EN (Evening News) else if a person rarely reads MM and does not read the weekly magazine SN (Sunday News) then the person reads EN else if a person rarely reads MM then the person does not read EN else the person reads EN.
  26. 26. Association rules <ul><li>Rules X => Y , X, Y conjunction of bin. attributes </li></ul><ul><li>Support: Sup(X,Y) = #XY / #D = p(XY) </li></ul><ul><li>Confidence: Conf(X,Y) = #XY / #X = p(XY) / p(X) = p(Y|X) </li></ul><ul><li>Task: Fi nd all association rules that satisfy minimum support and minimum confidence constraints. </li></ul><ul><li>Example association rule about readers of yellow press daily newspaper SloN (Slovenian News): </li></ul><ul><li>read_Love_Stories_Magazine => read_SloN </li></ul><ul><li>sup = 3.5% (3.5% of the whole dataset population reads both LSM and SloN) </li></ul><ul><li>conf = 61% (61% of those reading LSM also read SloN) </li></ul>
  27. 27. Association rules Finding profiles of readers of the Delo daily newspaper 1. read_Marketing magazine 116 => read_Delo 95 (0.82) 2. read_Financial_News 223 => read_Delo 180 (0.81) 3. read_Views 201 => read_Delo 157 (0.78) 4. read_Money 197 => read_Delo 150 (0.76) 5. read_Vip 181 => read_Delo 134 (0.74) Interpretation: Most readers of Marketing magazin e , Finan cial News , Views , Money and Vip read also Delo.
  28. 28. Anal ysis of UK traffic accidents <ul><li>End-user: Hampshire County Council (HCC , UK ) </li></ul><ul><ul><li>Can records of road traffic accidents be analysed to produce road safety information valuable to county surveyors? </li></ul></ul><ul><ul><li>HCC is sponsored to carry out a research project Road Surface Characteristics and Safety </li></ul></ul><ul><ul><li>R esearch includes an analysis of the STATS19 Accident Report Form Database to identify trends over time in the relationships between recorded road-user type/injury, vehicle position/damage, and road surface characteristics </li></ul></ul>
  29. 29. STATS19 Data Base 10 <ul><li>Over 5 million accidents recorded in 1979-1999 </li></ul><ul><li>3 data tables </li></ul>Accident ACC7999 (~5 mil . Accidents, 30 variables) Where ? When ? How many ? Vehicle VEH7999 (~9 mil . V ehicles, 24 variables ) Which vehicles ? What movement ? Which consequences ? Casualty CAS7999 (~7 mil . injuries , 16 variables) Who was injured ? What injuries ? ...
  30. 30. Data understanding
  31. 31. Data quality : Accident location
  32. 32. Data preparation <ul><li>There are 51 police force areas in UK </li></ul><ul><li>For each area we count the number of accidents in each: </li></ul><ul><ul><li>Year </li></ul></ul><ul><ul><li>Month </li></ul></ul><ul><ul><li>Day of Week </li></ul></ul><ul><ul><li>Hour of Day </li></ul></ul>
  33. 33. Data preparation
  34. 34. Simple visualization of short time series <ul><li>Used for data understanding </li></ul><ul><li>Very informative and easy to understand format </li></ul><ul><li>UK traffic accident analysis: Distributions of number of accidents over different time periods (year, month, day of week, and hour) </li></ul>
  35. 35. Year/Month distribution Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Darker color - MORE accidents
  36. 36. Day of Week/Month distribution All weekdays (Mon – Fri) are worse in deep winter, Friday the worst SUN FRI SAT MON THU TUES WED Jan Feb Mar Apr May Jun July Aug Sept Oct Nov Dec
  37. 37. Hour/Month distribution <ul><li>More Accidents at “Rush Hour”, Afternoon Rush hour is the worst </li></ul><ul><li>More holiday traffic (less rush hour) in August </li></ul>Jan Feb Mar Apr May Jun Jul Aug Sept Oct Nov Dec
  38. 38. Day of Week/Hour distribution <ul><li>More Accidents at “Rush Hour”, Afternoon Rush hour is the </li></ul><ul><li>worst and lasts longer with “early finish” on Fridays </li></ul><ul><li>2. More leisure traffic on Saturday/Sunday </li></ul>SUN FRI SAT MON THU TUES WED
  39. 39. Traffic: different modeling approaches <ul><li>association rule learning </li></ul><ul><li>static subgroup discovery </li></ul><ul><li>dynamic subgroup discovery </li></ul><ul><li>clustering of short time series </li></ul><ul><li>text mining </li></ul><ul><li>multi-relational approaches </li></ul><ul><li>… </li></ul>
  40. 40. Some discovered association rules <ul><li>Association rules: Road number and Severity of accident </li></ul><ul><ul><li>The probability of a fatal or serious accident on the “K8” road is 2.2 times greater than the probability of fatal or serious accidents in the county generally. </li></ul></ul><ul><ul><li>The probability of fatal accidents on the “K7” road is 2.8 times greater than the probability of fatal accidents in the county generally (when the road is dry and the speed limit = 70). </li></ul></ul>
  41. 41. Analysis of documents of European IST project <ul><li>Data source: </li></ul><ul><li>List of IST project descriptions as 1-2 page text summaries from the Web (database www.cordis.lu/ ) </li></ul><ul><li>IST 5FP has 2786 projects in which participate 7886 organizations </li></ul><ul><li>Analysis tasks: </li></ul><ul><li>Visualization of project topics </li></ul><ul><li>Analysis of collaboration </li></ul><ul><li>Connectedness between organizations </li></ul><ul><li>Community/clique identification </li></ul><ul><li>Thematic consortia identification </li></ul><ul><li>Simulation of 6FP IST </li></ul>
  42. 42. Analysis of documents of European IST project
  43. 43. Visualization into 25 project groups Health Data analysis Knowledge Management Mobile computing
  44. 44. Institutional Backbone of IST Telecommunication Transport Electronics No. of joint projects
  45. 45. Collaboration between countries (top 12) Most active country Number of collaborations
  46. 46. Part I. Introduction <ul><li>Data Mining and the KDD process </li></ul><ul><li>Why DM: Examples of discovered patterns and applications </li></ul><ul><li>Classification of DM tasks and techniques </li></ul><ul><li>Visualization and overview of DM tools </li></ul>
  47. 47. Types of DM tasks <ul><li>Predictive DM: </li></ul><ul><ul><li>Classification (learning of rulesets, decision trees, ...) </li></ul></ul><ul><ul><li>Prediction and estimation (regression) </li></ul></ul><ul><ul><li>Predictive relational DM (RDM, ILP) </li></ul></ul><ul><li>Descriptive DM: </li></ul><ul><ul><li>description and summarization </li></ul></ul><ul><ul><li>dependency analysis (association rule learning) </li></ul></ul><ul><ul><li>discovery of properties and constraints </li></ul></ul><ul><ul><li>segmentation (clustering) </li></ul></ul><ul><ul><li>subgroup discovery </li></ul></ul><ul><li>Text, Web and image analysis </li></ul>+ + + - - - H x x x x + x x x H
  48. 48. Predictive vs. descriptive induction <ul><li>Predictive induction </li></ul><ul><li>Descriptive induction </li></ul>+ - + + + + - - - - - -  + + + + + + +  + + + + + + + +  + + + + + + + + + +  + +
  49. 49. Predictive vs. descriptive induction <ul><li>Predictive induction: Inducing classifiers for solving classification and prediction tasks, </li></ul><ul><ul><li>Classification rule learning, Decision tree learning, ... </li></ul></ul><ul><ul><li>Bayesian classifier, ANN, SVM, ... </li></ul></ul><ul><ul><li>Data analysis through hypothesis generation and testing </li></ul></ul><ul><li>Descriptive induction: Discovering interesting regularities in the data, uncovering patterns, ... for solving KDD tasks </li></ul><ul><ul><li>Symbolic clustering, Association rule learning, Subgroup discovery, ... </li></ul></ul><ul><ul><li>Exploratory data analysis </li></ul></ul>
  50. 50. Predictive vs. descriptive induction: A rule learning perspective <ul><li>Predictive induction: Induces rulesets acting as classifiers for solving classification and prediction tasks </li></ul><ul><li>Descriptive induction: Discovers individual rules describing interesting regularities in the data </li></ul><ul><li>Therefore: Different goals, different heuristics, different evaluation criteria </li></ul>
  51. 51. Supervised vs. unsupervised learning: A rule learning perspective <ul><li>Supervised learning: Rules are induced from labeled instances (training examples with class assignment) - usually used in predictive induction </li></ul><ul><li>Unsupervised learning: Rules are induced from unabeled instances (training examples with no class assignment) - usually used in descriptive induction </li></ul><ul><li>Exception: Subgroup discovery </li></ul><ul><li>Discovers individual rules describing interesting regularities in the data from labeled examples </li></ul>
  52. 52. Subgroups vs. classifiers <ul><li>Classifiers: </li></ul><ul><ul><li>Classification rules aim at pure subgroups </li></ul></ul><ul><ul><li>A set of rules forms a domain model </li></ul></ul><ul><li>Subgroups: </li></ul><ul><ul><li>Rules describing subgroups aim at significantly higher proportion of positives </li></ul></ul><ul><ul><li>Each rule is an independent chunk of knowledge </li></ul></ul><ul><li>Link: </li></ul><ul><ul><li>SD can be viewed as </li></ul></ul><ul><ul><li>a form of cost-sensitive </li></ul></ul><ul><ul><li>classification </li></ul></ul>
  53. 53. Part I. Introduction <ul><li>Data Mining and the KDD process </li></ul><ul><li>Why DM: Examples of discovered patterns and applications </li></ul><ul><li>Classification of DM tasks and techniques </li></ul><ul><li>Visualization and overview of DM tools </li></ul>
  54. 54. Visualization <ul><li>can be used on its own (usually for description and summarization tasks) </li></ul><ul><li>can be used in combination with other DM techniques, for example </li></ul><ul><ul><li>visualization of decision trees </li></ul></ul><ul><ul><li>cluster visualization </li></ul></ul><ul><ul><li>visualization of association rules </li></ul></ul><ul><ul><li>subgroup visualization </li></ul></ul>
  55. 55. Data visualization: Scatter plot
  56. 56. Daisy Graph Visualization by B. Zupan et al.
  57. 57. Daisy Graph Patients were mostly female
  58. 58. Daisy Graph The older the patient, the higher the difference of HHS between two follow-ups
  59. 59. Data visualization: time dependecy Cumulative ineffectiveness of antibiotics gentamycin, clyndamycin, cefpiramide, and cefotaxim [Bohanec et al., “PTAH: A system for supporting nosocomial infection theraphy”, IDAMAP book, 1997]
  60. 60. Subgroup visualization Subgroups of patients with CHD risk [Gamberger, Lavrac & Wettschereck, IDAMAP2002]
  61. 61. Subgroup visualization Subgroups of patients with CHD risk [Gamberger, Lavrac & Wettschereck, IDAMAP2002]
  62. 62. Subgroup visualization Subgroups of patients with CHD risk [Gamberger & Lavrac, ICML2002]
  63. 63. DB Miner: Association rule visualization
  64. 64. MineSet: Association Rule Visualization
  65. 65. MineSet: Decision tree visualization
  66. 66. DM tools
  67. 67. Clementine
  68. 68. S-Plus
  69. 69. Part I: Summary <ul><li>KDD is the overall process of discovering useful knowledge in data </li></ul><ul><ul><li>many steps including data preparation, cleaning, transformation, pre-processing </li></ul></ul><ul><li>Data Mining is the data analysis phase in KDD </li></ul><ul><ul><li>DM takes only 15%-25% of the effort of the overall KDD process </li></ul></ul><ul><ul><li>employing techniques from machine learning and statistics </li></ul></ul><ul><li>Predictive and descriptive induction have different goals: classifier vs. pattern discovery </li></ul><ul><li>Many application areas </li></ul><ul><li>Many powerful tools available </li></ul>
  70. 70. Part I : Introduction Questions
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×