Data Mining Techniques

  • 539 views
Uploaded on

 

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
539
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
28
Comments
0
Likes
3

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Data Mining Techniques Wojtek Kowalczyk www.cs.vu.nl/~wojtek www.cs.vu.nl/~wojtek/DataMine www.cs.vu.nl/ci/DataMine/DIANA wojtek@cs.vu.nl R4.50
  • 2. Outline • Organization of the course • What is Data Mining? • Course overview • Data Mining Tasks • Data Mining Cycle • Data Mining Techniques 2
  • 3. Objectives of the course • Provide an overview of most common algorithms and techniques used in Data Mining (lectures) • Provide an extensive “hands-on” experience with applying these techniques (practicum) • Provide a survey of typical (and future) applications of data mining 3
  • 4. Organization of the course • 12 lectures (1sp) + 3 assignments (3sp) (1sp=40hrs work) • no exams; grades based on assignments (theory & practice) • assignments on: 8.03, 12.04, 03.05 • deadlines: 3 weeks later: 5.04, 3.05, 24.05 • work in couples(?); registration obligatory (before 1.03) by e-mail to dmt@few.vu.nl Subject: DMT-registration Body: Full name; e-mail address; student number; {AI|BWI|…} Full name; e-mail address; student number; {AI|BWI|…} 4
  • 5. Materials • Slides, notes, assignments: www.cs.vu.nl/~wojtek/DataMine • Book: “Data Mining” by Ian H. Witten and Eibe Frank, www.cs.waikato.ac.nz/~ml/weka/book.html • Internet: www.kdnuggets.com • Further readings from different perspectives: - business aspects: Berry & Linoff - theory: Hand, Mannila, Smyth; Tan, Steinbach, Kumar - latest: proceedings of KDD, PKDD, PAKDD, ML, ... 5
  • 6. Origins of Data Mining • Every day the world creates a few exabytes of data 1 exabyte = 1000 petabytes 1 petabyte = 1000 terabytes 1 terabyte = 1000 gigabytes • Only 4% of the data is used for any purpose (IBM) • If we could only do something useful with this data ... ² ... the field of DATA MINING is born 6
  • 7. Sources of data • satellites (images) • business: • banks, • telecom, • insurance, • retail • airlines, … • internet (only a few terabytes at late 90’s) • libraries (e.g., Library of Congress: 20 TB - 3PB) • law enforcement agencies (FBI fingerprints DB: 1PB) • Bioinformatics:? RFID-tags? Homeland security? 7
  • 8. Typical data mining applications • fraud detection (credit cards, telecom, insurance, taxes, …) • credit scoring and control (“to give or not to give?”) • marketing (mailing selection, modeling churn/retention, attrition, cross-selling, market basket analysis, etc) • Customer Relation Management (CRM) • criminal investigations (text mining) • …. In Holland every citizen is “present” in 800-1000 databases !!! 8
  • 9. What is Data Mining ? u Data mining is a step in the KDD process consisting of applying data analysis and discovery algorithms that, under acceptable computational efficiency limitations, produce a particular enumeration of patterns over the data. (U. Fayyad, G. Piatetsky-Shapiro and P. Smyth, KDD-96) u Data mining is an area in the intersection of machine learning, statistics, and databases. (M. Holsheimer, M. Kersten, H. Mannila and H. Taivonen) Data mining is the process of selecting, exploring, and modeling large amounts of data to uncover previously unknown patterns for a business advantage (SAS Institute) 9
  • 10. Sorts of Data Mining Tasks Predictive Data Mining (“supervised”): u Classification u Regression u Time series Knowledge Discovery (“unsupervised”): u Deviation Detection u Segmentation u Clustering u Association Rules u Summarization u Visualization 10
  • 11. Examples u Medical diagnosis: soft or hard contact lenses u Credit application scoring: grant a loan or not? u Fraud detection: is the transaction suspicious or not? u Direct mailing: who should be offered a given product? u CPU p - erformance: how to configure computers? u Remote sensing: determine water pollution from spectral images u Load forecasting: predict future demand for electric power u Intelligent ATM’s : how much cash will be there tomorrow? u identify groups of similar credit card users u automatically organize incoming e mails - u characterize interests of an Internet user u etc. 11
  • 12. Contact lenses: a classification task Can I use contact lenses? Possible output: none, soft, hard. Decision based on: - age - spectacle prescription - astigmatism - tear production rate 12
  • 13. Hypothetical Decision Table age prescription astigmatism tear p.r. lenses young myope no reduced none young myope no normal soft young hypermetrope yes reduced none pre-presbyopic myope no reduced none pre-presbyopic hypermetrope yes normal soft pre-presbyopic hypermetrope yes reduced none presbyopic myope no normal hard presbyopic myope no reduced none presbyopic hypermetrope yes reduced none 13
  • 14. Classifiers: classification procedures •A set of “if-then” rules •A decision tree •A Neural Network •A formula (e.g. “scoring model”) •A classification procedure 14
  • 15. If tear production rate = reduced then recommendation = none. If age = young and astigmatic = no and tear production rate = normal then recommendation = soft If age = pre-presbyopic and astigmatic = no and tear production rate = normal then recommendation = soft If age = presbyopic and spectacle prescription = myope and astigmatic = no then recommendation = none If spectacle prescription = hypermetrope and astigmatic = no and tear production rate = normal then recommendation = soft If spectacle prescription = myope and astigmatic = yes and tear production rate = normal then recommendation = hard If age = young and astigmatic = yes and tear production rate = normal then recommendation = hard If age = pre-presbyopic and spectacle prescription = hypermetrope and astigmatic = yes then recommendation = none If age = presbyopic and spectacle prescription = hypermetrope and astigmatic = yes then recommendation = none Figure 1.1 Rules for the contact lens data. 15
  • 16. Figure 1.2 Decision tree for the contact lens data. 16
  • 17. CPU performance: regression problem Computer’s CPU performance (PRP) depends on a number of factors: - cycle time (MYCT) - main memory (MMIN, MMAX) - cache (CACH) - number of channels (CHMIN, CHMAX) Problem: express PRP as a function of all these factors. 17
  • 18. PRP = - 56.1 + 0.049 MYCT + 0.015 MMIN + 0.006 MMAX + 0.630 CACH - 0.270 CHMIN + 1.46 CHMAX Figure 3.6(a) Models for the CPU performance data: linear regression. 18
  • 19. Figure 3.6(b) Models for the CPU performance data: regression tree. 19
  • 20. Figure 3.6(c) Models for the CPU performance data: model tree. 20
  • 21. Association Rules A shop sells products a, b, …, z Clients buy them in collections, e.g., {a, c}, {c, d, z}, … Each set is called a “transaction” or an “item set” What are the most frequent item sets? What are the most significant “association rules”: e.g., {c, g}==>{z} 21
  • 22. Association Rules II Rule Significance is measured in terms of: - support (percentage of transactions that match LHS) - confidence (accuracy of the rule) Problems: • combinatorial explosion of item sets • huge number of rules • two conflicting performance measures (we want rules to have big support and high accuracy) There are efficient algorithms for finding rules !!! 22
  • 23. Interdependencies: Link Analysis What influences what and to which extent? s=smoker s r x=sex d a=age x h h=health r=resistance d=live/death a Bayesian networks: graphical models of knowledge Networks constructed from data and knowledge !!! 23
  • 24. Putting similar things together: Clustering Example: Credit card users might be clustered according to the way the use their cards: • frequent/seldom usage • domestic/foreign transactions • high/low amounts of money • transactions of specific type • … Then for every group another fraud detection system may be developed. Or various products might be offered… 24
  • 25. Characteristics of the data: Huge quantities Redundancy Irrelevancy Bad quality: u missing values u incompleteness u inconsistency u errors u outdated u outliers High dimensionality Unstructured (e.g. textual) 25
  • 26. Data Mining Cycle • Problem understanding and formulating • Identification of relevant data • Data gathering • Data cleaning • Data preprocessing • Model building • Model analysis • Model implementation • Model maintenance 26
  • 27. Accents 1) Algorithms & Techniques 2) Technical skills (AWK, Matlab, Weka) 3) Performance Challenge 4) Applications 5) Recent Developments (text mining, web mining, mining data streams, etc.) 27
  • 28. Data Preprocessing • exploratory data analysis • discretization and grouping of values • reduction of dimensionality • feature extraction • treatment of missing values and outliers • sampling 28
  • 29. Model Building • Rule Induction • Decision Trees • Bayesian Classifiers • Regression Trees • Association Rules • Instance-based learning • Clustering Algorithms • Combining models: Bagging, Boosting, Stacking, etc. 29
  • 30. To remember: •There are various definitions of “Data Mining” •Most common tasks of Data Mining are: • Classification, • Regression/ numerical prediction, • Discovery of Associations, • Clustering • The road “from data to results” involves many steps • The course covers 3 aspects of DM: • data preprocessing • model building • model evaluation 30