Data mining & Decison Trees

2,241 views

Published on

In this presentation, a brief overview of data mining and 4 types of DT are given

Published in: Education, Technology
  • Be the first to comment

Data mining & Decison Trees

  1. 1. Presented and Contributed by: Ahmet Selman Bozkır Hacettepe University Ph.D. StudentNovember 29, 2011 1
  2. 2.  What is data mining? Motivation: Why data mining? Classification of data mining systems Architecture: Typical Data Mining System Data mining functionalityNovember 29, 2011 2
  3. 3.  Data mining (knowledge discovery from data)  Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data  Data mining: a misnomer? Alternative names  Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. Watch out: Is everything “data mining”?  (Deductive) query processing.  Expert systems or small ML/statistical programsNovember 29, 2011 3
  4. 4.  Data explosion problem  Automated data collection tools and mature database technology lead to tremendous amounts of data accumulated and/or to be analyzed in databases, data warehouses, and other information repositories We are drowning in data, but starving for knowledge! Solution: Data warehousing and data mining  Data warehousing and on-line analytical processing  Mining interesting knowledge (rules, regularities, patterns, constraints) from data in large databasesNovember 29, 2011 4
  5. 5.  Data analysis and decision support  Market analysis and management ▪ Target marketing, customer relationship management (CRM), market basket analysis, cross selling, market segmentation  Risk analysis and management ▪ Forecasting, customer retention, improved underwriting, quality control, competitive analysis  Fraud detection and detection of unusual patterns (outliers) Other Applications  Text mining (news group, email, documents) and Web mining  Bioinformatics and bio-data analysisNovember 29, 2011 5
  6. 6.  Target marketing  Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc.  Determine customer purchasing patterns over time Cross-market analysis—Find associations/co-relations between product sales, & predict based on such association Customer profiling —What types of customers buy what products (clustering or classification) Customer requirement analysis  Identify the best products for different groups of customers  Predict what factors will attract new customersNovember 29, 2011 Data Mining: Concepts and Techniques 6
  7. 7.  Finance planning and asset evaluation  cash flow analysis and prediction cross-sectional and time series analysis (financial-ratio, trend analysis, etc.) Resource planning  summarize and compare the resources and spending Competition  monitor competitors and market directions  group customers into classes and a class-based pricing procedure  set pricing strategy in a highly competitive marketNovember 29, 2011 7
  8. 8.  Approaches: Clustering & model construction for frauds, outlier analysis Applications: Health care, retail, credit card service, telecomm.  Auto insurance: ring of collisions  Money laundering: suspicious monetary transactions  Medical insurance ▪ Professional patients, ring of doctors, and ring of references ▪ Unnecessary or correlated screening tests  Telecommunications: phone-call fraud ▪ Phone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm  Retail industry ▪ Analysts estimate that 38% of retail shrink is due to dishonest employees  Anti-terrorismNovember 29, 2011 Data Mining: Concepts and Techniques 8
  9. 9. Pattern Evaluation  Data mining—core of knowledge discovery process Data Mining Task-relevant Data Data Warehouse SelectionData Cleaning Data Integration Databases November 29, 2011 9
  10. 10. Increasing potential to support business decisions End User Making Decisions Data Presentation Business Analyst Visualization Techniques Data Mining Data Information Discovery Analyst Data Exploration Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts OLAP, MDA DBA Data Sources Paper, Files, Information Providers, Database Systems, OLTPNovember 29, 2011 10
  11. 11.  Learning the application domain  relevant prior knowledge and goals of application Creating a target data set: data selection Data cleaning and preprocessing: (may take 70% of effort!) Data reduction and transformation  Find useful features, dimensionality/variable reduction, invariant representation. Choosing functions of data mining  summarization, classification, regression, association, clustering. Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation  visualization, transformation, removing redundant patterns, etc. Use of discovered knowledgeNovember 29, 2011 11
  12. 12. Graphical user interface Pattern evaluation Data mining engine Knowledge-base Database or data warehouse serverData cleaning & Filteringdata integration Data Databases WarehouseNovember 29, 2011 12
  13. 13.  General functionality  Descriptive data mining  Predictive data mining  Different views, different classifications  Kinds of databases to be mined  Kinds of knowledge to be discovered  Kinds of techniques utilized  Kinds of applications adaptedNovember 29, 2011 13
  14. 14.  Concept description: Characterization and discrimination  Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions Association (correlation and causality)  Diaper  Beer [0.5%, 75%] Classification and Prediction  Construct models (functions) that describe and distinguish classes or concepts for future prediction ▪ E.g., classify countries based on climate, or classify cars based on gas mileage  Presentation: decision-tree, classification rule, neural network  Predict some unknown or missing numerical valuesNovember 29, 2011 14
  15. 15.  Cluster analysis  Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns  Maximizing intra-class similarity & minimizing interclass similarity Outlier analysis  Outlier: a data object that does not comply with the general behavior of the data  Noise or exception? No! useful in fraud detection, rare events analysisNovember 29, 2011 15
  16. 16.  Data mining: discovering interesting patterns from large amounts of data A natural evolution of database technology, in great demand, with wide applications A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation Mining can be performed in a variety of information repositories Data mining functionalities: characterization, discrimination, association, classification, clustering, outl ier and trend analysis, etc. Data mining systems and architectures Major issues in data miningNovember 29, 2011 16
  17. 17.  R. Agrawal, J. Han, and H. Mannila, Readings in Data Mining: A Database Perspective, Morgan Kaufmann (in preparation) J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2001November 29, 2011 17
  18. 18. November 29, 2011 Thank you !!! 18
  19. 19.  • A decision tree (DT) is a hierarchical classification and prediction model • It is organized as a rooted tree with 2 types of nodes called decision nodes and inter nodes • It is a supervised data mining model used for classification or predictionNovember 29, 2011 19
  20. 20. November 29, 2011 20
  21. 21.  Chance and Terminal Nodes •Each internal node of a DT is a decision point, where some condition is tested •The result of this condition determines which branch of the tree is to be taken next •Thus they are called decision node, chance node or non- terminal node •Chance nodes partition the available data at that point to maximize dependent variable differencesNovember 29, 2011 21
  22. 22.  Terminal nodes •The leaf nodes of a DT are called terminal node •They indicate the class into which a data instance will be classified •They have just one incoming node •They do not have child nodes (outgoing nodes) •There are no conditions tested at terminal nodes •Tree traversal from the root to the leaf produces the production rule for that classNovember 29, 2011 22
  23. 23. November 29, 2011 23
  24. 24.  Advantages of DT • Easy to understand and interpret • Works for categorical and continious data • High performance classification (generally) • DT can grow to any depth • On-the-fly prediction • Pruning a DT is very easy • Works for missing or null valuesNovember 29, 2011 24
  25. 25.  Advantages contd. • Can be used to identify outliers • Production rules can be obtained directly from the built DT • They are relatively faster than other classification models • DT can be used even when domain experts are absent • Provide clear indication of which field is important for predication and classificationNovember 29, 2011 25
  26. 26.  Disadvantages •Class-overlap problem (due to the curse of dimensionality) •Complex production rules •A DT can be sub-optimal (for this reason ensembe methods are developed) • Some decision tree can deal only with binary-valued.November 29, 2011 26
  27. 27. November 29, 2011 27
  28. 28. •Training set - - to derive classifier (Generally %70-%80) •Test set - - to measure accuracy (Generally %20-%30)November 29, 2011 28
  29. 29.  Construction Phase: Initial Decision tree is Constructed in this Phase Q:How to split nodes? A: Different approaches with algorithms Pruning Phase: In this stage lower branches are removed to improve the performance Q:Why? A: Avoiding overfitting/overtrainingNovember 29, 2011 29
  30. 30.  ID3 (Available Everywhere) C4.5 / C5.0 (Weka/Spss Clementine) CART (Spss Clementine) CHAID (Spss Clementine, etc..) Microsoft Decision Trees (MS Analysis Services) Random Forests (Statistica)November 29, 2011 30
  31. 31.  ID3 induction algorithm •ID3 (Interactive dichotomiser) •Introduced in 1986 by Quinlan •Designed for only classification •Works on categorical attributes only •Uses entropy measure as splitting criteria •Missing value handling is absentNovember 29, 2011 31
  32. 32.  C4.5 induction algorithm •Invented by Quinlan in 1993 •Is an extension of ID3 algorithm •Designed for only classification •Numerical attributes can be input •Uses entropy measure as splitting criteria •Uses multi-way splits •Missing value handling is provided •Tree pruning is also providedNovember 29, 2011 32
  33. 33.  Classification and Regression Trees •Invented by Breiman, et.al. in 1984 •Uses binary recursive partitioning method •Designed for both classification and regression •Works on both categorical & numerical attributes •Uses Gini measure as splitting criteria •Uses two-way splits •Missing value handling is provided •Tree pruning is also providedNovember 29, 2011 33
  34. 34.  Chi-squared Automatic Interaction Detection •Invented by Kass, et.al. in 1980 •Designed for both classification and regression •Works on both categorical & numerical attributes •Uses Karl Pearsons X2 test as splitting criteria •Uses multi-way splits •Missing value handling is provided •Avoids tree pruningNovember 29, 2011 34
  35. 35.  Micorosoft Decision Trees •Invented by MS, in 1999 •Designed for both classification and regression •Works on both categorical & numerical attributes •Serves entropy, Bayesian K2, and Bayesian Dirichlet Equivalent with Uniform prior choices as splitting criteria •Uses multi-way splits and support binary splitting •Missing value handling is provided •Avoids tree pruningNovember 29, 2011 35
  36. 36.  Overfitting: An induced tree may overfit the training data  Too many branches, some may reflect anomalies due to noise or outliers  Poor accuracy for unseen samplesNovember 29, 2011 36
  37. 37.  Two approaches to avoid overfitting  Prepruning: Halt tree construction early—do not split a node if this would result in the goodness measure falling below a threshold ▪ Difficult to choose an appropriate threshold  Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees ▪ Use a set of data different from the training data to decide which is the “best pruned tree”November 29, 2011 37
  38. 38. Validation error Training error TimeNovember 29, 2011 38

×