Upcoming SlideShare
×

# Decision Tree Construction

557 views
464 views

Published on

0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
557
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
19
0
Likes
0
Embeds 0
No embeds

No notes for slide

### Decision Tree Construction

1. 1. The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke [email_address] http://www.cs.cornell.edu/ johannes
2. 2. Lectures Three and Four <ul><li>Data preprocessing </li></ul><ul><li>Multidimensional data analysis </li></ul><ul><li>Data mining </li></ul><ul><ul><li>Association rules </li></ul></ul><ul><ul><li>Classification trees </li></ul></ul><ul><ul><li>Clustering </li></ul></ul>
3. 3. Types of Attributes <ul><li>Numerical : Domain is ordered and can be represented on the real line (e.g., age, income) </li></ul><ul><li>Nominal or categorical : Domain is a finite set without any natural ordering (e.g., occupation, marital status, race) </li></ul><ul><li>Ordinal : Domain is ordered, but absolute differences between values is unknown (e.g., preference scale, severity of an injury) </li></ul>
4. 4. Classification <ul><li>Goal: Learn a function that assigns a record to one of several predefined classes. </li></ul>
5. 5. Classification Example <ul><li>Example training database </li></ul><ul><ul><li>Two predictor attributes: Age and Car-type ( S port, M inivan and T ruck) </li></ul></ul><ul><ul><li>Age is ordered, Car-type is categorical attribute </li></ul></ul><ul><ul><li>Class label indicates whether person bought product </li></ul></ul><ul><ul><li>Dependent attribute is categorical </li></ul></ul>
6. 6. Regression Example <ul><li>Example training database </li></ul><ul><ul><li>Two predictor attributes: Age and Car-type ( S port, M inivan and T ruck) </li></ul></ul><ul><ul><li>Spent indicates how much person spent during a recent visit to the web site </li></ul></ul><ul><ul><li>Dependent attribute is numerical </li></ul></ul>
7. 7. Types of Variables (Review) <ul><li>Numerical : Domain is ordered and can be represented on the real line (e.g., age, income) </li></ul><ul><li>Nominal or categorical : Domain is a finite set without any natural ordering (e.g., occupation, marital status, race) </li></ul><ul><li>Ordinal : Domain is ordered, but absolute differences between values is unknown (e.g., preference scale, severity of an injury) </li></ul>
8. 8. Definitions <ul><li>Random variables X 1 , …, X k ( predictor variables ) and Y ( dependent variable ) </li></ul><ul><li>X i has domain dom(X i ), Y has domain dom(Y) </li></ul><ul><li>P is a probability distribution on dom(X 1 ) x … x dom(X k ) x dom(Y) Training database D is a random sample from P </li></ul><ul><li>A predictor d is a function d: dom(X 1 ) … dom(X k )  dom(Y) </li></ul>
9. 9. Classification Problem <ul><li>If Y is categorical, the problem is a classification problem , and we use C instead of Y. |dom(C)| = J. </li></ul><ul><li>C is called the class label , d is called a classifier. </li></ul><ul><li>Take r be record randomly drawn from P. Define the misclassification rate of d: RT(d,P) = P(d(r.X 1 , …, r.X k ) != r.C) </li></ul><ul><li>Problem definition : Given dataset D that is a random sample from probability distribution P, find classifier d such that RT(d,P) is minimized. </li></ul>
10. 10. Regression Problem <ul><li>If Y is numerical, the problem is a regression problem. </li></ul><ul><li>Y is called the dependent variable, d is called a regression function. </li></ul><ul><li>Take r be record randomly drawn from P. Define mean squared error rate of d: RT(d,P) = E(r.Y - d(r.X 1 , …, r.X k )) 2 </li></ul><ul><li>Problem definition : Given dataset D that is a random sample from probability distribution P, find regression function d such that RT(d,P) is minimized. </li></ul>
11. 11. Goals and Requirements <ul><li>Goals: </li></ul><ul><ul><li>To produce an accurate classifier/regression function </li></ul></ul><ul><ul><li>To understand the structure of the problem </li></ul></ul><ul><li>Requirements on the model: </li></ul><ul><ul><li>High accuracy </li></ul></ul><ul><ul><li>Understandable by humans, interpretable </li></ul></ul><ul><ul><li>Fast construction for very large training databases </li></ul></ul>
12. 12. Different Types of Classifiers <ul><li>Linear discriminant analysis (LDA) </li></ul><ul><li>Quadratic discriminant analysis (QDA) </li></ul><ul><li>Density estimation methods </li></ul><ul><li>Nearest neighbor methods </li></ul><ul><li>Logistic regression </li></ul><ul><li>Neural networks </li></ul><ul><li>Fuzzy set theory </li></ul><ul><li>Decision Trees </li></ul>
13. 13. Difficulties with LDA and QDA <ul><li>Multivariate normal assumption often not true </li></ul><ul><li>Not designed for categorical variables </li></ul><ul><li>Form of classifier in terms of linear or quadratic discriminant functions is hard to interpret </li></ul>
14. 14. Histogram Density Estimation <ul><li>Curse of dimensionality </li></ul><ul><li>Cell boundaries are discontinuities. Beyond boundary cells, estimate falls abruptly to zero. </li></ul>
15. 15. Kernel Density Estimation <ul><li>How to choose kernel bandwith h? </li></ul><ul><ul><li>The optimal h depends on a criterion </li></ul></ul><ul><ul><li>The optimal h depends on the form of the kernel </li></ul></ul><ul><ul><li>The optimal h might depend on the class label </li></ul></ul><ul><ul><li>The optimal h might depend on the part of the predictor space </li></ul></ul><ul><li>How to choose form of the kernel? </li></ul>
16. 16. K-Nearest Neighbor Methods <ul><li>Difficulties: </li></ul><ul><ul><li>Data must be stored; for classification of a new record, all data must be available </li></ul></ul><ul><ul><li>Computationally expensive in high dimensions </li></ul></ul><ul><ul><li>Choice of k is unknown </li></ul></ul>
17. 17. Difficulties with Logistic Regression <ul><li>Few goodness of fit and model selection techniques </li></ul><ul><li>Categorical predictor variables have to be transformed into dummy vectors. </li></ul>
18. 18. Neural Networks and Fuzzy Set Theory <ul><li>Difficulties: </li></ul><ul><li>Classifiers are hard to understand </li></ul><ul><li>How to choose network topology and initial weights? </li></ul><ul><li>Categorical predictor variables? </li></ul>
19. 19. What are Decision Trees? Minivan Age Car Type YES NO YES <30 >=30 Sports, Truck 0 30 60 Age YES YES NO Minivan Sports, Truck
20. 20. Decision Trees <ul><li>A decision tree T encodes d (a classifier or regression function) in form of a tree. </li></ul><ul><li>A node t in T without children is called a leaf node . Otherwise t is called an internal node . </li></ul>
21. 21. Internal Nodes <ul><li>Each internal node has an associated splitting predicate . Most common are binary predicates. Example predicates: </li></ul><ul><ul><li>Age <= 20 </li></ul></ul><ul><ul><li>Profession in {student, teacher} </li></ul></ul><ul><ul><li>5000*Age + 3*Salary – 10000 > 0 </li></ul></ul>
22. 22. Internal Nodes: Splitting Predicates <ul><li>Binary Univariate splits: </li></ul><ul><ul><li>Numerical or ordered X: X <= c, c in dom(X) </li></ul></ul><ul><ul><li>Categorical X: X in A, A subset dom(X) </li></ul></ul><ul><li>Binary Multivariate splits: </li></ul><ul><ul><li>Linear combination split on numerical variables: Σ a i X i <= c </li></ul></ul><ul><li>k-ary (k>2) splits analogous </li></ul>
23. 23. Leaf Nodes <ul><li>Consider leaf node t </li></ul><ul><li>Classification problem: Node t is labeled with one class label c in dom(C) </li></ul><ul><li>Regression problem: Two choices </li></ul><ul><ul><li>Piecewise constant model: t is labeled with a constant y in dom(Y). </li></ul></ul><ul><ul><li>Piecewise linear model: t is labeled with a linear model Y = y t + Σ a i X i </li></ul></ul>
24. 24. Example <ul><li>Encoded classifier: </li></ul><ul><li>If (age<30 and carType=Minivan) Then YES </li></ul><ul><li>If (age <30 and (carType=Sports or carType=Truck)) Then NO </li></ul><ul><li>If (age >= 30) Then NO </li></ul>Minivan Age Car Type YES NO YES <30 >=30 Sports, Truck
25. 25. Choice of Classification Algorithm? <ul><li>Example study: (Lim, Loh, and Shih, Machine Learning 2000) </li></ul><ul><ul><li>33 classification algorithms </li></ul></ul><ul><ul><li>16 (small) data sets (UC Irvine ML Repository) </li></ul></ul><ul><ul><li>Each algorithm applied to each data set </li></ul></ul><ul><li>Experimental measurements: </li></ul><ul><ul><li>Classification accuracy </li></ul></ul><ul><ul><li>Computational speed </li></ul></ul><ul><ul><li>Classifier complexity </li></ul></ul>
26. 26. Classification Algorithms <ul><li>Tree-structure classifiers: </li></ul><ul><ul><li>IND, S-Plus Trees, C4.5, FACT, QUEST, CART, OC1, LMDT, CAL5, T1 </li></ul></ul><ul><li>Statistical methods: </li></ul><ul><ul><li>LDA, QDA, NN, LOG, FDA, PDA, MDA, POL </li></ul></ul><ul><li>Neural networks: </li></ul><ul><ul><li>LVQ, RBF </li></ul></ul>
27. 27. Experimental Details <ul><li>16 primary data sets, created 16 more data sets by adding noise </li></ul><ul><li>Converted categorical predictor variables to 0-1 dummy variables if necessary </li></ul><ul><li>Error rates for 6 data sets estimated from supplied test sets, 10-fold cross-validation used for the other data sets </li></ul>
28. 28. Ranking by Mean Error Rate <ul><li>Rank Algorithm Mean Error Time </li></ul><ul><li>1 Polyclass 0.195 3 hours </li></ul><ul><li>2 Quest Multivariate 0.202 4 min </li></ul><ul><li>3 Logistic Regression 0.204 4 min </li></ul><ul><li>6 LDA 0.208 10 s </li></ul><ul><li>8 IND CART 0.215 47 s </li></ul><ul><li>12 C4.5 Rules 0.220 20 s </li></ul><ul><li>16 Quest Univariate 0.221 40 s </li></ul><ul><li>… </li></ul>
29. 29. Other Results <ul><li>Number of leaves for tree-based classifiers varied widely (median number of leaves between 5 and 32 (removing some outliers)) </li></ul><ul><li>Mean misclassification rates for top 26 algorithms are not statistically significantly different, bottom 7 algorithms have significantly lower error rates </li></ul>
30. 30. Decision Trees: Summary <ul><li>Powerful data mining model for classification (and regression) problems </li></ul><ul><li>Easy to understand and to present to non-specialists </li></ul><ul><li>TIPS: </li></ul><ul><ul><li>Even if black-box models sometimes give higher accuracy, construct a decision tree anyway </li></ul></ul><ul><ul><li>Construct decision trees with different splitting variables at the root of the tree </li></ul></ul>
31. 31. Clustering <ul><li>Input: Relational database with fixed schema </li></ul><ul><li>Output: k groups of records called clusters, such that the records within a group are more similar to records in other groups </li></ul><ul><li>More difficult than classification (unsupervised learning: no record labels are given) </li></ul><ul><li>Usage: </li></ul><ul><ul><li>Exploratory data mining </li></ul></ul><ul><ul><li>Preprocessing step (e.g., outlier detection) </li></ul></ul>
32. 32. Clustering (Contd.) <ul><li>In clustering we partitioning a set of records into meaningful sub-classes called clusters. </li></ul><ul><li>Cluster: a collection of data objects that are “similar” to one another and thus can be treated collectively as one group. </li></ul><ul><li>Clustering helps users to detect inherent groupings and structure in a data set. </li></ul>
33. 33. Clustering (Contd.) <ul><li>Example input database: Two numerical variables </li></ul><ul><li>How many groups are here? </li></ul><ul><li>Requirements: Need to define “similarity” between records </li></ul>
34. 34. Graphical Representation
35. 35. Clustering (Contd.) <ul><li>Output of clustering: </li></ul><ul><ul><li>Representative points for each cluster </li></ul></ul><ul><ul><li>Labeling of each record with each cluster number </li></ul></ul><ul><ul><li>Other description of each cluster </li></ul></ul><ul><li>Important: Use the “right” distance function </li></ul><ul><ul><li>Scale or normalize all attributes. Example: seconds, hours, days </li></ul></ul><ul><ul><li>Assign different weights associated with importance of the attribute </li></ul></ul>
36. 36. Clustering: Summary <ul><li>Finding natural groups in data </li></ul><ul><li>Common post-processing steps: </li></ul><ul><ul><li>Build a decision tree with the cluster label as class label </li></ul></ul><ul><ul><li>Try to explain the groups using the decision tree </li></ul></ul><ul><ul><li>Visualize the clusters </li></ul></ul><ul><ul><li>Examine the differences between the clusters with respect to the fields of the dataset </li></ul></ul><ul><li>Try different number of clusters </li></ul>
37. 37. Web Usage Mining <ul><li>Data sources: </li></ul><ul><ul><li>Web server log </li></ul></ul><ul><ul><li>Information about the web site: </li></ul></ul><ul><ul><ul><li>Site graph </li></ul></ul></ul><ul><ul><ul><li>Metadata about each page (type, objects shown) </li></ul></ul></ul><ul><ul><ul><li>Object concept hierarchies </li></ul></ul></ul><ul><li>Preprocessing: </li></ul><ul><ul><li>Detect session and user context (Cookies, user authentication, personalization) </li></ul></ul>
38. 38. Web Usage Mining (Contd.) <ul><li>Data Mining </li></ul><ul><ul><li>Association Rules </li></ul></ul><ul><ul><li>Sequential Patterns </li></ul></ul><ul><ul><li>Classification </li></ul></ul><ul><li>Action </li></ul><ul><ul><li>Personalized pages </li></ul></ul><ul><ul><li>Cross-selling </li></ul></ul><ul><li>Evaluation and Measurement </li></ul><ul><ul><li>Deploy personalized pages selectively </li></ul></ul><ul><ul><li>Measure effectiveness of each implemented action </li></ul></ul>
39. 39. Large Case Study: Churn <ul><li>Telecommunications industry </li></ul><ul><li>Try to predict churn (whether customer will switch long-distance carrier) </li></ul><ul><li>Dataset: </li></ul><ul><ul><li>5000 records (tiny dataset, but manageable here in class) </li></ul></ul><ul><ul><li>21 attributes, both numerical and categorical attributes (very few attributes) </li></ul></ul><ul><ul><li>Data is already cleaned! No missing values, inconsistencies, etc. (again, for classroom purposes) </li></ul></ul>
40. 40. Churn Example: Dataset Columns <ul><li>State </li></ul><ul><li>Account length: Number of months the customer has been with the company </li></ul><ul><li>Area code </li></ul><ul><li>Phone number </li></ul><ul><li>International plan: yes/no </li></ul><ul><li>Voice mail: yes/no </li></ul><ul><li>Number of voice: Average number of voice messages per day </li></ul><ul><li>Total (day, evening, night, international) minutes: Average number of minutes charged </li></ul><ul><li>Total (day, evening, night, international) calls: Average number of calls made </li></ul><ul><li>Total (day, evening, night, international) charge: Average amount charged per day </li></ul><ul><li>Number customer service calls: Number of calls made to customer support in the last six months </li></ul><ul><li>Churned: Did the customer switch long-distance carriers in the last six months </li></ul>
41. 41. Churn Example: Analysis <ul><li>We start out by getting familiar with the dataset </li></ul><ul><ul><li>Record viewer </li></ul></ul><ul><ul><li>Statistics visualization </li></ul></ul><ul><ul><li>Evidence classifier </li></ul></ul><ul><ul><li>Visualizing joint distributions </li></ul></ul><ul><ul><li>Visualizing geographic distribution of churn </li></ul></ul>
42. 42. Churn Example: Analysis (Contd.) <ul><li>Building and interpreting data mining models </li></ul><ul><ul><li>Decision trees </li></ul></ul><ul><ul><li>Clustering </li></ul></ul>
43. 43. Evaluating Data Mining Tools
44. 44. Evaluating Data Mining Tools <ul><li>Checklist: </li></ul><ul><ul><li>Integration with current applications and your data management infrastructure </li></ul></ul><ul><ul><li>Ease of usage </li></ul></ul><ul><ul><li>Automation </li></ul></ul><ul><ul><li>Scalability to large datasets </li></ul></ul><ul><ul><ul><li>Number of records </li></ul></ul></ul><ul><ul><ul><li>Number of attributes </li></ul></ul></ul><ul><ul><ul><li>Datasets larger than main memory </li></ul></ul></ul><ul><ul><ul><li>Support of sampling </li></ul></ul></ul><ul><ul><li>Export of models into your enterprise </li></ul></ul><ul><ul><li>Stability of the company that offers the product </li></ul></ul>
45. 45. Integration With Data Management <ul><li>Proprietary storage format? </li></ul><ul><li>Native support of major database systems: </li></ul><ul><ul><li>IBM DB2, Informix, Oracle, SQL Server, Sybase </li></ul></ul><ul><ul><li>ODBC </li></ul></ul><ul><ul><li>Support of parallel database systems </li></ul></ul><ul><li>Integration with your data warehouse </li></ul>
46. 46. Cost Considerations <ul><li>Proprietary or commodity hardware and operating system </li></ul><ul><ul><li>Client and server might be different </li></ul></ul><ul><ul><li>What server platforms are supported? </li></ul></ul><ul><li>Support staff needed </li></ul><ul><li>Training of your staff members </li></ul><ul><ul><li>Online training, tutorials </li></ul></ul><ul><ul><li>On-site training </li></ul></ul><ul><ul><li>Books, course material </li></ul></ul>
47. 47. Data Mining Projects <ul><li>Checklist: </li></ul><ul><ul><li>Start with well-defined business questions </li></ul></ul><ul><ul><li>Have a champion within the company </li></ul></ul><ul><ul><li>Define measures of success and failure </li></ul></ul><ul><li>Main difficulty: No automation </li></ul><ul><ul><li>Understanding the business problem </li></ul></ul><ul><ul><li>Selecting the relevant data </li></ul></ul><ul><ul><li>Data transformation </li></ul></ul><ul><ul><li>Selection of the right mining methods </li></ul></ul><ul><ul><li>Interpretation </li></ul></ul>
48. 48. Understand the Business Problem <ul><li>Important questions: </li></ul><ul><ul><li>What is the problem that we need to solve? </li></ul></ul><ul><ul><li>Are there certain aspects of the problem that are especially interesting? </li></ul></ul><ul><ul><li>Do we need data mining to solve the problem? </li></ul></ul><ul><ul><li>What information is actionable, and when? </li></ul></ul><ul><ul><li>Are there important business rules that constrain our solution? </li></ul></ul><ul><ul><li>What people should we keep in the loop, and with whom should we discuss intermediate results? </li></ul></ul><ul><ul><li>Who are the (internal) customers of the effort? </li></ul></ul>
49. 49. Hiring Outside Experts? <ul><li>Factors: </li></ul><ul><li>One-time problem versus ongoing process </li></ul><ul><li>Source of data </li></ul><ul><li>Deployment of data mining models </li></ul><ul><li>Availability and skills of your own staff </li></ul>
50. 50. Hiring Experts <ul><li>Types of experts: </li></ul><ul><li>Your software vendor </li></ul><ul><li>Consulting companies/centers/individuals </li></ul><ul><li>Your goal: Develop in-house expertise </li></ul>
51. 51. The Data Mining Market <ul><li>Revenues for the data mining market: \$8 billion (Mega Group 1/1999) </li></ul><ul><li>Sales of data mining software (Two Crows Corporation 6/99) </li></ul><ul><ul><li>1998: \$50 million </li></ul></ul><ul><ul><li>1999: \$75 million </li></ul></ul><ul><ul><li>2000: \$120 million </li></ul></ul><ul><li>Hardware companies often use their data mining software as loss-leaders (Examples: IBM, SGI) </li></ul>
52. 52. Knowledge Management in General <ul><li>Percent of information technology executives citing the systems used in their knowledge management strategy (IW 4/1999) </li></ul><ul><li>Relational Database 95% </li></ul><ul><li>Text/Document Search 80% </li></ul><ul><li>Groupware 71% </li></ul><ul><li>Data Warehouse 65% </li></ul><ul><li>Data Mining Tools 58% </li></ul><ul><li>Expert Database/AI Tools 25% </li></ul>
53. 53. Crossing the Chasm <ul><li>Data mining is currently trying to cross this chasm. </li></ul><ul><li>Great opportunities, but also great perils. </li></ul><ul><ul><li>You have a unique advantage by applying data mining “the right way”. </li></ul></ul><ul><ul><li>It is not yet common knowledge how to apply data mining “the right way”. </li></ul></ul><ul><ul><li>No major cooking recipes to make a data mining project work (yet). </li></ul></ul>
54. 54. Summary <ul><li>Database and data mining technology is crucial for any enterprise </li></ul><ul><li>We talked about the complete data management infrastructure </li></ul><ul><ul><li>DBMS technology </li></ul></ul><ul><ul><li>Querying </li></ul></ul><ul><ul><li>WWW/DBMS integration </li></ul></ul><ul><ul><li>Data warehousing and dimensional modeling </li></ul></ul><ul><ul><li>OLAP </li></ul></ul><ul><ul><li>Data mining </li></ul></ul>
55. 55. Additional Material: Web Sites <ul><li>Data mining companies, jobs, courses, publications, datasets, etc: www.kdnuggets.com </li></ul><ul><li>ACM Special Interest Group on Knowledge Discovery and Data Mining www.acm.org/sigkdd </li></ul>
56. 56. Additional Material: Books <ul><li>U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining , AAAI/MIT Press, 1996 </li></ul><ul><li>Michael Berry & Gordon Linoff, Data Mining Techniques for Marketing, Sales and Customer Support , John Wiley & Sons, 1997. </li></ul><ul><li>Ian Witten and Eibe Frank, Data Mining, Practical Machine Learning Tools and Techniques with Java Implementations , Oct 1999 </li></ul><ul><li>Michael Berry & Gordon Linoff, Mastering Data Mining , John Wiley & Sons, 2000. </li></ul>
57. 57. Additional Material: Database Systems <ul><li>IBM DB2: www.ibm.com/software/data/db2 </li></ul><ul><li>Oracle: www.oracle.com </li></ul><ul><li>Sybase: www.sybase.com </li></ul><ul><li>Informix: www.informix.com </li></ul><ul><li>Microsoft: www.microsoft.com/sql </li></ul><ul><li>NCR Teradata: www.ncr.com/product/teradata </li></ul>
58. 58. Questions? <ul><li>“ Prediction is very difficult, especially about the future.” </li></ul><ul><li>Niels Bohr </li></ul>