Like this presentation? Why not share!

A General Framework for Accurate and Fast Regression by Data Summarization in Random Decision Trees

by Yao Wu on Mar 08, 2011

• 634 views

Views

Total Views
634
Views on SlideShare
595
Embed Views
39

Likes
0
10
0

1 Embed39

 http://yawwu.wordpress.com 39

Categories

Uploaded via SlideShare as Microsoft PowerPoint

A General Framework for Accurate and Fast Regression by Data Summarization in Random Decision TreesPresentation Transcript

• A General Framework for Fast and Accurate Regression by Data Summarization in Random Decision Trees Wei Fan, IBM T.J.Watson Joe McCloskey, US Department of Defense Philip Yu, IBM T.J.Watson
• Three DM Problems
• Classification:
• Label: given set of labels in training data.
• Probability Estimation:
• Similar to the above setting: estimate the probability that x is an example of class y.
• Difference: no truth is given, i.e., no true probability
• Regression:
• Target value: continuous values.
• Model Approximation
• True model or correct model.
• Generates y for each x with probability P(y|x).
• Normally never known in reality.
• Perfect model: never makes mistakes or has the same prediction as the true model.
• Not always possible due to:
• Stochastic nature of the problem
• Noise in training data
• Data is insufficient
• Optimal Model
• Loss function L(t,y) to evaluate performance.
• Optimal decision decision y* is the label that minimizes expected loss when x is sampled repeatedly:
• Examples
• 0-1 loss: y* is the label that appears the most often, i.e., if P(fraud|x) > 0.5, predict fraud
• cost-sensitive loss: the label that minimizes the “empirical risk”.
• If P(fraud|x) * \$1000 > \$90 or p(fraud|x) > 0.09, predict fraud
• MSE or mean square error: predict average
• How we look for optimal models?
• Don’t impose “exact forms”:
• Decision Trees, Classification based on Association rules, Production rules
• Learner estimate structure as well as parameters
• NP-hard for most “model representation”
• Impose “exact forms”:
• logistic regression functions, linear regression model, etc
• Learners estimate parameter ONLY. Structure is pre-fixed
• Inductive Bias.
• Decision tree is rather flexible, efficient yet powerful representation.
• Consider Decision Tree
• Compromise between accuracy and model complexity
• We think that simplest -structured hypothesis that fits the data is the best.
• We employ all kinds of heuristics to look for it.
• info gain, gini index, Kearns-Mansour, etc
• pruning: MDL pruning, reduced error-pruning, cost-based pruning.
• Reality: tractable, but still pretty expensive
• Truth: none of purity check functions guarantee accuracy over testing data.
• Random Decision Tree - classification, regression, probability estimation
• Key characteristics:
• Structure is randomly picked.
• Statistics are summarized from training data.
• At each node, an un-used feature is chosen randomly
• A discrete feature is un-used if it has never been chosen previously on a given decision path starting from the root to the current node.
• A continuous feature can be chosen multiple times on the same decision path, but each time a different threshold value is chosen
• Continued
• We stop when one of the following happens:
• A node becomes too small.
• Or the total height of the tree exceeds some limits:
• Such as the total number of features.
• Node Statistics
• Classification and Probability Estimation:
• Each node of the tree keeps the number of examples belonging to each class.
• Regression:
• Each node of the tree keeps the mean value of examples sorted into the node
• Classification/Prob Estimatimation
• During classification, each tree outputs posterior probability:
P(P1|x)=0.3 B1 < 0.5 Y B2 > 0.7 B1 > 0.3 P1: 200 P2: 10 N Y N P1: 30 P2: 70 Y … …
• Regression
• During classification, each tree average value of training examples that falls within each node
Age >30 Y Capt> 70% Edu=PhD Avg AGI=100K N Y N Avg AGI=150K Y … …
• Classification
• The prediction from multiple random trees are averaged as the final output.
• Classification: loss function is needed.
• Training can be very efficient. Particularly true for very large datasets.
• Natural multi-class probability.
• Natural multi-label classification and probability estimation.
• Imposes very little about the structures of the model.
• Number of trees
• Sampling theory:
• The random decision tree can be thought as sampling from a large (infinite when continuous features exist) population of trees.
• Unless the data is highly skewed, 30 to 50 gives pretty good estimate with reasonably small variance . In most cases, 10 are usually enough.
• Worst scenario
• Only one feature is relevant. All the rest are noise.
• Probability:
• Variance Deduction:
• Donation Dataset - classification and prob estimation
• Decide whom to send charity solicitation letter.
• It costs \$0.68 to send a letter.
• Loss function
• Result
• Credit Card Fraud -classification and prob estimation
• Detect if a transaction is a fraud
• There is an overhead to detect a fraud, {\$60, \$70, \$80, \$90}
• Loss Function
• Result
• Comparing with Boosting
• Don’t handle multi-class problems naturally, ECOC
• Do not output probabilities.
• Inefficient.
• Boosting rounds is tricky. Sometimes, more rounds can lead to overfitting.
• Inefficient.
• Implementation needs careful numerical manipulation.
• Comparing with Bagging
• Could be very inefficient particularly for very large dataset
• i.e., bootstrap sampling needs linear scan of the data.
• Do not output reliable probabilities.
• Probability Estimation
• Probability Estimation
• Overfitting
• Non-overfitting of RDT
• Selectivity
• Tolerance to data insufficiency
• GUIDE MLR y = a+a1*x1+a2*x2 + … ak*xk Age >30 Y Capt> 70% Edu=PhD MLR N Y N MLR Y … …
• Regression: single independent variable
• RDT
• Depend on combination of 5 independent variables
• RDT
• It grows like …
• Comparing with GUIDE
• Need to decide grouping variables and independent variables. A non-trivial task.
• If all variables are categorical, GUIDE becomes a single CART regression tree.
• Strong assumption and greedy-based search. Sometimes, can lead to very unexpected results, like the one given earlier
• Conclusion
• Imposing a particular form of model is not a good idea to train highly-accurate models.
• It may not even be efficient for some forms of models.
• RDT has been show to solve all three major problems in data mining, classification, probability estimation and regressions, simply, efficiently and accurately.
• Selected Bibliography of RDT
• ICDM’03: “Is random model better? On its accuracy and efficiency” (Fan, Wang, Yu and Ma)
• AAAI’04: “On the Optimality of Posterior Probability Estimation by Random Decision Tree” (Fan)
• ICDM’05: “Effective Estimation of Posterior Probabilities: Explaining the Accuracy of Randomized Decision Tree Approaches” (Fan, Greengrass, McCloskey, Yu, and Drummey)
• ICDM’05: “Learning through Changes: An Empirical Study of Dynamic Behaviors of Probability Estimation Trees” (Zhang, Buckles, Peng, and Xu)
• Master Thesis by Tony Liu, supervised by Kai Ming Ting, “The Utility of Randomness in Decision Tree Construction”, Monash University, 2005
• KDD’06: “A General Framework for Fast and Accurate Regression by Data Summarization in Random Decision Trees”