Would you like greater confidence that the models you build are genuinely useful and can drive rational decisions? This slideshow will show how to build the most useful models that fully exploit all the information in your data, simply and easily.
Join us for an upcoming live webcast to learn more about using JMP: http://www.jmp.com/uk/about/events/webcasts/
And if you'd like to try JMP, here's how: http://www.jmp.com/uk/software/try-jmp.shtml?product=jmp&ref=top
We start off by defining what a statistical model is.
A statistical model is a function, f(X), that we use to predict some response or outcome, that we label Y.
Here, X represents one or more continuous or categorical predictor variables.
We write the statistical model as Y = f(X) + residual error.
The residual error is the left over part of the variation in Y that we cannot predict with the function, f(X).
It turns out that during the process of building and evaluating statistical models, the residual error plays a key role, and examining and understanding that residual error can help us greatly as be seek to build a good model.
The Bootstrap Forest method applies ideas in data sampling and model averaging that help building predictive models with very good performance.
The first idea that is applied is Bootstrapping. A bootstrap sample from a data table is a 100% sample of the data, but performed with replacement. So, for instance, if we had a data table with 1000 rows of data, then a bootstrap sample would have 1000 rows of data. But, the bootstrap sample would not be the same as the original data table, because sample with replacement will lead to some rows being sample more than once (2, 3, 4, or even more times), and some rows not being sampled at all. Typically about 38% of the data is not selected at all.
If you build a decision tree model on that bootstrap sample, it may not, by itself, be a very good tree, because the bootstrapped data used to build it is so unrepresentative of the actual data. But if we repeat this process over and over again, and average all the models built across many bootstrap samples, this can lead to a model that performs very well. This approach, of averaging models built across many bootstrap samples is known as Bootstrap Aggregation, or “bagging”. Bagged decision tree models, on average, perform better than a single decision tree built on the original data.
To improve the model performance even more, we apply sampling of factors during the tree building process. For each tree built (on a separate bootstrap sample), for each split decision, a random subset of all the possible factors is selected and the optimal split is found among them. Typically we choose a small subset, say 25% of all the candidate factors. So at each split decision, about 75% of the factors are ignored, at random. This give each factor an opportunity to contribute to the model.
Combining bagging and variable sampling results in a tree building method known as a Bootstrap Forest. This is also known as a random forest technique.
Boosting is a newer idea in data mining, where models are built in layers.
(go to the next slide)