This slide deck presents an introduction to statistical modeling by Don McCormack of JMP. Don presents at Building Better Models seminars throughout the world. Upcoming complimentary US seminars are listed here: http://jmp.com/about/events/seminars/
We start off by defining what a statistical model is.
A statistical model is a function, f(X), that we use to predict some response or outcome, that we label Y.
Here, X represents one or more continuous or categorical predictor variables.
We write the statistical model as Y = f(X) + residual error.
The residual error is the left over part of the variation in Y that we cannot predict with the function, f(X).
It turns out that during the process of building and evaluating statistical models, the residual error plays a key role, and examining and understanding that residual error can help us greatly as be seek to build a good model.
We start off by defining what a statistical model is.
A statistical model is a function, f(X), that we use to predict some response or outcome, that we label Y.
Here, X represents one or more continuous or categorical predictor variables.
We write the statistical model as Y = f(X) + residual error.
The residual error is the left over part of the variation in Y that we cannot predict with the function, f(X).
It turns out that during the process of building and evaluating statistical models, the residual error plays a key role, and examining and understanding that residual error can help us greatly as be seek to build a good model.
Decision Tree models are known by a number off different names and algorithms: Recursive partitioning, CHAID (stands for Chi-Square Automatic Interaction Detector), CART (stands for Classification and Regression Trees).
Decision tree models are nothing more than a series of nested if statements. The IF-THEN and ELSE outcomes and conditions for the nested IF statements can be viewed as branches in a tree.
Decision tree models are constructed in a way that for each branch in the tree, the difference in the average response is maximized. These models are grown by adding more and more branches in the tree, with each successive branch explaining more of the variation in the response variable.
Consider a modeling example where the goal is to predict the likelihood that an individual will default on a loan. This is the same example that was used for illustrating how to construct an ROC Curve.
Overall in this data, we have 27900 individuals with a 3.23% default rate. We also have 24 candidate “X” variables that can be used to model and explain the default rate.
The algorithm for the decision tree will search through each one of these candidate variables, and through each unique level or combination of levels for each variable, and will find the “split” in the data, based on one variable, that maximizes difference in the response rate. (Actually, what is found is the split that has the largest LogWorth, which is just –log10([p-value) for the statistical test of the mean difference (or proportion difference) between the two groups.
In this example, the 1st split that is chosen is based on the AGE variable. Across all the possible splits, the algorithm found that the split that maximized the difference in the default rate was when AGE < 28 or AGE >=28. For the under 28 group, the default rate was 6.51%, and the 28 and older group the rate was 2.24%
Once this step is done, to find the next branch in the decision tree, we repeat the algorithms across each “partition” of the data, across each branch of the tree. So we look in the 28 and under group and find the split that maximizes logWorth, and also in the 28 and older group. Then whichever one of those has the overall maximum logWorth, we use that split in the model.
For the 2nd split, it finds that the types of credit cards the under 28 group has describes the most variability in the default rate. For those with more traditional credit cards (Mastercard, VISA, Cheque Card), the default rate is 3.13%, while those with non-traditional or no credit cards have a default rate of 8.16%.
To find the next split in the tree we repeat the process again across the 3 partitions of the data, represented by the 3 branches of the treee, to find the split that maximizes logWorth.
The 3rd split in the tree is found in the 28 and older group, based on the number of telephone #’s the person has. Those who have 1 or less (fewer than 2) have a default rate of 4.35% default rate, and those with 2 or more have a 1.89% rate.
So with just 3 steps of the tree building process, we have gone from a “trunk” of a tree with a rate of 3.23%, to a tree with 4 branches that have rates of 8.2% to 1.9%.
We would normally continue this process of adding branches to the tree, so that we grow out a model that predicts well.
One problem that decision trees have is that it is very easy to overfit a decision tree to the data. You could imagine that if we allowed the tree algorithm to continue until we had all branches with just ONE observation in each tree, that would fit the data perfectly. One way to prevent this from happening is the set a minimum branch size requirement, so that no branch in a tree will have fewer than a specified number of rows or percentage of rows.
Another good idea is to use hold out validation. As you grow the tree, you can check to see how well that tree is predicting the validation data, and if that stops improving as you add more complexity into the tree model, you could stop adding splits to the tree.
The Bootstrap Forest method applies ideas in data sampling and model averaging that help building predictive models with very good performance.
The first idea that is applied is Bootstrapping. A bootstrap sample from a data table is a 100% sample of the data, but performed with replacement. So, for instance, if we had a data table with 1000 rows of data, then a bootstrap sample would have 1000 rows of data. But, the bootstrap sample would not be the same as the original data table, because sample with replacement will lead to some rows being sample more than once (2, 3, 4, or even more times), and some rows not being sampled at all. Typically about 38% of the data is not selected at all.
If you build a decision tree model on that bootstrap sample, it may not, by itself, be a very good tree, because the bootstrapped data used to build it is so unrepresentative of the actual data. But if we repeat this process over and over again, and average all the models built across many bootstrap samples, this can lead to a model that performs very well. This approach, of averaging models built across many bootstrap samples is known as Bootstrap Aggregation, or “bagging”. Bagged decision tree models, on average, perform better than a single decision tree built on the original data.
To improve the model performance even more, we apply sampling of factors during the tree building process. For each tree built (on a separate bootstrap sample), for each split decision, a random subset of all the possible factors is selected and the optimal split is found among them. Typically we choose a small subset, say 25% of all the candidate factors. So at each split decision, about 75% of the factors are ignored, at random. This give each factor an opportunity to contribute to the model.
Combining bagging and variable sampling results in a tree building method known as a Bootstrap Forest. This is also known as a random forest technique.
Boosting is a newer idea in data mining, where models are built in layers.
(go to the next slide)
Neural Networks are sometimes described as “black box” methods, in that the underyling models are complex and most applications of Neural Networks rely on software to fit the models for you automatically.
It turns out that Neural Networks are nothing more than highly flexible non-linear models. While mathematically complex, the structure of a Neural Network model can be viewed as a network of connected blocks and nodes.
To illustrate this, consider a simple example where there is one response variable, Y, and there are two predictor variables, X1 and X2. (next slide)
This is a diagram represention of an example Neural Network.
On the left hand side of the graph are the factors or inputs to the model. On the right hand side is the response or output from the model.
In between is a network of nodes and links that represent the model. The types neural networks you can fit in JMP (Pro) have either one or two layers, called “hidden” layers, and within each layer you can have one or more node. The nodes have a symbol overlayed on them that represents a mathematical transformation.
The way this neural network is constructed is as follows:
You take a linear combination of the inputs into each node in the 2nd hidden layer, and then that value is transformed by the associated node’s “activation function”. So, for instance, on the top node in the first layer, you take a linear combination, a + b*X1 + c*X2, and you transform it using the “S”-shaped function, which is a hyperbolic tangent function.
For the middle node in the 2nd hidden layer, you take a separate linear combination, d + e*X1 + f*X2 into that node and perform a linear (or multiplicative scale) transformation on that value.
For the bottom node in the 2nd hidden layer, you take yet another linear combination of the inputs, g + h*X1 + i * X2, and you transform them with a Gaussian function (also known as a radial function).
What comes out of each of the nodes in the 2nd layer is a new transformed value. In this NN, those transformed values are treated like inputs to the 1st hidden layer, and same approach is used. Linear combinations 2nd hidden layer results are brought into each of the 1st hidden layer nodes, and another transformation is applied in each node.
What comes out of the end of the network, the output of the 1st hidden layer, is then used to make a prediction of the output, the Y variable, but taking a linear combination of those values. The very final part of the NN, between the 1st hidden layer and the output, is essentially a simple linear regression.
One way to think about what a neural network does is to find appropriate transformations of inputs so that you use those transformed values to build a linear regression model.
The description that I just gave about this neural network is somewhat simplistic, in that the estimation (or model building) process is done simultaneously, treating this neural network as one big non-linear model.
Also, it is important to mention that this particular Neural Network is just one of many neural networks we could fit to this data. The number of layers and the number of each type of node to use is left up to the user to specify.
So what is the big picture on Neural Networks?
These types of models are mathematically complex, but they can be easily represented and understood using the network diagram.
They are complex models that can be more computationally challenging and time consuming when you fit them
The key strength of Neural Networks are their flexibility. They can model a wide variety of relationships between inputs and outputs.
That strength is also a potential weakness of neural nets. Because they are so good at fitting data, they have a tendency to overfit. Because of that tendency, it is good you use some sort of validation strategy when you building models and deciding which network model is the right one to use. In JMP, we require you to use some sort of validation, whether it is holdout or k-fold crossvalidation.
In addition, the underlying analytic method use to estimate the model parameters is designed to help reduce the overfitting problem. There is a very nice whitepaper on the JMP web site that describes this in greater details, and the link is shown at the bottom of this slide.
During this seminar, we will use various measures of model quality.
For models where we are predicting a continuous response variable, the metrics we use are based on the sum of the squared error (SSE) in prediction. The most common metric we use is the R^2 (R-squared).
R^2 is defined as 1 – (SSE) / (TSS), where
TSS is the total variability in the response. If the model predicts perfectly, then SSE is zero and R^2 is one. If the model doesn’t predict any better than using the average of the response variable, the R^2 is zero. R^2 will range between [0,1] when evaluating the model on the training data, but it can actually become negative when evaluating the model on the training data set. A negative R^2 is making worse predictions that just using the average response, and is a strong indication of overfitting.
There are other model quality measures based on the SSE, such as AIC (Aikaike’s information criteria) and BIC (Bayesian Information Criteria), but we will not focus on using those metrics in this seminar.
When the response is categorical, there are generalizations of R^2 which are used to help determine the appropriate model size, but we can also use other measures to assess model quality. When you have a categorical response model, the prediction model you build is a probability model, it predicts the probability of the outcome levels for the response. You can then use those probability models to make classification decisions about the expected response. This lees to metrics such as misclassification rate, and a cross-classification matrix known as the confusion matrix.
Additionally, these probability models can be used to “sort” the data, and ROC curves summarize the ability of these models to effectively sort the data
The ROC curve is a “mystical” entity. Many predictive modeling methods use ROC curves to measure model quality. But a simple and understandable explanation of ROC curves is hard to find.
To help explain what an ROC curve is, we will use an example where a model was built to predict the probability of a binary outcome. In this case, the variable, GB, stands for “good’ / “bad”, and is an indicator of whether and individual defaulted on a loan or not. GB==1 indicates that the individual did NOT pay off the loan, and we have model formula that predicts the probability that GB==1. What we do is sort the data, based on that model, from high predicted probability to low probability.
If the model is predicting well, then what we would expect, upon sorting the data by predicted probability, is that we would have a lot of the defaulters at the top of the sorted table, and a lot of the non-defaulters at the bottom of the table.
What is shown is the ROC curve which describes the model’s sorting ability. Zooming in on the lower left hand corner of the graph, we can see how this relates to the sorted table.
In the sorted table, the first 9 rows are sorted TOWARDS the target, that is, those rows are the highest predicted probability of default and they were actually defaulters. So, for each one of those rows, we draw the graph up, at total of 9 steps.
Notice that the next 8 rows are not sorted toward the desired target, that is they have a relatively high probability of default, but they were not defaulters. We draw the graph over, horizontally, 8 steps, one step for each one of those rows.
The next 9 rows are sorted toward the target, and again we draw the graph up 9 steps.
This is a verbal description of how to construct the ROC curve. No need to go over this again, as it was just described in the last slide.
The step size for each row is based on the number of 1’s and zero’s in the data set, so that the graph, if we draw it based on the whole table, is plotted on the [0,1] x [0,1] range. For a model that is predicting well, you would expect that the graph, initially, would go more “up” than “over”, and the graph would be raised above the 45degree line shown on the chart. The higher the curve is above that 45 degree line, the better the model is doing at sorting.
Because ROC curves are always on the same scale., we can also use the area under the curve (AUC) as a metric of model quality.
A model that predicts perfectly (and therefore sorts perfectly), would have a ROC curve that goes up all the way to 1, and then over all the way to 1, so it looks like an upper right angle. A model that does not better than just picking or sorting at random would have a ROC curve that looks roughly like the 45 degree lines. So the more the ROC curve is above the 45 degree line, the better the model.
Each point on this curve corresponds to a particular Probability that can be used as a threshold for making decisions about an observation. Take, for instance, the point indicated on the graph (click animation to show). If the associated probability threshold were used to make decisions about the “Good/Bad” outcome, we would expect to correctly classify about 57% of the “Bad” credit risk, but at the tradeoff cost of incorrectly classifying about 27% of “Good” credit risk individuals as a “Bad” credit risk. (an expected question...how do I find that decision threshold? Unfortunately, with JMP you have to search the sorted table to find it. In JMP 11 there will also be a “cumulative” gains chart that is easier to use to find this threshold in the table. The Y axis for the cum gains chart will be the same as the ROC curve (Sensitivity), but the X-axis will be the sorted portion of the data, which you can then use to somewhat easily find the row and lookup it’s associated predicted probability).