Breaking the Kubernetes Kill Chain: Host Path Mount
Ā
What Is a Model, Anyhow?
1. What Is Predictive Modeling?
4250 258th Ave SE
Issaquah, WA 98029
425.996.8732 Office
bill.cassill@numericalalchemy.com
Copyright 2009 Numerical Alchemy, Inc.
This material is not to be distributed or in any way duplicated without the prior consent of the author.
2. What Is a Model?
ā¢ Predictive modeling refers to a class of techniques that
determine the most likely outcome given a set of inputs.
Frequently, this requires inputs consisting of prior data that
will be used to predict a future outcome or event.
Predictive Model Often Uses Past Data to Predict Future Events
Model Inputs
Input A
Outcome
Input B
Event
Input C
Input D
1
Past Data (e.g. last month) Future Data (e.g. 2 months out)
3. What Are Models Used For?
ā¢ Models currently have many uses.
ā¢ Some examples include:
ā Which people are a good credit risk?
ā What is someoneās accident risk based on age, gender, and past
driving history?
ā Who is most likely to buy my products in the next 90 days?
ā Who is most likely to stop doing business with my company in the near
future?
ā Which purchase transactions represent a significant fraud risk?
ā¢ All of these questions can be answered with predictive
modeling. 2
4. A Tangled Web of Data
ā¢ What can make the prediction task complex is when we are
faced with hundreds or thousands of potential factors that
can be used as inputs.
ā¢ The obvious questions arise:
ā Which ones should I use?
ā How many of the factors are truly relevant or predictive?
ā How do I know if I have the ārightā model?
ā¢ All of these questions can be answered by a good analyst or
statistician.
3
5. Outcome Variables
ā¢ Several types of outcome variables can be predicted using
statistical modeling techniques.
ā¢ These include:
ā Continuous values like future customer profitability and future sales
volumes
ā Binary outcomes (1 = event occurs & 0 = event does not occur) like
whether someone buys something (or not) or defaults on a credit card
(or not)
ā Multi-category outcomes like small, medium, and large.
ā¢ However, by far the most popular outcomes to model are the
continuous and binary variety.
4
6. Prediction and Scores
ā¢ Once a model has been built, it can be used to generate
scores (i.e. predicted values) on new data. Depending on the
outcome being modeled, these scores can take on a couple of
different varieties.
ā¢ Predicted scores for binary outcomes are represented as a
probability score: a 0 to 1 decimal score representing the
percentage chance that the modeled event will occur for a
given case.
ā¢ For continuous values, predicted scores take on the scale and
characteristics of the original outcome variable.
5
7. Finding the āRight Modelā
ā¢ There are many measures that tell you how predictive your
model is. The problem is that no matter how predictive your
model is on one set of data, it may lose itās predictive power
once applied to another set of data.
ā¢ One example is using demographic data to predict store level
retail sales during the summer months. The predictors we
observe for the South Eastern U.S. may not prove useful when
used on West Coast locations.
ā¢ Similarly, using an algorithm that predicts summer sales well
may likely prove useless in predicting the spike in sales during
the November and December Christmas season. 6
8. Validation Is the Key
ā¢ The way to truly test how well a model performs is to test it
on an external data set.
ā¢ The data the model is built on is typically call the
ādevelopment sampleā while the data set used to validate the
model is called the āvalidation sample.ā
ā¢ Ideally, both samples will be pulled from the same population
of cases. By creating random samples, we can be fairly sure
that we are creating data sets that are representative of the
population of interest.
7
9. Lift Charts
ā¢ One way to tell how well a model performs is by looking at
something called a lift chart. In order to construct one, follow
these basic steps:
1. Sort the case in the data set in descending order from the highest
predicted score to the lowest (i.e. the highest scores are at the top)
2. Cut the file into 10% chunks called ādecilesā where the top 10% (or top
decile) represents the top 10% with the highest scoring cases.
3. Calculate your lift value by dividing the average value of the outcome
variable within each decile by the average value of the entire sample.
8
10. Lift Charts (cont.)
ā¢ Once weāve done the basic data manipulation as shown on
the previous page, we can make a chart like the one shown
below. The good thing about models is that we can use them
to identify and target our actions to a much smaller number
of cases. Sample Lift Chart
7.00%
The average rate for the
outcome event is 1.5% of the
6.00% It is better to target these casesā¦
total cases. However, for the
top decile (or the top 10% of 5.00%
cases with the highest
4.00%
scores), the percentage of
ā¦than these
cases experiencing the event
3.00%
is 6%. This represents a lift
of 4 times higher than the
2.00%
sample average.
1.00%
In terms of application, if this model
were developed to identify likely 0.00%
buyers of a product, we would want
to focus our marketing efforts on
those in the top one or two deciles
who have a much stronger likelihood
to purchase vs. those who are very
9
unlikely to purchase. Average Decile Value Average Sample Value
11. Gains Charts
ā¢ Gains charts are another way to determine how well a model
performs.
ā¢ Like lift charts, we sort the data in descending order from
highest score to lowest score. Next, we cut the file into 10%
chunks.
ā¢ However, unlike a lift chart, the idea is to see how much of the
target event we are capturing as we move from the top of the
data file to the bottom.
10
12. Gains Charts (cont.)
ā¢ We compare the cumulative capture of the āeventā cases to
the cumulative capture rate if the file had simply been sorted
in a random order.
Sample Gains Chart
In this example, the model
100.00%
captures 45% of all the cases
Cumulative % of Event Captured
that exhibit the āeventā within 90.00%
the top 10% of the file. Within
80.00%
the top 30% of the file better
70.00%
than 75% of the āeventā cases
have been captured.
60.00%
50.00%
These results for the model
40.00%
can be compared to a random
30.00%
sorting of the file. In the case
of a random sort, we could 20.00%
expect to capture 10% of the
10.00%
āeventā cases within the top
0.00%
10% of the file and 30% of
āeventā cases within the top
30% of the file.
11
Cumulative Capture (Model) Cumulative Capture (Random)
13. Using the Model
ā¢ Once the model has been developed and validated, it is time
to use it. In order to use it, fresh data is utilized to generate
scores on the cases or population of interest.
ā¢ Typically, models are deployed to be used in one of three
fashions:
ā One time or infrequent, occasional use
ā Regularly scheduled rescoring (e.g. weekly, monthly, quarterly)
depending upon when fresh data becomes available
ā Scoring in real time. This is most appropriate for applications like
transaction fraud detection or continuous learning predictive
algorithms.
12
14. Tracking the Model
ā¢ Like almost everything else, models age and can become less
predictive over time.
ā¢ Because of this, it is important to periodically reassess a
modelās performance.
ā¢ This can be done using the standard lift and gains charts. By
comparing the model performance over different time
periods, the degree of performance decay can be assessed on
an ongoing basis.
13
15. Putting a Model Out to Pasture
ā¢ When a model finally loses its luster, it is time to retire it.
ā¢ However, the decision as to when to retire an existing model
can be somewhat subjective.
ā¢ When you do make this decision, you are faced with the
prospect of creating a new model to replace the one you are
going to retire.
ā¢ Donāt panic! This is just part of the model lifecycle. Simply
create the new one and then switch them out.
14
16. Final Comments
ā¢ Congratulations! You can now claim to be an educated user
of predictive analytics.
ā¢ At this point, you should have an idea of:
ā What a model does
ā What it can be used for
ā How to assess itās predictive accuracy
ā The basic model lifecycle
ā¢ We hope you have enjoyed this little overview, and best of
luck in your application of predictive analytics.
15