2. GOALS FOR TODAY’S PRESENTATION
Overview of predictive analytics and modeling process
Share a use case that illustrates PA
3. THE MODELING PROCESS
DEFINE
QUESTION
EXPLORE
AND
SELECT
DATA
DEPLOY
AND
MONITOR
MODEL
EVALUATE
4. USE CASE PROFILE
Science center in the Midwest
Approx. 800,000 visitors a year
Approx. 20,000 member households
The Raiser’s Edge for fundraising
Ticketmaster VISTA for ticketing
5. DEFINE
QUESTION
EXPLORE
AND SELECT
DATA
DEPLOY AND
MONITOR THE BUSINESS QUESTION
MODEL
EVALUATE
How do we make more money?
6. DEFINE
QUESTION
EXPLORE
AND
SELECT
DATA
DEPLOY
AND
THE BUSINESS QUESTION MONITOR
MODEL
EVALUATE
How do we make more money?
What are the factors that affect visitation?
7. DEFINE
QUESTION
EXPLORE
AND SELECT
DATA
DEPLOY AND
MONITOR BRAINSTORMING THE ANSWER
MODEL
EVALUATE
What do we think the factors are?
Exhibits
Day of the week
Seasonality
Holidays
These are the “predictors” – use these to create the modeling database
8. DEFINE
QUESTION
EXPLORE
AND
SELECT
DATA
DEPLOY
AND
EXPLORING THE DATA MONITOR
MODEL
EVALUATE
Generally become familiar with the data
Where are the outliers?
Are you finding evidence of bad data?
Do you have the data you need?
Transform the data so it is ready to be modeled
9. DEPLOY
AND
MONITOR
EXPLORE THE DATA DEFINE
QUESTION
EXPLORE
AND
SELECT
DATA
MODEL
EVALUATE
10. DEPLOY
AND
MONITOR
EXPLORE THE DATA DEFINE
QUESTION
EXPLORE
AND
SELECT
DATA
MODEL
EVALUATE
11. DEPLOY
AND
MONITOR
EXPLORE THE DATA DEFINE
QUESTION
EXPLORE
AND
SELECT
DATA
MODEL
EVALUATE
12. DEPLOY
AND
MONITOR
MODELING: FIRST PASS DEFINE
-5.0 -2.5 0.0 2.5 5.0 7.5 10.0
99.99
99
95
80
50
20
5
1
0.01
Standardized Residual
Percent
Normal Probability Plot
(response is ADM)
QUESTION
EXPLORE
AND
SELECT
DATA
MODEL
EVALUATE
13. MODELING: FIRST PASS = 44%
Predictor Coef P
Constant 1085.08 0
Mon -651.48 0
Tue -650.91 0
Wed -266.8 0
Thur -308.87 0
Fri -56.84 0.388
Sat 507.88 0
Apr -128.2 0.412
May -253.93 0.011
June 370 0.001
July 1019.8 0
Aug 843.4 0
Sept -392.99 0
Oct -398.2 0
Nov -179.2 0.014
Holiday -214.8 0.053
Holiday Wkn 355.26 0
EXH2 578.5 0.01
EXH3 448.9 0.069
EXH4 62.6 0.908
EXH5 629.3 0.01
Active Exh+ -3.2 0.995
DEFINE
QUESTION
EXPLORE
AND
SELECT
DATA
DEPLOY
AND
MONITOR
MODEL
EVALUATE
14. EVALUATE AND IMPROVE
SECOND PASS = 66%
-5.0 -2.5 0.0 2.5 5.0 7.5
99.99
99
95
80
50
20
5
1
0.01
Standardized Residual
Percent
Normal Probability Plot
(response is ADM)
DEFINE
QUESTION
EXPLORE
AND
SELECT
DATA
DEPLOY
AND
MONITOR
MODEL
EVALUATE
15. EVALUATE AND IMPROVE
THIRD PASS = 85%
-4 -3 -2 -1 0 1 2 3 4
99.99
99
95
80
50
20
5
1
0.01
Standardized Residual
Percent
Normal Probability Plot
(response is ADM)
17. DEPLOY
AND
MONITOR
THE FINAL MODEL DEFINE
QUESTION
EXPLORE
AND
SELECT
DATA
MODEL
EVALUATE
18. THE FINAL MODEL
DEFINE
QUESTION
EXPLORE
AND
SELECT
DATA
DEPLOY
AND
MONITOR
MODEL
EVALUATE
Constant 1124
Predictor (top) Effect Predictor (bottom) Effect
Community Open House 4952 September -515
New Year's Week 3332 December -541
Labor Day Weekend 3058 October -565
President's Day 3009 New Year's Day -1273
Martin Luther King Day 2798 Fourth of July -1971
Good Friday 1776 Easter Monday -2201
July 1349 Red White and Boom -3689
19. COMPILE DATA TO PREDICT
ADMISSIONS
DEFINE
QUESTION
EXPLORE
AND
SELECT
DATA
DEPLOY
AND
MONITOR
MODEL
EVALUATE
20. DEPLOY
AND
MONITOR
COMPILE DATA TO PREDICT DEFINE
QUESTION
EXPLORE
AND
SELECT
DATA
MODEL
EVALUATE
21. PREDICTION LINE FIT PLOT
5000
4000
3000
2000
1000
0
July Admissions Line Fit Plot
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Admissions
July 2012
DEFINE
QUESTION
EXPLORE
AND
SELECT
DATA
DEPLOY
AND
MONITOR
MODEL
EVALUATE
22. COMPARING TO REALITY
5000
4000
3000
2000
1000
0
July Admissions Line Fit Plot
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Admissions
July 2012
DEFINE
QUESTION
EXPLORE
AND
SELECT
DATA
DEPLOY
AND
MONITOR
MODEL
EVALUATE
23. DEFINE
QUESTIO
N
EXPLORE
AND
SELECT
DATA
DEPLOY
AND
MONITOR SO WHAT?
MODEL
EVALUAT
E
The model translates the your strategy into numbers
Business decisions could include…
Adding or reducing staffing and volunteers more strategically
Open the right amount of ticket windows
Opening an auxiliary room to handle lunch overflow
Planning for shuttle parking and security
Leveling visitation - if you know a day will likely be low attendance, you could move events
or group outings
Using a visitation model, you can…
Invest resources more efficiently
Improve the visitor experience
Editor's Notes
Begin by reviewing what this part of the presentation is aiming to accomplish
I would like to give you an overview of the PA process – the methodology you might follow if you were to do this
I’ll spend most of out time illustrating how we used this model with a client
And I will share with you the resulting model and how one might deploy it
Before we jump into the work we’ve done with our client, I wanted to give you a quick overall review of the modeling process and then we’ll walk through each of these stages as we applied them to our client.
Define question
You must know what you want to predict before you start the process – what is the business driver?
It is easy to jump into a project without a clear understanding of the business problem that is to be addressed
Project starting with “Let’s run this data through some predictive algorithms to see what we get” are doomed to fail
Explore and select data
Once you have define clearly the business questions, look at your data
Consider what data you think might be important to the question you have defined
Prepare the data – is it clean? Is it ready for the modeling tool?
Modeling
Exploratory Data Analysis – first look at data you selected – where are the outliers?, why are they there?, are you finding bad data?
Choose a model - Different models are used to answer different questions
Build model
Once you’ve explored and prepared the data and chosen the model, you are ready to build the model based on a subset of data you selected
Point the model at the historical data to “train the model” and improve it in iterations
Evaluate
Did the model properly address the business question?
Did you use the right data to answer the business question?
What did you find surprising about the results? Are there surprises that are worth further investigation to make sure you’re the data you’re using is effective?
Deploy and monitor
This is the step where you release the model into your organization’s decision-making process. This is an important step. It brings us to this place where we take a look at our results and ask “So what?”. It’s not enough to create a fancy looking model if it’s not going to lead us somewhere where we can have specific changes that we can apply to how we run our business.
Embed the model in your reports and BI
At a high level, here’s the modeling process we’re going to walk through.
I’m going to use a case study from one of our clients to highlight this process and share a little bit about what we’ve learned working with them.
Let me tell you a little bit about our client. We are working with a mid-size science center.
They get approximately 800,000 visitors a year
They average about 20,000 members a year
They use multiple systems throughout the organization but two of the main ones are The Raiser’s Edge for fundraising and Ticketmaster VISTA for ticketing. For the purposes of our work with this client, we’ve focused on data from Raiser’s Edge and VISTA for now.
This gives you a little sense of the organization and you’ll certainly learn more about them as we go along.
So now that you know a little about the client behind our use case, let me walk you through the modeling process that we’ve used for this client.
As you’ll recall, we need to start by defining our business question. Without a business goal, predictive modeling is just another answer in search of a question.
As we worked with our client, the general question they started with was: “How do we make more money? How do we earn a return?”
It’s a good question and one that I’m sure we’ve all thought about at some point, but this is a big place to start.
We need specificity in order for this to be meaningful, the question must be specific to your business. It must take into account the specifics of your organization or it will not be valuable and it will be hard to know if the modeling accomplished its goal.
We need to break this down to get to something that is meaningful. So we talked about the different ways they make money.
They sell memberships. They get donations. They sell tickets to the science center. They sell IMAX tickets. They have a gift shop. They have a café. They have events.
We decided to focus on General Admission ticketing.
So, our question went from the broad (and not terrible useable) “how do we make more money?” to the focused and ready for PA “What are the factors that affect visitation?”
By creating a model that predicts visitation (the bread and butter for this organization), our client will be in a better position to plan for it – to make strategic business decisions that, ultimately, will earn a return.
Some of the issues that we were expecting to address included staffing: Some days they were swamped and some days they were not and struggled to plan for staffing
No one knows your organization better than you do. Before you dive into the data, you should consider what you think the answer to the question is. With this client, we guessed that the factors that affect visitation were: SEE POINTS ON SLIDE
These “predictors” informed what data we needed to use. If we want to look at how exhibits affect attendance, we need to extract it into the “analysis dataset”
This is a first pass in an iterative process. You will learn through the process that not all of the initial factors are impactful and some predictors will be missing.
We know the question, we guessed at the answer, and we created a set of data that included the prediction (admission) and all the predictors. Now, we begin the data exploration.
The purpose of the data exploration is…
To get a basic understanding of the data; just looking at the data can be illuminating
What trends might exist?
Are there outliers? Modeling is, in many way, an exercise in explaining the outliers.
Are you noticing data that looks odd or wrong? Data exploration can highlight data entry errors or anomalous transaction processing issues.
You may also learn, when looking at the data, that you will need to transform it for analysis. For example, creating yes/no fields for exhibits was more useful than one field that listed all the exhibits. Same with holidays, we learned that the specific holiday was more relevant than just a generic “holiday.”
The tools you use for data analysis can include simple charts and graphs in Excel and more sophisticated tools that use statistical and data mining algorithms
We also looked at a probability plot – this uses statistics (standard deviation) to help highlight outliers
The blue line shows what is expected. You immediately see the curve.
On the high end there were some outliers and we began to explain those with the free days.
We also discovered that “outreach” days were skewing the data (these were days where attendees from schools, etc. where added as admissions)
There are a lot of outliers on the low and it turns out those are days where the museum is closed. While it may seem obvious, it was in the data and we needed to go through a step to find it and tag it. We also noticed some “closed” days where there are admissions. Turns out these where data entry anomalies that need to be removed. Each step the analysis data gets better and better – and it has the ancillary benefit of highlighting some opportunities to improve processing.
It is essential that you identify outliers. “Outliers increase the failures of mode” – in other words, they will mess up your model – they can pull it down, prop it up, of actually cause it to completely flip.
To find them, as we’ve shown you here, you use statistics, graphs, and common sense.
So let’s talk about the models we built for this client. We choose to begin by using logistic regression.
We added our predictors and ran a first pass
Day of Week
Month
Exhibits
If it was a Holiday
If it was a weekend of a Holiday
We looked at the graphs and assessed the indicators.
The blue line represents what the model predicts, the red dots are the actual. What we see is that there are a lot of “errors”
The dots on the high end are much too far away, the model is missing something…
The overall results are an S-curve, we want this to be straighter…
We don’t want to make this too technical, but I thought I’d share some of the data that is behind the model. This is where it starts to get real (and cool, imo).
Without going into the weeds too much, what this tells is that the “Constant” is 1085, meaning that we start with that number and adjust it up and down depending on the predictor.
On a Sunday you would subtract 651
On Saturday, you would add 507
In July, you add 1019
The “P” column is a statistical indicator of the relevance of the predictor. Anything .05 or greater is suspect.
We are creating a formula that will drive our model
This also told us that our model explained only 44% of the data – not good enough
So, we looked closer at the outliers…
We tagged the free “open house” days
We tagged the 4th of July, Christmas and New Years instead of making them generic “holidays”
We removed some of the less relevant predictors
We ran the analysis again
The chart certainly improved, much more of the admissions data are explained by the model, 62% to be exact.
We can still see that the predictions are weaker as the admissions get larger
We focused on the high days there were not explained and we found a data entry error – some entries were miscoded as admissions (they were actually large video conferencing events)
We found that other high days were Friday nights where families were given discounts and special programming
We decided to add in the weather variable to see what impact that had
We cleaned those up and recreated the analysis data set
Now we see a marked improvement on the quality of the model
The outliers are now within the bounds that we are more comfortable with (all within 3 standard deviations, fwiw)
The model accurately described 85% of the admissions data – that is good enough to start making some business decisions
Here’s the model ends up looking like. It’s a formula that predicts – that is what a model is.
This is not the actual model for the client, but the data is a fair representative of what you might find.
This is pretty ugly and not really usable as is
There are a few ways you can represent it
You can look at models in decision trees
It moves from left to right adjusting project admission based on the flow
For example, if you follow the flow of a Saturday in July all the to the end, it will show you what the predicted visitation number for that scenario
Here is a look of a snippet of the data behind the model. You can see pretty easily how you could use this to build a spreadsheet or report to project visitation based on the factors that you know.
For example, you can see that the primary factors that increase visitation are Good Friday, the month of July and a large exhibit. While the 4th of July and Tuesdays
Holidays are impactful in both directions
Exhibits – surprisingly, we found that most exhibits are not a huge impact, but one, Titanic, was. The next question is to model why it had such an inordinate impact on visitation.
To some, this may all seem to be common sense. In some respects it is, but it is many layers of common sense interacting with each other dynamically. In practice, developing and using predictive models will always outperform a pure "common sense" approach to targeting. The reason is that good models are better able to make correct judgment calls, and simultaneously take into account multiple factors and variables.
With our “formula” in hand, we can create a 365-row spreadsheet that predicts our visitation every day of the year
You simply tag all future dates based on the variables that the model said are significant. For example, July 1, 2012 is tagged 1 for the month of July and 1 for Sunday. If you were also having an exhibit that day, you would tag it too.
To illustrate this we created one worksheet with all of the Flags for each Variable (assigned a 1 or 0 for ‘yes’ or ‘no’).
We did this one pretty manually in Excel to illustrate it. However, there are more automated/sophisticated approaches that leverage SSIS tools and opuir BI tool, JCA Answers.
Here is the result that shows admissions after the data has been flagged
It’s important to remember that there is no such thing as a perfect prediction – all predictions have error (aka ‘residual’).
The model will tell you what your error is and you can look at it using a “line fit plot”
It shows the Predicted Admissions with the error range.
Of course, as we said, the explains 85% of the causes, that still leave 15% unexplained. As we continue to identify and explain outliers, the predictions will improve.
In the chart above, you can see the actuals (orange line), compared to the predictions. What is the reason the two highlighted areas were so outside the prediction? Was it a variable that we didn’t include, like weather? Are the data for these days complete and accurate (perhaps we didn’t flag an event)? In following up, we will note the larger deviations and seek to identify them. And the model will improve.
Knowing what your constant is and knowing the numeric effects of time of year, day of week, holidays, exhibits, etc. You can literally plug your yearly plan into a spreadsheet and see the projected visitation.
So now we have a pretty strong model that can predict visitation, what do we do with it?
Everything we looked at today was just a starting place.