1. Modeling and Analysis for the Non-Statistician Presented by: Andrew Curtis Vice President Richard Pless Consultant
2. 1 Models are developed using a six-step process. % Effort 1. Research Design 10% 2. Data Checking and Variable Creation 30 3. Create Analysis Files 30 4. Calibrate Scoring Model 10 5. Model Evaluation 10 6. Model Implementation 10 1. Research Design
3. 2 Research design requires the input of both marketers and analysts. Is the problem solvable through modeling? Do we have representative promotions from which to develop a model? Do we need to be concerned about selection bias? Will we be able to pull all the information we need to score the model off of our database in a timely manner? 1. Research Design
4. 3 Research Design--Unsolvable Problems Prospecting models for niche marketer. Some lists work really well. All others are unprofitable, even in the first decile. Finding all prospective buyers. Impossible to accurately predict all behavior. All models leave some revenue on the table. 1. Research Design
5. 4 Research Design--Unrepresentative Promotions. Album promotion during a major tour. Retail sale announcement during major clearance. Veterans magazine solicitation during the Gulf War. 1. Research Design
6. 5 Research Design--Selection Bias. The model is built off a series of mailings for business-appropriate suits, dresses, and accessories. The mailings were mailed to women only. If the resulting model is put into production without the gender pre-screen, then males will end up getting contacted, probably quite unprofitably. 1. Research Design
7. 6 Research Design--Timely Scoring Data. The model looks for number of Web applicants from a given ZIP code in the prior week but the data can only be pulled monthly. At best, the model can only be scored accurately once a month. The predictor which uses the information is ineffective. 1. Research Design
8. 7 Rule #1 Garbage In Garbage Out!Bad Data In Bad Models Out! Analysis is only good as the data being analyzed. All input data must be checked for reasonableness, timeliness, and completeness. Information extracted from multiple sources must be verified that all data are appended to the “master file” appropriately. You must engage in on-going quality control! 2. Data Checking
9. 8 Study and scrutinize the data dictionary! Understand every field in the database. Eliminate fields that are too new, poorly filled, or unrealiable. Look at distributions of values for each field. Know what every field means. Understand every value in the field. If there a “Z”, find out what “Z” means. Work with the finance to define the business rules for properly counting orders, revenue, and other business drivers. 2. Data Checking
10. 9 Clean the data when appropriate. Models are driven by underlying data patterns. Bad patterns lead to bad models. Correct data/variables with: Anomalies Missing values Outliers Errors. 2. Data Checking
13. 12 Data Checking--Outliers Example The “Michael Jordan” example. Individual credit card holders with $200,000 lines of credit. The department store employee with 100 shopping trips a year. 2. Data Checking
14. 13 Data Checking--Errors pose a tremendous risk for the modeler. Commonly Occurring Errors: Response data from a prior mailing incorrectly matched back to the customer file. Changes in meaning or usage of a particular variable. Alpha characters in supposedly numeric variable fields. 2. Data Checking
15. 14 Variable creation captures the dynamics of the business. Use creativity to create predictor variables. Predictor variables typically come in three classes: Recency—the time elapsed since an action. Frequency—the number of times an event has happen, e.g. orders, clicked on a web page etc. Monetary—the amount of money spent purchasing goods and services. Use ratios and cross variables to identify meaningful interactions between variables. 2b. Variable Creation
16. 15 Predictor Variable Creation--Example Monetary Sum of Revenue = $500 Frequency Count Order Dates= 6 Orders Recency (11/14/01 – 8/17/01) = 89 Days or 3 Months! 2b. Variable Creation
17. 16 Predictor Variable Creation--Example Average Order Size = $500 / 6 Orders Total Books = 4 Total DVDS = 1 Total Electronics = 1 Percent Gift Purchases= 2 / 6 = 33% Recency in Books (11/14/01 – 6/1/01) = 166 Days or 5.5 Months! 2b. Variable Creation
18. 17 Selecting a Target Variable Make sure your target variable will give you the type of results you want. Measuring response: may get a lot of hand- raisers that are not profitable. Measuring profit: by focusing only on the dollars, you may miss a viable low-profit group. Isolate all information gathered during the target period from being included as a predictor variable. 2b. Variable Creation
19. 18 Analysis files have three time frames: Predictor Period—The time before individuals are selected for a marketing contact. All predictor variables must contain only data from this period. Gap Period—The time between the selection date and when the first response is recorded. Target Period—The time between the first and last response date. All target variables must only contain information from this period. Predictor Period Gap Period Target Period Selection Date First Response Date Last Response Date 3. Create Analysis Files
20. 19 Good models are developed with modeling and validation samples. Before modeling begins, split the analysis file into two random subsets: modeling and validation. Develop the model using only the modeling subset. Test the robustness and accuracy of the model using the validation subset. Techniques exist for handling validation when analysis sample is too small to split. 3. Create Analysis Files
21. 20 The appropriate modeling technique is driven by several factors. The nature of the target variable. The software that is supported in the production environment. The skills of the analytical team. 4. Model Calibration
22. 21 No modeling technique should operate on autopilot. The analyst developing the model must: Know how to use the modeling technique. Know how to interpret the results. Know a “cringe variable” when they see one. Know how the model will be used by the marketers. Without a pilot, even the most sophisticated plane will crash. 4. Model Calibration
23. 22 Scoring models can be built using many different techniques. Linear regression Logistic regression Discriminant analysis Neural networks Many, many more... All can be used as predictors of future behavior. 4. Model Calibration
24. 23 Model Calibration Rule #1 If you want to get famous, talk about technique. If you want a great model, concentrate on “the other 90 percent.” 4. Model Calibration
25. 24 Corollary to Rule #1 Regardless of your technique of choice, if you short-change “the other 90 percent,” you will probably end up with a lousy model. 4. Model Calibration
26. 25 Construction analogy Throw several power tools onto a pile of lumber, come back in a month, and -- presto – you will NOT have a house. 4. Model Calibration
27. 26 Linear Regression is best suited for continuous outcomes, such as sales. Output can be understood by non-statisticians. Each name is assigned an estimated value. Scored population is easily ranked with respect to the target variable (sales, profits, etc.). Does not automatically identify interactions between predictor variables. 4. Model Calibration
28. 27 Linear Regression Example Scoring Model for Predicting Monthly Revenue Score = 0.08 + 0.06 * House Value (Estimated in $Thousands) - 0.20 * Number of Children + 0.10 * Average Credit Card Limit (in $Thousands) - 0.30 * Number of Autos JohnJenniferYOU House Value? $150,000 $125,000 No. of Kids? 2 0 Ave Limit? $15,000 $8,000 No. of Cars? 2 1 Score $9.58 $8.08 4. Model Calibration
29. 28 Logistic regression is best suited for binary outcomes, such as buy/no buy. Output can be understood by non-statisticians. Each name is assigned a probability of performing the expected outcome that is NOT a prediction of future performance. Scored population is easily ranked with respect to likelihood of displaying the targeted behavior. Does not automatically identify interactions between predictor variables. 4. Model Calibration
30. 29 Logistic Regression Example Scoring Model for Predicting Likelihood to Purchase (Yes/No) Score = 0.01 + 0.04 * Person Owns Home (1=yes,0=no) - 0.05 * Number of Credit Cards + 0.01 * Income (Estimated in $Thousands) - 0.02 * Age Probability Fix = 1 / [1 + Exponent(-Score)] JohnJenniferYOU Owns Home? No Yes No. of Cards? 6 3 Income? $40,000 $25,000 Age? 45 35 Score -0.79 (prob=31%) -0.55 (prob=37%) 4. Model Calibration
31. 30 Neural networks can be used with either binary or continuous targets. No restrictions on the type or structure of either the target variable or the historical variables. Can more easily capture interactions between predictor variables. Output is very difficult to explain. Implementation can be difficult. Models don’t always outperform traditional regression. 4. Model Calibration
32. 31 When done well, scoring models are smooth with few, if any clumps. Target behaviors of the scored names distribute on a “Gains Table” smoothly from highest to lowest. This makes it easier to target a precise number of names, or to select down to a precise threshold of response or profit. 5. Model Evaluation
33. 32 Understanding the Lift Table Start by ranking all customers by their descending scores and observing the number of responders in each “decile.” 5. Model Evaluation
41. 40 With cost figures, the gains table can be expanded to show profit. 5. Model Evaluation
42. 41 In this example, profit peaks around a mail quantity of 60,000. 5. Model Evaluation
43. 42 The production algorithm translates the model into the production environment. The model is worthless without proper implementation. Goal: create identical production and model algorithms. Involve the production people. Involve the marketers. 6. Model Implementation
44. 43 Quality control procedures ensure the model is applied correctly every time. Develop audit trail reports that highlight potential problems. Look for model degradation over time. Develop mini-profiles of each scoring decile and compare over time. 6. Model Implementation
45. 44 Testing should always be done to continually validate assumptions. The secret of determining the success of the model used for direct marketing is through tracking the results of its use in-market. Each cell must be measured as well as the overall. For scoring models, this means that ‘cells’ must be created, usually deciles or percentiles. Each group is marked and tracked. The performance can be compared to each other and to expected. 6. Model Implementation
46. 45 Focus not only on overall performance, but also at the margin. If you are losing money at the margin, too many unprofitable names are being contacted. If you are making money at the margin, you may be leaving profits on the table. Common sense and company policy will guide you to a target marginal ROI. 6. Model Implementation