Predictive Analysis

1,735
-1

Published on

Presentation by Dr. Peter Bruce, Statistics.com. Presented on April 27, 2012 at the MRA Spring Research Symposium hosted by the Mid-Atlantic Chapter of the Marketing Research Association.

Published in: Business, Technology, Education
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,735
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
41
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Predictive Analysis

  1. 1. Predictive Analytics Peter Bruce THE INSTITUTE FOR STATISTICS EDUCATION at Statistics.com peter.bruce@statistics.com
  2. 2. About Statistics.com THE INSTITUTE FOR STATISTICS EDUCATION • 100+ courses, introductory and advanced • Traditional statistics, data mining, machine learning, text mining, clinical trials, optimization, use of R • All online • Typically 4 weeks, scheduled dates • Don’t need to be online particular times/days • Private discussion forum with instructors - noted authors & experts
  3. 3. A man walks into a Target® store…
  4. 4. Predictive Analytics • In marketing, used for model driven targeted sales efforts • Also… will loan default, what diagnosis (given symptoms), is tax return fraudulent, …
  5. 5. Market Research • Traditionally surveys, analysis, information gathering, strategy • Moving online increases the amount of data, speeds its flow, and makes it more accessible
  6. 6. Washington Post (web) • 35 different reports tracking traffic daily • Midday report “are we on track for visitors?” • # visitors from key domains - .gov, .mil, .senate or .house
  7. 7. Daily Mail (UK web) • A traditional ingredient is stories about animals – tracked on web • “The animals that do best are monkeys, dogs, and cats, in that order…” Martin Clark (editor)
  8. 8. Back to Target
  9. 9. Predictive Analytics • Goes beyond the obvious, capturing complexity • Implemented for real-time behavior and decisions
  10. 10. Pregnant? • Obvious retail clues – maternity clothes, baby food, baby clothes, crib … • These may be too late • Earlier clues not so obvious – lotions, supplements, and, esp., combinations and changes in purchase patterns • Data mining algorithms can capture these less obvious, more complex signals
  11. 11. Training the Model • Bridal registry • Women of similar demographic not on bridal registry • Together, the training set – Known outcome – Purchase data over time
  12. 12. Hypothetical Data Cust # zinc10 zinc90 mag10 mag90 cotton10 cotton90 Registry ? 1011 1 1 1 1 0 1 0 1012 1 0 1 0 1 0 1 1013 1 1 0 1 1 0 1 1014 0 1 1 0 1 1 0 1015 1 1 0 1 1 0 0 1016 0 0 1 0 1 0 1
  13. 13. Classification Algorithms • K-nearest neighbors (involves 3 notions) – Distance measure – Centroid – Majority vote or average
  14. 14. K-NN Cust # zinc10 zinc90 mag10 mag90 cotton10 cotton90 Registry ? 1 1 1 1 1 0 1 0 2 1 1 1 1 0 1 0 3 1 1 0 1 0 1 0 4 0 1 1 0 1 1 0 5 1 1 0 1 1 0 1 6 0 0 1 0 1 0 1 NEW 1 0 1 1 0 1 ?
  15. 15. Classification Algorithms, cont. • Logistic Regression • CART • Discriminant Analysis • Neural Network • Naïve Bayes
  16. 16. The Overfit Problem 0 200 400 600 800 1000 1200 1400 1600 0 200 400 600 800 1000 Revenue Expenditure
  17. 17. Complex function - overfit 0 200 400 600 800 1000 1200 1400 1600 0 200 400 600 800 1000 Revenue Expenditure
  18. 18. Therefore: Validate the Model • Partition the original data – Training – Validation • Fit the model to the training data • Assess performance using the validation data
  19. 19. Performance Metrics • Continuous – RMSE • Categorical (often binary) – % accurate (confusion matrix) – Lift
  20. 20. Confusion Matrix and Cutoff Control Training Data scoring - Summary Report Cut off Prob.Val. for Success (Updatable) 0.5 Classification Confusion Matrix Predicted Class Actual Class 1 0 1 43 8 0 6 247
  21. 21. Lift • In classifying “pregnant” vs. “not-pregnant” classifying everyone as “not-pregnant” has very high overall accuracy • Need metric that reflects greater importance of the “pregnant” category, which is rare • Lift is the model’s improvement over average random selection
  22. 22. Decile Lift Chart 0 1 2 3 4 5 6 1 2 3 4 5 6 7 8 9 10 Decilemean/Globalmean Deciles Decile-wise lift chart (validation dataset)
  23. 23. Validate the Model • Compare one model to another • Avoid overfit • Solution: apply model to hold-out sample – Assess performance of different models – Fine tune parameters of individual models
  24. 24. Partitioning • Randomly split the initial data into 2 or 3 groups – Training – Validation – Test • Repeated use of validation data to compare and fine tune models -> overfit to validation, in addition to training – “Test” partition used only once, at the end
  25. 25. Software • SAS Enterprise Miner $$$$ • IBM SPSS Modeler (Clementine) $$$$ • XLMiner (Excel add-in) $ • Statistica Data Miner $$ • Salford Systems $$ • Rapid Miner $$ (open source free version) • R open source free
  26. 26. Data Mining - More • Clustering (segmentation) • Profiling (explanatory models) • Time series • Affinity (recommender systems) • Text analytics (NLP, sentiment analysis)
  27. 27. Skill Shortage • McKinsey “Big Data” report – Supply gap of 140,000-190,000 “deep analytical talent” • Emergence of “Analytics” masters programs (Northwestern, NC State, …)
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×