Your SlideShare is downloading. ×
0
Predictive Analysis
Predictive Analysis
Predictive Analysis
Predictive Analysis
Predictive Analysis
Predictive Analysis
Predictive Analysis
Predictive Analysis
Predictive Analysis
Predictive Analysis
Predictive Analysis
Predictive Analysis
Predictive Analysis
Predictive Analysis
Predictive Analysis
Predictive Analysis
Predictive Analysis
Predictive Analysis
Predictive Analysis
Predictive Analysis
Predictive Analysis
Predictive Analysis
Predictive Analysis
Predictive Analysis
Predictive Analysis
Predictive Analysis
Predictive Analysis
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Predictive Analysis

1,595

Published on

Presentation by Dr. Peter Bruce, Statistics.com. Presented on April 27, 2012 at the MRA Spring Research Symposium hosted by the Mid-Atlantic Chapter of the Marketing Research Association.

Presentation by Dr. Peter Bruce, Statistics.com. Presented on April 27, 2012 at the MRA Spring Research Symposium hosted by the Mid-Atlantic Chapter of the Marketing Research Association.

Published in: Business, Technology, Education
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,595
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
40
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Predictive Analytics Peter Bruce THE INSTITUTE FOR STATISTICS EDUCATION at Statistics.com peter.bruce@statistics.com
  • 2. About Statistics.com THE INSTITUTE FOR STATISTICS EDUCATION • 100+ courses, introductory and advanced • Traditional statistics, data mining, machine learning, text mining, clinical trials, optimization, use of R • All online • Typically 4 weeks, scheduled dates • Don’t need to be online particular times/days • Private discussion forum with instructors - noted authors & experts
  • 3. A man walks into a Target® store…
  • 4. Predictive Analytics • In marketing, used for model driven targeted sales efforts • Also… will loan default, what diagnosis (given symptoms), is tax return fraudulent, …
  • 5. Market Research • Traditionally surveys, analysis, information gathering, strategy • Moving online increases the amount of data, speeds its flow, and makes it more accessible
  • 6. Washington Post (web) • 35 different reports tracking traffic daily • Midday report “are we on track for visitors?” • # visitors from key domains - .gov, .mil, .senate or .house
  • 7. Daily Mail (UK web) • A traditional ingredient is stories about animals – tracked on web • “The animals that do best are monkeys, dogs, and cats, in that order…” Martin Clark (editor)
  • 8. Back to Target
  • 9. Predictive Analytics • Goes beyond the obvious, capturing complexity • Implemented for real-time behavior and decisions
  • 10. Pregnant? • Obvious retail clues – maternity clothes, baby food, baby clothes, crib … • These may be too late • Earlier clues not so obvious – lotions, supplements, and, esp., combinations and changes in purchase patterns • Data mining algorithms can capture these less obvious, more complex signals
  • 11. Training the Model • Bridal registry • Women of similar demographic not on bridal registry • Together, the training set – Known outcome – Purchase data over time
  • 12. Hypothetical Data Cust # zinc10 zinc90 mag10 mag90 cotton10 cotton90 Registry ? 1011 1 1 1 1 0 1 0 1012 1 0 1 0 1 0 1 1013 1 1 0 1 1 0 1 1014 0 1 1 0 1 1 0 1015 1 1 0 1 1 0 0 1016 0 0 1 0 1 0 1
  • 13. Classification Algorithms • K-nearest neighbors (involves 3 notions) – Distance measure – Centroid – Majority vote or average
  • 14. K-NN Cust # zinc10 zinc90 mag10 mag90 cotton10 cotton90 Registry ? 1 1 1 1 1 0 1 0 2 1 1 1 1 0 1 0 3 1 1 0 1 0 1 0 4 0 1 1 0 1 1 0 5 1 1 0 1 1 0 1 6 0 0 1 0 1 0 1 NEW 1 0 1 1 0 1 ?
  • 15. Classification Algorithms, cont. • Logistic Regression • CART • Discriminant Analysis • Neural Network • Naïve Bayes
  • 16. The Overfit Problem 0 200 400 600 800 1000 1200 1400 1600 0 200 400 600 800 1000 Revenue Expenditure
  • 17. Complex function - overfit 0 200 400 600 800 1000 1200 1400 1600 0 200 400 600 800 1000 Revenue Expenditure
  • 18. Therefore: Validate the Model • Partition the original data – Training – Validation • Fit the model to the training data • Assess performance using the validation data
  • 19. Performance Metrics • Continuous – RMSE • Categorical (often binary) – % accurate (confusion matrix) – Lift
  • 20. Confusion Matrix and Cutoff Control Training Data scoring - Summary Report Cut off Prob.Val. for Success (Updatable) 0.5 Classification Confusion Matrix Predicted Class Actual Class 1 0 1 43 8 0 6 247
  • 21. Lift • In classifying “pregnant” vs. “not-pregnant” classifying everyone as “not-pregnant” has very high overall accuracy • Need metric that reflects greater importance of the “pregnant” category, which is rare • Lift is the model’s improvement over average random selection
  • 22. Decile Lift Chart 0 1 2 3 4 5 6 1 2 3 4 5 6 7 8 9 10 Decilemean/Globalmean Deciles Decile-wise lift chart (validation dataset)
  • 23. Validate the Model • Compare one model to another • Avoid overfit • Solution: apply model to hold-out sample – Assess performance of different models – Fine tune parameters of individual models
  • 24. Partitioning • Randomly split the initial data into 2 or 3 groups – Training – Validation – Test • Repeated use of validation data to compare and fine tune models -> overfit to validation, in addition to training – “Test” partition used only once, at the end
  • 25. Software • SAS Enterprise Miner $$$$ • IBM SPSS Modeler (Clementine) $$$$ • XLMiner (Excel add-in) $ • Statistica Data Miner $$ • Salford Systems $$ • Rapid Miner $$ (open source free version) • R open source free
  • 26. Data Mining - More • Clustering (segmentation) • Profiling (explanatory models) • Time series • Affinity (recommender systems) • Text analytics (NLP, sentiment analysis)
  • 27. Skill Shortage • McKinsey “Big Data” report – Supply gap of 140,000-190,000 “deep analytical talent” • Emergence of “Analytics” masters programs (Northwestern, NC State, …)

×