My presentation on Data Mining, Lessons from Competitions, and Public Data looks at the Data Mining/Data Science/Big Data evolution, reviews lessons from KDD Cup 1997, Netflix Prize, and Kaggle, presents a big list of Public and Government data APIs, Marketplaces, Portals, and Platforms, and examines Big Data Hype. This talk was given at BPDM-2013, (Broadening Participation in Data Mining), Aug 10, 2013 held at KDD-2013, Chicago.
8. Analytics surges in 2006,
after Google Analytics introduced
(c) KDnuggets 2013
Slow-down in analytics
in 2012?
Google Analytics
introduced,
Dec 2005
Google Trends, Jan 2005 ā July 2013
āanalytics - googleā is 50%
of āanalyticsā searches
analytics
12. 1st Data Mining Competition:
KDD-CUP 1997
ā Organized by Ismail Parsa (then at Epsilon)
ā Task: given data on past responders to fund-raising,
predict most likely responders for new campaign
ā Data:
ā¢ Population of 750K prospects, 300+ variables
ā¢ 10K (1.4%) responded to a broad campaign mailing
ā¢ Competition file was a stratified sample of 10K responded,
26K non-resp. (28.7% response rate)
ā Big effort on leaker detection (false predictors)
KDD Cup was almost cancelled - several times
Charles Elkan found leakers in training data
13. Evaluating Targeted List:
Cumulative Pct Hits (Gains)
0
10
20
30
40
50
60
70
80
90
100
5
15
25
35
45
55
65
75
85
95
Model
Random
5% of random list have 5% of targets,
but 5% of model ranked list have 21% of targets
Cum Pct Hits (5%,model)=21%.
Pct list
Cumulative%Hits
14. KDD-CUP Participant Statistics
ā 45 companies/institutions participated
ā¢ 23 research prototypes
ā¢ 22 commercial tools
ā 16 contestants turned in their results
ā¢ 9 research prototypes
ā¢ 7 commercial tools
ā Evaluation: Best Gains (CPH) at 40% and 10%
ā Joint winners:
ā¢ Charles Elkan (UCSD) with BNB, Boosted Naive Bayesian Classifier
ā¢ Urban Science Applications, Inc. with commercial Gain, Direct
Marketing Selection System
ā¢ 3rd place: MineSet (SGI, Ronny Kohavi)
15. KDD-CUP Results Discussion
ā Top finishers very close
ā NaĆÆve Bayes algorithm was used by 2 of the top 3
contestants (BNB and 3rd place MineSet)
ā NaĆÆve Bayes tools did little data preprocessing, used
small number of variables
ā Urban Science implemented a tremendous amount
of automated data preprocessing and exploratory
data analysis and developed more than 50 models in
an automated fashion to get to their results
17. 17
KDD Cup 1997 ā worst results
Note that the worst
result (C6) was actually
worse than random.
Competitor names were
kept anonymous,
apart from top 3 winners
37. Big Data
ā¢ 2nd Industrial Revolution
ā¢ Do old activities better
ā¢ Create new activities/businesses
37(c) KDnuggets 2013
38. Doing Old Things Better
Application areas
ā Direct marketing/Customer modeling
ā Churn prediction
ā Recommendations
ā Fraud detection
ā Security/Intelligence
ā ā¦
ā¢ Improvement will be real, but limited because of
human randomness
ā¢ Competition will level companies
38(c) KDnuggets 2013
39. Big Data Enables New Things !
ā Google ā first big success of big data
ā Social networks (Facebook, Twitter, LinkedIn, ā¦)
success depends on network size, i.e. big data
ā Location analytics
ā Health-care
ā¢ Personalized medicine
ā Semantics and AI ?
ā¢ Imagine IBM Watson, Google Now, Siri in 2023 ?
39(c) KDnuggets 2013