SlideShare a Scribd company logo
1 of 43
Public Data and
Data Mining
Competitions ā€“
what are the
Lessons?
1Ā© KDnuggets 2013
Gregory Piatetsky-Shapiro
KDnuggets
My Data
ā€¢ PhD (ā€˜84) in applying Machine Learning to databases
ā€¢ Researcher at GTE Labs ā€“ started the first project on
Knowledge Discovery in Databases in 1989
ā€¢ Organized first 3 Knowledge Discovery and Data Mining
(KDD) workshops (1989-93), cofounded Knowledge
Discovery and Data Mining (KDD) conferences (1995)
ā€¢ Chief Scientist at 2 analytics startups 1998-2001
ā€¢ Co-founder SIGKDD (1998), Chair, 2005-2009
ā€¢ Analytics/Data Mining Consultant, 2001-
ā€¢ Editor, KDnuggets, 1994-, full time 2001-
Ā© KDnuggets 2013 2
Patterns ā€“ Key Part of Intelligence
ā€¢ Evolution: Animals better able
to find, use patterns ā€“ more
likely to survive
ā€¢ People have an ability and
desire to find patterns
ā€¢ People ā€œpattern intuitionā€ does
not scale
ā€¢ Science is what helps separate
valid from invalid patterns
(astrology, fake cures, ā€¦)
Ā© KDnuggets 2013 3
Horoscope for August: The
Mars-Jupiter tandem in
Cancer seems to indicate a
febrile activity related to the
accommodation, houses,
premises, real estate
investments. You'll build,
redecorate, move out, change
your furniture, refurbish, set
up your yard or garden ā€¦
Outline
ā€¢ What do we call it?
ā€¢ Data competitions ā€“ short history
ā€¢ Government and Public Data
ā€¢ Big Data Hype and Reality
Ā© KDnuggets 2013 4
What do we call it?
ā€¢ Statistics
ā€¢ Data mining
ā€¢ Knowledge Discovery in
Data (KDD)
ā€¢ Business Analytics
ā€¢ Predictive Analytics
ā€¢ Data Science
ā€¢ Big Data
ā€¢ ā€¦ ?
Ā© KDnuggets 2013 5
Same Core Idea:
Finding Useful
Patterns in Data
Different
Emphasis
20th Century
Statistics dominates
Ā© KDnuggets 2013 6
statistics
Note: Google Ngrams are case-sensitive. Here used lower case as more
representative
Google Ngrams, smoothing=1
ā€œData Miningā€ surges in 1996,
peaks in 2004-5
Ā© KDnuggets 2013 7
Advances in Knowledge Discovery and
Data Mining, AAAI/MIT Press, 1996, Eds:
U. Fayyad, G. Piatetsky-Shapiro, P. Smyth,
and R. Uthurusamy
analytics
data mining
KDD-95, 1st Conference on Knowledge
Discovery and Data Mining, Montreal
Google Ngrams, smoothing=1
Analytics surges in 2006,
after Google Analytics introduced
(c) KDnuggets 2013
Slow-down in analytics
in 2012?
Google Analytics
introduced,
Dec 2005
Google Trends, Jan 2005 ā€“ July 2013
ā€œanalytics - googleā€ is 50%
of ā€œanalyticsā€ searches
analytics
In 2013: Big Data > Data Mining >
Business Analytics > Predictive Analytics
> Data Science
9Ā© KDnuggets 2013
Big Data
Google Trends search, Jan 2008 - July 2013
Data mining
Big Data
slowdown?
History
ā€¢ 1900 - Statistics
ā€¢ 1960s Data Mining = bad activity, data ā€œdredgingā€
ā€¢ 1990 - ā€œData Miningā€ is good, surges in 1996
ā€¢ 2003 - ā€œData Miningā€ peaks, image tarnished
(Total Information Awareness, invasion of privacy)
ā€¢ 2006 - Google Analytics appears
ā€¢ 2007 - Business/Data/Predictive Analytics
ā€¢ 2012 - Big Data surge
ā€¢ 2013 - Data Science
ā€¢ 2015 - ??
10Ā© KDnuggets 2012
Data Competitions ā€“
Short History
(c) KDnuggets 2013 11
1st Data Mining Competition:
KDD-CUP 1997
ā€“ Organized by Ismail Parsa (then at Epsilon)
ā€“ Task: given data on past responders to fund-raising,
predict most likely responders for new campaign
ā€“ Data:
ā€¢ Population of 750K prospects, 300+ variables
ā€¢ 10K (1.4%) responded to a broad campaign mailing
ā€¢ Competition file was a stratified sample of 10K responded,
26K non-resp. (28.7% response rate)
ā€“ Big effort on leaker detection (false predictors)
KDD Cup was almost cancelled - several times
Charles Elkan found leakers in training data
Evaluating Targeted List:
Cumulative Pct Hits (Gains)
0
10
20
30
40
50
60
70
80
90
100
5
15
25
35
45
55
65
75
85
95
Model
Random
5% of random list have 5% of targets,
but 5% of model ranked list have 21% of targets
Cum Pct Hits (5%,model)=21%.
Pct list
Cumulative%Hits
KDD-CUP Participant Statistics
ā€“ 45 companies/institutions participated
ā€¢ 23 research prototypes
ā€¢ 22 commercial tools
ā€“ 16 contestants turned in their results
ā€¢ 9 research prototypes
ā€¢ 7 commercial tools
ā€“ Evaluation: Best Gains (CPH) at 40% and 10%
ā€“ Joint winners:
ā€¢ Charles Elkan (UCSD) with BNB, Boosted Naive Bayesian Classifier
ā€¢ Urban Science Applications, Inc. with commercial Gain, Direct
Marketing Selection System
ā€¢ 3rd place: MineSet (SGI, Ronny Kohavi)
KDD-CUP Results Discussion
ā€“ Top finishers very close
ā€“ NaĆÆve Bayes algorithm was used by 2 of the top 3
contestants (BNB and 3rd place MineSet)
ā€“ NaĆÆve Bayes tools did little data preprocessing, used
small number of variables
ā€“ Urban Science implemented a tremendous amount
of automated data preprocessing and exploratory
data analysis and developed more than 50 models in
an automated fashion to get to their results
16
KDD Cup 1997: Top 3 results
Top 3 finishers
are very close
17
KDD Cup 1997 ā€“ worst results
Note that the worst
result (C6) was actually
worse than random.
Competitor names were
kept anonymous,
apart from top 3 winners
KDD Cup Lessons
ā€¢ Data Preparation is key, especially eliminating
ā€œleakersā€ (false predictors)
ā€¢ Avoid overfitting the test data
ā€¢ Simple models work well for predicting human
behavior
Ā© KDnuggets 2013 18
Big Competition Successes
ā€¢ Ansari X-Prize 2004:
Spaceship One went to
space twice in 2 weeks
ā€¢ DARPA Grand
Challenge, 2005: 150 mi
Off-road robotic car
navigation
Ā© KDnuggets 2013 19
Netflix Prize
ā€¢ Started in 2006, with 100M
ratings, 500K users, 18K
movies, $1M prize
ā€¢ Goal: reduce RMSE error in ā€œstarā€
rating by 10% (was 0.95 for Netflix
own system Cinematch)
ā€¢ Public training data, public & secret
test sets
Ā© KDnuggets 2013 20
Predicted
Actual
Netflix Prize Milestones
ā€¢ In just one week, WXYZ consulting team
beat Netflix system with RMSE 0.9430
ā€¢ Progress in 2007-8 was very slow:
ā€¢ In 2007 KDnuggets Poll
32% thought prize will
never be won
ā€¢ Took 3 years to reach
10% improvement
Ā© KDnuggets 2013 21
Netflix Prize Winners
ā€¢ Winning team used a complex
ensemble of many algorithms
ā€¢ Two teams had exactly the same RMSE
of 0.8567, but winner submitted 20
minutes earlier !
Ā© KDnuggets 2013 22
Netflix Prize lessons, 1
ā€¢ Competitions work
ā€¢ Limits to predicting human behavior ā€“
inherent randomness, noisy data
ā€¢ Privacy concerns
ā€“ Researchers found a few people with matching
IMDB and Netflix ratings ā€“ potential privacy
breach
ā€“ 4 Netflix users sued
ā€“ Netflix Prize Sequel ā€“ cancelled
Ā© KDnuggets 2013 23
Netflix Prize lessons, 2
ā€¢ Winning algorithm was too complex, too
tailored to specific data set, never used ļŒ
ā€“ Netflix blog, Apr 2012
ā€¢ A basic SVD algorithm, proposed by Simon
Funk (KDnuggets Interview w. Simon Funk)
got ~6% improvement
ā€¢ SVD++ version by Yehuda Koren & winning
team reached ~ 8% improvement, was used
by Netflix
Ā© KDnuggets 2013 24
Netflix Prize lessons, 3
ā€¢ Wrong question was asked ! (Minimizing RMSE of
predicted vs actual ratings)
ā€¢ RMSE gives big penalty for errors > 2 stars, so an
algo. that fails big a few times will be worse than
an algo. that is often worse by 1.
ā€¢ Errors are not equal, but RMSE treats 2 vs 3 stars
same as 4 vs 5 or 1 vs 2.
ā€¢ Also, Netflix Instant became more popular
ā€¢ Better question would be ā€œwhat do you like to
watchā€ (anything on Instant likely to rank > 3)
Ā© KDnuggets 2013 25
Focus
on the right question ?
and the right GOAL
Ā© KDnuggets 2013 26
Kaggle Competition Platform
ā€¢ Launched by Anthony Goldbloom in 2010
ā€¢ Quickly became the top platform for
competitions
ā€“ Few people know of TunedIT competition
platform launched in 2009
ā€¢ Kaggle in Class ā€“ free for Universities
ā€¢ Achieved 100,000 members in July 2013
Ā© KDnuggets 2012 27
Kaggle Successes
ā€¢ Allstate competition: Winner model was 270%
more accurate than baseline
ā€¢ Identified sound of the endangered North
American Right whale in audio recordings
ā€¢ GE FlightQuest
ā€¢ Heritage Health Prize - $3M
competition, 2011-13
ā€¢ But ā€¦ Competitions - very time consuming
Ā© KDnuggets 2013 28
Kaggle Business Model
ā€¢ Initial business model - % of prize
ā€¢ Kaggle Job Boards (currently free)
ā€¢ Kaggle Connect: Offers consulting with top
0.5% of Kagglers (at $300/hr ? see post), or
$30-100K/month (IW , Mar 2013)
ā€¢ Private competitions (Masters) open to top
Kagglers
ā€“ Heritage Health Prize 2
Ā© KDnuggets 2013 29
Winning on Kaggle
ā€¢ Kaggle Chief Scientist: Specialist knowledge ā€“
useless & unhelpful (Slate, Dec 2012)
ā€¢ Big-data approaches
ā€¢ Use good tools: R, Random forests
ā€¢ Curiosity, Creativeness, Persistence, Team, Luc
k? (also Quora answer)
ā€¢ Many (most?) winners ā€“ not professional data
scientists (physicists, math profs, actuary)
(RW, Apr 2012)
Ā© KDnuggets 2013 30
ā€your Ivy League diploma and IBM
resume don't matter so much
as my Kaggle scoreā€
Almost true
31
Data:
Public, Government, Portals, Mar
ketplaces
Ā© KDnuggets 2013 32
Public Data
www.KDnuggets.com/datasets/
ā€¢ Government, Federal, State, City, Local and public data sites and portals
ā€¢ Data APIs, Hubs, Marketplaces, Platforms, Portals, and Search Engines.
ā€¢ Data Markets: DataMarket
ā€¢ Data Platforms: Enigma, InfoChimps (acq. By CSC), Knoema, Exversion, ā€¦
ā€¢ Data Search Engines: Qandl , qunb, Zanran
ā€¢ Location: Factual
ā€¢ People and places: Freebase
Ā© KDnuggets 2013 33
Public and Government Data
ā€¢ Datamob.org: tracks government data in
developer-friendly format
Ā© KDnuggets 2013 34
data about U.S. state legislative
activities, including bill
summaries, votes, sponsorships, legislators
and committees.
US Project Open Data
ā€¢ In May 2013, White House announced Project
Open Data
ā€¢ ā€œinformation is a valuable national asset whose
value is multiplied when it is made easily
accessible to the publicā€.
ā€¢ ā€œThe Executive Order requires that, going
forward, data generated by the government be
made available in open, machine-readable
formats, while appropriately safeguarding
privacy, confidentiality, and security.ā€
Ā© KDnuggets 2013 35
Using Public Data
ā€¢ Google ā€“ biggest success ?
ā€¢ Data Science for Social Good (Chicago) (Fast
Company, Aug 2013)
ā€“ predict when bikeshare stations run out of bikes
ā€“ forecast local crime
ā€“ warn local hospitals about impending heart
attacks
Ā© KDnuggets 2013 36
Big Data
ā€¢ 2nd Industrial Revolution
ā€¢ Do old activities better
ā€¢ Create new activities/businesses
37(c) KDnuggets 2013
Doing Old Things Better
Application areas
ā€“ Direct marketing/Customer modeling
ā€“ Churn prediction
ā€“ Recommendations
ā€“ Fraud detection
ā€“ Security/Intelligence
ā€“ ā€¦
ā€¢ Improvement will be real, but limited because of
human randomness
ā€¢ Competition will level companies
38(c) KDnuggets 2013
Big Data Enables New Things !
ā€“ Google ā€“ first big success of big data
ā€“ Social networks (Facebook, Twitter, LinkedIn, ā€¦)
success depends on network size, i.e. big data
ā€“ Location analytics
ā€“ Health-care
ā€¢ Personalized medicine
ā€“ Semantics and AI ?
ā€¢ Imagine IBM Watson, Google Now, Siri in 2023 ?
39(c) KDnuggets 2013
Copyright Ā© 2003 KDnuggets
Big Data Bubble?
Ā© 2013 KDnuggets
41
Gartner Hype Cycle
Big Data
Gartner Hype Cycle for Big Data, 2012
Ā© KDnuggets 2013 42
Data
Scientist,
2-5 yrs
Social Network
Analysis, 5-10
Social Analytics, 2-5
Predictive Analytics, <2
MapReduce & Alternative -
Disillusionment
Questions?
KDnuggets: Analytics, Big Data, Data Mining
ā€¢ News, Jobs, Software, Courses, Data, Meeting
s, Publications, Webcasts, ā€¦
www.KDnuggets.com/news
ā€¢ Subscribe to KDnuggets News email at
www.KDnuggets.com/subscribe.html
ā€¢ : @kdnuggets
ā€¢ Email to editor1@kdnuggets.com
43Ā© KDnuggets 2013

More Related Content

What's hot

Big Data and Predictive Analysis
Big Data and Predictive AnalysisBig Data and Predictive Analysis
Big Data and Predictive AnalysisJongwook Woo
Ā 
Data science e machine learning
Data science e machine learningData science e machine learning
Data science e machine learningGiuseppe Manco
Ā 
Traffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big DataTraffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big DataJongwook Woo
Ā 
Introduction to Big Data and AI for Business Analytics and Prediction
Introduction to Big Data and AI for Business Analytics and PredictionIntroduction to Big Data and AI for Business Analytics and Prediction
Introduction to Big Data and AI for Business Analytics and PredictionJongwook Woo
Ā 
Introduction to Big Data and its Trends
Introduction to Big Data and its TrendsIntroduction to Big Data and its Trends
Introduction to Big Data and its TrendsJongwook Woo
Ā 
Rating Prediction using Deep Learning and Spark
Rating Prediction using Deep Learning and SparkRating Prediction using Deep Learning and Spark
Rating Prediction using Deep Learning and SparkJongwook Woo
Ā 
History and Trend of Big Data and Deep Learning
History and Trend of Big Data and Deep LearningHistory and Trend of Big Data and Deep Learning
History and Trend of Big Data and Deep LearningJongwook Woo
Ā 
A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)Prof. Dr. Diego Kuonen
Ā 
Introduction to Big Data: Smart Factory
Introduction to Big Data: Smart FactoryIntroduction to Big Data: Smart Factory
Introduction to Big Data: Smart FactoryJongwook Woo
Ā 
Predictive Analysis of Financial Fraud Detection using Azure and Spark ML
Predictive Analysis of Financial Fraud Detection using Azure and Spark MLPredictive Analysis of Financial Fraud Detection using Azure and Spark ML
Predictive Analysis of Financial Fraud Detection using Azure and Spark MLJongwook Woo
Ā 
Scalable Predictive Analysis and The Trend with Big Data & AI
Scalable Predictive Analysis and The Trend with Big Data & AIScalable Predictive Analysis and The Trend with Big Data & AI
Scalable Predictive Analysis and The Trend with Big Data & AIJongwook Woo
Ā 
Big Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsBig Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsChandan Rajah
Ā 
The Importance of Open Innovation in AI era
The Importance of Open Innovation in AI eraThe Importance of Open Innovation in AI era
The Importance of Open Innovation in AI eraJongwook Woo
Ā 
Data Scientist 101 BI Dutch
Data Scientist 101 BI DutchData Scientist 101 BI Dutch
Data Scientist 101 BI DutchJos van Dongen
Ā 
Full-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data TeamFull-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data TeamGreg Goltsov
Ā 
Big Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our LivesBig Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our LivesRukshan Batuwita
Ā 
Introduction on Data Science
Introduction on Data ScienceIntroduction on Data Science
Introduction on Data ScienceEdureka!
Ā 
Intro to Data Science Big Data
Intro to Data Science Big DataIntro to Data Science Big Data
Intro to Data Science Big DataIndu Khemchandani
Ā 
Predictive Analysis for Airbnb Listing Rating using Scalable Big Data Platform
Predictive Analysis for Airbnb Listing Rating using Scalable Big Data PlatformPredictive Analysis for Airbnb Listing Rating using Scalable Big Data Platform
Predictive Analysis for Airbnb Listing Rating using Scalable Big Data PlatformSavita Yadav
Ā 

What's hot (20)

Big Data and Predictive Analysis
Big Data and Predictive AnalysisBig Data and Predictive Analysis
Big Data and Predictive Analysis
Ā 
Intro to Data Science Concepts
Intro to Data Science ConceptsIntro to Data Science Concepts
Intro to Data Science Concepts
Ā 
Data science e machine learning
Data science e machine learningData science e machine learning
Data science e machine learning
Ā 
Traffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big DataTraffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big Data
Ā 
Introduction to Big Data and AI for Business Analytics and Prediction
Introduction to Big Data and AI for Business Analytics and PredictionIntroduction to Big Data and AI for Business Analytics and Prediction
Introduction to Big Data and AI for Business Analytics and Prediction
Ā 
Introduction to Big Data and its Trends
Introduction to Big Data and its TrendsIntroduction to Big Data and its Trends
Introduction to Big Data and its Trends
Ā 
Rating Prediction using Deep Learning and Spark
Rating Prediction using Deep Learning and SparkRating Prediction using Deep Learning and Spark
Rating Prediction using Deep Learning and Spark
Ā 
History and Trend of Big Data and Deep Learning
History and Trend of Big Data and Deep LearningHistory and Trend of Big Data and Deep Learning
History and Trend of Big Data and Deep Learning
Ā 
A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)
Ā 
Introduction to Big Data: Smart Factory
Introduction to Big Data: Smart FactoryIntroduction to Big Data: Smart Factory
Introduction to Big Data: Smart Factory
Ā 
Predictive Analysis of Financial Fraud Detection using Azure and Spark ML
Predictive Analysis of Financial Fraud Detection using Azure and Spark MLPredictive Analysis of Financial Fraud Detection using Azure and Spark ML
Predictive Analysis of Financial Fraud Detection using Azure and Spark ML
Ā 
Scalable Predictive Analysis and The Trend with Big Data & AI
Scalable Predictive Analysis and The Trend with Big Data & AIScalable Predictive Analysis and The Trend with Big Data & AI
Scalable Predictive Analysis and The Trend with Big Data & AI
Ā 
Big Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsBig Data Science: Intro and Benefits
Big Data Science: Intro and Benefits
Ā 
The Importance of Open Innovation in AI era
The Importance of Open Innovation in AI eraThe Importance of Open Innovation in AI era
The Importance of Open Innovation in AI era
Ā 
Data Scientist 101 BI Dutch
Data Scientist 101 BI DutchData Scientist 101 BI Dutch
Data Scientist 101 BI Dutch
Ā 
Full-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data TeamFull-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data Team
Ā 
Big Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our LivesBig Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our Lives
Ā 
Introduction on Data Science
Introduction on Data ScienceIntroduction on Data Science
Introduction on Data Science
Ā 
Intro to Data Science Big Data
Intro to Data Science Big DataIntro to Data Science Big Data
Intro to Data Science Big Data
Ā 
Predictive Analysis for Airbnb Listing Rating using Scalable Big Data Platform
Predictive Analysis for Airbnb Listing Rating using Scalable Big Data PlatformPredictive Analysis for Airbnb Listing Rating using Scalable Big Data Platform
Predictive Analysis for Airbnb Listing Rating using Scalable Big Data Platform
Ā 

Similar to Public Data and Data Mining Competitions - What are Lessons?

Keynote - An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro
Keynote -  An overview on Big Data & Data Science - Dr Gregory Piatetsky-ShapiroKeynote -  An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro
Keynote - An overview on Big Data & Data Science - Dr Gregory Piatetsky-ShapiroData ScienceTech Institute
Ā 
On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)
On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)
On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)Galit Shmueli
Ā 
Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Thinkful
Ā 
25 June 2013 - Advanced Data Analytics - an Introduction - Paul kennedy Power...
25 June 2013 - Advanced Data Analytics - an Introduction - Paul kennedy Power...25 June 2013 - Advanced Data Analytics - an Introduction - Paul kennedy Power...
25 June 2013 - Advanced Data Analytics - an Introduction - Paul kennedy Power...BigData AAI
Ā 
Getting Started in Data Science
Getting Started in Data ScienceGetting Started in Data Science
Getting Started in Data ScienceThinkful
Ā 
Altron presentation on Emerging Technologies: Data Science and Artificial Int...
Altron presentation on Emerging Technologies: Data Science and Artificial Int...Altron presentation on Emerging Technologies: Data Science and Artificial Int...
Altron presentation on Emerging Technologies: Data Science and Artificial Int...Robert Williams
Ā 
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...CS, NcState
Ā 
Nicholas Jewell MedicReS World Congress 2014
Nicholas Jewell MedicReS World Congress 2014Nicholas Jewell MedicReS World Congress 2014
Nicholas Jewell MedicReS World Congress 2014MedicReS
Ā 
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptxXanGwaps
Ā 
Data Scientists: Your Must-Have Business Investment
Data Scientists: Your Must-Have Business InvestmentData Scientists: Your Must-Have Business Investment
Data Scientists: Your Must-Have Business InvestmentKalido
Ā 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving UpPaco Nathan
Ā 
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Big Data Spain
Ā 
Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)Thinkful
Ā 
Semantic Solutions from Information Exploration.pptx
Semantic Solutions from Information Exploration.pptxSemantic Solutions from Information Exploration.pptx
Semantic Solutions from Information Exploration.pptxInformation Exploration
Ā 
Intro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data ScientistsIntro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data ScientistsSri Ambati
Ā 
Real-time applications of Data Science.pptx
Real-time applications  of Data Science.pptxReal-time applications  of Data Science.pptx
Real-time applications of Data Science.pptxshalini s
Ā 
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...datacite
Ā 
UT Dallas CS - Rise of Crowd Computing
UT Dallas CS - Rise of Crowd ComputingUT Dallas CS - Rise of Crowd Computing
UT Dallas CS - Rise of Crowd ComputingMatthew Lease
Ā 
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talkNYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talkVivian S. Zhang
Ā 

Similar to Public Data and Data Mining Competitions - What are Lessons? (20)

Keynote - An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro
Keynote -  An overview on Big Data & Data Science - Dr Gregory Piatetsky-ShapiroKeynote -  An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro
Keynote - An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro
Ā 
On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)
On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)
On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)
Ā 
Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)
Ā 
25 June 2013 - Advanced Data Analytics - an Introduction - Paul kennedy Power...
25 June 2013 - Advanced Data Analytics - an Introduction - Paul kennedy Power...25 June 2013 - Advanced Data Analytics - an Introduction - Paul kennedy Power...
25 June 2013 - Advanced Data Analytics - an Introduction - Paul kennedy Power...
Ā 
Getting Started in Data Science
Getting Started in Data ScienceGetting Started in Data Science
Getting Started in Data Science
Ā 
Altron presentation on Emerging Technologies: Data Science and Artificial Int...
Altron presentation on Emerging Technologies: Data Science and Artificial Int...Altron presentation on Emerging Technologies: Data Science and Artificial Int...
Altron presentation on Emerging Technologies: Data Science and Artificial Int...
Ā 
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
Ā 
DBMS
DBMSDBMS
DBMS
Ā 
Nicholas Jewell MedicReS World Congress 2014
Nicholas Jewell MedicReS World Congress 2014Nicholas Jewell MedicReS World Congress 2014
Nicholas Jewell MedicReS World Congress 2014
Ā 
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
Ā 
Data Scientists: Your Must-Have Business Investment
Data Scientists: Your Must-Have Business InvestmentData Scientists: Your Must-Have Business Investment
Data Scientists: Your Must-Have Business Investment
Ā 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving Up
Ā 
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Ā 
Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)
Ā 
Semantic Solutions from Information Exploration.pptx
Semantic Solutions from Information Exploration.pptxSemantic Solutions from Information Exploration.pptx
Semantic Solutions from Information Exploration.pptx
Ā 
Intro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data ScientistsIntro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data Scientists
Ā 
Real-time applications of Data Science.pptx
Real-time applications  of Data Science.pptxReal-time applications  of Data Science.pptx
Real-time applications of Data Science.pptx
Ā 
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...
Ā 
UT Dallas CS - Rise of Crowd Computing
UT Dallas CS - Rise of Crowd ComputingUT Dallas CS - Rise of Crowd Computing
UT Dallas CS - Rise of Crowd Computing
Ā 
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talkNYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
Ā 

Recently uploaded

APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
Ā 
#StandardsGoals for 2024: Whatā€™s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: Whatā€™s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: Whatā€™s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: Whatā€™s new for BISAC - Tech Forum 2024BookNet Canada
Ā 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
Ā 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
Ā 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
Ā 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
Ā 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
Ā 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
Ā 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
Ā 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
Ā 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
Ā 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
Ā 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
Ā 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
Ā 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
Ā 
Swan(sea) Song ā€“ personal research during my six years at Swansea ... and bey...
Swan(sea) Song ā€“ personal research during my six years at Swansea ... and bey...Swan(sea) Song ā€“ personal research during my six years at Swansea ... and bey...
Swan(sea) Song ā€“ personal research during my six years at Swansea ... and bey...Alan Dix
Ā 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
Ā 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
Ā 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
Ā 

Recently uploaded (20)

Hot Sexy call girls in Panjabi Bagh šŸ” 9953056974 šŸ” Delhi escort Service
Hot Sexy call girls in Panjabi Bagh šŸ” 9953056974 šŸ” Delhi escort ServiceHot Sexy call girls in Panjabi Bagh šŸ” 9953056974 šŸ” Delhi escort Service
Hot Sexy call girls in Panjabi Bagh šŸ” 9953056974 šŸ” Delhi escort Service
Ā 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
Ā 
#StandardsGoals for 2024: Whatā€™s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: Whatā€™s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: Whatā€™s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: Whatā€™s new for BISAC - Tech Forum 2024
Ā 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Ā 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
Ā 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
Ā 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Ā 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
Ā 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
Ā 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
Ā 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
Ā 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
Ā 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
Ā 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
Ā 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Ā 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
Ā 
Swan(sea) Song ā€“ personal research during my six years at Swansea ... and bey...
Swan(sea) Song ā€“ personal research during my six years at Swansea ... and bey...Swan(sea) Song ā€“ personal research during my six years at Swansea ... and bey...
Swan(sea) Song ā€“ personal research during my six years at Swansea ... and bey...
Ā 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Ā 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
Ā 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
Ā 

Public Data and Data Mining Competitions - What are Lessons?

  • 1. Public Data and Data Mining Competitions ā€“ what are the Lessons? 1Ā© KDnuggets 2013 Gregory Piatetsky-Shapiro KDnuggets
  • 2. My Data ā€¢ PhD (ā€˜84) in applying Machine Learning to databases ā€¢ Researcher at GTE Labs ā€“ started the first project on Knowledge Discovery in Databases in 1989 ā€¢ Organized first 3 Knowledge Discovery and Data Mining (KDD) workshops (1989-93), cofounded Knowledge Discovery and Data Mining (KDD) conferences (1995) ā€¢ Chief Scientist at 2 analytics startups 1998-2001 ā€¢ Co-founder SIGKDD (1998), Chair, 2005-2009 ā€¢ Analytics/Data Mining Consultant, 2001- ā€¢ Editor, KDnuggets, 1994-, full time 2001- Ā© KDnuggets 2013 2
  • 3. Patterns ā€“ Key Part of Intelligence ā€¢ Evolution: Animals better able to find, use patterns ā€“ more likely to survive ā€¢ People have an ability and desire to find patterns ā€¢ People ā€œpattern intuitionā€ does not scale ā€¢ Science is what helps separate valid from invalid patterns (astrology, fake cures, ā€¦) Ā© KDnuggets 2013 3 Horoscope for August: The Mars-Jupiter tandem in Cancer seems to indicate a febrile activity related to the accommodation, houses, premises, real estate investments. You'll build, redecorate, move out, change your furniture, refurbish, set up your yard or garden ā€¦
  • 4. Outline ā€¢ What do we call it? ā€¢ Data competitions ā€“ short history ā€¢ Government and Public Data ā€¢ Big Data Hype and Reality Ā© KDnuggets 2013 4
  • 5. What do we call it? ā€¢ Statistics ā€¢ Data mining ā€¢ Knowledge Discovery in Data (KDD) ā€¢ Business Analytics ā€¢ Predictive Analytics ā€¢ Data Science ā€¢ Big Data ā€¢ ā€¦ ? Ā© KDnuggets 2013 5 Same Core Idea: Finding Useful Patterns in Data Different Emphasis
  • 6. 20th Century Statistics dominates Ā© KDnuggets 2013 6 statistics Note: Google Ngrams are case-sensitive. Here used lower case as more representative Google Ngrams, smoothing=1
  • 7. ā€œData Miningā€ surges in 1996, peaks in 2004-5 Ā© KDnuggets 2013 7 Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, 1996, Eds: U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy analytics data mining KDD-95, 1st Conference on Knowledge Discovery and Data Mining, Montreal Google Ngrams, smoothing=1
  • 8. Analytics surges in 2006, after Google Analytics introduced (c) KDnuggets 2013 Slow-down in analytics in 2012? Google Analytics introduced, Dec 2005 Google Trends, Jan 2005 ā€“ July 2013 ā€œanalytics - googleā€ is 50% of ā€œanalyticsā€ searches analytics
  • 9. In 2013: Big Data > Data Mining > Business Analytics > Predictive Analytics > Data Science 9Ā© KDnuggets 2013 Big Data Google Trends search, Jan 2008 - July 2013 Data mining Big Data slowdown?
  • 10. History ā€¢ 1900 - Statistics ā€¢ 1960s Data Mining = bad activity, data ā€œdredgingā€ ā€¢ 1990 - ā€œData Miningā€ is good, surges in 1996 ā€¢ 2003 - ā€œData Miningā€ peaks, image tarnished (Total Information Awareness, invasion of privacy) ā€¢ 2006 - Google Analytics appears ā€¢ 2007 - Business/Data/Predictive Analytics ā€¢ 2012 - Big Data surge ā€¢ 2013 - Data Science ā€¢ 2015 - ?? 10Ā© KDnuggets 2012
  • 11. Data Competitions ā€“ Short History (c) KDnuggets 2013 11
  • 12. 1st Data Mining Competition: KDD-CUP 1997 ā€“ Organized by Ismail Parsa (then at Epsilon) ā€“ Task: given data on past responders to fund-raising, predict most likely responders for new campaign ā€“ Data: ā€¢ Population of 750K prospects, 300+ variables ā€¢ 10K (1.4%) responded to a broad campaign mailing ā€¢ Competition file was a stratified sample of 10K responded, 26K non-resp. (28.7% response rate) ā€“ Big effort on leaker detection (false predictors) KDD Cup was almost cancelled - several times Charles Elkan found leakers in training data
  • 13. Evaluating Targeted List: Cumulative Pct Hits (Gains) 0 10 20 30 40 50 60 70 80 90 100 5 15 25 35 45 55 65 75 85 95 Model Random 5% of random list have 5% of targets, but 5% of model ranked list have 21% of targets Cum Pct Hits (5%,model)=21%. Pct list Cumulative%Hits
  • 14. KDD-CUP Participant Statistics ā€“ 45 companies/institutions participated ā€¢ 23 research prototypes ā€¢ 22 commercial tools ā€“ 16 contestants turned in their results ā€¢ 9 research prototypes ā€¢ 7 commercial tools ā€“ Evaluation: Best Gains (CPH) at 40% and 10% ā€“ Joint winners: ā€¢ Charles Elkan (UCSD) with BNB, Boosted Naive Bayesian Classifier ā€¢ Urban Science Applications, Inc. with commercial Gain, Direct Marketing Selection System ā€¢ 3rd place: MineSet (SGI, Ronny Kohavi)
  • 15. KDD-CUP Results Discussion ā€“ Top finishers very close ā€“ NaĆÆve Bayes algorithm was used by 2 of the top 3 contestants (BNB and 3rd place MineSet) ā€“ NaĆÆve Bayes tools did little data preprocessing, used small number of variables ā€“ Urban Science implemented a tremendous amount of automated data preprocessing and exploratory data analysis and developed more than 50 models in an automated fashion to get to their results
  • 16. 16 KDD Cup 1997: Top 3 results Top 3 finishers are very close
  • 17. 17 KDD Cup 1997 ā€“ worst results Note that the worst result (C6) was actually worse than random. Competitor names were kept anonymous, apart from top 3 winners
  • 18. KDD Cup Lessons ā€¢ Data Preparation is key, especially eliminating ā€œleakersā€ (false predictors) ā€¢ Avoid overfitting the test data ā€¢ Simple models work well for predicting human behavior Ā© KDnuggets 2013 18
  • 19. Big Competition Successes ā€¢ Ansari X-Prize 2004: Spaceship One went to space twice in 2 weeks ā€¢ DARPA Grand Challenge, 2005: 150 mi Off-road robotic car navigation Ā© KDnuggets 2013 19
  • 20. Netflix Prize ā€¢ Started in 2006, with 100M ratings, 500K users, 18K movies, $1M prize ā€¢ Goal: reduce RMSE error in ā€œstarā€ rating by 10% (was 0.95 for Netflix own system Cinematch) ā€¢ Public training data, public & secret test sets Ā© KDnuggets 2013 20 Predicted Actual
  • 21. Netflix Prize Milestones ā€¢ In just one week, WXYZ consulting team beat Netflix system with RMSE 0.9430 ā€¢ Progress in 2007-8 was very slow: ā€¢ In 2007 KDnuggets Poll 32% thought prize will never be won ā€¢ Took 3 years to reach 10% improvement Ā© KDnuggets 2013 21
  • 22. Netflix Prize Winners ā€¢ Winning team used a complex ensemble of many algorithms ā€¢ Two teams had exactly the same RMSE of 0.8567, but winner submitted 20 minutes earlier ! Ā© KDnuggets 2013 22
  • 23. Netflix Prize lessons, 1 ā€¢ Competitions work ā€¢ Limits to predicting human behavior ā€“ inherent randomness, noisy data ā€¢ Privacy concerns ā€“ Researchers found a few people with matching IMDB and Netflix ratings ā€“ potential privacy breach ā€“ 4 Netflix users sued ā€“ Netflix Prize Sequel ā€“ cancelled Ā© KDnuggets 2013 23
  • 24. Netflix Prize lessons, 2 ā€¢ Winning algorithm was too complex, too tailored to specific data set, never used ļŒ ā€“ Netflix blog, Apr 2012 ā€¢ A basic SVD algorithm, proposed by Simon Funk (KDnuggets Interview w. Simon Funk) got ~6% improvement ā€¢ SVD++ version by Yehuda Koren & winning team reached ~ 8% improvement, was used by Netflix Ā© KDnuggets 2013 24
  • 25. Netflix Prize lessons, 3 ā€¢ Wrong question was asked ! (Minimizing RMSE of predicted vs actual ratings) ā€¢ RMSE gives big penalty for errors > 2 stars, so an algo. that fails big a few times will be worse than an algo. that is often worse by 1. ā€¢ Errors are not equal, but RMSE treats 2 vs 3 stars same as 4 vs 5 or 1 vs 2. ā€¢ Also, Netflix Instant became more popular ā€¢ Better question would be ā€œwhat do you like to watchā€ (anything on Instant likely to rank > 3) Ā© KDnuggets 2013 25
  • 26. Focus on the right question ? and the right GOAL Ā© KDnuggets 2013 26
  • 27. Kaggle Competition Platform ā€¢ Launched by Anthony Goldbloom in 2010 ā€¢ Quickly became the top platform for competitions ā€“ Few people know of TunedIT competition platform launched in 2009 ā€¢ Kaggle in Class ā€“ free for Universities ā€¢ Achieved 100,000 members in July 2013 Ā© KDnuggets 2012 27
  • 28. Kaggle Successes ā€¢ Allstate competition: Winner model was 270% more accurate than baseline ā€¢ Identified sound of the endangered North American Right whale in audio recordings ā€¢ GE FlightQuest ā€¢ Heritage Health Prize - $3M competition, 2011-13 ā€¢ But ā€¦ Competitions - very time consuming Ā© KDnuggets 2013 28
  • 29. Kaggle Business Model ā€¢ Initial business model - % of prize ā€¢ Kaggle Job Boards (currently free) ā€¢ Kaggle Connect: Offers consulting with top 0.5% of Kagglers (at $300/hr ? see post), or $30-100K/month (IW , Mar 2013) ā€¢ Private competitions (Masters) open to top Kagglers ā€“ Heritage Health Prize 2 Ā© KDnuggets 2013 29
  • 30. Winning on Kaggle ā€¢ Kaggle Chief Scientist: Specialist knowledge ā€“ useless & unhelpful (Slate, Dec 2012) ā€¢ Big-data approaches ā€¢ Use good tools: R, Random forests ā€¢ Curiosity, Creativeness, Persistence, Team, Luc k? (also Quora answer) ā€¢ Many (most?) winners ā€“ not professional data scientists (physicists, math profs, actuary) (RW, Apr 2012) Ā© KDnuggets 2013 30
  • 31. ā€your Ivy League diploma and IBM resume don't matter so much as my Kaggle scoreā€ Almost true 31
  • 32. Data: Public, Government, Portals, Mar ketplaces Ā© KDnuggets 2013 32
  • 33. Public Data www.KDnuggets.com/datasets/ ā€¢ Government, Federal, State, City, Local and public data sites and portals ā€¢ Data APIs, Hubs, Marketplaces, Platforms, Portals, and Search Engines. ā€¢ Data Markets: DataMarket ā€¢ Data Platforms: Enigma, InfoChimps (acq. By CSC), Knoema, Exversion, ā€¦ ā€¢ Data Search Engines: Qandl , qunb, Zanran ā€¢ Location: Factual ā€¢ People and places: Freebase Ā© KDnuggets 2013 33
  • 34. Public and Government Data ā€¢ Datamob.org: tracks government data in developer-friendly format Ā© KDnuggets 2013 34 data about U.S. state legislative activities, including bill summaries, votes, sponsorships, legislators and committees.
  • 35. US Project Open Data ā€¢ In May 2013, White House announced Project Open Data ā€¢ ā€œinformation is a valuable national asset whose value is multiplied when it is made easily accessible to the publicā€. ā€¢ ā€œThe Executive Order requires that, going forward, data generated by the government be made available in open, machine-readable formats, while appropriately safeguarding privacy, confidentiality, and security.ā€ Ā© KDnuggets 2013 35
  • 36. Using Public Data ā€¢ Google ā€“ biggest success ? ā€¢ Data Science for Social Good (Chicago) (Fast Company, Aug 2013) ā€“ predict when bikeshare stations run out of bikes ā€“ forecast local crime ā€“ warn local hospitals about impending heart attacks Ā© KDnuggets 2013 36
  • 37. Big Data ā€¢ 2nd Industrial Revolution ā€¢ Do old activities better ā€¢ Create new activities/businesses 37(c) KDnuggets 2013
  • 38. Doing Old Things Better Application areas ā€“ Direct marketing/Customer modeling ā€“ Churn prediction ā€“ Recommendations ā€“ Fraud detection ā€“ Security/Intelligence ā€“ ā€¦ ā€¢ Improvement will be real, but limited because of human randomness ā€¢ Competition will level companies 38(c) KDnuggets 2013
  • 39. Big Data Enables New Things ! ā€“ Google ā€“ first big success of big data ā€“ Social networks (Facebook, Twitter, LinkedIn, ā€¦) success depends on network size, i.e. big data ā€“ Location analytics ā€“ Health-care ā€¢ Personalized medicine ā€“ Semantics and AI ? ā€¢ Imagine IBM Watson, Google Now, Siri in 2023 ? 39(c) KDnuggets 2013
  • 40. Copyright Ā© 2003 KDnuggets
  • 41. Big Data Bubble? Ā© 2013 KDnuggets 41 Gartner Hype Cycle Big Data
  • 42. Gartner Hype Cycle for Big Data, 2012 Ā© KDnuggets 2013 42 Data Scientist, 2-5 yrs Social Network Analysis, 5-10 Social Analytics, 2-5 Predictive Analytics, <2 MapReduce & Alternative - Disillusionment
  • 43. Questions? KDnuggets: Analytics, Big Data, Data Mining ā€¢ News, Jobs, Software, Courses, Data, Meeting s, Publications, Webcasts, ā€¦ www.KDnuggets.com/news ā€¢ Subscribe to KDnuggets News email at www.KDnuggets.com/subscribe.html ā€¢ : @kdnuggets ā€¢ Email to editor1@kdnuggets.com 43Ā© KDnuggets 2013

Editor's Notes

  1. Future is Bright for Big Data, but need use caution when evaluating claims