SlideShare a Scribd company logo
Beat the benchmark.
Getting started with competitive data mining
By
Maheshakya Wijewardena
What is competitive data mining and why?
● Gap between those who are with data and those who can analyze them.
Organizations need to make use of their massive amounts of data, but with less expenditure.
Promote and expand research on applications and data models.
Challenges organized by SIGKDD, ICML, PAKDD, ECML, NIPS, etc. to promote the practices.
Find talent, attract skills...
● eg: Facebook, yahoo, yelp, ...
2
What is competitive data mining and why?
“ I keep saying the sexy job in
next ten years will be the
statisticians “
- Hal Varian
Google chief economist, 2009 3
What is competitive data mining and why?
● Kaggle?
“ is a platform for data prediction competitions that allows
organizations to post their data and have it scrutinized by
the world’s best data scientists. “
- verbatim
4
An outline
● Types of challenges
● Understanding the challenge
● Setting things up
● Analyzing data
● Data preprocessing
● Training models
● Validating models
● ML/Statistics packages
● Conclusion
5
Types of competitions
Those well known tasks you find in the data mining class...
● Most of them are classification
○ Binary or probability
○ Rarely multiclass
● Time series forecasting
○ Predict for some period ahead
○ Seasonal patterns
● Anomaly detection
Majority of competitions focus on the results, not the process.
But there are some which give high priority to process - scalability, technical
feasibility, complexity, etc. (Often for recruitments and research) 6
Before you start...
Be aware of structure of data mining competitions in Kaggle
Always remember that the purpose of the predictive model is to predict on data
that we have not seen!
7
Understand what it is about
● Read the problem until you understand it; pristine.
● Keep an eye on the forum, always - Know how other competitors think.
● Check dataset sizes! - Can you handle it?
● Competitive advantage - Try to get some domain knowledge, but not
necessary.
● How do they evaluate, on what criterion?
○ Area under ROC curve
○ MSE
○ False positive/negative rate
○ Precision - recall
○ ...
8
Setting things up...
● Boil down the problem into sections
● Organize your team - divide work
● Look at benchmarks codes - a good point to start but it’s not enough!
● Look at sample submission files
And most importantly,
● Set up an environment in which you can iterate and test
new ideas rapidly
9
Analyzing Data
KNOW THY DATA !!!
10
Analyzing Data
● Get to know your data
○ Raw data - Image ,video, text - do I need to perform feature extraction too?
○ Numerical, categorical
● Visualize! - Histograms, pie charts, cluster diagrams…
○ Advanced - vector quantization - SOM
● Missing values
● Class imbalance
11
Feature engineering and Data Preprocessing
Typical preprocessing techniques:
● Handle missing values - keep, discard, impute
● Resample - up/downsampling
● Encoding
○ Label encoding
○ One hot encoding / bit maps
● For textual - TF-IDF, feature hashing, bag of words, ...
● Dimensionality reduction - PCA, SVD, ...
12
Feature engineering and Data Preprocessing
Feature engineering is a bit tricker…
● Identify what the most important/impacting features are.
○ Feature selection
○ Strong dependency with the learning algorithms
○ Recursive feature elimination
● Eliminate (trivial) irrelevant features - IDs, timestamps(sometimes)
● Derived features?
13
Important !
Make sure you have your own evaluating metric implemented.
When evaluating your models:
● Simple training/validation split is not enough.
○ K-fold validation uses all fractions while training though you hold out a sample.
● Always have a separate hold out set that you do not touch at all during model
building process
○ Including preprocessing
14
Typical model building process
15
Split
training/
holdout
Preprocess Train model
Evaluation
Implement
model
Training
set
Hold out set for
validation
Preprocess
Good?
Bad?
Be brave and scrap the model !
Training models
● Learning algorithm - select carefully based on the problem
● Hyper parameter tuning
○ Grid search
○ randomized search
○ manual?
● Be aware of overfitting!
● Ensemble methods:
○ Bagging
○ Boosting
○ model ensembling - convex combinations
No matter what models you train, winning solutions will always be ensembles
16
Model Validation
● Get the score of your model from your evaluator.
○ Bad? - Keep it aside and design a new model
○ Good? - go ahead and predict for the test set
● Even though an individual model performs poorly, it might fit in gracefully in an
ensemble
● Confusion matrix
● Try to visualize predicted vs. actual
○ With each feature
○ Gives you an insight on what characteristics of features make the model better or worse
● Keep records.
17
Final steps...
Submissions:
● Try to submit something every day - know your position
● Keep updated
● Don’t do changes in your model which do slight improvements in public leader
board - often a trap !
Don’t forget the forum !
● If you have something interesting, share it with others - but not everything ;)
● Good Kagglers alway give something back
18
About ML/Stat packages...
● Machine learning Packages:
○ R
○ scikit-learn
○ pylearn
○ ML Pack
○ Shogun
○ Spark/H2O - scalable, distributed processing - but limited functionality.
● Statistics
○ Again R
○ statsmodels
● Data manipulation
○ Again R
○ Pandas, numpy, scipy
● Visualization -
○ Again R
○ Matplotlib
Sometimes,
● Deep learning - Theano
● NLP - NLTK
Emerging - Julia 19
Conclusion
● First, try out some “getting started” competitions - take the advantage
● When analyzing data - be patient, be meticulous
● Visualize!
● (Some) Domain knowledge would be useful
● Feature engineering is the key (often)
● Have discipline to have a proper validation framework
● Be brave!
● Learn from others
● “Right” models
● Use of ML/Stat packages effectively
● Good coding/data manipulation and software engineering best practices
● Avoid overfitting!
● Luck....
20
No Free Lunch
21
?
22
References
1. Kaggle, https://www.kaggle.com/
2. Krishna Shankar, Hitchhiker’s guide to Kaglle, http://www.slideshare.
net/ksankar/oscon-kaggle20
3. Beth Schultz, 10 Tips for Winning a Data Science Competition, http://www.
allanalytics.com/author.asp?doc_id=268513
4. Owen Zhang, “Tips for data science competitions”, http://www.slideshare.
net/OwenZhang2/tips-for-data-science-competitions
5. Parsec Labs, https://www.parseclabs.com/knowthydata
23

More Related Content

Viewers also liked

Business versus indies
Business versus indies Business versus indies
Business versus indies
Funday Factory
 
The Evolution of Apache Kylin by Luke Han
The Evolution of Apache Kylin by Luke HanThe Evolution of Apache Kylin by Luke Han
The Evolution of Apache Kylin by Luke Han
Luke Han
 
Top Strategies for Marketing Signal Measurement
Top Strategies for Marketing Signal MeasurementTop Strategies for Marketing Signal Measurement
Top Strategies for Marketing Signal Measurement
Origami Logic
 
Big Data vs. Small Data...what's the difference?
Big Data vs. Small Data...what's the difference?Big Data vs. Small Data...what's the difference?
Big Data vs. Small Data...what's the difference?
Anna Kuhn
 
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkFiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
Evan Chan
 
The Evolution of Apache Kylin
The Evolution of Apache KylinThe Evolution of Apache Kylin
The Evolution of Apache Kylin
DataWorks Summit/Hadoop Summit
 
Nine Neins - where Java EE will never take you
Nine Neins - where Java EE will never take youNine Neins - where Java EE will never take you
Nine Neins - where Java EE will never take you
Markus Eisele
 
Brand resonance pyramid of siddhalepa
Brand resonance pyramid of siddhalepaBrand resonance pyramid of siddhalepa
Brand resonance pyramid of siddhalepa
Royal Ceramics Lanka PLC
 

Viewers also liked (8)

Business versus indies
Business versus indies Business versus indies
Business versus indies
 
The Evolution of Apache Kylin by Luke Han
The Evolution of Apache Kylin by Luke HanThe Evolution of Apache Kylin by Luke Han
The Evolution of Apache Kylin by Luke Han
 
Top Strategies for Marketing Signal Measurement
Top Strategies for Marketing Signal MeasurementTop Strategies for Marketing Signal Measurement
Top Strategies for Marketing Signal Measurement
 
Big Data vs. Small Data...what's the difference?
Big Data vs. Small Data...what's the difference?Big Data vs. Small Data...what's the difference?
Big Data vs. Small Data...what's the difference?
 
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkFiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
 
The Evolution of Apache Kylin
The Evolution of Apache KylinThe Evolution of Apache Kylin
The Evolution of Apache Kylin
 
Nine Neins - where Java EE will never take you
Nine Neins - where Java EE will never take youNine Neins - where Java EE will never take you
Nine Neins - where Java EE will never take you
 
Brand resonance pyramid of siddhalepa
Brand resonance pyramid of siddhalepaBrand resonance pyramid of siddhalepa
Brand resonance pyramid of siddhalepa
 

Similar to Beat the Benchmark.

Limits of Machine Learning
Limits of Machine LearningLimits of Machine Learning
Limits of Machine Learning
Alexey Grigorev
 
Tips for data science competitions
Tips for data science competitionsTips for data science competitions
Tips for data science competitions
Owen Zhang
 
Winning Data Science Competitions (Owen Zhang) - 2014 Boston Data Festival
Winning Data Science Competitions (Owen Zhang)  - 2014 Boston Data FestivalWinning Data Science Competitions (Owen Zhang)  - 2014 Boston Data Festival
Winning Data Science Competitions (Owen Zhang) - 2014 Boston Data Festival
freshdatabos
 
Winning data science competitions
Winning data science competitionsWinning data science competitions
Winning data science competitions
Owen Zhang
 
Winning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen ZhangWinning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen Zhang
Vivian S. Zhang
 
Investing in ai driven startups
Investing in ai driven startupsInvesting in ai driven startups
Investing in ai driven startups
Roy Lowrance
 
Overview of machine learning
Overview of machine learning Overview of machine learning
Overview of machine learning
SolivarLabs
 
How to become a data scientist
How to become a data scientist How to become a data scientist
How to become a data scientist
Manjunath Sindagi
 
"What we learned from 5 years of building a data science software that actual...
"What we learned from 5 years of building a data science software that actual..."What we learned from 5 years of building a data science software that actual...
"What we learned from 5 years of building a data science software that actual...
Dataconomy Media
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to hero
Daniel Marcous
 
Machine learning: A Walk Through School Exams
Machine learning: A Walk Through School ExamsMachine learning: A Walk Through School Exams
Machine learning: A Walk Through School Exams
Ramsha Ijaz
 
A step towards machine learning at accionlabs
A step towards machine learning at accionlabsA step towards machine learning at accionlabs
A step towards machine learning at accionlabs
Chetan Khatri
 
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...
Aseda Owusua Addai-Deseh
 
Day 1 wazz up ai
Day 1  wazz up aiDay 1  wazz up ai
Day 1 wazz up ai
HuyPhmNht2
 
Scaling Recommendations at Quora (RecSys talk 9/16/2016)
Scaling Recommendations at Quora (RecSys talk 9/16/2016)Scaling Recommendations at Quora (RecSys talk 9/16/2016)
Scaling Recommendations at Quora (RecSys talk 9/16/2016)
Nikhil Dandekar
 
PyData SF 2016 --- Moving forward through the darkness
PyData SF 2016 --- Moving forward through the darknessPyData SF 2016 --- Moving forward through the darkness
PyData SF 2016 --- Moving forward through the darkness
Chia-Chi Chang
 
A Kaggle Talk
A Kaggle TalkA Kaggle Talk
A Kaggle Talk
Lex Toumbourou
 
A friendly guide to requirements management
A friendly guide to requirements managementA friendly guide to requirements management
A friendly guide to requirements management
Annika Dahlqvist
 
Future of ai on the jvm
Future of ai on the jvmFuture of ai on the jvm
Future of ai on the jvm
Adam Gibson
 
AI hype or reality
AI  hype or realityAI  hype or reality
AI hype or reality
Awantik Das
 

Similar to Beat the Benchmark. (20)

Limits of Machine Learning
Limits of Machine LearningLimits of Machine Learning
Limits of Machine Learning
 
Tips for data science competitions
Tips for data science competitionsTips for data science competitions
Tips for data science competitions
 
Winning Data Science Competitions (Owen Zhang) - 2014 Boston Data Festival
Winning Data Science Competitions (Owen Zhang)  - 2014 Boston Data FestivalWinning Data Science Competitions (Owen Zhang)  - 2014 Boston Data Festival
Winning Data Science Competitions (Owen Zhang) - 2014 Boston Data Festival
 
Winning data science competitions
Winning data science competitionsWinning data science competitions
Winning data science competitions
 
Winning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen ZhangWinning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen Zhang
 
Investing in ai driven startups
Investing in ai driven startupsInvesting in ai driven startups
Investing in ai driven startups
 
Overview of machine learning
Overview of machine learning Overview of machine learning
Overview of machine learning
 
How to become a data scientist
How to become a data scientist How to become a data scientist
How to become a data scientist
 
"What we learned from 5 years of building a data science software that actual...
"What we learned from 5 years of building a data science software that actual..."What we learned from 5 years of building a data science software that actual...
"What we learned from 5 years of building a data science software that actual...
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to hero
 
Machine learning: A Walk Through School Exams
Machine learning: A Walk Through School ExamsMachine learning: A Walk Through School Exams
Machine learning: A Walk Through School Exams
 
A step towards machine learning at accionlabs
A step towards machine learning at accionlabsA step towards machine learning at accionlabs
A step towards machine learning at accionlabs
 
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...
 
Day 1 wazz up ai
Day 1  wazz up aiDay 1  wazz up ai
Day 1 wazz up ai
 
Scaling Recommendations at Quora (RecSys talk 9/16/2016)
Scaling Recommendations at Quora (RecSys talk 9/16/2016)Scaling Recommendations at Quora (RecSys talk 9/16/2016)
Scaling Recommendations at Quora (RecSys talk 9/16/2016)
 
PyData SF 2016 --- Moving forward through the darkness
PyData SF 2016 --- Moving forward through the darknessPyData SF 2016 --- Moving forward through the darkness
PyData SF 2016 --- Moving forward through the darkness
 
A Kaggle Talk
A Kaggle TalkA Kaggle Talk
A Kaggle Talk
 
A friendly guide to requirements management
A friendly guide to requirements managementA friendly guide to requirements management
A friendly guide to requirements management
 
Future of ai on the jvm
Future of ai on the jvmFuture of ai on the jvm
Future of ai on the jvm
 
AI hype or reality
AI  hype or realityAI  hype or reality
AI hype or reality
 

Recently uploaded

一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
StarCompliance.io
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
alex933524
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
AlejandraGmez176757
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
theahmadsaood
 

Recently uploaded (20)

一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
 

Beat the Benchmark.

  • 1. Beat the benchmark. Getting started with competitive data mining By Maheshakya Wijewardena
  • 2. What is competitive data mining and why? ● Gap between those who are with data and those who can analyze them. Organizations need to make use of their massive amounts of data, but with less expenditure. Promote and expand research on applications and data models. Challenges organized by SIGKDD, ICML, PAKDD, ECML, NIPS, etc. to promote the practices. Find talent, attract skills... ● eg: Facebook, yahoo, yelp, ... 2
  • 3. What is competitive data mining and why? “ I keep saying the sexy job in next ten years will be the statisticians “ - Hal Varian Google chief economist, 2009 3
  • 4. What is competitive data mining and why? ● Kaggle? “ is a platform for data prediction competitions that allows organizations to post their data and have it scrutinized by the world’s best data scientists. “ - verbatim 4
  • 5. An outline ● Types of challenges ● Understanding the challenge ● Setting things up ● Analyzing data ● Data preprocessing ● Training models ● Validating models ● ML/Statistics packages ● Conclusion 5
  • 6. Types of competitions Those well known tasks you find in the data mining class... ● Most of them are classification ○ Binary or probability ○ Rarely multiclass ● Time series forecasting ○ Predict for some period ahead ○ Seasonal patterns ● Anomaly detection Majority of competitions focus on the results, not the process. But there are some which give high priority to process - scalability, technical feasibility, complexity, etc. (Often for recruitments and research) 6
  • 7. Before you start... Be aware of structure of data mining competitions in Kaggle Always remember that the purpose of the predictive model is to predict on data that we have not seen! 7
  • 8. Understand what it is about ● Read the problem until you understand it; pristine. ● Keep an eye on the forum, always - Know how other competitors think. ● Check dataset sizes! - Can you handle it? ● Competitive advantage - Try to get some domain knowledge, but not necessary. ● How do they evaluate, on what criterion? ○ Area under ROC curve ○ MSE ○ False positive/negative rate ○ Precision - recall ○ ... 8
  • 9. Setting things up... ● Boil down the problem into sections ● Organize your team - divide work ● Look at benchmarks codes - a good point to start but it’s not enough! ● Look at sample submission files And most importantly, ● Set up an environment in which you can iterate and test new ideas rapidly 9
  • 10. Analyzing Data KNOW THY DATA !!! 10
  • 11. Analyzing Data ● Get to know your data ○ Raw data - Image ,video, text - do I need to perform feature extraction too? ○ Numerical, categorical ● Visualize! - Histograms, pie charts, cluster diagrams… ○ Advanced - vector quantization - SOM ● Missing values ● Class imbalance 11
  • 12. Feature engineering and Data Preprocessing Typical preprocessing techniques: ● Handle missing values - keep, discard, impute ● Resample - up/downsampling ● Encoding ○ Label encoding ○ One hot encoding / bit maps ● For textual - TF-IDF, feature hashing, bag of words, ... ● Dimensionality reduction - PCA, SVD, ... 12
  • 13. Feature engineering and Data Preprocessing Feature engineering is a bit tricker… ● Identify what the most important/impacting features are. ○ Feature selection ○ Strong dependency with the learning algorithms ○ Recursive feature elimination ● Eliminate (trivial) irrelevant features - IDs, timestamps(sometimes) ● Derived features? 13
  • 14. Important ! Make sure you have your own evaluating metric implemented. When evaluating your models: ● Simple training/validation split is not enough. ○ K-fold validation uses all fractions while training though you hold out a sample. ● Always have a separate hold out set that you do not touch at all during model building process ○ Including preprocessing 14
  • 15. Typical model building process 15 Split training/ holdout Preprocess Train model Evaluation Implement model Training set Hold out set for validation Preprocess Good? Bad? Be brave and scrap the model !
  • 16. Training models ● Learning algorithm - select carefully based on the problem ● Hyper parameter tuning ○ Grid search ○ randomized search ○ manual? ● Be aware of overfitting! ● Ensemble methods: ○ Bagging ○ Boosting ○ model ensembling - convex combinations No matter what models you train, winning solutions will always be ensembles 16
  • 17. Model Validation ● Get the score of your model from your evaluator. ○ Bad? - Keep it aside and design a new model ○ Good? - go ahead and predict for the test set ● Even though an individual model performs poorly, it might fit in gracefully in an ensemble ● Confusion matrix ● Try to visualize predicted vs. actual ○ With each feature ○ Gives you an insight on what characteristics of features make the model better or worse ● Keep records. 17
  • 18. Final steps... Submissions: ● Try to submit something every day - know your position ● Keep updated ● Don’t do changes in your model which do slight improvements in public leader board - often a trap ! Don’t forget the forum ! ● If you have something interesting, share it with others - but not everything ;) ● Good Kagglers alway give something back 18
  • 19. About ML/Stat packages... ● Machine learning Packages: ○ R ○ scikit-learn ○ pylearn ○ ML Pack ○ Shogun ○ Spark/H2O - scalable, distributed processing - but limited functionality. ● Statistics ○ Again R ○ statsmodels ● Data manipulation ○ Again R ○ Pandas, numpy, scipy ● Visualization - ○ Again R ○ Matplotlib Sometimes, ● Deep learning - Theano ● NLP - NLTK Emerging - Julia 19
  • 20. Conclusion ● First, try out some “getting started” competitions - take the advantage ● When analyzing data - be patient, be meticulous ● Visualize! ● (Some) Domain knowledge would be useful ● Feature engineering is the key (often) ● Have discipline to have a proper validation framework ● Be brave! ● Learn from others ● “Right” models ● Use of ML/Stat packages effectively ● Good coding/data manipulation and software engineering best practices ● Avoid overfitting! ● Luck.... 20
  • 22. ? 22
  • 23. References 1. Kaggle, https://www.kaggle.com/ 2. Krishna Shankar, Hitchhiker’s guide to Kaglle, http://www.slideshare. net/ksankar/oscon-kaggle20 3. Beth Schultz, 10 Tips for Winning a Data Science Competition, http://www. allanalytics.com/author.asp?doc_id=268513 4. Owen Zhang, “Tips for data science competitions”, http://www.slideshare. net/OwenZhang2/tips-for-data-science-competitions 5. Parsec Labs, https://www.parseclabs.com/knowthydata 23