SlideShare a Scribd company logo
stories behind kaggle competitions
wendy kan, data scientist
wendy@kaggle.com
@wendykan
5/19/2015 @
kaggle runs public machine learning
competitions
we worked with clients/hosts on various
types of problems and data of different sizes
my job as a data scientist at kaggle
“data science is not just
kaggle competitions”
whyyyy???
machine learning processes
● Business Problem
● Collect Data
● Transform Data
● Dataset Splitting
● Evaluation Metric
● Feature Extraction
● Feature Selection
● Model Training
● Model Ensembling
● Methodology Selection
● Production System
● Ongoing Optimization
not every problem can be
turned into a kaggle
competition
size matters! where bigger is
better (most of the time)
data cleaning/formatting:
● easy to make a quick submission
● boosts participation
● (too) clean data kills creativity
data privacy/anonymization
metric: how do you measure
success?
● Classification - AUC/ Logarithmic Loss/Accuracy
● Regression - RMSE/MAE
● Ranking - MAP/NDCG
● Other / Custom
https://www.kaggle.com/wiki/Metrics
the design of a competition shapes how
people are going to solve a problem
Splitting dataset
● training/test
● public/private
Time series data
data leakage
“Deemed ‘one of the top ten data mining mistakes’, leakage is
essentially the introduction of information about the data mining target,
which should not be legitimately available to mine from”
“the concept of identifying and harnessing leakage has been openly
addressed as one of three key aspects for winning data mining
competitions”
“Leakage in Data Mining: formulation, detection, and avoidance” S Kaufman et al
do you have thousands of
people reviewing your
performance at work 24/7?
I do.
1. people make mistakes.
honesty is the best policy.
2. crowdsourcing is powerful.
anything that can go wrong
will go wrong.
Stories Behind Kaggle Competitions with Wendy Kan from Kaggle

More Related Content

What's hot

Data science applications and usecases
Data science applications and usecasesData science applications and usecases
Data science applications and usecases
Sreenatha Reddy K R
 
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
Ilkay Altintas, Ph.D.
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
Sampath Kumar
 
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data ScienceIntroduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
Ferdin Joe John Joseph PhD
 
Data Science in Action
Data Science in ActionData Science in Action
Data Science in Action
Jordan Open Source Association
 
Data Science Salon: Culture, Data Engineering and Hamburger Stands: Thoughts ...
Data Science Salon: Culture, Data Engineering and Hamburger Stands: Thoughts ...Data Science Salon: Culture, Data Engineering and Hamburger Stands: Thoughts ...
Data Science Salon: Culture, Data Engineering and Hamburger Stands: Thoughts ...
Formulatedby
 
GTU GeekDay Data Science and Applications
GTU GeekDay Data Science and ApplicationsGTU GeekDay Data Science and Applications
GTU GeekDay Data Science and Applications
Kürşat İNCE
 
Adding Open Data Value to 'Closed Data' Problems
Adding Open Data Value to 'Closed Data' ProblemsAdding Open Data Value to 'Closed Data' Problems
Adding Open Data Value to 'Closed Data' Problems
Simon Price
 
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
Edureka!
 
Data Science
Data ScienceData Science
Data Science
Prithwis Mukerjee
 
Data science
Data scienceData science
Data science
Sreejith c
 
Unit 3 part 2
Unit  3 part 2Unit  3 part 2
Unit 3 part 2
MohammadAsharAshraf
 
A Practical-ish Introduction to Data Science
A Practical-ish Introduction to Data ScienceA Practical-ish Introduction to Data Science
A Practical-ish Introduction to Data Science
Mark West
 
Introduction to Data Science and Analytics
Introduction to Data Science and AnalyticsIntroduction to Data Science and Analytics
Introduction to Data Science and Analytics
Srinath Perera
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
ANOOP V S
 
Introduction to Data Science by Datalent Team @Data Science Clinic #9
Introduction to Data Science by Datalent Team @Data Science Clinic #9Introduction to Data Science by Datalent Team @Data Science Clinic #9
Introduction to Data Science by Datalent Team @Data Science Clinic #9
Dr.Sotarat Thammaboosadee CIMP-Data Governance
 
Programming for data science in python
Programming for data science in pythonProgramming for data science in python
Programming for data science in python
UmmeSalmaM1
 
Data science | What is Data science
Data science | What is Data scienceData science | What is Data science
Data science | What is Data science
ShilpaKrishna6
 
Session 04 communicating results
Session 04 communicating resultsSession 04 communicating results
Session 04 communicating results
bodaceacat
 
Data Science applications in business
Data Science applications in businessData Science applications in business
Data Science applications in business
Vladyslav Yakovenko
 

What's hot (20)

Data science applications and usecases
Data science applications and usecasesData science applications and usecases
Data science applications and usecases
 
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data ScienceIntroduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
 
Data Science in Action
Data Science in ActionData Science in Action
Data Science in Action
 
Data Science Salon: Culture, Data Engineering and Hamburger Stands: Thoughts ...
Data Science Salon: Culture, Data Engineering and Hamburger Stands: Thoughts ...Data Science Salon: Culture, Data Engineering and Hamburger Stands: Thoughts ...
Data Science Salon: Culture, Data Engineering and Hamburger Stands: Thoughts ...
 
GTU GeekDay Data Science and Applications
GTU GeekDay Data Science and ApplicationsGTU GeekDay Data Science and Applications
GTU GeekDay Data Science and Applications
 
Adding Open Data Value to 'Closed Data' Problems
Adding Open Data Value to 'Closed Data' ProblemsAdding Open Data Value to 'Closed Data' Problems
Adding Open Data Value to 'Closed Data' Problems
 
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
 
Data Science
Data ScienceData Science
Data Science
 
Data science
Data scienceData science
Data science
 
Unit 3 part 2
Unit  3 part 2Unit  3 part 2
Unit 3 part 2
 
A Practical-ish Introduction to Data Science
A Practical-ish Introduction to Data ScienceA Practical-ish Introduction to Data Science
A Practical-ish Introduction to Data Science
 
Introduction to Data Science and Analytics
Introduction to Data Science and AnalyticsIntroduction to Data Science and Analytics
Introduction to Data Science and Analytics
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Introduction to Data Science by Datalent Team @Data Science Clinic #9
Introduction to Data Science by Datalent Team @Data Science Clinic #9Introduction to Data Science by Datalent Team @Data Science Clinic #9
Introduction to Data Science by Datalent Team @Data Science Clinic #9
 
Programming for data science in python
Programming for data science in pythonProgramming for data science in python
Programming for data science in python
 
Data science | What is Data science
Data science | What is Data scienceData science | What is Data science
Data science | What is Data science
 
Session 04 communicating results
Session 04 communicating resultsSession 04 communicating results
Session 04 communicating results
 
Data Science applications in business
Data Science applications in businessData Science applications in business
Data Science applications in business
 

Similar to Stories Behind Kaggle Competitions with Wendy Kan from Kaggle

MLOps.pptx
MLOps.pptxMLOps.pptx
MLOps.pptx
sundharakumarkb1
 
Why Big Query is so Powerful - Trusted Conf
Why Big Query is so Powerful - Trusted ConfWhy Big Query is so Powerful - Trusted Conf
Why Big Query is so Powerful - Trusted Conf
In Marketing We Trust
 
Kaggle and data science
Kaggle and data scienceKaggle and data science
Kaggle and data science
Akira Shibata
 
What Level of "Data" is Right for Your Product by 7 Cups Sr PM
What Level of "Data" is Right for Your Product by 7 Cups Sr PMWhat Level of "Data" is Right for Your Product by 7 Cups Sr PM
What Level of "Data" is Right for Your Product by 7 Cups Sr PM
Product School
 
How to start a career in Data Science_.pptx
How to start a career in Data Science_.pptxHow to start a career in Data Science_.pptx
How to start a career in Data Science_.pptx
Avinash Sharma
 
Moving from BI to AI : For decision makers
Moving from BI to AI : For decision makersMoving from BI to AI : For decision makers
Moving from BI to AI : For decision makers
zekeLabs Technologies
 
Your Raw Data is Ready - Introduction to Analytics Engineering | SMX Advanced...
Your Raw Data is Ready - Introduction to Analytics Engineering | SMX Advanced...Your Raw Data is Ready - Introduction to Analytics Engineering | SMX Advanced...
Your Raw Data is Ready - Introduction to Analytics Engineering | SMX Advanced...
Christopher Gutknecht
 
2024-02-24_Session 1 - PMLE_UPDATED.pptx
2024-02-24_Session 1 - PMLE_UPDATED.pptx2024-02-24_Session 1 - PMLE_UPDATED.pptx
2024-02-24_Session 1 - PMLE_UPDATED.pptx
gdgsurrey
 
Reproducibility and experiments management in Machine Learning
Reproducibility and experiments management in Machine Learning Reproducibility and experiments management in Machine Learning
Reproducibility and experiments management in Machine Learning
Mikhail Rozhkov
 
Live predictions with schemaless data at scale. MLMU Kosice, Exponea
Live predictions with schemaless data at scale. MLMU Kosice, ExponeaLive predictions with schemaless data at scale. MLMU Kosice, Exponea
Live predictions with schemaless data at scale. MLMU Kosice, Exponea
Data Science Club
 
Predictive Analytics Project in Automotive Industry
Predictive Analytics Project in Automotive IndustryPredictive Analytics Project in Automotive Industry
Predictive Analytics Project in Automotive Industry
Matouš Havlena
 
C2_W1---.pdf
C2_W1---.pdfC2_W1---.pdf
C2_W1---.pdf
Humayun Kabir
 
Building a Marketing Data Warehouse from Scratch - SMX Advanced 202
Building a Marketing Data Warehouse from Scratch - SMX Advanced 202Building a Marketing Data Warehouse from Scratch - SMX Advanced 202
Building a Marketing Data Warehouse from Scratch - SMX Advanced 202
Christopher Gutknecht
 
Data Studio for SEOs: Reporting Automation Tips - Weekly SEO with Lazarina Stoy
Data Studio for SEOs: Reporting Automation Tips - Weekly SEO with Lazarina StoyData Studio for SEOs: Reporting Automation Tips - Weekly SEO with Lazarina Stoy
Data Studio for SEOs: Reporting Automation Tips - Weekly SEO with Lazarina Stoy
LazarinaStoyanova
 
How to train your product owner
How to train your product ownerHow to train your product owner
How to train your product owner
David Murgatroyd
 
Demystifying Data Science
Demystifying Data ScienceDemystifying Data Science
Demystifying Data Science
Data Science Milan
 
Anwar kamal .pdf.pptx
Anwar kamal .pdf.pptxAnwar kamal .pdf.pptx
Anwar kamal .pdf.pptx
Luminous8
 
Challenges of machine learning in production
Challenges of machine learning in productionChallenges of machine learning in production
Challenges of machine learning in production
Samuel Witherspoon
 
Big Data overview
Big Data overviewBig Data overview
Big Data overview
alexisroos
 
Transforming B2B Sales with Spark-Powered Sales Intelligence with Songtao Guo...
Transforming B2B Sales with Spark-Powered Sales Intelligence with Songtao Guo...Transforming B2B Sales with Spark-Powered Sales Intelligence with Songtao Guo...
Transforming B2B Sales with Spark-Powered Sales Intelligence with Songtao Guo...
Databricks
 

Similar to Stories Behind Kaggle Competitions with Wendy Kan from Kaggle (20)

MLOps.pptx
MLOps.pptxMLOps.pptx
MLOps.pptx
 
Why Big Query is so Powerful - Trusted Conf
Why Big Query is so Powerful - Trusted ConfWhy Big Query is so Powerful - Trusted Conf
Why Big Query is so Powerful - Trusted Conf
 
Kaggle and data science
Kaggle and data scienceKaggle and data science
Kaggle and data science
 
What Level of "Data" is Right for Your Product by 7 Cups Sr PM
What Level of "Data" is Right for Your Product by 7 Cups Sr PMWhat Level of "Data" is Right for Your Product by 7 Cups Sr PM
What Level of "Data" is Right for Your Product by 7 Cups Sr PM
 
How to start a career in Data Science_.pptx
How to start a career in Data Science_.pptxHow to start a career in Data Science_.pptx
How to start a career in Data Science_.pptx
 
Moving from BI to AI : For decision makers
Moving from BI to AI : For decision makersMoving from BI to AI : For decision makers
Moving from BI to AI : For decision makers
 
Your Raw Data is Ready - Introduction to Analytics Engineering | SMX Advanced...
Your Raw Data is Ready - Introduction to Analytics Engineering | SMX Advanced...Your Raw Data is Ready - Introduction to Analytics Engineering | SMX Advanced...
Your Raw Data is Ready - Introduction to Analytics Engineering | SMX Advanced...
 
2024-02-24_Session 1 - PMLE_UPDATED.pptx
2024-02-24_Session 1 - PMLE_UPDATED.pptx2024-02-24_Session 1 - PMLE_UPDATED.pptx
2024-02-24_Session 1 - PMLE_UPDATED.pptx
 
Reproducibility and experiments management in Machine Learning
Reproducibility and experiments management in Machine Learning Reproducibility and experiments management in Machine Learning
Reproducibility and experiments management in Machine Learning
 
Live predictions with schemaless data at scale. MLMU Kosice, Exponea
Live predictions with schemaless data at scale. MLMU Kosice, ExponeaLive predictions with schemaless data at scale. MLMU Kosice, Exponea
Live predictions with schemaless data at scale. MLMU Kosice, Exponea
 
Predictive Analytics Project in Automotive Industry
Predictive Analytics Project in Automotive IndustryPredictive Analytics Project in Automotive Industry
Predictive Analytics Project in Automotive Industry
 
C2_W1---.pdf
C2_W1---.pdfC2_W1---.pdf
C2_W1---.pdf
 
Building a Marketing Data Warehouse from Scratch - SMX Advanced 202
Building a Marketing Data Warehouse from Scratch - SMX Advanced 202Building a Marketing Data Warehouse from Scratch - SMX Advanced 202
Building a Marketing Data Warehouse from Scratch - SMX Advanced 202
 
Data Studio for SEOs: Reporting Automation Tips - Weekly SEO with Lazarina Stoy
Data Studio for SEOs: Reporting Automation Tips - Weekly SEO with Lazarina StoyData Studio for SEOs: Reporting Automation Tips - Weekly SEO with Lazarina Stoy
Data Studio for SEOs: Reporting Automation Tips - Weekly SEO with Lazarina Stoy
 
How to train your product owner
How to train your product ownerHow to train your product owner
How to train your product owner
 
Demystifying Data Science
Demystifying Data ScienceDemystifying Data Science
Demystifying Data Science
 
Anwar kamal .pdf.pptx
Anwar kamal .pdf.pptxAnwar kamal .pdf.pptx
Anwar kamal .pdf.pptx
 
Challenges of machine learning in production
Challenges of machine learning in productionChallenges of machine learning in production
Challenges of machine learning in production
 
Big Data overview
Big Data overviewBig Data overview
Big Data overview
 
Transforming B2B Sales with Spark-Powered Sales Intelligence with Songtao Guo...
Transforming B2B Sales with Spark-Powered Sales Intelligence with Songtao Guo...Transforming B2B Sales with Spark-Powered Sales Intelligence with Songtao Guo...
Transforming B2B Sales with Spark-Powered Sales Intelligence with Songtao Guo...
 

More from Sri Ambati

GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
Sri Ambati
 
Generative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptxGenerative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptx
Sri Ambati
 
AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek
Sri Ambati
 
LLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5thLLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5th
Sri Ambati
 
Building, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionBuilding, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for Production
Sri Ambati
 
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Sri Ambati
 
Risk Management for LLMs
Risk Management for LLMsRisk Management for LLMs
Risk Management for LLMs
Sri Ambati
 
Open-Source AI: Community is the Way
Open-Source AI: Community is the WayOpen-Source AI: Community is the Way
Open-Source AI: Community is the Way
Sri Ambati
 
Building Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2OBuilding Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2O
Sri Ambati
 
Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical
Sri Ambati
 
Cutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM PapersCutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM Papers
Sri Ambati
 
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Sri Ambati
 
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Sri Ambati
 
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
Sri Ambati
 
LLM Interpretability
LLM Interpretability LLM Interpretability
LLM Interpretability
Sri Ambati
 
Never Reply to an Email Again
Never Reply to an Email AgainNever Reply to an Email Again
Never Reply to an Email Again
Sri Ambati
 
Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)
Sri Ambati
 
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
Sri Ambati
 
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
Sri Ambati
 

More from Sri Ambati (20)

GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Generative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptxGenerative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptx
 
AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek
 
LLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5thLLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5th
 
Building, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionBuilding, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for Production
 
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
 
Risk Management for LLMs
Risk Management for LLMsRisk Management for LLMs
Risk Management for LLMs
 
Open-Source AI: Community is the Way
Open-Source AI: Community is the WayOpen-Source AI: Community is the Way
Open-Source AI: Community is the Way
 
Building Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2OBuilding Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2O
 
Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical
 
Cutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM PapersCutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM Papers
 
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
 
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
 
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
 
LLM Interpretability
LLM Interpretability LLM Interpretability
LLM Interpretability
 
Never Reply to an Email Again
Never Reply to an Email AgainNever Reply to an Email Again
Never Reply to an Email Again
 
Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)
 
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
 
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
 

Recently uploaded

TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
Tier1 app
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Globus
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Natan Silnitsky
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
Globus
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
Juraj Vysvader
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
XfilesPro
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
takuyayamamoto1800
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
abdulrafaychaudhry
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Globus
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
wottaspaceseo
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
Adele Miller
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
Ortus Solutions, Corp
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
AMB-Review
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
Globus
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
Ortus Solutions, Corp
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
Paco van Beckhoven
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
NYGGS Automation Suite
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
Fermin Galan
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus
 

Recently uploaded (20)

TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
 

Stories Behind Kaggle Competitions with Wendy Kan from Kaggle

  • 1. stories behind kaggle competitions wendy kan, data scientist wendy@kaggle.com @wendykan 5/19/2015 @
  • 2. kaggle runs public machine learning competitions
  • 3. we worked with clients/hosts on various types of problems and data of different sizes
  • 4. my job as a data scientist at kaggle
  • 5. “data science is not just kaggle competitions” whyyyy???
  • 6. machine learning processes ● Business Problem ● Collect Data ● Transform Data ● Dataset Splitting ● Evaluation Metric ● Feature Extraction ● Feature Selection ● Model Training ● Model Ensembling ● Methodology Selection ● Production System ● Ongoing Optimization
  • 7. not every problem can be turned into a kaggle competition
  • 8.
  • 9. size matters! where bigger is better (most of the time)
  • 10. data cleaning/formatting: ● easy to make a quick submission ● boosts participation ● (too) clean data kills creativity
  • 12. metric: how do you measure success? ● Classification - AUC/ Logarithmic Loss/Accuracy ● Regression - RMSE/MAE ● Ranking - MAP/NDCG ● Other / Custom https://www.kaggle.com/wiki/Metrics
  • 13. the design of a competition shapes how people are going to solve a problem
  • 16. data leakage “Deemed ‘one of the top ten data mining mistakes’, leakage is essentially the introduction of information about the data mining target, which should not be legitimately available to mine from” “the concept of identifying and harnessing leakage has been openly addressed as one of three key aspects for winning data mining competitions” “Leakage in Data Mining: formulation, detection, and avoidance” S Kaufman et al
  • 17. do you have thousands of people reviewing your performance at work 24/7? I do.
  • 18. 1. people make mistakes. honesty is the best policy.
  • 19. 2. crowdsourcing is powerful. anything that can go wrong will go wrong.

Editor's Notes

  1. diabetic-retinopathy-detection
  2. anything you need to do to your own company data to pursuade your boss ot release it to 100,000s of people