SlideShare a Scribd company logo
How to Win Machine Learning
competitions By Marios Michailidis
It’s not the destination…it’s the
journey!
What is kaggle
• world's biggest predictive modelling
competition platform
• Half a million members
• Companies host data challenges.
• Usual tasks include:
– Predict topic or sentiment from text.
– Predict species/type from image.
– Predict store/product/area sales
– Marketing response
Inspired by Horse races…!
• At the University of Southampton, an
entrepreneur talked to us about how he was
able to predict the horse races with regression!
Was curious, wanted to learn more
• learned statistical tools (Like SAS, SPSS, R)
• I became more passionate!
• Picked up programming skills
Built KazAnova
• Generated a couple of algorithms and data
techniques and decided to make them public
so that others can gain from it.
• I released it at (http://www.kazanovaforanalytics.com/)
• Named it after ANOVA (Statistics) and
• KAZANI , mom’s last name.
Joined Kaggle!
• Was curious about Kaggle.
• Joined a few contests and learned lots  .
• The community was very open to sharing and
collaboration.
Other interesting tasks!
• predict the correct answer from a science test,
using the Wikipedia.
• Predict which virus has infected a file.
• Predict the NCAA tournament!
• Predict the Higgs Bosson
• Out of many different numerical series predict
which one is the cause
3 Years of modelling competitions
• Over 90 competitions
• Participated with 45 different teams
• 22 top 10 finishes
• 12 times prize winner
• 3 different modelling platforms
• Ranked 1st out of 480,000 data scientists
What's next
• PhD (UCL) about recommender systems.
• More kaggling (but less intense) !
So… what wins competitions?
In short:
• Understand the problem
• Discipline
• try problem-specific things or new approaches
• The hours you put in
• the right tools
• Collaboration.
• Ensembling
More specifically…
● Understand the problem and the function to optimize
● Choose what seems to be a good Algorithm for a problem and
iterate
• Clean the data
• Scale the data
• Make feature transformations
• Make feature derivations
• Use cross validation to
 Make feature selections
 Tune hyper parameters of the algorithm
– Do this for many other algorithms – Always exploiting their
benefits
– Find best way to combine or ensemble the different algorithms
● Different type of models and when to use them
Understand the metric to optimize
● For example in the competition it was AUC (Area Under the roc
Curve).
● This is a ranking metric
● It shows how consistently your good cases have higher score
than your bad cases.
● Not all algorithms optimize the metric you want.
● Common metrics:
– AUC,
– Classification accuracy,
– precision,
– NDCG,
– RMSE
– MAE
– Deviance
Choose the algorithm
● In most cases many algorithms are experimented
before finding the right one(s).
● Those who try more models/parameters have higher
chance to win contests than others.
● Any algorithm that makes sense to be used, should be
used.
In the algorithm: Clean the Data
● The data cleaning step is not independent of the chosen
algorithm. For different algorithms , different cleaning filters should
be applied.
● Treating missing values is really important, certain algorithms
are more forgiving than others
● In other occasions it may make more sense to carefully replace a
missing value with a sensible one (like average of feature), treat it
as separate category or even remove the whole observation.
● Similarly we search for outliers and again other models are more
forgiving, while in others the impact of outliers is detrimental.
● How to decide the best method?
– Try them all or
– Experience and literature, but mostly the first (bold)
In the algorithm: scaling the data
● Certain algorithms cannot deal with unscaled data.
● scale techniques
– Max scaler: Divide each feature with highest absolute value
– Normalization: (subtract mean and divide with standard
deviation)
– Conditional scaling : scale only under certain conditions (e.g
in medicine we tend to scale per subject to make their features
comparable)
In the algorithm: feature transformations
● For certain algorithms there is benefit in changing the features
because they help them converge faster and better.
● Common transformations will include:
1. LOG, SQRT (variable) , smoothens variables
2. Dummies for categorical variables
3. Sparse matrices To be able to compress the data
4. 1st derivatives : To smoothen data.
5. Weights of evidence (transforming variables while using
information of the target variable)
6. Unsupervised methods that reduce dimensionality (SVD, PCA,
ISOMAP, KDTREE, clustering
In the algorithm: feature derivations
● In many problems this is the most important thing. For example:
– Text classification : generate the corpus of words and make TFIDF
– Sounds : convert sounds to frequencies through Fourier
transformations
– Images : make convolution. E.g. break down an image to pixels and
extract different parts of the image.
– Interactions: Really important for some models. For our algorithms
too! E.g. have variables that show if an item is popular AND the
customer likes it.
– Other that makes sense: similarity features, dimensionality
reduction features or even predictions from other models as features.
In the algorithm: Cross-Validation
● This basically means that from my main set , I create RANDOMLY 2
sets. I built (train) my algorithm with the first one (lets call it training
set) and score the other (lets call it validation set). I repeat this
process multiple times and always check how my model performs on
the test set in respect to the metric I want to optimize.
● The process may look like:
1. For 10 (you choose how many X) times
1. Split the set in training (50%-90% of the original data)
2. And validation (50%-10% of the original data)
3. Then fit the algorithm on the training set
4. Score the validation set.
5. Save the result of that scoring in respect to the chosen metric.
2. Calculate the average of these 10 (X) times. That how much you
expect this score in real life an dis generally a good estimate.
● Remember to use a SEED to be bale to replicate these X splits
In Cross-Validation: Do feature selection
● It is unlikely that all features are useful and some of them may be
damaging your model, because…
– They do not provide new information (Colinearity)
– They are just not predictive (just noise)
● There are many ways to do feature selection:
– Run the algorithm and seek an internal measure to retrieve the most
important features (no cross validation)
– Forward selection with or with no cross validation
– Backward selection with or with no cross validation
– Noise injection
– Hybrid of all methods
● Normally a forward method is chosen with cv. That is we add a
feature and then we split the data in the exact X times as we did
before and we check whether our metric improved:
– If yes, the feature remains
– Else, Wiedersehen!
In Cross-Validation: Hyper Parameter Optimization
● This generally takes lots of time depending on the algorithm . For
example in a random forest, the best model would need to be
determined based on some parameters as number of trees, max
depth, minimum cases in a split, features to consider at each split
etc.
● One way to find the best parameters is to manually change one
(e.g. max_depth=10 instead of 9) while you keep everything else
constant. I found using this method helps you understand more
about what would work in a specific dataset.
● Another way is try many possible combinations of hyper
parameters. We normally do that with Grid Search where we
provide an array of all possible values to be trialled with cross
validation (e.g. try max_depth {8,9,10,11} and number of trees
{90,120,200,500}
Train Many Algorithms
● Try to exploit their strengths
● For example, focus on using linear features with linear
regression (e.g. the higher the age the higher the
income) and non-linear ones with Random forest
● Make it so that each model tries to capture something
new or even focus on different part of the data
Ensemble
● Key part (in winning competitions at least) to combine the various
models made .
● Remember, even a crappy model can be useful to some small
extend.
● Possible ways to ensemble:
– Simple average (Model1 prediction + model2 prediction)/2
– Average Ranks for AUC (simple average after converting to rank)
– Manually tune weights with cross validation
– Using Geomean weighted average
– Use Meta-Modelling (also called stack generalization or stacking)
● Check github for a complete example of these methods using the
Amazon comp hosted by kaggle : https://github.com/kaz-
Anova/ensemble_amazon (top 60 rank) .
Different models I have experimented vol 1
● Logistic/Linear/discriminant regression: Fast, Scalable,
Comprehensible, solid under high dimensionality, can be memory-
light. Best when relationships are linear or all features are
categorical. Good for text classification too .
● Random Forests : Probably the best one-off overall algorithm out
there (to my experience) . Fast, Scalable , memory-medium. Best
when all features are numeric-continuous and there are strong non-
linear relationships. Does not cope well with high dimensionality.
● Gradient Boosting (Trees): Less memory intense as forests (as
individual predictors tend to be weaker). Fast, Semi-Scalable,
memory-medium. Is good when forests are good
● Neural Nets (AKA deep Learning): Good for tasks humans are
good at: Image Recognition, sound recognition. Good with categorical
variables too (as they replicate on-and-off signals). Medium-speed,
Scalable, memory-light . Generally good for linear and non-linear
tasks. May take a lot to train depending on structure. Many
parameters to tune. Very prone to over and under fitting.
● Support Vector Machines (SVMs): Medium-Speed, not scalable,
memory intense. Still good at capturing linear and non linear
relationships. Holding the kernel matrix takes too much memory.
Not advisable for data sets bigger than 20k.
● K Nearest Neighbours: Slow (depending on the size), Not easily
scalable, memory-heavy. Good when really defining the good of
the bad is matter of how much he/she looks to specific individuals.
Also good when number of target variables are many as the
similarity measures remain the same across different
observations. Good for text classification too.
● Naïve Bayes : Quick, scalable, memory-ok. Good for quick
classifications on big datasets. Not particularly predictive.
● Factorization Machines: Good gateways between Linear and
non-linear problems. Stand between regressions , Knns and
neural networks. Memory Medium, semi-scalable, medium-speed.
Good for predicting the rating a customer will assign to a pruduct
Different models I have experimented vol 2
Tools vol 1
● Languages : Python, R, Java
● Liblinear : for linear models
http://www.csie.ntu.edu.tw/~cjlin/liblinear/
● LibSvm for Support Vector machines
www.csie.ntu.edu.tw/~cjlin/libsvm/
● Scikit package in python for text classification, random forests
and gradient boosting machines scikit-learn.org/stable/
● Xgboost for fast scalable gradient boosting
https://github.com/tqchen/xgboost
● LightGBM https://github.com/Microsoft/LightGBM
● Vowpal Wabbit hunch.net/~vw/ for fast memory efficient linear
models
● http://www.heatonresearch.com/encog encog for neural nets
● H2O in R for many models
● LibFm www.libfm.org
● LibFFM : https://www.csie.ntu.edu.tw/~cjlin/libffm/
● Weka in Java (has everything) http://www.cs.waikato.ac.nz/ml/weka/
● Graphchi for factorizations : https://github.com/GraphChi
● GraphLab for lots of stuff. https://dato.com/products/create/open_source.html
● Cxxnet : One of the best implementation of convolutional neural nets out
there. Difficult to install and requires GPU with NVDIA Graphics card.
https://github.com/antinucleon/cxxnet
● RankLib: The best library out there made in java suited for ranking algorithms
(e.g. rank products for customers) that supports optimization fucntions like
NDCG. people.cs.umass.edu/~vdang/ranklib.html
● Keras ( http://keras.io/) and Lasagne(https://github.com/Lasagne/Lasagne )
for nets. This assumes you have Theano
(http://deeplearning.net/software/theano/ ) or Tensorflow
https://www.tensorflow.org/ .
Tools vol 2
Where to go next to prepare for competitions
● Coursera : https://www.coursera.org/course/ml Andrew’s NG
class
● Kaggle.com : many competitions for learning. For instance:
http://www.kaggle.com/c/titanic-gettingStarted . Look for the
“knowledge flag”
● Very good slides from university of UTAH:
www.cs.utah.edu/~piyush/teaching/cs5350.html
● clopinet.com/challenges/ . Many past predictive modelling
competitions with tutorials.
● Wikipedia. Not to underestimate. Still the best source of
information out there (collectively) .

More Related Content

What's hot

Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Gabriel Moreira
 
Winning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to StackingWinning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to Stacking
Ted Xiao
 
Tips for data science competitions
Tips for data science competitionsTips for data science competitions
Tips for data science competitions
Owen Zhang
 
Feature Engineering for ML - Dmitry Larko, H2O.ai
Feature Engineering for ML - Dmitry Larko, H2O.aiFeature Engineering for ML - Dmitry Larko, H2O.ai
Feature Engineering for ML - Dmitry Larko, H2O.ai
Sri Ambati
 
YOLOv4: optimal speed and accuracy of object detection review
YOLOv4: optimal speed and accuracy of object detection reviewYOLOv4: optimal speed and accuracy of object detection review
YOLOv4: optimal speed and accuracy of object detection review
LEE HOSEONG
 
Winning Data Science Competitions
Winning Data Science CompetitionsWinning Data Science Competitions
Winning Data Science Competitions
Jeong-Yoon Lee
 
Feature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsFeature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive models
Gabriel Moreira
 
Winning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen ZhangWinning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen Zhang
Vivian S. Zhang
 
Intro to kaggle
Intro to kaggleIntro to kaggle
Intro to kaggle
Henry Chang
 
Explainable AI
Explainable AIExplainable AI
Explainable AI
Wagston Staehler
 
Introduction to XGBoost
Introduction to XGBoostIntroduction to XGBoost
Introduction to XGBoost
Joonyoung Yi
 
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Md. Main Uddin Rony
 
Meta-Prod2Vec: Simple Product Embeddings with Side-Information
Meta-Prod2Vec: Simple Product Embeddings with Side-InformationMeta-Prod2Vec: Simple Product Embeddings with Side-Information
Meta-Prod2Vec: Simple Product Embeddings with Side-Information
recsysfr
 
Feature engineering pipelines
Feature engineering pipelinesFeature engineering pipelines
Feature engineering pipelines
Ramesh Sampath
 
Dowhy: An end-to-end library for causal inference
Dowhy: An end-to-end library for causal inferenceDowhy: An end-to-end library for causal inference
Dowhy: An end-to-end library for causal inference
Amit Sharma
 
You only look once (YOLO) : unified real time object detection
You only look once (YOLO) : unified real time object detectionYou only look once (YOLO) : unified real time object detection
You only look once (YOLO) : unified real time object detection
Entrepreneur / Startup
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need
Daiki Tanaka
 
Lecture 6: Ensemble Methods
Lecture 6: Ensemble Methods Lecture 6: Ensemble Methods
Lecture 6: Ensemble Methods
Marina Santini
 
Applications in Machine Learning
Applications in Machine LearningApplications in Machine Learning
Applications in Machine Learning
Joel Graff
 
K-means Clustering
K-means ClusteringK-means Clustering
K-means Clustering
Anna Fensel
 

What's hot (20)

Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017
 
Winning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to StackingWinning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to Stacking
 
Tips for data science competitions
Tips for data science competitionsTips for data science competitions
Tips for data science competitions
 
Feature Engineering for ML - Dmitry Larko, H2O.ai
Feature Engineering for ML - Dmitry Larko, H2O.aiFeature Engineering for ML - Dmitry Larko, H2O.ai
Feature Engineering for ML - Dmitry Larko, H2O.ai
 
YOLOv4: optimal speed and accuracy of object detection review
YOLOv4: optimal speed and accuracy of object detection reviewYOLOv4: optimal speed and accuracy of object detection review
YOLOv4: optimal speed and accuracy of object detection review
 
Winning Data Science Competitions
Winning Data Science CompetitionsWinning Data Science Competitions
Winning Data Science Competitions
 
Feature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsFeature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive models
 
Winning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen ZhangWinning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen Zhang
 
Intro to kaggle
Intro to kaggleIntro to kaggle
Intro to kaggle
 
Explainable AI
Explainable AIExplainable AI
Explainable AI
 
Introduction to XGBoost
Introduction to XGBoostIntroduction to XGBoost
Introduction to XGBoost
 
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
 
Meta-Prod2Vec: Simple Product Embeddings with Side-Information
Meta-Prod2Vec: Simple Product Embeddings with Side-InformationMeta-Prod2Vec: Simple Product Embeddings with Side-Information
Meta-Prod2Vec: Simple Product Embeddings with Side-Information
 
Feature engineering pipelines
Feature engineering pipelinesFeature engineering pipelines
Feature engineering pipelines
 
Dowhy: An end-to-end library for causal inference
Dowhy: An end-to-end library for causal inferenceDowhy: An end-to-end library for causal inference
Dowhy: An end-to-end library for causal inference
 
You only look once (YOLO) : unified real time object detection
You only look once (YOLO) : unified real time object detectionYou only look once (YOLO) : unified real time object detection
You only look once (YOLO) : unified real time object detection
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need
 
Lecture 6: Ensemble Methods
Lecture 6: Ensemble Methods Lecture 6: Ensemble Methods
Lecture 6: Ensemble Methods
 
Applications in Machine Learning
Applications in Machine LearningApplications in Machine Learning
Applications in Machine Learning
 
K-means Clustering
K-means ClusteringK-means Clustering
K-means Clustering
 

Viewers also liked

Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions
odsc
 
Lessons from 2MM machine learning models
Lessons from 2MM machine learning modelsLessons from 2MM machine learning models
Lessons from 2MM machine learning models
Extract Data Conference
 
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioLecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Marina Santini
 
Open Innovation - A Case Study
Open Innovation - A Case StudyOpen Innovation - A Case Study
Open Innovation - A Case Study
HackerEarth
 
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Spark Summit
 
Data Science Competition
Data Science CompetitionData Science Competition
Data Science Competition
Jeong-Yoon Lee
 
State of women in technical workforce
State of women in technical workforceState of women in technical workforce
State of women in technical workforce
HackerEarth
 
Kill the wabbit
Kill the wabbitKill the wabbit
Kill the wabbit
Joe Kleinwaechter
 
Wapid and wobust active online machine leawning with Vowpal Wabbit
Wapid and wobust active online machine leawning with Vowpal Wabbit Wapid and wobust active online machine leawning with Vowpal Wabbit
Wapid and wobust active online machine leawning with Vowpal Wabbit Antti Haapala
 
Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field
Domino Data Lab
 
Doing your first Kaggle (Python for Big Data sets)
Doing your first Kaggle (Python for Big Data sets)Doing your first Kaggle (Python for Big Data sets)
Doing your first Kaggle (Python for Big Data sets)
Domino Data Lab
 
Leverage Social Media for Employer Brand and Recruiting
Leverage Social Media for Employer Brand and RecruitingLeverage Social Media for Employer Brand and Recruiting
Leverage Social Media for Employer Brand and RecruitingHackerEarth
 
Ethics in Data Science and Machine Learning
Ethics in Data Science and Machine LearningEthics in Data Science and Machine Learning
Ethics in Data Science and Machine Learning
HJ van Veen
 
HackerEarth helping a startup hire developers - The Practo Case Study
HackerEarth helping a startup hire developers - The Practo Case StudyHackerEarth helping a startup hire developers - The Practo Case Study
HackerEarth helping a startup hire developers - The Practo Case Study
HackerEarth
 
Marriage - LIGHT Ministry
Marriage - LIGHT MinistryMarriage - LIGHT Ministry
Marriage - LIGHT Ministry
Jeong-Yoon Lee
 
Leveraged Analytics at Scale
Leveraged Analytics at ScaleLeveraged Analytics at Scale
Leveraged Analytics at Scale
Domino Data Lab
 
How hackathons can drive top line revenue growth
How hackathons can drive top line revenue growthHow hackathons can drive top line revenue growth
How hackathons can drive top line revenue growth
HackerEarth
 
Smart Switchboard: An home automation system
Smart Switchboard: An home automation systemSmart Switchboard: An home automation system
Smart Switchboard: An home automation system
HackerEarth
 
No-Bullshit Data Science
No-Bullshit Data ScienceNo-Bullshit Data Science
No-Bullshit Data Science
Domino Data Lab
 
How to recruit excellent tech talent
How to recruit excellent tech talentHow to recruit excellent tech talent
How to recruit excellent tech talent
HackerEarth
 

Viewers also liked (20)

Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions
 
Lessons from 2MM machine learning models
Lessons from 2MM machine learning modelsLessons from 2MM machine learning models
Lessons from 2MM machine learning models
 
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioLecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
 
Open Innovation - A Case Study
Open Innovation - A Case StudyOpen Innovation - A Case Study
Open Innovation - A Case Study
 
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
 
Data Science Competition
Data Science CompetitionData Science Competition
Data Science Competition
 
State of women in technical workforce
State of women in technical workforceState of women in technical workforce
State of women in technical workforce
 
Kill the wabbit
Kill the wabbitKill the wabbit
Kill the wabbit
 
Wapid and wobust active online machine leawning with Vowpal Wabbit
Wapid and wobust active online machine leawning with Vowpal Wabbit Wapid and wobust active online machine leawning with Vowpal Wabbit
Wapid and wobust active online machine leawning with Vowpal Wabbit
 
Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field
 
Doing your first Kaggle (Python for Big Data sets)
Doing your first Kaggle (Python for Big Data sets)Doing your first Kaggle (Python for Big Data sets)
Doing your first Kaggle (Python for Big Data sets)
 
Leverage Social Media for Employer Brand and Recruiting
Leverage Social Media for Employer Brand and RecruitingLeverage Social Media for Employer Brand and Recruiting
Leverage Social Media for Employer Brand and Recruiting
 
Ethics in Data Science and Machine Learning
Ethics in Data Science and Machine LearningEthics in Data Science and Machine Learning
Ethics in Data Science and Machine Learning
 
HackerEarth helping a startup hire developers - The Practo Case Study
HackerEarth helping a startup hire developers - The Practo Case StudyHackerEarth helping a startup hire developers - The Practo Case Study
HackerEarth helping a startup hire developers - The Practo Case Study
 
Marriage - LIGHT Ministry
Marriage - LIGHT MinistryMarriage - LIGHT Ministry
Marriage - LIGHT Ministry
 
Leveraged Analytics at Scale
Leveraged Analytics at ScaleLeveraged Analytics at Scale
Leveraged Analytics at Scale
 
How hackathons can drive top line revenue growth
How hackathons can drive top line revenue growthHow hackathons can drive top line revenue growth
How hackathons can drive top line revenue growth
 
Smart Switchboard: An home automation system
Smart Switchboard: An home automation systemSmart Switchboard: An home automation system
Smart Switchboard: An home automation system
 
No-Bullshit Data Science
No-Bullshit Data ScienceNo-Bullshit Data Science
No-Bullshit Data Science
 
How to recruit excellent tech talent
How to recruit excellent tech talentHow to recruit excellent tech talent
How to recruit excellent tech talent
 

Similar to How to Win Machine Learning Competitions ?

Machine Learning and Deep Learning 4 dummies
Machine Learning and Deep Learning 4 dummies Machine Learning and Deep Learning 4 dummies
Machine Learning and Deep Learning 4 dummies
Dori Waldman
 
Machine learning4dummies
Machine learning4dummiesMachine learning4dummies
Machine learning4dummies
Michael Winer
 
Informs presentation new ppt
Informs presentation new pptInforms presentation new ppt
Informs presentation new pptSalford Systems
 
Initializing & Optimizing Machine Learning Models
Initializing & Optimizing Machine Learning ModelsInitializing & Optimizing Machine Learning Models
Initializing & Optimizing Machine Learning Models
Eng Teong Cheah
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter Tuning
Jon Lederman
 
Ensemble hybrid learning technique
Ensemble hybrid learning techniqueEnsemble hybrid learning technique
Ensemble hybrid learning technique
DishaSinha9
 
Random Forest Decision Tree.pptx
Random Forest Decision Tree.pptxRandom Forest Decision Tree.pptx
Random Forest Decision Tree.pptx
Ramakrishna Reddy Bijjam
 
Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learning
Akshay Kanchan
 
Experimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles BakerExperimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles Baker
Databricks
 
Machine Learning in the Financial Industry
Machine Learning in the Financial IndustryMachine Learning in the Financial Industry
Machine Learning in the Financial Industry
Subrat Panda, PhD
 
GLM & GBM in H2O
GLM & GBM in H2OGLM & GBM in H2O
GLM & GBM in H2O
Sri Ambati
 
Deep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningDeep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter Tuning
Shubhmay Potdar
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Universitat Politècnica de Catalunya
 
Machine learning - session 3
Machine learning - session 3Machine learning - session 3
Machine learning - session 3
Luis Borbon
 
Diabetes Prediction Using Machine Learning
Diabetes Prediction Using Machine LearningDiabetes Prediction Using Machine Learning
Diabetes Prediction Using Machine Learning
jagan477830
 
Dnn guidelines
Dnn guidelinesDnn guidelines
Dnn guidelines
Naitik Shukla
 
Regression ppt
Regression pptRegression ppt
Regression ppt
SuyashSingh70
 
Rapid Miner
Rapid MinerRapid Miner
Rapid Miner
SrushtiSuvarna
 
20211229120253D6323_PERT 06_ Ensemble Learning.pptx
20211229120253D6323_PERT 06_ Ensemble Learning.pptx20211229120253D6323_PERT 06_ Ensemble Learning.pptx
20211229120253D6323_PERT 06_ Ensemble Learning.pptx
RaflyRizky2
 
AI Algorithms
AI AlgorithmsAI Algorithms
AI Algorithms
Dr. C.V. Suresh Babu
 

Similar to How to Win Machine Learning Competitions ? (20)

Machine Learning and Deep Learning 4 dummies
Machine Learning and Deep Learning 4 dummies Machine Learning and Deep Learning 4 dummies
Machine Learning and Deep Learning 4 dummies
 
Machine learning4dummies
Machine learning4dummiesMachine learning4dummies
Machine learning4dummies
 
Informs presentation new ppt
Informs presentation new pptInforms presentation new ppt
Informs presentation new ppt
 
Initializing & Optimizing Machine Learning Models
Initializing & Optimizing Machine Learning ModelsInitializing & Optimizing Machine Learning Models
Initializing & Optimizing Machine Learning Models
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter Tuning
 
Ensemble hybrid learning technique
Ensemble hybrid learning techniqueEnsemble hybrid learning technique
Ensemble hybrid learning technique
 
Random Forest Decision Tree.pptx
Random Forest Decision Tree.pptxRandom Forest Decision Tree.pptx
Random Forest Decision Tree.pptx
 
Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learning
 
Experimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles BakerExperimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles Baker
 
Machine Learning in the Financial Industry
Machine Learning in the Financial IndustryMachine Learning in the Financial Industry
Machine Learning in the Financial Industry
 
GLM & GBM in H2O
GLM & GBM in H2OGLM & GBM in H2O
GLM & GBM in H2O
 
Deep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningDeep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter Tuning
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
 
Machine learning - session 3
Machine learning - session 3Machine learning - session 3
Machine learning - session 3
 
Diabetes Prediction Using Machine Learning
Diabetes Prediction Using Machine LearningDiabetes Prediction Using Machine Learning
Diabetes Prediction Using Machine Learning
 
Dnn guidelines
Dnn guidelinesDnn guidelines
Dnn guidelines
 
Regression ppt
Regression pptRegression ppt
Regression ppt
 
Rapid Miner
Rapid MinerRapid Miner
Rapid Miner
 
20211229120253D6323_PERT 06_ Ensemble Learning.pptx
20211229120253D6323_PERT 06_ Ensemble Learning.pptx20211229120253D6323_PERT 06_ Ensemble Learning.pptx
20211229120253D6323_PERT 06_ Ensemble Learning.pptx
 
AI Algorithms
AI AlgorithmsAI Algorithms
AI Algorithms
 

More from HackerEarth

How to hire a data scientist recruit page
How to hire a data scientist recruit pageHow to hire a data scientist recruit page
How to hire a data scientist recruit page
HackerEarth
 
Build accurate assessment with question analytics
Build accurate assessment with question analyticsBuild accurate assessment with question analytics
Build accurate assessment with question analytics
HackerEarth
 
Make your assessments more effective with test analytics
Make your assessments more effective with test analyticsMake your assessments more effective with test analytics
Make your assessments more effective with test analytics
HackerEarth
 
How to hire a data scientist
How to hire a data scientistHow to hire a data scientist
How to hire a data scientist
HackerEarth
 
Changing landscape of Technical Recruitment
Changing landscape of Technical RecruitmentChanging landscape of Technical Recruitment
Changing landscape of Technical Recruitment
HackerEarth
 
Multi-Skilling: A unique way to train, retain and develop in a manufacturing ...
Multi-Skilling: A unique way to train, retain and develop in a manufacturing ...Multi-Skilling: A unique way to train, retain and develop in a manufacturing ...
Multi-Skilling: A unique way to train, retain and develop in a manufacturing ...
HackerEarth
 
How to recruit excellent talent
How to recruit excellent talentHow to recruit excellent talent
How to recruit excellent talent
HackerEarth
 
Interpersonal Dynamics at work
Interpersonal Dynamics at workInterpersonal Dynamics at work
Interpersonal Dynamics at work
HackerEarth
 
The Power of HR Analytics
The Power of HR AnalyticsThe Power of HR Analytics
The Power of HR Analytics
HackerEarth
 
Leading change management
Leading change managementLeading change management
Leading change management
HackerEarth
 
Enhancing the employer brand
Enhancing the employer brandEnhancing the employer brand
Enhancing the employer brand
HackerEarth
 
Global Hackathon Report
Global Hackathon ReportGlobal Hackathon Report
Global Hackathon Report
HackerEarth
 
How to organize a successful hackathon
How to organize a successful hackathonHow to organize a successful hackathon
How to organize a successful hackathon
HackerEarth
 
6 rules of enterprise innovation
6 rules of enterprise innovation6 rules of enterprise innovation
6 rules of enterprise innovation
HackerEarth
 
How to assess & hire Java developers accurately?
How to assess & hire Java developers accurately?How to assess & hire Java developers accurately?
How to assess & hire Java developers accurately?
HackerEarth
 
Managing innovation: A Process Overview
Managing innovation: A Process OverviewManaging innovation: A Process Overview
Managing innovation: A Process Overview
HackerEarth
 
How to become a Data Scientist?
How to become a Data Scientist? How to become a Data Scientist?
How to become a Data Scientist?
HackerEarth
 
Menstrual Health Reader - mEo
Menstrual Health Reader - mEoMenstrual Health Reader - mEo
Menstrual Health Reader - mEo
HackerEarth
 
Richard Matthew Stallman - A Brief Biography
Richard Matthew Stallman - A Brief BiographyRichard Matthew Stallman - A Brief Biography
Richard Matthew Stallman - A Brief Biography
HackerEarth
 
CodeRED Casestudy
CodeRED CasestudyCodeRED Casestudy
CodeRED Casestudy
HackerEarth
 

More from HackerEarth (20)

How to hire a data scientist recruit page
How to hire a data scientist recruit pageHow to hire a data scientist recruit page
How to hire a data scientist recruit page
 
Build accurate assessment with question analytics
Build accurate assessment with question analyticsBuild accurate assessment with question analytics
Build accurate assessment with question analytics
 
Make your assessments more effective with test analytics
Make your assessments more effective with test analyticsMake your assessments more effective with test analytics
Make your assessments more effective with test analytics
 
How to hire a data scientist
How to hire a data scientistHow to hire a data scientist
How to hire a data scientist
 
Changing landscape of Technical Recruitment
Changing landscape of Technical RecruitmentChanging landscape of Technical Recruitment
Changing landscape of Technical Recruitment
 
Multi-Skilling: A unique way to train, retain and develop in a manufacturing ...
Multi-Skilling: A unique way to train, retain and develop in a manufacturing ...Multi-Skilling: A unique way to train, retain and develop in a manufacturing ...
Multi-Skilling: A unique way to train, retain and develop in a manufacturing ...
 
How to recruit excellent talent
How to recruit excellent talentHow to recruit excellent talent
How to recruit excellent talent
 
Interpersonal Dynamics at work
Interpersonal Dynamics at workInterpersonal Dynamics at work
Interpersonal Dynamics at work
 
The Power of HR Analytics
The Power of HR AnalyticsThe Power of HR Analytics
The Power of HR Analytics
 
Leading change management
Leading change managementLeading change management
Leading change management
 
Enhancing the employer brand
Enhancing the employer brandEnhancing the employer brand
Enhancing the employer brand
 
Global Hackathon Report
Global Hackathon ReportGlobal Hackathon Report
Global Hackathon Report
 
How to organize a successful hackathon
How to organize a successful hackathonHow to organize a successful hackathon
How to organize a successful hackathon
 
6 rules of enterprise innovation
6 rules of enterprise innovation6 rules of enterprise innovation
6 rules of enterprise innovation
 
How to assess & hire Java developers accurately?
How to assess & hire Java developers accurately?How to assess & hire Java developers accurately?
How to assess & hire Java developers accurately?
 
Managing innovation: A Process Overview
Managing innovation: A Process OverviewManaging innovation: A Process Overview
Managing innovation: A Process Overview
 
How to become a Data Scientist?
How to become a Data Scientist? How to become a Data Scientist?
How to become a Data Scientist?
 
Menstrual Health Reader - mEo
Menstrual Health Reader - mEoMenstrual Health Reader - mEo
Menstrual Health Reader - mEo
 
Richard Matthew Stallman - A Brief Biography
Richard Matthew Stallman - A Brief BiographyRichard Matthew Stallman - A Brief Biography
Richard Matthew Stallman - A Brief Biography
 
CodeRED Casestudy
CodeRED CasestudyCodeRED Casestudy
CodeRED Casestudy
 

Recently uploaded

The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 

Recently uploaded (20)

The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 

How to Win Machine Learning Competitions ?

  • 1. How to Win Machine Learning competitions By Marios Michailidis It’s not the destination…it’s the journey!
  • 2. What is kaggle • world's biggest predictive modelling competition platform • Half a million members • Companies host data challenges. • Usual tasks include: – Predict topic or sentiment from text. – Predict species/type from image. – Predict store/product/area sales – Marketing response
  • 3. Inspired by Horse races…! • At the University of Southampton, an entrepreneur talked to us about how he was able to predict the horse races with regression!
  • 4. Was curious, wanted to learn more • learned statistical tools (Like SAS, SPSS, R) • I became more passionate! • Picked up programming skills
  • 5. Built KazAnova • Generated a couple of algorithms and data techniques and decided to make them public so that others can gain from it. • I released it at (http://www.kazanovaforanalytics.com/) • Named it after ANOVA (Statistics) and • KAZANI , mom’s last name.
  • 6. Joined Kaggle! • Was curious about Kaggle. • Joined a few contests and learned lots  . • The community was very open to sharing and collaboration.
  • 7. Other interesting tasks! • predict the correct answer from a science test, using the Wikipedia. • Predict which virus has infected a file. • Predict the NCAA tournament! • Predict the Higgs Bosson • Out of many different numerical series predict which one is the cause
  • 8. 3 Years of modelling competitions • Over 90 competitions • Participated with 45 different teams • 22 top 10 finishes • 12 times prize winner • 3 different modelling platforms • Ranked 1st out of 480,000 data scientists
  • 9. What's next • PhD (UCL) about recommender systems. • More kaggling (but less intense) !
  • 10. So… what wins competitions? In short: • Understand the problem • Discipline • try problem-specific things or new approaches • The hours you put in • the right tools • Collaboration. • Ensembling
  • 11. More specifically… ● Understand the problem and the function to optimize ● Choose what seems to be a good Algorithm for a problem and iterate • Clean the data • Scale the data • Make feature transformations • Make feature derivations • Use cross validation to  Make feature selections  Tune hyper parameters of the algorithm – Do this for many other algorithms – Always exploiting their benefits – Find best way to combine or ensemble the different algorithms ● Different type of models and when to use them
  • 12. Understand the metric to optimize ● For example in the competition it was AUC (Area Under the roc Curve). ● This is a ranking metric ● It shows how consistently your good cases have higher score than your bad cases. ● Not all algorithms optimize the metric you want. ● Common metrics: – AUC, – Classification accuracy, – precision, – NDCG, – RMSE – MAE – Deviance
  • 13. Choose the algorithm ● In most cases many algorithms are experimented before finding the right one(s). ● Those who try more models/parameters have higher chance to win contests than others. ● Any algorithm that makes sense to be used, should be used.
  • 14. In the algorithm: Clean the Data ● The data cleaning step is not independent of the chosen algorithm. For different algorithms , different cleaning filters should be applied. ● Treating missing values is really important, certain algorithms are more forgiving than others ● In other occasions it may make more sense to carefully replace a missing value with a sensible one (like average of feature), treat it as separate category or even remove the whole observation. ● Similarly we search for outliers and again other models are more forgiving, while in others the impact of outliers is detrimental. ● How to decide the best method? – Try them all or – Experience and literature, but mostly the first (bold)
  • 15. In the algorithm: scaling the data ● Certain algorithms cannot deal with unscaled data. ● scale techniques – Max scaler: Divide each feature with highest absolute value – Normalization: (subtract mean and divide with standard deviation) – Conditional scaling : scale only under certain conditions (e.g in medicine we tend to scale per subject to make their features comparable)
  • 16. In the algorithm: feature transformations ● For certain algorithms there is benefit in changing the features because they help them converge faster and better. ● Common transformations will include: 1. LOG, SQRT (variable) , smoothens variables 2. Dummies for categorical variables 3. Sparse matrices To be able to compress the data 4. 1st derivatives : To smoothen data. 5. Weights of evidence (transforming variables while using information of the target variable) 6. Unsupervised methods that reduce dimensionality (SVD, PCA, ISOMAP, KDTREE, clustering
  • 17. In the algorithm: feature derivations ● In many problems this is the most important thing. For example: – Text classification : generate the corpus of words and make TFIDF – Sounds : convert sounds to frequencies through Fourier transformations – Images : make convolution. E.g. break down an image to pixels and extract different parts of the image. – Interactions: Really important for some models. For our algorithms too! E.g. have variables that show if an item is popular AND the customer likes it. – Other that makes sense: similarity features, dimensionality reduction features or even predictions from other models as features.
  • 18. In the algorithm: Cross-Validation ● This basically means that from my main set , I create RANDOMLY 2 sets. I built (train) my algorithm with the first one (lets call it training set) and score the other (lets call it validation set). I repeat this process multiple times and always check how my model performs on the test set in respect to the metric I want to optimize. ● The process may look like: 1. For 10 (you choose how many X) times 1. Split the set in training (50%-90% of the original data) 2. And validation (50%-10% of the original data) 3. Then fit the algorithm on the training set 4. Score the validation set. 5. Save the result of that scoring in respect to the chosen metric. 2. Calculate the average of these 10 (X) times. That how much you expect this score in real life an dis generally a good estimate. ● Remember to use a SEED to be bale to replicate these X splits
  • 19. In Cross-Validation: Do feature selection ● It is unlikely that all features are useful and some of them may be damaging your model, because… – They do not provide new information (Colinearity) – They are just not predictive (just noise) ● There are many ways to do feature selection: – Run the algorithm and seek an internal measure to retrieve the most important features (no cross validation) – Forward selection with or with no cross validation – Backward selection with or with no cross validation – Noise injection – Hybrid of all methods ● Normally a forward method is chosen with cv. That is we add a feature and then we split the data in the exact X times as we did before and we check whether our metric improved: – If yes, the feature remains – Else, Wiedersehen!
  • 20. In Cross-Validation: Hyper Parameter Optimization ● This generally takes lots of time depending on the algorithm . For example in a random forest, the best model would need to be determined based on some parameters as number of trees, max depth, minimum cases in a split, features to consider at each split etc. ● One way to find the best parameters is to manually change one (e.g. max_depth=10 instead of 9) while you keep everything else constant. I found using this method helps you understand more about what would work in a specific dataset. ● Another way is try many possible combinations of hyper parameters. We normally do that with Grid Search where we provide an array of all possible values to be trialled with cross validation (e.g. try max_depth {8,9,10,11} and number of trees {90,120,200,500}
  • 21. Train Many Algorithms ● Try to exploit their strengths ● For example, focus on using linear features with linear regression (e.g. the higher the age the higher the income) and non-linear ones with Random forest ● Make it so that each model tries to capture something new or even focus on different part of the data
  • 22. Ensemble ● Key part (in winning competitions at least) to combine the various models made . ● Remember, even a crappy model can be useful to some small extend. ● Possible ways to ensemble: – Simple average (Model1 prediction + model2 prediction)/2 – Average Ranks for AUC (simple average after converting to rank) – Manually tune weights with cross validation – Using Geomean weighted average – Use Meta-Modelling (also called stack generalization or stacking) ● Check github for a complete example of these methods using the Amazon comp hosted by kaggle : https://github.com/kaz- Anova/ensemble_amazon (top 60 rank) .
  • 23. Different models I have experimented vol 1 ● Logistic/Linear/discriminant regression: Fast, Scalable, Comprehensible, solid under high dimensionality, can be memory- light. Best when relationships are linear or all features are categorical. Good for text classification too . ● Random Forests : Probably the best one-off overall algorithm out there (to my experience) . Fast, Scalable , memory-medium. Best when all features are numeric-continuous and there are strong non- linear relationships. Does not cope well with high dimensionality. ● Gradient Boosting (Trees): Less memory intense as forests (as individual predictors tend to be weaker). Fast, Semi-Scalable, memory-medium. Is good when forests are good ● Neural Nets (AKA deep Learning): Good for tasks humans are good at: Image Recognition, sound recognition. Good with categorical variables too (as they replicate on-and-off signals). Medium-speed, Scalable, memory-light . Generally good for linear and non-linear tasks. May take a lot to train depending on structure. Many parameters to tune. Very prone to over and under fitting.
  • 24. ● Support Vector Machines (SVMs): Medium-Speed, not scalable, memory intense. Still good at capturing linear and non linear relationships. Holding the kernel matrix takes too much memory. Not advisable for data sets bigger than 20k. ● K Nearest Neighbours: Slow (depending on the size), Not easily scalable, memory-heavy. Good when really defining the good of the bad is matter of how much he/she looks to specific individuals. Also good when number of target variables are many as the similarity measures remain the same across different observations. Good for text classification too. ● Naïve Bayes : Quick, scalable, memory-ok. Good for quick classifications on big datasets. Not particularly predictive. ● Factorization Machines: Good gateways between Linear and non-linear problems. Stand between regressions , Knns and neural networks. Memory Medium, semi-scalable, medium-speed. Good for predicting the rating a customer will assign to a pruduct Different models I have experimented vol 2
  • 25. Tools vol 1 ● Languages : Python, R, Java ● Liblinear : for linear models http://www.csie.ntu.edu.tw/~cjlin/liblinear/ ● LibSvm for Support Vector machines www.csie.ntu.edu.tw/~cjlin/libsvm/ ● Scikit package in python for text classification, random forests and gradient boosting machines scikit-learn.org/stable/ ● Xgboost for fast scalable gradient boosting https://github.com/tqchen/xgboost ● LightGBM https://github.com/Microsoft/LightGBM ● Vowpal Wabbit hunch.net/~vw/ for fast memory efficient linear models ● http://www.heatonresearch.com/encog encog for neural nets ● H2O in R for many models
  • 26. ● LibFm www.libfm.org ● LibFFM : https://www.csie.ntu.edu.tw/~cjlin/libffm/ ● Weka in Java (has everything) http://www.cs.waikato.ac.nz/ml/weka/ ● Graphchi for factorizations : https://github.com/GraphChi ● GraphLab for lots of stuff. https://dato.com/products/create/open_source.html ● Cxxnet : One of the best implementation of convolutional neural nets out there. Difficult to install and requires GPU with NVDIA Graphics card. https://github.com/antinucleon/cxxnet ● RankLib: The best library out there made in java suited for ranking algorithms (e.g. rank products for customers) that supports optimization fucntions like NDCG. people.cs.umass.edu/~vdang/ranklib.html ● Keras ( http://keras.io/) and Lasagne(https://github.com/Lasagne/Lasagne ) for nets. This assumes you have Theano (http://deeplearning.net/software/theano/ ) or Tensorflow https://www.tensorflow.org/ . Tools vol 2
  • 27. Where to go next to prepare for competitions ● Coursera : https://www.coursera.org/course/ml Andrew’s NG class ● Kaggle.com : many competitions for learning. For instance: http://www.kaggle.com/c/titanic-gettingStarted . Look for the “knowledge flag” ● Very good slides from university of UTAH: www.cs.utah.edu/~piyush/teaching/cs5350.html ● clopinet.com/challenges/ . Many past predictive modelling competitions with tutorials. ● Wikipedia. Not to underestimate. Still the best source of information out there (collectively) .