This document provides an overview and strategy for approaching Kaggle competitions using machine learning. It discusses preparing data, using versatile libraries like Scikit-Learn and XGBoost, and model ensembling techniques like voting, bagging, and stacking. It also covers exploring different algorithms like random forests, gradient boosted machines, and support vector machines, as well as potential issues like leakage and how to address them. The goal is to create learning algorithms that are data, problem, and solution agnostic to improve generalization.
Towards automating machine learning: benchmarking tools for hyperparameter tu...PyData
Fine-tuning machine learning models (hyperparameter tuning) is crucial but tedious. Fortunately, optimization promises to automate this task. We give an overview on algorithms for this task and explain their inner workings. To help you selecting one for your project, we benchmark implementations in Python against human experts. We link to the discussion on automating machine learning in general.
In many practical machine learning classification applications, the training data for one or all of the classes may be limited. We will examine how semi-supervised learning using Generative Adversarial Networks (GANs) can be used to improve generalization in these settings. The full approach from training to model deployment will be demonstrated, using AWS Lambda and/or AWS Sagemaker
Winning data science competitions, presented by Owen ZhangVivian S. Zhang
<featured> Meetup event hosted by NYC Open Data Meetup, NYC Data Science Academy. Speaker: Owen Zhang, Event Info: http://www.meetup.com/NYC-Open-Data/events/219370251/
Top contenders in the 2015 KDD cup include the team from DataRobot comprising Owen Zhang, #1 Ranked Kaggler and top Kagglers Xavier Contort and Sergey Yurgenson. Get an in-depth look as Xavier describes their approach. DataRobot allowed the team to focus on feature engineering by automating model training, hyperparameter tuning, and model blending - thus giving the team a firm advantage.
This is a brief overview of Artificial Intelligence from the historical data, machine learning, types of learning, artificial neural networks, deep learning and different types of ANN.
Towards automating machine learning: benchmarking tools for hyperparameter tu...PyData
Fine-tuning machine learning models (hyperparameter tuning) is crucial but tedious. Fortunately, optimization promises to automate this task. We give an overview on algorithms for this task and explain their inner workings. To help you selecting one for your project, we benchmark implementations in Python against human experts. We link to the discussion on automating machine learning in general.
In many practical machine learning classification applications, the training data for one or all of the classes may be limited. We will examine how semi-supervised learning using Generative Adversarial Networks (GANs) can be used to improve generalization in these settings. The full approach from training to model deployment will be demonstrated, using AWS Lambda and/or AWS Sagemaker
Winning data science competitions, presented by Owen ZhangVivian S. Zhang
<featured> Meetup event hosted by NYC Open Data Meetup, NYC Data Science Academy. Speaker: Owen Zhang, Event Info: http://www.meetup.com/NYC-Open-Data/events/219370251/
Top contenders in the 2015 KDD cup include the team from DataRobot comprising Owen Zhang, #1 Ranked Kaggler and top Kagglers Xavier Contort and Sergey Yurgenson. Get an in-depth look as Xavier describes their approach. DataRobot allowed the team to focus on feature engineering by automating model training, hyperparameter tuning, and model blending - thus giving the team a firm advantage.
This is a brief overview of Artificial Intelligence from the historical data, machine learning, types of learning, artificial neural networks, deep learning and different types of ANN.
How to Win Machine Learning Competitions ? HackerEarth
This presentation was given by Marios Michailidis (a.k.a Kazanova), Current Kaggle Rank #3 to help community learn machine learning better. It comprises of useful ML tips and techniques to perform better in machine learning competitions. Read the full blog: http://blog.hackerearth.com/winning-tips-machine-learning-competitions-kazanova-current-kaggle-3
VSSML16 LR1. Summary Day 1
Valencian Summer School in Machine Learning 2016
Day 1
Summary Day 1
Mercè Martin (BigML)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2016
Machine Learning for Dummies (without mathematics)ActiveEon
It presents an introduction and the basic concepts of machine learning without mathematics. This is a short presentation for beginners in machine learning.
An introduction and basic concepts of machine learning without mathematics. This is a short presentation for beginners in machine learning.Andrews Cordolino Sobral, Ph.D., Computer Vision and Machine Learning Researcher, Activeeon
Valencian Summer School in Machine Learning 2017 - Day 1
Lectures Review: Summary Day 1 Sessions. By Mercè Martín (BigML).
https://bigml.com/events/valencian-summer-school-in-machine-learning-2017
Joker'14 Java as a fundamental working tool of the Data ScientistAlexey Zinoviev
Alexey Zinoviev presented this paper on the Jocker conference http://jokerconf.com/#zinoviev.
This paper covers next topics: Data Mining, Machine Learning, Mahout, Spark, MLlib, Python, Octave, R language
These slides outline the common distributed computing abstractions necessary to implement data science at scale. It starts with a characterization of the computations required to realize common machine learning at scale. Introductions to Hadoop MR, Spark, GraphLab are covered currently. Going forward, we shall update with Flink, Titan and TensorFlow and how to realize machine learning/deep learning algorithms on top of these frameworks as well as trade-offs between these frameworks.
Jay Yagnik at AI Frontiers : A History Lesson on AIAI Frontiers
We have reached a remarkable point in history with the evolution of AI, from applying this technology to incredible use cases in healthcare, to addressing the world's biggest humanitarian and environmental issues. Our ability to learn task-specific functions for vision, language, sequence and control tasks is getting better at a rapid pace. This talk will survey some of the current advances in AI, compare AI to other fields that have historically developed over time, and calibrate where we are in the relative advancement timeline. We will also speculate about the next inflection points and capabilities that AI can offer down the road, and look at how those might intersect with other emergent fields, e.g. Quantum computing.
Using SigOpt to Tune Deep Learning Models with Nervana CloudSigOpt
In this talk I'll show how the Bayesian Optimization methods used by SigOpt, coupled with the incredibly scalable deep learning architecture provided with ncloud and neon, allow anyone it easily tune their models to quickly achieve higher accuracy. I'll walk through the techniques and show an explicit example with results.
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...HostedbyConfluent
"Regular performance testing is one of the pillars of Kafka Streams’ reliability and efficiency. Beyond ensuring dependable releases, regular performance testing supports engineers in new feature development with the ability to easily test the performance impact of their features, compare different approaches, etc.
In this session, Alex and John share their experience from developing, using, and maintaining a performance testing framework for Kafka Streams that has prevented multiple performance regressions over the last 5 years. They cover guiding principles and architecture, how to ensure statistical significance and stability of results, and how to automate regression detection for actionable notifications.
This talk sheds light on how Apache Kafka is able to foster a vibrant open-source community while maintaining a high performance bar across many years and releases. It also empowers performance-minded engineers to avoid common pitfalls and bring high-quality performance testing to their own systems."
How to Win Machine Learning Competitions ? HackerEarth
This presentation was given by Marios Michailidis (a.k.a Kazanova), Current Kaggle Rank #3 to help community learn machine learning better. It comprises of useful ML tips and techniques to perform better in machine learning competitions. Read the full blog: http://blog.hackerearth.com/winning-tips-machine-learning-competitions-kazanova-current-kaggle-3
VSSML16 LR1. Summary Day 1
Valencian Summer School in Machine Learning 2016
Day 1
Summary Day 1
Mercè Martin (BigML)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2016
Machine Learning for Dummies (without mathematics)ActiveEon
It presents an introduction and the basic concepts of machine learning without mathematics. This is a short presentation for beginners in machine learning.
An introduction and basic concepts of machine learning without mathematics. This is a short presentation for beginners in machine learning.Andrews Cordolino Sobral, Ph.D., Computer Vision and Machine Learning Researcher, Activeeon
Valencian Summer School in Machine Learning 2017 - Day 1
Lectures Review: Summary Day 1 Sessions. By Mercè Martín (BigML).
https://bigml.com/events/valencian-summer-school-in-machine-learning-2017
Joker'14 Java as a fundamental working tool of the Data ScientistAlexey Zinoviev
Alexey Zinoviev presented this paper on the Jocker conference http://jokerconf.com/#zinoviev.
This paper covers next topics: Data Mining, Machine Learning, Mahout, Spark, MLlib, Python, Octave, R language
These slides outline the common distributed computing abstractions necessary to implement data science at scale. It starts with a characterization of the computations required to realize common machine learning at scale. Introductions to Hadoop MR, Spark, GraphLab are covered currently. Going forward, we shall update with Flink, Titan and TensorFlow and how to realize machine learning/deep learning algorithms on top of these frameworks as well as trade-offs between these frameworks.
Jay Yagnik at AI Frontiers : A History Lesson on AIAI Frontiers
We have reached a remarkable point in history with the evolution of AI, from applying this technology to incredible use cases in healthcare, to addressing the world's biggest humanitarian and environmental issues. Our ability to learn task-specific functions for vision, language, sequence and control tasks is getting better at a rapid pace. This talk will survey some of the current advances in AI, compare AI to other fields that have historically developed over time, and calibrate where we are in the relative advancement timeline. We will also speculate about the next inflection points and capabilities that AI can offer down the road, and look at how those might intersect with other emergent fields, e.g. Quantum computing.
Using SigOpt to Tune Deep Learning Models with Nervana CloudSigOpt
In this talk I'll show how the Bayesian Optimization methods used by SigOpt, coupled with the incredibly scalable deep learning architecture provided with ncloud and neon, allow anyone it easily tune their models to quickly achieve higher accuracy. I'll walk through the techniques and show an explicit example with results.
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...HostedbyConfluent
"Regular performance testing is one of the pillars of Kafka Streams’ reliability and efficiency. Beyond ensuring dependable releases, regular performance testing supports engineers in new feature development with the ability to easily test the performance impact of their features, compare different approaches, etc.
In this session, Alex and John share their experience from developing, using, and maintaining a performance testing framework for Kafka Streams that has prevented multiple performance regressions over the last 5 years. They cover guiding principles and architecture, how to ensure statistical significance and stability of results, and how to automate regression detection for actionable notifications.
This talk sheds light on how Apache Kafka is able to foster a vibrant open-source community while maintaining a high performance bar across many years and releases. It also empowers performance-minded engineers to avoid common pitfalls and bring high-quality performance testing to their own systems."
2. APPROACH TO KAGGLE INCLASS COMPETITIONS
● 1) Get a good score as fast as possible by:
● Getting the raw data into a universal data format.
● Mostly CSV -> Numpy Array / LibSVMlight format
● 2) Using versatile libraries:
● Scikit-Learn, Vowpal Wabbit, XGBoost.
● 3) Model ensembling
● Voting, Bagging, Boosting, Binning, Blending, Stacking
3. STRATEGY
● Try to create "machine learning" learning algorithms
and optimized pipelines which are:
● Data agnostic,
● Problem agnostic,
● Solution agnostic,
● Automated
● Memory-friendly
● Robust with good generalization.
4. FIRST OVERVIEW
● Problem type
● Classification? Regression?
● Evaluation metric
● Description
● Benchmark code
“Predict human activities based on their smartphone usage
pattern. Predict if a person is sitting, walking, etc, using
their smartphone activities”
https://inclass.kaggle.com/c/smartphone-user-activity-
prediction
5. FIRST OVERVIEW
● Data types
● Counts
● Text
● Categorical
● Numerical
● Dates
0.28309984,-0.025501173,-0.11118051,-
0.37447712,-0.099567756,-0.20296558,-
0.37631066,-0.15016035,-0.18169451,-
0.29308661,-0.14946642, … Quick preview
6. FIRST OVERVIEW
● Data size
● Number of features?
● Number of train samples?
● Number of test samples?
● Online learning or offline learning?
● Linear problem or Non-linear?
7. BRANCH
● If issues with data:
● Clear up issues with data (imputing missing data, joining
tables, eval a JSON string)
● Give up, and join another competition.
● If no issues with data:
● Get the raw data into NumPy arrays, we want:
● X_train (train set), y (labels), X_test (test set)
10. ALGORITHMS
● There is a bias-variance trade-off between simple models
and complex models.
11. ALGORITHMS
● There is No Free Lunch in machine learning.
● We show that all algorithms that search for an extremum of
a cost function perform exactly the same, when averaged
over all possible cost functions. – Wolpert, Macready, No
free lunch theorems for search
● Solution:
● Let algo's play to their own strengths for particular
problems and
● remove their weaknesses, then
● combine their predictions.
12. RANDOM FORESTS 1/2
● A Random Forest is an ensemble of decision trees.
● "Bagging predictors is a method for generating multiple
versions of a predictor and using these to get an
aggregated predictor." - "Bagging Predictors". Breiman
13. RANDOM FORESTS 2/2
● Strength: Relatively fast. Can be fitted in parallel.
● Easy to tune.
● Easy to inspect.
● Easy to explore data with.
● Good to benchmark against.
● One of the most powerful general ML algorithms.
● You can introduce randomness.
● Weakness: Memory-heavy (so use bagging).
● Popular (So use RGF and Extremely Randomized Trees)
14. GBM 1/2
● Gradient Boosted Decision Trees train weak predictors
on samples that previous predictors got wrong.
● "A method is described for converting a weak learning
algorithm [the learner can produce an hypothesis that
performs only slightly better than random guessing] into
one that achieves arbitrarily high accuracy." "The strength
of weak learnability." - Schapire
15. GBM 2/2
● Strength:
● Can achieve very good results
● Can model very complex problems
● Works on a wide variety of problems.
● Weakness:
● Slower to run (use XGBoost).
● Tricky to tune (start with max trees, tune eta, tune depth)
16. SVM
● Classification and regression using support vectors.
● "Nothing is more practical than a good theory." The Nature
of Statistical Learning Theory, Vapnik
● Strength:
● Strong theoretical guarantees
● Tuning regularization parameter can prevent overfit
● Uses the kernel trick. Turn linear solvers into non-linear
solvers. Build custom kernels.
● Weakness:
● Requires a gridsearch. (Develop intuition or new algo!)
● Too slow on large data (use stratified subsampling)
17. KNN
● Look at distance to nearest neighbors
● "The nearest neighbor decision rule assigns to an
unclassified sample point the classification of the nearest of
a set of previously classified points." Nearest neighbor
pattern classification, Cover et. al.
● Strength:
● Nonlinear
● Basic
● Easy to tune
● Different / unpopular.
● Weakness: Slow and does not perform well in general. (so
use for stacking or finding near-duplicates)
19. ENSEMBLING
● Ensembling combines multiple models to (hopefully)
outperform any individual members.
● Ensembling (stacked generalization) won the 1 million $
Netflix competition.
● Ensembling reduces overfit and improves generalization
performance.
● Tips:
● Use diverse models
● Use many models
● Dont leak any information (stratified out-of-fold predictions)
20. Automatic stacked ensembling
● Combining 100s of automatically created models to
improve accuracy and generalization performance.
● "Hodor!" - Hodor.
● Strength:
● - Won this Kaggle competition :)
● - Robust / good generalization
● - No tuning
● - Incremental accuracy-increasing predictions
● Weakness: Unwieldy, Dim-witted, Slow, Redundant.
21. Automatic stacked ensembling
● Step 1 (Generalization)
● Create out-of-fold predictions for the train set and
predictions for the test set for:
● Different algorithms
● Different parameters
● Different sampling
● Step 2 (Stacking)
● Add preds to original features and train a GBM or RF on
this.
● Step 3 (Model Selection)
● Brute-force averaging of predictors.
23. LEAKAGE
● 'The introduction of information about the data mining
target, which should not be legitimately available to mine
from.'
● "Leakage in Data Mining. - Formulation, Detection, and
Avoidance" Kaufman et. al.
● 'one of the top ten data mining mistakes'
● "Handbook of Statistical Analysis and Data Mining
Applications." Nisbet et. al.
24. LEAKAGE
● Exploiting Leakage
● In predictive modeling competitions: Allowed and beneficial
for results.
● In science or business: A very big no-no!
● In both: Accidental leakage exploitation. RF finds leakage
automatically or KNN-classifier finds duplicates.
25. LEAKAGE 1/2
● In this competition
● Look at ordering of training sample labels:
● - Classes (activity) cluster together.
● - These are the different patients/subjects in the study?
● Exploits: Build better CV. Use subject meta-features.
26. LEAKAGE 2/2
● In this competition
● Look at ordering of test prediction file:
● - Class predictions again cluster together
● - Is the test set not randomized?
● Exploits: Change sequences to be more
uniform and look if that increases public
score consistently.
27. RESOURCES & FURTHER READING
● http://mlwave.com/kaggle-ensembling-guide/
● http://scikit-learn.org
● http://hunch.net/~vw/
● https://github.com/dmlc/xgboost
● https://www.youtube.com/watch?v=djRh0Rkqygw
[Ihler, Linear regression (5): Bias and variance]
● http://www.cs.nyu.edu/~mohri/mls/lecture_8.pdf
[Mohri, Foundations of Machine Learning]
● http://www.researchgate.net/profile/David_Wolpert/publication/2
[Wolpert, Stacked Generalization]