SlideShare a Scribd company logo
Smartphone User Activity Prediction
HJ van Veen | Triskelion@Kaggle
MLWave.com
APPROACH TO KAGGLE INCLASS COMPETITIONS
● 1) Get a good score as fast as possible by:
● Getting the raw data into a universal data format.
● Mostly CSV -> Numpy Array / LibSVMlight format
● 2) Using versatile libraries:
● Scikit-Learn, Vowpal Wabbit, XGBoost.
● 3) Model ensembling
● Voting, Bagging, Boosting, Binning, Blending, Stacking
STRATEGY
● Try to create "machine learning" learning algorithms
and optimized pipelines which are:
● Data agnostic,
● Problem agnostic,
● Solution agnostic,
● Automated
● Memory-friendly
● Robust with good generalization.
FIRST OVERVIEW
● Problem type
● Classification? Regression?
● Evaluation metric
● Description
● Benchmark code
“Predict human activities based on their smartphone usage
pattern. Predict if a person is sitting, walking, etc, using
their smartphone activities”
https://inclass.kaggle.com/c/smartphone-user-activity-
prediction
FIRST OVERVIEW
● Data types
● Counts
● Text
● Categorical
● Numerical
● Dates
0.28309984,-0.025501173,-0.11118051,-
0.37447712,-0.099567756,-0.20296558,-
0.37631066,-0.15016035,-0.18169451,-
0.29308661,-0.14946642, … Quick preview
FIRST OVERVIEW
● Data size
● Number of features?
● Number of train samples?
● Number of test samples?
● Online learning or offline learning?
● Linear problem or Non-linear?
BRANCH
● If issues with data:
● Clear up issues with data (imputing missing data, joining
tables, eval a JSON string)
● Give up, and join another competition.
● If no issues with data:
● Get the raw data into NumPy arrays, we want:
● X_train (train set), y (labels), X_test (test set)
TRANSFORMS & PREPROCESSING
● TRANSFORMS & SCALING
● TF-IDF Weighting
● Log scaling
● Minmax and standard-scaling
● PREPROCESSING
● Parse dates
● Concatenate text fields
● Impute missing values
TRANSFORMS & PREPROCESSING
● TRANSFORMS & SCALING
● TF-IDF Weighting
● Log scaling
● Minmax and standard-scaling
● PREPROCESSING
● Parse dates
● Concatenate text fields
● Impute missing values
ALGORITHMS
● There is a bias-variance trade-off between simple models
and complex models.
ALGORITHMS
● There is No Free Lunch in machine learning.
● We show that all algorithms that search for an extremum of
a cost function perform exactly the same, when averaged
over all possible cost functions. – Wolpert, Macready, No
free lunch theorems for search
● Solution:
● Let algo's play to their own strengths for particular
problems and
● remove their weaknesses, then
● combine their predictions.
RANDOM FORESTS 1/2
● A Random Forest is an ensemble of decision trees.
● "Bagging predictors is a method for generating multiple
versions of a predictor and using these to get an
aggregated predictor." - "Bagging Predictors". Breiman
RANDOM FORESTS 2/2
● Strength: Relatively fast. Can be fitted in parallel.
● Easy to tune.
● Easy to inspect.
● Easy to explore data with.
● Good to benchmark against.
● One of the most powerful general ML algorithms.
● You can introduce randomness.
● Weakness: Memory-heavy (so use bagging).
● Popular (So use RGF and Extremely Randomized Trees)
GBM 1/2
● Gradient Boosted Decision Trees train weak predictors
on samples that previous predictors got wrong.
● "A method is described for converting a weak learning
algorithm [the learner can produce an hypothesis that
performs only slightly better than random guessing] into
one that achieves arbitrarily high accuracy." "The strength
of weak learnability." - Schapire
GBM 2/2
● Strength:
● Can achieve very good results
● Can model very complex problems
● Works on a wide variety of problems.
● Weakness:
● Slower to run (use XGBoost).
● Tricky to tune (start with max trees, tune eta, tune depth)
SVM
● Classification and regression using support vectors.
● "Nothing is more practical than a good theory." The Nature
of Statistical Learning Theory, Vapnik
● Strength:
● Strong theoretical guarantees
● Tuning regularization parameter can prevent overfit
● Uses the kernel trick. Turn linear solvers into non-linear
solvers. Build custom kernels.
● Weakness:
● Requires a gridsearch. (Develop intuition or new algo!)
● Too slow on large data (use stratified subsampling)
KNN
● Look at distance to nearest neighbors
● "The nearest neighbor decision rule assigns to an
unclassified sample point the classification of the nearest of
a set of previously classified points." Nearest neighbor
pattern classification, Cover et. al.
● Strength:
● Nonlinear
● Basic
● Easy to tune
● Different / unpopular.
● Weakness: Slow and does not perform well in general. (so
use for stacking or finding near-duplicates)
OTHERS
● Logistic Regression
● Stochastic Gradient Descent
● Ridge Regression
● Naive Bayes
● Artificial Neural Nets
● Matrix Factorization, SVD
● Quantile Regression
● AdaBoosting
● Genetic Algorithms
● Perceptrons
ENSEMBLING
● Ensembling combines multiple models to (hopefully)
outperform any individual members.
● Ensembling (stacked generalization) won the 1 million $
Netflix competition.
● Ensembling reduces overfit and improves generalization
performance.
● Tips:
● Use diverse models
● Use many models
● Dont leak any information (stratified out-of-fold predictions)
Automatic stacked ensembling
● Combining 100s of automatically created models to
improve accuracy and generalization performance.
● "Hodor!" - Hodor.
● Strength:
● - Won this Kaggle competition :)
● - Robust / good generalization
● - No tuning
● - Incremental accuracy-increasing predictions
● Weakness: Unwieldy, Dim-witted, Slow, Redundant.
Automatic stacked ensembling
● Step 1 (Generalization)
● Create out-of-fold predictions for the train set and
predictions for the test set for:
● Different algorithms
● Different parameters
● Different sampling
● Step 2 (Stacking)
● Add preds to original features and train a GBM or RF on
this.
● Step 3 (Model Selection)
● Brute-force averaging of predictors.
Automatic stacked ensembling
● DEMO
LEAKAGE
● 'The introduction of information about the data mining
target, which should not be legitimately available to mine
from.'
● "Leakage in Data Mining. - Formulation, Detection, and
Avoidance" Kaufman et. al.
● 'one of the top ten data mining mistakes'
● "Handbook of Statistical Analysis and Data Mining
Applications." Nisbet et. al.
LEAKAGE
● Exploiting Leakage
● In predictive modeling competitions: Allowed and beneficial
for results.
● In science or business: A very big no-no!
● In both: Accidental leakage exploitation. RF finds leakage
automatically or KNN-classifier finds duplicates.
LEAKAGE 1/2
● In this competition
● Look at ordering of training sample labels:
● - Classes (activity) cluster together.
● - These are the different patients/subjects in the study?
● Exploits: Build better CV. Use subject meta-features.
LEAKAGE 2/2
● In this competition
● Look at ordering of test prediction file:
● - Class predictions again cluster together
● - Is the test set not randomized?
● Exploits: Change sequences to be more
uniform and look if that increases public
score consistently.
RESOURCES & FURTHER READING
● http://mlwave.com/kaggle-ensembling-guide/
● http://scikit-learn.org
● http://hunch.net/~vw/
● https://github.com/dmlc/xgboost
● https://www.youtube.com/watch?v=djRh0Rkqygw
[Ihler, Linear regression (5): Bias and variance]
● http://www.cs.nyu.edu/~mohri/mls/lecture_8.pdf
[Mohri, Foundations of Machine Learning]
● http://www.researchgate.net/profile/David_Wolpert/publication/2
[Wolpert, Stacked Generalization]

More Related Content

Similar to Smartphone Activity Prediction

Deep learning architectures
Deep learning architecturesDeep learning architectures
Deep learning architectures
Joe li
 
How to Win Machine Learning Competitions ?
How to Win Machine Learning Competitions ? How to Win Machine Learning Competitions ?
How to Win Machine Learning Competitions ?
HackerEarth
 
BSSML16 L5. Summary Day 1 Sessions
BSSML16 L5. Summary Day 1 SessionsBSSML16 L5. Summary Day 1 Sessions
BSSML16 L5. Summary Day 1 Sessions
BigML, Inc
 
VSSML16 LR1. Summary Day 1
VSSML16 LR1. Summary Day 1VSSML16 LR1. Summary Day 1
VSSML16 LR1. Summary Day 1
BigML, Inc
 
Data Mining Practical Machine Learning Tools and Techniques
Data Mining Practical Machine Learning Tools and TechniquesData Mining Practical Machine Learning Tools and Techniques
Data Mining Practical Machine Learning Tools and Techniques
RevathiSundar4
 
Machine Learning for Dummies (without mathematics)
Machine Learning for Dummies (without mathematics)Machine Learning for Dummies (without mathematics)
Machine Learning for Dummies (without mathematics)
ActiveEon
 
Machine learning-for-dummies-andrews-sobral-activeeon
Machine learning-for-dummies-andrews-sobral-activeeonMachine learning-for-dummies-andrews-sobral-activeeon
Machine learning-for-dummies-andrews-sobral-activeeon
Activeeon
 
Tips for data science competitions
Tips for data science competitionsTips for data science competitions
Tips for data science competitions
Owen Zhang
 
VSSML17 Review. Summary Day 1 Sessions
VSSML17 Review. Summary Day 1 SessionsVSSML17 Review. Summary Day 1 Sessions
VSSML17 Review. Summary Day 1 Sessions
BigML, Inc
 
Deep learning crash course
Deep learning crash courseDeep learning crash course
Deep learning crash course
Vishwas N
 
Joker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data ScientistJoker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data Scientist
Alexey Zinoviev
 
Predictive analytics semi-supervised learning with GANs
Predictive analytics   semi-supervised learning with GANsPredictive analytics   semi-supervised learning with GANs
Predictive analytics semi-supervised learning with GANs
terek47
 
Distributed computing abstractions_data_science_6_june_2016_ver_0.4
Distributed computing abstractions_data_science_6_june_2016_ver_0.4Distributed computing abstractions_data_science_6_june_2016_ver_0.4
Distributed computing abstractions_data_science_6_june_2016_ver_0.4
Vijay Srinivas Agneeswaran, Ph.D
 
PAISS (PRAIRIE AI Summer School) Digest July 2018
PAISS (PRAIRIE AI Summer School) Digest July 2018 PAISS (PRAIRIE AI Summer School) Digest July 2018
PAISS (PRAIRIE AI Summer School) Digest July 2018
Natalia Díaz Rodríguez
 
Graph Data Science WORST Practices
Graph Data Science WORST PracticesGraph Data Science WORST Practices
Graph Data Science WORST Practices
Neo4j
 
Machine Learning - Supervised Learning
Machine Learning - Supervised LearningMachine Learning - Supervised Learning
Machine Learning - Supervised Learning
Giorgio Alfredo Spedicato
 
Jay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AIJay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AI
AI Frontiers
 
Using SigOpt to Tune Deep Learning Models with Nervana Cloud
Using SigOpt to Tune Deep Learning Models with Nervana CloudUsing SigOpt to Tune Deep Learning Models with Nervana Cloud
Using SigOpt to Tune Deep Learning Models with Nervana Cloud
SigOpt
 
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
HostedbyConfluent
 
Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsUsing Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning Models
Scott Clark
 

Similar to Smartphone Activity Prediction (20)

Deep learning architectures
Deep learning architecturesDeep learning architectures
Deep learning architectures
 
How to Win Machine Learning Competitions ?
How to Win Machine Learning Competitions ? How to Win Machine Learning Competitions ?
How to Win Machine Learning Competitions ?
 
BSSML16 L5. Summary Day 1 Sessions
BSSML16 L5. Summary Day 1 SessionsBSSML16 L5. Summary Day 1 Sessions
BSSML16 L5. Summary Day 1 Sessions
 
VSSML16 LR1. Summary Day 1
VSSML16 LR1. Summary Day 1VSSML16 LR1. Summary Day 1
VSSML16 LR1. Summary Day 1
 
Data Mining Practical Machine Learning Tools and Techniques
Data Mining Practical Machine Learning Tools and TechniquesData Mining Practical Machine Learning Tools and Techniques
Data Mining Practical Machine Learning Tools and Techniques
 
Machine Learning for Dummies (without mathematics)
Machine Learning for Dummies (without mathematics)Machine Learning for Dummies (without mathematics)
Machine Learning for Dummies (without mathematics)
 
Machine learning-for-dummies-andrews-sobral-activeeon
Machine learning-for-dummies-andrews-sobral-activeeonMachine learning-for-dummies-andrews-sobral-activeeon
Machine learning-for-dummies-andrews-sobral-activeeon
 
Tips for data science competitions
Tips for data science competitionsTips for data science competitions
Tips for data science competitions
 
VSSML17 Review. Summary Day 1 Sessions
VSSML17 Review. Summary Day 1 SessionsVSSML17 Review. Summary Day 1 Sessions
VSSML17 Review. Summary Day 1 Sessions
 
Deep learning crash course
Deep learning crash courseDeep learning crash course
Deep learning crash course
 
Joker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data ScientistJoker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data Scientist
 
Predictive analytics semi-supervised learning with GANs
Predictive analytics   semi-supervised learning with GANsPredictive analytics   semi-supervised learning with GANs
Predictive analytics semi-supervised learning with GANs
 
Distributed computing abstractions_data_science_6_june_2016_ver_0.4
Distributed computing abstractions_data_science_6_june_2016_ver_0.4Distributed computing abstractions_data_science_6_june_2016_ver_0.4
Distributed computing abstractions_data_science_6_june_2016_ver_0.4
 
PAISS (PRAIRIE AI Summer School) Digest July 2018
PAISS (PRAIRIE AI Summer School) Digest July 2018 PAISS (PRAIRIE AI Summer School) Digest July 2018
PAISS (PRAIRIE AI Summer School) Digest July 2018
 
Graph Data Science WORST Practices
Graph Data Science WORST PracticesGraph Data Science WORST Practices
Graph Data Science WORST Practices
 
Machine Learning - Supervised Learning
Machine Learning - Supervised LearningMachine Learning - Supervised Learning
Machine Learning - Supervised Learning
 
Jay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AIJay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AI
 
Using SigOpt to Tune Deep Learning Models with Nervana Cloud
Using SigOpt to Tune Deep Learning Models with Nervana CloudUsing SigOpt to Tune Deep Learning Models with Nervana Cloud
Using SigOpt to Tune Deep Learning Models with Nervana Cloud
 
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
 
Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsUsing Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning Models
 

Smartphone Activity Prediction

  • 1. Smartphone User Activity Prediction HJ van Veen | Triskelion@Kaggle MLWave.com
  • 2. APPROACH TO KAGGLE INCLASS COMPETITIONS ● 1) Get a good score as fast as possible by: ● Getting the raw data into a universal data format. ● Mostly CSV -> Numpy Array / LibSVMlight format ● 2) Using versatile libraries: ● Scikit-Learn, Vowpal Wabbit, XGBoost. ● 3) Model ensembling ● Voting, Bagging, Boosting, Binning, Blending, Stacking
  • 3. STRATEGY ● Try to create "machine learning" learning algorithms and optimized pipelines which are: ● Data agnostic, ● Problem agnostic, ● Solution agnostic, ● Automated ● Memory-friendly ● Robust with good generalization.
  • 4. FIRST OVERVIEW ● Problem type ● Classification? Regression? ● Evaluation metric ● Description ● Benchmark code “Predict human activities based on their smartphone usage pattern. Predict if a person is sitting, walking, etc, using their smartphone activities” https://inclass.kaggle.com/c/smartphone-user-activity- prediction
  • 5. FIRST OVERVIEW ● Data types ● Counts ● Text ● Categorical ● Numerical ● Dates 0.28309984,-0.025501173,-0.11118051,- 0.37447712,-0.099567756,-0.20296558,- 0.37631066,-0.15016035,-0.18169451,- 0.29308661,-0.14946642, … Quick preview
  • 6. FIRST OVERVIEW ● Data size ● Number of features? ● Number of train samples? ● Number of test samples? ● Online learning or offline learning? ● Linear problem or Non-linear?
  • 7. BRANCH ● If issues with data: ● Clear up issues with data (imputing missing data, joining tables, eval a JSON string) ● Give up, and join another competition. ● If no issues with data: ● Get the raw data into NumPy arrays, we want: ● X_train (train set), y (labels), X_test (test set)
  • 8. TRANSFORMS & PREPROCESSING ● TRANSFORMS & SCALING ● TF-IDF Weighting ● Log scaling ● Minmax and standard-scaling ● PREPROCESSING ● Parse dates ● Concatenate text fields ● Impute missing values
  • 9. TRANSFORMS & PREPROCESSING ● TRANSFORMS & SCALING ● TF-IDF Weighting ● Log scaling ● Minmax and standard-scaling ● PREPROCESSING ● Parse dates ● Concatenate text fields ● Impute missing values
  • 10. ALGORITHMS ● There is a bias-variance trade-off between simple models and complex models.
  • 11. ALGORITHMS ● There is No Free Lunch in machine learning. ● We show that all algorithms that search for an extremum of a cost function perform exactly the same, when averaged over all possible cost functions. – Wolpert, Macready, No free lunch theorems for search ● Solution: ● Let algo's play to their own strengths for particular problems and ● remove their weaknesses, then ● combine their predictions.
  • 12. RANDOM FORESTS 1/2 ● A Random Forest is an ensemble of decision trees. ● "Bagging predictors is a method for generating multiple versions of a predictor and using these to get an aggregated predictor." - "Bagging Predictors". Breiman
  • 13. RANDOM FORESTS 2/2 ● Strength: Relatively fast. Can be fitted in parallel. ● Easy to tune. ● Easy to inspect. ● Easy to explore data with. ● Good to benchmark against. ● One of the most powerful general ML algorithms. ● You can introduce randomness. ● Weakness: Memory-heavy (so use bagging). ● Popular (So use RGF and Extremely Randomized Trees)
  • 14. GBM 1/2 ● Gradient Boosted Decision Trees train weak predictors on samples that previous predictors got wrong. ● "A method is described for converting a weak learning algorithm [the learner can produce an hypothesis that performs only slightly better than random guessing] into one that achieves arbitrarily high accuracy." "The strength of weak learnability." - Schapire
  • 15. GBM 2/2 ● Strength: ● Can achieve very good results ● Can model very complex problems ● Works on a wide variety of problems. ● Weakness: ● Slower to run (use XGBoost). ● Tricky to tune (start with max trees, tune eta, tune depth)
  • 16. SVM ● Classification and regression using support vectors. ● "Nothing is more practical than a good theory." The Nature of Statistical Learning Theory, Vapnik ● Strength: ● Strong theoretical guarantees ● Tuning regularization parameter can prevent overfit ● Uses the kernel trick. Turn linear solvers into non-linear solvers. Build custom kernels. ● Weakness: ● Requires a gridsearch. (Develop intuition or new algo!) ● Too slow on large data (use stratified subsampling)
  • 17. KNN ● Look at distance to nearest neighbors ● "The nearest neighbor decision rule assigns to an unclassified sample point the classification of the nearest of a set of previously classified points." Nearest neighbor pattern classification, Cover et. al. ● Strength: ● Nonlinear ● Basic ● Easy to tune ● Different / unpopular. ● Weakness: Slow and does not perform well in general. (so use for stacking or finding near-duplicates)
  • 18. OTHERS ● Logistic Regression ● Stochastic Gradient Descent ● Ridge Regression ● Naive Bayes ● Artificial Neural Nets ● Matrix Factorization, SVD ● Quantile Regression ● AdaBoosting ● Genetic Algorithms ● Perceptrons
  • 19. ENSEMBLING ● Ensembling combines multiple models to (hopefully) outperform any individual members. ● Ensembling (stacked generalization) won the 1 million $ Netflix competition. ● Ensembling reduces overfit and improves generalization performance. ● Tips: ● Use diverse models ● Use many models ● Dont leak any information (stratified out-of-fold predictions)
  • 20. Automatic stacked ensembling ● Combining 100s of automatically created models to improve accuracy and generalization performance. ● "Hodor!" - Hodor. ● Strength: ● - Won this Kaggle competition :) ● - Robust / good generalization ● - No tuning ● - Incremental accuracy-increasing predictions ● Weakness: Unwieldy, Dim-witted, Slow, Redundant.
  • 21. Automatic stacked ensembling ● Step 1 (Generalization) ● Create out-of-fold predictions for the train set and predictions for the test set for: ● Different algorithms ● Different parameters ● Different sampling ● Step 2 (Stacking) ● Add preds to original features and train a GBM or RF on this. ● Step 3 (Model Selection) ● Brute-force averaging of predictors.
  • 23. LEAKAGE ● 'The introduction of information about the data mining target, which should not be legitimately available to mine from.' ● "Leakage in Data Mining. - Formulation, Detection, and Avoidance" Kaufman et. al. ● 'one of the top ten data mining mistakes' ● "Handbook of Statistical Analysis and Data Mining Applications." Nisbet et. al.
  • 24. LEAKAGE ● Exploiting Leakage ● In predictive modeling competitions: Allowed and beneficial for results. ● In science or business: A very big no-no! ● In both: Accidental leakage exploitation. RF finds leakage automatically or KNN-classifier finds duplicates.
  • 25. LEAKAGE 1/2 ● In this competition ● Look at ordering of training sample labels: ● - Classes (activity) cluster together. ● - These are the different patients/subjects in the study? ● Exploits: Build better CV. Use subject meta-features.
  • 26. LEAKAGE 2/2 ● In this competition ● Look at ordering of test prediction file: ● - Class predictions again cluster together ● - Is the test set not randomized? ● Exploits: Change sequences to be more uniform and look if that increases public score consistently.
  • 27. RESOURCES & FURTHER READING ● http://mlwave.com/kaggle-ensembling-guide/ ● http://scikit-learn.org ● http://hunch.net/~vw/ ● https://github.com/dmlc/xgboost ● https://www.youtube.com/watch?v=djRh0Rkqygw [Ihler, Linear regression (5): Bias and variance] ● http://www.cs.nyu.edu/~mohri/mls/lecture_8.pdf [Mohri, Foundations of Machine Learning] ● http://www.researchgate.net/profile/David_Wolpert/publication/2 [Wolpert, Stacked Generalization]