Smartphone Activity Prediction

Smartphone User Activity Prediction
HJ van Veen | Triskelion@Kaggle
MLWave.com

APPROACH TO KAGGLE INCLASS COMPETITIONS
● 1) Get a good score as fast as possible by:
● Getting the raw data into a universal data format.
● Mostly CSV -> Numpy Array / LibSVMlight format
● 2) Using versatile libraries:
● Scikit-Learn, Vowpal Wabbit, XGBoost.
● 3) Model ensembling
● Voting, Bagging, Boosting, Binning, Blending, Stacking

STRATEGY
● Try to create "machine learning" learning algorithms
and optimized pipelines which are:
● Data agnostic,
● Problem agnostic,
● Solution agnostic,
● Automated
● Memory-friendly
● Robust with good generalization.

FIRST OVERVIEW
● Problem type
● Classification? Regression?
● Evaluation metric
● Description
● Benchmark code
“Predict human activities based on their smartphone usage
pattern. Predict if a person is sitting, walking, etc, using
their smartphone activities”
https://inclass.kaggle.com/c/smartphone-user-activity-
prediction

FIRST OVERVIEW
● Data types
● Counts
● Text
● Categorical
● Numerical
● Dates
0.28309984,-0.025501173,-0.11118051,-
0.37447712,-0.099567756,-0.20296558,-
0.37631066,-0.15016035,-0.18169451,-
0.29308661,-0.14946642, … Quick preview

FIRST OVERVIEW
● Data size
● Number of features?
● Number of train samples?
● Number of test samples?
● Online learning or offline learning?
● Linear problem or Non-linear?

BRANCH
● If issues with data:
● Clear up issues with data (imputing missing data, joining
tables, eval a JSON string)
● Give up, and join another competition.
● If no issues with data:
● Get the raw data into NumPy arrays, we want:
● X_train (train set), y (labels), X_test (test set)

TRANSFORMS & PREPROCESSING
● TRANSFORMS & SCALING
● TF-IDF Weighting
● Log scaling
● Minmax and standard-scaling
● PREPROCESSING
● Parse dates
● Concatenate text fields
● Impute missing values

ALGORITHMS
● There is a bias-variance trade-off between simple models
and complex models.

ALGORITHMS
● There is No Free Lunch in machine learning.
● We show that all algorithms that search for an extremum of
a cost function perform exactly the same, when averaged
over all possible cost functions. – Wolpert, Macready, No
free lunch theorems for search
● Solution:
● Let algo's play to their own strengths for particular
problems and
● remove their weaknesses, then
● combine their predictions.

RANDOM FORESTS 1/2
● A Random Forest is an ensemble of decision trees.
● "Bagging predictors is a method for generating multiple
versions of a predictor and using these to get an
aggregated predictor." - "Bagging Predictors". Breiman

RANDOM FORESTS 2/2
● Strength: Relatively fast. Can be fitted in parallel.
● Easy to tune.
● Easy to inspect.
● Easy to explore data with.
● Good to benchmark against.
● One of the most powerful general ML algorithms.
● You can introduce randomness.
● Weakness: Memory-heavy (so use bagging).
● Popular (So use RGF and Extremely Randomized Trees)

GBM 1/2
● Gradient Boosted Decision Trees train weak predictors
on samples that previous predictors got wrong.
● "A method is described for converting a weak learning
algorithm [the learner can produce an hypothesis that
performs only slightly better than random guessing] into
one that achieves arbitrarily high accuracy." "The strength
of weak learnability." - Schapire

GBM 2/2
● Strength:
● Can achieve very good results
● Can model very complex problems
● Works on a wide variety of problems.
● Weakness:
● Slower to run (use XGBoost).
● Tricky to tune (start with max trees, tune eta, tune depth)

SVM
● Classification and regression using support vectors.
● "Nothing is more practical than a good theory." The Nature
of Statistical Learning Theory, Vapnik
● Strength:
● Strong theoretical guarantees
● Tuning regularization parameter can prevent overfit
● Uses the kernel trick. Turn linear solvers into non-linear
solvers. Build custom kernels.
● Weakness:
● Requires a gridsearch. (Develop intuition or new algo!)
● Too slow on large data (use stratified subsampling)

KNN
● Look at distance to nearest neighbors
● "The nearest neighbor decision rule assigns to an
unclassified sample point the classification of the nearest of
a set of previously classified points." Nearest neighbor
pattern classification, Cover et. al.
● Strength:
● Nonlinear
● Basic
● Easy to tune
● Different / unpopular.
● Weakness: Slow and does not perform well in general. (so
use for stacking or finding near-duplicates)

OTHERS
● Logistic Regression
● Stochastic Gradient Descent
● Ridge Regression
● Naive Bayes
● Artificial Neural Nets
● Matrix Factorization, SVD
● Quantile Regression
● AdaBoosting
● Genetic Algorithms
● Perceptrons

ENSEMBLING
● Ensembling combines multiple models to (hopefully)
outperform any individual members.
● Ensembling (stacked generalization) won the 1 million $
Netflix competition.
● Ensembling reduces overfit and improves generalization
performance.
● Tips:
● Use diverse models
● Use many models
● Dont leak any information (stratified out-of-fold predictions)

Automatic stacked ensembling
● Combining 100s of automatically created models to
improve accuracy and generalization performance.
● "Hodor!" - Hodor.
● Strength:
● - Won this Kaggle competition :)
● - Robust / good generalization
● - No tuning
● - Incremental accuracy-increasing predictions
● Weakness: Unwieldy, Dim-witted, Slow, Redundant.

● Step 1 (Generalization)
● Create out-of-fold predictions for the train set and
predictions for the test set for:
● Different algorithms
● Different parameters
● Different sampling
● Step 2 (Stacking)
● Add preds to original features and train a GBM or RF on
this.
● Step 3 (Model Selection)
● Brute-force averaging of predictors.

● DEMO

LEAKAGE
● 'The introduction of information about the data mining
target, which should not be legitimately available to mine
from.'
● "Leakage in Data Mining. - Formulation, Detection, and
Avoidance" Kaufman et. al.
● 'one of the top ten data mining mistakes'
● "Handbook of Statistical Analysis and Data Mining
Applications." Nisbet et. al.

LEAKAGE
● Exploiting Leakage
● In predictive modeling competitions: Allowed and beneficial
for results.
● In science or business: A very big no-no!
● In both: Accidental leakage exploitation. RF finds leakage
automatically or KNN-classifier finds duplicates.

LEAKAGE 1/2
● In this competition
● Look at ordering of training sample labels:
● - Classes (activity) cluster together.
● - These are the different patients/subjects in the study?
● Exploits: Build better CV. Use subject meta-features.

LEAKAGE 2/2
● In this competition
● Look at ordering of test prediction file:
● - Class predictions again cluster together
● - Is the test set not randomized?
● Exploits: Change sequences to be more
uniform and look if that increases public
score consistently.

RESOURCES & FURTHER READING
● http://mlwave.com/kaggle-ensembling-guide/
● http://scikit-learn.org
● http://hunch.net/~vw/
● https://github.com/dmlc/xgboost
● https://www.youtube.com/watch?v=djRh0Rkqygw
[Ihler, Linear regression (5): Bias and variance]
● http://www.cs.nyu.edu/~mohri/mls/lecture_8.pdf
[Mohri, Foundations of Machine Learning]
● http://www.researchgate.net/profile/David_Wolpert/publication/2
[Wolpert, Stacked Generalization]

Smartphone Activity Prediction

Recommended

Recommended

More Related Content

Similar to Smartphone Activity Prediction

Similar to Smartphone Activity Prediction (20)

Smartphone Activity Prediction