Spark & Machine Learning Meetup
Hyperparameter Optimization - when scikit-learn meets PySpark
Sven Hafeneger
27.10.2016
©2015 IBM Corporation 1 November 20162
Data Science Workflow
Wikipedia https://en.wikipedia.org/wiki
Cross_Industry_Standard_Process_for_Data_Mining
©2015 IBM Corporation 1 November 20163
Data Science Workflow
knobs to tune !
Wikipedia https://en.wikipedia.org/wiki
Cross_Industry_Standard_Process_for_Data_Mining
https://www.okwenclosures.com/en/Potentiomet
er-Tuning-knobs/Top-Knobs.htm
©2015 IBM Corporation 1 November 20164
Data Science Workflow - Modeling
Model
Improves robustness
Influences complexity
Helps with class imbalances
https://www.kvraudio.com/forum/viewtopic.php?t=328938
©2015 IBM Corporation 1 November 20165
 “… is the problem of choosing a set of hyperparameters for a learning
algorithm, …” [1]
 Grid search
 Random search
 …
What is Hyperparameter Optimzation?
https://openclipart.org/detail/194603/grid-search-pattern
©2015 IBM Corporation 1 November 20166
 “… is the problem of choosing a set of hyperparameters for a learning
algorithm, …” [1]
 Grid search
 Random search
 …
What is Hyperparameter Optimzation?
http://25.media.tumblr.com/tumblr_lcelmoEfoX1qbl1tko1_400.jpg
©2015 IBM Corporation 1 November 20167
Gridsearch with scikit-learn
Build a classification model
We have some data and a classification problem
©2015 IBM Corporation 1 November 20168
Gridsearch with scikit-learn
©2015 IBM Corporation 1 November 20169
Gridsearch with scikit-learn
… well ... yes ... overfitted !
©2015 IBM Corporation 1 November 201610
Gridsearch with scikit-learn
Improve test scores !
©2015 IBM Corporation 1 November 201611
Gridsearch with scikit-learn
~ 500 jobs
~ 13 mins
©2015 IBM Corporation 1 November 201612
Gridsearch with scikit-learn
Return (best) model
Accuracy: 0.44 => 0.76
max_depth=15
n_estimators=200
©2015 IBM Corporation 1 November 201613
Gridsearch with spark-sklearn
What if you have access to a Spark cluster ?
Distribute the workload on the cluster !
©2015 IBM Corporation 1 November 201614
Save time !
Concentrate on more important problems …
https://databricks.com/blog/2016/02/08/auto-scaling-scikit-learn-with-apache-spark.html
©2015 IBM Corporation 1 November 201615
Data Science Workflow
Faster cycles !
©2015 IBM Corporation 1 November 201616
Try it out
Source: [6]
https://pypi.python.org
©2015 IBM Corporation 1 November 201617
Try it out
©2015 IBM Corporation 1 November 201618
References
 [1]: Bergstra, James; Bengio, Yoshua (2012). "Random Search for Hyper-Parameter Optimization”, J.
Machine Learning Research. 13: 281–305.
Thanks !

Hyperparameter Optimization - Sven Hafeneger

  • 1.
    Spark & MachineLearning Meetup Hyperparameter Optimization - when scikit-learn meets PySpark Sven Hafeneger 27.10.2016
  • 2.
    ©2015 IBM Corporation1 November 20162 Data Science Workflow Wikipedia https://en.wikipedia.org/wiki Cross_Industry_Standard_Process_for_Data_Mining
  • 3.
    ©2015 IBM Corporation1 November 20163 Data Science Workflow knobs to tune ! Wikipedia https://en.wikipedia.org/wiki Cross_Industry_Standard_Process_for_Data_Mining https://www.okwenclosures.com/en/Potentiomet er-Tuning-knobs/Top-Knobs.htm
  • 4.
    ©2015 IBM Corporation1 November 20164 Data Science Workflow - Modeling Model Improves robustness Influences complexity Helps with class imbalances https://www.kvraudio.com/forum/viewtopic.php?t=328938
  • 5.
    ©2015 IBM Corporation1 November 20165  “… is the problem of choosing a set of hyperparameters for a learning algorithm, …” [1]  Grid search  Random search  … What is Hyperparameter Optimzation? https://openclipart.org/detail/194603/grid-search-pattern
  • 6.
    ©2015 IBM Corporation1 November 20166  “… is the problem of choosing a set of hyperparameters for a learning algorithm, …” [1]  Grid search  Random search  … What is Hyperparameter Optimzation? http://25.media.tumblr.com/tumblr_lcelmoEfoX1qbl1tko1_400.jpg
  • 7.
    ©2015 IBM Corporation1 November 20167 Gridsearch with scikit-learn Build a classification model We have some data and a classification problem
  • 8.
    ©2015 IBM Corporation1 November 20168 Gridsearch with scikit-learn
  • 9.
    ©2015 IBM Corporation1 November 20169 Gridsearch with scikit-learn … well ... yes ... overfitted !
  • 10.
    ©2015 IBM Corporation1 November 201610 Gridsearch with scikit-learn Improve test scores !
  • 11.
    ©2015 IBM Corporation1 November 201611 Gridsearch with scikit-learn ~ 500 jobs ~ 13 mins
  • 12.
    ©2015 IBM Corporation1 November 201612 Gridsearch with scikit-learn Return (best) model Accuracy: 0.44 => 0.76 max_depth=15 n_estimators=200
  • 13.
    ©2015 IBM Corporation1 November 201613 Gridsearch with spark-sklearn What if you have access to a Spark cluster ? Distribute the workload on the cluster !
  • 14.
    ©2015 IBM Corporation1 November 201614 Save time ! Concentrate on more important problems … https://databricks.com/blog/2016/02/08/auto-scaling-scikit-learn-with-apache-spark.html
  • 15.
    ©2015 IBM Corporation1 November 201615 Data Science Workflow Faster cycles !
  • 16.
    ©2015 IBM Corporation1 November 201616 Try it out Source: [6] https://pypi.python.org
  • 17.
    ©2015 IBM Corporation1 November 201617 Try it out
  • 18.
    ©2015 IBM Corporation1 November 201618 References  [1]: Bergstra, James; Bengio, Yoshua (2012). "Random Search for Hyper-Parameter Optimization”, J. Machine Learning Research. 13: 281–305. Thanks !