Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux

204 views

Published on

PyParis 2017
http://pyparis.org

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux

  1. 1. Scikit-learn: an incomplete yearly review Ga¨el Varoquaux scikit machine learning in Python
  2. 2. Trends with 1 The library 2 The community G Varoquaux 2
  3. 3. 1 The library scikit machine learning in Python G Varoquaux 3
  4. 4. 1 In 0.18 oldies but goodies G Varoquaux 4
  5. 5. 1 In 0.18 oldies but goodies New cross-validation objects V.R. Rajagopalan from s k l e a r n . c r o s s v a l i d a t i o n import S t r a t i f i e d K F o l d cv = S t r a t i f i e d K F o l d (y , n f o l d s =2) for t r a i n , t e s t in cv : X t r a i n = X[ t r a i n ] y t a i n = y[ t r a i n ] G Varoquaux 4
  6. 6. 1 In 0.18 oldies but goodies New cross-validation objects V.R. Rajagopalan from s k l e a r n . m o d e l s e l e c t i o n import S t r a t i f i e d K F o l d cv = S t r a t i f i e d K F o l d ( n f o l d s =2) for t r a i n , t e s t in cv . s p l i t (X, y): X t r a i n = X[ t r a i n ] y t a i n = y[ t r a i n ] ⇒ better nested-CV G Varoquaux 4
  7. 7. 1 In 0.18 oldies but goodies New cross-validation objects V.R. Rajagopalan PCA == Randomized PCA G. Patrini Heuristic to switch PCA to random linear algebra Fights global warming Huge speed gains for biggish data G Varoquaux 4
  8. 8. 1 Coming soon Merged in master Memory in pipeline: G. Lemaitre make pipeline(PCA(), LinearSVC(), memory=’/tmp/joe’) Limits recomputation (eg in grid search) G Varoquaux 5
  9. 9. 1 Coming soon Merged in master Memory in pipeline G. Lemaitre New solver for logistic regression: SAGA A. Mensch linear model.LogisticRegression(solver=’saga’) Fast linear model on biggish data Trainingobjective SAGA Liblinear RCV1 G Varoquaux 5
  10. 10. 1 Coming soon Merged in master Memory in pipeline G. Lemaitre New solver for logistic regression: SAGA A. Mensch Quantile transformer: G. Lemaitre 0 2 4 6 8 10 12 Median Income 0 1 2 3 4 5 6 Numberofhouseholds 0.6 1.2 1.8 2.4 3.0 3.6 4.2 4.8 Colormappingforvaluesofy G Varoquaux 5
  11. 11. 1 Coming soon Merged in master Memory in pipeline G. Lemaitre New solver for logistic regression: SAGA A. Mensch Quantile transformer: G. Lemaitre 0 2 4 6 8 10 12 Median Income 0 1 2 3 4 5 6 Numberofhouseholds 0.6 1.2 1.8 2.4 3.0 3.6 4.2 4.8 Colormappingforvaluesofy 0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Median Income 0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Numberofhouseholds 0.6 1.2 1.8 2.4 3.0 3.6 4.2 4.8 Colormappingforvaluesofy G Varoquaux 5
  12. 12. 1 Coming soon Merged in master Memory in pipeline G. Lemaitre New solver for logistic regression: SAGA A. Mensch Quantile transformer G. Lemaitre Local outlier factor: N. Goix normal abnormal G Varoquaux 5
  13. 13. 1 Coming soon Merged in master Memory in pipeline G. Lemaitre New solver for logistic regression: SAGA A. Mensch Quantile transformer G. Lemaitre Local outlier factor N. Goix Memory savings Avoid casting (work with float32) J. Massich, A. Imbert T-SNE (in progress) T. Moreau G Varoquaux 5
  14. 14. 1 To come Maybe ColumnsTransformer: J. Van den Bossche Pandas in ... feature engineering ... array out transformer = make column transformer({ StandardScaler(): [’age’], OneHotEncoder(): [’company’] }) array = transformer.fit transform(data frame) G Varoquaux 6
  15. 15. 1 To come Maybe ColumnsTransformer J. Van den Bossche Faster trees, forest& boosting: V.R. Rajagopalan, G. Lemaitre Teaching from XGBoost, lightgbm: bin features for discrete values depth-first tree, for access locality G Varoquaux 6
  16. 16. 1 Scaling out Infrastructure Using many computers: cloud, elastic computing Orchestration, data distribution Integration in corporate infrastructure Hadoop, queues, services joblib backends Parallel computing Loky (robust single-machine process pool) Distributed (Yarn, dask, CMFActivity) Storage (S3, HDFS) G Varoquaux 7
  17. 17. 1 Continuous integration Testing under numpy & scipy dev A. Mueller G Varoquaux 8
  18. 18. 1 Scikit-learn-contrib Scaling the scikit-learn universe quicker https://github.com/scikit-learn-contrib py-earth multivariate adaptive regression splines imbalanced-learn under-sampling and over-sampling lightning fast linear models polylearn factorization machines and polynomial networks hdbscan high-performance clustering forest-confidence-interval confidence interval for forests boruta py boruta feature selection G Varoquaux 9
  19. 19. 1 Scikit-learn-contrib Scaling the scikit-learn universe quicker https://github.com/scikit-learn-contrib py-earth multivariate adaptive regression splines imbalanced-learn under-sampling and over-sampling lightning fast linear models polylearn factorization machines and polynomial networks hdbscan high-performance clustering forest-confidence-interval confidence interval for forests boruta py boruta feature selection sklearn.utils.estimator checks.check estimator G Varoquaux 9
  20. 20. 2 The community Users & developers G Varoquaux 10
  21. 21. 2 User base 350 000 returning users 5 000 citations G Varoquaux 11
  22. 22. 2 User base 350 000 returning users 5 000 citations OS Employer Windows Mac Linux Industry Academia Other 50% 20% 30% 63% 3% 34% G Varoquaux 11
  23. 23. 2 User base Jun Jul Aug Sep Oct Nov Dec Jan 2017 Feb Mar Apr May Jun 0 20000 40000 NumberofPyPIdownloads G Varoquaux 12
  24. 24. 2 User base Jun Jul Aug Sep Oct Nov Dec Jan 2017 Feb Mar Apr May Jun 0 20000 40000 60000 80000 100000NumberofPyPIdownloads numpy pandas scikit-learn django flask G Varoquaux 12
  25. 25. 2 In the Python ecosystem 1 10 100 1000 10000 Package rank 104 105 106 107 108 109 NumberofPyPIdownloads G Varoquaux 13
  26. 26. 2 In the Python ecosystem 1 10 100 1000 10000 Package rank 104 105 106 107 108 109 NumberofPyPIdownloads numpy scikit-learn joblib simplejson sixsetuptools G Varoquaux 13
  27. 27. 2 Core software is infrastructure Everybody uses it everyday In industry, education, & research “Roads and Bridge”: Ford foundation report Excellent talk by Heather Miller https://www.youtube.com/watch?v=17yy5BwIiTw G Varoquaux 14
  28. 28. 2 Community-based development in scikit-learn Active development team 2010 2012 2014 2016 0 25 50Monthly contributors https://www.openhub.net/p/scikit-learn G Varoquaux 15
  29. 29. 2 Funding & spending 2015 & 2016 New York A. Mueller $ 350 000 Moore-Sloan grant A. Mueller (full time). Students: M. Kumar, V. Birodkar Telecom ParisTech A. Gramfort 200 000e WendelinIA grant + 12 000 e CDS Programmers: T. Guillemot, T. Dupr´e Students: M. Kumar, D. Sullivan, V.R. Rajagopalan, N. Goix Inria Parietal G. Varoquaux 120 000e Inria + 100 000 e WendelinIA + 50 000 e ANR + 30 000 e CDS Programmers: O. Grisel, L. Esteve (programmer), G. Lemaitre, J. Van den Boosche Students: A. Mensch, J. Schreiber, G. Patrini > 400 000 e/yrG Varoquaux 16
  30. 30. 2 Funding & spending 2015 & 2016 New York A. Mueller $ 350 000 Moore-Sloan grant A. Mueller (full time). Students: M. Kumar, V. Birodkar Telecom ParisTech A. Gramfort 200 000e WendelinIA grant + 12 000 e CDS Programmers: T. Guillemot, T. Dupr´e Students: M. Kumar, D. Sullivan, V.R. Rajagopalan, N. Goix Inria Parietal G. Varoquaux 120 000e Inria + 100 000 e WendelinIA + 50 000 e ANR + 30 000 e CDS Programmers: O. Grisel, L. Esteve (programmer), G. Lemaitre, J. Van den Boosche Students: A. Mensch, J. Schreiber, G. Patrini > 400 000 e/yrG Varoquaux 16
  31. 31. 2 Sustainability G Varoquaux 17
  32. 32. 2 Sustainability Educating decision makers Not funding your infrastructure is a risk A fundation Danger: governance, focus on features for the rich We need partners, good ones G Varoquaux 17
  33. 33. @GaelVaroquaux Scikit-learn Machine learning for everyone – from beginner to expert On going progress Faster models (algorithmics, float32) Easier usage (better pandas integration) Coupling to infrastructure (via joblib) Thinking about sustainability & partnership

×