Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux

Pôle Systematic Paris-Region
Pôle Systematic Paris-RegionPôle Systematic Paris-Region
Scikit-learn: an incomplete yearly review
Ga¨el Varoquaux
scikit
machine learning in Python
Trends with
1 The library
2 The community
G Varoquaux 2
1 The library
scikit
machine learning in Python
G Varoquaux 3
1 In 0.18 oldies but goodies
G Varoquaux 4
1 In 0.18 oldies but goodies
New cross-validation objects V.R. Rajagopalan
from s k l e a r n . c r o s s v a l i d a t i o n
import S t r a t i f i e d K F o l d
cv = S t r a t i f i e d K F o l d (y , n f o l d s =2)
for t r a i n , t e s t in cv :
X t r a i n = X[ t r a i n ]
y t a i n = y[ t r a i n ]
G Varoquaux 4
1 In 0.18 oldies but goodies
New cross-validation objects V.R. Rajagopalan
from s k l e a r n . m o d e l s e l e c t i o n
import S t r a t i f i e d K F o l d
cv = S t r a t i f i e d K F o l d ( n f o l d s =2)
for t r a i n , t e s t in cv . s p l i t (X, y):
X t r a i n = X[ t r a i n ]
y t a i n = y[ t r a i n ]
⇒ better nested-CV
G Varoquaux 4
1 In 0.18 oldies but goodies
New cross-validation objects V.R. Rajagopalan
PCA == Randomized PCA G. Patrini
Heuristic to switch PCA to random linear algebra
Fights global warming
Huge speed gains for biggish data
G Varoquaux 4
1 Coming soon Merged in master
Memory in pipeline: G. Lemaitre
make pipeline(PCA(), LinearSVC(), memory=’/tmp/joe’)
Limits recomputation (eg in grid search)
G Varoquaux 5
1 Coming soon Merged in master
Memory in pipeline G. Lemaitre
New solver for logistic regression: SAGA A. Mensch
linear model.LogisticRegression(solver=’saga’)
Fast linear model on biggish data
Trainingobjective
SAGA
Liblinear
RCV1
G Varoquaux 5
1 Coming soon Merged in master
Memory in pipeline G. Lemaitre
New solver for logistic regression: SAGA A. Mensch
Quantile transformer: G. Lemaitre
0 2 4 6 8 10 12
Median Income
0
1
2
3
4
5
6
Numberofhouseholds
0.6
1.2
1.8
2.4
3.0
3.6
4.2
4.8
Colormappingforvaluesofy
G Varoquaux 5
1 Coming soon Merged in master
Memory in pipeline G. Lemaitre
New solver for logistic regression: SAGA A. Mensch
Quantile transformer: G. Lemaitre
0 2 4 6 8 10 12
Median Income
0
1
2
3
4
5
6
Numberofhouseholds
0.6
1.2
1.8
2.4
3.0
3.6
4.2
4.8
Colormappingforvaluesofy
0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2
Median Income
0.2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Numberofhouseholds
0.6
1.2
1.8
2.4
3.0
3.6
4.2
4.8
Colormappingforvaluesofy
G Varoquaux 5
1 Coming soon Merged in master
Memory in pipeline G. Lemaitre
New solver for logistic regression: SAGA A. Mensch
Quantile transformer G. Lemaitre
Local outlier factor: N. Goix
normal
abnormal
G Varoquaux 5
1 Coming soon Merged in master
Memory in pipeline G. Lemaitre
New solver for logistic regression: SAGA A. Mensch
Quantile transformer G. Lemaitre
Local outlier factor N. Goix
Memory savings
Avoid casting (work with float32) J. Massich, A. Imbert
T-SNE (in progress) T. Moreau
G Varoquaux 5
1 To come Maybe
ColumnsTransformer: J. Van den Bossche
Pandas in ... feature engineering ... array out
transformer = make column transformer({
StandardScaler(): [’age’],
OneHotEncoder(): [’company’]
})
array = transformer.fit transform(data frame)
G Varoquaux 6
1 To come Maybe
ColumnsTransformer J. Van den Bossche
Faster trees, forest& boosting:
V.R. Rajagopalan, G. Lemaitre
Teaching from XGBoost, lightgbm:
bin features for discrete values
depth-first tree, for access locality
G Varoquaux 6
1 Scaling out Infrastructure
Using many computers: cloud, elastic computing
Orchestration, data distribution
Integration in corporate infrastructure
Hadoop, queues, services
joblib backends
Parallel computing
Loky (robust single-machine process pool)
Distributed (Yarn, dask, CMFActivity)
Storage (S3, HDFS)
G Varoquaux 7
1 Continuous integration
Testing under numpy & scipy dev
A. Mueller
G Varoquaux 8
1 Scikit-learn-contrib
Scaling the scikit-learn universe quicker
https://github.com/scikit-learn-contrib
py-earth multivariate adaptive regression splines
imbalanced-learn under-sampling and over-sampling
lightning fast linear models
polylearn factorization machines and polynomial networks
hdbscan high-performance clustering
forest-confidence-interval confidence interval for forests
boruta py boruta feature selection
G Varoquaux 9
1 Scikit-learn-contrib
Scaling the scikit-learn universe quicker
https://github.com/scikit-learn-contrib
py-earth multivariate adaptive regression splines
imbalanced-learn under-sampling and over-sampling
lightning fast linear models
polylearn factorization machines and polynomial networks
hdbscan high-performance clustering
forest-confidence-interval confidence interval for forests
boruta py boruta feature selection
sklearn.utils.estimator checks.check estimator
G Varoquaux 9
2 The community
Users & developers
G Varoquaux 10
2 User base
350 000 returning users 5 000 citations
G Varoquaux 11
2 User base
350 000 returning users 5 000 citations
OS Employer
Windows
Mac
Linux
Industry Academia
Other
50%
20%
30%
63%
3%
34%
G Varoquaux 11
2 User base
Jun Jul Aug Sep Oct Nov Dec Jan
2017
Feb Mar Apr May Jun
0
20000
40000
NumberofPyPIdownloads
G Varoquaux 12
2 User base
Jun Jul Aug Sep Oct Nov Dec Jan
2017
Feb Mar Apr May Jun
0
20000
40000
60000
80000
100000NumberofPyPIdownloads numpy
pandas
scikit-learn
django
flask
G Varoquaux 12
2 In the Python ecosystem
1 10 100 1000 10000
Package rank
104
105
106
107
108
109
NumberofPyPIdownloads
G Varoquaux 13
2 In the Python ecosystem
1 10 100 1000 10000
Package rank
104
105
106
107
108
109
NumberofPyPIdownloads
numpy
scikit-learn
joblib
simplejson
sixsetuptools
G Varoquaux 13
2 Core software is infrastructure
Everybody uses it everyday
In industry, education, & research
“Roads and Bridge”: Ford foundation report
Excellent talk by Heather Miller
https://www.youtube.com/watch?v=17yy5BwIiTw
G Varoquaux 14
2 Community-based development in scikit-learn
Active development team
2010 2012 2014 2016
0
25
50Monthly contributors
https://www.openhub.net/p/scikit-learn
G Varoquaux 15
2 Funding & spending 2015 & 2016
New York A. Mueller
$ 350 000 Moore-Sloan grant
A. Mueller (full time). Students: M. Kumar, V. Birodkar
Telecom ParisTech A. Gramfort
200 000e WendelinIA grant + 12 000 e CDS
Programmers: T. Guillemot, T. Dupr´e
Students: M. Kumar, D. Sullivan, V.R. Rajagopalan, N. Goix
Inria Parietal G. Varoquaux
120 000e Inria + 100 000 e WendelinIA
+ 50 000 e ANR + 30 000 e CDS
Programmers: O. Grisel, L. Esteve (programmer), G.
Lemaitre, J. Van den Boosche
Students: A. Mensch, J. Schreiber, G. Patrini
> 400 000 e/yrG Varoquaux 16
2 Funding & spending 2015 & 2016
New York A. Mueller
$ 350 000 Moore-Sloan grant
A. Mueller (full time). Students: M. Kumar, V. Birodkar
Telecom ParisTech A. Gramfort
200 000e WendelinIA grant + 12 000 e CDS
Programmers: T. Guillemot, T. Dupr´e
Students: M. Kumar, D. Sullivan, V.R. Rajagopalan, N. Goix
Inria Parietal G. Varoquaux
120 000e Inria + 100 000 e WendelinIA
+ 50 000 e ANR + 30 000 e CDS
Programmers: O. Grisel, L. Esteve (programmer), G.
Lemaitre, J. Van den Boosche
Students: A. Mensch, J. Schreiber, G. Patrini
> 400 000 e/yrG Varoquaux 16
2 Sustainability
G Varoquaux 17
2 Sustainability
Educating decision makers
Not funding your infrastructure is a risk
A fundation
Danger: governance, focus on features for the rich
We need partners, good ones
G Varoquaux 17
@GaelVaroquaux
Scikit-learn
Machine learning for everyone
– from beginner to expert
On going progress
Faster models (algorithmics, float32)
Easier usage (better pandas integration)
Coupling to infrastructure (via joblib)
Thinking about sustainability & partnership
1 of 33

More Related Content

Viewers also liked(20)

Similar to Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux(20)

Simple big data, in PythonSimple big data, in Python
Simple big data, in Python
Gael Varoquaux8.3K views
Group3-Gravitation.pdfGroup3-Gravitation.pdf
Group3-Gravitation.pdf
VidhanSingh113 views
Lecture_2_v2_qc.pptxLecture_2_v2_qc.pptx
Lecture_2_v2_qc.pptx
Infinite Convergence Solutions4 views
New directions for mahoutNew directions for mahout
New directions for mahout
MapR Technologies552 views
CloudStack newsCloudStack news
CloudStack news
ShapeBlue1.2K views

More from Pôle Systematic Paris-Region(20)

Osis18_Cloud : Pas de commun sans communauté ?Osis18_Cloud : Pas de commun sans communauté ?
Osis18_Cloud : Pas de commun sans communauté ?
Pôle Systematic Paris-Region659 views
Osis18_Cloud : Projet Wolphin Osis18_Cloud : Projet Wolphin
Osis18_Cloud : Projet Wolphin
Pôle Systematic Paris-Region231 views
Osis18_Cloud : Virtualisation efficace d’architectures NUMAOsis18_Cloud : Virtualisation efficace d’architectures NUMA
Osis18_Cloud : Virtualisation efficace d’architectures NUMA
Pôle Systematic Paris-Region202 views
Osis18_Cloud : Software-heritageOsis18_Cloud : Software-heritage
Osis18_Cloud : Software-heritage
Pôle Systematic Paris-Region133 views
PyParis 2017 / Un mooc python, by thierry parmentelatPyParis 2017 / Un mooc python, by thierry parmentelat
PyParis 2017 / Un mooc python, by thierry parmentelat
Pôle Systematic Paris-Region2.5K views

Recently uploaded(20)

Green Leaf Consulting: Capabilities DeckGreen Leaf Consulting: Capabilities Deck
Green Leaf Consulting: Capabilities Deck
GreenLeafConsulting177 views
[2023] Putting the R! in R&D.pdf[2023] Putting the R! in R&D.pdf
[2023] Putting the R! in R&D.pdf
Eleanor McHugh36 views
The Research Portal of Catalonia: Growing more (information) & more (services)The Research Portal of Catalonia: Growing more (information) & more (services)
The Research Portal of Catalonia: Growing more (information) & more (services)
CSUC - Consorci de Serveis Universitaris de Catalunya59 views

Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux

  • 1. Scikit-learn: an incomplete yearly review Ga¨el Varoquaux scikit machine learning in Python
  • 2. Trends with 1 The library 2 The community G Varoquaux 2
  • 3. 1 The library scikit machine learning in Python G Varoquaux 3
  • 4. 1 In 0.18 oldies but goodies G Varoquaux 4
  • 5. 1 In 0.18 oldies but goodies New cross-validation objects V.R. Rajagopalan from s k l e a r n . c r o s s v a l i d a t i o n import S t r a t i f i e d K F o l d cv = S t r a t i f i e d K F o l d (y , n f o l d s =2) for t r a i n , t e s t in cv : X t r a i n = X[ t r a i n ] y t a i n = y[ t r a i n ] G Varoquaux 4
  • 6. 1 In 0.18 oldies but goodies New cross-validation objects V.R. Rajagopalan from s k l e a r n . m o d e l s e l e c t i o n import S t r a t i f i e d K F o l d cv = S t r a t i f i e d K F o l d ( n f o l d s =2) for t r a i n , t e s t in cv . s p l i t (X, y): X t r a i n = X[ t r a i n ] y t a i n = y[ t r a i n ] ⇒ better nested-CV G Varoquaux 4
  • 7. 1 In 0.18 oldies but goodies New cross-validation objects V.R. Rajagopalan PCA == Randomized PCA G. Patrini Heuristic to switch PCA to random linear algebra Fights global warming Huge speed gains for biggish data G Varoquaux 4
  • 8. 1 Coming soon Merged in master Memory in pipeline: G. Lemaitre make pipeline(PCA(), LinearSVC(), memory=’/tmp/joe’) Limits recomputation (eg in grid search) G Varoquaux 5
  • 9. 1 Coming soon Merged in master Memory in pipeline G. Lemaitre New solver for logistic regression: SAGA A. Mensch linear model.LogisticRegression(solver=’saga’) Fast linear model on biggish data Trainingobjective SAGA Liblinear RCV1 G Varoquaux 5
  • 10. 1 Coming soon Merged in master Memory in pipeline G. Lemaitre New solver for logistic regression: SAGA A. Mensch Quantile transformer: G. Lemaitre 0 2 4 6 8 10 12 Median Income 0 1 2 3 4 5 6 Numberofhouseholds 0.6 1.2 1.8 2.4 3.0 3.6 4.2 4.8 Colormappingforvaluesofy G Varoquaux 5
  • 11. 1 Coming soon Merged in master Memory in pipeline G. Lemaitre New solver for logistic regression: SAGA A. Mensch Quantile transformer: G. Lemaitre 0 2 4 6 8 10 12 Median Income 0 1 2 3 4 5 6 Numberofhouseholds 0.6 1.2 1.8 2.4 3.0 3.6 4.2 4.8 Colormappingforvaluesofy 0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Median Income 0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Numberofhouseholds 0.6 1.2 1.8 2.4 3.0 3.6 4.2 4.8 Colormappingforvaluesofy G Varoquaux 5
  • 12. 1 Coming soon Merged in master Memory in pipeline G. Lemaitre New solver for logistic regression: SAGA A. Mensch Quantile transformer G. Lemaitre Local outlier factor: N. Goix normal abnormal G Varoquaux 5
  • 13. 1 Coming soon Merged in master Memory in pipeline G. Lemaitre New solver for logistic regression: SAGA A. Mensch Quantile transformer G. Lemaitre Local outlier factor N. Goix Memory savings Avoid casting (work with float32) J. Massich, A. Imbert T-SNE (in progress) T. Moreau G Varoquaux 5
  • 14. 1 To come Maybe ColumnsTransformer: J. Van den Bossche Pandas in ... feature engineering ... array out transformer = make column transformer({ StandardScaler(): [’age’], OneHotEncoder(): [’company’] }) array = transformer.fit transform(data frame) G Varoquaux 6
  • 15. 1 To come Maybe ColumnsTransformer J. Van den Bossche Faster trees, forest& boosting: V.R. Rajagopalan, G. Lemaitre Teaching from XGBoost, lightgbm: bin features for discrete values depth-first tree, for access locality G Varoquaux 6
  • 16. 1 Scaling out Infrastructure Using many computers: cloud, elastic computing Orchestration, data distribution Integration in corporate infrastructure Hadoop, queues, services joblib backends Parallel computing Loky (robust single-machine process pool) Distributed (Yarn, dask, CMFActivity) Storage (S3, HDFS) G Varoquaux 7
  • 17. 1 Continuous integration Testing under numpy & scipy dev A. Mueller G Varoquaux 8
  • 18. 1 Scikit-learn-contrib Scaling the scikit-learn universe quicker https://github.com/scikit-learn-contrib py-earth multivariate adaptive regression splines imbalanced-learn under-sampling and over-sampling lightning fast linear models polylearn factorization machines and polynomial networks hdbscan high-performance clustering forest-confidence-interval confidence interval for forests boruta py boruta feature selection G Varoquaux 9
  • 19. 1 Scikit-learn-contrib Scaling the scikit-learn universe quicker https://github.com/scikit-learn-contrib py-earth multivariate adaptive regression splines imbalanced-learn under-sampling and over-sampling lightning fast linear models polylearn factorization machines and polynomial networks hdbscan high-performance clustering forest-confidence-interval confidence interval for forests boruta py boruta feature selection sklearn.utils.estimator checks.check estimator G Varoquaux 9
  • 20. 2 The community Users & developers G Varoquaux 10
  • 21. 2 User base 350 000 returning users 5 000 citations G Varoquaux 11
  • 22. 2 User base 350 000 returning users 5 000 citations OS Employer Windows Mac Linux Industry Academia Other 50% 20% 30% 63% 3% 34% G Varoquaux 11
  • 23. 2 User base Jun Jul Aug Sep Oct Nov Dec Jan 2017 Feb Mar Apr May Jun 0 20000 40000 NumberofPyPIdownloads G Varoquaux 12
  • 24. 2 User base Jun Jul Aug Sep Oct Nov Dec Jan 2017 Feb Mar Apr May Jun 0 20000 40000 60000 80000 100000NumberofPyPIdownloads numpy pandas scikit-learn django flask G Varoquaux 12
  • 25. 2 In the Python ecosystem 1 10 100 1000 10000 Package rank 104 105 106 107 108 109 NumberofPyPIdownloads G Varoquaux 13
  • 26. 2 In the Python ecosystem 1 10 100 1000 10000 Package rank 104 105 106 107 108 109 NumberofPyPIdownloads numpy scikit-learn joblib simplejson sixsetuptools G Varoquaux 13
  • 27. 2 Core software is infrastructure Everybody uses it everyday In industry, education, & research “Roads and Bridge”: Ford foundation report Excellent talk by Heather Miller https://www.youtube.com/watch?v=17yy5BwIiTw G Varoquaux 14
  • 28. 2 Community-based development in scikit-learn Active development team 2010 2012 2014 2016 0 25 50Monthly contributors https://www.openhub.net/p/scikit-learn G Varoquaux 15
  • 29. 2 Funding & spending 2015 & 2016 New York A. Mueller $ 350 000 Moore-Sloan grant A. Mueller (full time). Students: M. Kumar, V. Birodkar Telecom ParisTech A. Gramfort 200 000e WendelinIA grant + 12 000 e CDS Programmers: T. Guillemot, T. Dupr´e Students: M. Kumar, D. Sullivan, V.R. Rajagopalan, N. Goix Inria Parietal G. Varoquaux 120 000e Inria + 100 000 e WendelinIA + 50 000 e ANR + 30 000 e CDS Programmers: O. Grisel, L. Esteve (programmer), G. Lemaitre, J. Van den Boosche Students: A. Mensch, J. Schreiber, G. Patrini > 400 000 e/yrG Varoquaux 16
  • 30. 2 Funding & spending 2015 & 2016 New York A. Mueller $ 350 000 Moore-Sloan grant A. Mueller (full time). Students: M. Kumar, V. Birodkar Telecom ParisTech A. Gramfort 200 000e WendelinIA grant + 12 000 e CDS Programmers: T. Guillemot, T. Dupr´e Students: M. Kumar, D. Sullivan, V.R. Rajagopalan, N. Goix Inria Parietal G. Varoquaux 120 000e Inria + 100 000 e WendelinIA + 50 000 e ANR + 30 000 e CDS Programmers: O. Grisel, L. Esteve (programmer), G. Lemaitre, J. Van den Boosche Students: A. Mensch, J. Schreiber, G. Patrini > 400 000 e/yrG Varoquaux 16
  • 32. 2 Sustainability Educating decision makers Not funding your infrastructure is a risk A fundation Danger: governance, focus on features for the rich We need partners, good ones G Varoquaux 17
  • 33. @GaelVaroquaux Scikit-learn Machine learning for everyone – from beginner to expert On going progress Faster models (algorithmics, float32) Easier usage (better pandas integration) Coupling to infrastructure (via joblib) Thinking about sustainability & partnership