Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux

Scikit-learn: an incomplete yearly review
Ga¨el Varoquaux
scikit
machine learning in Python

Trends with
1 The library
2 The community
G Varoquaux 2

1 The library
scikit
machine learning in Python
G Varoquaux 3

1 In 0.18 oldies but goodies
G Varoquaux 4

New cross-validation objects V.R. Rajagopalan
from s k l e a r n . c r o s s v a l i d a t i o n
import S t r a t i f i e d K F o l d
cv = S t r a t i f i e d K F o l d (y , n f o l d s =2)
for t r a i n , t e s t in cv :
X t r a i n = X[ t r a i n ]
y t a i n = y[ t r a i n ]
G Varoquaux 4

from s k l e a r n . m o d e l s e l e c t i o n
import S t r a t i f i e d K F o l d
cv = S t r a t i f i e d K F o l d ( n f o l d s =2)
for t r a i n , t e s t in cv . s p l i t (X, y):
X t r a i n = X[ t r a i n ]
y t a i n = y[ t r a i n ]
⇒ better nested-CV
G Varoquaux 4

PCA == Randomized PCA G. Patrini
Heuristic to switch PCA to random linear algebra
Fights global warming
Huge speed gains for biggish data
G Varoquaux 4

1 Coming soon Merged in master
Memory in pipeline: G. Lemaitre
make pipeline(PCA(), LinearSVC(), memory=’/tmp/joe’)
Limits recomputation (eg in grid search)
G Varoquaux 5

Memory in pipeline G. Lemaitre
New solver for logistic regression: SAGA A. Mensch
linear model.LogisticRegression(solver=’saga’)
Fast linear model on biggish data
Trainingobjective
SAGA
Liblinear
RCV1
G Varoquaux 5

Quantile transformer: G. Lemaitre
0 2 4 6 8 10 12
Median Income
0
1
2
3
4
5
6
Numberofhouseholds
0.6
1.2
1.8
2.4
3.0
3.6
4.2
4.8
Colormappingforvaluesofy
G Varoquaux 5

Quantile transformer: G. Lemaitre
0 2 4 6 8 10 12
Median Income
0
1
2
3
4
5
6
Numberofhouseholds
0.6
1.2
1.8
2.4
3.0
3.6
4.2
4.8
0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2
Median Income
0.2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Numberofhouseholds
0.6
1.2
1.8
2.4
3.0
3.6
4.2
4.8
G Varoquaux 5

Quantile transformer G. Lemaitre
Local outlier factor: N. Goix
normal
abnormal
G Varoquaux 5

Quantile transformer G. Lemaitre
Local outlier factor N. Goix
Memory savings
Avoid casting (work with float32) J. Massich, A. Imbert
T-SNE (in progress) T. Moreau
G Varoquaux 5

1 To come Maybe
ColumnsTransformer: J. Van den Bossche
Pandas in ... feature engineering ... array out
transformer = make column transformer({
StandardScaler(): [’age’],
OneHotEncoder(): [’company’]
})
array = transformer.fit transform(data frame)
G Varoquaux 6

1 To come Maybe
ColumnsTransformer J. Van den Bossche
Faster trees, forest& boosting:
V.R. Rajagopalan, G. Lemaitre
Teaching from XGBoost, lightgbm:
bin features for discrete values
depth-ﬁrst tree, for access locality
G Varoquaux 6

1 Scaling out Infrastructure
Using many computers: cloud, elastic computing
Orchestration, data distribution
Integration in corporate infrastructure
Hadoop, queues, services
joblib backends
Parallel computing
Loky (robust single-machine process pool)
Distributed (Yarn, dask, CMFActivity)
Storage (S3, HDFS)
G Varoquaux 7

1 Continuous integration
Testing under numpy & scipy dev
A. Mueller
G Varoquaux 8

1 Scikit-learn-contrib
Scaling the scikit-learn universe quicker
https://github.com/scikit-learn-contrib
py-earth multivariate adaptive regression splines
imbalanced-learn under-sampling and over-sampling
lightning fast linear models
polylearn factorization machines and polynomial networks
hdbscan high-performance clustering
forest-conﬁdence-interval conﬁdence interval for forests
boruta py boruta feature selection
G Varoquaux 9

1 Scikit-learn-contrib
Scaling the scikit-learn universe quicker
https://github.com/scikit-learn-contrib
py-earth multivariate adaptive regression splines
imbalanced-learn under-sampling and over-sampling
lightning fast linear models
polylearn factorization machines and polynomial networks
hdbscan high-performance clustering
forest-conﬁdence-interval conﬁdence interval for forests
boruta py boruta feature selection
sklearn.utils.estimator checks.check estimator
G Varoquaux 9

2 The community
Users & developers
G Varoquaux 10

2 User base
350 000 returning users 5 000 citations
G Varoquaux 11

2 User base
350 000 returning users 5 000 citations
OS Employer
Windows
Mac
Linux
Industry Academia
Other
50%
20%
30%
63%
3%
34%
G Varoquaux 11

2 User base
Jun Jul Aug Sep Oct Nov Dec Jan
2017
Feb Mar Apr May Jun
0
20000
40000
NumberofPyPIdownloads
G Varoquaux 12

2 User base
Jun Jul Aug Sep Oct Nov Dec Jan
2017
Feb Mar Apr May Jun
0
20000
40000
60000
80000
100000NumberofPyPIdownloads numpy
pandas
scikit-learn
django
flask
G Varoquaux 12

2 In the Python ecosystem
1 10 100 1000 10000
Package rank
104
105
106
107
108
109
G Varoquaux 13

2 In the Python ecosystem
1 10 100 1000 10000
Package rank
104
105
106
107
108
109
numpy
scikit-learn
joblib
simplejson
sixsetuptools
G Varoquaux 13

2 Core software is infrastructure
Everybody uses it everyday
In industry, education, & research
“Roads and Bridge”: Ford foundation report
Excellent talk by Heather Miller
https://www.youtube.com/watch?v=17yy5BwIiTw
G Varoquaux 14

2 Community-based development in scikit-learn
Active development team
2010 2012 2014 2016
0
25
50Monthly contributors
https://www.openhub.net/p/scikit-learn
G Varoquaux 15

2 Funding & spending 2015 & 2016
New York A. Mueller
$ 350 000 Moore-Sloan grant
A. Mueller (full time). Students: M. Kumar, V. Birodkar
Telecom ParisTech A. Gramfort
200 000e WendelinIA grant + 12 000 e CDS
Programmers: T. Guillemot, T. Dupr´e
Students: M. Kumar, D. Sullivan, V.R. Rajagopalan, N. Goix
Inria Parietal G. Varoquaux
120 000e Inria + 100 000 e WendelinIA
+ 50 000 e ANR + 30 000 e CDS
Programmers: O. Grisel, L. Esteve (programmer), G.
Lemaitre, J. Van den Boosche
Students: A. Mensch, J. Schreiber, G. Patrini
> 400 000 e/yrG Varoquaux 16

2 Sustainability
G Varoquaux 17

2 Sustainability
Educating decision makers
Not funding your infrastructure is a risk
A fundation
Danger: governance, focus on features for the rich
We need partners, good ones
G Varoquaux 17

@GaelVaroquaux
Scikit-learn
Machine learning for everyone
– from beginner to expert
On going progress
Faster models (algorithmics, ﬂoat32)
Easier usage (better pandas integration)
Coupling to infrastructure (via joblib)
Thinking about sustainability & partnership

Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux

Similar to Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux (20)

More from Pôle Systematic Paris-Region

More from Pôle Systematic Paris-Region (20)

Recently uploaded

Recently uploaded (20)

Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux