5. 1 In 0.18 oldies but goodies
New cross-validation objects V.R. Rajagopalan
from s k l e a r n . c r o s s v a l i d a t i o n
import S t r a t i f i e d K F o l d
cv = S t r a t i f i e d K F o l d (y , n f o l d s =2)
for t r a i n , t e s t in cv :
X t r a i n = X[ t r a i n ]
y t a i n = y[ t r a i n ]
G Varoquaux 4
6. 1 In 0.18 oldies but goodies
New cross-validation objects V.R. Rajagopalan
from s k l e a r n . m o d e l s e l e c t i o n
import S t r a t i f i e d K F o l d
cv = S t r a t i f i e d K F o l d ( n f o l d s =2)
for t r a i n , t e s t in cv . s p l i t (X, y):
X t r a i n = X[ t r a i n ]
y t a i n = y[ t r a i n ]
⇒ better nested-CV
G Varoquaux 4
7. 1 In 0.18 oldies but goodies
New cross-validation objects V.R. Rajagopalan
PCA == Randomized PCA G. Patrini
Heuristic to switch PCA to random linear algebra
Fights global warming
Huge speed gains for biggish data
G Varoquaux 4
8. 1 Coming soon Merged in master
Memory in pipeline: G. Lemaitre
make pipeline(PCA(), LinearSVC(), memory=’/tmp/joe’)
Limits recomputation (eg in grid search)
G Varoquaux 5
9. 1 Coming soon Merged in master
Memory in pipeline G. Lemaitre
New solver for logistic regression: SAGA A. Mensch
linear model.LogisticRegression(solver=’saga’)
Fast linear model on biggish data
Trainingobjective
SAGA
Liblinear
RCV1
G Varoquaux 5
10. 1 Coming soon Merged in master
Memory in pipeline G. Lemaitre
New solver for logistic regression: SAGA A. Mensch
Quantile transformer: G. Lemaitre
0 2 4 6 8 10 12
Median Income
0
1
2
3
4
5
6
Numberofhouseholds
0.6
1.2
1.8
2.4
3.0
3.6
4.2
4.8
Colormappingforvaluesofy
G Varoquaux 5
11. 1 Coming soon Merged in master
Memory in pipeline G. Lemaitre
New solver for logistic regression: SAGA A. Mensch
Quantile transformer: G. Lemaitre
0 2 4 6 8 10 12
Median Income
0
1
2
3
4
5
6
Numberofhouseholds
0.6
1.2
1.8
2.4
3.0
3.6
4.2
4.8
Colormappingforvaluesofy
0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2
Median Income
0.2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Numberofhouseholds
0.6
1.2
1.8
2.4
3.0
3.6
4.2
4.8
Colormappingforvaluesofy
G Varoquaux 5
12. 1 Coming soon Merged in master
Memory in pipeline G. Lemaitre
New solver for logistic regression: SAGA A. Mensch
Quantile transformer G. Lemaitre
Local outlier factor: N. Goix
normal
abnormal
G Varoquaux 5
13. 1 Coming soon Merged in master
Memory in pipeline G. Lemaitre
New solver for logistic regression: SAGA A. Mensch
Quantile transformer G. Lemaitre
Local outlier factor N. Goix
Memory savings
Avoid casting (work with float32) J. Massich, A. Imbert
T-SNE (in progress) T. Moreau
G Varoquaux 5
14. 1 To come Maybe
ColumnsTransformer: J. Van den Bossche
Pandas in ... feature engineering ... array out
transformer = make column transformer({
StandardScaler(): [’age’],
OneHotEncoder(): [’company’]
})
array = transformer.fit transform(data frame)
G Varoquaux 6
15. 1 To come Maybe
ColumnsTransformer J. Van den Bossche
Faster trees, forest& boosting:
V.R. Rajagopalan, G. Lemaitre
Teaching from XGBoost, lightgbm:
bin features for discrete values
depth-first tree, for access locality
G Varoquaux 6
16. 1 Scaling out Infrastructure
Using many computers: cloud, elastic computing
Orchestration, data distribution
Integration in corporate infrastructure
Hadoop, queues, services
joblib backends
Parallel computing
Loky (robust single-machine process pool)
Distributed (Yarn, dask, CMFActivity)
Storage (S3, HDFS)
G Varoquaux 7
21. 2 User base
350 000 returning users 5 000 citations
G Varoquaux 11
22. 2 User base
350 000 returning users 5 000 citations
OS Employer
Windows
Mac
Linux
Industry Academia
Other
50%
20%
30%
63%
3%
34%
G Varoquaux 11
23. 2 User base
Jun Jul Aug Sep Oct Nov Dec Jan
2017
Feb Mar Apr May Jun
0
20000
40000
NumberofPyPIdownloads
G Varoquaux 12
24. 2 User base
Jun Jul Aug Sep Oct Nov Dec Jan
2017
Feb Mar Apr May Jun
0
20000
40000
60000
80000
100000NumberofPyPIdownloads numpy
pandas
scikit-learn
django
flask
G Varoquaux 12
25. 2 In the Python ecosystem
1 10 100 1000 10000
Package rank
104
105
106
107
108
109
NumberofPyPIdownloads
G Varoquaux 13
26. 2 In the Python ecosystem
1 10 100 1000 10000
Package rank
104
105
106
107
108
109
NumberofPyPIdownloads
numpy
scikit-learn
joblib
simplejson
sixsetuptools
G Varoquaux 13
27. 2 Core software is infrastructure
Everybody uses it everyday
In industry, education, & research
“Roads and Bridge”: Ford foundation report
Excellent talk by Heather Miller
https://www.youtube.com/watch?v=17yy5BwIiTw
G Varoquaux 14
28. 2 Community-based development in scikit-learn
Active development team
2010 2012 2014 2016
0
25
50Monthly contributors
https://www.openhub.net/p/scikit-learn
G Varoquaux 15
29. 2 Funding & spending 2015 & 2016
New York A. Mueller
$ 350 000 Moore-Sloan grant
A. Mueller (full time). Students: M. Kumar, V. Birodkar
Telecom ParisTech A. Gramfort
200 000e WendelinIA grant + 12 000 e CDS
Programmers: T. Guillemot, T. Dupr´e
Students: M. Kumar, D. Sullivan, V.R. Rajagopalan, N. Goix
Inria Parietal G. Varoquaux
120 000e Inria + 100 000 e WendelinIA
+ 50 000 e ANR + 30 000 e CDS
Programmers: O. Grisel, L. Esteve (programmer), G.
Lemaitre, J. Van den Boosche
Students: A. Mensch, J. Schreiber, G. Patrini
> 400 000 e/yrG Varoquaux 16
30. 2 Funding & spending 2015 & 2016
New York A. Mueller
$ 350 000 Moore-Sloan grant
A. Mueller (full time). Students: M. Kumar, V. Birodkar
Telecom ParisTech A. Gramfort
200 000e WendelinIA grant + 12 000 e CDS
Programmers: T. Guillemot, T. Dupr´e
Students: M. Kumar, D. Sullivan, V.R. Rajagopalan, N. Goix
Inria Parietal G. Varoquaux
120 000e Inria + 100 000 e WendelinIA
+ 50 000 e ANR + 30 000 e CDS
Programmers: O. Grisel, L. Esteve (programmer), G.
Lemaitre, J. Van den Boosche
Students: A. Mensch, J. Schreiber, G. Patrini
> 400 000 e/yrG Varoquaux 16
32. 2 Sustainability
Educating decision makers
Not funding your infrastructure is a risk
A fundation
Danger: governance, focus on features for the rich
We need partners, good ones
G Varoquaux 17
33. @GaelVaroquaux
Scikit-learn
Machine learning for everyone
– from beginner to expert
On going progress
Faster models (algorithmics, float32)
Easier usage (better pandas integration)
Coupling to infrastructure (via joblib)
Thinking about sustainability & partnership