Scikit-learn The state of the union
Ga¨el Varoquaux Open Source Innovation Spring
2016
Personal point of view, as an opening to scikit-learn days 2016 in Paris
1 Some history
Scikit-learn canal historique
G Varoquaux 2
1 scikit-learn growth: users
Website users (weekly): Google analytics
Debian popcon: ∼ 1% of the Debian users
G Varoquaux 3
1 scikit-learn growth: users
Website users (weekly): Google analytics
Debian popcon: ∼ 1% of the Debian users
Web searches: Google trends
G Varoquaux 3
1 scikit-learn growth: lines of code
Lines of code:
Huge feature set
https://www.openhub.net/p/scikit-learn
G Varoquaux 4
1 scikit-learn growth: contributors
Contributors:
759 contributors
https://www.openhub.net/p/scikit-learn
G Varoquaux 5
1 Started as David Cournapeau’s failed PhD project
David then preferred
improving numpy/scipy
That’s David sprinting in 2011
G Varoquaux 6
1 2009: We (Inria Parietal) need machine learning
My team takes over the
development
Hire a young guy
(Fabian Pedregosa)
Put post-docs and PhDs
(Alexandre Gramfort, Vincent Michel...)
Work in the open
Pythonic, fast, documented
G Varoquaux 7
1 2010: ICML MLOSS workshop
Machine Learning Open Source Software
“The examples in the
tutorial are pretty, but
not particularly useful
for the serious user.”
“For the sustainability of
the project it might be bet-
ter to narrow the focus...”
G Varoquaux 8
1 2011: NIPS sprint
People that I didn’t know
were solving my problems
G Varoquaux 9
1 2011: NIPS sprint
People that I didn’t know
were solving my problems
The project took off because of the community...
G Varoquaux 9
2 Upcoming cool stuff
Upcoming 0.18 release
G Varoquaux 10
2 Less code:
Lines of code:
G Varoquaux 11
2 Less code: Cython no longer embedded
Lines of code:
Generated C no longuer embedded in git
⇒ opens the door to fused-types (polymorphism)
⇒ multiple dtypes support in algorithm
= memory saver
Arthur MenschG Varoquaux 11
2 Faster code: better algorithmics
RandomizedPCA → PCA
Automatic choice randomized linear algebra
power iteration (arpack) full (lapack)
For large data: up to 20× speed up
https://github.com/scikit-learn/scikit-learn/issues/5243
Giorgio Patrini
G Varoquaux 12
2 Faster code: better algorithmics
RandomizedPCA → PCA
Automatic choice randomized linear algebra
power iteration (arpack) full (lapack)
For large data: up to 20× speed up
https://github.com/scikit-learn/scikit-learn/issues/5243
Giorgio Patrini
Elkan’s K means
For large data: ∼ 2× speed up.
https://github.com/scikit-learn/scikit-learn/pull/5414
Andreas M¨uller
G Varoquaux 12
2 New cross-validation objects
from s k l e a r n . c r o s s v a l i d a t i o n
import S t r a t i f i e d K F o l d
cv = S t r a t i f i e d K F o l d (y , n f o l d s =2)
for t r a i n , t e s t in cv :
X t r a i n = X[ t r a i n ]
y t a i n = y[ t r a i n ]
Data-independent nested-CV possible
https://github.com/scikit-learn/scikit-learn/pull/4294
Raghav R V
G Varoquaux 13
2 New cross-validation objects
from s k l e a r n . m o d e l s e l e c t i o n
import S t r a t i f i e d K F o l d
cv = S t r a t i f i e d K F o l d ( n f o l d s =2)
for t r a i n , t e s t in cv . s p l i t (X, y):
X t r a i n = X[ t r a i n ]
y t a i n = y[ t r a i n ]
Data-independent ⇒ nested-CV possible
https://github.com/scikit-learn/scikit-learn/pull/4294
Raghav R V
G Varoquaux 13
2 Sequential / Bayesian search CV
See hyper-parameter selection as a Bayesian
optimization / noisy fit problem.
⇒ choose hyper-parameters cleverly, not on a grid
Pull request stalled
https://github.com/scikit-learn/scikit-learn/pull/5491
Fabian Pedregosa, Sebastien Dubois, & Manoj Kumar
G Varoquaux 14
3 Vision(s): the future
G Varoquaux 15
Mission statement
Enable progress via data science
Lower the costs,
less technicalities
Machine learning
for everybody and
for everything
G Varoquaux 16
Mission statement
Enable progress via data science
Lower the costs,
less technicalities
Machine learning
for everybody and
for everything
Small hardware,
medium data
G Varoquaux 16
3 Deep learning
sklearn.neural network.MLPClassifier
architecture-specification language
GPUs unbound technicality
G Varoquaux 17
3 Deep learning
sklearn.neural network.MLPClassifier
architecture-specification language
GPUs unbound technicality
keras, caffe...
G Varoquaux 17
3 AutoML
Automatic model selection
Better hyper-parameter selection
Better description and uniformization of estimators
Integrate feedback from auto-sklearn
G Varoquaux 18
3 Better, faster, stronger
Faster models
From lightning, back to sklearn
Inspiration from XGBoost the paper is out!
G Varoquaux 19
3 Better, faster, stronger
Faster models
From lightning, back to sklearn
Inspiration from XGBoost the paper is out!
Larger data
More partial fit online forests?
Less copies
G Varoquaux 19
3 Scaling up (out?)
I don’t want java/scala
Less fluid prototyping
Cross-VM debugging hard
Numerics in java slowers than Lapack
Need C somewhere
G Varoquaux 20
3 Scaling up (out?)
I don’t want java/scala
They have:
Coupling distributed store to computation
Distributed job management
Create new stack? Ride on this one?
G Varoquaux 20
3 Scaling up (out?)
I don’t want java/scala
They have:
Coupling distributed store to computation
Distributed job management
Create new stack? Ride on this one?
Blaze, Ibis, dask: require rewrite of algorithms
dask promising for ETL
New backends for joblib parallel and storage
distributed, ssh
G Varoquaux 20
Sustainable growth
Reviewing is the bottleneck
User support drowns core devs
Users need stability (Airbus)
Coding is not the only thing
sprint, GSOC management, tutorials...
G Varoquaux 21
Sustainable growth
Reviewing is the bottleneck
User support drowns core devs
Users need stability (Airbus)
Coding is not the only thing
sprint, GSOC management, tutorials...
Structure & stability
How to organize funding and governance?
process/meetings/reports/funding proposal...
= work on project
Passionate coders get a lot done
unless they get drowned by meetings
G Varoquaux 21
@GaelVaroquaux
Funding: Inria, Nexedi, Paris-Saclay CDS, NYU CDS, GSoC

Scikit-learn: the state of the union 2016

  • 1.
    Scikit-learn The stateof the union Ga¨el Varoquaux Open Source Innovation Spring 2016 Personal point of view, as an opening to scikit-learn days 2016 in Paris
  • 2.
    1 Some history Scikit-learncanal historique G Varoquaux 2
  • 3.
    1 scikit-learn growth:users Website users (weekly): Google analytics Debian popcon: ∼ 1% of the Debian users G Varoquaux 3
  • 4.
    1 scikit-learn growth:users Website users (weekly): Google analytics Debian popcon: ∼ 1% of the Debian users Web searches: Google trends G Varoquaux 3
  • 5.
    1 scikit-learn growth:lines of code Lines of code: Huge feature set https://www.openhub.net/p/scikit-learn G Varoquaux 4
  • 6.
    1 scikit-learn growth:contributors Contributors: 759 contributors https://www.openhub.net/p/scikit-learn G Varoquaux 5
  • 7.
    1 Started asDavid Cournapeau’s failed PhD project David then preferred improving numpy/scipy That’s David sprinting in 2011 G Varoquaux 6
  • 8.
    1 2009: We(Inria Parietal) need machine learning My team takes over the development Hire a young guy (Fabian Pedregosa) Put post-docs and PhDs (Alexandre Gramfort, Vincent Michel...) Work in the open Pythonic, fast, documented G Varoquaux 7
  • 9.
    1 2010: ICMLMLOSS workshop Machine Learning Open Source Software “The examples in the tutorial are pretty, but not particularly useful for the serious user.” “For the sustainability of the project it might be bet- ter to narrow the focus...” G Varoquaux 8
  • 10.
    1 2011: NIPSsprint People that I didn’t know were solving my problems G Varoquaux 9
  • 11.
    1 2011: NIPSsprint People that I didn’t know were solving my problems The project took off because of the community... G Varoquaux 9
  • 12.
    2 Upcoming coolstuff Upcoming 0.18 release G Varoquaux 10
  • 13.
    2 Less code: Linesof code: G Varoquaux 11
  • 14.
    2 Less code:Cython no longer embedded Lines of code: Generated C no longuer embedded in git ⇒ opens the door to fused-types (polymorphism) ⇒ multiple dtypes support in algorithm = memory saver Arthur MenschG Varoquaux 11
  • 15.
    2 Faster code:better algorithmics RandomizedPCA → PCA Automatic choice randomized linear algebra power iteration (arpack) full (lapack) For large data: up to 20× speed up https://github.com/scikit-learn/scikit-learn/issues/5243 Giorgio Patrini G Varoquaux 12
  • 16.
    2 Faster code:better algorithmics RandomizedPCA → PCA Automatic choice randomized linear algebra power iteration (arpack) full (lapack) For large data: up to 20× speed up https://github.com/scikit-learn/scikit-learn/issues/5243 Giorgio Patrini Elkan’s K means For large data: ∼ 2× speed up. https://github.com/scikit-learn/scikit-learn/pull/5414 Andreas M¨uller G Varoquaux 12
  • 17.
    2 New cross-validationobjects from s k l e a r n . c r o s s v a l i d a t i o n import S t r a t i f i e d K F o l d cv = S t r a t i f i e d K F o l d (y , n f o l d s =2) for t r a i n , t e s t in cv : X t r a i n = X[ t r a i n ] y t a i n = y[ t r a i n ] Data-independent nested-CV possible https://github.com/scikit-learn/scikit-learn/pull/4294 Raghav R V G Varoquaux 13
  • 18.
    2 New cross-validationobjects from s k l e a r n . m o d e l s e l e c t i o n import S t r a t i f i e d K F o l d cv = S t r a t i f i e d K F o l d ( n f o l d s =2) for t r a i n , t e s t in cv . s p l i t (X, y): X t r a i n = X[ t r a i n ] y t a i n = y[ t r a i n ] Data-independent ⇒ nested-CV possible https://github.com/scikit-learn/scikit-learn/pull/4294 Raghav R V G Varoquaux 13
  • 19.
    2 Sequential /Bayesian search CV See hyper-parameter selection as a Bayesian optimization / noisy fit problem. ⇒ choose hyper-parameters cleverly, not on a grid Pull request stalled https://github.com/scikit-learn/scikit-learn/pull/5491 Fabian Pedregosa, Sebastien Dubois, & Manoj Kumar G Varoquaux 14
  • 20.
    3 Vision(s): thefuture G Varoquaux 15
  • 21.
    Mission statement Enable progressvia data science Lower the costs, less technicalities Machine learning for everybody and for everything G Varoquaux 16
  • 22.
    Mission statement Enable progressvia data science Lower the costs, less technicalities Machine learning for everybody and for everything Small hardware, medium data G Varoquaux 16
  • 23.
    3 Deep learning sklearn.neuralnetwork.MLPClassifier architecture-specification language GPUs unbound technicality G Varoquaux 17
  • 24.
    3 Deep learning sklearn.neuralnetwork.MLPClassifier architecture-specification language GPUs unbound technicality keras, caffe... G Varoquaux 17
  • 25.
    3 AutoML Automatic modelselection Better hyper-parameter selection Better description and uniformization of estimators Integrate feedback from auto-sklearn G Varoquaux 18
  • 26.
    3 Better, faster,stronger Faster models From lightning, back to sklearn Inspiration from XGBoost the paper is out! G Varoquaux 19
  • 27.
    3 Better, faster,stronger Faster models From lightning, back to sklearn Inspiration from XGBoost the paper is out! Larger data More partial fit online forests? Less copies G Varoquaux 19
  • 28.
    3 Scaling up(out?) I don’t want java/scala Less fluid prototyping Cross-VM debugging hard Numerics in java slowers than Lapack Need C somewhere G Varoquaux 20
  • 29.
    3 Scaling up(out?) I don’t want java/scala They have: Coupling distributed store to computation Distributed job management Create new stack? Ride on this one? G Varoquaux 20
  • 30.
    3 Scaling up(out?) I don’t want java/scala They have: Coupling distributed store to computation Distributed job management Create new stack? Ride on this one? Blaze, Ibis, dask: require rewrite of algorithms dask promising for ETL New backends for joblib parallel and storage distributed, ssh G Varoquaux 20
  • 31.
    Sustainable growth Reviewing isthe bottleneck User support drowns core devs Users need stability (Airbus) Coding is not the only thing sprint, GSOC management, tutorials... G Varoquaux 21
  • 32.
    Sustainable growth Reviewing isthe bottleneck User support drowns core devs Users need stability (Airbus) Coding is not the only thing sprint, GSOC management, tutorials... Structure & stability How to organize funding and governance? process/meetings/reports/funding proposal... = work on project Passionate coders get a lot done unless they get drowned by meetings G Varoquaux 21
  • 33.
    @GaelVaroquaux Funding: Inria, Nexedi,Paris-Saclay CDS, NYU CDS, GSoC