SlideShare a Scribd company logo
1 of 63
Download to read offline
Scikit-learn for easy machine learning:
the vision, the tool, and the project
Ga¨el Varoquaux
scikit
machine learning in Python
1 Scikit-learn: the vision
G Varoquaux 2
1 Scikit-learn: the vision
An enabler
G Varoquaux 2
1 Scikit-learn: the vision
An enabler
Machine learning
for everybody and
for everything
Machine learning
without learning the
machinery
G Varoquaux 2
Machine learning in a nutshell
Machine learning is about making prediction from data
G Varoquaux 3
1 Machine learning: a historical perspective
Artificial Intelligence The 80s
Building decision rules
Eatable?
Mobile?
Tall?
G Varoquaux 4
1 Machine learning: a historical perspective
Artificial Intelligence The 80s
Building decision rules
Machine learning The 90s
Learn these from observations
G Varoquaux 4
1 Machine learning: a historical perspective
Artificial Intelligence The 80s
Building decision rules
Machine learning The 90s
Learn these from observations
Statistical learning 2000s
Model the noise in the observations
G Varoquaux 4
1 Machine learning: a historical perspective
Artificial Intelligence The 80s
Building decision rules
Machine learning The 90s
Learn these from observations
Statistical learning 2000s
Model the noise in the observations
Big data today
Many observations,
simple rules
G Varoquaux 4
1 Machine learning: a historical perspective
Artificial Intelligence The 80s
Building decision rules
Machine learning The 90s
Learn these from observations
Statistical learning 2000s
Model the noise in the observations
Big data today
Many observations,
simple rules
“Big data isn’t actually interesting without machine
learning”
Steve Jurvetson, VC, Silicon Valley
G Varoquaux 4
1 Machine learning in a nutshell: an example
Face recognition
Andrew Bill Charles Dave
G Varoquaux 5
1 Machine learning in a nutshell: an example
Face recognition
Andrew Bill Charles Dave
?G Varoquaux 5
1 Machine learning in a nutshell
A simple method:
1 Store all the known (noisy) images and the names
that go with them.
2 From a new (noisy) images, find the image that is
most similar.
“Nearest neighbor” method
G Varoquaux 6
1 Machine learning in a nutshell
A simple method:
1 Store all the known (noisy) images and the names
that go with them.
2 From a new (noisy) images, find the image that is
most similar.
“Nearest neighbor” method
How many errors on already-known images?
... 0: no erreurs
Test data = Train data
G Varoquaux 6
1 Machine learning in a nutshell: regression
A single descriptor:
one dimension
x
y
G Varoquaux 7
1 Machine learning in a nutshell: regression
A single descriptor:
one dimension
x
y
x
y
Which model to prefer?
G Varoquaux 7
1 Machine learning in a nutshell: regression
A single descriptor:
one dimension
x
y
x
y
Problem of “over-fitting”
Minimizing error is not always the best strategy
(learning noise)
Test data = train data
G Varoquaux 7
1 Machine learning in a nutshell: regression
A single descriptor:
one dimension
x
y
x
y
Prefer simple models
= concept of “regularization”
Balance the number of parameters to learn
with the amount of data
G Varoquaux 7
1 Machine learning in a nutshell: regression
A single descriptor:
one dimension
x
y
Two descriptors:
2 dimensions
X_1
X_2
y
More parameters
G Varoquaux 7
1 Machine learning in a nutshell: regression
A single descriptor:
one dimension
x
y
Two descriptors:
2 dimensions
X_1
X_2
y
More parameters
⇒ need more data
“curse of dimensionality”
G Varoquaux 7
1 Machine learning in a nutshell: classification
Example:
recognizing hand-written digits
G Varoquaux 8
1 Machine learning in a nutshell: classification
X1
X2
Example:
recognizing hand-written digits
Represent with 2 numerical features
G Varoquaux 8
1 Machine learning in a nutshell: classification
X1
X2
G Varoquaux 8
1 Machine learning in a nutshell: unsupervised
ConocoPhillipsApple
Pepsi
Navistar
GlaxoSmithKline
crosoft
Kimberly-Clark
Ryder
SAP
an Sachs
Sony
Pfizer
Amazon
Marriott
Novartis
Coca Cola
3M
Comcast
Sanofi-Aventis
IBM
Chevron
DuPont de Nemours
S
Total
Caterpillar
Canon
rner
Home Depot
Texas instruments
Valero Energy
Ford
Cablevision
Toyota
g
Honda
HP
Dell
Mitsubishi
Xerox
Yahoo
Exxon
Mc Donalds
Cisco
Kraft Foods Unilever
Stock market structure
G Varoquaux 9
1 Machine learning in a nutshell: unsupervised
ConocoPhillipsApple
Pepsi
Navistar
GlaxoSmithKline
crosoft
Kimberly-Clark
Ryder
SAP
an Sachs
Sony
Pfizer
Amazon
Marriott
Novartis
Coca Cola
3M
Comcast
Sanofi-Aventis
IBM
Chevron
DuPont de Nemours
S
Total
Caterpillar
Canon
rner
Home Depot
Texas instruments
Valero Energy
Ford
Cablevision
Toyota
g
Honda
HP
Dell
Mitsubishi
Xerox
Yahoo
Exxon
Mc Donalds
Cisco
Kraft Foods Unilever
Stock market structure
Unlabeled data
more common than labeled data
G Varoquaux 9
Machine learning
Mathematics and algorithms for fitting predictive models
Regression
x
y
Classification
Notions of overfit and test error
G Varoquaux 10
Machine learning is everywhere
Image recognition
Marketing (click-through rate)
Movie / music recommendation
Medical data
Logistic chains (eg supermarkets)
Language translation
Detecting industrial failures
G Varoquaux 11
Why another machine learning package?
G Varoquaux 12
Real statisticians use R
And real astronomers use IRAF
Real economists use Gauss
Real coders use C assembler
Real experiments are controlled in Labview
Real Bayesians use BUGS stan
Real text processing is done in Perl
Real Deep learner is best done with torch (Lua)
And medical doctors only trust SPSS
G Varoquaux 13
1 My stack
Python, what else?
General purpose
Interactive language
Easy to read / write
G Varoquaux 14
1 My stack
The scientific Python stack
numpy arrays
Mostly a float**
No annotation / structure
Universal across applications
Easily shared with C / fortran
03878794797927
01790752701578
94071746124797
54970718717887
13653490495190
74754265358098
48721546349084
90345673245614
57187745620
03878794797927
01790752701578
94071746124797
54970718717887
13653490495190
74754265358098
48721546349084
90345673245614
187745620
G Varoquaux 14
1 My stack
The scientific Python stack
numpy arrays
Connecting to
scipy
scikit-image
pandas
...
It’s about plugin things
together
G Varoquaux 14
1 My stack
The scientific Python stack
numpy arrays
Connecting to
scipy
scikit-image
pandas
...
Being Pythonic and
SciPythonic
G Varoquaux 14
1 scikit-learn vision
Machine learning for all
No specific application domain
No requirements in machine learning
High-quality Pythonic software library
Interfaces designed for users
Community-driven development
BSD licensed, very diverse contributors
http://scikit-learn.org
G Varoquaux 15
1 Between research and applications
Machine learning research
Conceptual complexity is not an issue
New and bleeding edge is better
Simple problems are old science
In the field
Tried and tested (aka boring) is good
Little sophistication from the user
API is more important than maths
Solving simple problems matters
Solving them really well matters a lot
G Varoquaux 16
2 Scikit-learn: the tool
A Python library for machine learning
c Theodore W. Gray
G Varoquaux 17
2 A Python library
A library, not a program
More expressive and flexible
Easy to include in an ecosystem
As easy as py
from s k l e a r n import svm
c l a s s i f i e r = svm.SVC()
c l a s s i f i e r . f i t ( X t r a i n , Y t r a i n )
Y t e s t = c l a s s i f i e r . p r e d i c t ( X t e s t )
G Varoquaux 18
2 API: specifying a model
A central concept: the estimator
Instanciated without data
But specifying the parameters
from s k l e a r n . n e i g h b o r s import
KNear estNeig hbo r s
e s t i m a t o r = KN ea r estNe ig h b or s (
n n e i g h b o r s =2)
G Varoquaux 19
2 API: training a model
Training from data
e s t i m a t o r . f i t ( X t r a i n , Y t r a i n )
with:
X a numpy array with shape
nsamples × nfeatures
y a numpy 1D array, of ints or float, with shape
nsamples
G Varoquaux 20
2 API: using a model
Prediction: classification, regression
Y t e s t = e s t i m a t o r . p r e d i c t ( X t e s t )
Transforming: dimension reduction, filter
X new = e s t i m a t o r . t r a n s f o r m ( X t e s t )
Test score, density estimation
t e s t s c o r e = e s t i m a t o r . s c o r e ( X t e s t )
G Varoquaux 21
2 Vectorizing
From raw data to a sample matrix X
For text data: counting word occurences
- Input data: list of documents (string)
- Output data: numerical matrix
G Varoquaux 22
2 Vectorizing
From raw data to a sample matrix X
For text data: counting word occurences
- Input data: list of documents (string)
- Output data: numerical matrix
from s k l e a r n . f e a t u r e e x t r a c t i o n . t e x t
import H a s h i n g V e c t o r i z e r
h a s h e r = H a s h i n g V e c t o r i z e r ()
X = h a s h e r . f i t t r a n s f o r m ( documents )
G Varoquaux 22
2 Scikit-learn: very rich feature set
Supervised learning
Decision trees (Random-Forest, Boosted Tree)
Linear models
SVM
Unsupervised Learning
Clustering
Dictionary learning
Outlier detection
Model selection
Built in cross-validation
Parameter optimization
G Varoquaux 23
2 Computational performance
scikit-learn mlpy pybrain pymvpa mdp shogun
SVM 5.2 9.47 17.5 11.52 40.48 5.63
LARS 1.17 105.3 - 37.35 - -
Elastic Net 0.52 73.7 - 1.44 - -
kNN 0.57 1.41 - 0.56 0.58 1.36
PCA 0.18 - - 8.93 0.47 0.33
k-Means 1.34 0.79 ∞ - 35.75 0.68
Algorithmic optimizations
Minimizing data copies
G Varoquaux 24
2 Computational performance
scikit-learn mlpy pybrain pymvpa mdp shogun
SVM 5.2 9.47 17.5 11.52 40.48 5.63
LARS 1.17 105.3 - 37.35 - -
Elastic Net 0.52 73.7 - 1.44 - -
kNN 0.57 1.41 - 0.56 0.58 1.36
PCA 0.18 - - 8.93 0.47 0.33
k-Means 1.34 0.79 ∞ - 35.75 0.68
Algorithmic optimizations
Minimizing data copies
Random Forest fit time
0
2000
4000
6000
8000
10000
12000
14000Fittime(s)
203.01 211.53
4464.65
3342.83
1518.14
1711.94
1027.91
13427.06
10941.72
Scikit-Learn-RF
Scikit-Learn-ETs
OpenCV-RF
OpenCV-ETs
OK3-RF
OK3-ETs
Weka-RF
R-RF
Orange-RF
Scikit-Learn
Python, Cython
OpenCV
C++
OK3
C Weka
Java
randomForest
R, Fortran
Orange
Python
Figure: Gilles Louppe
G Varoquaux 24
What if the data does not fit in memory?
“Big data”:
Petabytes...
Distributed storage
Computing cluster
G Varoquaux 25
What if the data does not fit in memory?
“Big data”:
Petabytes...
Distributed storage
Computing cluster
Mere mortals:
Gigabytes...
Python programming
Off-the-self computers
See also: http://www.slideshare.net/GaelVaroquaux/processing-
biggish-data-on-commodity-hardware-simple-python-patterns
G Varoquaux 25
2 On-line algorithms
e s t i m a t o r . p a r t i a l f i t ( X t r a i n , Y t r a i n )0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
G Varoquaux 26
2 On-line algorithms
e s t i m a t o r . p a r t i a l f i t ( X t r a i n , Y t r a i n )
Linear models
sklearn.linear model.SGDRegressor
sklearn.linear model.SGDClassifier
Clustering
sklearn.cluster.MiniBatchKMeans
sklearn.cluster.Birch (new in 0.16)
PCA (new in 0.16)
sklearn.decompositions.IncrementalPCA
G Varoquaux 26
2 On-the-fly data reduction
Many features
⇒ Reduce the data as it is loaded
X s m a l l = e s t i m a t o r . t r a n s f o r m ( X big , y)
G Varoquaux 27
2 On-the-fly data reduction
Random projections (will average features)
sklearn.random projection
random linear combinations of the features
Fast clustering of features
sklearn.cluster.FeatureAgglomeration
on images: super-pixel strategy
Hashing when observations have varying size
(e.g. words)
sklearn.feature extraction.text.
HashingVectorizer
stateless: can be used in parallel
G Varoquaux 27
3 Scikit-learn: the project
G Varoquaux 28
3 Community-based development in scikit-learn
Huge feature set:
benefits of a large team
Project growth:
More than 200 contributors
∼ 12 core contributors
1 full-time INRIA programmer
from the start
Estimated cost of development: $ 6 millions
COCOMO model,
http://www.ohloh.net/p/scikit-learn
G Varoquaux 29
3 Many eyes makes code fast
L. Buitinck, O. Grisel, A. Joly, G. Louppe, J. Nothman, P. Prettenhofer
G Varoquaux 30
3 6 steps to a community-driven project
1 Focus on quality
2 Build great docs and examples
3 Use github
4 Limit the technicality of your codebase
5 Releasing and packaging matter
6 Focus on your contributors,
give them credit, decision power
http://www.slideshare.net/GaelVaroquaux/
scikit-learn-dveloppement-communautaire
G Varoquaux 31
3 Quality assurance
Code review: pull requests
Can include newcomers
We read each others code
Everything is discussed:
- Should the algorithm go in?
- Are there good defaults?
- Are names meaningfull?
- Are the numerics stable?
- Could it be faster?
G Varoquaux 32
3 Quality assurance
Unit testing
Everything is tested
Great for numerics
Overall tests enforce on all estimators
- consistency with the API
- basic invariances
- good handling of various inputs
G Varoquaux 33
Make it work, make it right, make it boring
G Varoquaux 34
3 The tragedy of the commons
Individuals, acting independently and rationally accord-
ing to each one’s self-interest, behave contrary to the
whole group’s long-term best interests by depleting
some common resource.
Wikipedia
Make it work, make it right, make it boring
Core projects (boring) taken for granted
⇒ Hard to fund, less excitement
They need citation, in papers & on corporate web pages
G Varoquaux 35
3 The tragedy of the commons
Individuals, acting independently and rationally accord-
ing to each one’s self-interest, behave contrary to the
whole group’s long-term best interests by depleting
some common resource.
Wikipedia
Make it work, make it right, make it boring
Core projects (boring) taken for granted
⇒ Hard to fund, less excitement
They need citation, in papers & on corporate web pages
+ It’s so hard to scale
User support
Growing codebase
G Varoquaux 35
@GaelVaroquaux
Scikit-learn
The vision
Machine learning as a means not an end
Versatile library: the “right” level of abstraction
Close to research, but seeking different tradeoffs
@GaelVaroquaux
Scikit-learn
The vision
Machine learning as a means not an end
The tool
Simple API uniform across learners
Numpy matrices as data containers
Reasonnably fast
@GaelVaroquaux
Scikit-learn
The vision
Machine learning as a means not an end
The tool
Simple API uniform across learners
The project
Many people working together
Tests and discussions for quality
We’re hiring!

More Related Content

More from Pôle Systematic Paris-Region

Osis19_IoT: Proof of Pointer Programs with Ownership in SPARK, by Yannick Moy
Osis19_IoT: Proof of Pointer Programs with Ownership in SPARK, by Yannick MoyOsis19_IoT: Proof of Pointer Programs with Ownership in SPARK, by Yannick Moy
Osis19_IoT: Proof of Pointer Programs with Ownership in SPARK, by Yannick MoyPôle Systematic Paris-Region
 
Osis18_Cloud : Virtualisation efficace d’architectures NUMA
Osis18_Cloud : Virtualisation efficace d’architectures NUMAOsis18_Cloud : Virtualisation efficace d’architectures NUMA
Osis18_Cloud : Virtualisation efficace d’architectures NUMAPôle Systematic Paris-Region
 
Osis18_Cloud : DeepTorrent Stockage distribué perenne basé sur Bittorrent
Osis18_Cloud : DeepTorrent Stockage distribué perenne basé sur BittorrentOsis18_Cloud : DeepTorrent Stockage distribué perenne basé sur Bittorrent
Osis18_Cloud : DeepTorrent Stockage distribué perenne basé sur BittorrentPôle Systematic Paris-Region
 
OSIS18_IoT: L'approche machine virtuelle pour les microcontrôleurs, le projet...
OSIS18_IoT: L'approche machine virtuelle pour les microcontrôleurs, le projet...OSIS18_IoT: L'approche machine virtuelle pour les microcontrôleurs, le projet...
OSIS18_IoT: L'approche machine virtuelle pour les microcontrôleurs, le projet...Pôle Systematic Paris-Region
 
OSIS18_IoT: La securite des objets connectes a bas cout avec l'os et riot
OSIS18_IoT: La securite des objets connectes a bas cout avec l'os et riotOSIS18_IoT: La securite des objets connectes a bas cout avec l'os et riot
OSIS18_IoT: La securite des objets connectes a bas cout avec l'os et riotPôle Systematic Paris-Region
 
OSIS18_IoT : Solution de mise au point pour les systemes embarques, par Julio...
OSIS18_IoT : Solution de mise au point pour les systemes embarques, par Julio...OSIS18_IoT : Solution de mise au point pour les systemes embarques, par Julio...
OSIS18_IoT : Solution de mise au point pour les systemes embarques, par Julio...Pôle Systematic Paris-Region
 
OSIS18_IoT : Securisation du reseau des objets connectes, par Nicolas LE SAUZ...
OSIS18_IoT : Securisation du reseau des objets connectes, par Nicolas LE SAUZ...OSIS18_IoT : Securisation du reseau des objets connectes, par Nicolas LE SAUZ...
OSIS18_IoT : Securisation du reseau des objets connectes, par Nicolas LE SAUZ...Pôle Systematic Paris-Region
 
OSIS18_IoT : Ada and SPARK - Defense in Depth for Safe Micro-controller Progr...
OSIS18_IoT : Ada and SPARK - Defense in Depth for Safe Micro-controller Progr...OSIS18_IoT : Ada and SPARK - Defense in Depth for Safe Micro-controller Progr...
OSIS18_IoT : Ada and SPARK - Defense in Depth for Safe Micro-controller Progr...Pôle Systematic Paris-Region
 
OSIS18_IoT : RTEMS pour l'IoT professionnel, par Pierre Ficheux (Smile ECS)
OSIS18_IoT : RTEMS pour l'IoT professionnel, par Pierre Ficheux (Smile ECS)OSIS18_IoT : RTEMS pour l'IoT professionnel, par Pierre Ficheux (Smile ECS)
OSIS18_IoT : RTEMS pour l'IoT professionnel, par Pierre Ficheux (Smile ECS)Pôle Systematic Paris-Region
 
PyParis 2017 / Un mooc python, by thierry parmentelat
PyParis 2017 / Un mooc python, by thierry parmentelatPyParis 2017 / Un mooc python, by thierry parmentelat
PyParis 2017 / Un mooc python, by thierry parmentelatPôle Systematic Paris-Region
 
PyParis2017 / Python pour les enseignants des classes préparatoires, by Olivi...
PyParis2017 / Python pour les enseignants des classes préparatoires, by Olivi...PyParis2017 / Python pour les enseignants des classes préparatoires, by Olivi...
PyParis2017 / Python pour les enseignants des classes préparatoires, by Olivi...Pôle Systematic Paris-Region
 
PyParis 2017 / Unicode and bytes demystified, by Boris Feld
PyParis 2017 / Unicode and bytes demystified, by Boris FeldPyParis 2017 / Unicode and bytes demystified, by Boris Feld
PyParis 2017 / Unicode and bytes demystified, by Boris FeldPôle Systematic Paris-Region
 
Py paris2017 / promises and perils in artificial intelligence, by Andreas Muller
Py paris2017 / promises and perils in artificial intelligence, by Andreas MullerPy paris2017 / promises and perils in artificial intelligence, by Andreas Muller
Py paris2017 / promises and perils in artificial intelligence, by Andreas MullerPôle Systematic Paris-Region
 
PyParis2017 / Incremental computation in python, by Philip Schanely
PyParis2017 / Incremental computation in python, by Philip SchanelyPyParis2017 / Incremental computation in python, by Philip Schanely
PyParis2017 / Incremental computation in python, by Philip SchanelyPôle Systematic Paris-Region
 
PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx
PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx
PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx Pôle Systematic Paris-Region
 
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael VaroquauxPyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael VaroquauxPôle Systematic Paris-Region
 

More from Pôle Systematic Paris-Region (20)

Osis19_IoT: Proof of Pointer Programs with Ownership in SPARK, by Yannick Moy
Osis19_IoT: Proof of Pointer Programs with Ownership in SPARK, by Yannick MoyOsis19_IoT: Proof of Pointer Programs with Ownership in SPARK, by Yannick Moy
Osis19_IoT: Proof of Pointer Programs with Ownership in SPARK, by Yannick Moy
 
Osis18_Cloud : Pas de commun sans communauté ?
Osis18_Cloud : Pas de commun sans communauté ?Osis18_Cloud : Pas de commun sans communauté ?
Osis18_Cloud : Pas de commun sans communauté ?
 
Osis18_Cloud : Projet Wolphin
Osis18_Cloud : Projet Wolphin Osis18_Cloud : Projet Wolphin
Osis18_Cloud : Projet Wolphin
 
Osis18_Cloud : Virtualisation efficace d’architectures NUMA
Osis18_Cloud : Virtualisation efficace d’architectures NUMAOsis18_Cloud : Virtualisation efficace d’architectures NUMA
Osis18_Cloud : Virtualisation efficace d’architectures NUMA
 
Osis18_Cloud : DeepTorrent Stockage distribué perenne basé sur Bittorrent
Osis18_Cloud : DeepTorrent Stockage distribué perenne basé sur BittorrentOsis18_Cloud : DeepTorrent Stockage distribué perenne basé sur Bittorrent
Osis18_Cloud : DeepTorrent Stockage distribué perenne basé sur Bittorrent
 
Osis18_Cloud : Software-heritage
Osis18_Cloud : Software-heritageOsis18_Cloud : Software-heritage
Osis18_Cloud : Software-heritage
 
OSIS18_IoT: L'approche machine virtuelle pour les microcontrôleurs, le projet...
OSIS18_IoT: L'approche machine virtuelle pour les microcontrôleurs, le projet...OSIS18_IoT: L'approche machine virtuelle pour les microcontrôleurs, le projet...
OSIS18_IoT: L'approche machine virtuelle pour les microcontrôleurs, le projet...
 
OSIS18_IoT: La securite des objets connectes a bas cout avec l'os et riot
OSIS18_IoT: La securite des objets connectes a bas cout avec l'os et riotOSIS18_IoT: La securite des objets connectes a bas cout avec l'os et riot
OSIS18_IoT: La securite des objets connectes a bas cout avec l'os et riot
 
OSIS18_IoT : Solution de mise au point pour les systemes embarques, par Julio...
OSIS18_IoT : Solution de mise au point pour les systemes embarques, par Julio...OSIS18_IoT : Solution de mise au point pour les systemes embarques, par Julio...
OSIS18_IoT : Solution de mise au point pour les systemes embarques, par Julio...
 
OSIS18_IoT : Securisation du reseau des objets connectes, par Nicolas LE SAUZ...
OSIS18_IoT : Securisation du reseau des objets connectes, par Nicolas LE SAUZ...OSIS18_IoT : Securisation du reseau des objets connectes, par Nicolas LE SAUZ...
OSIS18_IoT : Securisation du reseau des objets connectes, par Nicolas LE SAUZ...
 
OSIS18_IoT : Ada and SPARK - Defense in Depth for Safe Micro-controller Progr...
OSIS18_IoT : Ada and SPARK - Defense in Depth for Safe Micro-controller Progr...OSIS18_IoT : Ada and SPARK - Defense in Depth for Safe Micro-controller Progr...
OSIS18_IoT : Ada and SPARK - Defense in Depth for Safe Micro-controller Progr...
 
OSIS18_IoT : RTEMS pour l'IoT professionnel, par Pierre Ficheux (Smile ECS)
OSIS18_IoT : RTEMS pour l'IoT professionnel, par Pierre Ficheux (Smile ECS)OSIS18_IoT : RTEMS pour l'IoT professionnel, par Pierre Ficheux (Smile ECS)
OSIS18_IoT : RTEMS pour l'IoT professionnel, par Pierre Ficheux (Smile ECS)
 
PyParis 2017 / Un mooc python, by thierry parmentelat
PyParis 2017 / Un mooc python, by thierry parmentelatPyParis 2017 / Un mooc python, by thierry parmentelat
PyParis 2017 / Un mooc python, by thierry parmentelat
 
PyParis2017 / Python pour les enseignants des classes préparatoires, by Olivi...
PyParis2017 / Python pour les enseignants des classes préparatoires, by Olivi...PyParis2017 / Python pour les enseignants des classes préparatoires, by Olivi...
PyParis2017 / Python pour les enseignants des classes préparatoires, by Olivi...
 
PyParis 2017 / Unicode and bytes demystified, by Boris Feld
PyParis 2017 / Unicode and bytes demystified, by Boris FeldPyParis 2017 / Unicode and bytes demystified, by Boris Feld
PyParis 2017 / Unicode and bytes demystified, by Boris Feld
 
Py paris2017 / promises and perils in artificial intelligence, by Andreas Muller
Py paris2017 / promises and perils in artificial intelligence, by Andreas MullerPy paris2017 / promises and perils in artificial intelligence, by Andreas Muller
Py paris2017 / promises and perils in artificial intelligence, by Andreas Muller
 
PyParis2017 / Incremental computation in python, by Philip Schanely
PyParis2017 / Incremental computation in python, by Philip SchanelyPyParis2017 / Incremental computation in python, by Philip Schanely
PyParis2017 / Incremental computation in python, by Philip Schanely
 
PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx
PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx
PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx
 
PyParis2017 / Performant python, by Burkhard Kloss
PyParis2017 / Performant python, by Burkhard KlossPyParis2017 / Performant python, by Burkhard Kloss
PyParis2017 / Performant python, by Burkhard Kloss
 
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael VaroquauxPyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux
 

Recently uploaded

Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1
 

Recently uploaded (20)

Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docx
 

PyData Paris 2015 - Opening keynote Gael Varoquaux

  • 1. Scikit-learn for easy machine learning: the vision, the tool, and the project Ga¨el Varoquaux scikit machine learning in Python
  • 2. 1 Scikit-learn: the vision G Varoquaux 2
  • 3. 1 Scikit-learn: the vision An enabler G Varoquaux 2
  • 4. 1 Scikit-learn: the vision An enabler Machine learning for everybody and for everything Machine learning without learning the machinery G Varoquaux 2
  • 5. Machine learning in a nutshell Machine learning is about making prediction from data G Varoquaux 3
  • 6. 1 Machine learning: a historical perspective Artificial Intelligence The 80s Building decision rules Eatable? Mobile? Tall? G Varoquaux 4
  • 7. 1 Machine learning: a historical perspective Artificial Intelligence The 80s Building decision rules Machine learning The 90s Learn these from observations G Varoquaux 4
  • 8. 1 Machine learning: a historical perspective Artificial Intelligence The 80s Building decision rules Machine learning The 90s Learn these from observations Statistical learning 2000s Model the noise in the observations G Varoquaux 4
  • 9. 1 Machine learning: a historical perspective Artificial Intelligence The 80s Building decision rules Machine learning The 90s Learn these from observations Statistical learning 2000s Model the noise in the observations Big data today Many observations, simple rules G Varoquaux 4
  • 10. 1 Machine learning: a historical perspective Artificial Intelligence The 80s Building decision rules Machine learning The 90s Learn these from observations Statistical learning 2000s Model the noise in the observations Big data today Many observations, simple rules “Big data isn’t actually interesting without machine learning” Steve Jurvetson, VC, Silicon Valley G Varoquaux 4
  • 11. 1 Machine learning in a nutshell: an example Face recognition Andrew Bill Charles Dave G Varoquaux 5
  • 12. 1 Machine learning in a nutshell: an example Face recognition Andrew Bill Charles Dave ?G Varoquaux 5
  • 13. 1 Machine learning in a nutshell A simple method: 1 Store all the known (noisy) images and the names that go with them. 2 From a new (noisy) images, find the image that is most similar. “Nearest neighbor” method G Varoquaux 6
  • 14. 1 Machine learning in a nutshell A simple method: 1 Store all the known (noisy) images and the names that go with them. 2 From a new (noisy) images, find the image that is most similar. “Nearest neighbor” method How many errors on already-known images? ... 0: no erreurs Test data = Train data G Varoquaux 6
  • 15. 1 Machine learning in a nutshell: regression A single descriptor: one dimension x y G Varoquaux 7
  • 16. 1 Machine learning in a nutshell: regression A single descriptor: one dimension x y x y Which model to prefer? G Varoquaux 7
  • 17. 1 Machine learning in a nutshell: regression A single descriptor: one dimension x y x y Problem of “over-fitting” Minimizing error is not always the best strategy (learning noise) Test data = train data G Varoquaux 7
  • 18. 1 Machine learning in a nutshell: regression A single descriptor: one dimension x y x y Prefer simple models = concept of “regularization” Balance the number of parameters to learn with the amount of data G Varoquaux 7
  • 19. 1 Machine learning in a nutshell: regression A single descriptor: one dimension x y Two descriptors: 2 dimensions X_1 X_2 y More parameters G Varoquaux 7
  • 20. 1 Machine learning in a nutshell: regression A single descriptor: one dimension x y Two descriptors: 2 dimensions X_1 X_2 y More parameters ⇒ need more data “curse of dimensionality” G Varoquaux 7
  • 21. 1 Machine learning in a nutshell: classification Example: recognizing hand-written digits G Varoquaux 8
  • 22. 1 Machine learning in a nutshell: classification X1 X2 Example: recognizing hand-written digits Represent with 2 numerical features G Varoquaux 8
  • 23. 1 Machine learning in a nutshell: classification X1 X2 G Varoquaux 8
  • 24. 1 Machine learning in a nutshell: unsupervised ConocoPhillipsApple Pepsi Navistar GlaxoSmithKline crosoft Kimberly-Clark Ryder SAP an Sachs Sony Pfizer Amazon Marriott Novartis Coca Cola 3M Comcast Sanofi-Aventis IBM Chevron DuPont de Nemours S Total Caterpillar Canon rner Home Depot Texas instruments Valero Energy Ford Cablevision Toyota g Honda HP Dell Mitsubishi Xerox Yahoo Exxon Mc Donalds Cisco Kraft Foods Unilever Stock market structure G Varoquaux 9
  • 25. 1 Machine learning in a nutshell: unsupervised ConocoPhillipsApple Pepsi Navistar GlaxoSmithKline crosoft Kimberly-Clark Ryder SAP an Sachs Sony Pfizer Amazon Marriott Novartis Coca Cola 3M Comcast Sanofi-Aventis IBM Chevron DuPont de Nemours S Total Caterpillar Canon rner Home Depot Texas instruments Valero Energy Ford Cablevision Toyota g Honda HP Dell Mitsubishi Xerox Yahoo Exxon Mc Donalds Cisco Kraft Foods Unilever Stock market structure Unlabeled data more common than labeled data G Varoquaux 9
  • 26. Machine learning Mathematics and algorithms for fitting predictive models Regression x y Classification Notions of overfit and test error G Varoquaux 10
  • 27. Machine learning is everywhere Image recognition Marketing (click-through rate) Movie / music recommendation Medical data Logistic chains (eg supermarkets) Language translation Detecting industrial failures G Varoquaux 11
  • 28. Why another machine learning package? G Varoquaux 12
  • 29. Real statisticians use R And real astronomers use IRAF Real economists use Gauss Real coders use C assembler Real experiments are controlled in Labview Real Bayesians use BUGS stan Real text processing is done in Perl Real Deep learner is best done with torch (Lua) And medical doctors only trust SPSS G Varoquaux 13
  • 30. 1 My stack Python, what else? General purpose Interactive language Easy to read / write G Varoquaux 14
  • 31. 1 My stack The scientific Python stack numpy arrays Mostly a float** No annotation / structure Universal across applications Easily shared with C / fortran 03878794797927 01790752701578 94071746124797 54970718717887 13653490495190 74754265358098 48721546349084 90345673245614 57187745620 03878794797927 01790752701578 94071746124797 54970718717887 13653490495190 74754265358098 48721546349084 90345673245614 187745620 G Varoquaux 14
  • 32. 1 My stack The scientific Python stack numpy arrays Connecting to scipy scikit-image pandas ... It’s about plugin things together G Varoquaux 14
  • 33. 1 My stack The scientific Python stack numpy arrays Connecting to scipy scikit-image pandas ... Being Pythonic and SciPythonic G Varoquaux 14
  • 34. 1 scikit-learn vision Machine learning for all No specific application domain No requirements in machine learning High-quality Pythonic software library Interfaces designed for users Community-driven development BSD licensed, very diverse contributors http://scikit-learn.org G Varoquaux 15
  • 35. 1 Between research and applications Machine learning research Conceptual complexity is not an issue New and bleeding edge is better Simple problems are old science In the field Tried and tested (aka boring) is good Little sophistication from the user API is more important than maths Solving simple problems matters Solving them really well matters a lot G Varoquaux 16
  • 36. 2 Scikit-learn: the tool A Python library for machine learning c Theodore W. Gray G Varoquaux 17
  • 37. 2 A Python library A library, not a program More expressive and flexible Easy to include in an ecosystem As easy as py from s k l e a r n import svm c l a s s i f i e r = svm.SVC() c l a s s i f i e r . f i t ( X t r a i n , Y t r a i n ) Y t e s t = c l a s s i f i e r . p r e d i c t ( X t e s t ) G Varoquaux 18
  • 38. 2 API: specifying a model A central concept: the estimator Instanciated without data But specifying the parameters from s k l e a r n . n e i g h b o r s import KNear estNeig hbo r s e s t i m a t o r = KN ea r estNe ig h b or s ( n n e i g h b o r s =2) G Varoquaux 19
  • 39. 2 API: training a model Training from data e s t i m a t o r . f i t ( X t r a i n , Y t r a i n ) with: X a numpy array with shape nsamples × nfeatures y a numpy 1D array, of ints or float, with shape nsamples G Varoquaux 20
  • 40. 2 API: using a model Prediction: classification, regression Y t e s t = e s t i m a t o r . p r e d i c t ( X t e s t ) Transforming: dimension reduction, filter X new = e s t i m a t o r . t r a n s f o r m ( X t e s t ) Test score, density estimation t e s t s c o r e = e s t i m a t o r . s c o r e ( X t e s t ) G Varoquaux 21
  • 41. 2 Vectorizing From raw data to a sample matrix X For text data: counting word occurences - Input data: list of documents (string) - Output data: numerical matrix G Varoquaux 22
  • 42. 2 Vectorizing From raw data to a sample matrix X For text data: counting word occurences - Input data: list of documents (string) - Output data: numerical matrix from s k l e a r n . f e a t u r e e x t r a c t i o n . t e x t import H a s h i n g V e c t o r i z e r h a s h e r = H a s h i n g V e c t o r i z e r () X = h a s h e r . f i t t r a n s f o r m ( documents ) G Varoquaux 22
  • 43. 2 Scikit-learn: very rich feature set Supervised learning Decision trees (Random-Forest, Boosted Tree) Linear models SVM Unsupervised Learning Clustering Dictionary learning Outlier detection Model selection Built in cross-validation Parameter optimization G Varoquaux 23
  • 44. 2 Computational performance scikit-learn mlpy pybrain pymvpa mdp shogun SVM 5.2 9.47 17.5 11.52 40.48 5.63 LARS 1.17 105.3 - 37.35 - - Elastic Net 0.52 73.7 - 1.44 - - kNN 0.57 1.41 - 0.56 0.58 1.36 PCA 0.18 - - 8.93 0.47 0.33 k-Means 1.34 0.79 ∞ - 35.75 0.68 Algorithmic optimizations Minimizing data copies G Varoquaux 24
  • 45. 2 Computational performance scikit-learn mlpy pybrain pymvpa mdp shogun SVM 5.2 9.47 17.5 11.52 40.48 5.63 LARS 1.17 105.3 - 37.35 - - Elastic Net 0.52 73.7 - 1.44 - - kNN 0.57 1.41 - 0.56 0.58 1.36 PCA 0.18 - - 8.93 0.47 0.33 k-Means 1.34 0.79 ∞ - 35.75 0.68 Algorithmic optimizations Minimizing data copies Random Forest fit time 0 2000 4000 6000 8000 10000 12000 14000Fittime(s) 203.01 211.53 4464.65 3342.83 1518.14 1711.94 1027.91 13427.06 10941.72 Scikit-Learn-RF Scikit-Learn-ETs OpenCV-RF OpenCV-ETs OK3-RF OK3-ETs Weka-RF R-RF Orange-RF Scikit-Learn Python, Cython OpenCV C++ OK3 C Weka Java randomForest R, Fortran Orange Python Figure: Gilles Louppe G Varoquaux 24
  • 46. What if the data does not fit in memory? “Big data”: Petabytes... Distributed storage Computing cluster G Varoquaux 25
  • 47. What if the data does not fit in memory? “Big data”: Petabytes... Distributed storage Computing cluster Mere mortals: Gigabytes... Python programming Off-the-self computers See also: http://www.slideshare.net/GaelVaroquaux/processing- biggish-data-on-commodity-hardware-simple-python-patterns G Varoquaux 25
  • 48. 2 On-line algorithms e s t i m a t o r . p a r t i a l f i t ( X t r a i n , Y t r a i n )0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 G Varoquaux 26
  • 49. 2 On-line algorithms e s t i m a t o r . p a r t i a l f i t ( X t r a i n , Y t r a i n ) Linear models sklearn.linear model.SGDRegressor sklearn.linear model.SGDClassifier Clustering sklearn.cluster.MiniBatchKMeans sklearn.cluster.Birch (new in 0.16) PCA (new in 0.16) sklearn.decompositions.IncrementalPCA G Varoquaux 26
  • 50. 2 On-the-fly data reduction Many features ⇒ Reduce the data as it is loaded X s m a l l = e s t i m a t o r . t r a n s f o r m ( X big , y) G Varoquaux 27
  • 51. 2 On-the-fly data reduction Random projections (will average features) sklearn.random projection random linear combinations of the features Fast clustering of features sklearn.cluster.FeatureAgglomeration on images: super-pixel strategy Hashing when observations have varying size (e.g. words) sklearn.feature extraction.text. HashingVectorizer stateless: can be used in parallel G Varoquaux 27
  • 52. 3 Scikit-learn: the project G Varoquaux 28
  • 53. 3 Community-based development in scikit-learn Huge feature set: benefits of a large team Project growth: More than 200 contributors ∼ 12 core contributors 1 full-time INRIA programmer from the start Estimated cost of development: $ 6 millions COCOMO model, http://www.ohloh.net/p/scikit-learn G Varoquaux 29
  • 54. 3 Many eyes makes code fast L. Buitinck, O. Grisel, A. Joly, G. Louppe, J. Nothman, P. Prettenhofer G Varoquaux 30
  • 55. 3 6 steps to a community-driven project 1 Focus on quality 2 Build great docs and examples 3 Use github 4 Limit the technicality of your codebase 5 Releasing and packaging matter 6 Focus on your contributors, give them credit, decision power http://www.slideshare.net/GaelVaroquaux/ scikit-learn-dveloppement-communautaire G Varoquaux 31
  • 56. 3 Quality assurance Code review: pull requests Can include newcomers We read each others code Everything is discussed: - Should the algorithm go in? - Are there good defaults? - Are names meaningfull? - Are the numerics stable? - Could it be faster? G Varoquaux 32
  • 57. 3 Quality assurance Unit testing Everything is tested Great for numerics Overall tests enforce on all estimators - consistency with the API - basic invariances - good handling of various inputs G Varoquaux 33
  • 58. Make it work, make it right, make it boring G Varoquaux 34
  • 59. 3 The tragedy of the commons Individuals, acting independently and rationally accord- ing to each one’s self-interest, behave contrary to the whole group’s long-term best interests by depleting some common resource. Wikipedia Make it work, make it right, make it boring Core projects (boring) taken for granted ⇒ Hard to fund, less excitement They need citation, in papers & on corporate web pages G Varoquaux 35
  • 60. 3 The tragedy of the commons Individuals, acting independently and rationally accord- ing to each one’s self-interest, behave contrary to the whole group’s long-term best interests by depleting some common resource. Wikipedia Make it work, make it right, make it boring Core projects (boring) taken for granted ⇒ Hard to fund, less excitement They need citation, in papers & on corporate web pages + It’s so hard to scale User support Growing codebase G Varoquaux 35
  • 61. @GaelVaroquaux Scikit-learn The vision Machine learning as a means not an end Versatile library: the “right” level of abstraction Close to research, but seeking different tradeoffs
  • 62. @GaelVaroquaux Scikit-learn The vision Machine learning as a means not an end The tool Simple API uniform across learners Numpy matrices as data containers Reasonnably fast
  • 63. @GaelVaroquaux Scikit-learn The vision Machine learning as a means not an end The tool Simple API uniform across learners The project Many people working together Tests and discussions for quality We’re hiring!