Python for brain mining: (neuro)science with state of the art machine learning and data visualization
1. Python for brain mining:
(neuro)science with state of the art machine learning
and data visualization
Ga¨l Varoquaux
e
1. Data-driven science
“Brain mining”
2. Data mining in Python
Mayavi, scikit-learn, joblib
2. 1 Brain mining
Learning models of brain function
Ga¨l Varoquaux
e 2
3. 1 Imaging neuroscience
Brain Models of
images function
Cognitive tasks
Ga¨l Varoquaux
e 3
4. 1 Imaging neuroscience
Brain Models of
images function
Data-driven science
∂
i Cognitive= HΨ
Ψ tasks
∂t
Ga¨l Varoquaux
e 3
5. 1 Brain functional data
Rich data
50 000 voxels per
frame
Complex underlying
dynamics
Few observations
∼ 100
Drawing scientific conclusions?
Ill-posed statistical problem
Ga¨l Varoquaux
e 4
6. 1 Brain functional data
Rich data
50 000 voxels per
frame
Modern complex system studies:
Complex underlying
from strong hypothesizes to rich data
dynamics
Few observations
∼ 100
Drawing scientific conclusions?
Ill-posed statistical problem
Ga¨l Varoquaux
e 4
7. 1 Statistics: the curse of dimensionality
y function of x1
Ga¨l Varoquaux
e 5
8. 1 Statistics: the curse of dimensionality
y function of x1 y function of x1 and x2
More fit parameters?
⇒ need exponentially more data
Ga¨l Varoquaux
e 5
9. 1 Statistics: the curse of dimensionality
y function of x1 y function of x1 and x2
More fit parameters?
⇒ need exponentially more data
y function of 50 000 voxels?
Expert knowledge
Machine learning
(pick the right ones)
Ga¨l Varoquaux
e 5
10. 1 Brain reading
Predict from brain images the object viewed
Correlation analysis
Ga¨l Varoquaux
e 6
11. 1 Brain reading
Predict from brain images the object viewed
Inverse problem
Inject prior: regularize
Observations Spatial code
Sparse regression
Correlation analysis = compressive sensing
Ga¨l Varoquaux
e 6
12. 1 Brain reading
Predict from brain images the object viewed
Inverse problem
Inject prior: regularize
Extract brain regions
Observations Spatial code
Total variation
Correlation analysis regression
[Michel, Trans Med Imag 2011]
Ga¨l Varoquaux
e 6
13. 1 Brain reading
Predict from brain images the object viewed
Inverse problem
Inject prior: regularize
Cast the problem in Extract brain regions
a prediction task:
Observations supervised Total variation
Spatial code learning.
Correlation analysis a model-selection metric
Prediction is regression
[Michel, Trans Med Imag 2011]
Ga¨l Varoquaux
e 6
15. 1 Learning regions from spontaneous activity
Spatial
Time series map
Multi-subject dictionary
learning
Sparsity + spatial continuity + spatial variability
⇒ Individual maps + functional regions atlas
[Varoquaux, Inf Proc Med Imag 2011]
Ga¨l Varoquaux
e 8
16. 1 Graphical models: interactions between regions
Estimate covariance structure
Many parameters to learn
Regularize: conditional
independence
= sparsity on inverse
covariance
[Varoquaux NIPS 2010]
Ga¨l Varoquaux
e 9
17. 1 Graphical models: interactions between regions
Estimate covariance structure
Many parameters to learn
Regularize: conditional
Find structure via a density estimation:
independence
unsupervised learning.
= sparsity on inverse
Model selection: likelihood of new data
covariance
[Varoquaux NIPS 2010]
Ga¨l Varoquaux
e 9
18. 2 My data-science software stack
Mayavi, scikit-learn, joblib
Ga¨l Varoquaux
e 10
19. 2 Mayavi: 3D data visualization
Requirements Solution
large 3D data VTK: C++ data visualization
interactive visualization UI (traits)
easy scripting + pylab-inspired API
Black-box solutions don’t yield new intuitions
Limitations
hard to install Tragedy of the
clunky & complex commons or
C++ leaking through niche product?
3D visualization doesn’t
pay in academia
Ga¨l Varoquaux
e 11
20. 2 scikit-learn: statistical learning
Vision
Address non-machine-learning experts
Simplify but don’t dumb down
Performance: be state of the art
Ease of installation
Ga¨l Varoquaux
e 12
21. 2 scikit-learn: statistical learning
Technical choices
Prefer Python or Cython, focus on readability
Documentation and examples are paramount
Little object-oriented design. Opt for simplicity
Prefer algorithms to framework
Code quality: consistency and testing
Ga¨l Varoquaux
e 13
22. 2 scikit-learn: statistical learning
API
Inputs are numpy arrays
Learn a model from the data:
estimator.fit(X train, Y train)
Predict using learned model
estimator.predict(X test)
Test goodness of fit
estimator.score(X test, y test)
Apply change of representation
estimator.transform(X, y)
Ga¨l Varoquaux
e 14
24. 2 scikit-learn: statistical learning
Community
35 contributors since 2008, 103 github forks
25 contributors in latest release (3 months span)
Why this success?
Trendy topic?
Low barrier of entry
Friendly and very skilled mailing list
Credit to people
Ga¨l Varoquaux
e 16
25. 2 joblib: Python functions on steroids
We keep recomputing the same things
Nested loops with overlapping sub-problems
Varying parameters I/O
Standard solution:
pipelines
Challenges
Dependencies modeling
Parameter tracking
Ga¨l Varoquaux
e 17
26. 2 joblib: Python functions on steroids
Philosophy
Simple don’t change your code
Minimal no dependencies
Performant big data
Robust never fail
joblib’s solution = lazy recomputation:
Take an MD5 hash of function arguments,
Store outputs to disk
Ga¨l Varoquaux
e 18
27. 2 joblib
Lazy recomputing
>>> from j o b l i b import Memory
>>> mem = Memory ( c a c h e d i r = ’/ tmp / joblib ’)
>>> import numpy a s np
>>> a = np . v a n d e r ( np . a r a n g e (3) )
>>> s q u a r e = mem. c a c h e ( np . s q u a r e )
>>> b = square (a)
[ Memory ] C a l l i n g s q u a r e ...
s q u a r e ( a r r a y ([[0 , 0 , 1] ,
[1 , 1 , 1] ,
[4 , 2 , 1]]) )
s q u a r e - 0.0 s
>>> c = s q u a r e ( a )
>>> # No recomputation
Ga¨l Varoquaux
e 19
28. Conclusion
Data-driven science will need
machine learning because of the
curse of dimensionality
Scikit-learn and joblib: focus on
large-data performance and easy of
use
Cannot develop software and science separately
Ga¨l Varoquaux
e 20