1. Open source scientific software
What, why, & how
Ga¨l Varoquaux
e
—
Slides on slideshare
2. Please allow me to introduce myself
I’m a man of wealth and taste
I’ve been around for a long, long year
2005..2007: Experimental-control software
Quantum physics, free-fall airplanes
2006... Open source scientific Python
Mayavi, scikit-learn, joblib, nipy, nilearn...
2008 Consultant, scientific Python
Startup: Enthought, Texas
Scipy/Euroscipy conference chair
G Varoquaux
2
4. 1 Open Source: definitions
Free redistribution
Access to source code
Allow derived work
No discrimination against persons or groups /
against fields of endeavor
FSL, I am looking at you
Universities are commercial entities
(Madey vs Duke)
OSI: Open Source Initiative http://opensource.org
G Varoquaux
4
5. 1 Open Source: definitions
Free redistribution
Access to source code
Open Community
Allow derivedawork repository: read & write
Access to code
SPM, FreeSurfer... I am looking at you
No discrimination against persons or groups /
against fields of endeavor
FSL, I am looking at you
Universities are commercial entities
(Madey vs Duke)
OSI: Open Source Initiative http://opensource.org
G Varoquaux
4
6. 1 Choice of license
Use it, don’t screw my users
BSD, MIT
Viral by code inclusion
LGPL
CopyLeft
GPL
Do you understand the consequences?
- GPL code cannot be linked to MKL
- LGPL code can only be reused in GPL/LGPL code
- Code with no licenses cannot be used
G Varoquaux
http://opensource.org/licenses
5
7. 1 Choice of license
Use it, don’t screw my users
BSD, MIT
Viral by code inclusion
LGPL
CopyLeft
GPL
Do you understand the consequences?
Don’t invent licenses
Legalese should be left to lawyers
G Varoquaux
http://opensource.org/licenses
5
8. 1 Choice of license
Use it, don’t screw my users
BSD, MIT
Use BSD code inclusion
Viral by
foster private sector
LGPL
avoid legal difficulties
we need
CopyLeft a much reuse as possible
science should not have strings attached
GPL
Do you understand the consequences?
Don’t invent licenses
Legalese should be left to lawyers
G Varoquaux
http://opensource.org/licenses
5
9. Open source scientific software
2 Why
How do we justify the investment
to our bosses
to the funding agencies
www.phdcomics.com
G Varoquaux
6
10. 2 For the Good of Science
“if it’s not open and
verifiable by others, it’s
not science, or engineering,
or whatever it is you call
what we do” Stodden, 2010
“An article about computational science in a scientific
publication is not the scholarship itself, it is merely
advertising of the scholarship. The actual scholarship is
the complete software development environment.”
Buckheit & Donoho, 1995
Reproducible science
G Varoquaux
7
11. 2 For the Good of Science
“if it’s not open and
verifiable by others, it’s
not science, or engineering,
or whatever it is you call
what we do” Stodden, 2010 are high-level
These
conclusions
“An article about computational science in a scientific
Need more it is merely
publication is not the scholarship itself,ground-to
-earth arguments
advertising of the scholarship. The actual scholarship is
the complete software development environment.”
Buckheit & Donoho, 1995
Reproducible science
G Varoquaux
7
12. 2 Lab survival: beyond the oral tradition
Can you run the analysis
of the lab’s former students?
We need basic building blocks
More eyes make bugs shallow
G Varoquaux
8
13. 2 The economics
Code maintenance is expensive
scikit-learn ∼ 300 email/month nipy ∼ 45 email/month
joblib ∼ 45 email/month
mayavi ∼ 30 email/month
“Hey Gael, I take it you’re too
busy. That’s okay, I spent a day
trying to install XXX and I think
I’ll succeed myself. Next time
though please don’t ignore my
emails, I really don’t like it. You
can say, ‘sorry, I have no time to
help you.’ Just don’t ignore.”
G Varoquaux
9
14. 2 The economics
Code maintenance is expensive
scikit-learn ∼ 300 email/month nipy ∼ 45 email/month
joblib ∼ 45 email/month
mayavi ∼ 30 email/month
Your “benefits” come from a fraction of the code
Data loading?
Standard algorithms?
Share the common code...
...to avoid dying under code
Code becomes less precious with time
And somebody might contribute features
G Varoquaux
9
15. 2 Having an impact
To reach our target audience
(neuroscientists, MD)
To disseminate our ideas
To facilitate new ideas
Can bring citations
G Varoquaux
10
17. 3 Choice of environment
Python, what else?
High-level language
- interactive
ipython
- easy to debug
- general purpose
Scientific computing environment
- array-computing
numpy
- rich ecosystem
scipy, scikit-learn,
scikit-image...
G Varoquaux
12
18. 3 6 steps to a successfull project
1 Focus on quality
2 Build great docs and examples
3 Use github
4 Limit the technicality of your codebase
5 Releasing and packaging matter
6 Focus on your contributors,
give them credit
http://www.slideshare.net/GaelVaroquaux/
scikit-learn-dveloppement-communautaire
G Varoquaux
13
19. 3 Scikit-learn: a very successful project
General-purpose machine learning in Python
Over 200 contributors
∼ 12 core devs
Huge feature list: benefits of wide team
Success recipe: product vision, great docs, high-level
Documentation: all figures are generated
Crafting simple didactic examples has taught us a lot
⇒ Executable docs
= textbooks of the future
G Varoquaux
14
20. 3 Nilearn: making multivariate analysis routine
Project scope
Very preliminar
Machine learning for neuroimaging:
make using scikit-learn on neuroimaging easy
The target user base is small
Examples in the docs
Run out of the box,
downloading open data
Produce a clear figure
Data from Miyawaki 2008
Routine, simple, reproduction of papers
G Varoquaux
ni
15
21. Open source scientific software
It’s worth it
Do it right:
- Liberal licensing (BSD)
- Realistic engineer compromises
- Quality and ease of use (the apple strategy)
Work with us on nilearn
Examples = open science
@GaelVaroquaux
ni
22. Open source a tragedie
1/f distribution
Source: Fernando Perez