SlideShare a Scribd company logo
1 of 75
Download to read offline
Coding for science and innovation
Ga¨el Varoquaux
to change the world!
Science
The process of discovering
knowledge and mechanisms
Computing is a central part of how we do science
G Varoquaux 2
Science
The process of discovering
knowledge and mechanisms
Computing is a central part of how we do science
Science + Computers = Computational science
Nuclear physics Fluid dynamics Chemistry
G Varoquaux 2
Science
The process of discovering
knowledge and mechanisms
Computing is a central part of how we do science
Science + Computers = Computational science
Psychology
G Varoquaux 2
Science
The process of discovering
knowledge and mechanisms
Computing is a central part of how we do science
Science + Computers = Computational science
Psychology
Marketting
Data science: using data to acquire insights
G Varoquaux 2
Science
The process of discovering
knowledge and mechanisms
“Science is not a political construct or a belief sys-
tem. Scientific progress depends on openness, trans-
parency, and the free flow of ideas and people.”
— Dr. Rush Holt, CEO of AAAS,
testimony to the House Committee on Science, Space, and Tech-
nology, Feb 8, 2017
G Varoquaux 3
Science
The process of discovering
knowledge and mechanisms
Science helps shaping society
Growth in a time of debt [Reinhart & Rogoff 2010]:
Wrong conclusions due to flawed Excel processing
⇒ Public debt blamed for financial crisis (Osborne UK MP)
Autism and vaccines:
forged study: [Wakefield et al, Lancet 1998]
⇒ Drop in vaccination, measles outbreak
Loss of trust in science is very costly
G Varoquaux 3
Innovation
Putting the right technology to the right use
G Varoquaux 4
Innovation
Putting the right technology to the right use
Light blub:
Invented ∼ 1835 by Lindsay
Extra progress: vaccum pumps (Swan ∼ 1880)
Economics: availability of electric power
⇒ Edison’s company
G Varoquaux 4
Innovation
Putting the right technology to the right use
Light blub:
Invented ∼ 1835 by Lindsay
Extra progress: vaccum pumps (Swan ∼ 1880)
Economics: availability of electric power
⇒ Edison’s company
Outbox: company digitizing physical mail
But citizens aren’t the USPS customers, junk mailers are
⇒ No cooperation from USPS, Outbox dies
Power balances drive innovation as much as technology
G Varoquaux 4
Coding for science and innovation:
Computing is the new electricity:
a driver for change
With new data sources,
it reaches beyond physics & engineering
G Varoquaux 5
Coding for science and innovation:
1 Coding as a scientist
2 Building software for science
3 An ecosystem
G Varoquaux 6
1 Coding as a scientist
G Varoquaux 7
1 Data in brain sciences
The mental world
cognition, emotions
autism, depression
Historically studied
via verbal interactions
Psychology
G Varoquaux 8
1 Data in brain sciences
The mental world
cognition, emotions
autism, depression
Historically studied
via verbal interactions
The brain
an organ:
neurons, firing
Imaging brain activity
Quantitative data
G Varoquaux 8
1 One example of our work: biomarkers of Autism
[Abraham...Varoquaux, 2017]
Comparing the brain activity of many subjects
Supervised machine learning to discriminate Autism
G Varoquaux 9
1 One example of our work: biomarkers of Autism
[Abraham...Varoquaux, 2017]
1. Extract brain networks
Unsupervised feature learning
complex model fit to 1Tb data
G Varoquaux 9
1 One example of our work: biomarkers of Autism
[Abraham...Varoquaux, 2017]
1. Extract brain networks
2. Per-subject connections
Information geometry,
Lie algebra...
G Varoquaux 9
1 One example of our work: biomarkers of Autism
[Abraham...Varoquaux, 2017]
1. Extract brain networks
2. Per-subject connections
3. Supervised learning
Scikit-learn
G Varoquaux 9
1 One example of our work: biomarkers of Autism
[Abraham...Varoquaux, 2017]
1. Extract brain networks
2. Per-subject connections
3. Supervised learning
Scikit-learn
Limits to impact:
Cannot outperform clinicians that define Autism/Control
Psychiatrists unhappy with current blurry definition
But not ready to accept black-box algorithmic definition
G Varoquaux 9
1 One example of our work: biomarkers of Autism
[Abraham...Varoquaux, 2017]
1. Extract brain networks
2. Per-subject connections
3. Supervised learning
Scikit-learn
Limits to impact:
Cannot outperform clinicians that define Autism/Control
Psychiatrists unhappy with current blurry definition
But not ready to accept black-box algorithmic definition
Lots of moving parts
Practitionners need to
make the tools theirs
G Varoquaux 9
1 A quest for trust: reproducible research
“if it’s not open and verifiable by others, it’s not science,
or engineering, or whatever it is you call what we do“
— V. Stodden, The scientific method in practice
Computational reproducibility:
Automate everything
Control the environment
G Varoquaux 10
1 Automate everything
Just a simple matter of programming
G Varoquaux 11
1 Automate everything...
Some operations work better with a human in the loop
Scientific research is an iterative process
Tension between needs for interaction and replay
G Varoquaux 11
1 Automate everything...
Some operations work better with a human in the loop
Scientific research is an iterative process
Tension between needs for interaction and replay
Mayavi
Reflexivity between dialogs and objects
Record mode
G Varoquaux 11
1 Automate everything...
Some operations work better with a human in the loop
Scientific research is an iterative process
Tension between needs for interaction and replay
Jupyter, and its widgets:
Exploring the space between interaction and code
G Varoquaux 11
1 Beyond computational reproducibility
Make every computational step reproducible,
and good science will emerge
G Varoquaux 12
1 Beyond computational reproducibility
Make every computational step reproducible,
and good science will emerge
Estimating the reproducibility of psychological science
[Science 2015] 36% of effects replicate
Reasons:
Statistical challenges — analysis degrees of freedom
Weak insentives — winner’s curse in publication
Seldom computational reproducibility
G Varoquaux 12
1 Beyond computational reproducibility
Make every computational step reproducible,
and good science will emerge
Estimating the reproducibility of psychological science
[Science 2015] 36% of effects replicate
Reasons:
Statistical challenges — analysis degrees of freedom
Weak insentives — winner’s curse in publication
Seldom computational reproducibility
I think that reproducibility is a misnomer.
What matters is that operations be
verifiable or reusable.
G Varoquaux 12
In practice, the best way to improve research
is to use the right (conceptual) tools.
G Varoquaux 13
1 Managing complexity
In practice, the best way to improve research
is to use the right (conceptual) tools.
The everyday roadblock is cognitive load
Machine learning, brain anatomy, psychology
R, Python, shell scripts
Funding agencies, reviewer 3, courting VCs
G Varoquaux 14
Coding as a scientist
Final code should be auditable,
ideally reusable
Tension between interactive computing
& automating
Main enemy: cognitive overload
G Varoquaux 15
Coding as a scientist
Final code should be auditable,
ideally reusable
Tension between interactive computing
& automating
Main enemy: cognitive overload
In the industry
Reusable
Verifiable? Not for silicon valley,
but in insurance, healthcare, banking...
Moving data-scientist code
to production?
Software projects going over budget?
G Varoquaux 15
Code quality in exploratory work
Use pyflakes in your editor seriously
Coding convention, good naming
Version control Use git + github
Code review
Unit testing
If it’s not tested, it’s broken or soon will be
Make a package
controlled dependencies and compilation
...
G Varoquaux 16
Code quality in exploratory workIncreasingcost
?
Use pyflakes in your editor seriously
Coding convention, good naming
Version control Use git + github
Code review
Unit testing
If it’s not tested, it’s broken or soon will be
Make a package
controlled dependencies and compilation
...
Avoid premature software engineering
G Varoquaux 16
Code quality in exploratory workIncreasingcost
?
Use pyflakes in your editor seriously
Coding convention, good naming
Version control Use git + github
Code review
Unit testing
If it’s not tested, it’s broken or soon will be
Make a package
controlled dependencies and compilation
...
Avoid premature software engineering
Over versus under engineering
Goal is generating insights / moving in new spaces
Experimentation for intuitions and proofs of concepts
⇒ new ideas
As the path becomes clear: consolidation
is great for that
Heavy engineering too early freezes bad ideas
G Varoquaux 16
2 Building software for science
The point of view of the developer
Libraries are what enables us to scale:
Abstractions reduce cognitive load
Code reuse gets us further
G Varoquaux 17
2 Examples of such libraries
scikit-learn
Make research in machine-learning
models and algorithm useable to people
who do not understand them
ni
nilearn
Make it easy to answer neuroimaging
problems with them
G Varoquaux 18
2 Examples of such libraries
scikit-learn
Make research in machine-learning
models and algorithm useable to people
who do not understand them
Challenges:
Variety of that space
Statistical concepts coding concepts
ni
nilearn
Make it easy to answer neuroimaging
problems with them
Challenges: Onboarding technology-adverse users
G Varoquaux 18
2 Tools that reduce cognitive overload
It’s a design problem
G Varoquaux 19
2 Tools that reduce cognitive overload
Jonathan Ive, an industrial designer, is #4 at Apple
Code different.
G Varoquaux 20
2 Some API design principles for the scipy stack
Consistency, consistency, consistency
Functions are easier to understand than classes
A library should hinge on a small number of concepts
Common data containers make the ecosystem stronger
Each function should have one and only one purpose
Code for interfaces, but don’t overdo duck typing
Properties are for impedance matching
Shallow is better than deep
Error messages matter
Be Pythonic
G Varoquaux 21
2 Some API design principles for the scipy stack
Consistency, consistency, consistency
np.save(file, obj) pickle.dump(obj, file)
fmin(...maxiter=10) lsq linear(...max iter=10)
Creates cognitive overload
Functions are easier to understand than classes
A library should hinge on a small number of concepts
Common data containers make the ecosystem stronger
Each function should have one and only one purpose
Code for interfaces, but don’t overdo duck typing
Properties are for impedance matching
Shallow is better than deep
Error messages matter
Be Pythonic
G Varoquaux 22
2 Some API design principles for the scipy stack
Consistency, consistency, consistency
Functions are easier to understand than classes
Objects have hidden states,
Objects have no universal interface, entry point, output
A library should hinge on a small number of concepts
Common data containers make the ecosystem stronger
Each function should have one and only one purpose
Code for interfaces, but don’t overdo duck typing
Properties are for impedance matching
Shallow is better than deep
Error messages matter
Be Pythonic
G Varoquaux 23
2 Some API design principles for the scipy stack
Consistency, consistency, consistency
Functions are easier to understand than classes
A library should hinge on a small number of concepts
How much do usage patterns carry out across the library?
Common data containers make the ecosystem stronger
Each function should have one and only one purpose
Code for interfaces, but don’t overdo duck typing
Properties are for impedance matching
Shallow is better than deep
Error messages matter
Be Pythonic
G Varoquaux 24
2 Some API design principles for the scipy stack
Consistency, consistency, consistency
Functions are easier to understand than classes
A library should hinge on a small number of concepts
Common data containers make the ecosystem stronger
Facilitates working with multiple libraries together
Easier to get up to speed with a given library
Each function should have one and only one purpose
Code for interfaces, but don’t overdo duck typing
Properties are for impedance matching
Shallow is better than deep
Error messages matter
Be Pythonic
G Varoquaux 25
2 Some API design principles for the scipy stack
Consistency, consistency, consistency
Functions are easier to understand than classes
A library should hinge on a small number of concepts
Common data containers make the ecosystem stronger
Each function should have one and only one purpose
Change of behavior depending on input type
Code for interfaces, but don’t overdo duck typing
Properties are for impedance matching
Shallow is better than deep
Error messages matter
Be Pythonic
G Varoquaux 26
2 Some API design principles for the scipy stack
Consistency, consistency, consistency
Functions are easier to understand than classes
A library should hinge on a small number of concepts
Common data containers make the ecosystem stronger
Each function should have one and only one purpose
Code for interfaces, but don’t overdo duck typing
Interfaces define objects
Incompatible behaviors lead to bugs (eg np.matrix)
Properties are for impedance matching
Shallow is better than deep
Error messages matter
Be Pythonic
G Varoquaux 27
2 Some API design principles for the scipy stack
Consistency, consistency, consistency
Functions are easier to understand than classes
A library should hinge on a small number of concepts
Common data containers make the ecosystem stronger
Each function should have one and only one purpose
Code for interfaces, but don’t overdo duck typing
Properties are for impedance matching
Properties obfuscate the data model of the object
Properties can create hidden compute costs
Shallow is better than deep
Error messages matter
Be Pythonic
G Varoquaux 28
2 Some API design principles for the scipy stack
Consistency, consistency, consistency
Functions are easier to understand than classes
A library should hinge on a small number of concepts
Common data containers make the ecosystem stronger
Each function should have one and only one purpose
Code for interfaces, but don’t overdo duck typing
Properties are for impedance matching
Shallow is better than deep
Objects are understood by their surface
Composition creates cognitive overload
Error messages matter
Be Pythonic
G Varoquaux 29
2 Some API design principles for the scipy stack
Consistency, consistency, consistency
Functions are easier to understand than classes
A library should hinge on a small number of concepts
Common data containers make the ecosystem stronger
Each function should have one and only one purpose
Code for interfaces, but don’t overdo duck typing
Properties are for impedance matching
Shallow is better than deep
Error messages matter
Explain the problem
Print the offending value
Be Pythonic
G Varoquaux 30
2 Some API design principles for the scipy stack
Consistency, consistency, consistency
Functions are easier to understand than classes
A library should hinge on a small number of concepts
Common data containers make the ecosystem stronger
Each function should have one and only one purpose
Code for interfaces, but don’t overdo duck typing
Properties are for impedance matching
Shallow is better than deep
Error messages matter
Be Pythonic
Avoid syntax hacks
G Varoquaux 31
2 Scikit-learn API
Scikit-learn cheat sheet
Scikit-learn
Fit and predict
>>> estimator = Estimator(param1=param1)
>>> estimator.fit(X train, y train)
>>> y test = estimator.predict(X test)
Transform data
>>> X red = estimator.transform(X test)
G Varoquaux 32
2 Scikit-learn API
Scikit-learn cheat sheet
Scikit-learn
Fit and predict
>>> estimator = Estimator(param1=param1)
>>> estimator.fit(X train, y train)
>>> y test = estimator.predict(X test)
Transform data
>>> X red = estimator.transform(X test)
The estimator is a “contract”
(slightly more elaborate than above)
It has created an ecosystem of packages
Based on duck-typing, not inheritence
G Varoquaux 32
2 numpy arrays
03878794797927
01790752701578
94071746124797
54970718717887
0495190
03878794797927
01790752701578
94071746124797
54970718717887
495190
ndarray
Abstraction over pointers & operation
Contract: the memory layout
IMHO, gone too far in number of methods (163)
The array protocol makes it easy to quack like an array
PS: The ecosystem needs categorical dtypes in numpy
G Varoquaux 33
2 Example-driven development
The 3-liner as the new cool
Teaching others
Teaching yourself
Write examples that solve end problems
Iterate on your API until these are simple
Mayavi scikit-learn nilearn
G Varoquaux 34
2 Example-driven development
The 3-liner as the new cool
Teaching others
Teaching yourself
Write examples that solve end problems
Iterate on your API until these are simple
Mayavi scikit-learn nilearn
User flow on the scikit-learn website:
Examples
G Varoquaux 34
2 Example-driven development
The 3-liner as the new cool
Teaching others
Teaching yourself
Write examples that solve end problems
Iterate on your API until these are simple
Mayavi scikit-learn nilearn
User flow on the nilearn website:
Examples
G Varoquaux 34
2 Example-driven development
The 3-liner as the new cool
Teaching others
Teaching yourself
Write examples that solve end problems
Iterate on your API until these are simple
Mayavi scikit-learn nilearn
Sphinx-gallery: compiling scripts in an examples gallery
G Varoquaux 34
2 Example-driven development
The 3-liner as the new cool
Teaching others
Teaching yourself
Write examples that solve end problems
Iterate on your API until these are simple
Mayavi scikit-learn nilearn
Sphinx-gallery: compiling scripts in an examples gallery
Restructured text
formatting
Capturing
outputs
Links to
function docs
+Creates Jupyter
notebooks
G Varoquaux 34
2 Example-driven development
The 3-liner as the new cool
Teaching others
Teaching yourself
Write examples that solve end problems
Iterate on your API until these are simple
Mayavi scikit-learn nilearn
Sphinx-gallery: compiling scripts in an examples gallery
Insert links to examples
containing a function
G Varoquaux 34
2 Building great documentation
Focus on explaining concepts (hint: write plain English)
Less is more: prioritize, avoid redundancy
Code examples must be short (link to full tutorial examples)
Links everywhere: users will land at the wrong place
Teach with the docs
Plan for maintenance of docs:
Continuous integration
Check links
Runs examples
Doctests
G Varoquaux 35
2 Reusable science
scikit-learn is the new machine-learning textbook
nilearn is the new neuroimaging review article
Experiments reproduced
at each commit
eg: brain reading
nilearn.github.io/auto examples/02 decoding/plot miyawaki reconstruction.html
G Varoquaux 36
2 Reusable science
scikit-learn is the new machine-learning textbook
nilearn is the new neuroimaging review article
Experiments reproduced
at each commit
eg: brain reading
nilearn.github.io/auto examples/02 decoding/plot miyawaki reconstruction.html
Resource intensive CI:
Data ⇒ Fight for good open data
Computation ⇒ Find good algorithms and tradeoffs
Forces us to distill the literature (as a review)
G Varoquaux 36
2 Reusable science
scikit-learn is the new machine-learning textbook
nilearn is the new neuroimaging review article
Experiments reproduced
at each commit
eg: brain reading
nilearn.github.io/auto examples/02 decoding/plot miyawaki reconstruction.html
Package development consolidates
science and moves it outside the lab
G Varoquaux 36
3 An ecosystem
A bird’s eye view on scientific packages
G Varoquaux 37
3 Packages of the Python ecosystem
1 10 100 1000 10000
Package rank
104
105
106
107
108
109
NumberofPyPIdownloads
A small number of packages
are used by many
1
f distribution, preferential attachment
G Varoquaux 38
3 Packages of the Python ecosystem
1 10 100 1000 10000
Package rank
104
105
106
107
108
109
NumberofPyPIdownloads
numpy#49
scikit-learn #110
joblib #431
nilearn
#2877
simplejson #1
six #2setuptools#3
A small number of packages
are used by many
1
f distribution, preferential attachment
nilearn relies on scikit-learn & joblib that rely on numpy...
G Varoquaux 38
3 Standing on the shoulders of maintainers
May 31th: pip broken
https://github.com/pypa/
setuptools/pull/1043
Left-pad:
How left-padding strings broke
the Internet
A Javascript package
for left padding strings
was removed from
node’s package manager,
breaking all the websites
that depended on it.
G Varoquaux 39
3 Dependencies
Beyond installation, a challenge is to ensure package
versions play way together: correctness of the code
Breakage of backward compability
yields irreconcilable dependencies
G Varoquaux 40
3 Dependencies and their upgrade
It’s a fact: users hate upgrading
If it ain’t broken, don’t fix it
even if it is, apparently
G Varoquaux 41
3 Declaring undependence?
Monolythic packages with no dependencies...
But:
Scaling is hard
Complexity grows as square of codebase size
[Woodfield 1979]
User support grows with userbase size
G Varoquaux 42
3 Core software is infrastructure
Everybody uses it everyday
In industry, education, & research
G Varoquaux 43
3 Core software is infrastructure
Everybody uses it everyday
In industry, education, & research
It needs maintenance
Like roads (or openSSL, to prevent heartbleed)
Central infrastructure packages are “boring”
They are understaffed and underfunded
References: “Roads and Bridge” Ford foundation report
Excellent talk by Heather Miller
https://www.youtube.com/watch?v=17yy5BwIiTw
G Varoquaux 43
@GaelVaroquaux
Coding for science and innovation
New science
High value of bringing new methods to a field
⇒ Enable domain-specialists
Rapid interation, but with automation & consolidation
Software tools
Scientists are limited by cognitive load
⇒ Design of API and documentation in libraries
Libraries make science reproducible and reusable
An ecosystem
Central packages hold the ecosystem together
Thanks to: the scipy community

More Related Content

Similar to Coding for science and innovation

Capturing Context in Scientific Experiments: Towards Computer-Driven Science
Capturing Context in Scientific Experiments: Towards Computer-Driven ScienceCapturing Context in Scientific Experiments: Towards Computer-Driven Science
Capturing Context in Scientific Experiments: Towards Computer-Driven Science
dgarijo
 
Ict와 사회과학지식간 학제간 연구동향(23 march2013)
Ict와 사회과학지식간 학제간 연구동향(23 march2013)Ict와 사회과학지식간 학제간 연구동향(23 march2013)
Ict와 사회과학지식간 학제간 연구동향(23 march2013)
Han Woo PARK
 
DevelopingDataScienceProfession
DevelopingDataScienceProfessionDevelopingDataScienceProfession
DevelopingDataScienceProfession
Gary Rector
 
4th_paradigm_book_complete_lr
4th_paradigm_book_complete_lr4th_paradigm_book_complete_lr
4th_paradigm_book_complete_lr
Dominic A Ienco
 

Similar to Coding for science and innovation (20)

Better neuroimaging data processing: driven by evidence, open communities, an...
Better neuroimaging data processing: driven by evidence, open communities, an...Better neuroimaging data processing: driven by evidence, open communities, an...
Better neuroimaging data processing: driven by evidence, open communities, an...
 
Open Science and Executable Papers
Open Science and Executable PapersOpen Science and Executable Papers
Open Science and Executable Papers
 
Capturing Context in Scientific Experiments: Towards Computer-Driven Science
Capturing Context in Scientific Experiments: Towards Computer-Driven ScienceCapturing Context in Scientific Experiments: Towards Computer-Driven Science
Capturing Context in Scientific Experiments: Towards Computer-Driven Science
 
CODATA International Training Workshop in Big Data for Science for Researcher...
CODATA International Training Workshop in Big Data for Science for Researcher...CODATA International Training Workshop in Big Data for Science for Researcher...
CODATA International Training Workshop in Big Data for Science for Researcher...
 
Digital Science: Towards the executable paper
Digital Science: Towards the executable paperDigital Science: Towards the executable paper
Digital Science: Towards the executable paper
 
Research in Computer Science and Engineering
Research in Computer Science and EngineeringResearch in Computer Science and Engineering
Research in Computer Science and Engineering
 
Ict와 사회과학지식간 학제간 연구동향(23 march2013)
Ict와 사회과학지식간 학제간 연구동향(23 march2013)Ict와 사회과학지식간 학제간 연구동향(23 march2013)
Ict와 사회과학지식간 학제간 연구동향(23 march2013)
 
Gridforum David De Roure Newe Science 20080402
Gridforum David De Roure Newe Science 20080402Gridforum David De Roure Newe Science 20080402
Gridforum David De Roure Newe Science 20080402
 
tools for communicating in the computational sciences
tools for communicating in the computational sciencestools for communicating in the computational sciences
tools for communicating in the computational sciences
 
Computational Thinking - a 4 step approach and a new pedagogy
Computational Thinking - a 4 step approach and a new pedagogyComputational Thinking - a 4 step approach and a new pedagogy
Computational Thinking - a 4 step approach and a new pedagogy
 
Big data, Behavioral Change and IOT Architecture
Big data, Behavioral Change and IOT ArchitectureBig data, Behavioral Change and IOT Architecture
Big data, Behavioral Change and IOT Architecture
 
DevelopingDataScienceProfession
DevelopingDataScienceProfessionDevelopingDataScienceProfession
DevelopingDataScienceProfession
 
Jéssica Cohen, José M. Blanco, Yaiza Rubio, Félix Brezo
Jéssica Cohen, José M. Blanco, Yaiza Rubio, Félix BrezoJéssica Cohen, José M. Blanco, Yaiza Rubio, Félix Brezo
Jéssica Cohen, José M. Blanco, Yaiza Rubio, Félix Brezo
 
The Role of Scientific Method in Software Development
The Role of Scientific Method in Software Development The Role of Scientific Method in Software Development
The Role of Scientific Method in Software Development
 
Introduction to AI (Artificial Intelligence).
Introduction to AI (Artificial Intelligence).Introduction to AI (Artificial Intelligence).
Introduction to AI (Artificial Intelligence).
 
From Open Data to Open Science, by Geoffrey Boulton
 From Open Data to Open Science, by Geoffrey Boulton From Open Data to Open Science, by Geoffrey Boulton
From Open Data to Open Science, by Geoffrey Boulton
 
A data view of the data science process
A data view of the data science processA data view of the data science process
A data view of the data science process
 
4th_paradigm_book_complete_lr
4th_paradigm_book_complete_lr4th_paradigm_book_complete_lr
4th_paradigm_book_complete_lr
 
Lecture 1 Slides -Introduction to algorithms.pdf
Lecture 1 Slides -Introduction to algorithms.pdfLecture 1 Slides -Introduction to algorithms.pdf
Lecture 1 Slides -Introduction to algorithms.pdf
 
Increasing the Efficiency of Workflows: Use Cases in the Life Sciences
Increasing the Efficiency of Workflows: Use Cases in the Life SciencesIncreasing the Efficiency of Workflows: Use Cases in the Life Sciences
Increasing the Efficiency of Workflows: Use Cases in the Life Sciences
 

More from Gael Varoquaux

Similarity encoding for learning on dirty categorical variables
Similarity encoding for learning on dirty categorical variablesSimilarity encoding for learning on dirty categorical variables
Similarity encoding for learning on dirty categorical variables
Gael Varoquaux
 
Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities
Gael Varoquaux
 

More from Gael Varoquaux (20)

Evaluating machine learning models and their diagnostic value
Evaluating machine learning models and their diagnostic valueEvaluating machine learning models and their diagnostic value
Evaluating machine learning models and their diagnostic value
 
Measuring mental health with machine learning and brain imaging
Measuring mental health with machine learning and brain imagingMeasuring mental health with machine learning and brain imaging
Measuring mental health with machine learning and brain imaging
 
Machine learning with missing values
Machine learning with missing valuesMachine learning with missing values
Machine learning with missing values
 
Dirty data science machine learning on non-curated data
Dirty data science machine learning on non-curated dataDirty data science machine learning on non-curated data
Dirty data science machine learning on non-curated data
 
Representation learning in limited-data settings
Representation learning in limited-data settingsRepresentation learning in limited-data settings
Representation learning in limited-data settings
 
Functional-connectome biomarkers to meet clinical needs?
Functional-connectome biomarkers to meet clinical needs?Functional-connectome biomarkers to meet clinical needs?
Functional-connectome biomarkers to meet clinical needs?
 
Atlases of cognition with large-scale human brain mapping
Atlases of cognition with large-scale human brain mappingAtlases of cognition with large-scale human brain mapping
Atlases of cognition with large-scale human brain mapping
 
Similarity encoding for learning on dirty categorical variables
Similarity encoding for learning on dirty categorical variablesSimilarity encoding for learning on dirty categorical variables
Similarity encoding for learning on dirty categorical variables
 
Machine learning for functional connectomes
Machine learning for functional connectomesMachine learning for functional connectomes
Machine learning for functional connectomes
 
Towards psychoinformatics with machine learning and brain imaging
Towards psychoinformatics with machine learning and brain imagingTowards psychoinformatics with machine learning and brain imaging
Towards psychoinformatics with machine learning and brain imaging
 
Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities
 
A tutorial on Machine Learning, with illustrations for MR imaging
A tutorial on Machine Learning, with illustrations for MR imagingA tutorial on Machine Learning, with illustrations for MR imaging
A tutorial on Machine Learning, with illustrations for MR imaging
 
Computational practices for reproducible science
Computational practices for reproducible scienceComputational practices for reproducible science
Computational practices for reproducible science
 
Estimating Functional Connectomes: Sparsity’s Strength and Limitations
Estimating Functional Connectomes: Sparsity’s Strength and LimitationsEstimating Functional Connectomes: Sparsity’s Strength and Limitations
Estimating Functional Connectomes: Sparsity’s Strength and Limitations
 
Scientist meets web dev: how Python became the language of data
Scientist meets web dev: how Python became the language of dataScientist meets web dev: how Python became the language of data
Scientist meets web dev: how Python became the language of data
 
Machine learning and cognitive neuroimaging: new tools can answer new questions
Machine learning and cognitive neuroimaging: new tools can answer new questionsMachine learning and cognitive neuroimaging: new tools can answer new questions
Machine learning and cognitive neuroimaging: new tools can answer new questions
 
Social-sparsity brain decoders: faster spatial sparsity
Social-sparsity brain decoders: faster spatial sparsitySocial-sparsity brain decoders: faster spatial sparsity
Social-sparsity brain decoders: faster spatial sparsity
 
Inter-site autism biomarkers from resting state fMRI
Inter-site autism biomarkers from resting state fMRIInter-site autism biomarkers from resting state fMRI
Inter-site autism biomarkers from resting state fMRI
 
Brain maps from machine learning? Spatial regularizations
Brain maps from machine learning? Spatial regularizationsBrain maps from machine learning? Spatial regularizations
Brain maps from machine learning? Spatial regularizations
 
Scikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the projectScikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the project
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

API Governance and Monetization - The evolution of API governance
API Governance and Monetization -  The evolution of API governanceAPI Governance and Monetization -  The evolution of API governance
API Governance and Monetization - The evolution of API governance
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
How to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cfHow to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cf
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
ChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps Productivity
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Navigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern EnterpriseNavigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern Enterprise
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Decarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational PerformanceDecarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational Performance
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 

Coding for science and innovation

  • 1. Coding for science and innovation Ga¨el Varoquaux to change the world!
  • 2. Science The process of discovering knowledge and mechanisms Computing is a central part of how we do science G Varoquaux 2
  • 3. Science The process of discovering knowledge and mechanisms Computing is a central part of how we do science Science + Computers = Computational science Nuclear physics Fluid dynamics Chemistry G Varoquaux 2
  • 4. Science The process of discovering knowledge and mechanisms Computing is a central part of how we do science Science + Computers = Computational science Psychology G Varoquaux 2
  • 5. Science The process of discovering knowledge and mechanisms Computing is a central part of how we do science Science + Computers = Computational science Psychology Marketting Data science: using data to acquire insights G Varoquaux 2
  • 6. Science The process of discovering knowledge and mechanisms “Science is not a political construct or a belief sys- tem. Scientific progress depends on openness, trans- parency, and the free flow of ideas and people.” — Dr. Rush Holt, CEO of AAAS, testimony to the House Committee on Science, Space, and Tech- nology, Feb 8, 2017 G Varoquaux 3
  • 7. Science The process of discovering knowledge and mechanisms Science helps shaping society Growth in a time of debt [Reinhart & Rogoff 2010]: Wrong conclusions due to flawed Excel processing ⇒ Public debt blamed for financial crisis (Osborne UK MP) Autism and vaccines: forged study: [Wakefield et al, Lancet 1998] ⇒ Drop in vaccination, measles outbreak Loss of trust in science is very costly G Varoquaux 3
  • 8. Innovation Putting the right technology to the right use G Varoquaux 4
  • 9. Innovation Putting the right technology to the right use Light blub: Invented ∼ 1835 by Lindsay Extra progress: vaccum pumps (Swan ∼ 1880) Economics: availability of electric power ⇒ Edison’s company G Varoquaux 4
  • 10. Innovation Putting the right technology to the right use Light blub: Invented ∼ 1835 by Lindsay Extra progress: vaccum pumps (Swan ∼ 1880) Economics: availability of electric power ⇒ Edison’s company Outbox: company digitizing physical mail But citizens aren’t the USPS customers, junk mailers are ⇒ No cooperation from USPS, Outbox dies Power balances drive innovation as much as technology G Varoquaux 4
  • 11. Coding for science and innovation: Computing is the new electricity: a driver for change With new data sources, it reaches beyond physics & engineering G Varoquaux 5
  • 12. Coding for science and innovation: 1 Coding as a scientist 2 Building software for science 3 An ecosystem G Varoquaux 6
  • 13. 1 Coding as a scientist G Varoquaux 7
  • 14. 1 Data in brain sciences The mental world cognition, emotions autism, depression Historically studied via verbal interactions Psychology G Varoquaux 8
  • 15. 1 Data in brain sciences The mental world cognition, emotions autism, depression Historically studied via verbal interactions The brain an organ: neurons, firing Imaging brain activity Quantitative data G Varoquaux 8
  • 16. 1 One example of our work: biomarkers of Autism [Abraham...Varoquaux, 2017] Comparing the brain activity of many subjects Supervised machine learning to discriminate Autism G Varoquaux 9
  • 17. 1 One example of our work: biomarkers of Autism [Abraham...Varoquaux, 2017] 1. Extract brain networks Unsupervised feature learning complex model fit to 1Tb data G Varoquaux 9
  • 18. 1 One example of our work: biomarkers of Autism [Abraham...Varoquaux, 2017] 1. Extract brain networks 2. Per-subject connections Information geometry, Lie algebra... G Varoquaux 9
  • 19. 1 One example of our work: biomarkers of Autism [Abraham...Varoquaux, 2017] 1. Extract brain networks 2. Per-subject connections 3. Supervised learning Scikit-learn G Varoquaux 9
  • 20. 1 One example of our work: biomarkers of Autism [Abraham...Varoquaux, 2017] 1. Extract brain networks 2. Per-subject connections 3. Supervised learning Scikit-learn Limits to impact: Cannot outperform clinicians that define Autism/Control Psychiatrists unhappy with current blurry definition But not ready to accept black-box algorithmic definition G Varoquaux 9
  • 21. 1 One example of our work: biomarkers of Autism [Abraham...Varoquaux, 2017] 1. Extract brain networks 2. Per-subject connections 3. Supervised learning Scikit-learn Limits to impact: Cannot outperform clinicians that define Autism/Control Psychiatrists unhappy with current blurry definition But not ready to accept black-box algorithmic definition Lots of moving parts Practitionners need to make the tools theirs G Varoquaux 9
  • 22. 1 A quest for trust: reproducible research “if it’s not open and verifiable by others, it’s not science, or engineering, or whatever it is you call what we do“ — V. Stodden, The scientific method in practice Computational reproducibility: Automate everything Control the environment G Varoquaux 10
  • 23. 1 Automate everything Just a simple matter of programming G Varoquaux 11
  • 24. 1 Automate everything... Some operations work better with a human in the loop Scientific research is an iterative process Tension between needs for interaction and replay G Varoquaux 11
  • 25. 1 Automate everything... Some operations work better with a human in the loop Scientific research is an iterative process Tension between needs for interaction and replay Mayavi Reflexivity between dialogs and objects Record mode G Varoquaux 11
  • 26. 1 Automate everything... Some operations work better with a human in the loop Scientific research is an iterative process Tension between needs for interaction and replay Jupyter, and its widgets: Exploring the space between interaction and code G Varoquaux 11
  • 27. 1 Beyond computational reproducibility Make every computational step reproducible, and good science will emerge G Varoquaux 12
  • 28. 1 Beyond computational reproducibility Make every computational step reproducible, and good science will emerge Estimating the reproducibility of psychological science [Science 2015] 36% of effects replicate Reasons: Statistical challenges — analysis degrees of freedom Weak insentives — winner’s curse in publication Seldom computational reproducibility G Varoquaux 12
  • 29. 1 Beyond computational reproducibility Make every computational step reproducible, and good science will emerge Estimating the reproducibility of psychological science [Science 2015] 36% of effects replicate Reasons: Statistical challenges — analysis degrees of freedom Weak insentives — winner’s curse in publication Seldom computational reproducibility I think that reproducibility is a misnomer. What matters is that operations be verifiable or reusable. G Varoquaux 12
  • 30. In practice, the best way to improve research is to use the right (conceptual) tools. G Varoquaux 13
  • 31. 1 Managing complexity In practice, the best way to improve research is to use the right (conceptual) tools. The everyday roadblock is cognitive load Machine learning, brain anatomy, psychology R, Python, shell scripts Funding agencies, reviewer 3, courting VCs G Varoquaux 14
  • 32. Coding as a scientist Final code should be auditable, ideally reusable Tension between interactive computing & automating Main enemy: cognitive overload G Varoquaux 15
  • 33. Coding as a scientist Final code should be auditable, ideally reusable Tension between interactive computing & automating Main enemy: cognitive overload In the industry Reusable Verifiable? Not for silicon valley, but in insurance, healthcare, banking... Moving data-scientist code to production? Software projects going over budget? G Varoquaux 15
  • 34. Code quality in exploratory work Use pyflakes in your editor seriously Coding convention, good naming Version control Use git + github Code review Unit testing If it’s not tested, it’s broken or soon will be Make a package controlled dependencies and compilation ... G Varoquaux 16
  • 35. Code quality in exploratory workIncreasingcost ? Use pyflakes in your editor seriously Coding convention, good naming Version control Use git + github Code review Unit testing If it’s not tested, it’s broken or soon will be Make a package controlled dependencies and compilation ... Avoid premature software engineering G Varoquaux 16
  • 36. Code quality in exploratory workIncreasingcost ? Use pyflakes in your editor seriously Coding convention, good naming Version control Use git + github Code review Unit testing If it’s not tested, it’s broken or soon will be Make a package controlled dependencies and compilation ... Avoid premature software engineering Over versus under engineering Goal is generating insights / moving in new spaces Experimentation for intuitions and proofs of concepts ⇒ new ideas As the path becomes clear: consolidation is great for that Heavy engineering too early freezes bad ideas G Varoquaux 16
  • 37. 2 Building software for science The point of view of the developer Libraries are what enables us to scale: Abstractions reduce cognitive load Code reuse gets us further G Varoquaux 17
  • 38. 2 Examples of such libraries scikit-learn Make research in machine-learning models and algorithm useable to people who do not understand them ni nilearn Make it easy to answer neuroimaging problems with them G Varoquaux 18
  • 39. 2 Examples of such libraries scikit-learn Make research in machine-learning models and algorithm useable to people who do not understand them Challenges: Variety of that space Statistical concepts coding concepts ni nilearn Make it easy to answer neuroimaging problems with them Challenges: Onboarding technology-adverse users G Varoquaux 18
  • 40. 2 Tools that reduce cognitive overload It’s a design problem G Varoquaux 19
  • 41. 2 Tools that reduce cognitive overload Jonathan Ive, an industrial designer, is #4 at Apple Code different. G Varoquaux 20
  • 42. 2 Some API design principles for the scipy stack Consistency, consistency, consistency Functions are easier to understand than classes A library should hinge on a small number of concepts Common data containers make the ecosystem stronger Each function should have one and only one purpose Code for interfaces, but don’t overdo duck typing Properties are for impedance matching Shallow is better than deep Error messages matter Be Pythonic G Varoquaux 21
  • 43. 2 Some API design principles for the scipy stack Consistency, consistency, consistency np.save(file, obj) pickle.dump(obj, file) fmin(...maxiter=10) lsq linear(...max iter=10) Creates cognitive overload Functions are easier to understand than classes A library should hinge on a small number of concepts Common data containers make the ecosystem stronger Each function should have one and only one purpose Code for interfaces, but don’t overdo duck typing Properties are for impedance matching Shallow is better than deep Error messages matter Be Pythonic G Varoquaux 22
  • 44. 2 Some API design principles for the scipy stack Consistency, consistency, consistency Functions are easier to understand than classes Objects have hidden states, Objects have no universal interface, entry point, output A library should hinge on a small number of concepts Common data containers make the ecosystem stronger Each function should have one and only one purpose Code for interfaces, but don’t overdo duck typing Properties are for impedance matching Shallow is better than deep Error messages matter Be Pythonic G Varoquaux 23
  • 45. 2 Some API design principles for the scipy stack Consistency, consistency, consistency Functions are easier to understand than classes A library should hinge on a small number of concepts How much do usage patterns carry out across the library? Common data containers make the ecosystem stronger Each function should have one and only one purpose Code for interfaces, but don’t overdo duck typing Properties are for impedance matching Shallow is better than deep Error messages matter Be Pythonic G Varoquaux 24
  • 46. 2 Some API design principles for the scipy stack Consistency, consistency, consistency Functions are easier to understand than classes A library should hinge on a small number of concepts Common data containers make the ecosystem stronger Facilitates working with multiple libraries together Easier to get up to speed with a given library Each function should have one and only one purpose Code for interfaces, but don’t overdo duck typing Properties are for impedance matching Shallow is better than deep Error messages matter Be Pythonic G Varoquaux 25
  • 47. 2 Some API design principles for the scipy stack Consistency, consistency, consistency Functions are easier to understand than classes A library should hinge on a small number of concepts Common data containers make the ecosystem stronger Each function should have one and only one purpose Change of behavior depending on input type Code for interfaces, but don’t overdo duck typing Properties are for impedance matching Shallow is better than deep Error messages matter Be Pythonic G Varoquaux 26
  • 48. 2 Some API design principles for the scipy stack Consistency, consistency, consistency Functions are easier to understand than classes A library should hinge on a small number of concepts Common data containers make the ecosystem stronger Each function should have one and only one purpose Code for interfaces, but don’t overdo duck typing Interfaces define objects Incompatible behaviors lead to bugs (eg np.matrix) Properties are for impedance matching Shallow is better than deep Error messages matter Be Pythonic G Varoquaux 27
  • 49. 2 Some API design principles for the scipy stack Consistency, consistency, consistency Functions are easier to understand than classes A library should hinge on a small number of concepts Common data containers make the ecosystem stronger Each function should have one and only one purpose Code for interfaces, but don’t overdo duck typing Properties are for impedance matching Properties obfuscate the data model of the object Properties can create hidden compute costs Shallow is better than deep Error messages matter Be Pythonic G Varoquaux 28
  • 50. 2 Some API design principles for the scipy stack Consistency, consistency, consistency Functions are easier to understand than classes A library should hinge on a small number of concepts Common data containers make the ecosystem stronger Each function should have one and only one purpose Code for interfaces, but don’t overdo duck typing Properties are for impedance matching Shallow is better than deep Objects are understood by their surface Composition creates cognitive overload Error messages matter Be Pythonic G Varoquaux 29
  • 51. 2 Some API design principles for the scipy stack Consistency, consistency, consistency Functions are easier to understand than classes A library should hinge on a small number of concepts Common data containers make the ecosystem stronger Each function should have one and only one purpose Code for interfaces, but don’t overdo duck typing Properties are for impedance matching Shallow is better than deep Error messages matter Explain the problem Print the offending value Be Pythonic G Varoquaux 30
  • 52. 2 Some API design principles for the scipy stack Consistency, consistency, consistency Functions are easier to understand than classes A library should hinge on a small number of concepts Common data containers make the ecosystem stronger Each function should have one and only one purpose Code for interfaces, but don’t overdo duck typing Properties are for impedance matching Shallow is better than deep Error messages matter Be Pythonic Avoid syntax hacks G Varoquaux 31
  • 53. 2 Scikit-learn API Scikit-learn cheat sheet Scikit-learn Fit and predict >>> estimator = Estimator(param1=param1) >>> estimator.fit(X train, y train) >>> y test = estimator.predict(X test) Transform data >>> X red = estimator.transform(X test) G Varoquaux 32
  • 54. 2 Scikit-learn API Scikit-learn cheat sheet Scikit-learn Fit and predict >>> estimator = Estimator(param1=param1) >>> estimator.fit(X train, y train) >>> y test = estimator.predict(X test) Transform data >>> X red = estimator.transform(X test) The estimator is a “contract” (slightly more elaborate than above) It has created an ecosystem of packages Based on duck-typing, not inheritence G Varoquaux 32
  • 55. 2 numpy arrays 03878794797927 01790752701578 94071746124797 54970718717887 0495190 03878794797927 01790752701578 94071746124797 54970718717887 495190 ndarray Abstraction over pointers & operation Contract: the memory layout IMHO, gone too far in number of methods (163) The array protocol makes it easy to quack like an array PS: The ecosystem needs categorical dtypes in numpy G Varoquaux 33
  • 56. 2 Example-driven development The 3-liner as the new cool Teaching others Teaching yourself Write examples that solve end problems Iterate on your API until these are simple Mayavi scikit-learn nilearn G Varoquaux 34
  • 57. 2 Example-driven development The 3-liner as the new cool Teaching others Teaching yourself Write examples that solve end problems Iterate on your API until these are simple Mayavi scikit-learn nilearn User flow on the scikit-learn website: Examples G Varoquaux 34
  • 58. 2 Example-driven development The 3-liner as the new cool Teaching others Teaching yourself Write examples that solve end problems Iterate on your API until these are simple Mayavi scikit-learn nilearn User flow on the nilearn website: Examples G Varoquaux 34
  • 59. 2 Example-driven development The 3-liner as the new cool Teaching others Teaching yourself Write examples that solve end problems Iterate on your API until these are simple Mayavi scikit-learn nilearn Sphinx-gallery: compiling scripts in an examples gallery G Varoquaux 34
  • 60. 2 Example-driven development The 3-liner as the new cool Teaching others Teaching yourself Write examples that solve end problems Iterate on your API until these are simple Mayavi scikit-learn nilearn Sphinx-gallery: compiling scripts in an examples gallery Restructured text formatting Capturing outputs Links to function docs +Creates Jupyter notebooks G Varoquaux 34
  • 61. 2 Example-driven development The 3-liner as the new cool Teaching others Teaching yourself Write examples that solve end problems Iterate on your API until these are simple Mayavi scikit-learn nilearn Sphinx-gallery: compiling scripts in an examples gallery Insert links to examples containing a function G Varoquaux 34
  • 62. 2 Building great documentation Focus on explaining concepts (hint: write plain English) Less is more: prioritize, avoid redundancy Code examples must be short (link to full tutorial examples) Links everywhere: users will land at the wrong place Teach with the docs Plan for maintenance of docs: Continuous integration Check links Runs examples Doctests G Varoquaux 35
  • 63. 2 Reusable science scikit-learn is the new machine-learning textbook nilearn is the new neuroimaging review article Experiments reproduced at each commit eg: brain reading nilearn.github.io/auto examples/02 decoding/plot miyawaki reconstruction.html G Varoquaux 36
  • 64. 2 Reusable science scikit-learn is the new machine-learning textbook nilearn is the new neuroimaging review article Experiments reproduced at each commit eg: brain reading nilearn.github.io/auto examples/02 decoding/plot miyawaki reconstruction.html Resource intensive CI: Data ⇒ Fight for good open data Computation ⇒ Find good algorithms and tradeoffs Forces us to distill the literature (as a review) G Varoquaux 36
  • 65. 2 Reusable science scikit-learn is the new machine-learning textbook nilearn is the new neuroimaging review article Experiments reproduced at each commit eg: brain reading nilearn.github.io/auto examples/02 decoding/plot miyawaki reconstruction.html Package development consolidates science and moves it outside the lab G Varoquaux 36
  • 66. 3 An ecosystem A bird’s eye view on scientific packages G Varoquaux 37
  • 67. 3 Packages of the Python ecosystem 1 10 100 1000 10000 Package rank 104 105 106 107 108 109 NumberofPyPIdownloads A small number of packages are used by many 1 f distribution, preferential attachment G Varoquaux 38
  • 68. 3 Packages of the Python ecosystem 1 10 100 1000 10000 Package rank 104 105 106 107 108 109 NumberofPyPIdownloads numpy#49 scikit-learn #110 joblib #431 nilearn #2877 simplejson #1 six #2setuptools#3 A small number of packages are used by many 1 f distribution, preferential attachment nilearn relies on scikit-learn & joblib that rely on numpy... G Varoquaux 38
  • 69. 3 Standing on the shoulders of maintainers May 31th: pip broken https://github.com/pypa/ setuptools/pull/1043 Left-pad: How left-padding strings broke the Internet A Javascript package for left padding strings was removed from node’s package manager, breaking all the websites that depended on it. G Varoquaux 39
  • 70. 3 Dependencies Beyond installation, a challenge is to ensure package versions play way together: correctness of the code Breakage of backward compability yields irreconcilable dependencies G Varoquaux 40
  • 71. 3 Dependencies and their upgrade It’s a fact: users hate upgrading If it ain’t broken, don’t fix it even if it is, apparently G Varoquaux 41
  • 72. 3 Declaring undependence? Monolythic packages with no dependencies... But: Scaling is hard Complexity grows as square of codebase size [Woodfield 1979] User support grows with userbase size G Varoquaux 42
  • 73. 3 Core software is infrastructure Everybody uses it everyday In industry, education, & research G Varoquaux 43
  • 74. 3 Core software is infrastructure Everybody uses it everyday In industry, education, & research It needs maintenance Like roads (or openSSL, to prevent heartbleed) Central infrastructure packages are “boring” They are understaffed and underfunded References: “Roads and Bridge” Ford foundation report Excellent talk by Heather Miller https://www.youtube.com/watch?v=17yy5BwIiTw G Varoquaux 43
  • 75. @GaelVaroquaux Coding for science and innovation New science High value of bringing new methods to a field ⇒ Enable domain-specialists Rapid interation, but with automation & consolidation Software tools Scientists are limited by cognitive load ⇒ Design of API and documentation in libraries Libraries make science reproducible and reusable An ecosystem Central packages hold the ecosystem together Thanks to: the scipy community