SlideShare a Scribd company logo
Open-Source Software’s
Responsibility to Science
Joel Nothman
20th July 2018
Gatekeepers Names Surprises Sensibility Misdirection Collaborate 2
Me in open source
Mostly contributed to popular Scientific Python libraries:
scikit-learn, nltk, scipy.sparse, pandas, ipython, numpydoc
Also information extraction evaluation (neleval), etc.
Community service
“Volunteer software development”
With thanks to our financial sponsors
Caretakers aren’t always founders
Founders aren’t always caretakers
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
Gatekeepers Names Surprises Sensibility Misdirection Collaborate 3
Overheard at ICML
Don’t worry about how tricky it is to implement . . .
Someone will put it in Scikit-learn and you can just use it.
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
Gatekeepers Names Surprises Sensibility Misdirection Collaborate 4
Thoughts on an arrogant ML researcher
Scientists think software maintenance is no big deal
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
Gatekeepers Names Surprises Sensibility Misdirection Collaborate 4
Thoughts on an arrogant ML researcher
Scientists think software maintenance is no big deal
Science and engineering rely heavily on open-source infrastructure
Popular tools become de-facto standards
Most users are uncomfortable building their own tools
Many will only use what’s provided in a popular library
Many will not inspect how it works on the inside
Volunteer maintainers act as gatekeepers
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
Gatekeepers Names Surprises Sensibility Misdirection Collaborate 5
The power of the gatekeeper
decides which algorithms are available
decides how to ensure correctness and stability
decides how to name or describe the algorithm
decides whether to be faithful to a published description
decides on an API that may facilitate good science/engineering
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
Gatekeepers Names Surprises Sensibility Misdirection Collaborate 5
The power of the gatekeeper
decides which algorithms are available
decides how to ensure correctness and stability
decides how to name or describe the algorithm
decides whether to be faithful to a published description
decides on an API that may facilitate good science/engineering
OSS maintainers can enable or inhibit scientific best practices
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
Gatekeepers Names Surprises Sensibility Misdirection Collaborate 6
But you can’t blame the gatekeeper
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES,
INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED.
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
Gatekeepers Names Surprises Sensibility Misdirection Collaborate 7
This presentation is a series of examples
Risks to good science and engineering related to software design
Some things we do to help science
Some things we have changed to help science
Some things we have yet to solve
There’s not a great deal of NLP in here
but there’s a lot of ML and software engineering in NLP
(so I hope it is relevant, interesting and accessible)
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
Gatekeepers Names Surprises Sensibility Misdirection Collaborate 8
Scikit-learn preliminaries
An ecosystem of estimators
Fit an estimator on some data, so that it can:
describe the training data
transform unseen data
predict a target for unseen data
Data is usually a numeric matrix X (samples × features)
May provide a target vector or matrix y at training time
real valued for regression
categories for multiclass classification
multiple columns of binary targets for multilabel classification
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
Gatekeepers Names Surprises Sensibility Misdirection Collaborate 9
Methods and results and are indecipherable if researchers publish an
inappropriate, or underspecified, algorithm name
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
Gatekeepers Names Surprises Sensibility Misdirection Collaborate 10
A simple example of a bad name
sklearn.covariance.GraphLasso for sparse inverse covariance estimation
but Graph Lasso is sparse regression where the features lie on a graph
the paper for covariance estimation named it Graphical Lasso
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
Gatekeepers Names Surprises Sensibility Misdirection Collaborate 10
A simple example of a bad name
sklearn.covariance.GraphLasso for sparse inverse covariance estimation
but Graph Lasso is sparse regression where the features lie on a graph
the paper for covariance estimation named it Graphical Lasso
Solution deprecate GraphLasso and rename it GraphicalLasso
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
Gatekeepers Names Surprises Sensibility Misdirection Collaborate 11
Tripping over hidden parameters
precision_score([‘a’,‘a’,‘b’,‘b’,‘c’], [‘a’,‘b’,‘b’,‘c’,‘c’])
For multi-class or multi-label, how should you average across classes?
Pa = 1
1 , Pb = 1
2 , Pc = 1
2
average=’micro’ ⇒ 3
5
average=’macro’ ⇒ (1 + 1
2 + 1
2 )/3 = 2
3
average=’weighted’ ⇒ (2 × 1 + 2 × 1
2 + 1 × 1
2 )/5 = 7
10
for a long time, prevalence-weighted macro average was the default
∴ papers say “We achieved a precision of . . . ”
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
Gatekeepers Names Surprises Sensibility Misdirection Collaborate 11
Tripping over hidden parameters
precision_score([‘a’,‘a’,‘b’,‘b’,‘c’], [‘a’,‘b’,‘b’,‘c’,‘c’])
For multi-class or multi-label, how should you average across classes?
Pa = 1
1 , Pb = 1
2 , Pc = 1
2
average=’micro’ ⇒ 3
5
average=’macro’ ⇒ (1 + 1
2 + 1
2 )/3 = 2
3
average=’weighted’ ⇒ (2 × 1 + 2 × 1
2 + 1 × 1
2 )/5 = 7
10
for a long time, prevalence-weighted macro average was the default
∴ papers say “We achieved a precision of . . . ”
Solution precision_score raises an error if the data is not binary,
unless the user specifies average.
Use the API to force literacy & awareness
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
Gatekeepers Names Surprises Sensibility Misdirection Collaborate 12
What’s in a name?
What makes an implementation of some named algorithm correct?
Faithful to a published research paper?
Faithful to a reference implementation?
Faithful to some community of practice?
Consistent with other components of our software library?
Consistent with previous versions of the library?
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
Gatekeepers Names Surprises Sensibility Misdirection Collaborate 13
Experimenters report sub-optimal results
because they assume our implementation is nicely behaved
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
Gatekeepers Names Surprises Sensibility Misdirection Collaborate 14
The fit you thought was finished
Many optimisations are iterative
have criteria to test if it has converged on an optimum
Predictions and inferences may be poor if parameters did not converge
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
Gatekeepers Names Surprises Sensibility Misdirection Collaborate 14
The fit you thought was finished
Many optimisations are iterative
have criteria to test if it has converged on an optimum
Predictions and inferences may be poor if parameters did not converge
Solution Warn if we did not detect convergence
but if we have too many warnings, users ignore them. . .
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
Gatekeepers Names Surprises Sensibility Misdirection Collaborate 15
The words you didn’t mean to stop
CountVectorizer turns text into a term-document matrix
can choose stop words: None, ‘english’ or BYO
‘english’ will remove system (and used to remove computer)
‘english’ will remove five, six, eight but not seven
‘english’ will remove we have but treat we’ve as ve
This is documented nowhere.
See my NLP-OSS paper with Hanmin Qin and Roman Yurchak
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
Gatekeepers Names Surprises Sensibility Misdirection Collaborate 15
The words you didn’t mean to stop
CountVectorizer turns text into a term-document matrix
can choose stop words: None, ‘english’ or BYO
‘english’ will remove system (and used to remove computer)
‘english’ will remove five, six, eight but not seven
‘english’ will remove we have but treat we’ve as ve
This is documented nowhere.
See my NLP-OSS paper with Hanmin Qin and Roman Yurchak
Solution Deprecate ‘english’
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
Gatekeepers Names Surprises Sensibility Misdirection Collaborate 15
The words you didn’t mean to stop
CountVectorizer turns text into a term-document matrix
can choose stop words: None, ‘english’ or BYO
‘english’ will remove system (and used to remove computer)
‘english’ will remove five, six, eight but not seven
‘english’ will remove we have but treat we’ve as ve
This is documented nowhere.
See my NLP-OSS paper with Hanmin Qin and Roman Yurchak
Solution Deprecate ‘english’ and add another perfect stop list . . .
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
Gatekeepers Names Surprises Sensibility Misdirection Collaborate 16
The intercept you didn’t mean to regularise
In Logistic Regression, we learn a weight vector β
and a bias term β0 which corresponds to a feature x0 of all-1s
Regularisation: minimise i β2
i to ensure small weights as well as small loss
liblinear regularises β0. You probably never want to do this.
All our other linear estimators do not regularise the intercept.
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
Gatekeepers Names Surprises Sensibility Misdirection Collaborate 16
The intercept you didn’t mean to regularise
In Logistic Regression, we learn a weight vector β
and a bias term β0 which corresponds to a feature x0 of all-1s
Regularisation: minimise i β2
i to ensure small weights as well as small loss
liblinear regularises β0. You probably never want to do this.
All our other linear estimators do not regularise the intercept.
Sol’n 1 intercept_scaling: also need to optimise x0’s fill value
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
Gatekeepers Names Surprises Sensibility Misdirection Collaborate 16
The intercept you didn’t mean to regularise
In Logistic Regression, we learn a weight vector β
and a bias term β0 which corresponds to a feature x0 of all-1s
Regularisation: minimise i β2
i to ensure small weights as well as small loss
liblinear regularises β0. You probably never want to do this.
All our other linear estimators do not regularise the intercept.
Sol’n 1 intercept_scaling: also need to optimise x0’s fill value
. . . but most users don’t see/do this
Sol’n 2 Implement alternative optimisers, and deprecate liblinear as default
LogisticRegression solver
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
Gatekeepers Names Surprises Sensibility Misdirection Collaborate 17
Analysis of code on GitHub shows that people use default
parameters when they shouldn’t
Andreas M¨uller
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
Gatekeepers Names Surprises Sensibility Misdirection Collaborate 18
Most users are lazy
Users don’t explore alternatives
alternative parameters values
alternative software libraries
My students tend to use a CountVectorizer even
when they’re counting non-words (e.g. synsets)
We try to provide sensible default parameters
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
Gatekeepers Names Surprises Sensibility Misdirection Collaborate 19
Sensible default fail
Ten-tree forests
Three-fold cross validation
?? A tokeniser that splits on word-internal punctuation
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
Gatekeepers Names Surprises Sensibility Misdirection Collaborate 20
What makes a default value good?
Good defaults should give good predictive models and reliable statistics
? Good defaults should behave how users expect
but different communities of practice
Good defaults should be invariant to:
sample size (for stability in cross validation)
number of features (for stability in model selection)
?feature scaling (for stability in different tasks/datasets)
Example: finding a good default γ for an RBF kernel (#779, #10331)
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
Gatekeepers Names Surprises Sensibility Misdirection Collaborate 21
Good parametrisation
We choose the defaults, but also how parameters are expressed
(and whether they can be changed at all)
Should number of nearest neighbors be specified as:
an absolute value (e.g. 10)?
a proportion of training samples (e.g. 2%)?
an arbitrary function of training data shape?
Algorithms and optimisation research often don’t report on this
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
Gatekeepers Names Surprises Sensibility Misdirection Collaborate 22
Scientific software should make it easy for users to do good science.
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
Gatekeepers Names Surprises Sensibility Misdirection Collaborate 23
Have you ever tried to de-tokenise the Penn TreeBank?
PTB is delivered with each token POS tagged or bracketed
We wanted to know how paragraph structure, etc. informed parsing
The source text before tokenisation is available
An easy case:
[ Imports/NNS ]
were/VBD at/IN
[ $/$ 50.38/CD billion/CD ]
,/, up/RB
[ 19/CD %/NN ]
./.
Imports were at $50.38
billion, up 19%.
Imports 1–7
were 9–12
at 14–15
$ 16–16
50.38 17–21
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
Gatekeepers Names Surprises Sensibility Misdirection Collaborate 24
Have you ever tried to de-tokenise the Penn TreeBank?
PTB is delivered with each token POS tagged or bracketed
We wanted to know how paragraph structure, etc. informed parsing
The source text before tokenisation is available
⇒ alignment hell due to typos corrected/inserted, reorderings, missed text, etc.
Hindsight: delivering annotations on tokenised text is a bad idea
It restricts what you can do with it later
NLP software should always provide stand-off markup
Or store the whitespace/non-token data as spaCy does
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
Gatekeepers Names Surprises Sensibility Misdirection Collaborate 25
Avoiding leakage in cross validation
Bad X_preprocessed = preprocessor.fit_transform(X)
result = cross_validate(classifier, X_preprocessed, y)
Test data statistics leak into preprocessing
⇒ inflated cross validation results
Good pipeline = make_pipeline(preprocessor, classifier)
result = cross_validate(pipeline, X, y)
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
Gatekeepers Names Surprises Sensibility Misdirection Collaborate 25
Avoiding leakage in cross validation
Bad X_preprocessed = preprocessor.fit_transform(X)
result = cross_validate(classifier, X_preprocessed, y)
Test data statistics leak into preprocessing
⇒ inflated cross validation results
Good pipeline = make_pipeline(preprocessor, classifier)
result = cross_validate(pipeline, X, y)
Solution De-emphasise fit_transform.
And make sure Pipeline works with everything;
and make sure cross_validate works with everything.
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
Gatekeepers Names Surprises Sensibility Misdirection Collaborate 26
Maintainers of large projects
can’t be experts in all the things they maintain
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
Gatekeepers Names Surprises Sensibility Misdirection Collaborate 27
Scientists can (and do) help us:
make sure the implementation matches the name
make users aware of or avoid unexpected behaviour
parametrise algorithms and set defaults helpfully
understand how our design choices lead to flawed experiments
Users trust popular OSS.
Thank you for helping us make OSS trustworthy.
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018

More Related Content

Similar to Open-Source Software's Responsibility to Science

Data excellence: Better data for better AI
Data excellence: Better data for better AIData excellence: Better data for better AI
Data excellence: Better data for better AI
Lora Aroyo
 
The Role of Normware in Trustworthy and Explainable AI
The Role of Normware in Trustworthy and Explainable AIThe Role of Normware in Trustworthy and Explainable AI
The Role of Normware in Trustworthy and Explainable AI
Giovanni Sileno
 
Open IoT Made Easy - Introduction to OGC SensorThings API
Open IoT Made Easy - Introduction to OGC SensorThings APIOpen IoT Made Easy - Introduction to OGC SensorThings API
Open IoT Made Easy - Introduction to OGC SensorThings API
SensorUp
 
How to think smarter about software development
How to think smarter about software developmentHow to think smarter about software development
How to think smarter about software development
Nilanjan Bhattacharya
 
An Ontology Engineering Approach to Gamify Collaborative Learning Scenarios
An Ontology Engineering Approach to Gamify Collaborative Learning ScenariosAn Ontology Engineering Approach to Gamify Collaborative Learning Scenarios
An Ontology Engineering Approach to Gamify Collaborative Learning Scenarios
Geiser Chalco
 
Artificial Intelligence power point presentation document
Artificial Intelligence power point presentation documentArtificial Intelligence power point presentation document
Artificial Intelligence power point presentation document
David Raj Kanthi
 
Trusted, Transparent and Fair AI using Open Source
Trusted, Transparent and Fair AI using Open SourceTrusted, Transparent and Fair AI using Open Source
Trusted, Transparent and Fair AI using Open Source
Animesh Singh
 
Splunk for ITOps
Splunk for ITOpsSplunk for ITOps
Splunk for ITOps
Splunk
 
Solution challenge 2024 GDSC-GHRCE AIML.pptx
Solution challenge 2024 GDSC-GHRCE AIML.pptxSolution challenge 2024 GDSC-GHRCE AIML.pptx
Solution challenge 2024 GDSC-GHRCE AIML.pptx
GoogleDeveloperStude22
 
How Four Statistical Rules Forecast Who Wins a Competitive Bid
How Four Statistical Rules Forecast Who Wins a Competitive BidHow Four Statistical Rules Forecast Who Wins a Competitive Bid
How Four Statistical Rules Forecast Who Wins a Competitive Bid
IntelCollab.com
 
Mobile devices, games and education: exploring an untapped potential through ...
Mobile devices, games and education: exploring an untapped potential through ...Mobile devices, games and education: exploring an untapped potential through ...
Mobile devices, games and education: exploring an untapped potential through ...
Ruth S. Contreras Espinosa
 
Bayesian reasoning
Bayesian reasoningBayesian reasoning
Bayesian reasoning
Marta Fajlhauer
 
2015 data-science-salary-survey
2015 data-science-salary-survey2015 data-science-salary-survey
2015 data-science-salary-survey
Adam Rabinovitch
 
Architecting a Platform for Enterprise Use - Strata London 2018
Architecting a Platform for Enterprise Use - Strata London 2018Architecting a Platform for Enterprise Use - Strata London 2018
Architecting a Platform for Enterprise Use - Strata London 2018
mark madsen
 
UX STRAT Online 2020: Dr. Martin Tingley, Netflix
UX STRAT Online 2020: Dr. Martin Tingley, NetflixUX STRAT Online 2020: Dr. Martin Tingley, Netflix
UX STRAT Online 2020: Dr. Martin Tingley, Netflix
UX STRAT
 
What Is The Future of Data Science With Python?
What Is The Future of Data Science With Python?What Is The Future of Data Science With Python?
What Is The Future of Data Science With Python?
SofiaCarter4
 
The State of AI in Insights and Research 2024: Results and Findings
The State of AI in Insights and Research 2024: Results and FindingsThe State of AI in Insights and Research 2024: Results and Findings
The State of AI in Insights and Research 2024: Results and Findings
Ray Poynter
 
Data Science.pptx
Data Science.pptxData Science.pptx
Data Science.pptx
TrainerAnalogicx
 
Python webinar 4th june
Python webinar 4th junePython webinar 4th june
Python webinar 4th june
Edureka!
 
6 Open Source Data Science Projects To Impress Your Interviewer
6 Open Source Data Science Projects To Impress Your Interviewer6 Open Source Data Science Projects To Impress Your Interviewer
6 Open Source Data Science Projects To Impress Your Interviewer
PrachiVarshney7
 

Similar to Open-Source Software's Responsibility to Science (20)

Data excellence: Better data for better AI
Data excellence: Better data for better AIData excellence: Better data for better AI
Data excellence: Better data for better AI
 
The Role of Normware in Trustworthy and Explainable AI
The Role of Normware in Trustworthy and Explainable AIThe Role of Normware in Trustworthy and Explainable AI
The Role of Normware in Trustworthy and Explainable AI
 
Open IoT Made Easy - Introduction to OGC SensorThings API
Open IoT Made Easy - Introduction to OGC SensorThings APIOpen IoT Made Easy - Introduction to OGC SensorThings API
Open IoT Made Easy - Introduction to OGC SensorThings API
 
How to think smarter about software development
How to think smarter about software developmentHow to think smarter about software development
How to think smarter about software development
 
An Ontology Engineering Approach to Gamify Collaborative Learning Scenarios
An Ontology Engineering Approach to Gamify Collaborative Learning ScenariosAn Ontology Engineering Approach to Gamify Collaborative Learning Scenarios
An Ontology Engineering Approach to Gamify Collaborative Learning Scenarios
 
Artificial Intelligence power point presentation document
Artificial Intelligence power point presentation documentArtificial Intelligence power point presentation document
Artificial Intelligence power point presentation document
 
Trusted, Transparent and Fair AI using Open Source
Trusted, Transparent and Fair AI using Open SourceTrusted, Transparent and Fair AI using Open Source
Trusted, Transparent and Fair AI using Open Source
 
Splunk for ITOps
Splunk for ITOpsSplunk for ITOps
Splunk for ITOps
 
Solution challenge 2024 GDSC-GHRCE AIML.pptx
Solution challenge 2024 GDSC-GHRCE AIML.pptxSolution challenge 2024 GDSC-GHRCE AIML.pptx
Solution challenge 2024 GDSC-GHRCE AIML.pptx
 
How Four Statistical Rules Forecast Who Wins a Competitive Bid
How Four Statistical Rules Forecast Who Wins a Competitive BidHow Four Statistical Rules Forecast Who Wins a Competitive Bid
How Four Statistical Rules Forecast Who Wins a Competitive Bid
 
Mobile devices, games and education: exploring an untapped potential through ...
Mobile devices, games and education: exploring an untapped potential through ...Mobile devices, games and education: exploring an untapped potential through ...
Mobile devices, games and education: exploring an untapped potential through ...
 
Bayesian reasoning
Bayesian reasoningBayesian reasoning
Bayesian reasoning
 
2015 data-science-salary-survey
2015 data-science-salary-survey2015 data-science-salary-survey
2015 data-science-salary-survey
 
Architecting a Platform for Enterprise Use - Strata London 2018
Architecting a Platform for Enterprise Use - Strata London 2018Architecting a Platform for Enterprise Use - Strata London 2018
Architecting a Platform for Enterprise Use - Strata London 2018
 
UX STRAT Online 2020: Dr. Martin Tingley, Netflix
UX STRAT Online 2020: Dr. Martin Tingley, NetflixUX STRAT Online 2020: Dr. Martin Tingley, Netflix
UX STRAT Online 2020: Dr. Martin Tingley, Netflix
 
What Is The Future of Data Science With Python?
What Is The Future of Data Science With Python?What Is The Future of Data Science With Python?
What Is The Future of Data Science With Python?
 
The State of AI in Insights and Research 2024: Results and Findings
The State of AI in Insights and Research 2024: Results and FindingsThe State of AI in Insights and Research 2024: Results and Findings
The State of AI in Insights and Research 2024: Results and Findings
 
Data Science.pptx
Data Science.pptxData Science.pptx
Data Science.pptx
 
Python webinar 4th june
Python webinar 4th junePython webinar 4th june
Python webinar 4th june
 
6 Open Source Data Science Projects To Impress Your Interviewer
6 Open Source Data Science Projects To Impress Your Interviewer6 Open Source Data Science Projects To Impress Your Interviewer
6 Open Source Data Science Projects To Impress Your Interviewer
 

Recently uploaded

LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOMLORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
lorraineandreiamcidl
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
Google
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
rickgrimesss22
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
Deuglo Infosystem Pvt Ltd
 
SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024
Hironori Washizaki
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Crescat
 
GreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-JurisicGreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-Jurisic
Green Software Development
 
Artificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension FunctionsArtificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension Functions
Octavian Nadolu
 
E-commerce Application Development Company.pdf
E-commerce Application Development Company.pdfE-commerce Application Development Company.pdf
E-commerce Application Development Company.pdf
Hornet Dynamics
 
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
mz5nrf0n
 
socradar-q1-2024-aviation-industry-report.pdf
socradar-q1-2024-aviation-industry-report.pdfsocradar-q1-2024-aviation-industry-report.pdf
socradar-q1-2024-aviation-industry-report.pdf
SOCRadar
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
timtebeek1
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
Paco van Beckhoven
 
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
NYGGS Automation Suite
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
Max Andersen
 
Energy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina JonuziEnergy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina Jonuzi
Green Software Development
 
DDS-Security 1.2 - What's New? Stronger security for long-running systems
DDS-Security 1.2 - What's New? Stronger security for long-running systemsDDS-Security 1.2 - What's New? Stronger security for long-running systems
DDS-Security 1.2 - What's New? Stronger security for long-running systems
Gerardo Pardo-Castellote
 
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket ManagementUtilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
Neo4j
 

Recently uploaded (20)

LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOMLORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
 
SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
 
GreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-JurisicGreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-Jurisic
 
Artificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension FunctionsArtificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension Functions
 
E-commerce Application Development Company.pdf
E-commerce Application Development Company.pdfE-commerce Application Development Company.pdf
E-commerce Application Development Company.pdf
 
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
 
socradar-q1-2024-aviation-industry-report.pdf
socradar-q1-2024-aviation-industry-report.pdfsocradar-q1-2024-aviation-industry-report.pdf
socradar-q1-2024-aviation-industry-report.pdf
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
 
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
 
Energy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina JonuziEnergy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina Jonuzi
 
DDS-Security 1.2 - What's New? Stronger security for long-running systems
DDS-Security 1.2 - What's New? Stronger security for long-running systemsDDS-Security 1.2 - What's New? Stronger security for long-running systems
DDS-Security 1.2 - What's New? Stronger security for long-running systems
 
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket ManagementUtilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
 

Open-Source Software's Responsibility to Science

  • 1. Open-Source Software’s Responsibility to Science Joel Nothman 20th July 2018
  • 2. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 2 Me in open source Mostly contributed to popular Scientific Python libraries: scikit-learn, nltk, scipy.sparse, pandas, ipython, numpydoc Also information extraction evaluation (neleval), etc. Community service “Volunteer software development” With thanks to our financial sponsors Caretakers aren’t always founders Founders aren’t always caretakers Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
  • 3. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 3 Overheard at ICML Don’t worry about how tricky it is to implement . . . Someone will put it in Scikit-learn and you can just use it. Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
  • 4. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 4 Thoughts on an arrogant ML researcher Scientists think software maintenance is no big deal Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
  • 5. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 4 Thoughts on an arrogant ML researcher Scientists think software maintenance is no big deal Science and engineering rely heavily on open-source infrastructure Popular tools become de-facto standards Most users are uncomfortable building their own tools Many will only use what’s provided in a popular library Many will not inspect how it works on the inside Volunteer maintainers act as gatekeepers Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
  • 6. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 5 The power of the gatekeeper decides which algorithms are available decides how to ensure correctness and stability decides how to name or describe the algorithm decides whether to be faithful to a published description decides on an API that may facilitate good science/engineering Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
  • 7. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 5 The power of the gatekeeper decides which algorithms are available decides how to ensure correctness and stability decides how to name or describe the algorithm decides whether to be faithful to a published description decides on an API that may facilitate good science/engineering OSS maintainers can enable or inhibit scientific best practices Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
  • 8. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 6 But you can’t blame the gatekeeper THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
  • 9. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 7 This presentation is a series of examples Risks to good science and engineering related to software design Some things we do to help science Some things we have changed to help science Some things we have yet to solve There’s not a great deal of NLP in here but there’s a lot of ML and software engineering in NLP (so I hope it is relevant, interesting and accessible) Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
  • 10. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 8 Scikit-learn preliminaries An ecosystem of estimators Fit an estimator on some data, so that it can: describe the training data transform unseen data predict a target for unseen data Data is usually a numeric matrix X (samples × features) May provide a target vector or matrix y at training time real valued for regression categories for multiclass classification multiple columns of binary targets for multilabel classification Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
  • 11. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 9 Methods and results and are indecipherable if researchers publish an inappropriate, or underspecified, algorithm name Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
  • 12. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 10 A simple example of a bad name sklearn.covariance.GraphLasso for sparse inverse covariance estimation but Graph Lasso is sparse regression where the features lie on a graph the paper for covariance estimation named it Graphical Lasso Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
  • 13. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 10 A simple example of a bad name sklearn.covariance.GraphLasso for sparse inverse covariance estimation but Graph Lasso is sparse regression where the features lie on a graph the paper for covariance estimation named it Graphical Lasso Solution deprecate GraphLasso and rename it GraphicalLasso Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
  • 14. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 11 Tripping over hidden parameters precision_score([‘a’,‘a’,‘b’,‘b’,‘c’], [‘a’,‘b’,‘b’,‘c’,‘c’]) For multi-class or multi-label, how should you average across classes? Pa = 1 1 , Pb = 1 2 , Pc = 1 2 average=’micro’ ⇒ 3 5 average=’macro’ ⇒ (1 + 1 2 + 1 2 )/3 = 2 3 average=’weighted’ ⇒ (2 × 1 + 2 × 1 2 + 1 × 1 2 )/5 = 7 10 for a long time, prevalence-weighted macro average was the default ∴ papers say “We achieved a precision of . . . ” Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
  • 15. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 11 Tripping over hidden parameters precision_score([‘a’,‘a’,‘b’,‘b’,‘c’], [‘a’,‘b’,‘b’,‘c’,‘c’]) For multi-class or multi-label, how should you average across classes? Pa = 1 1 , Pb = 1 2 , Pc = 1 2 average=’micro’ ⇒ 3 5 average=’macro’ ⇒ (1 + 1 2 + 1 2 )/3 = 2 3 average=’weighted’ ⇒ (2 × 1 + 2 × 1 2 + 1 × 1 2 )/5 = 7 10 for a long time, prevalence-weighted macro average was the default ∴ papers say “We achieved a precision of . . . ” Solution precision_score raises an error if the data is not binary, unless the user specifies average. Use the API to force literacy & awareness Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
  • 16. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 12 What’s in a name? What makes an implementation of some named algorithm correct? Faithful to a published research paper? Faithful to a reference implementation? Faithful to some community of practice? Consistent with other components of our software library? Consistent with previous versions of the library? Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
  • 17. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 13 Experimenters report sub-optimal results because they assume our implementation is nicely behaved Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
  • 18. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 14 The fit you thought was finished Many optimisations are iterative have criteria to test if it has converged on an optimum Predictions and inferences may be poor if parameters did not converge Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
  • 19. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 14 The fit you thought was finished Many optimisations are iterative have criteria to test if it has converged on an optimum Predictions and inferences may be poor if parameters did not converge Solution Warn if we did not detect convergence but if we have too many warnings, users ignore them. . . Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
  • 20. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 15 The words you didn’t mean to stop CountVectorizer turns text into a term-document matrix can choose stop words: None, ‘english’ or BYO ‘english’ will remove system (and used to remove computer) ‘english’ will remove five, six, eight but not seven ‘english’ will remove we have but treat we’ve as ve This is documented nowhere. See my NLP-OSS paper with Hanmin Qin and Roman Yurchak Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
  • 21. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 15 The words you didn’t mean to stop CountVectorizer turns text into a term-document matrix can choose stop words: None, ‘english’ or BYO ‘english’ will remove system (and used to remove computer) ‘english’ will remove five, six, eight but not seven ‘english’ will remove we have but treat we’ve as ve This is documented nowhere. See my NLP-OSS paper with Hanmin Qin and Roman Yurchak Solution Deprecate ‘english’ Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
  • 22. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 15 The words you didn’t mean to stop CountVectorizer turns text into a term-document matrix can choose stop words: None, ‘english’ or BYO ‘english’ will remove system (and used to remove computer) ‘english’ will remove five, six, eight but not seven ‘english’ will remove we have but treat we’ve as ve This is documented nowhere. See my NLP-OSS paper with Hanmin Qin and Roman Yurchak Solution Deprecate ‘english’ and add another perfect stop list . . . Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
  • 23. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 16 The intercept you didn’t mean to regularise In Logistic Regression, we learn a weight vector β and a bias term β0 which corresponds to a feature x0 of all-1s Regularisation: minimise i β2 i to ensure small weights as well as small loss liblinear regularises β0. You probably never want to do this. All our other linear estimators do not regularise the intercept. Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
  • 24. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 16 The intercept you didn’t mean to regularise In Logistic Regression, we learn a weight vector β and a bias term β0 which corresponds to a feature x0 of all-1s Regularisation: minimise i β2 i to ensure small weights as well as small loss liblinear regularises β0. You probably never want to do this. All our other linear estimators do not regularise the intercept. Sol’n 1 intercept_scaling: also need to optimise x0’s fill value Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
  • 25. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 16 The intercept you didn’t mean to regularise In Logistic Regression, we learn a weight vector β and a bias term β0 which corresponds to a feature x0 of all-1s Regularisation: minimise i β2 i to ensure small weights as well as small loss liblinear regularises β0. You probably never want to do this. All our other linear estimators do not regularise the intercept. Sol’n 1 intercept_scaling: also need to optimise x0’s fill value . . . but most users don’t see/do this Sol’n 2 Implement alternative optimisers, and deprecate liblinear as default LogisticRegression solver Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
  • 26. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 17 Analysis of code on GitHub shows that people use default parameters when they shouldn’t Andreas M¨uller Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
  • 27. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 18 Most users are lazy Users don’t explore alternatives alternative parameters values alternative software libraries My students tend to use a CountVectorizer even when they’re counting non-words (e.g. synsets) We try to provide sensible default parameters Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
  • 28. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 19 Sensible default fail Ten-tree forests Three-fold cross validation ?? A tokeniser that splits on word-internal punctuation Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
  • 29. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 20 What makes a default value good? Good defaults should give good predictive models and reliable statistics ? Good defaults should behave how users expect but different communities of practice Good defaults should be invariant to: sample size (for stability in cross validation) number of features (for stability in model selection) ?feature scaling (for stability in different tasks/datasets) Example: finding a good default γ for an RBF kernel (#779, #10331) Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
  • 30. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 21 Good parametrisation We choose the defaults, but also how parameters are expressed (and whether they can be changed at all) Should number of nearest neighbors be specified as: an absolute value (e.g. 10)? a proportion of training samples (e.g. 2%)? an arbitrary function of training data shape? Algorithms and optimisation research often don’t report on this Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
  • 31. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 22 Scientific software should make it easy for users to do good science. Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
  • 32. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 23 Have you ever tried to de-tokenise the Penn TreeBank? PTB is delivered with each token POS tagged or bracketed We wanted to know how paragraph structure, etc. informed parsing The source text before tokenisation is available An easy case: [ Imports/NNS ] were/VBD at/IN [ $/$ 50.38/CD billion/CD ] ,/, up/RB [ 19/CD %/NN ] ./. Imports were at $50.38 billion, up 19%. Imports 1–7 were 9–12 at 14–15 $ 16–16 50.38 17–21 Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
  • 33. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 24 Have you ever tried to de-tokenise the Penn TreeBank? PTB is delivered with each token POS tagged or bracketed We wanted to know how paragraph structure, etc. informed parsing The source text before tokenisation is available ⇒ alignment hell due to typos corrected/inserted, reorderings, missed text, etc. Hindsight: delivering annotations on tokenised text is a bad idea It restricts what you can do with it later NLP software should always provide stand-off markup Or store the whitespace/non-token data as spaCy does Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
  • 34. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 25 Avoiding leakage in cross validation Bad X_preprocessed = preprocessor.fit_transform(X) result = cross_validate(classifier, X_preprocessed, y) Test data statistics leak into preprocessing ⇒ inflated cross validation results Good pipeline = make_pipeline(preprocessor, classifier) result = cross_validate(pipeline, X, y) Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
  • 35. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 25 Avoiding leakage in cross validation Bad X_preprocessed = preprocessor.fit_transform(X) result = cross_validate(classifier, X_preprocessed, y) Test data statistics leak into preprocessing ⇒ inflated cross validation results Good pipeline = make_pipeline(preprocessor, classifier) result = cross_validate(pipeline, X, y) Solution De-emphasise fit_transform. And make sure Pipeline works with everything; and make sure cross_validate works with everything. Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
  • 36. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 26 Maintainers of large projects can’t be experts in all the things they maintain Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
  • 37. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 27 Scientists can (and do) help us: make sure the implementation matches the name make users aware of or avoid unexpected behaviour parametrise algorithms and set defaults helpfully understand how our design choices lead to flawed experiments Users trust popular OSS. Thank you for helping us make OSS trustworthy. Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018