University of Southampton. Electronics & Computer Science. Research Seminar (Invited Talk).
TITLE: From Research Objects to Reproducible Science Tales
ABSTRACT. Rumor has it that there is a reproducibility crisis in science. Or maybe there are multiple crises? What do we mean by reproducibility and replicability anyways? In this talk I will first make an attempt at sorting out some of the terminological confusion in this area, focusing on computational aspects. The PRIMAD model is another attempt to describe different aspects of reproducibility studies by focusing on the "delta" between those studies and the original study. In addition to these more theoretical investigations, I will discuss practical efforts to create more reproducible and more transparent computational platforms such as the one developed by the Whole-Tale project: here 'tales' are executable research objects that may combine data, code, runtime environments, and narratives (i.e., the traditional "science story"). I will conclude with some thoughts about the remaining challenges and opportunities to bridge the large conceptual gaps that continue to exist despite the recognition of problems of reproducibility and transparency in science.
ABOUT the Speaker. Bertram Ludäscher is a professor at the School of Information Sciences at the University of Illinois, Urbana-Champaign and a faculty affiliate with the National Center for Supercomputing Applications (NCSA) and the Department of Computer Science at Illinois. Until 2014 he was a professor at the Department of Computer Science at the University of California, Davis. His research interests range from practical questions in scientific data and workflow management, to database theory and knowledge representation and reasoning. Prior to his faculty appointments, he was a research scientist at the San Diego Supercomputer Center (SDSC) and an adjunct faculty at the CSE Department at UC San Diego. He received his M.S. (Dipl.-Inform.) in computer science from the University of Karlsruhe (now K.I.T.), and his PhD (Dr. rer. nat.) from the University of Freiburg, in Germany.
From Research Objects to Reproducible Science Tales
1. From Research Objects to
Reproducible Science Tales
Bertram Ludäscher
ludaesch@illinois.edu
Director, Center for Informatics Research in Science & Scholarship (CIRSS)
School of Information Sciences (iSchool@Illinois)
& National Center for Supercomputing Applications (NCSA)
& Department of Computer Science (CS@Illinois)
Southampton
UK
2019-11-121
2. Outline
• Crisis Time? Manifesto Time!
• Terminology
• Research Objects - A Long March
• ROs & Reproducibility: Cui Bono?
• A call to action: Transparency Action ..
From ROs to Reproducible Science 2
16. Tool Envy Syndrome (TES)
Maybe we should be working on
conceptual foundations and ask:
I can haz some
Reproducibility Platform?
17. How Perl Saved the Human Genome Project
• The Perl Journal, http://www.tpj.com
• By Lincoln Stein
• DATE: Early February, 1996
• LOCATION: Cambridge, England, in the conference room of the largest DNA
sequencing center in Europe.
• OCCASION: A high level meeting between the computer scientists of this center
and the largest DNA sequencing center in the United States.
• THE PROBLEM: Although the two centers use almost identical laboratory
techniques, almost identical databases, and almost identical data analysis tools,
they still can't interchange data or meaningfully compare results.
• THE SOLUTION: Perl
From ROs to Reproducible Science 17
• … Most groups, however, learned to build modular, loosely-coupled systems whose parts could be
swapped in and out without retooling the whole system:
• First there's a basic quality check on the sequence: is it long enough? Are the number of ambiguous
leMers below the maximum limit? Then the vector check ensures that only human DNA gets into the
database. Next there's a check for repeTTve sequences … penulOmate step is to aMempt to match
the new sequence against other sequences in a large community database of DNA sequences ….
APer performing all these checks, the sequence along with the informaOon that's been gathered
about it [provenance!] along the way is loaded into the local laboratory database.
18. Tool Envy Syndrome (TES)
Maybe we should be working on
conceptual foundations and ask:
I can haz some
reduce-sillity-platforms?
19. A Reproducibility Platform …
errors galore
• Reproducibility platform
• … Reproducility platform
• … Reproduce-sillity platform
• è Debuggers!
• … Reduce-sillity platform
• è The Vision
• Helmut Schmidt: “Wer Visionen hat soll zum Arzt
gehen!” (If you have visions, go see a doctor!)
From ROs to Reproducible Science 19
20. • … NSF SKOPE: system and tools to discover,
access, analyze, visualize paleoenvironmental
data
– unprecedented ability to explore provenance
(detailed, comprehensible record of computa:onal
deriva:on of results)
– for researchers, :nkerers, and modelers
• … NSF Whole Tale:
– leverage & contribute to exisAng CI to support the
whole tale (“living paper”), from workflow run to
scholarly publica:on
– integrate tools & CI (DataONE, Globus, iRODS,
NDS, ...) to simplify use and promote best
pracAces.
– driven by science WGs (Archaeology/SKOPE,
materials science, astro, biodiversity informa:cs ..)
Enter the tool makers
From ROs to Reproducible Science 20
21. Whole Tale: The next step in the evolution of the
scholarly article: The “Living [Frozen?] Paper”
• 1st Generation:
– narrative (prose)
• 2nd Generation: plus …
– name .. identify .. include (access to) data
• 3rd Generation: plus …
– name .. reference .. include code (software) ..
– and provenance … and exec environment (containers)
Ludäscher: Why-Not Provenance 21
Whole Tale
Whole Tale Dashboard
26. ROs ... Whole Tale
… are we done here?
From ROs to Reproducible Science 26
27. The return of the R* brouhaha
From ROs to Reproducible Science 27
28. In a nutshell
• Computa/onal reproducibility
=/=>
• Scien/fic reproducibility
• Transparency to the rescue!
• What’s the goal again?
• And what informa/on gain is implied by
– a successful reproducibility study (alright … )
– a failed reproducibility study (K Popper says ‘Hi’!)
– a non-conclusive reproducibility study
From ROs to Reproducible Science 28
29. Reproducibility Crisis (reprised)
• Successful reproducibility study:
• increases trust in prior study J
• … but no surprises L
• Failed reproducibility study :
• decreases trust (or falsifies) prior study L
• … but surprising failure yields new info/knowledge J
• Learning from failures!
– Not really a new, revolutionary idea..
– What is a positive vs negative result anyways?
– ... fail early, fail often ...
On Provenance 29
30. PRIMAD (what have you “primed”?)
On Provenance 30
Dagstuhl Seminar #16041 Report Outputs = Exec(M,I,P,D) | RO, A
- M = parsimony/bootstrap/..
- I = package XYZ
- P = MacOS ..
- D = (Params, Files)
31. PRIMAD (what have you “primed”?)
On Provenance 31
Dagstuhl Seminar #16041 Report
34. A rant …
Didn’t you come here for this?
From ROs to Reproducible Science 34
35. The Evolution of Language
– Peter Buneman for Phil Wadler
35
The Evolution of Language
2x (Descartes)
x. 2x (Church)
(LAMBDA (X) (* 2 X)) (McCarthy)
<?xml version="1.0"?>
<LAMBDA-TERM>
<VAR-LIST>
<VAR>X</VAR>
</VAR-LIST>
<EXPR>
<APPLICATION>
<EXPR><CONST>*</CONST></EXPR>
<ARGUMENT-LIST>
<EXPR><CONST>2</CONST></EXPR>
<EXPR><VAR>X</VAR></EXPR>
</ARGUMENT-LIST>
</APPLICATION>
</EXPR>
</LAMBDA-TERM>
(W3C)
Thesis:
• There’s no problem that can’t be
tackled by another level of
indirec5on.
An5thesis:
• Adding levels of indirec=on gets you
further away from solving your
problem.
• ... or worse:
Beware of the Turing tar-pit in which
everything is possible but nothing of
interest is easy.
-- Alan Perlis in Epigrams on Programming
From ROs to Reproducible Science
36. Beware of Techno(re)ligion:
Great ideas are simple; frozen accidents aren’t …
• Geo-/Helio-centric
model
• Evolution by Natural
Selection
• Structure of DNA
• Genetic Code
• Relativity
• …
• Logic
F = A | F/F | -F | (ex x) F
36
vs
From ROs to Reproducible Science
37. Thinking Tools
From ROs to Reproducible Science 37
You can’t do much carpentry with your bare hands,
and you can’t do much thinking with your bare brain
– Bo Dahlbom (via D. Dennett)
38. Why we need Thinking Tools
• How do we analyze metadata models, schemas,
integrity constraints, taxonomies, ontologies, …
• … or the big picture: what do we mean by …
provenance? Reproducibility in science?
• From Thinking Tools to …. “Tool Tools”?
From ROs to Reproducible Science 38
39. Provenance as an Intuition Pump for
Understanding what happened!
(Frozen Accidents Edition)
Zrzavý, Jan, David Storch, and Stanislav
Mihulka. Evolu?on: Ein Lese-Lehrbuch.
Springer-Verlag, 2009.
Author: Jkwchui (Based on
drawing by Truth-seeker2004)
From ROs to Reproducible Science 39
43. Computational Provenance …
• Origin, processing history of artifacts
– data products, figures, ...
– also: underlying workflow
è understand methods, dataflow, and dependencies
From ROs to Reproducible Science 43
Climate Change Impacts
in the United States
U.S. National Climate Assessment
U.S. Global Change Research Program
44. João F. Pimentel, Saumen Dey, Timothy McPhillips,
Khalid Belhajjame, David Koop, Leonardo Murta,
Vanessa Braganholo, Bertram Ludäscher
Yin & Yang: Demonstrating complementary
provenance from noWorkflow &
YesWorkflow
49. Habemus Pons!
We’ve got the Bridge!
The bridge is the journey..
(The journey is the destination)
Lineage of image file
in terms of YW
model, with details
from NW provenance
Provenance @ SBBD'16
50. Computa(onal Thinking: Die Grenzen meiner
Sprache bedeuten die Grenzen meiner Welt …
• Vanilla Process Network
• Functional Programming
Dataflow Network
• XML Transformation
Network
• Collection-oriented
Modeling & Design
framework (COMAD)
– Look Ma: No Shims!
From ROs to Reproducible Science 50
51. Tool Envy Syndrome (TES)
Maybe we should be working on
conceptual foundations and ask:
I can haz some
Terminology Tools?
52. 52
Y X X YX Y X Y X Y
Congruence
X == Y
Inclusion
X > Y
Inverse Inclusion
X < Y
Overlap
X>< Y
Disjointness
X ! Y
Origins:
Euler diagrams ...
... limited FO reasoning
... RCC-5++ reasoning
Applica:on: Geo-Taxonomy Alignment
The secret sauce inside: Moved from FO reasoner to … qualitative reasoning
(RCC-5) to … Answer Set Programming (ASP) + some more secret sauce
Taxonomy Alignment Problem
From ROs to Reproducible Science
53. • Euler/X project employs
qualitative reasoning (RCC-5),
implemented in ASP to align,
merge taxonomies, debug
alignments, etc.
53
Reasoning with Incomplete Knowledge:
Exploring Possible Worlds
From ROs to Reproducible Science
54. The long march to ROs & Reproducible Science
We're off to see the Wizard,
The wonderful Wizard of Prov!
--
We hear he is a wiz of a wiz
If ever a wiz there was.
--
If ever, oh ever, a wiz there was,
The Wizard of Prov is one because,
Because, because, because, because, because,
Because of the wonderful things he does.
Provenance @ SBBD'16
55. Meanwhile in a galaxy far far away…
Semantic Web Stuff
From ROs to Reproducible Science 55
W3C Activities in Developing New Query Languages
[Man15] R. MAN TH EY . Back to the Future – Should SQL Surrender to SPARQL? SOFSEM, LNCS,
2015.
56. Are we caught in a strange loop?
From ROs to Reproducible Science 56
[Man15] R. MANTHEY . Back to the Future – Should SQL Surrender to SPARQL? SOFSEM, LNCS, 2015.
57. The long march begins …
From ROs to Reproducible Science 57
60. Actionable Transparency
• Transparency vs Re-executability
• In the beginning was the Question!
– … then came the (logic) rule
– ... in the form of a query!
• Semantics anyone?
From ROs to Reproducible Science 60
62. What (& where) is Semantics?
How (& what) to do (with) Semantics?
• The Meaning Triangle
• Controlled Vocabularies ..
Terminological Logics (DLs) ..
Ontologies
• (Relational) Structures
• A query is a question about a
concept!
– Is this graph bipartite?
– Demonstrate, show, prove
• .. hat it is!
• .. that it isn’t! 62From ROs to Reproducible Science
63. Answer Set Programming: a superpower for “doing semantics”
• ASP = DB+LP+KR+SAT
• Reasoning spectrum: …queries … constraint solving
• … OWL/DL, FO, SQL, Datalog, ..., ASP, ...
• ASP occupies a “sweet spot”
• ... but needs GTD extensions:
• PWE = ASP
+ Python
+ Jupyter
63h"ps://github.com/idaks/PWE-demos
From ROs to Reproducible Science
64. ASP + PWE: Possible Worlds Explorer
64
https://github.com/idaks/PW-explorer https://github.com/idaks/PWE-demosFrom ROs to Reproducible Science
67. Visualized in PWE via Python under the hood!
From ROs to Reproducible Science 67
68. … for a few Python LOCs more …
(growing the target audience)
From ROs to Reproducible Science 68
69. … we get highlighting of the LCAs!
From ROs to Reproducible Science 69
70. “Boring” (ASCII) answer sets become
informative Timeline Visualization
(Here: IC Checking & Repair rules!)
From ROs to Reproducible Science 70
71. … visualizing clusters of PWs (answer sets) …
From ROs to Reproducible Science 71
… easily plug in different
ranking/distance/similarity functions!
72. … to discover additional structure!
• … discover similar (here:
isomorphic) solutions
• … and display them!
From ROs to Reproducible Science 72
73. Conclusion I
• Clarifying what we mean by reproducibility
• Identifying tool & thinking gaps
• Bridging gaps
• Empowering the many (long tail)
• Turbocharging the specialists
From ROs to Reproducible Science 73
74. Conclusion II: Actionable Thinking Tools!
• Possible Worlds Explorer (PWE):
– loosely coupling (= wrapping) Datalog & ASP systems
• DLV, clingo, …, XSB, … , <you-name-it>
– … with Python
– … and Jupyter notebooks
=> where the users are!
=> leveraging Python, Pandas, … analytics and visualization!
• Datalog & ASP for the rest of us!
– … and for LP / DB-Theory gurus :-)
• Work in progress
– join or fork: https://github.com/idaks/PW-explorer
– or talk, to get started: ludaesch@Illinois.edu
From ROs to Reproducible Science 74