Teaching Reproducibility and Embracing Variability: From Floating-Point Experiments to Replicating Research

Teaching Reproducibility and
Embracing Variability:
From Floating-Point Experiments to Replicating
Research
Mathieu Acher, Arnaud Gotlieb, Helge Spieker, Gauthier Le Bartz Lyan
https://hal.science/hal-05190848v1
@acherm https://mathieuacher.com/
https://diverse-project.github.io/ripost_ea/
1

On the one hand, Resilient and Reproducible Software (aka RIPOST team)
https://diverse-project.github.io/ripost_ea/ had lots of discussions on how to
teach reproducibility (our research topic)
On the other hand, 4th- and 5th-year students at INSA Rennes: Master's level
in engineering education; strong computer science background; by default, not
interested by doing a PhD (by extension, “research” is not the primary goal).

How to integrate reproducibility in the curriculum?
Increasingly taught through MOOC, dedicated courses (Fund, 2023), collaborative formats like
Reprohackathons (Cokelaer et al., 2023) or ReproducedPapers.org (Yildiz et al., 2021), and
integration into existing curricula (Vilhuber et al., 2022), but efforts remain scattered.

Integrating reproducibility in the curriculum?
Three key motivations:
● Reproducibility as a way to concretely practice science (scientific
methods and thinking; reading a paper; design and conduct experiments; analyze data, etc.)
● Reproducibility is directly related to modern software
engineering practices (including version control, continuous integration,
containerization, and infrastructure automation)
● Reproducibility and variability in software systems and
computational experiments (input data, implementation details, environment,
randomness, etc.) causing threats but also opportunities

Design of a reproducibility course, two complementary parts
From Floating-Point Experiments…
Students first explored the non-associativity of floating-point arithmetic as a reproducibility
"Hello World" using Docker, GitHub Actions, and templated experimentation to analyze
sources of variability across programming languages, compiler flags, and numerical
precision.
…to Replicating Research
The second half of the course focused on reproducing and replicating actual research
papers, including studies on large language models playing chess, home advantage in
football during COVID19, and energy efficiency across programming languages.

From Floating-Point Experiments…
to Replicating Research
MOTOs:
● software engineering (SE) as an enabler of reproducible thinking
● students used to work in pairs to encourage discussions, collaborations and sharing of
the work; also reproduced each other’s results (across groups)
● critical engagement with scientific work
● learning by doing (practicing SE and science)
● Reproducibility vs Replicability and Variability
○ reproducibility = finding one variant (and fixing some variability) to retrieve
exact/similar results
○ replicability = exploring variants’ and embracing variability

Students first explored the non-associativity of floating-point
arithmetic as a reproducibility "Hello World" using Docker,
GitHub Actions, and templated experimentation to analyze
sources of variability across programming languages, compiler
flags, and numerical precision.

Is (x+y)+z == x+(y+z)?
How often (x+y)+z == x+(y+z)?
write a program that returns a percentage
again: Choose the programming language, the
compiler/interpreter, the library, the computing environment,
the computer you want…
8

Each group
shared their
solutions/results
through a Git
(with Dockerﬁle,
README.md,
Github actions,
Notebooks, etc.)
9

Exploring and analyzing variability factors with
template-based approach
10

11
Exploring and analyzing variability factors with CLI

Results
Students explored diverse variability factors (e.g., random seed, data range,
programming language, operation type). Most focused on factors like number of
repetitions and value ranges, occasionally testing beyond associativity (e.g.,
commutativity).
Results varied across setups but tended to converge as experiments became more
reﬁned and controlled.
The "Hello World" task, while rich in variability, has limits:
● it doesn’t fully capture the complexity of real-world reproducibility scenarios.
● often requires digging into ﬂoating-point internals to explain subtle behaviors.
We believe the x+(y+z) “Hello world” can be reused and is a good introduction
to reproducibility, variability, and software engineering; a “pre-requisite”
before practicing reproducibility and science in a more realistic scenario 12

The second half of the course focused on reproducing and
replicating actual research papers, including studies on large
language models playing chess, home advantage in football
during COVID19, and energy efficiency across programming
languages.

“Revisite d’article scientifique (50%)”
● Read a paper (choose 1 out of 3, see next slide)
○ read/study ASAP
● Reproduce
○ reuse code/data
○ understand; re-run
○ conclusions: is it reproducible? do you confirm original results?
● Replicate
○ identify variability factors and threats
○ make a deviation/variation
○ analyse
○ conclude: do you confirm original results?
● Present key results and lessons learned
14

Choose 1 out of 3
● "Debunking the Chessboard: Confronting GPTs Against Chess
Engines to Estimate Elo Ratings and Assess Legal Move Abilities"
○ https://blog.mathieuacher.com/GPTsChessEloRatingLegalMoves/
● "COVID and Home Advantage in Football: An Analysis of Results
and xG Data in European Leagues"
○ https://blog.mathieuacher.com/FootballAnalysis-xG-COVIDHome/
● “Ranking Programming Languages by Energy Eﬃciency” SCP 2021
○ https://github.com/greensoftwarelab/RosettaExamples
=> XXX (link)
15

Debunking the Chessboard: Confronting GPTs Against Chess
Engines to Estimate Elo Ratings and Assess Legal Move Abilities
https://blog.mathieuacher.com/GPTsChessEloRatingLegalMoves/
https://www.youtube.com/watch?v=6D1XIbkm4JE (video with MonsieurPhi)
● what about GPT4o or Claude or Llama?
● prompt sensitivity?
● temperature?
● what about confronting to other chess engines?
● …
16

COVID and Home Advantage in Football: An Analysis
of Results and xG Data in European Leagues
https://blog.mathieuacher.com/FootballAnalysis-xG-COVIDHome/
● what about 2023/2024?
● xG: instead of understat, another source of data
● other leagues (amateur)
● other sports
● …
17

Ranking Programming Languages by Energy
Efficiency SCP 2021
https://github.com/greensoftwarelab/RosettaExamples
● other programs/workloads?
● measurements?
● other programming languages?
● other compilers or interpreters?
● container eﬀect?
● other hardware? ...
● …
18

Home advantage in football (soccer) before and during
COVID https://blog.mathieuacher.com/FootballAnalysis-xG-COVIDHome/
xPoints vs xG vs result; inter-season comparison; single-season comparison;
variability in COVID management across leagues
tldr; The results show a significant decrease in home advantage during the
COVID period, particularly in Ligue 1 and the Premier League. 19

My students reproduced results and found… differences!
Key finding #1: Bug or feature? scipy 1.6.0 changes p-values on MannU
○ The importance of reproducing results (with other teams/people!)
○ The difficulty of reproducing results: the version of a library can change everything!
○ Be careful about non-choices of configurations (here on the alternative hypothesis:
two-sided vs one-sided)
20

My students replicated and found new insights!
Key finding #2
○ Original study was conducted in 2021... opportunities to extend it to 2022 and
2023 and 2024
○ and ask new questions: has the post-COVID period restored the home
advantage?
○ Answer/teaser: yes!
○ New findings on other national championships and European cups
(Champions League, the Turkish Süper Lig) on post-COVID seasons
(2022–2024)
21

Discussion
Scaling the course wrt number of students. There are many threats to our
experience:
● ideal setting: only 1 instructor (Mathieu Acher); only 20 students (10 groups); eases
seamless and consistent communication; can be more challenging with more
instructors and students!
● kinds of papers to replicate: very important to deeply know the details – see also (Yildiz
et al., 2021); I (instructor) was author of two; more diﬁcult for the third one ;)
● tension between detailed instructions vs allowing open-ended investigation;
importance of having ﬁne-grained interactions at some points
Teaching variability: We can go further (variability modelling with a more
principled exploration of variants and interactions between variability factors);
Should we?
Dissemination: how to valorize high-quality student work and replication projects:
co-authored publications with students?
22

Conclusion
Self-contained reproducibility course with
● (1) advanced software engineering;
● (2) practice of science, including critical thinking and replication;
● (3) variability at the heart (identiﬁcation and combination of variability factors; exploration of
variants; analysis of results) for both reproducing and replicating
Two steps-approach:
● (1) an “hello world” for reproducibility providing the motivation, technical foundations and
methods; mandatory (?) pre-requiste to address real-world scenarios
● (2) a reproduction and replication of actual research content through variability management
Success: high-quality works through git repositories; students were able to pinpoint potential
reproducibility issues and propose replication (variations) to further explore new hypotheses,
provide new insights, and test generality. Possible co-authorship
We learn from students and from teaching. Can’t wait to… replicate the
course next fall 2025!
23

Teaching Reproducibility and Embracing
Variability: From Floating-Point Experiments to Replicating Research
preprint: https://hal.science/hal-05190848v1
Course material can be reused:
https://archive.softwareheritage.org/browse/directory/89ad4e83c94a17411027ee891c9cb297a4df02c5/
Questions? if you don’t have, I have ;)
What’s your hello world for reproducibility?
What are the success criteria of a reproducibility course?

Reproducible science with variability
25
“Authors provide all the necessary data and the computer codes to run the
analysis again, re-creating the results.”
Yet, despite the availability of data and code, several studies report that unexplored variability
in software can lead to varying results up to the point discrepancies can radically change the
conclusions and contradict established knowledge
from a set of scripts to automate the deployment to… a
comprehensive system containing several features that
help researchers exploring various hypotheses

Computational science with deep variability
26
hardware
variability
25,000+ options,
10^6000 variants
(operating system)
thousands of
compiler flags
dozens of library
versions
dozens of
command-line
parameters
(container)
configuration files
(distributed
environment)
hyperparameters
(application code)
variability in data
energy
consumption
execution time
binary
42
accuracy

Deep Software Variability
“refers to the interaction of all external “factors” modifying the behavior (including both functional and
nonfunctional properties) of a software system” Lesoil et al. VaMoS 2020
Combinatorial explosion of the epistemic and ontological variability with impacts on computational
result and non-functional properties
27
always 42 ?

30
https://github.com/FAMILIAR-project/reproducibility-associativity/

Our Vision: Embrace
deep variability!
Explicit modeling of the variability
points and their relationships, such as:
1. Get insights into the variability “factors”
and their possible interactions
2. Capture and document configurations
for the sake of reproducibility
3. Explore diverse configurations to
replicate, and hence optimize, validate,
increase the robustness, or provide
better resilience
⇒ We aim to address the complexities associated
with reproducibility and replicability in modern
software systems and environments, facilitating a
more comprehensive and nuanced perspective on these
critical “factors”.
31
https://hal.science/hal-04582287

32
https://github.com/FAMILIAR-project/reproducibility-associativity/
(excerpt)
textual notation (UVL)

Feature model: widely studied and used formalism in software engineering (proposed in 1990!)
● Formal abstractions are definitely needed to encode variability knowledge
and pilot the exploration of computational experiments
● Numerous works/techniques to specify and reverse engineer (out of
spreadsheet, command-line parameters, source code, doc., configurations, etc.) feature models
33

Teaching Reproducibility and Embracing Variability: From Floating-Point Experiments to Replicating Research

More Related Content

Similar to Teaching Reproducibility and Embracing Variability: From Floating-Point Experiments to Replicating Research

More from University of Rennes, INSA Rennes, Inria/IRISA, CNRS

Recently uploaded

Teaching Reproducibility and Embracing Variability: From Floating-Point Experiments to Replicating Research