Teaching Reproducibility and
Embracing Variability:
From Floating-Point Experiments to Replicating
Research
Mathieu Acher, Arnaud Gotlieb, Helge Spieker, Gauthier Le Bartz Lyan
https://hal.science/hal-05190848v1
@acherm https://mathieuacher.com/
https://diverse-project.github.io/ripost_ea/
1
On the one hand, Resilient and Reproducible Software (aka RIPOST team)
https://diverse-project.github.io/ripost_ea/ had lots of discussions on how to
teach reproducibility (our research topic)
On the other hand, 4th- and 5th-year students at INSA Rennes: Master's level
in engineering education; strong computer science background; by default, not
interested by doing a PhD (by extension, “research” is not the primary goal).
How to integrate reproducibility in the curriculum?
Increasingly taught through MOOC, dedicated courses (Fund, 2023), collaborative formats like
Reprohackathons (Cokelaer et al., 2023) or ReproducedPapers.org (Yildiz et al., 2021), and
integration into existing curricula (Vilhuber et al., 2022), but efforts remain scattered.
Integrating reproducibility in the curriculum?
Three key motivations:
● Reproducibility as a way to concretely practice science (scientific
methods and thinking; reading a paper; design and conduct experiments; analyze data, etc.)
● Reproducibility is directly related to modern software
engineering practices (including version control, continuous integration,
containerization, and infrastructure automation)
● Reproducibility and variability in software systems and
computational experiments (input data, implementation details, environment,
randomness, etc.) causing threats but also opportunities
Design of a reproducibility course, two complementary parts
From Floating-Point Experiments…
Students first explored the non-associativity of floating-point arithmetic as a reproducibility
"Hello World" using Docker, GitHub Actions, and templated experimentation to analyze
sources of variability across programming languages, compiler flags, and numerical
precision.
…to Replicating Research
The second half of the course focused on reproducing and replicating actual research
papers, including studies on large language models playing chess, home advantage in
football during COVID19, and energy efficiency across programming languages.
Design of a reproducibility course, two complementary parts
From Floating-Point Experiments…
to Replicating Research
MOTOs:
● software engineering (SE) as an enabler of reproducible thinking
● students used to work in pairs to encourage discussions, collaborations and sharing of
the work; also reproduced each other’s results (across groups)
● critical engagement with scientific work
● learning by doing (practicing SE and science)
● Reproducibility vs Replicability and Variability
○ reproducibility = finding one variant (and fixing some variability) to retrieve
exact/similar results
○ replicability = exploring variants’ and embracing variability
Design of a reproducibility course, two complementary parts
Students first explored the non-associativity of floating-point
arithmetic as a reproducibility "Hello World" using Docker,
GitHub Actions, and templated experimentation to analyze
sources of variability across programming languages, compiler
flags, and numerical precision.
Is (x+y)+z == x+(y+z)?
How often (x+y)+z == x+(y+z)?
write a program that returns a percentage
again: Choose the programming language, the
compiler/interpreter, the library, the computing environment,
the computer you want…
8
Each group
shared their
solutions/results
through a Git
(with Dockerfile,
README.md,
Github actions,
Notebooks, etc.)
9
Exploring and analyzing variability factors with
template-based approach
10
11
Exploring and analyzing variability factors with CLI
Results
Students explored diverse variability factors (e.g., random seed, data range,
programming language, operation type). Most focused on factors like number of
repetitions and value ranges, occasionally testing beyond associativity (e.g.,
commutativity).
Results varied across setups but tended to converge as experiments became more
refined and controlled.
The "Hello World" task, while rich in variability, has limits:
● it doesn’t fully capture the complexity of real-world reproducibility scenarios.
● often requires digging into floating-point internals to explain subtle behaviors.
We believe the x+(y+z) “Hello world” can be reused and is a good introduction
to reproducibility, variability, and software engineering; a “pre-requisite”
before practicing reproducibility and science in a more realistic scenario 12
Design of a reproducibility course, two complementary parts
The second half of the course focused on reproducing and
replicating actual research papers, including studies on large
language models playing chess, home advantage in football
during COVID19, and energy efficiency across programming
languages.
“Revisite d’article scientifique (50%)”
● Read a paper (choose 1 out of 3, see next slide)
○ read/study ASAP
● Reproduce
○ reuse code/data
○ understand; re-run
○ conclusions: is it reproducible? do you confirm original results?
● Replicate
○ identify variability factors and threats
○ make a deviation/variation
○ analyse
○ conclude: do you confirm original results?
● Present key results and lessons learned
14
Choose 1 out of 3
● "Debunking the Chessboard: Confronting GPTs Against Chess
Engines to Estimate Elo Ratings and Assess Legal Move Abilities"
○ https://blog.mathieuacher.com/GPTsChessEloRatingLegalMoves/
● "COVID and Home Advantage in Football: An Analysis of Results
and xG Data in European Leagues"
○ https://blog.mathieuacher.com/FootballAnalysis-xG-COVIDHome/
● “Ranking Programming Languages by Energy Efficiency” SCP 2021
○ https://github.com/greensoftwarelab/RosettaExamples
=> XXX (link)
15
Debunking the Chessboard: Confronting GPTs Against Chess
Engines to Estimate Elo Ratings and Assess Legal Move Abilities
https://blog.mathieuacher.com/GPTsChessEloRatingLegalMoves/
https://www.youtube.com/watch?v=6D1XIbkm4JE (video with MonsieurPhi)
● what about GPT4o or Claude or Llama?
● prompt sensitivity?
● temperature?
● what about confronting to other chess engines?
● …
16
COVID and Home Advantage in Football: An Analysis
of Results and xG Data in European Leagues
https://blog.mathieuacher.com/FootballAnalysis-xG-COVIDHome/
● what about 2023/2024?
● xG: instead of understat, another source of data
● other leagues (amateur)
● other sports
● …
17
Ranking Programming Languages by Energy
Efficiency SCP 2021
https://github.com/greensoftwarelab/RosettaExamples
● other programs/workloads?
● measurements?
● other programming languages?
● other compilers or interpreters?
● container effect?
● other hardware? ...
● …
18
Home advantage in football (soccer) before and during
COVID https://blog.mathieuacher.com/FootballAnalysis-xG-COVIDHome/
xPoints vs xG vs result; inter-season comparison; single-season comparison;
variability in COVID management across leagues
tldr; The results show a significant decrease in home advantage during the
COVID period, particularly in Ligue 1 and the Premier League. 19
My students reproduced results and found… differences!
Key finding #1: Bug or feature? scipy 1.6.0 changes p-values on MannU
○ The importance of reproducing results (with other teams/people!)
○ The difficulty of reproducing results: the version of a library can change everything!
○ Be careful about non-choices of configurations (here on the alternative hypothesis:
two-sided vs one-sided)
20
My students replicated and found new insights!
Key finding #2
○ Original study was conducted in 2021... opportunities to extend it to 2022 and
2023 and 2024
○ and ask new questions: has the post-COVID period restored the home
advantage?
○ Answer/teaser: yes!
○ New findings on other national championships and European cups
(Champions League, the Turkish Süper Lig) on post-COVID seasons
(2022–2024)
21
Discussion
Scaling the course wrt number of students. There are many threats to our
experience:
● ideal setting: only 1 instructor (Mathieu Acher); only 20 students (10 groups); eases
seamless and consistent communication; can be more challenging with more
instructors and students!
● kinds of papers to replicate: very important to deeply know the details – see also (Yildiz
et al., 2021); I (instructor) was author of two; more dificult for the third one ;)
● tension between detailed instructions vs allowing open-ended investigation;
importance of having fine-grained interactions at some points
Teaching variability: We can go further (variability modelling with a more
principled exploration of variants and interactions between variability factors);
Should we?
Dissemination: how to valorize high-quality student work and replication projects:
co-authored publications with students?
22
Conclusion
Self-contained reproducibility course with
● (1) advanced software engineering;
● (2) practice of science, including critical thinking and replication;
● (3) variability at the heart (identification and combination of variability factors; exploration of
variants; analysis of results) for both reproducing and replicating
Two steps-approach:
● (1) an “hello world” for reproducibility providing the motivation, technical foundations and
methods; mandatory (?) pre-requiste to address real-world scenarios
● (2) a reproduction and replication of actual research content through variability management
Success: high-quality works through git repositories; students were able to pinpoint potential
reproducibility issues and propose replication (variations) to further explore new hypotheses,
provide new insights, and test generality. Possible co-authorship
We learn from students and from teaching. Can’t wait to… replicate the
course next fall 2025!
23
Teaching Reproducibility and Embracing
Variability: From Floating-Point Experiments to Replicating Research
preprint: https://hal.science/hal-05190848v1
Course material can be reused:
https://archive.softwareheritage.org/browse/directory/89ad4e83c94a17411027ee891c9cb297a4df02c5/
Questions? if you don’t have, I have ;)
What’s your hello world for reproducibility?
What are the success criteria of a reproducibility course?
Reproducible science with variability
25
“Authors provide all the necessary data and the computer codes to run the
analysis again, re-creating the results.”
Yet, despite the availability of data and code, several studies report that unexplored variability
in software can lead to varying results up to the point discrepancies can radically change the
conclusions and contradict established knowledge
from a set of scripts to automate the deployment to… a
comprehensive system containing several features that
help researchers exploring various hypotheses
Computational science with deep variability
26
hardware
variability
25,000+ options,
10^6000 variants
(operating system)
thousands of
compiler flags
dozens of library
versions
dozens of
command-line
parameters
(container)
configuration files
(distributed
environment)
hyperparameters
(application code)
variability in data
energy
consumption
execution time
binary
42
accuracy
Deep Software Variability
“refers to the interaction of all external “factors” modifying the behavior (including both functional and
nonfunctional properties) of a software system” Lesoil et al. VaMoS 2020
Combinatorial explosion of the epistemic and ontological variability with impacts on computational
result and non-functional properties
27
always 42 ?
28
Deep Software Variability
29
Deep Software Variability
30
Deep Software Variability
https://github.com/FAMILIAR-project/reproducibility-associativity/
Deep Software Variability
Our Vision: Embrace
deep variability!
Explicit modeling of the variability
points and their relationships, such as:
1. Get insights into the variability “factors”
and their possible interactions
2. Capture and document configurations
for the sake of reproducibility
3. Explore diverse configurations to
replicate, and hence optimize, validate,
increase the robustness, or provide
better resilience
⇒ We aim to address the complexities associated
with reproducibility and replicability in modern
software systems and environments, facilitating a
more comprehensive and nuanced perspective on these
critical “factors”.
31
https://hal.science/hal-04582287
32
https://github.com/FAMILIAR-project/reproducibility-associativity/
(excerpt)
textual notation (UVL)
Feature model: widely studied and used formalism in software engineering (proposed in 1990!)
● Formal abstractions are definitely needed to encode variability knowledge
and pilot the exploration of computational experiments
● Numerous works/techniques to specify and reverse engineer (out of
spreadsheet, command-line parameters, source code, doc., configurations, etc.) feature models
33

Teaching Reproducibility and Embracing Variability: From Floating-Point Experiments to Replicating Research

  • 1.
    Teaching Reproducibility and EmbracingVariability: From Floating-Point Experiments to Replicating Research Mathieu Acher, Arnaud Gotlieb, Helge Spieker, Gauthier Le Bartz Lyan https://hal.science/hal-05190848v1 @acherm https://mathieuacher.com/ https://diverse-project.github.io/ripost_ea/ 1
  • 2.
    On the onehand, Resilient and Reproducible Software (aka RIPOST team) https://diverse-project.github.io/ripost_ea/ had lots of discussions on how to teach reproducibility (our research topic) On the other hand, 4th- and 5th-year students at INSA Rennes: Master's level in engineering education; strong computer science background; by default, not interested by doing a PhD (by extension, “research” is not the primary goal).
  • 3.
    How to integratereproducibility in the curriculum? Increasingly taught through MOOC, dedicated courses (Fund, 2023), collaborative formats like Reprohackathons (Cokelaer et al., 2023) or ReproducedPapers.org (Yildiz et al., 2021), and integration into existing curricula (Vilhuber et al., 2022), but efforts remain scattered.
  • 4.
    Integrating reproducibility inthe curriculum? Three key motivations: ● Reproducibility as a way to concretely practice science (scientific methods and thinking; reading a paper; design and conduct experiments; analyze data, etc.) ● Reproducibility is directly related to modern software engineering practices (including version control, continuous integration, containerization, and infrastructure automation) ● Reproducibility and variability in software systems and computational experiments (input data, implementation details, environment, randomness, etc.) causing threats but also opportunities
  • 5.
    Design of areproducibility course, two complementary parts From Floating-Point Experiments… Students first explored the non-associativity of floating-point arithmetic as a reproducibility "Hello World" using Docker, GitHub Actions, and templated experimentation to analyze sources of variability across programming languages, compiler flags, and numerical precision. …to Replicating Research The second half of the course focused on reproducing and replicating actual research papers, including studies on large language models playing chess, home advantage in football during COVID19, and energy efficiency across programming languages.
  • 6.
    Design of areproducibility course, two complementary parts From Floating-Point Experiments… to Replicating Research MOTOs: ● software engineering (SE) as an enabler of reproducible thinking ● students used to work in pairs to encourage discussions, collaborations and sharing of the work; also reproduced each other’s results (across groups) ● critical engagement with scientific work ● learning by doing (practicing SE and science) ● Reproducibility vs Replicability and Variability ○ reproducibility = finding one variant (and fixing some variability) to retrieve exact/similar results ○ replicability = exploring variants’ and embracing variability
  • 7.
    Design of areproducibility course, two complementary parts Students first explored the non-associativity of floating-point arithmetic as a reproducibility "Hello World" using Docker, GitHub Actions, and templated experimentation to analyze sources of variability across programming languages, compiler flags, and numerical precision.
  • 8.
    Is (x+y)+z ==x+(y+z)? How often (x+y)+z == x+(y+z)? write a program that returns a percentage again: Choose the programming language, the compiler/interpreter, the library, the computing environment, the computer you want… 8
  • 9.
    Each group shared their solutions/results througha Git (with Dockerfile, README.md, Github actions, Notebooks, etc.) 9
  • 10.
    Exploring and analyzingvariability factors with template-based approach 10
  • 11.
    11 Exploring and analyzingvariability factors with CLI
  • 12.
    Results Students explored diversevariability factors (e.g., random seed, data range, programming language, operation type). Most focused on factors like number of repetitions and value ranges, occasionally testing beyond associativity (e.g., commutativity). Results varied across setups but tended to converge as experiments became more refined and controlled. The "Hello World" task, while rich in variability, has limits: ● it doesn’t fully capture the complexity of real-world reproducibility scenarios. ● often requires digging into floating-point internals to explain subtle behaviors. We believe the x+(y+z) “Hello world” can be reused and is a good introduction to reproducibility, variability, and software engineering; a “pre-requisite” before practicing reproducibility and science in a more realistic scenario 12
  • 13.
    Design of areproducibility course, two complementary parts The second half of the course focused on reproducing and replicating actual research papers, including studies on large language models playing chess, home advantage in football during COVID19, and energy efficiency across programming languages.
  • 14.
    “Revisite d’article scientifique(50%)” ● Read a paper (choose 1 out of 3, see next slide) ○ read/study ASAP ● Reproduce ○ reuse code/data ○ understand; re-run ○ conclusions: is it reproducible? do you confirm original results? ● Replicate ○ identify variability factors and threats ○ make a deviation/variation ○ analyse ○ conclude: do you confirm original results? ● Present key results and lessons learned 14
  • 15.
    Choose 1 outof 3 ● "Debunking the Chessboard: Confronting GPTs Against Chess Engines to Estimate Elo Ratings and Assess Legal Move Abilities" ○ https://blog.mathieuacher.com/GPTsChessEloRatingLegalMoves/ ● "COVID and Home Advantage in Football: An Analysis of Results and xG Data in European Leagues" ○ https://blog.mathieuacher.com/FootballAnalysis-xG-COVIDHome/ ● “Ranking Programming Languages by Energy Efficiency” SCP 2021 ○ https://github.com/greensoftwarelab/RosettaExamples => XXX (link) 15
  • 16.
    Debunking the Chessboard:Confronting GPTs Against Chess Engines to Estimate Elo Ratings and Assess Legal Move Abilities https://blog.mathieuacher.com/GPTsChessEloRatingLegalMoves/ https://www.youtube.com/watch?v=6D1XIbkm4JE (video with MonsieurPhi) ● what about GPT4o or Claude or Llama? ● prompt sensitivity? ● temperature? ● what about confronting to other chess engines? ● … 16
  • 17.
    COVID and HomeAdvantage in Football: An Analysis of Results and xG Data in European Leagues https://blog.mathieuacher.com/FootballAnalysis-xG-COVIDHome/ ● what about 2023/2024? ● xG: instead of understat, another source of data ● other leagues (amateur) ● other sports ● … 17
  • 18.
    Ranking Programming Languagesby Energy Efficiency SCP 2021 https://github.com/greensoftwarelab/RosettaExamples ● other programs/workloads? ● measurements? ● other programming languages? ● other compilers or interpreters? ● container effect? ● other hardware? ... ● … 18
  • 19.
    Home advantage infootball (soccer) before and during COVID https://blog.mathieuacher.com/FootballAnalysis-xG-COVIDHome/ xPoints vs xG vs result; inter-season comparison; single-season comparison; variability in COVID management across leagues tldr; The results show a significant decrease in home advantage during the COVID period, particularly in Ligue 1 and the Premier League. 19
  • 20.
    My students reproducedresults and found… differences! Key finding #1: Bug or feature? scipy 1.6.0 changes p-values on MannU ○ The importance of reproducing results (with other teams/people!) ○ The difficulty of reproducing results: the version of a library can change everything! ○ Be careful about non-choices of configurations (here on the alternative hypothesis: two-sided vs one-sided) 20
  • 21.
    My students replicatedand found new insights! Key finding #2 ○ Original study was conducted in 2021... opportunities to extend it to 2022 and 2023 and 2024 ○ and ask new questions: has the post-COVID period restored the home advantage? ○ Answer/teaser: yes! ○ New findings on other national championships and European cups (Champions League, the Turkish Süper Lig) on post-COVID seasons (2022–2024) 21
  • 22.
    Discussion Scaling the coursewrt number of students. There are many threats to our experience: ● ideal setting: only 1 instructor (Mathieu Acher); only 20 students (10 groups); eases seamless and consistent communication; can be more challenging with more instructors and students! ● kinds of papers to replicate: very important to deeply know the details – see also (Yildiz et al., 2021); I (instructor) was author of two; more dificult for the third one ;) ● tension between detailed instructions vs allowing open-ended investigation; importance of having fine-grained interactions at some points Teaching variability: We can go further (variability modelling with a more principled exploration of variants and interactions between variability factors); Should we? Dissemination: how to valorize high-quality student work and replication projects: co-authored publications with students? 22
  • 23.
    Conclusion Self-contained reproducibility coursewith ● (1) advanced software engineering; ● (2) practice of science, including critical thinking and replication; ● (3) variability at the heart (identification and combination of variability factors; exploration of variants; analysis of results) for both reproducing and replicating Two steps-approach: ● (1) an “hello world” for reproducibility providing the motivation, technical foundations and methods; mandatory (?) pre-requiste to address real-world scenarios ● (2) a reproduction and replication of actual research content through variability management Success: high-quality works through git repositories; students were able to pinpoint potential reproducibility issues and propose replication (variations) to further explore new hypotheses, provide new insights, and test generality. Possible co-authorship We learn from students and from teaching. Can’t wait to… replicate the course next fall 2025! 23
  • 24.
    Teaching Reproducibility andEmbracing Variability: From Floating-Point Experiments to Replicating Research preprint: https://hal.science/hal-05190848v1 Course material can be reused: https://archive.softwareheritage.org/browse/directory/89ad4e83c94a17411027ee891c9cb297a4df02c5/ Questions? if you don’t have, I have ;) What’s your hello world for reproducibility? What are the success criteria of a reproducibility course?
  • 25.
    Reproducible science withvariability 25 “Authors provide all the necessary data and the computer codes to run the analysis again, re-creating the results.” Yet, despite the availability of data and code, several studies report that unexplored variability in software can lead to varying results up to the point discrepancies can radically change the conclusions and contradict established knowledge from a set of scripts to automate the deployment to… a comprehensive system containing several features that help researchers exploring various hypotheses
  • 26.
    Computational science withdeep variability 26 hardware variability 25,000+ options, 10^6000 variants (operating system) thousands of compiler flags dozens of library versions dozens of command-line parameters (container) configuration files (distributed environment) hyperparameters (application code) variability in data energy consumption execution time binary 42 accuracy
  • 27.
    Deep Software Variability “refersto the interaction of all external “factors” modifying the behavior (including both functional and nonfunctional properties) of a software system” Lesoil et al. VaMoS 2020 Combinatorial explosion of the epistemic and ontological variability with impacts on computational result and non-functional properties 27 always 42 ?
  • 28.
  • 29.
  • 30.
  • 31.
    Our Vision: Embrace deepvariability! Explicit modeling of the variability points and their relationships, such as: 1. Get insights into the variability “factors” and their possible interactions 2. Capture and document configurations for the sake of reproducibility 3. Explore diverse configurations to replicate, and hence optimize, validate, increase the robustness, or provide better resilience ⇒ We aim to address the complexities associated with reproducibility and replicability in modern software systems and environments, facilitating a more comprehensive and nuanced perspective on these critical “factors”. 31 https://hal.science/hal-04582287
  • 32.
  • 33.
    Feature model: widelystudied and used formalism in software engineering (proposed in 1990!) ● Formal abstractions are definitely needed to encode variability knowledge and pilot the exploration of computational experiments ● Numerous works/techniques to specify and reverse engineer (out of spreadsheet, command-line parameters, source code, doc., configurations, etc.) feature models 33