Machine learning suffers from a reproducibility crisis. Deterministic machine learning is incredibly important for academia to verify papers, but also for developers to debug, audit and regress models.
Due to the various reasons for non-deterministic ML, especially when GPUs are in play, I conducted several experiments and identified all causes and the corresponding solutions (if available).
2. About me
● Bioinformatics MSc from the University of
Tübingen
● Research Software Engineer at the Quantitative
Biology Center Tübingen
● Expert for reproducible research
2
3. About the Quantitative Biology Center (QBiC)
● Bioinformatics core facility at the University of Tübingen
○ Data management and data analysis
● Strong contributor to reproducible research
● Job opening for a Scientific Data Steward
3
4.
5. Why do we even care?
● 400 papers in machine learning evaluated [1]
○ Only 24% reproducible
[1] Gundersen, Odd Erik & Kjensmo, Sigbjørn. (2018). State of the Art: Reproducibility in Artificial Intelligence.
Auditing Experimentation
Debugging Regression
Science
6. Primary reasons for non-reproducible machine learning [1]
● Data and code not shared
● Documentation insufficient
○ Hyperparameters, metrics, ...
○ Used hardware
[1] Gundersen, Odd Erik & Kjensmo, Sigbjørn. (2018). State of the Art: Reproducibility in Artificial Intelligence.
● Irreproducible environment
● Usage of GPUs
○ Non-deterministic operations
7. The elephant in the deterministic machine learning room
● Sum-reduce algorithm
○ Based on CUDA atomic add operations
○ GPUs operate highly in parallel
○ Summing up requires synchronization
8. On highly parallel floating point addition
● Order of thread synchronization leads to
different floating point errors
● Summation is not associative
● Many applications of the algorithm lead
to amplified differences
● Most machine learning libraries are
based on atomic operations
● Plenty more reasons for
non-deterministic behavior
9. Recent developments
9
Deterministic algorithms working as expected?
Options for all algorithms available?
Effect on the run time?
● (Optional) deterministic algorithms are now offered
○ Implemented without atomic operations
since v0.4.0 (2017)
since v2.1.0 (2020)
since v1.1.0 (2020)
10. Evaluating determinism - the setup
● Containerized projects
○ Pytorch, Tensorflow: MNIST
○ XGBoost: Covertype
10
● Three settings for CPU, single GPU and multiple GPUs
○ No random seeds
○ All possible random seeds
○ Deterministic algorithms and random seeds
● 5 runs per setting
System Hardware
1 - Laptop Intel I5 7300 HQ and NVIDIA 1050M
2 - deNBI K80 Intel 12 core and 2 NVIDIA Tesla k80s
3/4 - deNBI V100 Intel 24 core and 2 NVIDIA Tesla V100s
11. GPU with just seeds is non-deterministic
I5 7300 HQ
1050M
11
15. Primary takeaways
● Deterministic algorithms work
○ Badly tested
○ Need to be forced
15[1] https://github.com/NVIDIA/framework-determinism
● Not every algorithm has a deterministic option
○ Difficult to get complete lists
○ Even harder to keep track
● Determinism is hardware architecture dependent
● Neglectable effect on the runtime
○ Duncan Riach (NVIDIA): ~ 6% [1]
16. Requirements for Deterministic Machine Learning
Run 1
Run 2
Run 3
Complex requirements demand
an intuitive software solution
17.
18. Enabling deterministic machine learning with mlf-core [1][2]
[1] https://mlf-core.com
[2] https://pypi.org/project/mlf-core/ Inspired by
30. Acknowledgements
● Sven Nahnsen
● Philipp Hennig
● Gisela Gabernet
● Duncan Riach
● nf-core
● deNBI
30
This work was supported by the BMBF-funded de.NBI Cloud within the German Network
for Bioinformatics Infrastructure (de.NBI)(031A537B, 031A533A, 031A538A, 031A533B,
031A535A, 031A537C, 031A534A, 031A532B).
33. Further reasons for non-determinism
● NVIDIA cuDNN benchmark
○ Disable benchmarking
● Bias additions, max-pooling, batch normalization
○ Usually based on atomic add
○ Circumvent with deterministic algorithms
● Many many non obvious functions
○ Index_add
○ Gate_gradients
○ …
○ Usually no solution available
● GPU batch distribution
○ Library specific
● CUDA version
○ Must be compiled with the correct version 33
34. Deterministic sum_reduce
● One of the easier algorithms to replace atomic add in
● Multiply transpose of a column vector with a column of ones
def reduce_sum_det(x):
v = tf.reshape(x, [1, -1])
return tf.reshape(tf.matmul(v, tf.ones_like(v), transpose_b=True), []
● GPUs are good at matrix multiplication
34