On the Reliability of Coverage-based Fuzzer Benchmarking

On the Reliability of Coverage-Based 
Fuzzer Benchmarking
Marcel Böhme 
MPI-SP & Monash
László Szekeres 
Google
Jonathan Metzman 
Google

Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
whoami
Marcel Böhme
Foundations of Software Security @ MPI-SP

(Max Planck Institute for Security and Privacy)

Looking for PhD & PostDocs 
at Max Planck Institute 
Bochum, Germany
• Fuzzing for Automatic Vulnerability Discovery

• Making machines attack other machines.

• Focus on scalability, e
ffi
ciency, and e
ff
ectiveness.

• Foundations of Software Security

• Assurances in Software Security

• Fundamental limitations of existing approaches

• Drawing from multiple disciplines (information theory, biostatistics)
10 yrs  
Singapore
3 yrs  
Melbourne
since Aug’22 
Bochum

Motivation
Suppose none of our fuzzers 
fi
nds any bugs in our program.

How do we know which fuzzer is better?

Motivation
fi
nds any bugs in our program.

We measure code coverage!

• Key Idea:

• You cannot
fi
nd bugs in code that is not covered.

• Question:

• How strong is the relationship between coverage and bug
fi
nding?
Motivation: Coverage

• Key Idea:

• You cannot
fi

• Question:

fi
nding?
ICSE’14

• Key Idea:

• You cannot
fi

• Question:

fi
nding?
ICSE’14
This is called “correlation”.

• Key Idea:

• You cannot
fi

• Question:

fi
nding?
ICSE’14
• Observation: Test suites with  
more coverage
fi
nd more bugs  
only because they are bigger.

• Key Idea:

• You cannot
fi

• Question:

fi
nding?
ICSE’14
more coverage
fi
nd more bugs  
irrespective of whether they are bigger.

• Key Idea:

• You cannot
fi

• Question:

fi
nding?
ICSE’14
more coverage
fi
nd more bugs  
irrespective of whether they are bigger.
This is called “contradiction”.

• Key Idea:

• You cannot
fi

• Question:

fi
nding?
ASE’20

• Key Idea:

• You cannot
fi

• Question:

• How strong is the relationship ?
ASE’20

• Our experiments con
fi
rm a 
very strong correlation for  
fuzzer-generated test suites!

• As a fuzzer covers more code,  
it also
fi
nds more bugs.
Correlation: Very strong

•Problem:

• Fuzzing folks are not convinced.
“It does not make sense 🤔” 
paraphrasing Klees et al., CCS’18

•Problem:

We cannot compare  
two or more fuzzers  
in terms of coverage  
in order to establish  
one as the best fuzzer 
in terms of bug
fi
nding.

•Problem:

CCS’18

•Problem:

Why?

Agreement: That’s why.

• Suppose, we have  
• Two instruments to measure acidity.

• Strong correlation:

• More acidity = both indicate higher PH values.




• Weak agreement:

• Both instruments might rank 2+ tubes di
ff
erently.




• Weak agreement:

• Both instruments might rank 2+ tubes di
ff
erently.
Moderate agreement means
 
we cannot reliably substitute
 
one instrument for the other.

 
 
Ranking 10 fuzzers  
in terms of code coverage and 
in terms of #bugs found.

The worst fuzzer in terms coverage is  
the best fuzzer in terms of bug
fi
nding.
Ranking 10 fuzzers  
in terms of code coverage and 
in terms of #bugs found. 
 
 

• Experimental Design (post hoc ground truth)

• To minimize threats to validity, we use post hoc bug identi
fi
cation instead
of a pre-determined ground truth benchmark (more on that later).

• Automatic and manual deduplication of bugs found during fuzzing.

• 341,595 generated bug reports across all campaigns.

• 409 unique bugs after automatic deduplication (via a variant of ClusterFuzz).

• 235 unique bugs after manual deduplication (via two profession Softw. Eng.).
Experimental Setup

Experimental Setup
• Fuzzers and programs

• FuzzBench infrastructure

Experimental Setup


• 10 fuzzers + 24 programs

Experimental Setup



• Benchmark selection

• Randomly selected from  
OSS-Fuzz (500+ programs).

• Higher selection probability for 
programs with historically more 
bugs (for economic reasons).

Experimental Setup





• Reproducibility

Experimental Setup





• Reproducibility
10 fuzzers x 24 programs  
x 20 campaigns x 23 hours  
>13 CPU years

Agreement: Coverage vs Bug Finding

Agreement on superiority comparing 2 fuzzers 
in terms of code coverage and #bugs found, 
when di
ff
erence is statistical signi
fi
cant at p.

Agreement on superiority comparing 2 fuzzers 
in terms of code coverage and #bugs found, 
when di
ff
erence is statistical signi
fi
cant at p.
Strong agreement for p <= 0.0001 
for both: coverage and bug
fi
nding.

However, if we only require di
ff
erence in terms of coverage  
(which we can observe) to be statistically signi
fi
cant:
Weak agreement.

We also provide two other measures of agreement on superiority: 
disagreement proportion d and Spearman’s rho p (+1 superior, 0 not signi
fi
cant, -1 inferior)

Threats to Validity

Threats to Validity: Campaign Length
Does agreement increase  
as campaign length increases?

Maybe 23 hours are too short 
to expect an agreement.

Does agreement increase  

Not really.

Maybe 23 hours are too short or too long  

 
How does agreement change  

Does agreement increase 
as the number of trials increases?

Maybe 20 trials are too few 
Threats to Validity: Number of Trials

Looking good!

20 trials 
are
fi
ne.

Threats to Validity: Generality (#Subjects)

Result Summary

• We observe a moderate agreement on superiority or ranking.

• Only if we require di
ff
erences in coverage *and* bug
fi
nding to  
be highly statistically signi
fi
cant, we observe a strong agreement.
You can substitute 
coverage for bug
fi
nding 
only with moderate reliability.
Result Summary

Benchmarking: Challenges of Bug-Based Benchmarking
Reviewers B&C: Why not use a ground-truth  
for your bug-based benchmarking?

• Golden evaluation

• Choose a random, representative sample of programs and fuzz them.

• Problem: (Un)fortunately, bugs are very sparse. No statistical power.

• Golden evaluation

• Choose a random, representative sample of programs and fuzz them.

• Problem: (Un)fortunately, bugs are very sparse. No statistical power.

• Mutation-based evaluation

• Inject synthetic bugs into a random, representative sample of programs

• More economical. We know many bugs can be found.

• Problem: Are synthetic bugs representative of real bugs?

• Ground-truth-based evaluation

• Curate real bugs in a random, representative sample of programs.

• Economical, realistic bugs, objective ground truth.




• Problem:

1. Survivorship bias

• Fuzzers that are better at
fi
nding previously 
undiscovered bugs appear worse

• Fuzzers that contributed to the original 
discovery appear better




• Problem:


2. Experimenter bias

• Porting bugs to one version makes it more economical,  
but also potentially introduces bug masking and interaction e
ff
ects.




• Problem:



3. Observer-expectancy bias

• Manual translation to an if-statement representing the bug-trigger condition 
simpli
fi
es bug counting and provides the same bug oracle to all fuzzers, 
but it enforces a relationship between coverage (of the if-body) and bug
fi
nding.




• Problem:




4. Con
fi
rmation bias

• Given a ground truth benchmark,  
researchers might be enticed  
to iteratively and unknowingly  
tune their fuzzer to the benchmark.




• Problem: Many potential sources of bias.

• Post hoc bug based evaluation

• Maximize bug probability in a random, representative sample of programs.

• Identify and deduplicate bugs *after* the fuzzing campaign. Minimizes bias.

• Problem: Less economical (we did not
fi
nd bugs in 7/24 [30%] programs).

Discussion Summary
Bug-based benchmarking 
is not easy to get 
right, either!

There are many pitfalls in  
sound fuzzer benchmarking.
Discussion Summary

Reviewer C (meta-review):

In this role of informing the experimental design of  
future fuzzing research, it is important to describe other 
e
ff
orts that call for a more holistic fuzzer evaluation.
Benchmarking: Recommendations for Fuzzer Benchmarking
*paraphrasing
*

Reviewer C (meta-review):

In this role of informing the experimental design of  
future fuzzing research, it is important to describe other 
e
ff
orts that call for a more holistic fuzzer evaluation.
*paraphrasing
*
So, we synthesized a set of recommendations  
from previous work, our results, and our own experience.

• Select ≥10 representative programs. Repeat each experiment ≥ 10x. 
Increasing these values improves generality and statistical power.

•


• Select “real-world programs” that are typically fuzzed in practice. 
Increasing representativeness, improves generality. 
If experiment cost are a concern, prioritize likely more buggy programs.


• Select “real-world programs” that are typically fuzzed in practice. 
Increasing representativeness, improves generality. 
If experiment cost are a concern, prioritize likely more buggy programs.

• Select a baseline that was extended to implement the technique. 
Ensure equivalent conditions (CLI parameters, initial seeds, ..). 
Improves construct validity and allows to attribute improvements precisely. 
(Optional) Comparison to SOTA. Note improvements due to engineering di
ff
erences.

• Consider using a “training set” during fuzzer development 
and a “validation set” (e.g., benchmarking platform) for evaluation. 
Reduces over
fi
tting and mitigates con
fi
rmation bias. 

Reduces over
fi
fi
rmation bias.

• Statistical analysis of e
ff
ect size and signi
fi
cance 
Allows to assess magnitude of the di
ff
erence and the degree to which 
the di
ff
erences are explained by randomness. 
 

Reduces over
fi
fi
rmation bias.

• Statistical analysis of e
ff
ect size and signi
fi
cance 
Allows to assess magnitude of the di
ff
erence and the degree to which 
the di
ff
erences are explained by randomness.

• Measure & report coverage and bug-based metrics. 
Use the same measurement tooling & procedure across all fuzzers. 
Improves construct validity. 
 

• Discuss potential threats to validity and your strategy to mitigate. 
Helps the reader assess the validity of the key claims and results.

• Discuss potential threats to validity and your strategy to mitigate. 
Helps the reader assess the validity of the key claims and results.

• Ensure experiments are reproducible 
Publish tool, benchmark, data, and analysis. Report speci
fi
c experiment parameters. 
Reproducibility is the foundation of sound scienti
fi
c progress.

Fuzzer Benchmarking
Marcel Böhme 
MPI-SP & Monash
Google
Jonathan Metzman 
Google
Ç
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · ICSE’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
Motivation
finds any bugs in our program.


Fuzzer Benchmarking
Marcel Böhme 
MPI-SP & Monash
Google
Jonathan Metzman 
Google
Ç
Motivation


• Only if we require diﬀerences in coverage *and* bug finding to  
be highly statistically significant, we observe a strong agreement.
coverage for bug finding 
Result Summary

Fuzzer Benchmarking
Marcel Böhme 
MPI-SP & Monash
Google
Jonathan Metzman 
Google
Ç
Motivation


Result Summary



• Problem:




4. Confirmation bias


Fuzzer Benchmarking
Marcel Böhme 
MPI-SP & Monash
Google
Jonathan Metzman 
Google



• Problem:





Motivation

Big Picture Conclusion:

• CS graduates need better training in statistical and empirical methods.

• Learn about di
ff
erent statistical instruments to investigate empirical questions, 
di
ff
erent sources of bias and threats to validity (what can go wrong), and 
sound experiment design (how to do it right)

• In research, we focus on a paper’s claim, and not enough on the claim’s validation.

• In practice, we also make claims about our system that need validation.

Result Summary
Ç

Fuzzer Benchmarking
Marcel Böhme 
MPI-SP & Monash
Google
Jonathan Metzman 
Google



• Problem:





Motivation



• Learn about di
ff
di
ff



• CS research community needs more focus on evaluation standards.

• Publication bias & Author bias: Too much focus on the results

• Investigate soundness of our experimental designs

 

Result Summary

Fuzzer Benchmarking
Marcel Böhme 
MPI-SP & Monash
Google
Jonathan Metzman 
Google



• Problem:





Motivation



• Learn about di
ff
di
ff



• CS research community needs more focus on evaluation standards.

• Publication bias & Author bias: Too much focus on the results

• Investigate soundness of our experimental designs

 
1. Select ≥10 representative programs. Repeat each experiment ≥ 10x.

2. Select “real-world programs”.

3. Select a fair baseline.

4. Use a training and a validation set.

5. Use a statistical analysis of eﬀect size and significance.

6. Measure & report coverage and bug-based metrics.

7. Discuss potential threats to validity.

8. Ensure reproducibility.

Result Summary

On the Reliability of Coverage-based Fuzzer Benchmarking

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

More from mboehme

More from mboehme (10)

Recently uploaded

Recently uploaded (20)

On the Reliability of Coverage-based Fuzzer Benchmarking