SlideShare a Scribd company logo
On the Reliability of Coverage-Based

Fuzzer Benchmarking
Marcel Böhme

MPI-SP & Monash
László Szekeres

Google
Jonathan Metzman

Google
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
whoami
Marcel Böhme
Foundations of Software Security @ MPI-SP

(Max Planck Institute for Security and Privacy)

Looking for PhD & PostDocs

at Max Planck Institute

Bochum, Germany
• Fuzzing for Automatic Vulnerability Discovery

• Making machines attack other machines.

• Focus on scalability, e
ffi
ciency, and e
ff
ectiveness.

• Foundations of Software Security

• Assurances in Software Security

• Fundamental limitations of existing approaches

• Drawing from multiple disciplines (information theory, biostatistics)
10 yrs 

Singapore
3 yrs 

Melbourne
since Aug’22

Bochum
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
Motivation
Suppose none of our fuzzers

fi
nds any bugs in our program.

How do we know which fuzzer is better?
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
Motivation
Suppose none of our fuzzers

fi
nds any bugs in our program.

How do we know which fuzzer is better?
We measure code coverage!
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
• Key Idea:

• You cannot
fi
nd bugs in code that is not covered.

• Question:

• How strong is the relationship between coverage and bug
fi
nding?
Motivation: Coverage
We measure code coverage!
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
• Key Idea:

• You cannot
fi
nd bugs in code that is not covered.

• Question:

• How strong is the relationship between coverage and bug
fi
nding?
Motivation: Coverage
ICSE’14
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
• Key Idea:

• You cannot
fi
nd bugs in code that is not covered.

• Question:

• How strong is the relationship between coverage and bug
fi
nding?
Motivation: Coverage
ICSE’14
This is called “correlation”.
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
• Key Idea:

• You cannot
fi
nd bugs in code that is not covered.

• Question:

• How strong is the relationship between coverage and bug
fi
nding?
Motivation: Coverage
ICSE’14
This is called “correlation”.
• Observation: Test suites with 

more coverage
fi
nd more bugs 

only because they are bigger.
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
• Key Idea:

• You cannot
fi
nd bugs in code that is not covered.

• Question:

• How strong is the relationship between coverage and bug
fi
nding?
Motivation: Coverage
ICSE’14
This is called “correlation”.
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
• Key Idea:

• You cannot
fi
nd bugs in code that is not covered.

• Question:

• How strong is the relationship between coverage and bug
fi
nding?
Motivation: Coverage
ICSE’14
• Observation: Test suites with 

more coverage
fi
nd more bugs 

irrespective of whether they are bigger.
This is called “correlation”.
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
• Key Idea:

• You cannot
fi
nd bugs in code that is not covered.

• Question:

• How strong is the relationship between coverage and bug
fi
nding?
Motivation: Coverage
ICSE’14
• Observation: Test suites with 

more coverage
fi
nd more bugs 

irrespective of whether they are bigger.
This is called “correlation”.
This is called “contradiction”.
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
• Key Idea:

• You cannot
fi
nd bugs in code that is not covered.

• Question:

• How strong is the relationship between coverage and bug
fi
nding?
Motivation: Coverage
ASE’20
This is called “correlation”.
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
• Key Idea:

• You cannot
fi
nd bugs in code that is not covered.

• Question:

• How strong is the relationship ?
Motivation: Coverage
ASE’20
This is called “correlation”.
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
• Our experiments con
fi
rm a

very strong correlation for 

fuzzer-generated test suites!

• As a fuzzer covers more code, 

it also
fi
nds more bugs.
Correlation: Very strong
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
• Our experiments con
fi
rm a

very strong correlation for 

fuzzer-generated test suites!

• As a fuzzer covers more code, 

it also
fi
nds more bugs.
Correlation: Very strong
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
• Our experiments con
fi
rm a

very strong correlation for 

fuzzer-generated test suites!

• As a fuzzer covers more code, 

it also
fi
nds more bugs.
Correlation: Very strong
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
• Our experiments con
fi
rm a

very strong correlation for 

fuzzer-generated test suites!

• As a fuzzer covers more code, 

it also
fi
nds more bugs.
Correlation: Very strong
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
•Problem:

• Fuzzing folks are not convinced.
Correlation: Very strong
“It does not make sense 🤔”

paraphrasing Klees et al., CCS’18
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
•Problem:

• Fuzzing folks are not convinced.
Correlation: Very strong
We cannot compare 

two or more fuzzers 

in terms of coverage 

in order to establish 

one as the best fuzzer

in terms of bug
fi
nding.
“It does not make sense 🤔”

paraphrasing Klees et al., CCS’18
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
•Problem:

• Fuzzing folks are not convinced.
Correlation: Very strong
CCS’18
“It does not make sense 🤔”

paraphrasing Klees et al., CCS’18
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
•Problem:

• Fuzzing folks are not convinced.
Correlation: Very strong
CCS’18
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
•Problem:

• Fuzzing folks are not convinced.
Correlation: Very strong
Why?
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
Agreement: That’s why.
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
Agreement: That’s why.
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
Agreement: That’s why.
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
Agreement: That’s why.
• Suppose, we have 

• Two instruments to measure acidity.

• Strong correlation: 

• More acidity = both indicate higher PH values.
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
Agreement: That’s why.
• Two instruments to measure acidity.

• Strong correlation: 

• More acidity = both indicate higher PH values.
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
Agreement: That’s why.
• Two instruments to measure acidity.

• Strong correlation: 

• More acidity = both indicate higher PH values.

• Weak agreement: 

• Both instruments might rank 2+ tubes di
ff
erently.
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
Agreement: That’s why.
• Two instruments to measure acidity.

• Strong correlation: 

• More acidity = both indicate higher PH values.

• Weak agreement: 

• Both instruments might rank 2+ tubes di
ff
erently.
Moderate agreement means


we cannot reliably substitute


one instrument for the other.
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
Agreement: That’s why.
Moderate agreement means


we cannot reliably substitute


one instrument for the other.
Ranking 10 fuzzers 

in terms of code coverage and

in terms of #bugs found.

Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
Agreement: That’s why.
The worst fuzzer in terms coverage is 

the best fuzzer in terms of bug
fi
nding.
Ranking 10 fuzzers 

in terms of code coverage and

in terms of #bugs found.

Moderate agreement means


we cannot reliably substitute


one instrument for the other.
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
• Experimental Design (post hoc ground truth)

• To minimize threats to validity, we use post hoc bug identi
fi
cation instead
of a pre-determined ground truth benchmark (more on that later).

• Automatic and manual deduplication of bugs found during fuzzing.

• 341,595 generated bug reports across all campaigns.

• 409 unique bugs after automatic deduplication (via a variant of ClusterFuzz).

• 235 unique bugs after manual deduplication (via two profession Softw. Eng.).
Experimental Setup
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
Experimental Setup
• Fuzzers and programs

• FuzzBench infrastructure
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
Experimental Setup
• Fuzzers and programs

• FuzzBench infrastructure

• 10 fuzzers + 24 programs
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
Experimental Setup
• Fuzzers and programs

• FuzzBench infrastructure

• 10 fuzzers + 24 programs

• Benchmark selection

• Randomly selected from 

OSS-Fuzz (500+ programs).

• Higher selection probability for

programs with historically more

bugs (for economic reasons).
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
Experimental Setup
• Fuzzers and programs

• FuzzBench infrastructure

• 10 fuzzers + 24 programs

• Benchmark selection

• Randomly selected from 

OSS-Fuzz (500+ programs).

• Higher selection probability for

programs with historically more

bugs (for economic reasons).
• Reproducibility
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
Experimental Setup
• Fuzzers and programs

• FuzzBench infrastructure

• 10 fuzzers + 24 programs

• Benchmark selection

• Randomly selected from 

OSS-Fuzz (500+ programs).

• Higher selection probability for

programs with historically more

bugs (for economic reasons).
• Reproducibility
10 fuzzers x 24 programs 

x 20 campaigns x 23 hours 

>13 CPU years
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
Agreement: Coverage vs Bug Finding
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
Agreement: Coverage vs Bug Finding
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
Agreement: Coverage vs Bug Finding
Agreement on superiority comparing 2 fuzzers

in terms of code coverage and #bugs found,

when di
ff
erence is statistical signi
fi
cant at p.
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
Agreement: Coverage vs Bug Finding
Agreement on superiority comparing 2 fuzzers

in terms of code coverage and #bugs found,

when di
ff
erence is statistical signi
fi
cant at p.
Strong agreement for p <= 0.0001

for both: coverage and bug
fi
nding.
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
Agreement: Coverage vs Bug Finding
However, if we only require di
ff
erence in terms of coverage 

(which we can observe) to be statistically signi
fi
cant:
Weak agreement.
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
Agreement: Coverage vs Bug Finding
We also provide two other measures of agreement on superiority:

disagreement proportion d and Spearman’s rho p (+1 superior, 0 not signi
fi
cant, -1 inferior)
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
Threats to Validity
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
Threats to Validity: Campaign Length
Does agreement increase 

as campaign length increases?

Maybe 23 hours are too short

to expect an agreement.
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
Threats to Validity: Campaign Length
Does agreement increase 

as campaign length increases?

Not really.
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
Maybe 23 hours are too short or too long 

to expect an agreement.



How does agreement change 

as campaign length increases?
Threats to Validity: Campaign Length
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
Does agreement increase

as the number of trials increases?

Maybe 20 trials are too few

to expect an agreement.
Threats to Validity: Number of Trials
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
Looking good!
Threats to Validity: Number of Trials
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
Threats to Validity: Number of Trials
20 trials

are
fi
ne.
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
Threats to Validity: Generality (#Subjects)
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
Threats to Validity: Generality (#Subjects)
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
Result Summary
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
• We observe a moderate agreement on superiority or ranking.

• Only if we require di
ff
erences in coverage *and* bug
fi
nding to 

be highly statistically signi
fi
cant, we observe a strong agreement.
You can substitute

coverage for bug
fi
nding

only with moderate reliability.
Result Summary
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
Benchmarking: Challenges of Bug-Based Benchmarking
Reviewers B&C: Why not use a ground-truth 

for your bug-based benchmarking?
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
Benchmarking: Challenges of Bug-Based Benchmarking
• Golden evaluation 

• Choose a random, representative sample of programs and fuzz them.

• Problem: (Un)fortunately, bugs are very sparse. No statistical power.
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
Benchmarking: Challenges of Bug-Based Benchmarking
• Golden evaluation 

• Choose a random, representative sample of programs and fuzz them.

• Problem: (Un)fortunately, bugs are very sparse. No statistical power.

• Mutation-based evaluation

• Inject synthetic bugs into a random, representative sample of programs

• More economical. We know many bugs can be found.

• Problem: Are synthetic bugs representative of real bugs?
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
Benchmarking: Challenges of Bug-Based Benchmarking
• Ground-truth-based evaluation 

• Curate real bugs in a random, representative sample of programs.

• Economical, realistic bugs, objective ground truth.
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
Benchmarking: Challenges of Bug-Based Benchmarking
• Ground-truth-based evaluation 

• Curate real bugs in a random, representative sample of programs.

• Economical, realistic bugs, objective ground truth.

• Problem:

1. Survivorship bias

• Fuzzers that are better at
fi
nding previously

undiscovered bugs appear worse

• Fuzzers that contributed to the original

discovery appear better
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
Benchmarking: Challenges of Bug-Based Benchmarking
• Ground-truth-based evaluation 

• Curate real bugs in a random, representative sample of programs.

• Economical, realistic bugs, objective ground truth.

• Problem:

1. Survivorship bias

2. Experimenter bias 

• Porting bugs to one version makes it more economical, 

but also potentially introduces bug masking and interaction e
ff
ects.
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
Benchmarking: Challenges of Bug-Based Benchmarking
• Ground-truth-based evaluation 

• Curate real bugs in a random, representative sample of programs.

• Economical, realistic bugs, objective ground truth.

• Problem:

1. Survivorship bias

2. Experimenter bias

3. Observer-expectancy bias

• Manual translation to an if-statement representing the bug-trigger condition

simpli
fi
es bug counting and provides the same bug oracle to all fuzzers,

but it enforces a relationship between coverage (of the if-body) and bug
fi
nding.
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
Benchmarking: Challenges of Bug-Based Benchmarking
• Ground-truth-based evaluation 

• Curate real bugs in a random, representative sample of programs.

• Economical, realistic bugs, objective ground truth.

• Problem:

1. Survivorship bias

2. Experimenter bias

3. Observer-expectancy bias

4. Con
fi
rmation bias

• Given a ground truth benchmark, 

researchers might be enticed 

to iteratively and unknowingly 

tune their fuzzer to the benchmark.
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
Benchmarking: Challenges of Bug-Based Benchmarking
• Ground-truth-based evaluation 

• Curate real bugs in a random, representative sample of programs.

• Economical, realistic bugs, objective ground truth.

• Problem: Many potential sources of bias.

• Post hoc bug based evaluation 

• Maximize bug probability in a random, representative sample of programs.

• Identify and deduplicate bugs *after* the fuzzing campaign. Minimizes bias.

• Problem: Less economical (we did not
fi
nd bugs in 7/24 [30%] programs).
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
Discussion Summary
Bug-based benchmarking

is not easy to get

right, either!
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
There are many pitfalls in 

sound fuzzer benchmarking.
Discussion Summary
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
Reviewer C (meta-review):

In this role of informing the experimental design of 

future fuzzing research, it is important to describe other

e
ff
orts that call for a more holistic fuzzer evaluation.
Benchmarking: Recommendations for Fuzzer Benchmarking
*paraphrasing
*
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
Reviewer C (meta-review):

In this role of informing the experimental design of 

future fuzzing research, it is important to describe other

e
ff
orts that call for a more holistic fuzzer evaluation.
Benchmarking: Recommendations for Fuzzer Benchmarking
*paraphrasing
*
So, we synthesized a set of recommendations 

from previous work, our results, and our own experience.
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
Benchmarking: Recommendations for Fuzzer Benchmarking
• Select ≥10 representative programs. Repeat each experiment ≥ 10x.

Increasing these values improves generality and statistical power.

• 

Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
Benchmarking: Recommendations for Fuzzer Benchmarking
• Select ≥10 representative programs. Repeat each experiment ≥ 10x.

Increasing these values improves generality and statistical power.

• Select “real-world programs” that are typically fuzzed in practice.

Increasing representativeness, improves generality.

If experiment cost are a concern, prioritize likely more buggy programs.

Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
Benchmarking: Recommendations for Fuzzer Benchmarking
• Select ≥10 representative programs. Repeat each experiment ≥ 10x.

Increasing these values improves generality and statistical power.

• Select “real-world programs” that are typically fuzzed in practice.

Increasing representativeness, improves generality.

If experiment cost are a concern, prioritize likely more buggy programs.

• Select a baseline that was extended to implement the technique.

Ensure equivalent conditions (CLI parameters, initial seeds, ..).

Improves construct validity and allows to attribute improvements precisely.

(Optional) Comparison to SOTA. Note improvements due to engineering di
ff
erences.

Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
• Consider using a “training set” during fuzzer development

and a “validation set” (e.g., benchmarking platform) for evaluation.

Reduces over
fi
tting and mitigates con
fi
rmation bias.

Benchmarking: Recommendations for Fuzzer Benchmarking
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
• Consider using a “training set” during fuzzer development

and a “validation set” (e.g., benchmarking platform) for evaluation.

Reduces over
fi
tting and mitigates con
fi
rmation bias.

• Statistical analysis of e
ff
ect size and signi
fi
cance

Allows to assess magnitude of the di
ff
erence and the degree to which

the di
ff
erences are explained by randomness.



Benchmarking: Recommendations for Fuzzer Benchmarking
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
• Consider using a “training set” during fuzzer development

and a “validation set” (e.g., benchmarking platform) for evaluation.

Reduces over
fi
tting and mitigates con
fi
rmation bias.

• Statistical analysis of e
ff
ect size and signi
fi
cance

Allows to assess magnitude of the di
ff
erence and the degree to which

the di
ff
erences are explained by randomness.

• Measure & report coverage and bug-based metrics.

Use the same measurement tooling & procedure across all fuzzers.

Improves construct validity.



Benchmarking: Recommendations for Fuzzer Benchmarking
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
• Discuss potential threats to validity and your strategy to mitigate.

Helps the reader assess the validity of the key claims and results.
Benchmarking: Recommendations for Fuzzer Benchmarking
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
• Discuss potential threats to validity and your strategy to mitigate.

Helps the reader assess the validity of the key claims and results.

• Ensure experiments are reproducible

Publish tool, benchmark, data, and analysis. Report speci
fi
c experiment parameters.

Reproducibility is the foundation of sound scienti
fi
c progress.
Benchmarking: Recommendations for Fuzzer Benchmarking
On the Reliability of Coverage-Based

Fuzzer Benchmarking
Marcel Böhme

MPI-SP & Monash
László Szekeres

Google
Jonathan Metzman

Google
On the Reliability of Coverage-Based

Fuzzer Benchmarking
Marcel Böhme

MPI-SP & Monash
László Szekeres

Google
Jonathan Metzman

Google
Ç
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · ICSE’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
Motivation
Suppose none of our fuzzers

finds any bugs in our program.

How do we know which fuzzer is better?
We measure code coverage!
On the Reliability of Coverage-Based

Fuzzer Benchmarking
Marcel Böhme

MPI-SP & Monash
László Szekeres

Google
Jonathan Metzman

Google
Ç
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · ICSE’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
Motivation
Suppose none of our fuzzers

finds any bugs in our program.

How do we know which fuzzer is better?
We measure code coverage!
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · ICSE’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
• We observe a moderate agreement on superiority or ranking.

• Only if we require differences in coverage *and* bug finding to 

be highly statistically significant, we observe a strong agreement.
You can substitute

coverage for bug finding

only with moderate reliability.
Result Summary
On the Reliability of Coverage-Based

Fuzzer Benchmarking
Marcel Böhme

MPI-SP & Monash
László Szekeres

Google
Jonathan Metzman

Google
Ç
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · ICSE’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
Motivation
Suppose none of our fuzzers

finds any bugs in our program.

How do we know which fuzzer is better?
We measure code coverage!
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · ICSE’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
• We observe a moderate agreement on superiority or ranking.

• Only if we require differences in coverage *and* bug finding to 

be highly statistically significant, we observe a strong agreement.
You can substitute

coverage for bug finding

only with moderate reliability.
Result Summary
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · ICSE’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
Benchmarking: Challenges of Bug-Based Benchmarking
• Ground-truth-based evaluation 

• Curate real bugs in a random, representative sample of programs.

• Economical, realistic bugs, objective ground truth.

• Problem:

1. Survivorship bias

2. Experimenter bias

3. Observer-expectancy bias

4. Confirmation bias

• Given a ground truth benchmark, 

researchers might be enticed 

to iteratively and unknowingly 

tune their fuzzer to the benchmark.
On the Reliability of Coverage-Based

Fuzzer Benchmarking
Marcel Böhme

MPI-SP & Monash
László Szekeres

Google
Jonathan Metzman

Google
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · ICSE’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
Benchmarking: Challenges of Bug-Based Benchmarking
• Ground-truth-based evaluation 

• Curate real bugs in a random, representative sample of programs.

• Economical, realistic bugs, objective ground truth.

• Problem:

1. Survivorship bias

2. Experimenter bias

3. Observer-expectancy bias

4. Confirmation bias

• Given a ground truth benchmark, 

researchers might be enticed 

to iteratively and unknowingly 

tune their fuzzer to the benchmark.
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · ICSE’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
Motivation
Suppose none of our fuzzers

finds any bugs in our program.

How do we know which fuzzer is better?
We measure code coverage!
Big Picture Conclusion:

• CS graduates need better training in statistical and empirical methods.

• Learn about di
ff
erent statistical instruments to investigate empirical questions,

di
ff
erent sources of bias and threats to validity (what can go wrong), and

sound experiment design (how to do it right)

• In research, we focus on a paper’s claim, and not enough on the claim’s validation.

• In practice, we also make claims about our system that need validation.
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · ICSE’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
• We observe a moderate agreement on superiority or ranking.

• Only if we require differences in coverage *and* bug finding to 

be highly statistically significant, we observe a strong agreement.
You can substitute

coverage for bug finding

only with moderate reliability.
Result Summary
Ç
On the Reliability of Coverage-Based

Fuzzer Benchmarking
Marcel Böhme

MPI-SP & Monash
László Szekeres

Google
Jonathan Metzman

Google
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · ICSE’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
Benchmarking: Challenges of Bug-Based Benchmarking
• Ground-truth-based evaluation 

• Curate real bugs in a random, representative sample of programs.

• Economical, realistic bugs, objective ground truth.

• Problem:

1. Survivorship bias

2. Experimenter bias

3. Observer-expectancy bias

4. Confirmation bias

• Given a ground truth benchmark, 

researchers might be enticed 

to iteratively and unknowingly 

tune their fuzzer to the benchmark.
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · ICSE’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
Motivation
Suppose none of our fuzzers

finds any bugs in our program.

How do we know which fuzzer is better?
We measure code coverage!
Big Picture Conclusion:

• CS graduates need better training in statistical and empirical methods.

• Learn about di
ff
erent statistical instruments to investigate empirical questions,

di
ff
erent sources of bias and threats to validity (what can go wrong), and

sound experiment design (how to do it right)

• In research, we focus on a paper’s claim, and not enough on the claim’s validation.

• In practice, we also make claims about our system that need validation.

• CS research community needs more focus on evaluation standards.

• Publication bias & Author bias: Too much focus on the results

• Investigate soundness of our experimental designs



Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · ICSE’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
• We observe a moderate agreement on superiority or ranking.

• Only if we require differences in coverage *and* bug finding to 

be highly statistically significant, we observe a strong agreement.
You can substitute

coverage for bug finding

only with moderate reliability.
Result Summary
On the Reliability of Coverage-Based

Fuzzer Benchmarking
Marcel Böhme

MPI-SP & Monash
László Szekeres

Google
Jonathan Metzman

Google
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · ICSE’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
Benchmarking: Challenges of Bug-Based Benchmarking
• Ground-truth-based evaluation 

• Curate real bugs in a random, representative sample of programs.

• Economical, realistic bugs, objective ground truth.

• Problem:

1. Survivorship bias

2. Experimenter bias

3. Observer-expectancy bias

4. Confirmation bias

• Given a ground truth benchmark, 

researchers might be enticed 

to iteratively and unknowingly 

tune their fuzzer to the benchmark.
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · ICSE’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
Motivation
Suppose none of our fuzzers

finds any bugs in our program.

How do we know which fuzzer is better?
We measure code coverage!
Big Picture Conclusion:

• CS graduates need better training in statistical and empirical methods.

• Learn about di
ff
erent statistical instruments to investigate empirical questions,

di
ff
erent sources of bias and threats to validity (what can go wrong), and

sound experiment design (how to do it right)

• In research, we focus on a paper’s claim, and not enough on the claim’s validation.

• In practice, we also make claims about our system that need validation.

• CS research community needs more focus on evaluation standards.

• Publication bias & Author bias: Too much focus on the results

• Investigate soundness of our experimental designs



Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · ICSE’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
1. Select ≥10 representative programs. Repeat each experiment ≥ 10x.

2. Select “real-world programs”.

3. Select a fair baseline.

4. Use a training and a validation set.

5. Use a statistical analysis of effect size and significance.

6. Measure & report coverage and bug-based metrics.

7. Discuss potential threats to validity.

8. Ensure reproducibility.
Benchmarking: Recommendations for Fuzzer Benchmarking
Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · ICSE’22 · On the Reliability of Coverage-based Fuzzer Benchmarking
• We observe a moderate agreement on superiority or ranking.

• Only if we require differences in coverage *and* bug finding to 

be highly statistically significant, we observe a strong agreement.
You can substitute

coverage for bug finding

only with moderate reliability.
Result Summary

More Related Content

What's hot

Développement sécurisé
Développement sécuriséDéveloppement sécurisé
Développement sécurisé
facemeshfacemesh
 
AppSec & DevSecOps Metrics: Key Performance Indicators (KPIs) to Measure Success
AppSec & DevSecOps Metrics: Key Performance Indicators (KPIs) to Measure SuccessAppSec & DevSecOps Metrics: Key Performance Indicators (KPIs) to Measure Success
AppSec & DevSecOps Metrics: Key Performance Indicators (KPIs) to Measure Success
Robert Grupe, CSSLP CISSP PE PMP
 
Antivirus - Virus detection and removal methods
Antivirus - Virus detection and removal methodsAntivirus - Virus detection and removal methods
Antivirus - Virus detection and removal methods
Somanath Kavalase
 
DevSecOps: What Why and How : Blackhat 2019
DevSecOps: What Why and How : Blackhat 2019DevSecOps: What Why and How : Blackhat 2019
DevSecOps: What Why and How : Blackhat 2019
NotSoSecure Global Services
 
Making Continuous Security a Reality with OWASP’s AppSec Pipeline - Matt Tesa...
Making Continuous Security a Reality with OWASP’s AppSec Pipeline - Matt Tesa...Making Continuous Security a Reality with OWASP’s AppSec Pipeline - Matt Tesa...
Making Continuous Security a Reality with OWASP’s AppSec Pipeline - Matt Tesa...
Matt Tesauro
 
Practical DevSecOps Course - Part 1
Practical DevSecOps Course - Part 1Practical DevSecOps Course - Part 1
Practical DevSecOps Course - Part 1
Mohammed A. Imran
 
Purple Team - Work it out: Organizing Effective Adversary Emulation Exercises
Purple Team - Work it out: Organizing Effective Adversary Emulation ExercisesPurple Team - Work it out: Organizing Effective Adversary Emulation Exercises
Purple Team - Work it out: Organizing Effective Adversary Emulation Exercises
Jorge Orchilles
 
Threat modelling(system + enterprise)
Threat modelling(system + enterprise)Threat modelling(system + enterprise)
Threat modelling(system + enterprise)
abhimanyubhogwan
 
DevSecOps Fundamentals and the Scars to Prove it.
DevSecOps Fundamentals and the Scars to Prove it.DevSecOps Fundamentals and the Scars to Prove it.
DevSecOps Fundamentals and the Scars to Prove it.
Matt Tesauro
 
Building an InfoSec RedTeam
Building an InfoSec RedTeamBuilding an InfoSec RedTeam
Building an InfoSec RedTeam
Dan Vasile
 
Meet the hackers powering the world's best bug bounty programs
Meet the hackers powering the world's best bug bounty programsMeet the hackers powering the world's best bug bounty programs
Meet the hackers powering the world's best bug bounty programs
HackerOne
 
SC conference - Building AppSec Teams
SC conference  - Building AppSec TeamsSC conference  - Building AppSec Teams
SC conference - Building AppSec Teams
Dinis Cruz
 
Guided Path to DevOps Career.
Guided Path to DevOps Career.Guided Path to DevOps Career.
Guided Path to DevOps Career.
wahabwelcome
 
Threat Modeling Everything
Threat Modeling EverythingThreat Modeling Everything
Threat Modeling Everything
Anne Oikarinen
 
Saying Hello to Bug Bounty
Saying Hello to Bug BountySaying Hello to Bug Bounty
Saying Hello to Bug Bounty
Null Bhubaneswar
 
OS20 - Enhanced complete genome sequencing of foot-and-mouth disease virus us...
OS20 - Enhanced complete genome sequencing of foot-and-mouth disease virus us...OS20 - Enhanced complete genome sequencing of foot-and-mouth disease virus us...
OS20 - Enhanced complete genome sequencing of foot-and-mouth disease virus us...
EuFMD
 
Securite applicative et SDLC - OWASP Quebec - 15 avril 2014
Securite applicative et SDLC - OWASP Quebec - 15 avril 2014 Securite applicative et SDLC - OWASP Quebec - 15 avril 2014
Securite applicative et SDLC - OWASP Quebec - 15 avril 2014 Patrick Leclerc
 
Basic Security Concepts of Computer
Basic Security Concepts of ComputerBasic Security Concepts of Computer
Basic Security Concepts of Computer
Faizan Janjua
 
Computer security
Computer securityComputer security
Computer security
fiza1975
 
Cyber security and demonstration of security tools
Cyber security and demonstration of security toolsCyber security and demonstration of security tools
Cyber security and demonstration of security tools
Vicky Fernandes
 

What's hot (20)

Développement sécurisé
Développement sécuriséDéveloppement sécurisé
Développement sécurisé
 
AppSec & DevSecOps Metrics: Key Performance Indicators (KPIs) to Measure Success
AppSec & DevSecOps Metrics: Key Performance Indicators (KPIs) to Measure SuccessAppSec & DevSecOps Metrics: Key Performance Indicators (KPIs) to Measure Success
AppSec & DevSecOps Metrics: Key Performance Indicators (KPIs) to Measure Success
 
Antivirus - Virus detection and removal methods
Antivirus - Virus detection and removal methodsAntivirus - Virus detection and removal methods
Antivirus - Virus detection and removal methods
 
DevSecOps: What Why and How : Blackhat 2019
DevSecOps: What Why and How : Blackhat 2019DevSecOps: What Why and How : Blackhat 2019
DevSecOps: What Why and How : Blackhat 2019
 
Making Continuous Security a Reality with OWASP’s AppSec Pipeline - Matt Tesa...
Making Continuous Security a Reality with OWASP’s AppSec Pipeline - Matt Tesa...Making Continuous Security a Reality with OWASP’s AppSec Pipeline - Matt Tesa...
Making Continuous Security a Reality with OWASP’s AppSec Pipeline - Matt Tesa...
 
Practical DevSecOps Course - Part 1
Practical DevSecOps Course - Part 1Practical DevSecOps Course - Part 1
Practical DevSecOps Course - Part 1
 
Purple Team - Work it out: Organizing Effective Adversary Emulation Exercises
Purple Team - Work it out: Organizing Effective Adversary Emulation ExercisesPurple Team - Work it out: Organizing Effective Adversary Emulation Exercises
Purple Team - Work it out: Organizing Effective Adversary Emulation Exercises
 
Threat modelling(system + enterprise)
Threat modelling(system + enterprise)Threat modelling(system + enterprise)
Threat modelling(system + enterprise)
 
DevSecOps Fundamentals and the Scars to Prove it.
DevSecOps Fundamentals and the Scars to Prove it.DevSecOps Fundamentals and the Scars to Prove it.
DevSecOps Fundamentals and the Scars to Prove it.
 
Building an InfoSec RedTeam
Building an InfoSec RedTeamBuilding an InfoSec RedTeam
Building an InfoSec RedTeam
 
Meet the hackers powering the world's best bug bounty programs
Meet the hackers powering the world's best bug bounty programsMeet the hackers powering the world's best bug bounty programs
Meet the hackers powering the world's best bug bounty programs
 
SC conference - Building AppSec Teams
SC conference  - Building AppSec TeamsSC conference  - Building AppSec Teams
SC conference - Building AppSec Teams
 
Guided Path to DevOps Career.
Guided Path to DevOps Career.Guided Path to DevOps Career.
Guided Path to DevOps Career.
 
Threat Modeling Everything
Threat Modeling EverythingThreat Modeling Everything
Threat Modeling Everything
 
Saying Hello to Bug Bounty
Saying Hello to Bug BountySaying Hello to Bug Bounty
Saying Hello to Bug Bounty
 
OS20 - Enhanced complete genome sequencing of foot-and-mouth disease virus us...
OS20 - Enhanced complete genome sequencing of foot-and-mouth disease virus us...OS20 - Enhanced complete genome sequencing of foot-and-mouth disease virus us...
OS20 - Enhanced complete genome sequencing of foot-and-mouth disease virus us...
 
Securite applicative et SDLC - OWASP Quebec - 15 avril 2014
Securite applicative et SDLC - OWASP Quebec - 15 avril 2014 Securite applicative et SDLC - OWASP Quebec - 15 avril 2014
Securite applicative et SDLC - OWASP Quebec - 15 avril 2014
 
Basic Security Concepts of Computer
Basic Security Concepts of ComputerBasic Security Concepts of Computer
Basic Security Concepts of Computer
 
Computer security
Computer securityComputer security
Computer security
 
Cyber security and demonstration of security tools
Cyber security and demonstration of security toolsCyber security and demonstration of security tools
Cyber security and demonstration of security tools
 

More from mboehme

An Implementation of Preregistration
An Implementation of PreregistrationAn Implementation of Preregistration
An Implementation of Preregistration
mboehme
 
Statistical Reasoning About Programs
Statistical Reasoning About ProgramsStatistical Reasoning About Programs
Statistical Reasoning About Programs
mboehme
 
The Curious Case of Fuzzing for Automated Software Testing
The Curious Case of Fuzzing for Automated Software TestingThe Curious Case of Fuzzing for Automated Software Testing
The Curious Case of Fuzzing for Automated Software Testing
mboehme
 
On the Surprising Efficiency and Exponential Cost of Fuzzing
On the Surprising Efficiency and Exponential Cost of FuzzingOn the Surprising Efficiency and Exponential Cost of Fuzzing
On the Surprising Efficiency and Exponential Cost of Fuzzing
mboehme
 
Foundations Of Software Testing
Foundations Of Software TestingFoundations Of Software Testing
Foundations Of Software Testing
mboehme
 
DS3 Fuzzing Panel (M. Boehme)
DS3 Fuzzing Panel (M. Boehme)DS3 Fuzzing Panel (M. Boehme)
DS3 Fuzzing Panel (M. Boehme)
mboehme
 
Fuzzing: On the Exponential Cost of Vulnerability Discovery
Fuzzing: On the Exponential Cost of Vulnerability DiscoveryFuzzing: On the Exponential Cost of Vulnerability Discovery
Fuzzing: On the Exponential Cost of Vulnerability Discovery
mboehme
 
Fuzzing: Challenges and Reflections
Fuzzing: Challenges and ReflectionsFuzzing: Challenges and Reflections
Fuzzing: Challenges and Reflections
mboehme
 
AFLGo: Directed Greybox Fuzzing
AFLGo: Directed Greybox FuzzingAFLGo: Directed Greybox Fuzzing
AFLGo: Directed Greybox Fuzzing
mboehme
 
NUS SoC Graduate Outreach @ TU Dresden
NUS SoC Graduate Outreach @ TU DresdenNUS SoC Graduate Outreach @ TU Dresden
NUS SoC Graduate Outreach @ TU Dresden
mboehme
 

More from mboehme (10)

An Implementation of Preregistration
An Implementation of PreregistrationAn Implementation of Preregistration
An Implementation of Preregistration
 
Statistical Reasoning About Programs
Statistical Reasoning About ProgramsStatistical Reasoning About Programs
Statistical Reasoning About Programs
 
The Curious Case of Fuzzing for Automated Software Testing
The Curious Case of Fuzzing for Automated Software TestingThe Curious Case of Fuzzing for Automated Software Testing
The Curious Case of Fuzzing for Automated Software Testing
 
On the Surprising Efficiency and Exponential Cost of Fuzzing
On the Surprising Efficiency and Exponential Cost of FuzzingOn the Surprising Efficiency and Exponential Cost of Fuzzing
On the Surprising Efficiency and Exponential Cost of Fuzzing
 
Foundations Of Software Testing
Foundations Of Software TestingFoundations Of Software Testing
Foundations Of Software Testing
 
DS3 Fuzzing Panel (M. Boehme)
DS3 Fuzzing Panel (M. Boehme)DS3 Fuzzing Panel (M. Boehme)
DS3 Fuzzing Panel (M. Boehme)
 
Fuzzing: On the Exponential Cost of Vulnerability Discovery
Fuzzing: On the Exponential Cost of Vulnerability DiscoveryFuzzing: On the Exponential Cost of Vulnerability Discovery
Fuzzing: On the Exponential Cost of Vulnerability Discovery
 
Fuzzing: Challenges and Reflections
Fuzzing: Challenges and ReflectionsFuzzing: Challenges and Reflections
Fuzzing: Challenges and Reflections
 
AFLGo: Directed Greybox Fuzzing
AFLGo: Directed Greybox FuzzingAFLGo: Directed Greybox Fuzzing
AFLGo: Directed Greybox Fuzzing
 
NUS SoC Graduate Outreach @ TU Dresden
NUS SoC Graduate Outreach @ TU DresdenNUS SoC Graduate Outreach @ TU Dresden
NUS SoC Graduate Outreach @ TU Dresden
 

Recently uploaded

Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
Rohit Gautam
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 

Recently uploaded (20)

Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 

On the Reliability of Coverage-based Fuzzer Benchmarking

  • 1. On the Reliability of Coverage-Based
 Fuzzer Benchmarking Marcel Böhme
 MPI-SP & Monash László Szekeres
 Google Jonathan Metzman
 Google
  • 2. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking whoami Marcel Böhme Foundations of Software Security @ MPI-SP (Max Planck Institute for Security and Privacy) Looking for PhD & PostDocs
 at Max Planck Institute
 Bochum, Germany • Fuzzing for Automatic Vulnerability Discovery • Making machines attack other machines. • Focus on scalability, e ffi ciency, and e ff ectiveness. • Foundations of Software Security • Assurances in Software Security • Fundamental limitations of existing approaches • Drawing from multiple disciplines (information theory, biostatistics) 10 yrs 
 Singapore 3 yrs 
 Melbourne since Aug’22
 Bochum
  • 3. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking Motivation Suppose none of our fuzzers
 fi nds any bugs in our program. How do we know which fuzzer is better?
  • 4. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking Motivation Suppose none of our fuzzers
 fi nds any bugs in our program. How do we know which fuzzer is better? We measure code coverage!
  • 5. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking • Key Idea: • You cannot fi nd bugs in code that is not covered. • Question: • How strong is the relationship between coverage and bug fi nding? Motivation: Coverage We measure code coverage!
  • 6. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking • Key Idea: • You cannot fi nd bugs in code that is not covered. • Question: • How strong is the relationship between coverage and bug fi nding? Motivation: Coverage ICSE’14
  • 7. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking • Key Idea: • You cannot fi nd bugs in code that is not covered. • Question: • How strong is the relationship between coverage and bug fi nding? Motivation: Coverage ICSE’14 This is called “correlation”.
  • 8. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking • Key Idea: • You cannot fi nd bugs in code that is not covered. • Question: • How strong is the relationship between coverage and bug fi nding? Motivation: Coverage ICSE’14 This is called “correlation”. • Observation: Test suites with 
 more coverage fi nd more bugs 
 only because they are bigger.
  • 9. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking • Key Idea: • You cannot fi nd bugs in code that is not covered. • Question: • How strong is the relationship between coverage and bug fi nding? Motivation: Coverage ICSE’14 This is called “correlation”.
  • 10. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking • Key Idea: • You cannot fi nd bugs in code that is not covered. • Question: • How strong is the relationship between coverage and bug fi nding? Motivation: Coverage ICSE’14 • Observation: Test suites with 
 more coverage fi nd more bugs 
 irrespective of whether they are bigger. This is called “correlation”.
  • 11. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking • Key Idea: • You cannot fi nd bugs in code that is not covered. • Question: • How strong is the relationship between coverage and bug fi nding? Motivation: Coverage ICSE’14 • Observation: Test suites with 
 more coverage fi nd more bugs 
 irrespective of whether they are bigger. This is called “correlation”. This is called “contradiction”.
  • 12. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking • Key Idea: • You cannot fi nd bugs in code that is not covered. • Question: • How strong is the relationship between coverage and bug fi nding? Motivation: Coverage ASE’20 This is called “correlation”.
  • 13. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking • Key Idea: • You cannot fi nd bugs in code that is not covered. • Question: • How strong is the relationship ? Motivation: Coverage ASE’20 This is called “correlation”.
  • 14. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking • Our experiments con fi rm a
 very strong correlation for 
 fuzzer-generated test suites! • As a fuzzer covers more code, 
 it also fi nds more bugs. Correlation: Very strong
  • 15. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking • Our experiments con fi rm a
 very strong correlation for 
 fuzzer-generated test suites! • As a fuzzer covers more code, 
 it also fi nds more bugs. Correlation: Very strong
  • 16. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking • Our experiments con fi rm a
 very strong correlation for 
 fuzzer-generated test suites! • As a fuzzer covers more code, 
 it also fi nds more bugs. Correlation: Very strong
  • 17. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking • Our experiments con fi rm a
 very strong correlation for 
 fuzzer-generated test suites! • As a fuzzer covers more code, 
 it also fi nds more bugs. Correlation: Very strong
  • 18. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking •Problem: • Fuzzing folks are not convinced. Correlation: Very strong “It does not make sense 🤔”
 paraphrasing Klees et al., CCS’18
  • 19. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking •Problem: • Fuzzing folks are not convinced. Correlation: Very strong We cannot compare 
 two or more fuzzers 
 in terms of coverage 
 in order to establish 
 one as the best fuzzer
 in terms of bug fi nding. “It does not make sense 🤔”
 paraphrasing Klees et al., CCS’18
  • 20. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking •Problem: • Fuzzing folks are not convinced. Correlation: Very strong CCS’18 “It does not make sense 🤔”
 paraphrasing Klees et al., CCS’18
  • 21. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking •Problem: • Fuzzing folks are not convinced. Correlation: Very strong CCS’18
  • 22. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking •Problem: • Fuzzing folks are not convinced. Correlation: Very strong Why?
  • 23. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking Agreement: That’s why.
  • 24. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking Agreement: That’s why.
  • 25. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking Agreement: That’s why.
  • 26. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking Agreement: That’s why. • Suppose, we have 
 • Two instruments to measure acidity. • Strong correlation: • More acidity = both indicate higher PH values.
  • 27. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking Agreement: That’s why. • Two instruments to measure acidity. • Strong correlation: • More acidity = both indicate higher PH values.
  • 28. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking Agreement: That’s why. • Two instruments to measure acidity. • Strong correlation: • More acidity = both indicate higher PH values. • Weak agreement: • Both instruments might rank 2+ tubes di ff erently.
  • 29. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking Agreement: That’s why. • Two instruments to measure acidity. • Strong correlation: • More acidity = both indicate higher PH values. • Weak agreement: • Both instruments might rank 2+ tubes di ff erently. Moderate agreement means 
 we cannot reliably substitute 
 one instrument for the other.
  • 30. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking Agreement: That’s why. Moderate agreement means 
 we cannot reliably substitute 
 one instrument for the other. Ranking 10 fuzzers 
 in terms of code coverage and
 in terms of #bugs found.

  • 31. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking Agreement: That’s why. The worst fuzzer in terms coverage is 
 the best fuzzer in terms of bug fi nding. Ranking 10 fuzzers 
 in terms of code coverage and
 in terms of #bugs found.
 Moderate agreement means 
 we cannot reliably substitute 
 one instrument for the other.
  • 32. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking • Experimental Design (post hoc ground truth) • To minimize threats to validity, we use post hoc bug identi fi cation instead of a pre-determined ground truth benchmark (more on that later). • Automatic and manual deduplication of bugs found during fuzzing. • 341,595 generated bug reports across all campaigns. • 409 unique bugs after automatic deduplication (via a variant of ClusterFuzz). • 235 unique bugs after manual deduplication (via two profession Softw. Eng.). Experimental Setup
  • 33. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking Experimental Setup • Fuzzers and programs • FuzzBench infrastructure
  • 34. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking Experimental Setup • Fuzzers and programs • FuzzBench infrastructure • 10 fuzzers + 24 programs
  • 35. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking Experimental Setup • Fuzzers and programs • FuzzBench infrastructure • 10 fuzzers + 24 programs • Benchmark selection • Randomly selected from 
 OSS-Fuzz (500+ programs). • Higher selection probability for
 programs with historically more
 bugs (for economic reasons).
  • 36. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking Experimental Setup • Fuzzers and programs • FuzzBench infrastructure • 10 fuzzers + 24 programs • Benchmark selection • Randomly selected from 
 OSS-Fuzz (500+ programs). • Higher selection probability for
 programs with historically more
 bugs (for economic reasons). • Reproducibility
  • 37. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking Experimental Setup • Fuzzers and programs • FuzzBench infrastructure • 10 fuzzers + 24 programs • Benchmark selection • Randomly selected from 
 OSS-Fuzz (500+ programs). • Higher selection probability for
 programs with historically more
 bugs (for economic reasons). • Reproducibility 10 fuzzers x 24 programs 
 x 20 campaigns x 23 hours 
 >13 CPU years
  • 38. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking Agreement: Coverage vs Bug Finding
  • 39. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking Agreement: Coverage vs Bug Finding
  • 40. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking Agreement: Coverage vs Bug Finding Agreement on superiority comparing 2 fuzzers
 in terms of code coverage and #bugs found,
 when di ff erence is statistical signi fi cant at p.
  • 41. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking Agreement: Coverage vs Bug Finding Agreement on superiority comparing 2 fuzzers
 in terms of code coverage and #bugs found,
 when di ff erence is statistical signi fi cant at p. Strong agreement for p <= 0.0001
 for both: coverage and bug fi nding.
  • 42. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking Agreement: Coverage vs Bug Finding However, if we only require di ff erence in terms of coverage 
 (which we can observe) to be statistically signi fi cant: Weak agreement.
  • 43. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking Agreement: Coverage vs Bug Finding We also provide two other measures of agreement on superiority:
 disagreement proportion d and Spearman’s rho p (+1 superior, 0 not signi fi cant, -1 inferior)
  • 44. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking Threats to Validity
  • 45. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking Threats to Validity: Campaign Length Does agreement increase 
 as campaign length increases? Maybe 23 hours are too short
 to expect an agreement.
  • 46. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking Threats to Validity: Campaign Length Does agreement increase 
 as campaign length increases? Not really.
  • 47. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking Maybe 23 hours are too short or too long 
 to expect an agreement. 
 How does agreement change 
 as campaign length increases? Threats to Validity: Campaign Length
  • 48. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking Does agreement increase
 as the number of trials increases? Maybe 20 trials are too few
 to expect an agreement. Threats to Validity: Number of Trials
  • 49. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking Looking good! Threats to Validity: Number of Trials
  • 50. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking Threats to Validity: Number of Trials 20 trials
 are fi ne.
  • 51. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking Threats to Validity: Generality (#Subjects)
  • 52. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking Threats to Validity: Generality (#Subjects)
  • 53. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking Result Summary
  • 54. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking • We observe a moderate agreement on superiority or ranking. • Only if we require di ff erences in coverage *and* bug fi nding to 
 be highly statistically signi fi cant, we observe a strong agreement. You can substitute
 coverage for bug fi nding
 only with moderate reliability. Result Summary
  • 55. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking Benchmarking: Challenges of Bug-Based Benchmarking Reviewers B&C: Why not use a ground-truth 
 for your bug-based benchmarking?
  • 56. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking Benchmarking: Challenges of Bug-Based Benchmarking • Golden evaluation • Choose a random, representative sample of programs and fuzz them. • Problem: (Un)fortunately, bugs are very sparse. No statistical power.
  • 57. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking Benchmarking: Challenges of Bug-Based Benchmarking • Golden evaluation • Choose a random, representative sample of programs and fuzz them. • Problem: (Un)fortunately, bugs are very sparse. No statistical power. • Mutation-based evaluation • Inject synthetic bugs into a random, representative sample of programs • More economical. We know many bugs can be found. • Problem: Are synthetic bugs representative of real bugs?
  • 58. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking Benchmarking: Challenges of Bug-Based Benchmarking • Ground-truth-based evaluation • Curate real bugs in a random, representative sample of programs. • Economical, realistic bugs, objective ground truth.
  • 59. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking Benchmarking: Challenges of Bug-Based Benchmarking • Ground-truth-based evaluation • Curate real bugs in a random, representative sample of programs. • Economical, realistic bugs, objective ground truth. • Problem: 1. Survivorship bias • Fuzzers that are better at fi nding previously
 undiscovered bugs appear worse • Fuzzers that contributed to the original
 discovery appear better
  • 60. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking Benchmarking: Challenges of Bug-Based Benchmarking • Ground-truth-based evaluation • Curate real bugs in a random, representative sample of programs. • Economical, realistic bugs, objective ground truth. • Problem: 1. Survivorship bias 2. Experimenter bias • Porting bugs to one version makes it more economical, 
 but also potentially introduces bug masking and interaction e ff ects.
  • 61. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking Benchmarking: Challenges of Bug-Based Benchmarking • Ground-truth-based evaluation • Curate real bugs in a random, representative sample of programs. • Economical, realistic bugs, objective ground truth. • Problem: 1. Survivorship bias 2. Experimenter bias 3. Observer-expectancy bias • Manual translation to an if-statement representing the bug-trigger condition
 simpli fi es bug counting and provides the same bug oracle to all fuzzers,
 but it enforces a relationship between coverage (of the if-body) and bug fi nding.
  • 62. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking Benchmarking: Challenges of Bug-Based Benchmarking • Ground-truth-based evaluation • Curate real bugs in a random, representative sample of programs. • Economical, realistic bugs, objective ground truth. • Problem: 1. Survivorship bias 2. Experimenter bias 3. Observer-expectancy bias 4. Con fi rmation bias • Given a ground truth benchmark, 
 researchers might be enticed 
 to iteratively and unknowingly 
 tune their fuzzer to the benchmark.
  • 63. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking Benchmarking: Challenges of Bug-Based Benchmarking • Ground-truth-based evaluation • Curate real bugs in a random, representative sample of programs. • Economical, realistic bugs, objective ground truth. • Problem: Many potential sources of bias. • Post hoc bug based evaluation • Maximize bug probability in a random, representative sample of programs. • Identify and deduplicate bugs *after* the fuzzing campaign. Minimizes bias. • Problem: Less economical (we did not fi nd bugs in 7/24 [30%] programs).
  • 64. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking Discussion Summary Bug-based benchmarking
 is not easy to get
 right, either!
  • 65. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking There are many pitfalls in 
 sound fuzzer benchmarking. Discussion Summary
  • 66. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking Reviewer C (meta-review): In this role of informing the experimental design of 
 future fuzzing research, it is important to describe other
 e ff orts that call for a more holistic fuzzer evaluation. Benchmarking: Recommendations for Fuzzer Benchmarking *paraphrasing *
  • 67. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking Reviewer C (meta-review): In this role of informing the experimental design of 
 future fuzzing research, it is important to describe other
 e ff orts that call for a more holistic fuzzer evaluation. Benchmarking: Recommendations for Fuzzer Benchmarking *paraphrasing * So, we synthesized a set of recommendations 
 from previous work, our results, and our own experience.
  • 68. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking Benchmarking: Recommendations for Fuzzer Benchmarking • Select ≥10 representative programs. Repeat each experiment ≥ 10x.
 Increasing these values improves generality and statistical power. • 

  • 69. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking Benchmarking: Recommendations for Fuzzer Benchmarking • Select ≥10 representative programs. Repeat each experiment ≥ 10x.
 Increasing these values improves generality and statistical power. • Select “real-world programs” that are typically fuzzed in practice.
 Increasing representativeness, improves generality.
 If experiment cost are a concern, prioritize likely more buggy programs.

  • 70. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking Benchmarking: Recommendations for Fuzzer Benchmarking • Select ≥10 representative programs. Repeat each experiment ≥ 10x.
 Increasing these values improves generality and statistical power. • Select “real-world programs” that are typically fuzzed in practice.
 Increasing representativeness, improves generality.
 If experiment cost are a concern, prioritize likely more buggy programs. • Select a baseline that was extended to implement the technique.
 Ensure equivalent conditions (CLI parameters, initial seeds, ..).
 Improves construct validity and allows to attribute improvements precisely.
 (Optional) Comparison to SOTA. Note improvements due to engineering di ff erences.

  • 71. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking • Consider using a “training set” during fuzzer development
 and a “validation set” (e.g., benchmarking platform) for evaluation.
 Reduces over fi tting and mitigates con fi rmation bias.
 Benchmarking: Recommendations for Fuzzer Benchmarking
  • 72. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking • Consider using a “training set” during fuzzer development
 and a “validation set” (e.g., benchmarking platform) for evaluation.
 Reduces over fi tting and mitigates con fi rmation bias. • Statistical analysis of e ff ect size and signi fi cance
 Allows to assess magnitude of the di ff erence and the degree to which
 the di ff erences are explained by randomness.
 
 Benchmarking: Recommendations for Fuzzer Benchmarking
  • 73. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking • Consider using a “training set” during fuzzer development
 and a “validation set” (e.g., benchmarking platform) for evaluation.
 Reduces over fi tting and mitigates con fi rmation bias. • Statistical analysis of e ff ect size and signi fi cance
 Allows to assess magnitude of the di ff erence and the degree to which
 the di ff erences are explained by randomness. • Measure & report coverage and bug-based metrics.
 Use the same measurement tooling & procedure across all fuzzers.
 Improves construct validity.
 
 Benchmarking: Recommendations for Fuzzer Benchmarking
  • 74. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking • Discuss potential threats to validity and your strategy to mitigate.
 Helps the reader assess the validity of the key claims and results. Benchmarking: Recommendations for Fuzzer Benchmarking
  • 75. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · UZH IFI Colloquium’22 · On the Reliability of Coverage-based Fuzzer Benchmarking • Discuss potential threats to validity and your strategy to mitigate.
 Helps the reader assess the validity of the key claims and results. • Ensure experiments are reproducible
 Publish tool, benchmark, data, and analysis. Report speci fi c experiment parameters.
 Reproducibility is the foundation of sound scienti fi c progress. Benchmarking: Recommendations for Fuzzer Benchmarking
  • 76. On the Reliability of Coverage-Based
 Fuzzer Benchmarking Marcel Böhme
 MPI-SP & Monash László Szekeres
 Google Jonathan Metzman
 Google
  • 77. On the Reliability of Coverage-Based
 Fuzzer Benchmarking Marcel Böhme
 MPI-SP & Monash László Szekeres
 Google Jonathan Metzman
 Google Ç Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · ICSE’22 · On the Reliability of Coverage-based Fuzzer Benchmarking Motivation Suppose none of our fuzzers
 finds any bugs in our program. How do we know which fuzzer is better? We measure code coverage!
  • 78. On the Reliability of Coverage-Based
 Fuzzer Benchmarking Marcel Böhme
 MPI-SP & Monash László Szekeres
 Google Jonathan Metzman
 Google Ç Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · ICSE’22 · On the Reliability of Coverage-based Fuzzer Benchmarking Motivation Suppose none of our fuzzers
 finds any bugs in our program. How do we know which fuzzer is better? We measure code coverage! Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · ICSE’22 · On the Reliability of Coverage-based Fuzzer Benchmarking • We observe a moderate agreement on superiority or ranking. • Only if we require differences in coverage *and* bug finding to 
 be highly statistically significant, we observe a strong agreement. You can substitute
 coverage for bug finding
 only with moderate reliability. Result Summary
  • 79. On the Reliability of Coverage-Based
 Fuzzer Benchmarking Marcel Böhme
 MPI-SP & Monash László Szekeres
 Google Jonathan Metzman
 Google Ç Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · ICSE’22 · On the Reliability of Coverage-based Fuzzer Benchmarking Motivation Suppose none of our fuzzers
 finds any bugs in our program. How do we know which fuzzer is better? We measure code coverage! Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · ICSE’22 · On the Reliability of Coverage-based Fuzzer Benchmarking • We observe a moderate agreement on superiority or ranking. • Only if we require differences in coverage *and* bug finding to 
 be highly statistically significant, we observe a strong agreement. You can substitute
 coverage for bug finding
 only with moderate reliability. Result Summary Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · ICSE’22 · On the Reliability of Coverage-based Fuzzer Benchmarking Benchmarking: Challenges of Bug-Based Benchmarking • Ground-truth-based evaluation • Curate real bugs in a random, representative sample of programs. • Economical, realistic bugs, objective ground truth. • Problem: 1. Survivorship bias 2. Experimenter bias 3. Observer-expectancy bias 4. Confirmation bias • Given a ground truth benchmark, 
 researchers might be enticed 
 to iteratively and unknowingly 
 tune their fuzzer to the benchmark.
  • 80. On the Reliability of Coverage-Based
 Fuzzer Benchmarking Marcel Böhme
 MPI-SP & Monash László Szekeres
 Google Jonathan Metzman
 Google Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · ICSE’22 · On the Reliability of Coverage-based Fuzzer Benchmarking Benchmarking: Challenges of Bug-Based Benchmarking • Ground-truth-based evaluation • Curate real bugs in a random, representative sample of programs. • Economical, realistic bugs, objective ground truth. • Problem: 1. Survivorship bias 2. Experimenter bias 3. Observer-expectancy bias 4. Confirmation bias • Given a ground truth benchmark, 
 researchers might be enticed 
 to iteratively and unknowingly 
 tune their fuzzer to the benchmark. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · ICSE’22 · On the Reliability of Coverage-based Fuzzer Benchmarking Motivation Suppose none of our fuzzers
 finds any bugs in our program. How do we know which fuzzer is better? We measure code coverage! Big Picture Conclusion: • CS graduates need better training in statistical and empirical methods. • Learn about di ff erent statistical instruments to investigate empirical questions,
 di ff erent sources of bias and threats to validity (what can go wrong), and
 sound experiment design (how to do it right) • In research, we focus on a paper’s claim, and not enough on the claim’s validation. • In practice, we also make claims about our system that need validation. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · ICSE’22 · On the Reliability of Coverage-based Fuzzer Benchmarking • We observe a moderate agreement on superiority or ranking. • Only if we require differences in coverage *and* bug finding to 
 be highly statistically significant, we observe a strong agreement. You can substitute
 coverage for bug finding
 only with moderate reliability. Result Summary Ç
  • 81. On the Reliability of Coverage-Based
 Fuzzer Benchmarking Marcel Böhme
 MPI-SP & Monash László Szekeres
 Google Jonathan Metzman
 Google Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · ICSE’22 · On the Reliability of Coverage-based Fuzzer Benchmarking Benchmarking: Challenges of Bug-Based Benchmarking • Ground-truth-based evaluation • Curate real bugs in a random, representative sample of programs. • Economical, realistic bugs, objective ground truth. • Problem: 1. Survivorship bias 2. Experimenter bias 3. Observer-expectancy bias 4. Confirmation bias • Given a ground truth benchmark, 
 researchers might be enticed 
 to iteratively and unknowingly 
 tune their fuzzer to the benchmark. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · ICSE’22 · On the Reliability of Coverage-based Fuzzer Benchmarking Motivation Suppose none of our fuzzers
 finds any bugs in our program. How do we know which fuzzer is better? We measure code coverage! Big Picture Conclusion: • CS graduates need better training in statistical and empirical methods. • Learn about di ff erent statistical instruments to investigate empirical questions,
 di ff erent sources of bias and threats to validity (what can go wrong), and
 sound experiment design (how to do it right) • In research, we focus on a paper’s claim, and not enough on the claim’s validation. • In practice, we also make claims about our system that need validation. • CS research community needs more focus on evaluation standards. • Publication bias & Author bias: Too much focus on the results • Investigate soundness of our experimental designs 
 Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · ICSE’22 · On the Reliability of Coverage-based Fuzzer Benchmarking • We observe a moderate agreement on superiority or ranking. • Only if we require differences in coverage *and* bug finding to 
 be highly statistically significant, we observe a strong agreement. You can substitute
 coverage for bug finding
 only with moderate reliability. Result Summary
  • 82. On the Reliability of Coverage-Based
 Fuzzer Benchmarking Marcel Böhme
 MPI-SP & Monash László Szekeres
 Google Jonathan Metzman
 Google Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · ICSE’22 · On the Reliability of Coverage-based Fuzzer Benchmarking Benchmarking: Challenges of Bug-Based Benchmarking • Ground-truth-based evaluation • Curate real bugs in a random, representative sample of programs. • Economical, realistic bugs, objective ground truth. • Problem: 1. Survivorship bias 2. Experimenter bias 3. Observer-expectancy bias 4. Confirmation bias • Given a ground truth benchmark, 
 researchers might be enticed 
 to iteratively and unknowingly 
 tune their fuzzer to the benchmark. Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · ICSE’22 · On the Reliability of Coverage-based Fuzzer Benchmarking Motivation Suppose none of our fuzzers
 finds any bugs in our program. How do we know which fuzzer is better? We measure code coverage! Big Picture Conclusion: • CS graduates need better training in statistical and empirical methods. • Learn about di ff erent statistical instruments to investigate empirical questions,
 di ff erent sources of bias and threats to validity (what can go wrong), and
 sound experiment design (how to do it right) • In research, we focus on a paper’s claim, and not enough on the claim’s validation. • In practice, we also make claims about our system that need validation. • CS research community needs more focus on evaluation standards. • Publication bias & Author bias: Too much focus on the results • Investigate soundness of our experimental designs 
 Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · ICSE’22 · On the Reliability of Coverage-based Fuzzer Benchmarking 1. Select ≥10 representative programs. Repeat each experiment ≥ 10x. 2. Select “real-world programs”. 3. Select a fair baseline. 4. Use a training and a validation set. 5. Use a statistical analysis of effect size and significance. 6. Measure & report coverage and bug-based metrics. 7. Discuss potential threats to validity. 8. Ensure reproducibility. Benchmarking: Recommendations for Fuzzer Benchmarking Marcel Böhme, Max Planck Institute for Security and Privacy & Monash University · ICSE’22 · On the Reliability of Coverage-based Fuzzer Benchmarking • We observe a moderate agreement on superiority or ranking. • Only if we require differences in coverage *and* bug finding to 
 be highly statistically significant, we observe a strong agreement. You can substitute
 coverage for bug finding
 only with moderate reliability. Result Summary