We analyse data from the final two years of a long-running and influential annual Dutch survey of the quality of Dutch New Herring served in large samples of consumer outlets. The data was compiled and analysed by a university econometrician whose findings were publicized in national and international media. This led to the cessation of the survey amid allegations of bias due to a conflict of interest on the part of the leader of the herring tasting team. The survey organizers responded with accusations of failure of scientific integrity. The econometrician was acquitted of wrong-doing by the Dutch authority, whose inquiry nonetheless concluded that further research was needed. We reconstitute the data and uncover its important features which throw new light on the econometrician's findings, focussing on the issue of correlation versus causality: the sample is definitely not a random sample. Taking account both of newly discovered data features and of the sampling mechanism, we conclude that there is no evidence of biased evaluation, despite the econometrician's renewed insistence on his claim.
1. Richard Gill, Mathematical Institute, Leiden University. 16 November, 2022
The Dutch new herring
scandals
Repeated measurements
with unintended feedback
2.
3. At some point in the latter half of the 14th century Willem Beukelszoon began to go about things differently to other
fi
shermen or
fi
sh curers. He is credited with discovering something called kaken, or “gibbing”: a method of gutting and
deboning herring that left parts of its stomach and another of its internal organs, the pyloric caecae, intact. Removing the
guts and bones took away the bits that would begin to rot
fi
rst, whilst the remaining pyloric caecae would continue to emit
an enzyme called trypsin. The herring would then be thrown into brine and basically, pickle in its own juices.
4. Smell them and weep
Netherlands
fi
shmongers accuse herring-tasters of erring
Can the Dutch still trust their herring-tasters?
Print edition | Europe
Nov 23rd 2017
| AMSTERDAM
HERRING (genus Clupea, with four species found in the Baltic and North Seas) have been vital to northern Europe’s economy since the Middle Ages,
when
fi
shermen worked out how to preserve them in brine. Every north European country maintains that there is a right way to eat the
fi
sh, but they
differ as to what it is. In Sweden Baltic surströmming are fermented until slightly rancid. In Denmark the sill are pickled, or cooked and eaten in long
strips. In the Netherlands haring must be lightly salted for preservation but otherwise raw, dipped in minced onion and accompanied with a pickle. No
food is more loved.
So the Dutch were shocked when accusations surfaced in November that there was something rotten about the national herring test. The test, sponsored
by the Algemeen Dagblad, a newspaper, is carried out by two expert tasters, who each year rate the herring at over a hundred shops and stands across the
country. Ben Vollaard, an economist at Tilburg University, was surprised when his respected local
fi
shmonger scored zero. The merchant told Mr Vollaard
that one judge routinely tipped the scales, giving higher scores to stores that get their
fi
sh from the Atlantic Group, a distributor in Scheveningen. The
judge happened to be a consultant for Atlantic, giving courses on how to slice and serve herring.
“I saw how much damage a low rating could do. The judges act like God,” says Mr Vollaard, who specialises in using statistics to detect crime. He
decided to run the numbers. The ratings include objective criteria, like weight and fattiness, and subjective ones such as taste and appearance. The
economist contacted 85% of the shops surveyed in the past two years and asked who their distributors were. He found that whereas the overall average
score was 5.5, the average for those supplied by Atlantic was 8.7. The extra boost for the Atlantic stores came mainly from the subjective scores.
Mr Vollaard’s study has blown the lid off the sealed world of Dutch herring. Fishmongers who long suspected the judge of bias towards Atlantic now say
the test is rotten. Two who received low ratings have vowed to sue the Algemeen Dagblad for defamation.
The judge and Atlantic say they have been smeared, and that the statistical evidence is a red herring. They say Mr Vollaard’s
fi
gures are off, and that their
high scores are due to their superior
fi
sh. But the charges of belangenverstrengeling (con
fl
ict of interest) have left the test’s reputation for impartiality
gutted.
This article appeared in the Europe section of the print edition under the headline "Failing the smell test"
Print edition | Europe
Nov 23rd 2017
| AMSTERDAM
5. • Celebrating many years collaboration with Per Kragh Andersen
• and Ørnulf Borgan, whose retirement party I had to miss
(pandemic)
• In loving memory of the leader of the pack, Niels Keiding
• May the royalty checks for “the weightlifter’s guide to counting
processes” (ABGK) continue to come in, for many years to
come!
ABGK for ever!
6.
7.
8.
9.
10.
11. • Pitfalls of amateur regression: The Dutch New Herring controversies
• Fengnan Gao & Richard D. Gill
• https://arxiv.org/abs/2104.00333
• Submitted to SJS
• Previously rejected by Statistica Neerlandica, Life Data Analysis, and JRSS(A)
• SJS: “The journal specializes in statistical modelling showing particular
appreciation of the underlying substantive research problems”
• “The emergence of specialized methods for analysing longitudinal and
spatial data is just one example of an area of important methodological
development in which the Scandinavian Journal of Statistics has a particular
niche"
Longitudinal and spatial data with enormous societal impact
Dutch New Herring
12. • AD: Algemene Dagblad, a popular Dutch daily newspaper, based in
Rotterdam
• The Herring Test: a yearly ranking of consumer outlets of “Dutch
New Herring”
• Dutch New Herring (“Hollandse maatjes”): An EU protected
designation for herring prepared according to ancient Dutch tradition
• Ordinary consumers (readers of AD) nominated their local favourite
food market stalls, supermarkets,
fi
sh shops; highly ranked outlets
were automatically included in the next year’s test
• A high ranking was like a Michelin star. A low ranking was the kiss of
death.
Was the AD Herring Test biased?
The substantive research problem
15. The person who started this all
“I heard something strange about the AD Herring Test. I pulled the data together to see how it all
worked. Things came out of that that I thought, ‘this can't be true’. The test panel had a business
interest in a herring supplier, Atlantic. Coincidence or not, whoever got the
fi
sh from Atlantic,
scored signi
fi
cantly higher. I am proud of the fact that my research has helped put an end to this.”
Ben Vollaard (“BV”), Tilburg
16. • BV published statistical analyses as two “working papers” on his
university web pages
• Tilburg University put out a press release, both times
• BV appeared on current a
ff
airs chat shows on national TV, the
results were reported in the newspapers
• The AD cancelled the “AD Herring Test”, brought in lawyers,
making complaints to BV and to his university. Aggrieved herring
outlets started legal action against AD.
• The AD’s lawyer hired me …
Answer: PR activities of his university
How did BV’s work reach The Economist?
17. • Atlantic’s “Dutch New Herring” is objectively the best, according to the well-
known criteria according to which the AD Herring Testers evaluate Dutch new
herring
• Well-cleaned
• Prepared on site
• Neither too “ripe” [matured] nor too “unripe” (cf. venison, whisky, cheese)
• Temperature not too cold, not too warm (cf. hygiene regulations)
• No microbiological contamination
• High fat percentage, high weight (per
fi
sh)
• Atlantic’s Dutch New Herring is also the most expensive! (Euros per gram).
Coincidence?
Coincidence?
No: Atlantic’s Dutch New Herring is the best!
18. No: Atlantic’s Dutch New Herring is the best!
Coincidence?
• Acknowledgement of con
fl
ict of interest: I was paid by AD to help them
in their complaint of violation of scienti
fi
c integrity against Ben Vollaard
• I claim that I gave them my independent scienti
fi
c opinion on their case
• I argued that econometrician Vollaard was incompetent. It is not for me
to judge his integrity
• The Netherlands national organ for evaluation of complaints of
violation of scienti
fi
c integrity ruled that Vollaard was innocent, but that
the scienti
fi
c issues needed further research
• Hence I continued my research with essential collaboration of
Fengnan Gao (replication of entire research project, including data
extraction from original sources)
20. • Positive e
ff
ect, signi
fi
cant at 5% level, size of e
ff
ect ca. 0.5
• Strangely, he also had a covariate “among top 10”
• It was also rather signi
fi
cant, positive
Add indicator variable “< 30 km from Rotterdam’
Vollaard’s first paper
21. • He did not add it as a covariate
• Instead, he showed that the according to his
fi
tted model, the
di
ff
erence between “Atlantic” outlets and the rest is mainly due to
the subjective variables “ripeness” and “cleaning”. Two
objective variables “temperature” and “being freshly served”
have some importance, but less; and the other objective
variables are not important at all.
New variable “outlet supplied by Atlantic”
Vollaard’s second paper
22. Atlantic als leverancier scoren gemiddeld een 9,1. De andere verkooppunten scoren een onvoldoende,
gemiddeld net geen 5,5. We zien weinig verschil in gemiddelde score tussen de overige leveranciers.
Figuur 1. Gemiddeldescorevan verkooppunten op deharingtest 2016/17, naar leverancier
Dit verschil in eindcijfer kan versterkt zijn als in de groep die niet Atlantic als leverancier heeft relatief veel
rotte appels zitten; zaken die echt haring verkopen die niet verkocht zou moeten worden. Deze rotte appels
zouden het gemiddelde sterk omlaag kunnen trekken. Om dit te onderzoeken gebruiken we de
microbiologische gesteldheid van de haring zoals vastgesteld door het laboratorium en vermeld door het
0
1
2
3
4
5
6
7
8
9
10
Atlantic als leverancier Andere leverancier
Score
op
haringtest
0
1
2
3
4
5
6
7
8
9
10
Atlantic als leverancier Andere leverancier
Score
op
haringtest
3,6 punten
verschil
3,3 punten
verschil
Alle verkooppunten Alleen verkooppunten met haring van
voldoende microbiologische gesteldheid
From report 2
n1 + n2 = 292 n1 + n2 = 292 - 37
23. Tabel 1. Waarom verkooppunten dieniet Atlantic alsleverancier hebben lager scoren op deharingtest
Verklaringsfactor
Verschil tussen verkooppunten met
en zonder Atlantic als leverancier Eenheid
Effect op
eindcijfer (0-10)
Gewicht -2.17 gram -0.09
Microbiologische gesteldheid
(zeer) goed referentiecategorie
Voldoende 0.07 aandeel -0.01
Slecht 0.03 aandeel -0.02
waarschuwingsfase 0.09 aandeel -0.01
Afgekeurd 0.02 aandeel -0.04
Gezamenlijk -0.07
Vetpercentage
beneden 10% referentiecategorie
tussen de 10 en 14% -0.21 aandeel -0.04
boven 14% -0.02 aandeel -0.01
Gezamenlijk -0.05
Temperatuur
beneden 7⁰C referentiecategorie
tussen 7 en 10⁰C 0.29 aandeel -0.17
boven 10⁰C 0.21 aandeel -0.36
Gezamenlijk -0.52
Vers van het mes
niet referentiecategorie
wel -0.29 aandeel -0.51
Schoonmaken
zeer goed referentiecategorie
goed 0.27 aandeel -0.22
matig 0.25 aandeel -0.40
slecht 0.04 aandeel -0.12
gezamenlijk -0.73
Rijping
licht referentiecategorie
gemiddeld -0.24 aandeel 0.09
sterk 0.34 aandeel -0.63
bedorven 0.06 aandeel -0.25
gezamenlijk -0.80
Regio
binnen straal van 30 km rond R’dam referentiecategorie
meer dan 30 km van R’dam 0.60 aandeel -0.18
Top 10
top 10 referentiecategorie
25. 0 2 4 6 8 10
0
5
10
15
20
25
30
35
Final test score
Number
of
outlets
26. The model was wrong
• The testers gave a score “zero” if the
fi
sh should not have
been sold
• temperature > 10 deg C
• microbiological contamination = health danger
• maturity = rotten
27. More dif
fi
culties
• Two years data combined, some? many? outlets in both years’
data
• Misclassi
fi
cation of “Atlantic” supplied outlets (deduced from
inconsistency of summary statistics: one outlet got a “6” in
2016)
• Vollaard tested e
ff
ect of “near to Rotterdam” with a dummy
variable, but e
ff
ect of “supplied by Atlantic” with faulty subject
matter argument (according to him, maturity and microbiology
should have been the same). But: “ripening” is a chemical
process, “contamination” is biological
• Dummy variable for “Atlantic” was not signi
fi
cant!
28. • Vollaard would not give us the un-anonymised data
• AD gave us their classi
fi
cation of “Atlantic”
• We went back to the original data
29. Better model
• Score zero = “disquali
fi
ed”
• We study outlets with score >/= 1
• We omit outlets’ 2nd observation if they are tested in both
years
• Result: much better
fi
t, much smaller standard error of
residuals
• Overall picture unchanged
30. More discoveries
• BV had incorrectly merged the lowest (“green”) and the
highest (“rotten”) category of “ripening”
• BV had mapped scores like “7–” and “7+” to 7. We tried
“7 +/– epsilon” for various choices of epsilon (0.01, 0.1).
Tiny changes had dramatic e
ff
ects on statistical
signi
fi
cance of “distance from Rotterdam” and “Atlantic”
• “Rotterdam” went out, “Atlantic” came in!
• Attempts to improve model (e.g., interaction between
most important variables) failed due to multicolinearity
31. Conclusions for
Dutch new herring
• The data was not a random sample. It was self-selected.
Most participants were new participants and often came
from areas where the the AD herring test was little known
• These newcomers often performed poorly
• There is absolutely no evidence of bias toward Atlantic
outlets
• There is little evidence for a regional bias
• “Atlantic” is the best!
32. Take home messages
• In linear regression analysis where multicollinearity is present,
regression estimates are highly sensitive to small perturbations in
model speci
fi
cations
• Too much applied statistics applies a conventional method-of-choice
to a data-set without any thought about the data-generation
mechanism
• What I learned from ABGK: doing science is fun and it’s a social activity
• Scandinavian statistics has found the right balance between theory and
application
• Thanks Per (and also Niels and Ørnulf!)
37. Vollaard’s anonymised & combined data set
AD Herring Test websites 2026, 2017
AD identification of Vollaard’s vendors
tidyverse
dplyr
ggplot2
cbsodataR
ggmap
year
2016
2017
54.59815
148.41316
403.42879
1096.63316
2980.95799
pop. density
by municipality
Venders of 2016 and 2017, Dutch New Herring