How can we measure whether a natural language generation system produces both high quality and diverse outputs? Human evaluation captures quality but not diversity, as it does not catch models that simply plagiarize from the training set. On the other hand, statistical evaluation (i.e., perplexity) captures diversity but not quality, as models that occasionally emit low quality samples would be insufficiently penalized.
In this paper, we propose a unified framework which evaluates both diversity and quality, based on the optimal error rate of predicting whether a sentence is human- or machine-generated.
We demonstrate that this error rate can be efficiently estimated by combining human and statistical evaluation, using an evaluation metric which we call HUSE.
On summarization and chit-chat dialogue, we show that (i) HUSE detects diversity defects which fool pure human evaluation and that (ii) techniques such as annealing for improving quality actually decrease HUSE due to decreased diversity.
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
Unifying Human and Statistical Evaluation for Natural Language Generation
1. Unifying Human and Statistical
Evaluation for Natural Language
Generation
Tatsunori Hashimoto*, Hugh Zhang*, Percy Liang
2. What are the goals
of natural language
generation?
2
3. Storytelling
3
A high
quality
story?
The scientist named the population, after
their distinctive horn, Ovid’s Unicorn.
These four-horned, silver-white unicorns
were previously unknown to science.
Now, after almost two centuries, the
mystery of what sparked this odd
phenomenon is finally solved.
[Radford+ 2019]
4. Storytelling
4
The scientist named the population, after
their distinctive horn, Ovid’s Unicorn.
These four-horned, silver-white unicorns
were previously unknown to science.
Now, after almost two centuries, the
mystery of what sparked this odd
phenomenon is finally solved.
A high
quality
story?
5. Storytelling
5
Atticus said to Jem one day, “I’d
rather you shot at tin cans in the
back yard, but I know you’ll go after
birds. Shoot all the bluejays you
want, if you can hit ‘em, but
remember it’s a sin to kill a
mockingbird.”
From Harper Lee’s
“To Kill A Mockingbird”
6. Storytelling
6
Atticus said to Jem one day, “I’d
rather you shot at tin cans in the
back yard, but I know you’ll go after
birds. Shoot all the bluejays you
want, if you can hit ‘em, but
remember it’s a sin to kill a
mockingbird.”
Good
story, but
not a good
model
From Harper Lee’s
“To Kill A Mockingbird”
7. Storytelling
7
Peter said to James one afternoon,
“I’d rather you fired at aluminum cans
in the garage, but I know you will go
after birds. Hit all the ravens you can,
if you want, but remember it is a sin
to murder a hummingbird.”
8. Storytelling
8
Peter said to James one afternoon,
“I’d rather you fired at aluminum cans
in the garage, but I know you will go
after birds. Hit all the ravens you can,
if you want, but remember it is a sin
to murder a hummingbird.”
9. Storytelling
9
Peter said to James one afternoon,
“I’d rather you fired at aluminum cans
in the garage, but I know you will go
after birds. Hit all the ravens you can,
if you want, but remember it is a sin
to murder a hummingbird.”
Diversity is
important
and hard to
quantify!
14. 14
Context: Political leaders in Israel united in
prayers for Ariel Sharon as the prime minister
underwent surgery after suffering a stroke.
___________________________________________
Output: Sharon has stroke for stroke.
15. 15
Context: Political leaders in Israel united in
prayers for Ariel Sharon as the prime minister
underwent surgery after suffering a stroke.
___________________________________________
Output: Sharon has stroke for stroke.
Machine generated
(obvious quality failure)
16. 16
Context: The Buffalo Bills sacked Tom Donahoe
as president and general manager on Wednesday,
fulfilling expectations of a shake-up
___________________________________________
Output: Bills sack Donahoe as president and gm.
17. 17
Context: The Buffalo Bills sacked Tom Donahoe
as president and general manager on Wednesday,
fulfilling expectations of a shake-up
___________________________________________
Output: Bills sack Donahoe as president and gm.
Machine generated
(hard to detect diversity issue)
18. 18
Context: The Buffalo Bills sacked Tom Donahoe
as president and general manager on Wednesday,
fulfilling expectations of a shake-up.
___________________________________________
Output: Bills sack Donahoe as president and gm.
Machine generated
(hard to detect diversity issue)
19. 19
Context: The Buffalo Bills sacked Tom Donahoe
as president and general manager on Wednesday,
fulfilling expectations of a shake-up.
___________________________________________
Output: Bills sack Donahoe as president and gm.
___________________________________________
Reference: NFL’s Bills shake up front office.
20. Existing Evaluations
20
Pros
Gold standard
for quality
Cons
Can be
cheated by
under diversity
Human Evaluation
Statistical (e.g., perplexity)
Reference Based (e.g., BLEU, ROUGE)
Learned Metrics (e.g., ADEM)
44. Human Judgement Score
44
1. 20 crowdworkers rate a sentence
from 1 (rare) to 5 (common)
2. Define HJ as the average of their
“typicality” judgements