Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Unifying Human and Statistical Evaluation for Natural Language Generation

97 views

Published on

How can we measure whether a natural language generation system produces both high quality and diverse outputs? Human evaluation captures quality but not diversity, as it does not catch models that simply plagiarize from the training set. On the other hand, statistical evaluation (i.e., perplexity) captures diversity but not quality, as models that occasionally emit low quality samples would be insufficiently penalized.

In this paper, we propose a unified framework which evaluates both diversity and quality, based on the optimal error rate of predicting whether a sentence is human- or machine-generated.
We demonstrate that this error rate can be efficiently estimated by combining human and statistical evaluation, using an evaluation metric which we call HUSE.

On summarization and chit-chat dialogue, we show that (i) HUSE detects diversity defects which fool pure human evaluation and that (ii) techniques such as annealing for improving quality actually decrease HUSE due to decreased diversity.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Unifying Human and Statistical Evaluation for Natural Language Generation

  1. 1. Unifying Human and Statistical Evaluation for Natural Language Generation Tatsunori Hashimoto*, Hugh Zhang*, Percy Liang
  2. 2. What are the goals of natural language generation? 2
  3. 3. Storytelling 3 A high quality story? The scientist named the population, after their distinctive horn, Ovid’s Unicorn. These four-horned, silver-white unicorns were previously unknown to science. Now, after almost two centuries, the mystery of what sparked this odd phenomenon is finally solved. [Radford+ 2019]
  4. 4. Storytelling 4 The scientist named the population, after their distinctive horn, Ovid’s Unicorn. These four-horned, silver-white unicorns were previously unknown to science. Now, after almost two centuries, the mystery of what sparked this odd phenomenon is finally solved. A high quality story?
  5. 5. Storytelling 5 Atticus said to Jem one day, “I’d rather you shot at tin cans in the back yard, but I know you’ll go after birds. Shoot all the bluejays you want, if you can hit ‘em, but remember it’s a sin to kill a mockingbird.” From Harper Lee’s “To Kill A Mockingbird”
  6. 6. Storytelling 6 Atticus said to Jem one day, “I’d rather you shot at tin cans in the back yard, but I know you’ll go after birds. Shoot all the bluejays you want, if you can hit ‘em, but remember it’s a sin to kill a mockingbird.” Good story, but not a good model From Harper Lee’s “To Kill A Mockingbird”
  7. 7. Storytelling 7 Peter said to James one afternoon, “I’d rather you fired at aluminum cans in the garage, but I know you will go after birds. Hit all the ravens you can, if you want, but remember it is a sin to murder a hummingbird.”
  8. 8. Storytelling 8 Peter said to James one afternoon, “I’d rather you fired at aluminum cans in the garage, but I know you will go after birds. Hit all the ravens you can, if you want, but remember it is a sin to murder a hummingbird.”
  9. 9. Storytelling 9 Peter said to James one afternoon, “I’d rather you fired at aluminum cans in the garage, but I know you will go after birds. Hit all the ravens you can, if you want, but remember it is a sin to murder a hummingbird.” Diversity is important and hard to quantify!
  10. 10. Goal 10
  11. 11. Evaluation should measure both quality and diversity 11
  12. 12. 12 Claim: Human evaluation has difficulty catching diversity defects.
  13. 13. Try It Yourself! 13 Task: Headline short news articles
  14. 14. 14 Context: Political leaders in Israel united in prayers for Ariel Sharon as the prime minister underwent surgery after suffering a stroke. ___________________________________________ Output: Sharon has stroke for stroke.
  15. 15. 15 Context: Political leaders in Israel united in prayers for Ariel Sharon as the prime minister underwent surgery after suffering a stroke. ___________________________________________ Output: Sharon has stroke for stroke. Machine generated (obvious quality failure)
  16. 16. 16 Context: The Buffalo Bills sacked Tom Donahoe as president and general manager on Wednesday, fulfilling expectations of a shake-up ___________________________________________ Output: Bills sack Donahoe as president and gm.
  17. 17. 17 Context: The Buffalo Bills sacked Tom Donahoe as president and general manager on Wednesday, fulfilling expectations of a shake-up ___________________________________________ Output: Bills sack Donahoe as president and gm. Machine generated (hard to detect diversity issue)
  18. 18. 18 Context: The Buffalo Bills sacked Tom Donahoe as president and general manager on Wednesday, fulfilling expectations of a shake-up. ___________________________________________ Output: Bills sack Donahoe as president and gm. Machine generated (hard to detect diversity issue)
  19. 19. 19 Context: The Buffalo Bills sacked Tom Donahoe as president and general manager on Wednesday, fulfilling expectations of a shake-up. ___________________________________________ Output: Bills sack Donahoe as president and gm. ___________________________________________ Reference: NFL’s Bills shake up front office.
  20. 20. Existing Evaluations 20 Pros Gold standard for quality Cons Can be cheated by under diversity Human Evaluation Statistical (e.g., perplexity) Reference Based (e.g., BLEU, ROUGE) Learned Metrics (e.g., ADEM)
  21. 21. Existing Evaluations 21 Pros Measures diversity Cons Does not measure sample quality Human Evaluation Statistical (e.g., perplexity) Reference Based (e.g., BLEU, ROUGE) Learned Metrics (e.g., ADEM) [Theis+ 2015]
  22. 22. Existing Evaluations 22 Human Evaluation Statistical (e.g., perplexity) Pros Quick and easy to calculate Cons Inaccurate measures of both quality and diversity Reference Based (e.g., BLEU, ROUGE) Learned Metrics (e.g., ADEM) [Papineni+ 2002], [Lin+ 2004], [Banerjee+ 2005]
  23. 23. Existing Evaluations 23 Human Evaluation Statistical (e.g., perplexity) Reference Based (e.g., BLEU, ROUGE) Learned Metrics (e.g., ADEM) Pros Quick and easy to calculate Cons Still unreliable. Often still can’t catch diversity. [Lowe+ 2017], [Olsson+ 2018]
  24. 24. Existing Evaluations 24 Human Statistical Learned Reference Dist Quality Diversity
  25. 25. Our work: unifying human and statistical evaluation to measure both quality and diversity. 25
  26. 26. 26
  27. 27. 27
  28. 28. 28
  29. 29. 29
  30. 30. Solution: Optimal Classification 30
  31. 31. Optimal Classifier ... 31 Captures Quality Captures Diversity Intuitive Low quality samples are easily distinguished
  32. 32. Optimal Classifier Is ... 32 Captures Quality Captures Diversity Intuitive Under-diversity and plagiarism is also recognizable
  33. 33. Optimal Classifier Is ... 33 Captures Quality Captures Diversity Intuitive Optimal classification error makes intuitive sense to humans
  34. 34. Can we reliably estimate the optimal classification error? 34
  35. 35. 35 Learned Classifier [Chaganty+ 2017], [Novikova+ 2017]
  36. 36. 36 Learned Classifier [Chaganty+ 2017], [Novikova+ 2017] Good model or weak classifier?
  37. 37. Key Insight: Only need access to two features to optimally classify sentences 37
  38. 38. 38
  39. 39. 39
  40. 40. 40
  41. 41. 41
  42. 42. 42
  43. 43. 43 Use Humans! Crowdsource estimates of “typicality” as a substitute for p_ref
  44. 44. Human Judgement Score 44 1. 20 crowdworkers rate a sentence from 1 (rare) to 5 (common) 2. Define HJ as the average of their “typicality” judgements
  45. 45. 45
  46. 46. 46
  47. 47. 47
  48. 48. Human Unified with Statistical Evaluation 48
  49. 49. HUSE 49
  50. 50. HUSE 50
  51. 51. HUSE 51
  52. 52. Learning a classifier in high dimensions is hard. In two dimensions, it is easy. 52
  53. 53. HUSE Guarantees 53 Optimal ≤ HUSE ≤ Humans Lower is better
  54. 54. HUSE Guarantees 54 Optimal ≤ HUSE ≤ Humans Always Outperforms Humans Lower is better
  55. 55. HUSE Guarantees 55 Optimal ≤ HUSE ≤ Humans gap from model underdiversity Lower is better
  56. 56. HUSE Guarantees 56 Optimal ≤ HUSE ≤ Humans Always Outperforms Humans Zero False Negatives Lower is better
  57. 57. HUSE Guarantees 57 Optimal ≈ HUSE ≤ Humans
  58. 58. Case Study: Summarization 58
  59. 59. 59
  60. 60. 60
  61. 61. 61
  62. 62. 62
  63. 63. 63
  64. 64. 64
  65. 65. 65
  66. 66. 66
  67. 67. 67
  68. 68. 68
  69. 69. 69
  70. 70. 70
  71. 71. Quality-Diversity Tradeoffs 71
  72. 72. 72
  73. 73. 73
  74. 74. 74 (1 - HUSE_Q) + (1 - HUSE_D) = (1 - HUSE) quality + diversity = total error
  75. 75. 75
  76. 76. 76
  77. 77. 77
  78. 78. 78
  79. 79. 79
  80. 80. Use HUSE 80 ● Captures quality and diversity ● Statistically principled
  81. 81. Use our system! 81 https://github.com/hughbzhang/HUSE
  82. 82. Questions? 82
  83. 83. Appendix 83
  84. 84. 84
  85. 85. 85
  86. 86. Turk Prompt 86
  87. 87. Mutual Information Theorem 87
  88. 88. Holder’s Bound 88
  89. 89. 89
  90. 90. 90

×