More Related Content

More from Lviv Startup Club(20)


Veronika Snizhko: Оцінка якості NLP проєкту: чому автоматичних метрик може бути недостатньо

  1. Evaluation of NLP projects Why automatic metrics is not always enough Veronika Snizhko
  2. Automatic Evaluation Metrics ● Objective and consistent ● Time-saving ● Cost-effective ● Repeatable and scalable ● Benchmarking ● Standardization ● Transparency
  3. Categories of automatic evaluation metrics Generic can be applied to a variety of situations and datasets, such as precision, accuracy, perplexity Task-specific are limited to a given task, such as Machine Translation (often evaluated using metrics BLEU or ROUGE) or Named Entity Recognition (often evaluated with seqeval). Dataset-specific aim to measure model performance on specific benchmarks: for instance, the GLUE benchmark has a dedicated evaluation metric, or SQUAD. ● Linguistic ● Semantic ● Diversity ● Factual correctness ● Engagement
  4. Limitations of Automatic Evaluation Metrics ● Lack of Coverage: Automatic evaluation metrics may not capture the full range of nuances and complexities in natural language. ● Divergence from Human Quality: Automatic evaluation metrics are often based on human judgments of quality, but these judgments are not always consistent or reliable. There may be cases where the metric gives a high score, but humans perceive the output as poor or low quality. ● Lack of Domain Specificity: Automatic evaluation metrics may not capture the domain-specific nuances or knowledge required for certain NLP tasks, such as medical or legal text, where the language is highly technical and specialized. ● Lack of Context: Automatic evaluation metrics may not be able to capture the context in which the generated text is used, which can be critical for assessing the quality of the output. For example, a generated sentence may be grammatically correct and semantically coherent, but it may not be appropriate in the context of a larger document or conversation. ● Lack of Creativity: Automatic evaluation metrics may not capture the creativity and novelty of the generated text, which can be important for certain applications such as creative writing or advertising.
  5. Human Evaluation ● Human evaluation provides a more comprehensive understanding of the quality of NLP models. By having humans assess the output, we can capture aspects of language that are difficult for machines to measure, such as humor, sarcasm, and irony. ● Human evaluation can identify errors or mistakes in the output that may be missed by automatic metrics. Humans are better able to understand the context and can pick up on nuances that may not be captured by machines. ● Human evaluation can help to improve the usability and acceptability of NLP models. By assessing the naturalness and coherence of the generated text, we can ensure that NLP models produce outputs that meet the needs of users. ● Human evaluation can help to address issues of bias and fairness in NLP models. By having diverse groups of humans evaluate the output, we can identify biases that may be present in the model and work to address them.
  6. Human Evaluation
  7. A language model may have a good perplexity score, but if the generated text is repetitive or fails to provide accurate or relevant information, it may be considered poor by human evaluators.
  9. Limitations of Human Evaluation ● Cost and time: Human evaluation can be expensive and time-consuming, especially when large amounts of data need to be evaluated. ● Subjectivity: Human evaluators may have different interpretations and opinions, which can lead to inconsistencies in the evaluation process. ● Biases: Human evaluators may have biases based on factors like cultural background, gender, or personal preferences. These biases can influence their evaluations and lead to unfair assessments of NLP models. ● Scale: It can be difficult to scale human evaluation to large datasets or applications, which can limit its usefulness in some contexts. ● Reproducibility: Human evaluation can be difficult to reproduce, as different evaluators may have different interpretations and opinions. This can make it difficult to compare results across different studies or evaluations.
  10. Conclusion To overcome the limitations of automatic evaluation metrics, it is essential to combine them with human evaluation. Human evaluation can provide valuable insights into the NLP system's performance, such as the naturalness and fluency of its output. Additionally, human evaluation can help to identify the areas where the NLP system needs improvement.
  11. Thank you!