SlideShare a Scribd company logo
Evaluating MT Systems with
Second Language Proficiency Tests
Takuya Matsuzaki, Akira Fujita, 

Naoya Todo, Noriko H. Arai
ACL 2015
2015/09/24
AHCLab M1 Makoto Morishita
Abstract
• BLEU have some weak points to evaluate the
system in a real situation.
• In this paper, evaluate the system by using
second language ability test (TOEIC, etc).
• It revealed that the context-unawareness of the
current MT systems severely damages human
performance when solving the test problems.
2
Weak Points of BLEU
1. Unreliability in evaluating short translations
2. Non-interpretability of the scores beyond
numerical comparison
3. Bias towards SMT systems
3
Weak Points of Manual Evaluation
1. It costs much.
2. It is not easy to analyze the characteristics of
MT systems based solely on the evaluation
results.
4
Solution
• Task-based evaluation of MT systems

- Measures the human performance in a task
• Human do some task such as information
extraction from a machine-translated text.
5
Weak Points of

Task-Based Evaluation
• It costs much.

- We have to make test materials, 

and gather appropriate human subjects.
• This paper use second-language proficiency tests
(SLPTs) such as TOEIC, as the source of test materials.
• Human solve the problem which is translated and
evaluate the system by the test scores.
6
Second-Language Proficiency Tests

(SLPT)
• There are a lot of SLPTs in many languages.
• They are carefully designed to evaluate
various aspects of language ability.
• SLPTs are designed to assess the language
ability, but not general intelligence.

- Can be robust against the heterogeneity of
the subjects.
7
(多様性)
Materials
• We chose 40 problems randomly from 

National Center Test for University
Admissions (センター試験).
• All the problem consisted of a short
conversation between two people.
8
Materials
• In this paper, we use a multiple-choice
dialogue completion problems.
9
Experiment
• The original problems were English, and we
translated them into Japanese.
• The human subjects solved the translated
problems.
• The translation quality was evaluated based
on the rate of correct answers given by the
human subjects.
10
Experiment
• Evaluated 4 systems.

- G: Google Translate

- Y: Yahoo Translate

- Hs: Human translation which do not 

consider context

- Ho: Human translation which consider 

context
11
Participants
• 320 Japanese junior high school student
12
School A School B
1st: 80
2nd: 80
3rd: 78
1st: 82
Extrinsic Evaluation Metric
• CAR: Correct Answer Rate
13
CARM (p) =
# of subjects that correctly answered M(p)
# of subjects who solved M(p)
Avg CARM =
1
|P|
X
p2P
CARM (P)
Robustness against the
Heterogeneity of the Human Subjects
14
School A
1st: 80
2nd: 80
3rd: 78
No difference
School A
1st: 80
School B
1st: 82
No difference
→The participants’ Heterogeneity did not affect the test result
System-level Evaluation
• We cannot find significant difference
between Y and Hs
15
System-level Evaluation
16
System-level Evaluation
17
Better
Better
System-level Evaluation
18
Same
Better
System-level Evaluation
19
• Refo: Do not consider context
• Refs: Consider context
Better
Agreement
• If Score of Intrinsic Measure M

System A’s translation > B’s translation

And

Score of CAR

System A’s translation > B’s translation

then Agree
• Check the agreement rate of each problems
20
Agreement Rate
• Agreement Rates between Automatic
Evaluation Metrics and Human Evaluation
21
Agreement Rate
• Agreement Rates between Intrinsic
Evaluation Metrics and Correct Answer Rate
22
Agreement Rate
• The human evaluation agrees with the CAR
slightly better than the automatic metrics.
• But still less than 0.7
• CAR can be critically damaged by a subtle
mistake.
23
Conclusion
• Comparing 4 systems, it is important to
consider contexts of individual sentences in
translating dialogues.
• SLPT can evaluate a different dimension of
translation quality.
• SLPT can be robust against the heterogeneity
of human subjects.
24
Questions & Comments

More Related Content

What's hot

Qualitative methods
Qualitative methods Qualitative methods
Qualitative methods
Sr Edith Bogue
 
Group c
Group cGroup c
Group c
Study Point
 
Act prep
Act prepAct prep
Act prep
kathymae86
 
Parcc 8
Parcc 8Parcc 8
Parcc 8
jonathonregan
 
Tiss 2010 analysis by bylls eye
Tiss 2010 analysis by bylls eyeTiss 2010 analysis by bylls eye
Tiss 2010 analysis by bylls eye
sachinmalik22
 
The MSP Overview
The MSP OverviewThe MSP Overview
The MSP Overview
Glenn E. Malone, EdD
 
Cat 2013 day 11 analysis & Result Details
Cat 2013 day 11 analysis & Result DetailsCat 2013 day 11 analysis & Result Details
Cat 2013 day 11 analysis & Result Details
sakshij91
 
Applied logic: A mastery learning approach
Applied logic: A mastery learning approachApplied logic: A mastery learning approach
Applied logic: A mastery learning approach
john6938
 
Assessment tools
Assessment toolsAssessment tools
Assessment tools
Yiscah Etrof
 
RAPS: A Recommender Algorithm Based on Pattern Structures
RAPS: A Recommender Algorithm Based on Pattern StructuresRAPS: A Recommender Algorithm Based on Pattern Structures
RAPS: A Recommender Algorithm Based on Pattern Structures
Dmitrii Ignatov
 
Sat ppt
Sat pptSat ppt
Sat ppt
williamd1
 
Presentation at joint PIA workshop at UMAP 2014
Presentation at joint PIA workshop at UMAP 2014 Presentation at joint PIA workshop at UMAP 2014
Presentation at joint PIA workshop at UMAP 2014
CNGL_Ireland
 
Types of test ( categories 2)
Types of test ( categories 2)Types of test ( categories 2)
Types of test ( categories 2)
Evriani Gea
 
Using Programmed Instruction to Help Students Engage with eTextbook Content
Using Programmed Instruction to Help Students Engage with eTextbook Content Using Programmed Instruction to Help Students Engage with eTextbook Content
Using Programmed Instruction to Help Students Engage with eTextbook Content
Sergey Sosnovsky
 
Language tests
Language testsLanguage tests
Language tests
Universidad Santo Tomás
 
Problem solving
Problem solvingProblem solving
Problem solving
Barbara M. King
 
Tv drama mark scheme
Tv drama mark schemeTv drama mark scheme
Tv drama mark scheme
Charis Creber
 

What's hot (18)

Qualitative methods
Qualitative methods Qualitative methods
Qualitative methods
 
Welcome to brss spring 2010 grad
Welcome to brss spring 2010 gradWelcome to brss spring 2010 grad
Welcome to brss spring 2010 grad
 
Group c
Group cGroup c
Group c
 
Act prep
Act prepAct prep
Act prep
 
Parcc 8
Parcc 8Parcc 8
Parcc 8
 
Tiss 2010 analysis by bylls eye
Tiss 2010 analysis by bylls eyeTiss 2010 analysis by bylls eye
Tiss 2010 analysis by bylls eye
 
The MSP Overview
The MSP OverviewThe MSP Overview
The MSP Overview
 
Cat 2013 day 11 analysis & Result Details
Cat 2013 day 11 analysis & Result DetailsCat 2013 day 11 analysis & Result Details
Cat 2013 day 11 analysis & Result Details
 
Applied logic: A mastery learning approach
Applied logic: A mastery learning approachApplied logic: A mastery learning approach
Applied logic: A mastery learning approach
 
Assessment tools
Assessment toolsAssessment tools
Assessment tools
 
RAPS: A Recommender Algorithm Based on Pattern Structures
RAPS: A Recommender Algorithm Based on Pattern StructuresRAPS: A Recommender Algorithm Based on Pattern Structures
RAPS: A Recommender Algorithm Based on Pattern Structures
 
Sat ppt
Sat pptSat ppt
Sat ppt
 
Presentation at joint PIA workshop at UMAP 2014
Presentation at joint PIA workshop at UMAP 2014 Presentation at joint PIA workshop at UMAP 2014
Presentation at joint PIA workshop at UMAP 2014
 
Types of test ( categories 2)
Types of test ( categories 2)Types of test ( categories 2)
Types of test ( categories 2)
 
Using Programmed Instruction to Help Students Engage with eTextbook Content
Using Programmed Instruction to Help Students Engage with eTextbook Content Using Programmed Instruction to Help Students Engage with eTextbook Content
Using Programmed Instruction to Help Students Engage with eTextbook Content
 
Language tests
Language testsLanguage tests
Language tests
 
Problem solving
Problem solvingProblem solving
Problem solving
 
Tv drama mark scheme
Tv drama mark schemeTv drama mark scheme
Tv drama mark scheme
 

Similar to [Paper Introduction] Evaluating MT Systems with Second Language Proficiency Tests

Lepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metric
Lifeng (Aaron) Han
 
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
Lifeng (Aaron) Han
 
Pragmatic software testing education - SIGCSE 2019
Pragmatic software testing education - SIGCSE 2019Pragmatic software testing education - SIGCSE 2019
Pragmatic software testing education - SIGCSE 2019
Maurício Aniche
 
Measuring the impact of instant high quality feedback.
Measuring the impact of instant high quality feedback.Measuring the impact of instant high quality feedback.
Measuring the impact of instant high quality feedback.
Stephen Nutbrown
 
Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (m...
Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (m...Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (m...
Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (m...
Les Perelman
 
Python For Data Science and Analytics For Sophomores
Python For Data Science and Analytics For SophomoresPython For Data Science and Analytics For Sophomores
Python For Data Science and Analytics For Sophomores
ssuser4ab9671
 
Thamme Gowda's PhD dissertation defense slides
Thamme Gowda's PhD dissertation defense slidesThamme Gowda's PhD dissertation defense slides
Thamme Gowda's PhD dissertation defense slides
Thamme Gowda
 
Accessing student performance by nlp
Accessing student performance by nlpAccessing student performance by nlp
Accessing student performance by nlp
SimranAgrawal16
 
Attaining the Unattainable? Reassessing Claims of Human Parity in Neural Mach...
Attaining the Unattainable? Reassessing Claims of Human Parity in Neural Mach...Attaining the Unattainable? Reassessing Claims of Human Parity in Neural Mach...
Attaining the Unattainable? Reassessing Claims of Human Parity in Neural Mach...
Antonio Toral
 
MT and Post Editing in master's level translation education
MT and Post Editing in master's level translation education MT and Post Editing in master's level translation education
MT and Post Editing in master's level translation education
Jakub Absolon
 
Recommending Scientific Papers: Investigating the User Curriculum
Recommending Scientific Papers: Investigating the User CurriculumRecommending Scientific Papers: Investigating the User Curriculum
Recommending Scientific Papers: Investigating the User Curriculum
Jonathas Magalhães
 
Occe2018: Student experiences with a bring your own laptop e-Exam system in p...
Occe2018: Student experiences with a bring your own laptop e-Exam system in p...Occe2018: Student experiences with a bring your own laptop e-Exam system in p...
Occe2018: Student experiences with a bring your own laptop e-Exam system in p...
mathewhillier
 
eMOOCs2015 Does peer grading work?
eMOOCs2015 Does peer grading work?eMOOCs2015 Does peer grading work?
eMOOCs2015 Does peer grading work?
Rémi Bachelet
 
MyMathTest La Trobe case study
MyMathTest La Trobe case studyMyMathTest La Trobe case study
MyMathTest La Trobe case study
Pearson Australia
 
TESTA to FASTECH Presentation
TESTA to FASTECH PresentationTESTA to FASTECH Presentation
TESTA to FASTECH Presentation
Tansy_Jessop
 
2015 EDM Leopard for Adaptive Tutoring Evaluation
2015 EDM Leopard for Adaptive Tutoring Evaluation2015 EDM Leopard for Adaptive Tutoring Evaluation
2015 EDM Leopard for Adaptive Tutoring Evaluation
Yun Huang
 
Graduate Record Examination
Graduate Record ExaminationGraduate Record Examination
micro testing teaching learning analytics
micro testing teaching learning analyticsmicro testing teaching learning analytics
micro testing teaching learning analytics
Martin Schön
 
Learning
LearningLearning
Learning
Amar Jukuntla
 
UMR - My ongoing projects with Technology - Rochester - 2015
UMR - My ongoing projects with Technology - Rochester - 2015 UMR - My ongoing projects with Technology - Rochester - 2015
UMR - My ongoing projects with Technology - Rochester - 2015
University of Minnesota Rochester
 

Similar to [Paper Introduction] Evaluating MT Systems with Second Language Proficiency Tests (20)

Lepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metric
 
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
 
Pragmatic software testing education - SIGCSE 2019
Pragmatic software testing education - SIGCSE 2019Pragmatic software testing education - SIGCSE 2019
Pragmatic software testing education - SIGCSE 2019
 
Measuring the impact of instant high quality feedback.
Measuring the impact of instant high quality feedback.Measuring the impact of instant high quality feedback.
Measuring the impact of instant high quality feedback.
 
Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (m...
Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (m...Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (m...
Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (m...
 
Python For Data Science and Analytics For Sophomores
Python For Data Science and Analytics For SophomoresPython For Data Science and Analytics For Sophomores
Python For Data Science and Analytics For Sophomores
 
Thamme Gowda's PhD dissertation defense slides
Thamme Gowda's PhD dissertation defense slidesThamme Gowda's PhD dissertation defense slides
Thamme Gowda's PhD dissertation defense slides
 
Accessing student performance by nlp
Accessing student performance by nlpAccessing student performance by nlp
Accessing student performance by nlp
 
Attaining the Unattainable? Reassessing Claims of Human Parity in Neural Mach...
Attaining the Unattainable? Reassessing Claims of Human Parity in Neural Mach...Attaining the Unattainable? Reassessing Claims of Human Parity in Neural Mach...
Attaining the Unattainable? Reassessing Claims of Human Parity in Neural Mach...
 
MT and Post Editing in master's level translation education
MT and Post Editing in master's level translation education MT and Post Editing in master's level translation education
MT and Post Editing in master's level translation education
 
Recommending Scientific Papers: Investigating the User Curriculum
Recommending Scientific Papers: Investigating the User CurriculumRecommending Scientific Papers: Investigating the User Curriculum
Recommending Scientific Papers: Investigating the User Curriculum
 
Occe2018: Student experiences with a bring your own laptop e-Exam system in p...
Occe2018: Student experiences with a bring your own laptop e-Exam system in p...Occe2018: Student experiences with a bring your own laptop e-Exam system in p...
Occe2018: Student experiences with a bring your own laptop e-Exam system in p...
 
eMOOCs2015 Does peer grading work?
eMOOCs2015 Does peer grading work?eMOOCs2015 Does peer grading work?
eMOOCs2015 Does peer grading work?
 
MyMathTest La Trobe case study
MyMathTest La Trobe case studyMyMathTest La Trobe case study
MyMathTest La Trobe case study
 
TESTA to FASTECH Presentation
TESTA to FASTECH PresentationTESTA to FASTECH Presentation
TESTA to FASTECH Presentation
 
2015 EDM Leopard for Adaptive Tutoring Evaluation
2015 EDM Leopard for Adaptive Tutoring Evaluation2015 EDM Leopard for Adaptive Tutoring Evaluation
2015 EDM Leopard for Adaptive Tutoring Evaluation
 
Graduate Record Examination
Graduate Record ExaminationGraduate Record Examination
Graduate Record Examination
 
micro testing teaching learning analytics
micro testing teaching learning analyticsmicro testing teaching learning analytics
micro testing teaching learning analytics
 
Learning
LearningLearning
Learning
 
UMR - My ongoing projects with Technology - Rochester - 2015
UMR - My ongoing projects with Technology - Rochester - 2015 UMR - My ongoing projects with Technology - Rochester - 2015
UMR - My ongoing projects with Technology - Rochester - 2015
 

More from NAIST Machine Translation Study Group

[Paper Introduction] Efficient Lattice Rescoring Using Recurrent Neural Netwo...
[Paper Introduction] Efficient Lattice Rescoring Using Recurrent Neural Netwo...[Paper Introduction] Efficient Lattice Rescoring Using Recurrent Neural Netwo...
[Paper Introduction] Efficient Lattice Rescoring Using Recurrent Neural Netwo...
NAIST Machine Translation Study Group
 
[Paper Introduction] Distant supervision for relation extraction without labe...
[Paper Introduction] Distant supervision for relation extraction without labe...[Paper Introduction] Distant supervision for relation extraction without labe...
[Paper Introduction] Distant supervision for relation extraction without labe...
NAIST Machine Translation Study Group
 
On using monolingual corpora in neural machine translation
On using monolingual corpora in neural machine translationOn using monolingual corpora in neural machine translation
On using monolingual corpora in neural machine translation
NAIST Machine Translation Study Group
 
RNN-based Translation Models (Japanese)
RNN-based Translation Models (Japanese)RNN-based Translation Models (Japanese)
RNN-based Translation Models (Japanese)
NAIST Machine Translation Study Group
 
[Paper Introduction] Efficient top down btg parsing for machine translation p...
[Paper Introduction] Efficient top down btg parsing for machine translation p...[Paper Introduction] Efficient top down btg parsing for machine translation p...
[Paper Introduction] Efficient top down btg parsing for machine translation p...
NAIST Machine Translation Study Group
 
[Paper Introduction] Translating into Morphologically Rich Languages with Syn...
[Paper Introduction] Translating into Morphologically Rich Languages with Syn...[Paper Introduction] Translating into Morphologically Rich Languages with Syn...
[Paper Introduction] Translating into Morphologically Rich Languages with Syn...
NAIST Machine Translation Study Group
 
[Paper Introduction] Supervised Phrase Table Triangulation with Neural Word E...
[Paper Introduction] Supervised Phrase Table Triangulation with Neural Word E...[Paper Introduction] Supervised Phrase Table Triangulation with Neural Word E...
[Paper Introduction] Supervised Phrase Table Triangulation with Neural Word E...
NAIST Machine Translation Study Group
 
[Paper Introduction] Bilingual word representations with monolingual quality ...
[Paper Introduction] Bilingual word representations with monolingual quality ...[Paper Introduction] Bilingual word representations with monolingual quality ...
[Paper Introduction] Bilingual word representations with monolingual quality ...
NAIST Machine Translation Study Group
 
[Paper Introduction] A Context-Aware Topic Model for Statistical Machine Tran...
[Paper Introduction] A Context-Aware Topic Model for Statistical Machine Tran...[Paper Introduction] A Context-Aware Topic Model for Statistical Machine Tran...
[Paper Introduction] A Context-Aware Topic Model for Statistical Machine Tran...
NAIST Machine Translation Study Group
 
[Book Reading] 機械翻訳 - Section 3 No.1
[Book Reading] 機械翻訳 - Section 3 No.1[Book Reading] 機械翻訳 - Section 3 No.1
[Book Reading] 機械翻訳 - Section 3 No.1
NAIST Machine Translation Study Group
 
[Paper Introduction] Training a Natural Language Generator From Unaligned Data
[Paper Introduction] Training a Natural Language Generator From Unaligned Data[Paper Introduction] Training a Natural Language Generator From Unaligned Data
[Paper Introduction] Training a Natural Language Generator From Unaligned Data
NAIST Machine Translation Study Group
 
[Book Reading] 機械翻訳 - Section 5 No.2
[Book Reading] 機械翻訳 - Section 5 No.2[Book Reading] 機械翻訳 - Section 5 No.2
[Book Reading] 機械翻訳 - Section 5 No.2
NAIST Machine Translation Study Group
 
[Book Reading] 機械翻訳 - Section 7 No.1
[Book Reading] 機械翻訳 - Section 7 No.1[Book Reading] 機械翻訳 - Section 7 No.1
[Book Reading] 機械翻訳 - Section 7 No.1
NAIST Machine Translation Study Group
 
[Book Reading] 機械翻訳 - Section 2 No.2
 [Book Reading] 機械翻訳 - Section 2 No.2 [Book Reading] 機械翻訳 - Section 2 No.2
[Book Reading] 機械翻訳 - Section 2 No.2
NAIST Machine Translation Study Group
 

More from NAIST Machine Translation Study Group (14)

[Paper Introduction] Efficient Lattice Rescoring Using Recurrent Neural Netwo...
[Paper Introduction] Efficient Lattice Rescoring Using Recurrent Neural Netwo...[Paper Introduction] Efficient Lattice Rescoring Using Recurrent Neural Netwo...
[Paper Introduction] Efficient Lattice Rescoring Using Recurrent Neural Netwo...
 
[Paper Introduction] Distant supervision for relation extraction without labe...
[Paper Introduction] Distant supervision for relation extraction without labe...[Paper Introduction] Distant supervision for relation extraction without labe...
[Paper Introduction] Distant supervision for relation extraction without labe...
 
On using monolingual corpora in neural machine translation
On using monolingual corpora in neural machine translationOn using monolingual corpora in neural machine translation
On using monolingual corpora in neural machine translation
 
RNN-based Translation Models (Japanese)
RNN-based Translation Models (Japanese)RNN-based Translation Models (Japanese)
RNN-based Translation Models (Japanese)
 
[Paper Introduction] Efficient top down btg parsing for machine translation p...
[Paper Introduction] Efficient top down btg parsing for machine translation p...[Paper Introduction] Efficient top down btg parsing for machine translation p...
[Paper Introduction] Efficient top down btg parsing for machine translation p...
 
[Paper Introduction] Translating into Morphologically Rich Languages with Syn...
[Paper Introduction] Translating into Morphologically Rich Languages with Syn...[Paper Introduction] Translating into Morphologically Rich Languages with Syn...
[Paper Introduction] Translating into Morphologically Rich Languages with Syn...
 
[Paper Introduction] Supervised Phrase Table Triangulation with Neural Word E...
[Paper Introduction] Supervised Phrase Table Triangulation with Neural Word E...[Paper Introduction] Supervised Phrase Table Triangulation with Neural Word E...
[Paper Introduction] Supervised Phrase Table Triangulation with Neural Word E...
 
[Paper Introduction] Bilingual word representations with monolingual quality ...
[Paper Introduction] Bilingual word representations with monolingual quality ...[Paper Introduction] Bilingual word representations with monolingual quality ...
[Paper Introduction] Bilingual word representations with monolingual quality ...
 
[Paper Introduction] A Context-Aware Topic Model for Statistical Machine Tran...
[Paper Introduction] A Context-Aware Topic Model for Statistical Machine Tran...[Paper Introduction] A Context-Aware Topic Model for Statistical Machine Tran...
[Paper Introduction] A Context-Aware Topic Model for Statistical Machine Tran...
 
[Book Reading] 機械翻訳 - Section 3 No.1
[Book Reading] 機械翻訳 - Section 3 No.1[Book Reading] 機械翻訳 - Section 3 No.1
[Book Reading] 機械翻訳 - Section 3 No.1
 
[Paper Introduction] Training a Natural Language Generator From Unaligned Data
[Paper Introduction] Training a Natural Language Generator From Unaligned Data[Paper Introduction] Training a Natural Language Generator From Unaligned Data
[Paper Introduction] Training a Natural Language Generator From Unaligned Data
 
[Book Reading] 機械翻訳 - Section 5 No.2
[Book Reading] 機械翻訳 - Section 5 No.2[Book Reading] 機械翻訳 - Section 5 No.2
[Book Reading] 機械翻訳 - Section 5 No.2
 
[Book Reading] 機械翻訳 - Section 7 No.1
[Book Reading] 機械翻訳 - Section 7 No.1[Book Reading] 機械翻訳 - Section 7 No.1
[Book Reading] 機械翻訳 - Section 7 No.1
 
[Book Reading] 機械翻訳 - Section 2 No.2
 [Book Reading] 機械翻訳 - Section 2 No.2 [Book Reading] 機械翻訳 - Section 2 No.2
[Book Reading] 機械翻訳 - Section 2 No.2
 

Recently uploaded

Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
SitimaJohn
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Wask
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 

Recently uploaded (20)

Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 

[Paper Introduction] Evaluating MT Systems with Second Language Proficiency Tests

  • 1. Evaluating MT Systems with Second Language Proficiency Tests Takuya Matsuzaki, Akira Fujita, 
 Naoya Todo, Noriko H. Arai ACL 2015 2015/09/24 AHCLab M1 Makoto Morishita
  • 2. Abstract • BLEU have some weak points to evaluate the system in a real situation. • In this paper, evaluate the system by using second language ability test (TOEIC, etc). • It revealed that the context-unawareness of the current MT systems severely damages human performance when solving the test problems. 2
  • 3. Weak Points of BLEU 1. Unreliability in evaluating short translations 2. Non-interpretability of the scores beyond numerical comparison 3. Bias towards SMT systems 3
  • 4. Weak Points of Manual Evaluation 1. It costs much. 2. It is not easy to analyze the characteristics of MT systems based solely on the evaluation results. 4
  • 5. Solution • Task-based evaluation of MT systems
 - Measures the human performance in a task • Human do some task such as information extraction from a machine-translated text. 5
  • 6. Weak Points of
 Task-Based Evaluation • It costs much.
 - We have to make test materials, 
 and gather appropriate human subjects. • This paper use second-language proficiency tests (SLPTs) such as TOEIC, as the source of test materials. • Human solve the problem which is translated and evaluate the system by the test scores. 6
  • 7. Second-Language Proficiency Tests
 (SLPT) • There are a lot of SLPTs in many languages. • They are carefully designed to evaluate various aspects of language ability. • SLPTs are designed to assess the language ability, but not general intelligence.
 - Can be robust against the heterogeneity of the subjects. 7 (多様性)
  • 8. Materials • We chose 40 problems randomly from 
 National Center Test for University Admissions (センター試験). • All the problem consisted of a short conversation between two people. 8
  • 9. Materials • In this paper, we use a multiple-choice dialogue completion problems. 9
  • 10. Experiment • The original problems were English, and we translated them into Japanese. • The human subjects solved the translated problems. • The translation quality was evaluated based on the rate of correct answers given by the human subjects. 10
  • 11. Experiment • Evaluated 4 systems.
 - G: Google Translate
 - Y: Yahoo Translate
 - Hs: Human translation which do not 
 consider context
 - Ho: Human translation which consider 
 context 11
  • 12. Participants • 320 Japanese junior high school student 12 School A School B 1st: 80 2nd: 80 3rd: 78 1st: 82
  • 13. Extrinsic Evaluation Metric • CAR: Correct Answer Rate 13 CARM (p) = # of subjects that correctly answered M(p) # of subjects who solved M(p) Avg CARM = 1 |P| X p2P CARM (P)
  • 14. Robustness against the Heterogeneity of the Human Subjects 14 School A 1st: 80 2nd: 80 3rd: 78 No difference School A 1st: 80 School B 1st: 82 No difference →The participants’ Heterogeneity did not affect the test result
  • 15. System-level Evaluation • We cannot find significant difference between Y and Hs 15
  • 19. System-level Evaluation 19 • Refo: Do not consider context • Refs: Consider context Better
  • 20. Agreement • If Score of Intrinsic Measure M
 System A’s translation > B’s translation
 And
 Score of CAR
 System A’s translation > B’s translation
 then Agree • Check the agreement rate of each problems 20
  • 21. Agreement Rate • Agreement Rates between Automatic Evaluation Metrics and Human Evaluation 21
  • 22. Agreement Rate • Agreement Rates between Intrinsic Evaluation Metrics and Correct Answer Rate 22
  • 23. Agreement Rate • The human evaluation agrees with the CAR slightly better than the automatic metrics. • But still less than 0.7 • CAR can be critically damaged by a subtle mistake. 23
  • 24. Conclusion • Comparing 4 systems, it is important to consider contexts of individual sentences in translating dialogues. • SLPT can evaluate a different dimension of translation quality. • SLPT can be robust against the heterogeneity of human subjects. 24