SlideShare a Scribd company logo
A Methodology and Tool Suite for
Evaluating the Accuracy of
Interoperating Statistical Natural
Language Processing Engines
Uma Murthy
Virginia Tech

John Pitrelli, Ganesh Ramaswamy,
Martin Franz, and Burn Lewis
IBM T.J. Watson Research Center


Interspeech
22-26 September 2008
Brisbane, Australia
Outline
•    Motivation
•    Context
•    Issues
•    Evaluation methodology
•    Example evaluation modules
•    Future directions


                                  2
Motivation
•  Combining Natural Language Processing
   (NLP) engines for information processing in
   complex tasks
•  Evaluation of accuracy of output of individual
   NLP engines exists
   –  sliding window, BLEU score, word-error rate, etc.
•  No work on evaluation methods for large
   combinations, or aggregates, of NLP engines
   –  Foreign language videos  transcription 
      translation  story segmentation  topic
      clustering


                                                          3
Project Goal

To develop a methodology and tool suite for
   evaluating the accuracy (of output) of
 interoperating statistical natural language
            processing engines


           in the context of IOD


                                          4
Interoperability Demonstration
System (IOD)




                       Built upon UIMA

                                         5
Issues
1.  How is the accuracy of one engine or a set
    of engines evaluated, in the context of being
    present in an aggregate?
2.  What is the measure of accuracy of an
    aggregate and how can it be computed?
3.  How can the mechanics of this evaluation
    methodology be validated and tested?




                                                6
“Evaluation Space”
•  Core of the evaluation methodology
•  Various options of comparison of
   evaluation space of ground truth options
   based on human-generated and
   machine-generated outputs at every
   stage in the pipeline



                                          7
8
1.  Comparison between M-
    M-M… and H-H-H…
    evaluates the accuracy of
    the entire aggregate


2.  Emerging pattern

3.  Comparison of adjacent
    evaluations determines
    how much one engine
    (TC) degrades accuracy
    of the aggregate
4.  Do not consider H-M
    sequences

5.  Comparing two engines of
    the same function

6.  Assembling ground truths
    is the most expensive
    task

                          9
Evaluation Modules
•  Uses evaluation space as a template to automatically
   evaluate the performance of an aggregate
•  Development
    –  Explore methods that are used to evaluate the last
       engine in the aggregate
    –  If required, modify these methods, considering
       •  Preceding engines and, their input and output
       •  Different ground truth formats
•  Testing:
    –  Focus on validating the mechanics of evaluation and
       not the engines in question


                                                          10
Example Evaluation Modules
•  STTSBD
 – Sliding-window scheme
 – Automatically generated comparable
   ROC curves
   •  Validated module with six 30-minute Arabic
      news shows
•  STTMT
 – BLEU metric
 – Automatically generated BLEU scores
   •  Validated module with two Arabic-English MT
      engines on 38 minutes of audio
                                                    11
Future Directions
•  Develop more evaluation modules and
   validate them
    –  Test with actual ground truths
    –  Test with more data-sets
    –  Test on different engines (of the same
       kind)
•  Methodology
    –  Identify points of error
    –  How much does an engine impact the
       performance of the aggregate?


                                                12
Summary
•  Presented a methodology for automatic
   evaluation of accuracy of aggregates of
   interoperating statistical NLP engines
   –  Evaluation space and evaluation modules
•  Developed and validated evaluation modules
   for two aggregates

•  Miles to go!
   –  Small portion of a vast research area

                                                13
Thank You



      ?
            ?

                14
Back-up Slides




                 15
Evaluation Module Implementation
•  Each module was implemented as a
   UIMA CAS consumer
•  Ground truth and other evaluation
   parameters were input as CAS
   Consumer parameters




                                       16
Measuring the performance of
story boundary detection
TDT-style sliding window approach:
       partial credit for slightly misplaced segment boundaries




• True and system agree within the window t correct.
• No system boundary in a window containing a true boundary t Miss
• System boundary in a window containing no true boundary t False
Alarm

• Window length: 15 seconds
                                     Source: Franz, et al. “Breaking Translation Symmetry”


                                                                                  17
STTSBD Test Constraints
•  Ground truth availability: word-position-
   based story boundaries on ASR
   transcripts
  –  Transcripts were already segmented into
     sentences
•  For the pipeline (STTSBD) output, we
   needed to compare time-based story
   boundaries on Arabic speech

                                               18

More Related Content

Similar to A Methodology and Tool Suite for Evaluating the Accuracy of Interoperating NLP Engines

Continuous Evaluation of Collaborative Recommender Systems in Data Stream Man...
Continuous Evaluation of Collaborative Recommender Systems in Data Stream Man...Continuous Evaluation of Collaborative Recommender Systems in Data Stream Man...
Continuous Evaluation of Collaborative Recommender Systems in Data Stream Man...
Dr. Cornelius Ludmann
 
Scalable Software Testing and Verification of Non-Functional Properties throu...
Scalable Software Testing and Verification of Non-Functional Properties throu...Scalable Software Testing and Verification of Non-Functional Properties throu...
Scalable Software Testing and Verification of Non-Functional Properties throu...
Lionel Briand
 
An empirical evaluation of cost-based federated SPARQL query Processing Engines
An empirical evaluation of cost-based federated SPARQL query Processing EnginesAn empirical evaluation of cost-based federated SPARQL query Processing Engines
An empirical evaluation of cost-based federated SPARQL query Processing Engines
Umair Qudus
 
The DEBS Grand Challenge 2017
The DEBS Grand Challenge 2017The DEBS Grand Challenge 2017
The DEBS Grand Challenge 2017
Roman Katerinenko
 
Generating test cases using UML Communication Diagram
Generating test cases using UML Communication Diagram Generating test cases using UML Communication Diagram
Generating test cases using UML Communication Diagram
Praveen Penumathsa
 
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...
Boris Glavic
 
Testing Autonomous Cars for Feature Interaction Failures using Many-Objective...
Testing Autonomous Cars for Feature Interaction Failures using Many-Objective...Testing Autonomous Cars for Feature Interaction Failures using Many-Objective...
Testing Autonomous Cars for Feature Interaction Failures using Many-Objective...
Lionel Briand
 
Making Model-Driven Verification Practical and Scalable: Experiences and Less...
Making Model-Driven Verification Practical and Scalable: Experiences and Less...Making Model-Driven Verification Practical and Scalable: Experiences and Less...
Making Model-Driven Verification Practical and Scalable: Experiences and Less...
Lionel Briand
 
Artificial Intelligence for Automated Software Testing
Artificial Intelligence for Automated Software TestingArtificial Intelligence for Automated Software Testing
Artificial Intelligence for Automated Software Testing
Lionel Briand
 
Performance Testing Java Applications
Performance Testing Java ApplicationsPerformance Testing Java Applications
Performance Testing Java Applications
C4Media
 
SSBSE 2020 keynote
SSBSE 2020 keynoteSSBSE 2020 keynote
SSBSE 2020 keynote
Shiva Nejati
 
Combinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learningCombinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learning
민재 정
 
techniques.ppt
techniques.ppttechniques.ppt
techniques.ppt
veeruyadav9
 
A practical guide for using Statistical Tests to assess Randomized Algorithms...
A practical guide for using Statistical Tests to assess Randomized Algorithms...A practical guide for using Statistical Tests to assess Randomized Algorithms...
A practical guide for using Statistical Tests to assess Randomized Algorithms...
Lionel Briand
 
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
Lucidworks
 
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...
Lionel Briand
 
Pro smartbooksquestions
Pro smartbooksquestionsPro smartbooksquestions
Pro smartbooksquestions
yoummr
 
Chromatography Data System: Getting It “Right First Time” Seminar Series – Pa...
Chromatography Data System: Getting It “Right First Time” Seminar Series – Pa...Chromatography Data System: Getting It “Right First Time” Seminar Series – Pa...
Chromatography Data System: Getting It “Right First Time” Seminar Series – Pa...
Chromatography & Mass Spectrometry Solutions
 
Applications of Machine Learning and Metaheuristic Search to Security Testing
Applications of Machine Learning and Metaheuristic Search to Security TestingApplications of Machine Learning and Metaheuristic Search to Security Testing
Applications of Machine Learning and Metaheuristic Search to Security Testing
Lionel Briand
 
TLC2018 Thomas Haver: The Automation Firehose - Be Strategic and Tactical
TLC2018 Thomas Haver: The Automation Firehose - Be Strategic and TacticalTLC2018 Thomas Haver: The Automation Firehose - Be Strategic and Tactical
TLC2018 Thomas Haver: The Automation Firehose - Be Strategic and Tactical
Anna Royzman
 

Similar to A Methodology and Tool Suite for Evaluating the Accuracy of Interoperating NLP Engines (20)

Continuous Evaluation of Collaborative Recommender Systems in Data Stream Man...
Continuous Evaluation of Collaborative Recommender Systems in Data Stream Man...Continuous Evaluation of Collaborative Recommender Systems in Data Stream Man...
Continuous Evaluation of Collaborative Recommender Systems in Data Stream Man...
 
Scalable Software Testing and Verification of Non-Functional Properties throu...
Scalable Software Testing and Verification of Non-Functional Properties throu...Scalable Software Testing and Verification of Non-Functional Properties throu...
Scalable Software Testing and Verification of Non-Functional Properties throu...
 
An empirical evaluation of cost-based federated SPARQL query Processing Engines
An empirical evaluation of cost-based federated SPARQL query Processing EnginesAn empirical evaluation of cost-based federated SPARQL query Processing Engines
An empirical evaluation of cost-based federated SPARQL query Processing Engines
 
The DEBS Grand Challenge 2017
The DEBS Grand Challenge 2017The DEBS Grand Challenge 2017
The DEBS Grand Challenge 2017
 
Generating test cases using UML Communication Diagram
Generating test cases using UML Communication Diagram Generating test cases using UML Communication Diagram
Generating test cases using UML Communication Diagram
 
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...
 
Testing Autonomous Cars for Feature Interaction Failures using Many-Objective...
Testing Autonomous Cars for Feature Interaction Failures using Many-Objective...Testing Autonomous Cars for Feature Interaction Failures using Many-Objective...
Testing Autonomous Cars for Feature Interaction Failures using Many-Objective...
 
Making Model-Driven Verification Practical and Scalable: Experiences and Less...
Making Model-Driven Verification Practical and Scalable: Experiences and Less...Making Model-Driven Verification Practical and Scalable: Experiences and Less...
Making Model-Driven Verification Practical and Scalable: Experiences and Less...
 
Artificial Intelligence for Automated Software Testing
Artificial Intelligence for Automated Software TestingArtificial Intelligence for Automated Software Testing
Artificial Intelligence for Automated Software Testing
 
Performance Testing Java Applications
Performance Testing Java ApplicationsPerformance Testing Java Applications
Performance Testing Java Applications
 
SSBSE 2020 keynote
SSBSE 2020 keynoteSSBSE 2020 keynote
SSBSE 2020 keynote
 
Combinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learningCombinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learning
 
techniques.ppt
techniques.ppttechniques.ppt
techniques.ppt
 
A practical guide for using Statistical Tests to assess Randomized Algorithms...
A practical guide for using Statistical Tests to assess Randomized Algorithms...A practical guide for using Statistical Tests to assess Randomized Algorithms...
A practical guide for using Statistical Tests to assess Randomized Algorithms...
 
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
 
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...
 
Pro smartbooksquestions
Pro smartbooksquestionsPro smartbooksquestions
Pro smartbooksquestions
 
Chromatography Data System: Getting It “Right First Time” Seminar Series – Pa...
Chromatography Data System: Getting It “Right First Time” Seminar Series – Pa...Chromatography Data System: Getting It “Right First Time” Seminar Series – Pa...
Chromatography Data System: Getting It “Right First Time” Seminar Series – Pa...
 
Applications of Machine Learning and Metaheuristic Search to Security Testing
Applications of Machine Learning and Metaheuristic Search to Security TestingApplications of Machine Learning and Metaheuristic Search to Security Testing
Applications of Machine Learning and Metaheuristic Search to Security Testing
 
TLC2018 Thomas Haver: The Automation Firehose - Be Strategic and Tactical
TLC2018 Thomas Haver: The Automation Firehose - Be Strategic and TacticalTLC2018 Thomas Haver: The Automation Firehose - Be Strategic and Tactical
TLC2018 Thomas Haver: The Automation Firehose - Be Strategic and Tactical
 

Recently uploaded

UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 

Recently uploaded (20)

UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 

A Methodology and Tool Suite for Evaluating the Accuracy of Interoperating NLP Engines

  • 1. A Methodology and Tool Suite for Evaluating the Accuracy of Interoperating Statistical Natural Language Processing Engines Uma Murthy Virginia Tech John Pitrelli, Ganesh Ramaswamy, Martin Franz, and Burn Lewis IBM T.J. Watson Research Center Interspeech 22-26 September 2008 Brisbane, Australia
  • 2. Outline •  Motivation •  Context •  Issues •  Evaluation methodology •  Example evaluation modules •  Future directions 2
  • 3. Motivation •  Combining Natural Language Processing (NLP) engines for information processing in complex tasks •  Evaluation of accuracy of output of individual NLP engines exists –  sliding window, BLEU score, word-error rate, etc. •  No work on evaluation methods for large combinations, or aggregates, of NLP engines –  Foreign language videos  transcription  translation  story segmentation  topic clustering 3
  • 4. Project Goal To develop a methodology and tool suite for evaluating the accuracy (of output) of interoperating statistical natural language processing engines in the context of IOD 4
  • 6. Issues 1.  How is the accuracy of one engine or a set of engines evaluated, in the context of being present in an aggregate? 2.  What is the measure of accuracy of an aggregate and how can it be computed? 3.  How can the mechanics of this evaluation methodology be validated and tested? 6
  • 7. “Evaluation Space” •  Core of the evaluation methodology •  Various options of comparison of evaluation space of ground truth options based on human-generated and machine-generated outputs at every stage in the pipeline 7
  • 8. 8
  • 9. 1.  Comparison between M- M-M… and H-H-H… evaluates the accuracy of the entire aggregate 2.  Emerging pattern 3.  Comparison of adjacent evaluations determines how much one engine (TC) degrades accuracy of the aggregate 4.  Do not consider H-M sequences 5.  Comparing two engines of the same function 6.  Assembling ground truths is the most expensive task 9
  • 10. Evaluation Modules •  Uses evaluation space as a template to automatically evaluate the performance of an aggregate •  Development –  Explore methods that are used to evaluate the last engine in the aggregate –  If required, modify these methods, considering •  Preceding engines and, their input and output •  Different ground truth formats •  Testing: –  Focus on validating the mechanics of evaluation and not the engines in question 10
  • 11. Example Evaluation Modules •  STTSBD – Sliding-window scheme – Automatically generated comparable ROC curves •  Validated module with six 30-minute Arabic news shows •  STTMT – BLEU metric – Automatically generated BLEU scores •  Validated module with two Arabic-English MT engines on 38 minutes of audio 11
  • 12. Future Directions •  Develop more evaluation modules and validate them –  Test with actual ground truths –  Test with more data-sets –  Test on different engines (of the same kind) •  Methodology –  Identify points of error –  How much does an engine impact the performance of the aggregate? 12
  • 13. Summary •  Presented a methodology for automatic evaluation of accuracy of aggregates of interoperating statistical NLP engines –  Evaluation space and evaluation modules •  Developed and validated evaluation modules for two aggregates •  Miles to go! –  Small portion of a vast research area 13
  • 14. Thank You ? ? 14
  • 16. Evaluation Module Implementation •  Each module was implemented as a UIMA CAS consumer •  Ground truth and other evaluation parameters were input as CAS Consumer parameters 16
  • 17. Measuring the performance of story boundary detection TDT-style sliding window approach: partial credit for slightly misplaced segment boundaries • True and system agree within the window t correct. • No system boundary in a window containing a true boundary t Miss • System boundary in a window containing no true boundary t False Alarm • Window length: 15 seconds Source: Franz, et al. “Breaking Translation Symmetry” 17
  • 18. STTSBD Test Constraints •  Ground truth availability: word-position- based story boundaries on ASR transcripts –  Transcripts were already segmented into sentences •  For the pipeline (STTSBD) output, we needed to compare time-based story boundaries on Arabic speech 18