SlideShare a Scribd company logo
1 of 18
Download to read offline
A Methodology and Tool Suite for
Evaluating the Accuracy of
Interoperating Statistical Natural
Language Processing Engines
Uma Murthy
Virginia Tech

John Pitrelli, Ganesh Ramaswamy,
Martin Franz, and Burn Lewis
IBM T.J. Watson Research Center


Interspeech
22-26 September 2008
Brisbane, Australia
Outline
•    Motivation
•    Context
•    Issues
•    Evaluation methodology
•    Example evaluation modules
•    Future directions


                                  2
Motivation
•  Combining Natural Language Processing
   (NLP) engines for information processing in
   complex tasks
•  Evaluation of accuracy of output of individual
   NLP engines exists
   –  sliding window, BLEU score, word-error rate, etc.
•  No work on evaluation methods for large
   combinations, or aggregates, of NLP engines
   –  Foreign language videos  transcription 
      translation  story segmentation  topic
      clustering


                                                          3
Project Goal

To develop a methodology and tool suite for
   evaluating the accuracy (of output) of
 interoperating statistical natural language
            processing engines


           in the context of IOD


                                          4
Interoperability Demonstration
System (IOD)




                       Built upon UIMA

                                         5
Issues
1.  How is the accuracy of one engine or a set
    of engines evaluated, in the context of being
    present in an aggregate?
2.  What is the measure of accuracy of an
    aggregate and how can it be computed?
3.  How can the mechanics of this evaluation
    methodology be validated and tested?




                                                6
“Evaluation Space”
•  Core of the evaluation methodology
•  Various options of comparison of
   evaluation space of ground truth options
   based on human-generated and
   machine-generated outputs at every
   stage in the pipeline



                                          7
8
1.  Comparison between M-
    M-M… and H-H-H…
    evaluates the accuracy of
    the entire aggregate


2.  Emerging pattern

3.  Comparison of adjacent
    evaluations determines
    how much one engine
    (TC) degrades accuracy
    of the aggregate
4.  Do not consider H-M
    sequences

5.  Comparing two engines of
    the same function

6.  Assembling ground truths
    is the most expensive
    task

                          9
Evaluation Modules
•  Uses evaluation space as a template to automatically
   evaluate the performance of an aggregate
•  Development
    –  Explore methods that are used to evaluate the last
       engine in the aggregate
    –  If required, modify these methods, considering
       •  Preceding engines and, their input and output
       •  Different ground truth formats
•  Testing:
    –  Focus on validating the mechanics of evaluation and
       not the engines in question


                                                          10
Example Evaluation Modules
•  STTSBD
 – Sliding-window scheme
 – Automatically generated comparable
   ROC curves
   •  Validated module with six 30-minute Arabic
      news shows
•  STTMT
 – BLEU metric
 – Automatically generated BLEU scores
   •  Validated module with two Arabic-English MT
      engines on 38 minutes of audio
                                                    11
Future Directions
•  Develop more evaluation modules and
   validate them
    –  Test with actual ground truths
    –  Test with more data-sets
    –  Test on different engines (of the same
       kind)
•  Methodology
    –  Identify points of error
    –  How much does an engine impact the
       performance of the aggregate?


                                                12
Summary
•  Presented a methodology for automatic
   evaluation of accuracy of aggregates of
   interoperating statistical NLP engines
   –  Evaluation space and evaluation modules
•  Developed and validated evaluation modules
   for two aggregates

•  Miles to go!
   –  Small portion of a vast research area

                                                13
Thank You



      ?
            ?

                14
Back-up Slides




                 15
Evaluation Module Implementation
•  Each module was implemented as a
   UIMA CAS consumer
•  Ground truth and other evaluation
   parameters were input as CAS
   Consumer parameters




                                       16
Measuring the performance of
story boundary detection
TDT-style sliding window approach:
       partial credit for slightly misplaced segment boundaries




• True and system agree within the window t correct.
• No system boundary in a window containing a true boundary t Miss
• System boundary in a window containing no true boundary t False
Alarm

• Window length: 15 seconds
                                     Source: Franz, et al. “Breaking Translation Symmetry”


                                                                                  17
STTSBD Test Constraints
•  Ground truth availability: word-position-
   based story boundaries on ASR
   transcripts
  –  Transcripts were already segmented into
     sentences
•  For the pipeline (STTSBD) output, we
   needed to compare time-based story
   boundaries on Arabic speech

                                               18

More Related Content

Similar to A Methodology and Tool Suite for Evaluating the Accuracy of Interoperating NLP Engines

Testing Autonomous Cars for Feature Interaction Failures using Many-Objective...
Testing Autonomous Cars for Feature Interaction Failures using Many-Objective...Testing Autonomous Cars for Feature Interaction Failures using Many-Objective...
Testing Autonomous Cars for Feature Interaction Failures using Many-Objective...
Lionel Briand
 
Making Model-Driven Verification Practical and Scalable: Experiences and Less...
Making Model-Driven Verification Practical and Scalable: Experiences and Less...Making Model-Driven Verification Practical and Scalable: Experiences and Less...
Making Model-Driven Verification Practical and Scalable: Experiences and Less...
Lionel Briand
 
A practical guide for using Statistical Tests to assess Randomized Algorithms...
A practical guide for using Statistical Tests to assess Randomized Algorithms...A practical guide for using Statistical Tests to assess Randomized Algorithms...
A practical guide for using Statistical Tests to assess Randomized Algorithms...
Lionel Briand
 
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...
Lionel Briand
 
Pro smartbooksquestions
Pro smartbooksquestionsPro smartbooksquestions
Pro smartbooksquestions
yoummr
 

Similar to A Methodology and Tool Suite for Evaluating the Accuracy of Interoperating NLP Engines (20)

Continuous Evaluation of Collaborative Recommender Systems in Data Stream Man...
Continuous Evaluation of Collaborative Recommender Systems in Data Stream Man...Continuous Evaluation of Collaborative Recommender Systems in Data Stream Man...
Continuous Evaluation of Collaborative Recommender Systems in Data Stream Man...
 
Scalable Software Testing and Verification of Non-Functional Properties throu...
Scalable Software Testing and Verification of Non-Functional Properties throu...Scalable Software Testing and Verification of Non-Functional Properties throu...
Scalable Software Testing and Verification of Non-Functional Properties throu...
 
An empirical evaluation of cost-based federated SPARQL query Processing Engines
An empirical evaluation of cost-based federated SPARQL query Processing EnginesAn empirical evaluation of cost-based federated SPARQL query Processing Engines
An empirical evaluation of cost-based federated SPARQL query Processing Engines
 
The DEBS Grand Challenge 2017
The DEBS Grand Challenge 2017The DEBS Grand Challenge 2017
The DEBS Grand Challenge 2017
 
Generating test cases using UML Communication Diagram
Generating test cases using UML Communication Diagram Generating test cases using UML Communication Diagram
Generating test cases using UML Communication Diagram
 
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...
 
Testing Autonomous Cars for Feature Interaction Failures using Many-Objective...
Testing Autonomous Cars for Feature Interaction Failures using Many-Objective...Testing Autonomous Cars for Feature Interaction Failures using Many-Objective...
Testing Autonomous Cars for Feature Interaction Failures using Many-Objective...
 
Making Model-Driven Verification Practical and Scalable: Experiences and Less...
Making Model-Driven Verification Practical and Scalable: Experiences and Less...Making Model-Driven Verification Practical and Scalable: Experiences and Less...
Making Model-Driven Verification Practical and Scalable: Experiences and Less...
 
Artificial Intelligence for Automated Software Testing
Artificial Intelligence for Automated Software TestingArtificial Intelligence for Automated Software Testing
Artificial Intelligence for Automated Software Testing
 
Performance Testing Java Applications
Performance Testing Java ApplicationsPerformance Testing Java Applications
Performance Testing Java Applications
 
SSBSE 2020 keynote
SSBSE 2020 keynoteSSBSE 2020 keynote
SSBSE 2020 keynote
 
Combinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learningCombinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learning
 
techniques.ppt
techniques.ppttechniques.ppt
techniques.ppt
 
A practical guide for using Statistical Tests to assess Randomized Algorithms...
A practical guide for using Statistical Tests to assess Randomized Algorithms...A practical guide for using Statistical Tests to assess Randomized Algorithms...
A practical guide for using Statistical Tests to assess Randomized Algorithms...
 
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
 
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...
 
Pro smartbooksquestions
Pro smartbooksquestionsPro smartbooksquestions
Pro smartbooksquestions
 
Chromatography Data System: Getting It “Right First Time” Seminar Series – Pa...
Chromatography Data System: Getting It “Right First Time” Seminar Series – Pa...Chromatography Data System: Getting It “Right First Time” Seminar Series – Pa...
Chromatography Data System: Getting It “Right First Time” Seminar Series – Pa...
 
Applications of Machine Learning and Metaheuristic Search to Security Testing
Applications of Machine Learning and Metaheuristic Search to Security TestingApplications of Machine Learning and Metaheuristic Search to Security Testing
Applications of Machine Learning and Metaheuristic Search to Security Testing
 
TLC2018 Thomas Haver: The Automation Firehose - Be Strategic and Tactical
TLC2018 Thomas Haver: The Automation Firehose - Be Strategic and TacticalTLC2018 Thomas Haver: The Automation Firehose - Be Strategic and Tactical
TLC2018 Thomas Haver: The Automation Firehose - Be Strategic and Tactical
 

Recently uploaded

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 

A Methodology and Tool Suite for Evaluating the Accuracy of Interoperating NLP Engines

  • 1. A Methodology and Tool Suite for Evaluating the Accuracy of Interoperating Statistical Natural Language Processing Engines Uma Murthy Virginia Tech John Pitrelli, Ganesh Ramaswamy, Martin Franz, and Burn Lewis IBM T.J. Watson Research Center Interspeech 22-26 September 2008 Brisbane, Australia
  • 2. Outline •  Motivation •  Context •  Issues •  Evaluation methodology •  Example evaluation modules •  Future directions 2
  • 3. Motivation •  Combining Natural Language Processing (NLP) engines for information processing in complex tasks •  Evaluation of accuracy of output of individual NLP engines exists –  sliding window, BLEU score, word-error rate, etc. •  No work on evaluation methods for large combinations, or aggregates, of NLP engines –  Foreign language videos  transcription  translation  story segmentation  topic clustering 3
  • 4. Project Goal To develop a methodology and tool suite for evaluating the accuracy (of output) of interoperating statistical natural language processing engines in the context of IOD 4
  • 6. Issues 1.  How is the accuracy of one engine or a set of engines evaluated, in the context of being present in an aggregate? 2.  What is the measure of accuracy of an aggregate and how can it be computed? 3.  How can the mechanics of this evaluation methodology be validated and tested? 6
  • 7. “Evaluation Space” •  Core of the evaluation methodology •  Various options of comparison of evaluation space of ground truth options based on human-generated and machine-generated outputs at every stage in the pipeline 7
  • 8. 8
  • 9. 1.  Comparison between M- M-M… and H-H-H… evaluates the accuracy of the entire aggregate 2.  Emerging pattern 3.  Comparison of adjacent evaluations determines how much one engine (TC) degrades accuracy of the aggregate 4.  Do not consider H-M sequences 5.  Comparing two engines of the same function 6.  Assembling ground truths is the most expensive task 9
  • 10. Evaluation Modules •  Uses evaluation space as a template to automatically evaluate the performance of an aggregate •  Development –  Explore methods that are used to evaluate the last engine in the aggregate –  If required, modify these methods, considering •  Preceding engines and, their input and output •  Different ground truth formats •  Testing: –  Focus on validating the mechanics of evaluation and not the engines in question 10
  • 11. Example Evaluation Modules •  STTSBD – Sliding-window scheme – Automatically generated comparable ROC curves •  Validated module with six 30-minute Arabic news shows •  STTMT – BLEU metric – Automatically generated BLEU scores •  Validated module with two Arabic-English MT engines on 38 minutes of audio 11
  • 12. Future Directions •  Develop more evaluation modules and validate them –  Test with actual ground truths –  Test with more data-sets –  Test on different engines (of the same kind) •  Methodology –  Identify points of error –  How much does an engine impact the performance of the aggregate? 12
  • 13. Summary •  Presented a methodology for automatic evaluation of accuracy of aggregates of interoperating statistical NLP engines –  Evaluation space and evaluation modules •  Developed and validated evaluation modules for two aggregates •  Miles to go! –  Small portion of a vast research area 13
  • 14. Thank You ? ? 14
  • 16. Evaluation Module Implementation •  Each module was implemented as a UIMA CAS consumer •  Ground truth and other evaluation parameters were input as CAS Consumer parameters 16
  • 17. Measuring the performance of story boundary detection TDT-style sliding window approach: partial credit for slightly misplaced segment boundaries • True and system agree within the window t correct. • No system boundary in a window containing a true boundary t Miss • System boundary in a window containing no true boundary t False Alarm • Window length: 15 seconds Source: Franz, et al. “Breaking Translation Symmetry” 17
  • 18. STTSBD Test Constraints •  Ground truth availability: word-position- based story boundaries on ASR transcripts –  Transcripts were already segmented into sentences •  For the pipeline (STTSBD) output, we needed to compare time-based story boundaries on Arabic speech 18