SlideShare a Scribd company logo
1 of 18
Download to read offline
A Methodology and Tool Suite for
Evaluating the Accuracy of
Interoperating Statistical Natural
Language Processing Engines
Uma Murthy
Virginia Tech

John Pitrelli, Ganesh Ramaswamy,
Martin Franz, and Burn Lewis
IBM T.J. Watson Research Center


Interspeech
22-26 September 2008
Brisbane, Australia
Outline
•    Motivation
•    Context
•    Issues
•    Evaluation methodology
•    Example evaluation modules
•    Future directions


                                  2
Motivation
•  Combining Natural Language Processing
   (NLP) engines for information processing in
   complex tasks
•  Evaluation of accuracy of output of individual
   NLP engines exists
   –  sliding window, BLEU score, word-error rate, etc.
•  No work on evaluation methods for large
   combinations, or aggregates, of NLP engines
   –  Foreign language videos  transcription 
      translation  story segmentation  topic
      clustering


                                                          3
Project Goal

To develop a methodology and tool suite for
   evaluating the accuracy (of output) of
 interoperating statistical natural language
            processing engines


           in the context of IOD


                                          4
Interoperability Demonstration
System (IOD)




                       Built upon UIMA

                                         5
Issues
1.  How is the accuracy of one engine or a set
    of engines evaluated, in the context of being
    present in an aggregate?
2.  What is the measure of accuracy of an
    aggregate and how can it be computed?
3.  How can the mechanics of this evaluation
    methodology be validated and tested?




                                                6
“Evaluation Space”
•  Core of the evaluation methodology
•  Various options of comparison of
   evaluation space of ground truth options
   based on human-generated and
   machine-generated outputs at every
   stage in the pipeline



                                          7
8
1.  Comparison between M-
    M-M… and H-H-H…
    evaluates the accuracy of
    the entire aggregate


2.  Emerging pattern

3.  Comparison of adjacent
    evaluations determines
    how much one engine
    (TC) degrades accuracy
    of the aggregate
4.  Do not consider H-M
    sequences

5.  Comparing two engines of
    the same function

6.  Assembling ground truths
    is the most expensive
    task

                          9
Evaluation Modules
•  Uses evaluation space as a template to automatically
   evaluate the performance of an aggregate
•  Development
    –  Explore methods that are used to evaluate the last
       engine in the aggregate
    –  If required, modify these methods, considering
       •  Preceding engines and, their input and output
       •  Different ground truth formats
•  Testing:
    –  Focus on validating the mechanics of evaluation and
       not the engines in question


                                                          10
Example Evaluation Modules
•  STTSBD
 – Sliding-window scheme
 – Automatically generated comparable
   ROC curves
   •  Validated module with six 30-minute Arabic
      news shows
•  STTMT
 – BLEU metric
 – Automatically generated BLEU scores
   •  Validated module with two Arabic-English MT
      engines on 38 minutes of audio
                                                    11
Future Directions
•  Develop more evaluation modules and
   validate them
    –  Test with actual ground truths
    –  Test with more data-sets
    –  Test on different engines (of the same
       kind)
•  Methodology
    –  Identify points of error
    –  How much does an engine impact the
       performance of the aggregate?


                                                12
Summary
•  Presented a methodology for automatic
   evaluation of accuracy of aggregates of
   interoperating statistical NLP engines
   –  Evaluation space and evaluation modules
•  Developed and validated evaluation modules
   for two aggregates

•  Miles to go!
   –  Small portion of a vast research area

                                                13
Thank You



      ?
            ?

                14
Back-up Slides




                 15
Evaluation Module Implementation
•  Each module was implemented as a
   UIMA CAS consumer
•  Ground truth and other evaluation
   parameters were input as CAS
   Consumer parameters




                                       16
Measuring the performance of
story boundary detection
TDT-style sliding window approach:
       partial credit for slightly misplaced segment boundaries




• True and system agree within the window t correct.
• No system boundary in a window containing a true boundary t Miss
• System boundary in a window containing no true boundary t False
Alarm

• Window length: 15 seconds
                                     Source: Franz, et al. “Breaking Translation Symmetry”


                                                                                  17
STTSBD Test Constraints
•  Ground truth availability: word-position-
   based story boundaries on ASR
   transcripts
  –  Transcripts were already segmented into
     sentences
•  For the pipeline (STTSBD) output, we
   needed to compare time-based story
   boundaries on Arabic speech

                                               18

More Related Content

Similar to A Methodology and Tool Suite for Evaluating the Accuracy of Interoperating NLP Engines

Testing Autonomous Cars for Feature Interaction Failures using Many-Objective...
Testing Autonomous Cars for Feature Interaction Failures using Many-Objective...Testing Autonomous Cars for Feature Interaction Failures using Many-Objective...
Testing Autonomous Cars for Feature Interaction Failures using Many-Objective...
Lionel Briand
 
Making Model-Driven Verification Practical and Scalable: Experiences and Less...
Making Model-Driven Verification Practical and Scalable: Experiences and Less...Making Model-Driven Verification Practical and Scalable: Experiences and Less...
Making Model-Driven Verification Practical and Scalable: Experiences and Less...
Lionel Briand
 
A practical guide for using Statistical Tests to assess Randomized Algorithms...
A practical guide for using Statistical Tests to assess Randomized Algorithms...A practical guide for using Statistical Tests to assess Randomized Algorithms...
A practical guide for using Statistical Tests to assess Randomized Algorithms...
Lionel Briand
 
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...
Lionel Briand
 
Pro smartbooksquestions
Pro smartbooksquestionsPro smartbooksquestions
Pro smartbooksquestions
yoummr
 

Similar to A Methodology and Tool Suite for Evaluating the Accuracy of Interoperating NLP Engines (20)

Continuous Evaluation of Collaborative Recommender Systems in Data Stream Man...
Continuous Evaluation of Collaborative Recommender Systems in Data Stream Man...Continuous Evaluation of Collaborative Recommender Systems in Data Stream Man...
Continuous Evaluation of Collaborative Recommender Systems in Data Stream Man...
 
Scalable Software Testing and Verification of Non-Functional Properties throu...
Scalable Software Testing and Verification of Non-Functional Properties throu...Scalable Software Testing and Verification of Non-Functional Properties throu...
Scalable Software Testing and Verification of Non-Functional Properties throu...
 
An empirical evaluation of cost-based federated SPARQL query Processing Engines
An empirical evaluation of cost-based federated SPARQL query Processing EnginesAn empirical evaluation of cost-based federated SPARQL query Processing Engines
An empirical evaluation of cost-based federated SPARQL query Processing Engines
 
The DEBS Grand Challenge 2017
The DEBS Grand Challenge 2017The DEBS Grand Challenge 2017
The DEBS Grand Challenge 2017
 
Generating test cases using UML Communication Diagram
Generating test cases using UML Communication Diagram Generating test cases using UML Communication Diagram
Generating test cases using UML Communication Diagram
 
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...
 
Testing Autonomous Cars for Feature Interaction Failures using Many-Objective...
Testing Autonomous Cars for Feature Interaction Failures using Many-Objective...Testing Autonomous Cars for Feature Interaction Failures using Many-Objective...
Testing Autonomous Cars for Feature Interaction Failures using Many-Objective...
 
Making Model-Driven Verification Practical and Scalable: Experiences and Less...
Making Model-Driven Verification Practical and Scalable: Experiences and Less...Making Model-Driven Verification Practical and Scalable: Experiences and Less...
Making Model-Driven Verification Practical and Scalable: Experiences and Less...
 
Artificial Intelligence for Automated Software Testing
Artificial Intelligence for Automated Software TestingArtificial Intelligence for Automated Software Testing
Artificial Intelligence for Automated Software Testing
 
Performance Testing Java Applications
Performance Testing Java ApplicationsPerformance Testing Java Applications
Performance Testing Java Applications
 
SSBSE 2020 keynote
SSBSE 2020 keynoteSSBSE 2020 keynote
SSBSE 2020 keynote
 
Combinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learningCombinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learning
 
techniques.ppt
techniques.ppttechniques.ppt
techniques.ppt
 
A practical guide for using Statistical Tests to assess Randomized Algorithms...
A practical guide for using Statistical Tests to assess Randomized Algorithms...A practical guide for using Statistical Tests to assess Randomized Algorithms...
A practical guide for using Statistical Tests to assess Randomized Algorithms...
 
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
 
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...
 
Pro smartbooksquestions
Pro smartbooksquestionsPro smartbooksquestions
Pro smartbooksquestions
 
Chromatography Data System: Getting It “Right First Time” Seminar Series – Pa...
Chromatography Data System: Getting It “Right First Time” Seminar Series – Pa...Chromatography Data System: Getting It “Right First Time” Seminar Series – Pa...
Chromatography Data System: Getting It “Right First Time” Seminar Series – Pa...
 
Applications of Machine Learning and Metaheuristic Search to Security Testing
Applications of Machine Learning and Metaheuristic Search to Security TestingApplications of Machine Learning and Metaheuristic Search to Security Testing
Applications of Machine Learning and Metaheuristic Search to Security Testing
 
TLC2018 Thomas Haver: The Automation Firehose - Be Strategic and Tactical
TLC2018 Thomas Haver: The Automation Firehose - Be Strategic and TacticalTLC2018 Thomas Haver: The Automation Firehose - Be Strategic and Tactical
TLC2018 Thomas Haver: The Automation Firehose - Be Strategic and Tactical
 

Recently uploaded

Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
UXDXConf
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
panagenda
 
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdfBreaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
UK Journal
 

Recently uploaded (20)

Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
 
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
 
Syngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdfSyngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdf
 
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
 
A Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyA Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System Strategy
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
 
ERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage Intacct
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdf
 
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024
 
AI mind or machine power point presentation
AI mind or machine power point presentationAI mind or machine power point presentation
AI mind or machine power point presentation
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe
 
Using IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & IrelandUsing IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & Ireland
 
Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024
 
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdfBreaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
 
PLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsPLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. Startups
 

A Methodology and Tool Suite for Evaluating the Accuracy of Interoperating NLP Engines

  • 1. A Methodology and Tool Suite for Evaluating the Accuracy of Interoperating Statistical Natural Language Processing Engines Uma Murthy Virginia Tech John Pitrelli, Ganesh Ramaswamy, Martin Franz, and Burn Lewis IBM T.J. Watson Research Center Interspeech 22-26 September 2008 Brisbane, Australia
  • 2. Outline •  Motivation •  Context •  Issues •  Evaluation methodology •  Example evaluation modules •  Future directions 2
  • 3. Motivation •  Combining Natural Language Processing (NLP) engines for information processing in complex tasks •  Evaluation of accuracy of output of individual NLP engines exists –  sliding window, BLEU score, word-error rate, etc. •  No work on evaluation methods for large combinations, or aggregates, of NLP engines –  Foreign language videos  transcription  translation  story segmentation  topic clustering 3
  • 4. Project Goal To develop a methodology and tool suite for evaluating the accuracy (of output) of interoperating statistical natural language processing engines in the context of IOD 4
  • 6. Issues 1.  How is the accuracy of one engine or a set of engines evaluated, in the context of being present in an aggregate? 2.  What is the measure of accuracy of an aggregate and how can it be computed? 3.  How can the mechanics of this evaluation methodology be validated and tested? 6
  • 7. “Evaluation Space” •  Core of the evaluation methodology •  Various options of comparison of evaluation space of ground truth options based on human-generated and machine-generated outputs at every stage in the pipeline 7
  • 8. 8
  • 9. 1.  Comparison between M- M-M… and H-H-H… evaluates the accuracy of the entire aggregate 2.  Emerging pattern 3.  Comparison of adjacent evaluations determines how much one engine (TC) degrades accuracy of the aggregate 4.  Do not consider H-M sequences 5.  Comparing two engines of the same function 6.  Assembling ground truths is the most expensive task 9
  • 10. Evaluation Modules •  Uses evaluation space as a template to automatically evaluate the performance of an aggregate •  Development –  Explore methods that are used to evaluate the last engine in the aggregate –  If required, modify these methods, considering •  Preceding engines and, their input and output •  Different ground truth formats •  Testing: –  Focus on validating the mechanics of evaluation and not the engines in question 10
  • 11. Example Evaluation Modules •  STTSBD – Sliding-window scheme – Automatically generated comparable ROC curves •  Validated module with six 30-minute Arabic news shows •  STTMT – BLEU metric – Automatically generated BLEU scores •  Validated module with two Arabic-English MT engines on 38 minutes of audio 11
  • 12. Future Directions •  Develop more evaluation modules and validate them –  Test with actual ground truths –  Test with more data-sets –  Test on different engines (of the same kind) •  Methodology –  Identify points of error –  How much does an engine impact the performance of the aggregate? 12
  • 13. Summary •  Presented a methodology for automatic evaluation of accuracy of aggregates of interoperating statistical NLP engines –  Evaluation space and evaluation modules •  Developed and validated evaluation modules for two aggregates •  Miles to go! –  Small portion of a vast research area 13
  • 14. Thank You ? ? 14
  • 16. Evaluation Module Implementation •  Each module was implemented as a UIMA CAS consumer •  Ground truth and other evaluation parameters were input as CAS Consumer parameters 16
  • 17. Measuring the performance of story boundary detection TDT-style sliding window approach: partial credit for slightly misplaced segment boundaries • True and system agree within the window t correct. • No system boundary in a window containing a true boundary t Miss • System boundary in a window containing no true boundary t False Alarm • Window length: 15 seconds Source: Franz, et al. “Breaking Translation Symmetry” 17
  • 18. STTSBD Test Constraints •  Ground truth availability: word-position- based story boundaries on ASR transcripts –  Transcripts were already segmented into sentences •  For the pipeline (STTSBD) output, we needed to compare time-based story boundaries on Arabic speech 18