A Methodology and Tool Suite for Evaluating the Accuracy of Interoperating NLP Engines
Upcoming SlideShare
Loading in...5

A Methodology and Tool Suite for Evaluating the Accuracy of Interoperating NLP Engines



Paper presented at Interspeech 2008. ...

Paper presented at Interspeech 2008.

Murthy, U., Pitrelli, J., Ramaswamy, G., Franz, M. and Lewis, B. A Methodology and Tool Suite for Evaluation of Accuracy of Interoperating Statistical Natural Language Processing Engines. In Proc. of Interspeech 2008, Brisbane, Australia, 2008, pp 2066-2069.



Total Views
Views on SlideShare
Embed Views



2 Embeds 2

http://www.linkedin.com 1
http://www.docseek.net 1



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

A Methodology and Tool Suite for Evaluating the Accuracy of Interoperating NLP Engines A Methodology and Tool Suite for Evaluating the Accuracy of Interoperating NLP Engines Presentation Transcript

  • A Methodology and Tool Suite forEvaluating the Accuracy ofInteroperating Statistical NaturalLanguage Processing EnginesUma MurthyVirginia TechJohn Pitrelli, Ganesh Ramaswamy,Martin Franz, and Burn LewisIBM T.J. Watson Research CenterInterspeech22-26 September 2008Brisbane, Australia
  • Outline•  Motivation•  Context•  Issues•  Evaluation methodology•  Example evaluation modules•  Future directions 2
  • Motivation•  Combining Natural Language Processing (NLP) engines for information processing in complex tasks•  Evaluation of accuracy of output of individual NLP engines exists –  sliding window, BLEU score, word-error rate, etc.•  No work on evaluation methods for large combinations, or aggregates, of NLP engines –  Foreign language videos  transcription  translation  story segmentation  topic clustering 3
  • Project GoalTo develop a methodology and tool suite for evaluating the accuracy (of output) of interoperating statistical natural language processing engines in the context of IOD 4
  • Interoperability DemonstrationSystem (IOD) Built upon UIMA 5
  • Issues1.  How is the accuracy of one engine or a set of engines evaluated, in the context of being present in an aggregate?2.  What is the measure of accuracy of an aggregate and how can it be computed?3.  How can the mechanics of this evaluation methodology be validated and tested? 6
  • “Evaluation Space”•  Core of the evaluation methodology•  Various options of comparison of evaluation space of ground truth options based on human-generated and machine-generated outputs at every stage in the pipeline 7
  • 8
  • 1.  Comparison between M- M-M… and H-H-H… evaluates the accuracy of the entire aggregate2.  Emerging pattern3.  Comparison of adjacent evaluations determines how much one engine (TC) degrades accuracy of the aggregate4.  Do not consider H-M sequences5.  Comparing two engines of the same function6.  Assembling ground truths is the most expensive task 9
  • Evaluation Modules•  Uses evaluation space as a template to automatically evaluate the performance of an aggregate•  Development –  Explore methods that are used to evaluate the last engine in the aggregate –  If required, modify these methods, considering •  Preceding engines and, their input and output •  Different ground truth formats•  Testing: –  Focus on validating the mechanics of evaluation and not the engines in question 10
  • Example Evaluation Modules•  STTSBD – Sliding-window scheme – Automatically generated comparable ROC curves •  Validated module with six 30-minute Arabic news shows•  STTMT – BLEU metric – Automatically generated BLEU scores •  Validated module with two Arabic-English MT engines on 38 minutes of audio 11
  • Future Directions•  Develop more evaluation modules and validate them –  Test with actual ground truths –  Test with more data-sets –  Test on different engines (of the same kind)•  Methodology –  Identify points of error –  How much does an engine impact the performance of the aggregate? 12
  • Summary•  Presented a methodology for automatic evaluation of accuracy of aggregates of interoperating statistical NLP engines –  Evaluation space and evaluation modules•  Developed and validated evaluation modules for two aggregates•  Miles to go! –  Small portion of a vast research area 13
  • Thank You ? ? 14
  • Back-up Slides 15
  • Evaluation Module Implementation•  Each module was implemented as a UIMA CAS consumer•  Ground truth and other evaluation parameters were input as CAS Consumer parameters 16
  • Measuring the performance ofstory boundary detectionTDT-style sliding window approach: partial credit for slightly misplaced segment boundaries• True and system agree within the window t correct.• No system boundary in a window containing a true boundary t Miss• System boundary in a window containing no true boundary t FalseAlarm• Window length: 15 seconds Source: Franz, et al. “Breaking Translation Symmetry” 17
  • STTSBD Test Constraints•  Ground truth availability: word-position- based story boundaries on ASR transcripts –  Transcripts were already segmented into sentences•  For the pipeline (STTSBD) output, we needed to compare time-based story boundaries on Arabic speech 18