Towards Standardizing Evaluation Test Sets for Compound Analysers

426 views
392 views

Published on

© Liaan L. Fourie, Martin Puttkammer & Menno Van Zaanen

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
426
On SlideShare
0
From Embeds
0
Number of Embeds
104
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Towards Standardizing Evaluation Test Sets for Compound Analysers

  1. 1. Towards Standardizing Evaluation Test Sets for Compound Analysers
  2. 2. Contents <ul><li>Introduction </li></ul><ul><li>Related work </li></ul><ul><ul><li>Evaluation of compound analysers </li></ul></ul><ul><ul><li>Parameter of standard test sets </li></ul></ul><ul><ul><li>Metrics for evaluation </li></ul></ul><ul><li>Experimental set-up </li></ul><ul><ul><li>Data </li></ul></ul><ul><ul><li>Experiments </li></ul></ul><ul><li>Results and discussion </li></ul><ul><li>Conclusion </li></ul>
  3. 3. Introduction <ul><li>Importance of compound analysers </li></ul><ul><li>Compounding is productive process </li></ul><ul><li>Evaluations of compound analysers differ </li></ul><ul><li>Results cannot be compared </li></ul><ul><li>Propose to standardize: </li></ul><ul><ul><li>Test set </li></ul></ul><ul><ul><li>Set of metrics </li></ul></ul><ul><li>Standardization will enable comparison </li></ul><ul><li>Problem: no standard test set </li></ul>
  4. 4. Evaluation of Compound Analysers <ul><li>Test set contains test data </li></ul><ul><li>Used to test compound analyser </li></ul><ul><li>Change outcome of evaluation </li></ul><ul><li>Representative of task </li></ul><ul><li>Ensure reliability and validity </li></ul>
  5. 5. Parameters of Standard Test Sets <ul><li>Ensure set can be rebuilt </li></ul><ul><li>Include total number of data types </li></ul><ul><li>Represent properties of language </li></ul><ul><li>Well-balanced test set </li></ul>
  6. 6. Parameters of Standard Test Sets II <ul><li>Size is most important </li></ul><ul><li>Assume that larger test set = stable results El-Emam (2000) </li></ul><ul><li>Results will normalize </li></ul><ul><li>Minimum size is where results normalize </li></ul><ul><li>Optimum test set: </li></ul><ul><ul><li>Set with results not statistically significantly different </li></ul></ul>
  7. 7. Metrics for Evaluation <ul><li>There are three commonly used metrics: </li></ul><ul><ul><li>Accuracy Schiller (2006) </li></ul></ul><ul><ul><li>Error rate Van Huyssteen & Van Zaanen (2004) </li></ul></ul><ul><ul><li>Precision, Recall and F-score Pilon et al. (2008) </li></ul></ul>
  8. 8. Data Categories <ul><li>Focus: determine what a standard test set looks like </li></ul><ul><li>Test set has three different categories of data: </li></ul><ul><ul><li>Compound words </li></ul></ul><ul><ul><li>Non-compound words </li></ul></ul><ul><ul><li>Error words: </li></ul></ul><ul><ul><ul><li>Real-world errors </li></ul></ul></ul><ul><ul><ul><li>Generated errors </li></ul></ul></ul>
  9. 9. Compound Words <ul><li>sjokoladekoek </li></ul><ul><li>(chocolate cake) </li></ul>
  10. 10. Compound Words II <ul><li>sjokoladekoek sjokolade + koek </li></ul><ul><li>(chocolate cake) (chocolate) + (cake) </li></ul>
  11. 11. Non-compound Words <ul><ul><li>balkonnetjie </li></ul></ul><ul><ul><li>( balcony + DIM ) </li></ul></ul>
  12. 12. Non-compound Words II <ul><li>balkonnetjie bal + kon + netjie </li></ul><ul><li>( balcony + DIM ) </li></ul>
  13. 13. Error Words <ul><li>Real-world errors: </li></ul><ul><ul><li>aanbeve e lingsbrief </li></ul></ul><ul><ul><li>(letter of recommendation) </li></ul></ul><ul><li>Generated errors: </li></ul><ul><ul><li>prok k ureur </li></ul></ul><ul><ul><li>(lawyer) </li></ul></ul><ul><ul><li>bakkersl u sensie </li></ul></ul><ul><ul><li>(baker’s license) </li></ul></ul><ul><ul><li>gebr uk ersnaam </li></ul></ul><ul><ul><li>(username) </li></ul></ul>
  14. 14. Error Words II <ul><li>Real-world errors: </li></ul><ul><ul><li>aanbeve e lingsbrief aanbeve e ling _ s + brief </li></ul></ul><ul><ul><li>(letter of recommendation) (recommendation) + (letter) </li></ul></ul><ul><li>Generated errors: </li></ul><ul><ul><li>prok k ureur prok k ureur </li></ul></ul><ul><ul><li>(lawyer) </li></ul></ul><ul><ul><li>bakkersl u sensie bakker _ s + l u sensie </li></ul></ul><ul><ul><li>(baker’s license) (baker) + (license) </li></ul></ul><ul><ul><li>gebr uk ersnaam gebr uk er _ s + naam </li></ul></ul><ul><ul><li>(username) (user) + (name) </li></ul></ul>
  15. 15. Experiments <ul><li>TiMBL is used </li></ul><ul><li>Wordlists obtained from CKARMA Van Huyssteen et al. (2005) </li></ul><ul><li>Words extracted for test purposes: </li></ul><ul><ul><li>5,000 compounds </li></ul></ul><ul><ul><li>2,500 non-compounds </li></ul></ul><ul><li>Two lists combined into training set </li></ul><ul><li>Errors file has 3,000 words: </li></ul><ul><ul><li>2,113 real-world errors </li></ul></ul><ul><ul><li>887 generated errors </li></ul></ul><ul><li>Substitution used to generate errors </li></ul><ul><ul><li>Search for assosiasie (association) </li></ul></ul><ul><ul><li>Create a s osiasiegrond (association ground) </li></ul></ul>
  16. 16. Experiment 1 <ul><li>... X 10 </li></ul>5000 Compounds 2500 Non-compounds 3000 Errors 250 500 5000 750
  17. 17. Instance and Word Level <ul><li>Evaluation results calculated on two levels: </li></ul><ul><ul><li>Instance level </li></ul></ul><ul><ul><li>Word level </li></ul></ul><ul><li>For example: </li></ul><ul><ul><li>Consider the word stoelpoot (chair leg) </li></ul></ul><ul><ul><li>s + t + o + e + l + p + o + o + t </li></ul></ul><ul><ul><li>(c + h + a + i + r l + e + g) </li></ul></ul><ul><ul><li>stoel + poot </li></ul></ul><ul><ul><li>(chair leg) </li></ul></ul>
  18. 18. Results and Discussion <ul><li>Results compared for statistical significance </li></ul><ul><li>The null hypothesis, H 0 , of the experiment is: </li></ul><ul><ul><li>The performance of a smaller set cannot be shown to be significantly different compared to that of a larger set. </li></ul></ul><ul><li>F-test determines significant difference </li></ul><ul><li>P-value determines if result significantly different </li></ul><ul><li>ANOVA and TukeyHSD summarize results </li></ul><ul><li>Determine significant difference between: </li></ul><ul><ul><li>Data types (compounds, non-compounds and errors) </li></ul></ul><ul><ul><li>Instance and word level </li></ul></ul><ul><ul><li>Different sizes </li></ul></ul>
  19. 19. ANOVA Variable p-value Compounds-NonCompounds-Errors 0.000 Instance-Word 0.000 Size 0.991
  20. 20. TukeyHSD Variable p-value Errors-Compounds 0.000 NonCompounds-Compounds 0.000 NonCompounds-Errors 0.000 Words-Instances 0.000 500-250 0.999 750-250 0.981 1000-250 0.997 1250-250 0.999 … … 4750-4250 1.000 5000-4250 1.000 4750-4500 1.000 5000-4500 1.000
  21. 21. Experiment 2 <ul><li>Expected goal not achieved </li></ul><ul><li>Experiment is repeated </li></ul><ul><li>x 10 </li></ul><ul><li>… </li></ul>250 Compounds 250 Non-compounds 250 Errors 10 20 250 30
  22. 22. ANOVA Variable p-value Compounds-NonCompounds-Errors 0.000 Instance-Word 0.000 Size 0.000
  23. 23. TukeyHSD Variable p-value Errors-Compounds 0.000 NonCompounds-Compounds 0.000 NonCompounds-Errors 0.000 Words-Instances 0.000 20-10 0.986 30-10 0.375 40-10 0.0000005 50-10 0.044 … … 250-30 0.322 50-40 0.783 60-40 1.000 … …
  24. 24. Conclusion <ul><li>Minimum number of words is 250 </li></ul><ul><li>Minimum size incremented by 250 for reliability </li></ul><ul><li>Standard test set consists of: </li></ul><ul><ul><li>500 compounds </li></ul></ul><ul><ul><li>500 non-compounds </li></ul></ul><ul><ul><li>500 errors </li></ul></ul><ul><li>Standard test set has 1500 words </li></ul><ul><li>Standard test set goal is achieved </li></ul><ul><li>Should be representative </li></ul><ul><li>Each data category should have enough examples </li></ul>
  25. 25. Future Work <ul><li>Test set not fully evaluated </li></ul><ul><li>Therefore, we need to compare: </li></ul><ul><ul><li>Evaluation metrics </li></ul></ul><ul><ul><li>Settings for algorithm </li></ul></ul><ul><ul><li>Algorithms </li></ul></ul><ul><ul><li>Training sets </li></ul></ul><ul><ul><li>Languages </li></ul></ul><ul><li>Results should remain statistically similar </li></ul>
  26. 26. Thank you!

×