Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Commonsense knowledge for Machine Intelligence - part 2

34 views

Published on

These are the slides of the tutorial on commonsense knowledge for machine intelligence, presented by Dr. Niket Tandon, Dr. Aparna Varde, and Dr. Gerard de Melo at the CIKM conference 2017.


*Part 2/3: Commonsense knowledge for detecting and correcting odd collocations in text*


Website: http://allenai.org/tutorials/csk/

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Commonsense knowledge for Machine Intelligence - part 2

  1. 1. Part 2: Detecting and Correcting Odd Collocations in Text 1 Commonsense for Machine Intelligence: Text to Knowledge and Knowledge to Text
  2. 2. Introduction to Collocations • Correct native speaker expression in a given language • Strong tea (not powerful tea) • Clear sky (not pure sky) • Go home (not go to home) • Go to school (not go school) • House arrest (not arrest house) • Friend circle (not circle friend) 2
  3. 3. Collocation Errors or Odd Collocations • Expressions that may be grammatically correct, not typical among native speakers • Red meat & white meat are correct collocations in English • Their literal translations are odd collocations in German • Not usually used by Deutsche speakers • Machine translation can often cause such collocation errors • Can be due to lack of commonsense & world knowledge 3
  4. 4. Collocations and Idioms • Some collocations are idiomatic expressions: “couch potato” • Literal idiom translation may be totally absurd: “sofa potato” • Note: Correct idiom usage & translation is harder • All collocations are not idioms, e.g., “fast cars” (vs “quick cars”) • Yet, correct collocation usage is important in many situations 4
  5. 5. Motivation to Address Collocations – Daily Communication • Tourist wants “black coffee” (regular coffee without milk) in a coffee shop • Asks for “dark coffee” using online translation help • Server brings coffee with milk, made with darkest coffee beans available • This is not what the tourist intended… • What if he is lactose intolerant? • Note: “Coffee Shop” in Amsterdam might mean something completely different J A place for drugs! • Important to address collocations with commonsense & world knowledge 5
  6. 6. Motivation to Address Collocations – Written Texts • Classic Bible quote also in Shakespeare’s Hamlet • Literal machine translation can yield different meaning! • Collocations e.g., “willing spirit” & “weak flesh” must be translated with commonsense & reference to context 6
  7. 7. Motivation to Address Collocations – Search Engines • Odd collocation “quick cars” returns fewer hits & less appropriate results • Correct collocation “fast cars” shows better site & images of cars as good search results • Machine translation help for search engines should fix collocation errors 7
  8. 8. Techniques to Address Odd Collocations • Treatment of Collocations • Different types oddly collocated terms • Examples of each type with problems caused • Linguistic Classification • Classifying terms as correct vs incorrect collocations • Considering associations / using source language • Detection and Correction • Finding various incorrectly collocated terms using frequency etc. • Providing correct responses, similarity measures, ranking the suggestions 8
  9. 9. Treatment of Collocations • Collocations are typically treated in different categories • Insertion Errors: adding a wrong term • Deletion Errors: omitting a required term • Transposition Errors: changing order of terms • Substitution Errors: using one term instead of another • We briefly describe each type with examples and the problems they could cause 9
  10. 10. Insertion Errors • These include adding a term not appropriate in a correct native speaker expression “I went to home” vs “I went home” “When will you return back from Singapore?” vs “When will you return from Singapore?” “Take a break for the lunch” vs “Take a break for lunch” • Article errors quite common in this category (adding unnecessary articles) • Many of these errors involve grammatical mistakes • These types of errors create problems in • Fluency of speech especially at formal events • Clarity of written documents 10
  11. 11. Deletion Errors • These are the opposite of insertion errors & involve missing a term needed in an expression “Einstein was scientist” vs “Einstein was a scientist” “Hire someone to do job” vs “Hire someone to do the job” “Let us wait her” vs “Let us wait for her” • They also create similar problems with respect to fluency and clarity • Many deletion errors also pertain to odd use of articles (omitting a necessary one) • Approaches in the literature for article error treatment are applicable here • These also often pertain to grammatical mistakes 11
  12. 12. Transposition Errors • These errors occur when terms are not placed in the appropriate order • They could be more problematic than insertion & deletion errors “Don’t talk with your full mouth” vs “Don’t talk with your mouth full” “How to make friendships close” vs “How to make close friendships” • They might convey the wrong meaning, e.g., talking with your full mouth is different from talking with your mouth full • Sometimes it’s almost the opposite meaning, e.g., close friendships vs friendships close • Often, knowing native language of speaker / origin of the source text might help here 12
  13. 13. Substitution Errors • These involve using an inappropriate term in an expression instead of a term in correct usage “This actor does money” vs “This actor makes money” “Where is the nearest quick food place?” vs “Where is the nearest fast food place?” • Most common types of collocation errors • Often cause miscommunication problems while talking, writing, searching etc. • Many approaches in the literature address mainly substitution errors • They can be potentially applied to address the other types as well • Incorporation of commonsense knowledge is particularly useful here 13
  14. 14. Addressing Odd Collocations by Linguistic Classification • Some works focus on classifying collocation errors from a linguistic perspective • Using collocation measures on syntactic patterns for lexical classification as correctly collocated term vs error [Futagi et al., 2008] • Considering source language (of ESL learner or machine generated text) to classify collocations [Dahlmeier, 2011] 14
  15. 15. Collocation Measures on Syntactic Patterns • This work addresses 7 aspects of lexical collocations • Collocation errors lexically classified using candidate word strings • POS tagging of texts is conducted followed by pattern matching 15 [Futagi et al.]
  16. 16. Collocation Measures on Syntactic Patterns (Contd.) • After spell checking, variants of word strings built with articles, synonyms etc. • Word strings looked up in a reference DB (RR DB) to find a match • If no match found, it is classified as a collocation error [Futagi et al.] 16
  17. 17. Collocation Measures on Syntactic Patterns (Contd.) • Measure of collocation strength • Rank ratio statistic • From 1b words of native speaker texts • Incorporating commonsense knowledge • When evaluated by a gold standard with native speakers, this work gives around 85% precision in classification • This work does not provide correct suggestions as responses to collocation errors [Futagi et al.] 17
  18. 18. Source Language to Classify Collocations • Errors often caused by semantic similarity of words in source language • This is called the L1 language • Literal translation to destination language can cause collocation errors • Thus, L1 induced paraphrases are proposed for classifying collocations 18 Over a dozen English Translations: look, see, watch, read etc. vs [Dahlmeier et al.] Possible translation from source I like to look movies I like to watch movies
  19. 19. Source Language to Classify Collocations (Contd.) • NUCLE: Annotated 1m word corpus of 1400 essays by ESL university students • Annotated with start & end offset, error type, gold standard correction • Incorporates commonsense knowledge from professional English instructors • They filter out preposition & article errors, focus on collocations involving semantics 19 Statistics of NUCLE Analysis [Dahlmeier et al.]
  20. 20. Source Language to Classify Collocations (Contd.) • Detected errors classified as: Spelling, Homophone, Synonyms, L1-transfer • Spelling: Edit dist. (erroneous phrase, correction) < threshold • Homophone: (erroneous word, correction) have same pronunciation • Synonym: (erroneous word, correction) have similar meaning • L1-transfer: (erroneous phrase, correction) share a common translation [Dahlmeier et al.] 20
  21. 21. Source Language to Classify Collocations (Contd.) • Number of errors in L1-transfer > other types • Extract English-L1, L1-English phrases max 3 words • Phrase extraction heuristic: • Here, f: foreign language phrase • Translation probabilities p(e1|f), p(f|e2) predicted by max likelihood estimation • Only keep phrases with probability > threshold (0.001 in this work) • This serves as the basis for suggesting corrections [Dahlmeier et al.] Analysis of Collocation Errors 21
  22. 22. Discussion • These research works clearly focus more on lexical classification of collocation errors • Linguistic perspectives are significant here • Commonsense knowledge is included in collocation error classification using corpora from native speakers / English instructors • These works provide an insight into the reasons for collocation errors and their grammatical placements • Such research heads towards proposing corrective measures 22
  23. 23. Collocation Error Detection and Correction • These approaches develop tools for the actual detection and correction of collocation errors • AwkChecker: While a user writes a text document, flag collocation errors and suggest replacements that correspond closely to consensus using word-level statistical n-grams [Park et al., 2008] • CollOrder: When a user enters a term in the tool, detect collocation errors and provide correctly ordered collocated responses as outputs using an ensemble of similarity measures [Varghese et al., 2015] 23
  24. 24. AwkChecker • End-user tool to correct collocation errors in written documents • Users write text, Awkward phrases are Checked by highlighting them • Users can click awkward phrases to see suggested replacements • 1st ever tool for collocation error correction 24 AwkChecker’s user interface: A) Flagged phrases in the composition window B) Suggested replacement for “powerful tea” [Park et al.]
  25. 25. AwkChecker (Contd.) • Builds statistical n-grams (sequences of n words) from training corpus & records frequencies • Analyzes user input against corpus to find if a phrase is a collocation error • Flags error if there exist similar phrases with frequency > input frequency • Generates replacements using n-gram frequency based approach • Candidates with much higher frequency are potential replacements 25 [Park et al.]
  26. 26. AwkChecker (Contd.) • Statistical n-grams are used over relevant corpora including Wikipedia • Helpful in capturing commonsense with domain-specific knowledge using frequency-based approach • Example: Referring to a medical corpus to flag phrases awkward in medical research writing • Assumption: Relevant corpora are correct more frequently than they are incorrect • Evaluation reveals usefulness in collocation correction, but details of accuracy not discussed 26 [Park et al.]
  27. 27. CollOrder • Detects & corrects collocation errors in terms input to the tool • Outputs ranked responses of correctly collocated terms • Correct collocations source: ANC / BNC (American / British National Corpus) • Includes commonsense knowledge from native speakers’ writings • Useful in Web queries, text documents, ESL translation etc. 27 Approach in the CollOrder tool [Varghese et al.]
  28. 28. CollOrder (Contd.) • Ensemble of measures is used for similarity search and ranking • Conditional Probability: Measures relative occurrence of terms A & B • Jaccard’s Coefficient: Measures extent of semantic similarity between A & B • WebJaccard: To reduce adverse effects of random co-occurrence (due to scale & noise in Web data) [Bolegalla et al., 2009] 28 [Varghese et al.]
  29. 29. CollOrder (Contd.) • These & other measures (Frequency Normalized, Frequency Ratio) are used [Varghese et al., 2015] • Different measures empirically yield good results in different scenarios • Ensemble of measures with classifiers thus proposed to optimize performance • Classifier used: JRIP, implementation of RIPPER (Repeated Incremental Pruning to Produce Error Reduction) [Cohen, 1995] • CollOrder evaluation with MTurk on native speakers: Average accuracy 92.44% 29 Example of ensemble learning by the classifier “blue sky” is a valid suggestion, classified as “y” “night sky” is not a valid suggestion, classified as “n” [Varghese et al.]
  30. 30. Other Related Works • [Ramos et al., 2010] build annotation schema with 3D topology to classify collocations mainly in Spanish & English translation: • 1st dimension finds if error is for whole or part of collocation • 2nd dimension does language-oriented error analysis • 3rd dimension does interpretive error analysis • [Li et al., 2009] use a probabilistic approach for collocation correction: • Use BNC and WordNet as language learning sources • Suggest corrections based on commonly used expressions • Do not develop a tool for collocation detection & correction 30
  31. 31. Discussion • Collocation error correction tools in the literature are found useful by users • Commonsense knowledge from native speakers is typically entailed in the source corpora used for learning • Approaches in linguistic classification as well as in collocation correction rely heavily on frequency • Thus, potential issues related to sparse data with correct collocations call for further research 31
  32. 32. Text to Knowledge and Knowledge to Text • Collocation approaches start with text and extract knowledge from corpora • Different methods used for knowledge extraction - probabilistic, ensemble • Extracted knowledge used for linguistic classification, error correction • Statistical text categorization occurs due to analysis in linguistic classification • Correctly collocated text responses offered as suggestions in error correction • Thus, extracted knowledge serves to provide text based outputs • Commonsense knowledge plays a role mainly in source corpora from native speakers & expert writings • This contributes to machine intelligence by providing better machine translation incorporating commonsense 32
  33. 33. References • Bollegala, D., Matsuo, Y. and Ishizuka,M., Measuring the similarity between implicit semantic relations using web search engines, WSDM 2009, pp. 104-113. • Cohen, W., Fast effective rule induction. In Proceedings of the International Conference on Machine Learning, ICML 1995, pp. 115–123. • Dahlmeier, D. and Ng., H.T., Correcting semantic collocation errors with l1-induced paraphrases. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP 2011, pp. 107–117. • Futagi, Y., Deane, P., Chodorow, M. and Tetreault., J., A computational approach to detecting collocation errors in the writing of non-native speakers of English, Computer Assisted Language Learning 2008, 21(4):353–367. • Li-E, L. A., Wible, D. and Tsao, N-L., Automated suggestions for miscollocations, Proceedings of the 4th Workshop on Innovative Use of NLP for Building Educational Applications, 2009, pp. 47-50. • Park, T., Lank, E., Poupart, P. and Terry, M., Is the sky pure today - Awkchecker: An assistive tool for detecting and correcting collocation errors, ACM Symposium on User Interface Software and Technology 2008, pages 121–130. • Ramos, M.A., Wanner, L., Vincze, O., del Bosque, G.C., Veiga, N.V., Suárez, E.M. and González, S.P., Towards a Motivated Annotation Schema of Collocation Errors in Learner Corpora, LREC 2010, pp. 3209-3214. • Varghese, A., Varde, A., Peng, J. and Fitzpatrick. E., A framework for collocation error correction in Web pages and text documents, ACM SIGKDD Explorations 2015, 17(1):14–23. 33

×