Supporting the authoring process with linguistic software


Published on

Day:1 _ Melani_Siegel_1345-1430

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Supporting the authoring process with linguistic software

  1. 1. Supporting the Authoring Process with Linguistic Software Melanie Siegel
  2. 2. The authoring process – andwhere it needs supportchallenges for correctness• time pressure• non-native writing• not enough capacity for careful proofreadingautomatic support possibilities• spell checking• grammar checking
  3. 3. The authoring process – andwhere it needs supportchallenges for understandability andreadability• authors are experts of subject and language – users often are notautomatic support possibilities• style checking
  4. 4. The authoring process – andwhere it needs supportchallenges for consistence and corporate wording• guidelines for corporate wording exist – in a large document on the shelf• terminology lists exist – in an excel sheet somewhere in the file system• distributed writingautomatic support possibilities• terminology checking• sentence clustering
  5. 5. The authoring process – andwhere it needs supportchallenges for translatability• authors write without having the translation process in mind• lexical, syntactic and semantic ambiguity• translation costs depend on translation memory matchesautomatic support possibilities• style checking• terminology checking
  6. 6.  tokenization POS-tagging morphology dictionary error dictionary
  7. 7. tokenization• Close the door of our XYZ car. capital word lower word space dot_EOS 花子が本を読んだ。 based on rules and lists of花子 が 本 を 読ん だ 。 abbreviations Kanji Hiragana dot_EOS
  8. 8. POS taggingClose the door of our XYZ car.V DET N PREP PRON NE NXML and attributevalue structures statistical methods large dictionaries
  9. 9. morphology• Close the door of our XYZ car. Lemma: close Tense: present_imp Lemma: car Person: third Number: singular Number: singular Case: nominative_accusative based on dictionaries, rules for inflection and derivation
  10. 10. dictionary• words unknown to the standard NLP system
  11. 11.  words are defined in a  errors are defined dictionary  unknown words that are not defined as anything not in the errors are term dictionary is an error candidates high recall, low  based on words and precision (depending rules on the domain)  consider terminology  high precision, recall is dependent on data worklanguage analysis error analysis
  12. 12. error dictionary• stylesheet  style sheet• begginning  beginning• beleive  believe• definately  definitely• gotta  have to• hided  hid|hidden|hides
  13. 13. why work on terminology?• avoid false alarms in spelling• consistency• less ambiguity• translatability• corporate wordingultimate goal:1 term - 1 meaning - 1 translation
  14. 14. reality: variants• web server – web-server• upload protection – upload-protection• timeout – time out• Reset – ReSet• sub station – sub-station
  15. 15. term variants – orthographic variants - hyphen, blank, case: term bank, termbank – semi-orthographic variants - number : 6-digit, six-digit - trademark : MyCompany™, MyCompany – syntactic variants - preposition: oil level, level of oil - gerund/noun : call center, calling center – synonyms “classical” : vehicle, car – language-specific variants (e.g. Fugenelemente DE, Katakana JA)
  16. 16. how toget consistent terminology• author/company defines the term bank• list of deprecated terms deprecated term: vehicle approved term: car• list of approved terms  automatic identification of variants approved term: SWASSNet User deprecated term: SWASSNet user, SWASS-Net User
  17. 17. terminology and spelling
  18. 18. terminology and spelling
  19. 19. NLP for terminology• NLP methods for term extraction – corpus analysis (morphology, POS, NER) – information extraction (potential product names) – ontologies (e.g. semantic groups)• NLP methods for setting up a term database – morphology (finding the base form) – POS• NLP methods for term checking – variants – similar words – inflection
  20. 20. approaches to grammarchecking descriptive grammar error grammar• definition of correct grammar • implementation of grammar • e.g. HPSG, LFG, chunk-grammar, errors statistical grammars • preconditions: • anything that‘s not analyzable • work with error corpora must be a grammar error • error grammar with a high • preconditions: number of error types • grammar with large coverage • „deepness“ of analysis varies • large dictionaries with the type of error to be • robust, but not too robust described parsing • high precision, recall is based on • efficient parsing methods the number of rules • high recall, low precision
  21. 21. grammar rules, examples• subject verb agreement: – Check if instructions are programmed in such a way that a scan never finish. – When the operations is completed, the return to home completes.
  22. 22. grammar rules, examples• a an distinction: – a isolating transformer – an program• wrong verb form: – it cannot communicates with them – IP can be automatically get
  23. 23. example grammar rule*• write_words_together – @can ::= [ TOK "^(can)$" – MORPH.READING.MCAT "^Verb$" ]; – The application can not start. – The application can tomorrow not start. – TRIGGER(80) == @can^1 [@adv]* not^2 – -> ($can, $not) – -> { mark: $can, $not; – suggest: $can -> , $not -> cannot; – } – Branch circuits can not only minimize system damage but can interrupt the flow of fault current – NEG_EV(40) == $can not only @verbInf []* but; * implemented in Acrolinx
  24. 24. style - controlled language• controlled languages • AeroSpace and Defence Industries Association of Europe (ASD) ASD-STE100 (simplified English) • Caterpillar Technical English (CTE)• disadvantages: • very restrictive • low acceptance of users
  25. 25. style – moderately controlledlanguage• rules define errors (like grammar rules)• rules (and instructional information) are defined by authors• implementation in authoring support systems• high acceptance• good usability
  26. 26. style guidelines• different for different usages – text type • (e.g., press release – technical documentation) – domain • (e.g., software – machines) – readers • (e.g., end users – service personnel) – authors • (e.g., Germans tend to write long sentences)
  27. 27. style rule examples*: best practise• avoid_latin_expressions• avoid_modal_verbs• avoid_passive• avoid_split_infinitives• avoid_subjunctive• use_serial_comma• use_comma_after_introductory_phrase• spell_out_numerals *style rule implemented in Acrolinx
  28. 28. style rule examples: company• use_units_consistently• abbreviate_currency• COMPANY_trademark• do_not_refer_to_COMPANY_intranet• add_tag_to_UI_string• avoid_trademark_as_noun• avoid_articles_in_title
  29. 29. style rule examples MT preediting• avoid_nested_sentences• avoid_ing_words• keep_two_verb_parts_together• avoid_parenthetical_expressionsdependent of MT system and language pair
  30. 30. automatic suggestions for style rules – replacement of words or phrases – replacement using the correct writing with uppercase or lowercase – replacement of words using the correct inflection – generation of whole sentences (e.g. passive – active) requires semantic analysis and generation and is therefore not (yet) possible
  31. 31. example style rule*• avoid_future_tense• /* Example: „.. It will be necessary .." */• TRIGGER (80) == @will^1 [-@comma]* @verbInf^2• ->($will, $verbInf)• -> { mark : $will, $verbInf;}• /* Example: „.. The router services will be offered in the future .." */• NEG_EV(40) == $will []* @in @det @time; * implemented in Acrolinx
  32. 32. consistent phrasing• Use the same phrase for the same meaning.• Examples: – Congratulations on acquiring your new wearable digital audio player – Congratulations, you have acquired your new wearable digital audio player! – Dear Customer, congratulations on purchasing the new wearable digital audio player!
  33. 33. Acrolinx intelligent reuse™ Acrolinx server Terminology Writing Standards Intelligent Grammar Reuse & Spelling Content / Translation Reuse repository Repository Clustersmicro-clustering the cat sat on the mat The dog sat on the rug the cat sat on the carpet The elk sat on the moss The cat slept on the sofa The moose sat on the elk review and release the cat sat on the mat this is a sentence you can’t Fish swam in the blue water read The fish swam in the green water redundancy and quality The fish swam in the red sea. the cat sat on the mat Another small test snippet the cat sat on the mat the cat sat on the malt The cat ate on the mat This is the same as the other one. filters the cat sat on the mat the cat sat on the doormat the cat sat on the mat. the cat sat on the mat The cat sat on the mat More useless data points the cat sat on the mat
  34. 34. DEMO
  35. 35. checking OpenOffice documentation
  36. 36. correctness
  37. 37. understandability
  38. 38. consistency
  39. 39. consistency
  40. 40. translatabiliy
  41. 41. summary• The authoring process is challenging – correctness – consistency – understandability – translatability• It can be effectively supported by NLP- enhanced tools
  42. 42. Thank you! Melanie SiegelHochschule Darmstadt – University of Applied Sciences