Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Grammarly AI-NLP Club #5 - Automatic text simplification in the biomedical domain - Natalia Grabar

Speaker: Natalia Grabar, NLP scientist

Summary: We propose a set of experiments with the general objective of ensuring a better understanding of technical health documents. Various experiments address different steps of this complex and ambitious process: (1) categorization of documents according to their complexity; (2) detection of complex passages within documents; (3) acquisition of resources for the lexical and semantic simplification of documents; (4) alignment of parallel sentences from comparable corpora for generating rules for syntactic transformation. According to the steps and tasks, various methods are exploited (rule-based, machine learning, with and without linguistic knowledge). In addition to text simplification, the results and resources can be used for other NLP applications and tasks (e.g., information retrieval and extraction, question-answering, textual entailment).

  • Be the first to comment

Grammarly AI-NLP Club #5 - Automatic text simplification in the biomedical domain - Natalia Grabar

  1. 1. Context Difficulty Paraphrases Conclusion Automatic text simplification in biomedical domain Natalia Grabar STL CNRS UMR8163, France Grammarly, Kyiv, Ukraine: 21/08/2018 1/45 Automatic text simplification in biomedical domain Natalia Grabar
  2. 2. Context Difficulty Paraphrases Conclusion Background Lviv University Languages, Linguistics 2/45 Automatic text simplification in biomedical domain Natalia Grabar
  3. 3. Context Difficulty Paraphrases Conclusion Background Lviv University Master, PhD INaLCO, Universit´e Paris 6 Languages, Linguistics NLP, Medical area, Terminology 2/45 Automatic text simplification in biomedical domain Natalia Grabar
  4. 4. Context Difficulty Paraphrases Conclusion Background Lviv University Master, PhD INaLCO, Universit´e Paris 6 Languages, Linguistics NLP, Medical area, Terminology PostDoc, AHU Inserm, Fondation HON Geneva Information retrieval, Quality of information Discourse analysis, Typology Information for non-specialized users 2/45 Automatic text simplification in biomedical domain Natalia Grabar
  5. 5. Context Difficulty Paraphrases Conclusion Background Lviv University Master, PhD INaLCO, Universit´e Paris 6 Languages, Linguistics NLP, Medical area, Terminology Acquisition of lexical resources PostDoc, AHU Researcher Inserm, Fondation HON Geneva CNRS Information retrieval, Quality of information Information for non-specialized users Discourse analysis, Typology Semantic annotation, Information extraction Information for non-specialized users 2/45 Automatic text simplification in biomedical domain Natalia Grabar
  6. 6. Context Difficulty Paraphrases Conclusion Automatic text simplification in biomedical domain work in French 1 Context 2 Detection of difficulties 3 Acquisition of paraphrases 4 Conclusion 3/45 Automatic text simplification in biomedical domain Natalia Grabar
  7. 7. Context Difficulty Paraphrases Conclusion Context Evolution of the biomedical domain: specific knowledge and terms Different kinds of users: medical staff, pharmacists, students, patients... various levels of specialization Patients: quality of information, understanding technicity and understanding of health information ⇒ Close relation with health and well-being of people (AMA, 1999; Berland et al., 2001; McCray, 2005; Tran et al., 2009) 4/45 Automatic text simplification in biomedical domain Natalia Grabar
  8. 8. Context Difficulty Paraphrases Conclusion Readability of health documents Health information must be: readable, understandable, usable In different situations: follow up of treatments make decisions (chronical disorders) communicate with medical doctors make the healthcare process successful Real difficulty: understand the steps of the correct intake of drugs (Patel et al., 2002) within 2,600 US patients (2 hospitals): 26% to 60% cannot understand instructions on drug intake, informed consensus, health brochures (Williams et al., 1995) Documents, health websites designed for patients: often show high technicity (Berland et al., 2001) 5/45 Automatic text simplification in biomedical domain Natalia Grabar
  9. 9. Context Difficulty Paraphrases Conclusion Objective Make health documents and medical terms better understandable by patients: detect reading difficulties propose common paraphrases for technical terms Diagnosis of text modelref. ref. model res. rules Detection of difficult words Simplification /decoration difficult Text Simplified text Interdisciplinary research: linguistics, psychology, terminology, NLP... 6/45 Automatic text simplification in biomedical domain Natalia Grabar
  10. 10. Context Difficulty Paraphrases Conclusion Detection of difficulties 1 Context 2 Detection of difficulties 3 Acquisition of paraphrases 4 Conclusion 7/45 Automatic text simplification in biomedical domain Natalia Grabar
  11. 11. Context Difficulty Paraphrases Conclusion Detection of difficulties (documents) Existing work Text typology Diagnosis of the text readability Classical measures: Flesch (Flesch, 1948), Fog (Gunning, 1973)... Computational measures: classical measures and medical vocabulary (Kokkinakis & Toporowska Gronostaj, 2006) n-grams of characters (Poprat et al., 2006) manual weighting of words (Zheng et al., 2002) morphology (Chmielik & Grabar, 2009) stylistic criteria (Grabar et al., 2007) discursive criteria (Goeuriot et al., 2007) various combinations (Wang, 2006; Zeng-Treiler et al., 2007; Goeuriot et al., 2007; Leroy et al., 2008) ... 8/45 Automatic text simplification in biomedical domain Natalia Grabar
  12. 12. Context Difficulty Paraphrases Conclusion Detection of difficulties (documents) Results (Chmielik & Grabar, 2009; Chmielik & Grabar, 2011) 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 lexical features morphological features Decision trees C4.5 (Quinlan, 1993) 10-fold cross-validation 9/45 Automatic text simplification in biomedical domain Natalia Grabar
  13. 13. Context Difficulty Paraphrases Conclusion Detection of difficulties (words) Existing work Facilitators: hiphen (Bertram et al., 2011), space (Frisson et al., 2008), morphological closeness (L¨uttmann et al., 2011), primes (Bozic et al., 2007; Beyersmann et al., 2012), pictures (Dohmes et al., 2004; Koester & Schiller, 2011), etc. Morphological head (Jarema et al., 1999; Libben et al., 2003) NLP: challenges (Specia et al., 2012): for a short text and a given word, several possible substitutions which satisfy the context are proposed → sort the substitutions according to their simplicity Descriptors: Google n-grams, WordNet, length of words, number syllables, mutual information, frequency... 10/45 Automatic text simplification in biomedical domain Natalia Grabar
  14. 14. Context Difficulty Paraphrases Conclusion Detection of difficulties Psychology: eye-tracking (Grabar et al., 2018) Eye-tracking: recording eye movements when reading Several indicators: fixations: periods during which the eyes are stable (visual information is analyzed) saccades: rapid movements of eyes to move from one point to another regressions: backward movements 11/45 Automatic text simplification in biomedical domain Natalia Grabar
  15. 15. Context Difficulty Paraphrases Conclusion Detection of difficulties: Eye-tracking text1 EXAMEN : ECHOGRAPHIE DES MAINS ET DES PIEDS MOTIF : Bilan d’arthralgies Mains : On ne visualise pas de t´enosynovite, ou d’arthrosynovite. Avant-pieds : On retrouve des remaniements int´eressant les premi`eres m´etatarsophalangiennes en rapport avec des ant´ec´edents de chirurgie d’Hallux valgus. Absence d’arthrosynovite au niveau des articulations m´etatarsophalangiennes. EXAMEN : ECHOGRAPHIE DES MAINS ET DES PIEDS MOTIF : Bilan de douleurs articulaires Mains : On ne visualise pas d’inflammation des tendons, ni de la membrane articulaire. Avant-pieds : On retrouve des remaniements int´eressants sur les premi`eres articulations des pieds en rapport avec les ant´ec´edents de la chirurgie de la d´eformation du pied. Absence d’inflammation de la membrane au niveau des articulations du pied. 12/45 Automatic text simplification in biomedical domain Natalia Grabar
  16. 16. Context Difficulty Paraphrases Conclusion Detection of difficulties: Eye-tracking text2 Cette patiente avait constitu´e un infarctus du myocarde ant´erieur en novembre 2010, pour lequel avait ´et´e r´ealis´ee une angioplastie de l’IVA moyenne avec implantation d’un stent non actif Vision de 2.75 mm x 18 mm, un compl´ement par angioplastie au ballon seul en aval. Une endoproth`ese avait ´egalement ´et´e implant´ee au niveau de la circonflexe proximale, avec un stent Vision 2.5 x 18 mm. La fraction d’´ejection ´etait ´evalu´ee entre 35 et 40 %. Nous l’avions revue r´ecemment, en insuffisance cardiaque, avec plusieurs autres probl`emes : - une an´emie microcytaire inexpliqu´ee, - un d´es´equilibre important de son diab`ete pour lequel elle a ´et´e, entre temps, prise en charge par nos confr`eres diab´etologues. Cette patiente avait pr´esent´e une crise cardiaque en novembre 2010, pour laquelle avait ´et´e r´ealis´ee une intervention chirurgicale de l’art`ere cardiaque avec implantation d’un stent non actif. Un autre stent avait ´egalement ´et´e implant´e au niveau d’une autre art`ere. La fraction d’´ejection observ´ee ´etait basse. Nous l’avions revue r´ecemment, en insuffisance cardiaque, avec plusieurs autres probl`emes : - une an´emie inexpliqu´ee, - un d´es´equilibre important de son diab`ete pour lequel elle a ´et´e, entre temps,13/45 Automatic text simplification in biomedical domain Natalia Grabar
  17. 17. Context Difficulty Paraphrases Conclusion Detection of difficulties: Eye-tracking Results on text1 14/45 Automatic text simplification in biomedical domain Natalia Grabar
  18. 18. Context Difficulty Paraphrases Conclusion Detection of difficulties: Eye-tracking Results on text2 15/45 Automatic text simplification in biomedical domain Natalia Grabar
  19. 19. Context Difficulty Paraphrases Conclusion Detection of difficulties: Eye-tracking Results text1 text2 O S SD p ddl t-test O S SD p ddl t-test TRN 60,55 63,63 -3,08 0,23 45,00 1,22 62,73 59,67 3,06 0,22 45,00 1,24 CRL 58,88 62,06 -3,19 0,22 45,00 1,25 61,04 57,84 3,20 0,21 45,00 1,29 DPF 227,41 215,75 11,66 0,11 45,00 1,65 214,73 214,69 0,04 0,50 45,00 0,68 NTF 587,61 370,48 217,14 0,00 45,00 7,38 395,71 372,22 23,49 0,16 45,00 1,43 AMP 3,50 3,80 -0,30 0,02 45,00 2,44 3,33 3,82 -0,49 0,00 45,00 5,38 REG 27,26 21,21 6,06 0,05 45,00 2,05 21,47 19,30 2,18 0,24 45,00 1,18 QCM 1304,35 869,57 434,78 0,02 21,00 2,08 602,77 538,95 63,82 0,00 21,00 2,08 TRN, CRL: stable reading DPF: no anticipation NTF, AMP, REG: better significance on text1 QCM: better understanding with simplified versions 16/45 Automatic text simplification in biomedical domain Natalia Grabar
  20. 20. Context Difficulty Paraphrases Conclusion Detection of difficulties: NLP (Grabar et al., 2014) Medical words from Snomed International (Cˆot´e et al., 1993) 29,641 lemmatized words Manually annotated: by 3 independent annotators: categories: 1 I can understand 2 I am not sure 3 I cannot understand inter-annotator agreement: Cohen’s Kappa 0.736 NLP task: supervised categorization automatically reproduce the manual annotations: F=0.90 24 descriptors: syntactic and morphological information, reference lexica, frequency, length, initial and final substrings, readability scores... 17/45 Automatic text simplification in biomedical domain Natalia Grabar
  21. 21. Context Difficulty Paraphrases Conclusion Detection of difficulties: NLP 18/45 Automatic text simplification in biomedical domain Natalia Grabar
  22. 22. Context Difficulty Paraphrases Conclusion Detection of difficulties Typology abbreviations (OG, VG, PAPS, j, bat, cp); proper names (Gougerot, Sj¨ogren, Bentall, Glasgow, Babinski, Barthel, Cockcroft); drug names; neoclassical compounds - disorders, procedures, treatments (pseudoh´emophilie, scl´erodermie, hydrolase, tympanectomie, arthrod`ese, synesth´esie); borrowings from Latin or English; human anatomy (cloacal, pubovaginal, nasopharyng´e, mitral, antre, inguinal, strontium, ´eryth`eme, maxillo-facial, m´esent`ere); lab test results. 19/45 Automatic text simplification in biomedical domain Natalia Grabar
  23. 23. Context Difficulty Paraphrases Conclusion Acquisition of paraphrases 1 Contexte 2 Detection of difficulties 3 Acquisition of paraphrases 4 Conclusion 20/45 Automatic text simplification in biomedical domain Natalia Grabar
  24. 24. Context Difficulty Paraphrases Conclusion Acquisition of paraphrases Existing work: general language Revision of Simple Wikipedia articles (Yatskar et al., 2010): probabilistic models and filters between 1,079 and 2,970 pairs: {stands for, is the same as}, {indigenous, native} precision: 17% to 86%; Methods from machine translation (Zhu et al., 2010; Wubben et al., 2012): parallel and aligned corpora (Wikipedia/Simple Wikipedia) Distributional methods (Glavas & Stajner, 2015; Kim et al., 2016): monolingual corpora vectors can contain equivalents easier to understand filtering 21/45 Automatic text simplification in biomedical domain Natalia Grabar
  25. 25. Context Difficulty Paraphrases Conclusion Acquisition of paraphrases Existing work: medical language Automatic translator of medical terms to general language (McCray et al., 1999): MEDLINEplus (brochures) Consumer Health Vocabulary (CHV) (Zeng & Tse, 2006) collaborative approach Morpho-syntactic variants (Del´eger & Zweigenbaum, 2008; Cartoni & Del´eger, 2011): {consommation r´eguli`ere, consommer de fa¸con r´eguli`ere} {gˆene `a la lecture, empˆeche de lire} Social media specificities (Tapi Nzali et al., 2015): misspellings {cirrhose, cyrose}, {m´etastase, metastase} reduced words {oncologue, onco}, {chimioth´erapie, chimio} 22/45 Automatic text simplification in biomedical domain Natalia Grabar
  26. 26. Context Difficulty Paraphrases Conclusion Acquisition of paraphrases Definitions (Antoine & Grabar, 2017) Reformulations (Antoine & Grabar, 2017) Morphological composition (Grabar & Hamon, 2014; Grabar & Hamon, 2016) 23/45 Automatic text simplification in biomedical domain Natalia Grabar
  27. 27. Context Difficulty Paraphrases Conclusion Definitions Methods Definition: structure with two elements: definiendum (term to define) and definiens (the definition) Myocarde est le tissu musculaire du coeur Use of four patterns (P´ery-Woodley & Rebeyrolle, 1998) d´esigne (means) est un (is a) est appel´e (called as) peut ˆetre d´efini comme (can be defined as) ...with inflectional variants Trigger: term 24/45 Automatic text simplification in biomedical domain Natalia Grabar
  28. 28. Context Difficulty Paraphrases Conclusion Definitions Results Extraction: 2,037 definitions 1,286 unique terms Evaluation: strict precision: 52.5% correct definitions: 849 weak precision: 68% correct and possibly correct definitions: 1,028 Types of terms: compound terms: hypoglyc´emie, acidoc´etose, angiographie, hypokali´emie, affixed terms: curetage, capsulite, arthrose, glaucome, durillon, pr´e-diab`ete, non-constructed terms: cataracte, imp´etigo, zona 25/45 Automatic text simplification in biomedical domain Natalia Grabar
  29. 29. Context Difficulty Paraphrases Conclusion Definitions Results L’hypoglyc´emie est un manque de sucre dans l’organisme Une septic´emie est un empoisonnement du sang du `a un microbe Le curetage est un nettoyage en profondeur d’une gencive inflamm´ee Pour un ˆetre humain adulte, une hypoglyc´emie est une glyc´emie inf´erieure `a 0,8 g/L Les signes classiques annonciateurs de l’hypoglyc´emie sont des sueurs, pˆaleur, palpitations, fringales en particulier L’imp´etigo est une infection cutan´ee, qui provoque des pustules qui d´eg´en`erent en croˆutes jaunˆatres, l’imp´etigo est due `a... 26/45 Automatic text simplification in biomedical domain Natalia Grabar
  30. 30. Context Difficulty Paraphrases Conclusion Definitions Results Readability (p´ericarde): + La couche ext´erieure du cœur est appel´ee p´ericarde. ∼ Le p´ericarde est un sac `a double paroi contenant le cœur et les racines des gros vaisseaux sanguins. − Le p´ericarde est un organe de glissement, form´e de deux feuillets limitant une cavit´e virtuelle, la cavit´e p´ericardique, qui permet les mouvements cardiaques. 27/45 Automatic text simplification in biomedical domain Natalia Grabar
  31. 31. Context Difficulty Paraphrases Conclusion Reformulations Motivation Reformulation: say differently (Le Bot et al., 2008) Occurrence of reformulations: indicates presence of difficult words/terms provides triggers for the extraction Exploit reliable data: health fora with moderators Wikipedia 28/45 Automatic text simplification in biomedical domain Natalia Grabar
  32. 32. Context Difficulty Paraphrases Conclusion Reformulations Methods concept marker reformulation v´esiculaire, c’est-`a-dire, venant de la v´esicule biliaire 3 markers : c’est-`a-dire (I mean) autrement dit ; Autrement dit (in other words) encore appel´e(e)(s) (also called) Pre-processing POS-tagging and syntactic analysis by Cordial (Laurent et al., 2009) Trigger: markers Extraction of concept and of reformulation: syntactic information boundaries: syntagms or propositions 29/45 Automatic text simplification in biomedical domain Natalia Grabar
  33. 33. Context Difficulty Paraphrases Conclusion Reformulations form lemma POS POSMT GS type GS Prop Vous vous PPER2P Pp2.pn 1 S 1 ne ne ADV Rpn 3—1 S 1 devez devoir VINDP2P Vmip2p 3 V 1 pas pas ADV Rgn 3 Q 1 employer employer VINF Vmn – 5 D 2 de de PREP Sp 7 D 2 savons savon NCMP Ncmp 7 D 2 ou ou COO Cc 7 F 2 des de le DETDPIG Da-.p-i 10—7 F 2 laits lait NCMP Ncmp 10—7 F 2 sophistiqu´es sophistiqu´e ADJMP Afpmp 10—7 F 2 , , PCTFAIB Ypw - - 2 c’ ce PDS Pd-..- 13 N 2 est est ADV Rgp - p 2 -`a `a PREP Sp 16 F 2 -dire dire VINF Vmn– 16 F 2 contenant contenant NCMS Ncms 17 D 2 plusieurs plusieurs ADJIND Dt-.p- 19 D 2 composants composant NCMP Ncmp 19 D 2 30/45 Automatic text simplification in biomedical domain Natalia Grabar
  34. 34. Context Difficulty Paraphrases Conclusion Reformulations form lemma POS POSMT GS type GS Prop Vous vous PPER2P Pp2.pn 1 S 1 ne ne ADV Rpn 3—1 S 1 devez devoir VINDP2P Vmip2p 3 V 1 pas pas ADV Rgn 3 Q 1 employer employer VINF Vmn – 5 D 2 de de PREP Sp 7 D 2 savons savon NCMP Ncmp 7 D 2 ou ou COO Cc 7 F 2 des de le DETDPIG Da-.p-i 10—7 F 2 laits lait NCMP Ncmp 10—7 F 2 sophistiqu´es sophistiqu´e ADJMP Afpmp 10—7 F 2 , , PCTFAIB Ypw - - 2 c’ ce PDS Pd-..- 13 N 2 est est ADV Rgp - p 2 -`a `a PREP Sp 16 F 2 -dire dire VINF Vmn– 16 F 2 contenant contenant NCMS Ncms 17 D 2 plusieurs plusieurs ADJIND Dt-.p- 19 D 2 composants composant NCMP Ncmp 19 D 2 31/45 Automatic text simplification in biomedical domain Natalia Grabar
  35. 35. Context Difficulty Paraphrases Conclusion Reformulations form lemma POS POSMT GS type GS Prop Vous vous PPER2P Pp2.pn 1 S 1 ne ne ADV Rpn 3—1 S 1 devez devoir VINDP2P Vmip2p 3 V 1 pas pas ADV Rgn 3 Q 1 employer employer VINF Vmn – 5 D 2 de de PREP Sp 7 D 2 savons savon NCMP Ncmp 7 D 2 ou ou COO Cc 7 F 2 des de le DETDPIG Da-.p-i 10—7 F 2 laits lait NCMP Ncmp 10—7 F 2 sophistiqu´es sophistiqu´e ADJMP Afpmp 10—7 F 2 , , PCTFAIB Ypw - - 2 c’ ce PDS Pd-..- 13 N 2 est est ADV Rgp - p 2 -`a `a PREP Sp 16 F 2 -dire dire VINF Vmn– 16 F 2 contenant contenant NCMS Ncms 17 D 2 plusieurs plusieurs ADJIND Dt-.p- 19 D 2 composants composant NCMP Ncmp 19 D 2 32/45 Automatic text simplification in biomedical domain Natalia Grabar
  36. 36. Context Difficulty Paraphrases Conclusion Reformulations Evaluation Dev. Test P R F nb occ. 96 2 757 exact 0.24 0.24 0.24 nb types 96 2 710 inexact 0.98 0.98 0.98 Difficulties: detection of boundaries: en c’est-`a-dire au contact du sang circulant une toxi-infection, c’est-`a-dire, qu’ elle peut semantics: en 10 ans autrement dit sur 64 millions de personnes un objectif c’est-`a-dire une finalit´e 33/45 Automatic text simplification in biomedical domain Natalia Grabar
  37. 37. Context Difficulty Paraphrases Conclusion Reformulations Results des canaux galactophores c’est-`a-dire s´ecr`etent le lait erratiques c’est-`a-dire qu’ils changent de d’aspect et d’endroit par une lithiase c’est-`a-dire un caillou clivage du moi c’est-`a-dire comme une opposition entre le moi et la r´ealit´e au gr´e de la d´esint´egration radioactive du 18 F c’est-`a-dire avec une demi-vie d’environ un trouble de l’identit´e sexuelle c’est-`a-dire qu’ils s’identifient `a un genre ne correspondant pas `a leur sexe biologique une enzyme prot´eolytique c’est-`a-dire dig`ere les prot´eines comme le fait le suc pancr´eatique celle de troubles fonctionnels intestinaux encore appel´es colopathie fonctionnelle 34/45 Automatic text simplification in biomedical domain Natalia Grabar
  38. 38. Context Difficulty Paraphrases Conclusion Morphological composition Morphological analysis of components TranslationPOS−tagging Medical terms Corpus POS−tagging Syntactic analysis Evaluation Alignment Processing of terms myocarde myocarde/Nom [[[myo N*] [carde N*] NOM] ique ADJ] myo=muscle, carde=coeur Processing of corpus Les causes de tachycardie ventriculaire sont superposables `a celles des extrasystoles ventriculaires: infarctus du myocarde, insuffisance cardiaque, hypertrophie du muscle du cœur et prolapsus de la valve mitrale. 35/45 Automatic text simplification in biomedical domain Natalia Grabar
  39. 39. Context Difficulty Paraphrases Conclusion Morphological composition Morphological analysis of components TranslationPOS−tagging Medical terms Corpus POS−tagging Syntactic analysis Evaluation Alignment Processing of terms myocarde myocarde/Nom [[[myo N*] [carde N*] NOM] ique ADJ] myo=muscle, carde=coeur Processing of corpus Les causes de tachycardie ventriculaire sont superposables `a celles des extrasystoles ventriculaires: infarctus du myocarde, insuffisance cardiaque, [hypertrophie du [muscle du cœur]] et prolapsus de la valve mitrale. 36/45 Automatic text simplification in biomedical domain Natalia Grabar
  40. 40. Context Difficulty Paraphrases Conclusion Morphological composition Results Alignment syntagm/term (percentage of alignment): E1: full term and syntagm: {myo pathie, maladie du muscle} E2: full term, partial syntagm: {myo pathie, maladie du muscle cardiaque} E3: partial term, full syntagm: {myopathie, la maladie} E4: partial term and syntagm: {myopathie, l’ origine de la maladie} 37/45 Automatic text simplification in biomedical domain Natalia Grabar
  41. 41. Context Difficulty Paraphrases Conclusion Morphological composition Evaluation Nb of unigrams bigrams trigrams b l s b l s b l s correct paraphrases 549 785 644 378 517 461 195 290 257 poss. correct 39 32 67 22 45 75 10 19 41 processing of terms 47 60 44 28 28 46 9 10 26 incorrect paraphrases 33 146 296 64 80 380 25 39 148 Pstrict 82 77 61 77 77 48 82 81 55 Pweak 88 80 68 81 84 40 86 86 63 %incorrect 5 14 28 13 12 39 11 11 31 Evaluation: strict precision 82 to 55% weak precision 86 to 40% error rate 5 to 39% Resources without: the best precision morphology: good precision synonymy: low precision 38/45 Automatic text simplification in biomedical domain Natalia Grabar
  42. 42. Context Difficulty Paraphrases Conclusion Morphological composition Morphological analysis Ambigous analysis [post [[uro N*] [graphie N*] NOM] NOM] [[posturo N*] [graphie N*] NOM] Incorrect analysis sanglot: lot and sang exotique: externe and oreille divin: deux and vin (deux litres de vin) 39/45 Automatic text simplification in biomedical domain Natalia Grabar
  43. 43. Context Difficulty Paraphrases Conclusion Morphological composition Extraction of paraphrases and their evaluation Correct paraphrases raw {podalgie, douleur du pied} {mastite, inflammation du sein} {cystoprostatectomie, ablation de la vessie et de la prostate} Morphology {desmorrhexie, rupture des ligaments} (ligament→ligaments) {bronchite, inflammation des bronches/inflammation bronchique} (bronche→bronches, bronche→bronchique) {dentalgie, douleurs dentaires} (dents→dentaires) Synonymy {aclasie, absence de fracture} (cassure→fracture) {enterectomie, r´esection des intestins} (ablation→r´esection) 40/45 Automatic text simplification in biomedical domain Natalia Grabar
  44. 44. Context Difficulty Paraphrases Conclusion Morphological composition Extraction of paraphrases and their evaluation Semantic relations between components: well managed by data from corpora errors: coordination/subordination hematospermie: le sang ou le sperme, instead of → le sang dans le sperme Non-compositional terms: ost´eodermie: peau and os, instead of → une structure d’´ecailles, de plaques osseuses ou d’autres compositions dans les couches dermiques de la peau, comme chez les l´ezards ou dinosaures 41/45 Automatic text simplification in biomedical domain Natalia Grabar
  45. 45. Context Difficulty Paraphrases Conclusion Comparison with existing work term type nb. para precision (Zeng et al., 2006) all CHV (Elhadad & Sutaria, 2007) all 152 0.58 (Del´eger & Zweigenbaum, 2008) m-synt. 65, 82 0.67, 0.60 (Cartoni & Del´eger, 2011) m-synt. 109 0.66 definitions all 1,028 0.52, 0.68 morphology compounds 1,128 0.76, 0.86 abbreviations abbr. 42, 8,106 0.74/0.94 reformulation all 96, 2,710 0.24/0.98 parentheses all 305, 92,971 0.23/0.68 morpho-syntactic: {consommation r´eguli`ere, consommer de fa¸con r´eguli`ere} comparable performance, better coverage 42/45 Automatic text simplification in biomedical domain Natalia Grabar
  46. 46. Context Difficulty Paraphrases Conclusion Comparison with existing work D´eriF (Namer, 2003): gloss in formal language for every analyzed word our method: coverage depends on content of corpora myocarde: ”(Partie de – Type particulier de) coeur en rapport avec le(s) muscle” muscle du coeur desmorrhexie: ”rupture (du – li´ee au) ligament” rupture des ligaments 43/45 Automatic text simplification in biomedical domain Natalia Grabar
  47. 47. Context Difficulty Paraphrases Conclusion Conclusion Detection of difficulties in reading and understanding Acquisition of resources for explaining technical terms Methods dedicated to different kinds of linguistic phenomena paraphrases, reformulations... Exploitation of general language corpora Complementary methods Interesting and exploitable results Work in French Diagnosis of text modelref. ref. model res. rules Detection of difficult words Simplification /decoration difficult Text Simplified text 44/45 Automatic text simplification in biomedical domain Natalia Grabar
  48. 48. Context Difficulty Paraphrases Conclusion Future work Increase the coverage of paraphrases and reformulations: more corpora comparables (Cochrane, patient package inserts, Wiki/Viki) monolingual more suppletive resources other methods for extracting the paraphrases Alignment with medical terminologies Distribution of the resource Other languages Lexical simplification of medical texts ANR project CLEAR (Communication, Literacy, Education, Accessibility, Readability) Diagnosis of text modelref. ref. model res. rules Detection of difficult words Simplification /decoration difficult Text Simplified text 45/45 Automatic text simplification in biomedical domain Natalia Grabar
  49. 49. Context Difficulty Paraphrases Conclusion AMA (1999). Health literacy: report of the council on scientific affairs. Ad hoc committee on health literacy for the council on scientific affairs, American Medical Association. JAMA, 281(6), 552–7. Antoine, E. & Grabar, N. (2017). Acquisition of expert/non-expert vocabulary from reformulations. In MIE, Stud Health Technol Inform. 235, pp. 521–525. Berland, G., Elliott, M., Morales, L., Algazy, J., Kravitz, R., Broder, M., Kanouse, D., Munoz, J., Puyol, J. & et al, M. L. (2001). Health information on the internet. accessibility, quality, and readability in english ans spanish. JAMA, 285(20), 2612–2621. Bertram, R., Kuperman, V., Baayen, H. R. & Hy¨on¨a, J. (2011). The hyphen as a segmentation cue in triconstituent compound processing: It’s getting better all the time. Scandinavian Journal of Psychology, 52(6), 530–544. Beyersmann, E., Coltheart, M. & Castles, A. (2012). Parallel processing of whole words and morphemes in visual word recognition. The Quarterly Journal of Experimental Psychology, 65(9), 1798–1819. Bozic, M., Marslen-Wilson, W. D., Stamatakis, E. A., Davis, M. H. & Tyler, L. K. (2007). Differentiating morphology, form, and meaning: Neural correlates of morphological complexity. 45/45 Automatic text simplification in biomedical domain Natalia Grabar
  50. 50. Context Difficulty Paraphrases Conclusion Journal of Cognitive Neuroscience, 19(9), 1464–1475. Cartoni, B. & Del´eger, L. (2011). D´ecouverte de patrons paraphrastiques en corpus comparable: une approche bas´ee sur les n-grammes. In Traitement Automatique des Langues Naturelles (TALN). Chmielik, J. & Grabar, N. (2009). Comparative study between expert and non-expert biomedical writings: their morphology and semantics. Stud Health Technol Inform., 150, 359–63. Chmielik, J. & Grabar, N. (2011). D´etection de la sp´ecialisation scientifique et technique des documents biom´edicaux grˆace aux informations morphologiques. TAL, 51(2), 151–179. Cˆot´e, R. A., Rothwell, D. J., Palotay, J. L., Beckett, R. S. & Brochu, L. (1993). The Systematised Nomenclature of Human and Veterinary Medicine: SNOMED International. Northfield: College of American Pathologists. Del´eger, L. & Zweigenbaum, P. (2008). Paraphrase acquisition from comparable medical corpora of specialized and lay texts. In Ann Symp Am Med Inform Assoc (AMIA), pp. 146–50. Dohmes, P., Zwitserlood, P. & B¨olte, J. (2004).45/45 Automatic text simplification in biomedical domain Natalia Grabar
  51. 51. Context Difficulty Paraphrases Conclusion The impact of semantic transparency of morphologically complex words on picture naming. Brain and Language, 90(1-3), 203–212. Elhadad, N. & Sutaria, K. (2007). Mining a lexicon of technical terms and lay equivalents. In BioNLP, pp. 49–56. Flesch, R. (1948). A new readability yardstick. Journ Appl Psychol, 23, 221–233. Frisson, S., Niswander-Klement, E. & Pollatsek, A. (2008). The role of semantic transparency in the processing of english compound words. Br J Psychol, 99(1), 87–107. Glavas, G. & Stajner, S. (2015). Simplifying lexical simplification: Do we need simplified corpora? In ACL-COLING, pp. 63–68. Goeuriot, L., Grabar, N. & Daille, B. (2007). Caract´erisation des discours scientifique et vulgaris´e en fran¸cais, japonais et russe. In Traitement Automatique des Langues Naturelles (TALN), pp. 93–102. Grabar, N., Farce, E. & Sparrow, L. (2018). ´Etude de la lisibilit´e des documents de sant´e avec des m´ethodes d’oculom´etrie. In Traitement Automatique des Langues Naturelles (TALN), pp. 1–14. 45/45 Automatic text simplification in biomedical domain Natalia Grabar
  52. 52. Context Difficulty Paraphrases Conclusion Grabar, N. & Hamon, T. (2014). Automatic extraction of layman names for technical medical terms. In ICHI 2014, Pavia, Italy. Grabar, N. & Hamon, T. (2016). Exploitation de la morphologie pour l’extraction automatique de paraphrases grand public des termes m´edicaux. TAL, 57(1), 85–109. Grabar, N., Hamon, T. & Amiot, D. (2014). Automatic diagnosis of understanding of medical words. In EACL PITR Workshop, pp. 11–20. Grabar, N., Krivine, S. & Jaulent, M. (2007). Classification of health webpages as expert and non expert with a reduced set of cross-language features. In Ann Symp Am Med Inform Assoc (AMIA), pp. 284–288. Gunning, R. (1973). The art of clear writing. New York, NY: McGraw Hill. Jarema, G., Busson, C., Nikolova, R., Tsapkini, K. & Libben, G. (1999). Processing compounds: A cross-linguistic study. Brain and Language, 68(1-2), 362–369. Kim, Y.-S., Hullman, J., Burgess, M. & Adar, E. (2016). Simplescience: Lexical simplification of scientific terminology. In EMNLP, pp. 1–6.45/45 Automatic text simplification in biomedical domain Natalia Grabar
  53. 53. Context Difficulty Paraphrases Conclusion Koester, D. & Schiller, N. O. (2011). The functional neuroanatomy of morphology in language production. NeuroImage, 55(2), 732–741. Kokkinakis, D. & Toporowska Gronostaj, M. (2006). Comparing lay and professional language in cardiovascular disorders corpora. In A. Pham T., James Cook University, Ed., WSEAS Transactions on BIOLOGY and BIOMEDICINE, pp. 429–437. Laurent, D., N`egre, S. & S´egu´ela, P. (2009). L’analyseur syntaxique Cordial dans Passage. In Traitement Automatique des Langues Naturelles (TALN). Le Bot, M.-C., Schuwer, M. & ´Elisabeth Richard (dir.) (2008). La reformulation : Marqueurs linguistiques – Strat´egies ´enonciatives. Rennes: Rivages linguistiques. Leroy, G., Helmreich, S., Cowie, J., Miller, T. & Zheng, W. (2008). Evaluating online health information: Beyond readability formulas. In Ann Symp Am Med Inform Assoc (AMIA), pp. 394–8. Libben, G., Gibson, M., Yoon, Y. B. & Sandra, D. (2003). Compound fracture: The role of semantic transparency and morphological headedness. Brain and Language, 84(1), 50–64. L¨uttmann, H., Zwitserlood, P. & B¨olte, J. (2011). 45/45 Automatic text simplification in biomedical domain Natalia Grabar
  54. 54. Context Difficulty Paraphrases Conclusion Sharing morphemes without sharing meaning: Production and comprehension of german verbs in the context of morphological relatives. Canadian Journal of Experimental Psychology/Revue canadienne de psychologie exp´erimentale, 65(3), 173–191. McCray, A. (2005). Promoting health literacy. J of Am Med Infor Ass, 12, 152–163. McCray, A., Loane, R., Browne, A. & Bangalore, A. (1999). Terminology issues in user access to web-based medical information. In Ann Symp Am Med Inform Assoc (AMIA), pp. 107–7. Namer, F. (2003). Automatiser l’analyse morpho-s´emantique non affixale: le syst`eme D´eriF. Cahiers de Grammaire, 28, 31–48. Patel, V., Branch, T. & Arocha, J. (2002). Errors in interpreting quantities as procedures : The case of pharmaceutical labels. Int Journ Med Inform, 65(3), 193–211. P´ery-Woodley, M. & Rebeyrolle, J. (1998). Domain and genre in sublanguage text: definitional microtexts in three corpora. In LREC, pp. 987–992. Poprat, M., Mark´o, K. & Hahn, U. (2006). A language classifier that automatically divides medical documents for experts and health care consumers. 45/45 Automatic text simplification in biomedical domain Natalia Grabar
  55. 55. Context Difficulty Paraphrases Conclusion In Int Congress of the European Federation for Medical Informatics, pp. 503–508, Maastricht. Quinlan, J. (1993). C4.5 Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann. Specia, L., Jauhar, S. & Mihalcea, R. (2012). Semeval-2012 task 1: English lexical simplification. In *SEM 2012, pp. 347–355. Tapi Nzali, M., Bringay, S., Lavergne, C., Opitz, T., Az´e, J. & Mollevi, C. (2015). Construction d’un vocabulaire patient/m´edecin d´edi´e au cancer du sein `a partir des m´edias sociaux. In IC 2015. Tran, T., Chekroud, H., Thiery, P. & Julienne, A. (2009). Internet et soins : un tiers invisible dans la relation m´edecine/patient ? Ethica Clinica, 53, 34–43. Wang, Y. (2006). Automatic recognition of text difficulty from consumers health information. In IEEE, Ed., Computer-Based Medical Systems, pp. 131–136. Williams, M., Parker, R., Baker, D., Parikh, N., Pitkin, K., Coates, W. & Nurss, J. (1995). Inadequate functional health literacy among patients at two public hospitals. JAMA, 274(21), 1677–1682. 45/45 Automatic text simplification in biomedical domain Natalia Grabar
  56. 56. Context Difficulty Paraphrases Conclusion Wubben, S., van den Bosch, A. & Krahmer, E. (2012). Sentence simplification by monolingual machine translation. In Annual Meeting of the Association for Computational Linguistics, pp. 1015–1024. Yatskar, M., Pang, B., Danescu-Niculescu-Mizil, C. & Lee, L. (2010). For the sake of simplicity: Unsupervised extraction of lexical simplifications from Wikipedia. In NAACL, pp. 365–368. Zeng, Q. & Tse, T. (2006). Exploring and developing consumer health vocabularies. JAMIA, 13, 24–29. Zeng, Q. T., Tse, T., Divita, G., Keselman, A., Crowell, J. & Browne, A. C. (2006). Exploring lexical forms: first-generation consumer health vocabularies. In Ann Symp Am Med Inform Assoc (AMIA), pp. 1155–1155. Zeng-Treiler, Q., Kim, H., Goryachev, S., Keselman, A., Slaugther, L. & Smith, C. (2007). Text characteristics of clinical reports and their implications for the readability of personal health records. In MEDINFO, pp. 1117–1121, Brisbane, Australia. Zheng, W., Milios, E. & Watters, C. (2002). Filtering for medical news items using a machine learning approach. In Ann Symp Am Med Inform Assoc (AMIA), pp. 949–53. 45/45 Automatic text simplification in biomedical domain Natalia Grabar
  57. 57. Context Difficulty Paraphrases Conclusion Zhu, Z., Bernhard, D. & Gurevych, I. (2010). A monolingual tree-based translation model for sentence simplification. In COLING 2010, pp. 1353–1361. 45/45 Automatic text simplification in biomedical domain Natalia Grabar

    Be the first to comment

    Login to see the comments

  • dmitrybalabka

    Jun. 15, 2021

Speaker: Natalia Grabar, NLP scientist Summary: We propose a set of experiments with the general objective of ensuring a better understanding of technical health documents. Various experiments address different steps of this complex and ambitious process: (1) categorization of documents according to their complexity; (2) detection of complex passages within documents; (3) acquisition of resources for the lexical and semantic simplification of documents; (4) alignment of parallel sentences from comparable corpora for generating rules for syntactic transformation. According to the steps and tasks, various methods are exploited (rule-based, machine learning, with and without linguistic knowledge). In addition to text simplification, the results and resources can be used for other NLP applications and tasks (e.g., information retrieval and extraction, question-answering, textual entailment).

Views

Total views

2,434

On Slideshare

0

From embeds

0

Number of embeds

2

Actions

Downloads

12

Shares

0

Comments

0

Likes

1

×