Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Natural Language Processing for biomedical text mining - Thierry Hamon

2,473 views

Published on

Speaker: Thierry Hamon, Associate Professor in Computer Science at Université Paris, Member of the LIMSI-CNRS research lab.

Summary: Among the large amounts of unstructured data generated across the world and available nowadays, textual data represent an important source of information. This fact is particularly true in the biomedical domain, where a constant increasing demand to access the textual content is observed: the situation is relevant for accessing and processing Electronic Health Records, online discussion forums, and scientific literature. Indeed, dealing with biomedical texts requires us to take into account a great variety of texts, languages and Users.

For several years now, a lot of NLP research has focused on mining and retrieving information (i.e., medical entities and domain-specific relations), which are relevant for biologists, physicians, terminologists, epidemiologists, and patients. We will propose an overview of the NLP methods used for tackling several such research problems through text mining applications. First, we will present the resources and rule-based approaches we designed for extracting drug-related information from clinical texts, and for acquiring domain-specific semantic relations from digital libraries. Then we will present the cross-lingual approach we are developing for building multilingual terminologies from a patient-centered Ukrainian corpus.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Natural Language Processing for biomedical text mining - Thierry Hamon

  1. 1. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Natural Language Processing for biomedical text mining Thierry Hamon LIMSI, CNRS, Université Paris-Saclay, Orsay, France Université Paris 13, Sorbonne Paris Cité, Villetaneuse, France hamon@limsi.fr 14/06/2017 1/75 Grammarly Meet-up T Hamon
  2. 2. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Context Most of the data are unstructured about 90% of the data produced in 2011 (1.8 trillion of gigabytes) [Oracle, 2011] 85% of data produced in compagnies Unstructured data: textual data Important source of information Accessing and reading are costly, time-consuming and sometimes impossible Need of methods for information retrieval and information extraction 2/75 Grammarly Meet-up T Hamon
  3. 3. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Context In biomedical domain, constant increase of amount of Scientific Medical literature Scientific papers in digital libraries or portal Medical, pharmacological, epidemiological reports Electronic Health Records in hospitals Discharge summaries Radiological reports Patient-related textual data documents explaining diseases to patients, health behaviors social media (online discussion forums, twitter messages) 3/75 Grammarly Meet-up T Hamon
  4. 4. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Context Example: Scientific article publications Medline (U.S. National Library of Medicine bibliographic database) - https://www.ncbi.nlm.nih.gov/pubmed/ Evolution of the number of references to articles in life sciences Citations Added to MEDLINE® per Year Currently: More than 27 million references 4/75 Grammarly Meet-up T Hamon
  5. 5. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion What is text mining? Objective: Extraction of useful and non-trivial knowledge from texts Extraction of information useful for a given application from textual data, i.e. writen in natural language Collecting and linking this information Feed databases or knowledge bases with information extracted from texts Indirectly: allow data mining on unstructured/textual data 5/75 Grammarly Meet-up T Hamon
  6. 6. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Data mining vs. Text mining Data mining Methods and algorithms to explore structured data, issued from databases, data warehouse or knowledge bases Objectives: Highlight rules, identify trends or behaviours which are invisible to humans 6/75 Grammarly Meet-up T Hamon
  7. 7. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Data mining vs. Text mining Data mining Methods and algorithms to explore structured data, issued from databases, data warehouse or knowledge bases Objectives: Highlight rules, identify trends or behaviours which are invisible to humans Text mining Methods and algorithms to explore unstructured data, i.e. texts written in Natural Language Objectives: Extraction and categorisation of information available in the texts 6/75 Grammarly Meet-up T Hamon
  8. 8. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion What are text mining applications? EHR: Search and find relevant information, Hospital information system Provide synthetic views of patient-related information EHR / Scientific literature: Information storage in databases for statistics, epidemiologic survey, Information system in hospital, etc. Formalize information or knowledge Social media: Epidemiologic analysis, Therapeutical Patient Education, Potential adverse drug effect identification 7/75 Grammarly Meet-up T Hamon
  9. 9. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion What information to identify? Semantic entities: terms with semantic types Semantic relations between entities Temporal information related to events Numerical information Modifiers for identifying polarity, modality, presence/absence, uncertainty 8/75 Grammarly Meet-up T Hamon
  10. 10. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Needs for analysis of biomedical texts Various resources: Terminologies, Ontologies, Open Linked Data Lexica, Consumer Health Vocabularies Semantic description of entities NLP approaches and methods: Rule-based approaches (more or less sophisticated regular expressions) Machine Learning approaches (supervised, semi-supervised, unsupervised) Evaluation against independent reference data 9/75 Grammarly Meet-up T Hamon
  11. 11. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Difficulties Textual data may be noisy, sparse, multilingual Text processing is time-consuming, may require contextual information Terminological and semantic variation, semantic ambiguity, unknown or new words and terms, etc. → High and unpredictable number of dimensions Complex and embedded semantic relations 10/75 Grammarly Meet-up T Hamon
  12. 12. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Difficulties Ambiguities of the natural language at each level: lexicon: spell[N] vs. spell[V], Apple[company] vs. apple[fruit] гори[V] (a form of burn) vs. гори (inflectional form of mountain) syntax: the doctor examines the patient with a stetoscope Joe experienced severe shortness of breath and chest pain at home while having sex, which became more unpleasant at the emergency room. 11/75 Grammarly Meet-up T Hamon
  13. 13. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Difficulties Ambiguities of the natural language at each level: semantics: a red pencil, He reached the bank. поділися (form of disappear) vs. поділися or lemma of share) pragmatics: The chicken is ready to eat. Margaret invited Susan for a visit, and she gave her a good lunch. a very pleasant patient 12/75 Grammarly Meet-up T Hamon
  14. 14. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Difficulties Variation in semantically similar wording: Bayer is buying Monsanto Bayer clinches Monsanto Bayer and Monsanto [...] will merge Bayer's announced acquisition of Monsanto Monsanto-Bayer merger Metonymy: the latest Apple/Samsung Metaphor: Web giants, or noir (black gold in French) Spelling errors: Appel(call in French)/Apple Mix of Latin and Ukrainian characters (different UTF-8 codes): i vs. і, o vs. о, p vs. р, y vs. у... 13/75 Grammarly Meet-up T Hamon
  15. 15. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Three experiments in biomedical text mining 1 Recognition of Medication, assertion, temporal information in EHR [Hamon and Grabar10, Périnet et al.11, Grouin et al.13, Zweigenbaum et al.13, Hamon and Grabar14] Work with Natalia Grabar (CNRS STL - Lille 3), Amandine Périnet (LIM&BIO - Paris 13), Cyril Grouin, Sophie Rosset, Xavier Tannier, Pierre Zweigenbaum (LIMSI, CNRS) 2 Mining literature for identifying risk factors [Hamon et al.10] Work with Martin Graña, Víctor Raggio and Hugo Naya (Institut Pasteur de Montevideo), and Natalia Grabar (CNRS STL - Lille 3) 3 Cross-Lingual Transfer Methods for Terminology Acquisition [Hamon and Grabar16] Work with Natalia Grabar (CNRS STL - Lille 3) 14/75 Grammarly Meet-up T Hamon
  16. 16. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Mining Patients' Electronic Health Records [Hamon and Grabar10, Périnet et al.11, Grouin et al.13, Zweigenbaum et al.13, Hamon and Grabar14] Description of the hospitalization A lot of (personal) information about patients Problems Therapies (treatments, drugs, etc.) Tests and analysis (lab data, etc.) Assertions regarding facts (certainty, hypothesis, etc.) Temporal information (useful for the clinical timeline) The best way to record information (database are difficult to maintain) BUT the texts are written by practitioners: in a hurry, with mistakes, with little or incorrect syntactic structures, etc. 15/75 Grammarly Meet-up T Hamon
  17. 17. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Objectives Identification of Medication names given to patients Related information (dosage, duration, frequency, mode of administration, reason for prescription) Assertion: certainty and uncertainty of information in medical texts focus on the relation {patient / medical problem} Temporal expressions: date, time and duration of medical events Participation to several I2B2 Challenges 16/75 Grammarly Meet-up T Hamon
  18. 18. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Drug-related information acne osteoporosis swelling of face arterial hypertension ulcer of stomach depression solumedrol salt phosphate disodique anhydre phosphate monosodique anhydre sodium lactosis cortisone steroidal anti−inflammatory allergic shock Quincke oedema suffocation by larynx oedema brain oedema methylprednisolone adverseeffects digitaline insulin composition is a prescribedfor INN DDI FDI dosage mode frequency reason duration prescriptionfeatures17/75 Grammarly Meet-up T Hamon
  19. 19. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Assertion task Degree of certainty from abdominal pain With shrimps, the patient suffers The patient is to call the hospital if he suffers from abdominal pain The patient denies suffering from abdominal pain abdominal pain The patient suffers from might suffer from abdominal pain It was thought that the patient Certainty Hypothesis Condition Negative certainty Positive certainty Assertion Possibility 18/75 Grammarly Meet-up T Hamon
  20. 20. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Example Medication name, associated information, assertions and time expressions The patient is currently off diuretics at this time. Daily weights should be checked and if her weight increases by more than 3 pounds Dr. Bockoven should be notified. The patient was also started on calcitriol given elevation of parathyroid hormone. Cardiovascular: Rate and rhythm: The patient has a history of atrial fibrillation with a slow ventricular response. Two weeks ago, the patient was started on metoprolol 12.5 mg p.o. q.6 h. for rate control , however , this dose was decreased to 12.5 mg p.o. twice a day, given some bradycardia on her telemetry. The patient was also started on Flecainide 75 mg p.o. q.12 h. She will continue on these two medications upon discharge. 19/75 Grammarly Meet-up T Hamon
  21. 21. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Example Medication name, associated information, assertions and time expressions RRR , lots of BS's , neuro nonfocal , ext with 1+ edema. On atenolol , zestril , norvasc , premarin , detrol , lasix 60 qd , nebs prn at home. Labs sig for Cr 0.7 , CK 48 , TnI .05 , QBC 9.5 , Hct 41.3. From CV point of view , thought to be CHF exac. ROMI'd without events on monitor and diuresed 2L/day. IV Lasix 80 bid to start transitioned to 60 po bid. BNP>assay. 6/17 dobut MIBI with mod sized ant septal wall defect c/w diagonal lesion , 3/22 Echo with EF 55-60% , mild LAE/RAE , no WMA , mod large RV. No further CV studies. Cont previously meds on d/c. From FEN point of view , 2 L fluid restriction , 2 g Na restriction. Nutrition consult , but pt very resistant to diet changes. From GI point of view , GERD; nexium started. From pulm point of view , CXR c/w sl fluid overload , no focal findings , no pulm edema. Given NC O2 and BiPAP at night. 20/75 Grammarly Meet-up T Hamon
  22. 22. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Material Documents Discharge summaries: 1,249 documents (provided by the I2B2 challenges) 2009: 649 docs in the training set , 553 docs in the test set, 17 manually annotated documents (for illustrating the annotation guidelines) 2010: 349 annotated documents + 827 raw documents in the training set, 477 in the test set Assertions: 11,968 in the training set, 18,550 in the test set 2012: 190 docs in the training set, 120 docs in the test 21/75 Grammarly Meet-up T Hamon
  23. 23. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Material Terminologies and lexica Medication names: RxNorm (243,869 entries) and Therapeutic classes and groups of medication from the FDA website Ambiguous medication (red blood cells, magnesium, iron): specific status during the annotation process Medical problems: 45,898 terms (Diagnosis and Morphology axes of the Snomed International), 476 terms from the training set documents Medication-related information Regular expressions for frequency, dosage, duration and mode of administration 52 identification rules for reasons: characterization of Snomed Int terms and/or extracted terms as reasons 22/75 Grammarly Meet-up T Hamon
  24. 24. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Material Terminologies and lexica Assertions: Negation: 284 markers from the NegEx resource [Chapman et al.01] Lexical clues: on exertion (condition) Morphological clues: afebrile (negative certainty) Contextual information (342 markers) Clues in the sentence, Section headings ... could represent a multifocal pneumonic process (possible) ALLERGIES, SOCIAL HISTORY, lists Lexico-syntactic patterns (137 patterns) be to (address | request | notify) DT (office | clinic | hospital) if PB (Hypothesis) TE to (evaluate | check | eval | consult) (from | if | with | against) PB (Possibility) 23/75 Grammarly Meet-up T Hamon
  25. 25. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Document processing Annotation of the documents Use of terminological and linguistic resources and selection and disambiguation rules CRF-based models [Grouin et al., Minard et al.11] tuning Heideltime system [Strotgen and Gertz12, Hamon and Grabar14] Design of post-processing modules for Disambiguation and negative contexts of medication names Computing of dependency relations between patient, medication names and related information, or assertion Improving the CRF-based system with extracted terms [Aubin and Hamon06] 24/75 Grammarly Meet-up T Hamon
  26. 26. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Enriching documents with linguistic information Extraction of terms Ontology Lemmatisation Tagging of the terms Terminoloy Semantic tagging linguistic and structural annotations XML document with of named Dictionary entities Named entity tagging Word and sentence segmentation Specialised lexicon Part−Of−Speech Tagging Tokenisation XML document with structural annotations Symbolic approach: use of NLP methods Terminological resources and disambiguation rules Concurrent annotations and annotation selection Design of post-processing modules for Annotation disambiguation Establishment of dependency relations between patient, medication names and related information, or assertion Annotation based on the Ogmios NLP platform (developed during the EU Project Alvis) 25/75 Grammarly Meet-up T Hamon
  27. 27. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Enriching document with linguistic information Identification of the sentences The patient has a history of atrial fibrillation with a slow ventricular response . Two weeks ago , the patient was started on metoprolol 12.5 mg p.o. q.6 h. for rate control ... 26/75 Grammarly Meet-up T Hamon
  28. 28. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Enriching document with linguistic information Identification of the sentences, words The patient has a history of atrial fibrillation with a slow ventricular response . Two weeks ago , the patient was started on metoprolol 12.5 mg p.o. q.6 h. for rate control ... 26/75 Grammarly Meet-up T Hamon
  29. 29. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Enriching document with linguistic information Identification of the sentences, words, lemma and part-of-speech The DT patient NN has VBZ a DT history NN of IN atrial JJ fibrillation NN with IN a DT slow JJ ventricular JJ response NN . Two CD weeks NNS ago RB , the DT patient NN was VBD started VBN on IN metoprolol FW 12.5 CD mg NN p.o. SYM q.6 FW h. NP for IN rate NN control NN ... 26/75 Grammarly Meet-up T Hamon
  30. 30. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Enriching document with linguistic information Identification of the sentences, words, lemma and part-of-speech, named entities [TIMEX3] [DOSAGE] [MODADM] [FREQ] The DT patient NN has VBZ a DT history NN of IN atrial JJ fibrillation NN with IN a DT slow JJ ventricular JJ response NN . Two CD weeks NNS ago RB , the DT patient NN was VBD started VBN on IN metoprolol FW 12.5 CD mg NN p.o. SYM q.6 FW h. NP for IN rate NN control NN ... 26/75 Grammarly Meet-up T Hamon
  31. 31. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Enriching document with linguistic information Identification of the sentences, words, lemma and part-of-speech, named entities and terms with semantic types [TIMEX3] [DOSAGE] [MODADM] [FREQ] [DISORDER] [DRUG] [DISORDER] [DISORDER] The DT patient NN has VBZ a DT history NN of IN atrial JJ fibrillation NN with IN a DT slow JJ ventricular JJ response NN . Two CD weeks NNS ago RB , the DT patient NN was VBD started VBN on IN metoprolol FW 12.5 CD mg NN p.o. SYM q.6 FW h. NP for IN rate NN control NN ... 26/75 Grammarly Meet-up T Hamon
  32. 32. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Concurrent annotation of documents Preparing material for document annotation Named Entity Recognition (frequency, duration, dosage, mode of administration) + internal disambiguation (avoid nested annotations of different types and merge annotations of the same type) Term and semantic tagging (medication and reasons, negation and reason marker, assertion) based on linguistic information (word and sentence segmentation, lemmatization) + internal disambiguation (nested terms, parenthesed medication names, etc.) 27/75 Grammarly Meet-up T Hamon
  33. 33. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Time expression identification [Hamon and Grabar14] Tuning Heideltime system [Strotgen and Gertz12] for English and French EHR Enrichment and encoding of linguistic temporal expressions specific to medical and clinical domain: post-operative day #, b.i.d. meaning twice a day, day of life, etc. Admission date as the reference or starting point for computing relative dates and their normalised value if the admission date is 14 June 2017, the normalised value of 2 days later is 16 June 2017. Additional normalizations of the temporal expressions: normalization the durations in approximate numerical values to avoid undefined values external computation for some durations and frequencies due to limitations in HeidelTime's internal arithmetic processor 28/75 Grammarly Meet-up T Hamon
  34. 34. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Annotation selection Processing of ambiguous medication names : laboratory data or medication 1 if a list section: status changed in medication HOME MEDS: methadone 20 bid, imdur 120 bid, hydral taking 25 bid, lasix 20 bid, coumadin, colace, iron, nexium 40 bid Rejection of medicaton names: if in allergy sections ALLERGY: prednisone, penicillins, tamsulosin, simvastatin Removal of drug names in negative contexts Guessing new drug names with semantic patterns m do mo? f [Hamon et al.13] 1 Noun phrases recognized by the term extractor YATEA 2 Stopwords rejected 3 Filtering with typical suffixes of the medication names Diovan 160mg PO BID, HCTZ 25mg PO QD, Imdur ER 60mg PO QD, NTG .4mg PRN CP, Norvasc 10mg PO QD, Pavachol 80mg PO QD. 29/75 Grammarly Meet-up T Hamon
  35. 35. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Results Medication task Focus on various parameters for reason identification and guessing medication names RUN2 RUN1 RUN3 System 0.7801 0.7681 (-0.0120) 0.7719 (-0.0082) m 0.8142 0.8093 (-0.0049) 0.808 (-0.0062) do 0.8234 0.8172 (-0.0062) 0.821 (-0.0024) f 0.837 0.8304 (-0.0066) 0.8345 (-0.0025) mo 0.8655 0.8577 (-0.0078) 0.8624 (-0.0031) du 0.3575 0.3516 (-0.0059) 0.3505 (-0.0070) r 0.2867 0.2759 (-0.0108) 0.2666 (-0.0201) RUN1: All reasons RUN2: All reasons without semantic tagging and reason markers RUN3: All reasons without semantic tagging and use of reason markers Guessing medication names 30/75 Grammarly Meet-up T Hamon
  36. 36. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Results Medication task exact inexact F P R F P R System 0.7801 0.7997 0.7614 0.7792 0.8111 0.7497 m 0.8142 0.8448 0.7858 0.8304 0.8666 0.7971 do 0.8234 0.8728 0.7793 0.8503 0.8799 0.8226 f 0.837 0.8306 0.8435 0.8411 0.8436 0.8386 mo 0.8655 0.8543 0.877 0.863 0.844 0.8828 du 0.3575 0.3483 0.3673 0.3607 0.3669 0.3546 r 0.2867 0.3047 0.2708 0.3386 0.4386 0.2757 Reason: difficult to identify the exact noun phrases (-13% between inexact and exact precision) 31/75 Grammarly Meet-up T Hamon
  37. 37. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Results Assertion task and time expression identification List of markers + section headings Categories Training Test P R F P R F Associated to somebody else 0.96 0.80 0.88 0.84 0.74 0.79 Hypothesis 0.71 0.31 0.43 0.63 0.24 0.35 Condition 0.08 0.40 0.14 0.08 0.33 0.12 Possibility 0.46 0.57 0.51 0.51 0.47 0.49 Absent 0.92 0.75 0.82 0.87 0.75 0.81 Present 0.86 0.90 0.88 0.84 0.87 0.86 Assertions 0.82 0.82 0.82 0.80 0.80 0.80 Precision Recall F-measure Temporal expressions 0.8611 0.8170 0.8385 32/75 Grammarly Meet-up T Hamon
  38. 38. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Conclusion F-measure of the system: 0.800 (avg) Analysis of the resource contribution: Importance of the markers Need to include syntactic structures Difficulty to identify certainty degrees few examples for condition and hypothesis 33/75 Grammarly Meet-up T Hamon
  39. 39. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Further improvements Medication tasks: Duration extraction: identification of specific prepositional phrases based on parsing Medical problem identification: development of a specific reasoning module Assertion task: Enrich resources with synonyms (Wordnet) Improving the patterns: using syntactic dependencies integrating semantic classes (verbs of evidence, verbs to get in touch with somebody, etc.) 34/75 Grammarly Meet-up T Hamon
  40. 40. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Mining literature to identify relations between risk factors and their pathologies [Hamon et al.10] Objective: Massive exploitation of Medline bibliographical database for extracting risk factors and their associations with health conditions Risk factors: increase people's chance to develop a given disease Information on risk factors is wide-spread over the web: websites, bibliographical databases, ... Previous works: Genomic scientific literature (BioCreative, TREC Genomics), clinical records (I2B2 NLP Challenge 2014), processing of narratives [Blake04] Data mining (KDD challenge 2004) [Ahmad and Bath05, Cerrito04, Kolyshkina and van rooyen06] 35/75 Grammarly Meet-up T Hamon
  41. 41. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Material Bibliographical database Medline (titles, abtracts) Selection of potential citations/PMIDs, i.e. containing the sequences risk factors, factor of risk 187,544 citations selected: over 42 million word occurrences MeSH (thesaurus for information storage and retrieval) Disease-related MeSH term recognition in citations 36/75 Grammarly Meet-up T Hamon
  42. 42. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Document processing 1 Annotation of Medline citations with linguistic information Ogmios NLP platform [Hamon et al.07] Segmentation, POS-tagging & lemmatization -- Genia Tagger [Tsuruoka et al.05] Term recognition but also term extraction -- YATEA [Aubin and Hamon06] 2 Risk factors identification 37/75 Grammarly Meet-up T Hamon
  43. 43. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Document processing 1 Annotation of Medline citations with linguistic information Ogmios NLP platform [Hamon et al.07] Segmentation, POS-tagging & lemmatization -- Genia Tagger [Tsuruoka et al.05] Term recognition but also term extraction -- YATEA [Aubin and Hamon06] 2 Risk factors identification 37/75 Grammarly Meet-up T Hamon
  44. 44. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Term recognition vs. Term extraction Term recognition: Tagging of texts with terms issued from a terminologies Use of more or less complexe methods (string matching, terminological variant computing, semantic distances, ML methods...) Term extraction: Discovering of terms in texts Identification of noun phrases which are potential terms (term candidates) Computing of the strength of the term components (unithood) the strength of the relation to the domain (termhood) [Kageura and Umino96] 38/75 Grammarly Meet-up T Hamon
  45. 45. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Term extraction with YATEA Yet Another Term ExtrActor (Aubin&Hamon, 2006) Term extration from French and English texts Shallow parsing of texts Parsing focusing on the parts of the sentence which may contain terms (usually the noun phrases) With recursively applied minimal parsing patterns endogenous learning Term candidate decomposition in Head and Modifier components (component syntactic role in the noun phrase) Each component of a term candidate is also considered as a term candidate Unparseable noun phrases are rejected 39/75 Grammarly Meet-up T Hamon
  46. 46. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion YATEA Yet Another Term ExtrActor (Aubin et Hamon, 2006) Several statistical measures are associated with each term candidate (Number of occurrences, C-Value1, C-Value*, etc.) [Hamon et al.14] Module CPAN http://search.cpan.org/~thhamon/Lingua-YaTeA/ Developpement during the European project ALVIS Description of the shallow parsing with configuration files Possibility of tuning for a domain (Bi oYATEA ) [Golik et al.13] For other languages: on-going work for Ukrainian and Arabic 40/75 Grammarly Meet-up T Hamon
  47. 47. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Term extraction with YATEA Textes lemmatisation + POS tagging 22CD yoJJ maleNN ,, hNN /SYMoNN primitiveJJ neuroectodermalJJ tumorNN withIN metsNNS toTO brainNN andCC spineNN ,, transferredVBN fromIN Hospital1NNP ,, initiallyRB inIN Dept1NNP andCC thenRB transferredVBN toTO theDT floorNN .. HePRP wasVBD initiallyRB diagnosedVBN withIN aDT thoracicJJ gangliogliomNN //resectedVBN inIN 2012CD .. HePRP hadVBD backJJ painNN inin 2CD /SYM04CD ,, seenVBN atIN Dept2NNP ,, andCC wasbe foundVBN toTO haveVB metsNNS toTO brainNN andCC spineNN .. 41/75 Grammarly Meet-up T Hamon
  48. 48. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Term extraction with YATEA Textes lemmatisation + POS tagging Term extraction rule-based approaches Identification of chunks thanks to morpho-syntactic information (frontiers - verbs, adverbs, etc.) 22CD yoJJ maleNN ,, hNN /SYMoNN primitiveJJ neuroectodermalJJ tumorNN withIN metsNNS toTO brainNN andCC spineNN ,, transferredVBN fromIN Hospital1NNP ,, initiallyRB inIN Dept1NNP andCC thenRB transferredVBN toTO theDT floorNN .. HePRP wasVBD initiallyRB diagnosedVBN withIN aDT thoracicJJ gangliogliomNN //resectedVBN inIN 2012CD .. HePRP hadVBD backJJ painNN inin 2CD /SYM04CD ,, seenVBN atIN Dept2NNP ,, andCC wasbe foundVBN toTO haveVB metsNNS toTO brainNN andCC spineNN .. 41/75 Grammarly Meet-up T Hamon
  49. 49. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Term extraction with YATEA Parsing of the noun phrases to detect term candidates 1. Identification of term candidates described by parsing patterns NNJJ M H (< H > : Head of the noun phrase, < M > : modifier of the head) neuroectodermal tumor → (neuroectodermal< M > tumor< T >) tumorneuroectodermal M H shortness of breath → shortness< T > of breath< M > (of) breathshortness H M 42/75 Grammarly Meet-up T Hamon
  50. 50. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Term extraction with YATEA 2. Use of the previously parsed term candidates (island of reliability) to parse remaining noun phrases Example: primitive neuroectodermal tumor 43/75 Grammarly Meet-up T Hamon
  51. 51. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Term extraction with YATEA 2. Use of the previously parsed term candidates (island of reliability) to parse remaining noun phrases Example: primitive neuroectodermal tumor Use of the already parsed term neuroectodermal tumor tumorneuroectodermal M H 43/75 Grammarly Meet-up T Hamon
  52. 52. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Term extraction with YATEA 2. Use of the previously parsed term candidates (island of reliability) to parse remaining noun phrases Example: primitive neuroectodermal tumor Use of the already parsed term neuroectodermal tumor primitive tumorneuroectodermal M H 43/75 Grammarly Meet-up T Hamon
  53. 53. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Term extraction with YATEA 2. Use of the previously parsed term candidates (island of reliability) to parse remaining noun phrases Example: primitive neuroectodermal tumor Use of the already parsed term neuroectodermal tumor primitive tumorneuroectodermal M H Temporary simplification (folding): primitiveJJ tumorNN 43/75 Grammarly Meet-up T Hamon
  54. 54. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Term extraction with YATEA 2. Use of the previously parsed term candidates (island of reliability) to parse remaining noun phrases Example: primitive neuroectodermal tumor Use of the already parsed term neuroectodermal tumor primitive tumorneuroectodermal M H Temporary simplification (folding): primitiveJJ tumorNN Use of the parsing pattern: NNJJ M H → tumorprimitive M H 43/75 Grammarly Meet-up T Hamon
  55. 55. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Term extraction with YATEA 2. Use of the previously parsed term candidates (island of reliability) to parse remaining noun phrases Example: primitive neuroectodermal tumor Use of the already parsed term neuroectodermal tumor primitive tumorneuroectodermal M H Temporary simplification (folding): primitiveJJ tumorNN Use of the parsing pattern: NNJJ M H → tumorprimitive M H Unfolding : tumorneuroectodermal M H primitive M H 43/75 Grammarly Meet-up T Hamon
  56. 56. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Term extraction with YATEA Textes lemmatisation + POS tagging 22CD yoJJ maleNN ,, hNN /SYMoNN primitiveJJ neuroectodermalJJ tumorNN withIN metsNNS toTO brainNN andCC spineNN ,, transferredVBN fromIN Hospital1NNP ,, initiallyRB inIN Dept1NNP andCC thenRB transferredVBN toTO theDT floorNN .. HePRP wasVBD initiallyRB diagnosedVBN withIN aDT thoracicJJ gangliogliomNN //resectedVBN inIN 2012CD .. HePRP hadVBD backJJ painNN inin 2CD /SYM04CD ,, seenVBN atIN Dept2NNP ,, andCC wasbe foundVBN toTO haveVB metsNNS toTO brainNN andCC spineNN .. 44/75 Grammarly Meet-up T Hamon
  57. 57. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Term extraction with YATEA Textes lemmatisation + POS tagging Term extraction rule-based approaches Candidate terms yo male thoracic gangliogliom h back pain o mets primitive neuroectodermal tumor brain mets spine brain floor spine ... 44/75 Grammarly Meet-up T Hamon
  58. 58. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Term extraction with YATEA Textes lemmatisation + POS tagging Term extraction rule-based approaches Candidate terms Term ranking frequency term length C-Value Ranked term candidates f l Cv1 f l Cv1 yo male 1 1 1.58 spine 2 1 2 h 1 1 1 floor 1 1 1 o 1 1 0 thoracic gangliogliom 1 2 1.58 mets 2 1 2 back pain 1 2 1.58 brain 2 1 2 primitive neuroectodermal tumor 1 3 2.32 ... 44/75 Grammarly Meet-up T Hamon
  59. 59. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Document processing 1 Annotation of Medline citations with linguistic information Ogmios NLP platform [Hamon et al.07] Segmentation, POS-tagging & lemmatization -- Genia Tagger [Tsuruoka et al.05] Term recognition and extraction -- YATEA [Aubin and Hamon06] 2 Risk factors identification 45/75 Grammarly Meet-up T Hamon
  60. 60. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Document processing 1 Annotation of Medline citations with linguistic information Ogmios NLP platform [Hamon et al.07] Segmentation, POS-tagging & lemmatization -- Genia Tagger [Tsuruoka et al.05] Term recognition and extraction -- YATEA [Aubin and Hamon06] 2 Risk factors identification 45/75 Grammarly Meet-up T Hamon
  61. 61. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Risk factor identification Semantico-syntactic patterns 5 patterns for risk factors and pathologies 12 patterns for handling enumerations 3 patterns for pathologies <NP-RF> as a risk factor for <NP-P> where as a risk factor for: trigger sequence <NP-RF>: noun phrases corresponding to risk factors <NP-P>: pathologies ? and *: optional and recurrent elements MeSH descriptors of citations Descriptors belonging to C heading of diseases 46/75 Grammarly Meet-up T Hamon
  62. 62. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Risk factor identification Examples Pattern: <NP-RF-list> is a risk factor for <NP-P> ...a high intake of calcium and phosphorus is a risk factor for the development of metabolic acidosis . (PMID 1435825) Pattern: risk factors for <NP-P>,? include <NP-RF-list> ...had more than one of the common risk factors for cerebrovascular accidents , including hypertension , advanced age , hyperfibrinogenemia , diabetes mellitus , and past history of cerebrovascular accident. (PMID 1560589) 47/75 Grammarly Meet-up T Hamon
  63. 63. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Risk factor identification Examples Pattern: <NP-RF-list> is a risk factor for <NP-P> ...a high intake of calcium and phosphorus is a risk factor for the development of metabolic acidosis . (PMID 1435825) Pattern: risk factors for <NP-P>,? include <NP-RF-list> ...had more than one of the common risk factors for cerebrovascular accidents , including hypertension , advanced age , hyperfibrinogenemia , diabetes mellitus , and past history of cerebrovascular accident. (PMID 1560589) 47/75 Grammarly Meet-up T Hamon
  64. 64. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Results Application of three kinds of patterns (1) {risk factor, pathology}, (2) risk factors, (3) pathologies Definition of relations: direct relations with patterns {risk factor, pathology} combination of information provided by (2) and (3) 10,445 PMIDs provide information 313 pairs {risk factor, pathology} 15,398 pairs by combination of (2) and (3) 5,873 risk factors (2) not associated with any pathology MeSH indexing: 5,106 pathologies and health conditions 21,584 triplets {risk factor, pathologytext?, pathologyMeSH?} 17,620 (14,895) pairs only provided by the patterns 5,717 (4,412) pairs contain MeSH descriptors as pathology 48/75 Grammarly Meet-up T Hamon
  65. 65. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Evaluation Evaluation of precision ratio of correct extractions among the overall results Manual evaluation: no dedicated and comprehensive gold standard is available Comparison with three relationships provided by Snomed CT (nomenclature for organizing and exhanging clinical data) has causative agent: direct cause of the disorder or finding (92,807 relations) bacterial endocarditis has causative agent bacterium due to: relate a clinical finding directly to its cause (25,309 relations) acute pancreatitis due to infection associated with: clinically relevant association between terms without either asserting or excluding a causal or sequential relationship between the two (36,134 relations) fentanyl allergy has causative agent fentanyl 49/75 Grammarly Meet-up T Hamon
  66. 66. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Evaluation 1 Quality and exhaustiveness of risk factors for a given pathology Evaluation by medical doctor of 1,102 risk factors for coronary heart disease: 88.38% precision hypertension: {smoking; cigarette smoking; smoking history; importance of total life consumption of cigarettes} 2 Comparison between text mining results for 20 pathologies (3,100 extractions, about 25%) and Snomed CT causal and associative relations (154,130 pairs) 19 extractions (0.6%) considered as already in Snomed CT Snomed CT not dedicated to risk factors, but they may occur acquired immunodeficiency syndrome: {bisexuality, blood transfusion, intravenous drug abuse } 50/75 Grammarly Meet-up T Hamon
  67. 67. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Conclusion Extraction of information related to risk factors Relation with associated pathologies Text mining approach based on semantico-syntactic patterns Evaluation by medical doctor and computer scientist 88.38% of risk factors related to coronary heart disease are correct about 70% of extracted pathologies are equivalent with MeSH indexing Snomed CT is not dedidated to the recording of risk factors, although they may occur ⇒ Creation of a dedicated resource for risk factors is suitable 51/75 Grammarly Meet-up T Hamon
  68. 68. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Future work Use of other patterns, i.e. predictor, precursor ... Machine learning methods Knowledge representation: homogeneous groups of risk factors environmental, social, clinical, behavioral ... Characterization of this information modal, negative contexts Geographical, demographic variation 52/75 Grammarly Meet-up T Hamon
  69. 69. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Adaptation of Cross-Lingual Transfer Methods for the Building of Medical Terminology in Ukrainian [Hamon and Grabar16] Nowadays, methods and automatic tools for several European languages and Japanese [Kageura and Umino96, Cabre et al.01, Pazienza et al.05] For many languages: few NLP tools are available and suitable for automatic terminology extraction while textual data exist and terminological resources are required 53/75 Grammarly Meet-up T Hamon
  70. 70. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Our objective Design of specific methods for the acquisition of such terminological resources in Ukrainian Approaches: Compilation of terminological resources Automatic building of terminologies Observations: increasing availability of parallel bilingual corpora Methodology: Use of specialized parallel corpora including a low-resourced language (Ukrainian) to build bilingual and trilingual terminologies by the means of the cross-lingual transfer principle 54/75 Grammarly Meet-up T Hamon
  71. 71. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Cross-lingual transfer principle [Yarowsky et al.01, Lopez et al.02] Hypothesis: parallel and aligned corpora with two languages L1 and L2 syntactic or semantic annotations and information from L1 Method: transpose these annotations or information from L1 to L2, obtain the corresponding annotations and information in L2 Efficient way for [Zeman and Resnik08, Mcdonald et al.11] processing multilingual texts from low-resourced languages creating various types of annotations: part-of-speech, semantic categories or even acoustic and prosodic features 55/75 Grammarly Meet-up T Hamon
  72. 72. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Drawbacks of the transfer principle The transfer methodology depends on the quality of the extracted information and annotation from L1 texts the quality of alignment usually a statistical alignment method depending on the size of the corpora: the bigger the better → Define an approach to bypass these drawbacks 56/75 Grammarly Meet-up T Hamon
  73. 73. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Material Medical data in three languages (Ukrainian, French, and English): Ukrainian Wikipedia: source of relevant terms help for the word-level alignment of the MedlinePlus corpus MedlinePlus corpus: a collection of specialized texts providing the basis for the building of the terminology 57/75 Grammarly Meet-up T Hamon
  74. 74. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Medicine-related articles from Ukrainian Wikipedia Selection of the Ukrainian part of the Wikipedia using medicine-related categories, such as Медицина (medicine) or Захворювання (disorders) Potentially covers a wide range of medical notions Use of information in the infobox 58/75 Grammarly Meet-up T Hamon
  75. 75. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Parallel medical corpus [Hamon and Grabar17] -- http://natalia.grabar.free.fr/resources.php Patient-oriented brochures in three languages (Ukrainian, French, and English) from MedlinePlus on several medical topics (body systems, disorders and conditions, diagnosis and therapy, health and wellness) created in English and then translated in several other languages (including French and Ukrainian) About 43,000 words for each language English Ukrainian Cancer cells grow and divide more quickly than healthy cells. Cancer treatments are made to work on these fast growing cells. Ракові клітини ростуть і діляться швидше, ніж здорові клітини. При лі- куванні раку здійснюється вплив на ці клітини, що швидко ростуть. - Tiredness - Втома - Nausea or vomiting - Нудота або блювота - Pain - Біль - Hair loss called alopecia - Втрата волосся, що називається алопецією 59/75 Grammarly Meet-up T Hamon
  76. 76. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Extraction of bilingual terminology from the Wikipedia Objective: complete and help the alignment method applied to the MedlinePlus corpus Use of content of the infoboxes Ukrainian Wikipedia medical part 60/75 Grammarly Meet-up T Hamon
  77. 77. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Extraction of bilingual terminology from the Wikipedia Objective: complete and help the alignment method applied to the MedlinePlus corpus Use of content of the infoboxes Ukrainian Wikipedia medical part Processing of the InfoBoxes 60/75 Grammarly Meet-up T Hamon
  78. 78. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Extraction of bilingual terminology from the Wikipedia Objective: complete and help the alignment method applied to the MedlinePlus corpus Use of content of the infoboxes Ukrainian Wikipedia medical part Processing of the InfoBoxes Medical terms with MeSH codes Цукровий діабет тип 2 60/75 Grammarly Meet-up T Hamon
  79. 79. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Extraction of bilingual terminology from the Wikipedia Objective: complete and help the alignment method applied to the MedlinePlus corpus Use of content of the infoboxes Ukrainian Wikipedia medical part Processing of the InfoBoxes Medical terms with MeSH codes UMLSQuerying UMLS UMLS Цукровий діабет тип 2 NIDDM Type 2 Diabetes Mellitus DID2, Diabète avec insulinorésistance 60/75 Grammarly Meet-up T Hamon
  80. 80. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Extraction of bilingual terminology from the Wikipedia Objective: complete and help the alignment method applied to the MedlinePlus corpus Use of content of the infoboxes Ukrainian Wikipedia medical part Processing of the InfoBoxes Medical terms with MeSH codes UMLSQuerying UMLS Pairs of medical terms (UK/FR and UK/EN) Цукровий діабет тип 2 NIDDM Type 2 Diabetes Mellitus DID2, Diabète avec insulinorésistance 60/75 Grammarly Meet-up T Hamon
  81. 81. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Extraction of bilingual terminology from the MedlinePlus corpus Illustration of the transfer methods English Ukrainian Cancer cells grow and divide more quickly than healthy cells. Cancer treatments are made to work on these fast growing cells. Ракові клітини ростуть і діля- ться швидше, ніж здорові кліти- ни. При лікуванні раку здійсню- ється вплив на ці клітини, що швидко ростуть. - Tiredness - Втома - Nausea or vomiting - Нудота або блювота - Pain - Біль - Hair loss called alopecia - Втрата волосся, що називає- ться алопецією 61/75 Grammarly Meet-up T Hamon
  82. 82. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Extraction of bilingual terminology from the MedlinePlus corpus MedlinePlus Corpora UK/FR & UK/EN Cleaning and manual paragraph alignment 62/75 Grammarly Meet-up T Hamon
  83. 83. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Extraction of bilingual terminology from the MedlinePlus corpus MedlinePlus Corpora UK/FR & UK/EN Cleaning and manual paragraph alignment POS tagging with TreeTagger and Flemm FR & EN term extraction with YATEA 62/75 Grammarly Meet-up T Hamon
  84. 84. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Extraction of bilingual terminology from the MedlinePlus corpus Transfer 1 MedlinePlus Corpora UK/FR & UK/EN Cleaning and manual paragraph alignment POS tagging with TreeTagger and Flemm FR & EN term extraction with YATEA Extraction of UK terms corresponding to lines 62/75 Grammarly Meet-up T Hamon
  85. 85. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Extraction of bilingual terminology from the MedlinePlus corpus Transfer 1 MedlinePlus Corpora UK/FR & UK/EN Cleaning and manual paragraph alignment POS tagging with TreeTagger and Flemm FR & EN term extraction with YATEA Extraction of UK terms corresponding to lines Pairs of candidate terms (UK/FR and UK/EN) 62/75 Grammarly Meet-up T Hamon
  86. 86. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Extraction of bilingual terminology from the MedlinePlus corpus Transfer 1 Transfer 2MedlinePlus Corpora UK/FR & UK/EN Cleaning and manual paragraph alignment POS tagging with TreeTagger and Flemm FR & EN term extraction with YATEA Extraction of UK terms corresponding to lines Pairs of candidate terms (UK/FR and UK/EN) 62/75 Grammarly Meet-up T Hamon
  87. 87. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Extraction of bilingual terminology from the MedlinePlus corpus Transfer 1 Transfer 2MedlinePlus Corpora UK/FR & UK/EN Cleaning and manual paragraph alignment Giza++ suite (including MkCls) POS tagging with TreeTagger and Flemm FR & EN term extraction with YATEA Extraction of UK terms corresponding to lines Pairs of candidate terms (UK/FR and UK/EN) MedlinePlus corpora aligned at the word level 62/75 Grammarly Meet-up T Hamon
  88. 88. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Extraction of bilingual terminology from the MedlinePlus corpus Transfer 1 Transfer 2MedlinePlus Corpora UK/FR & UK/EN Cleaning and manual paragraph alignment Giza++ suite (including MkCls) POS tagging with TreeTagger and Flemm FR & EN term extraction with YATEA Extraction of UK terms corresponding to lines Pairs of candidate terms (UK/FR and UK/EN) MedlinePlus corpora aligned at the word level UK term extraction by transfer 62/75 Grammarly Meet-up T Hamon
  89. 89. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Extraction of bilingual terminology from the MedlinePlus corpus Transfer 1 Transfer 2MedlinePlus Corpora UK/FR & UK/EN Cleaning and manual paragraph alignment Giza++ suite (including MkCls) POS tagging with TreeTagger and Flemm FR & EN term extraction with YATEA Extraction of UK terms corresponding to lines Pairs of candidate terms (UK/FR and UK/EN) MedlinePlus corpora aligned at the word level UK term extraction by transfer Pairs of candidate terms (UK/FR and UK/EN) 62/75 Grammarly Meet-up T Hamon
  90. 90. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Extraction of bilingual terminology from the MedlinePlus corpus Transfer 1 Transfer 2MedlinePlus Corpora UK/FR & UK/EN Cleaning and manual paragraph alignment Giza++ suite (including MkCls) POS tagging with TreeTagger and Flemm FR & EN term extraction with YATEA Extraction of UK terms corresponding to lines Pairs of candidate terms (UK/FR and UK/EN) MedlinePlus corpora aligned at the word level UK term extraction by transfer Pairs of candidate terms (UK/FR and UK/EN) Cross-fertilization with single-word terms 62/75 Grammarly Meet-up T Hamon
  91. 91. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Extraction of bilingual terminology from the MedlinePlus corpus Transfer 1 Transfer 2MedlinePlus Corpora UK/FR & UK/EN Cleaning and manual paragraph alignment Giza++ suite (including MkCls) POS tagging with TreeTagger and Flemm FR & EN term extraction with YATEA Extraction of UK terms corresponding to lines Pairs of candidate terms (UK/FR and UK/EN) MedlinePlus corpora aligned at the word level UK term extraction by transfer Pairs of candidate terms (UK/FR and UK/EN) Wikipedia pairs of medical terms Cross-fertilization with single-word terms Cross-fertilization with single-word terms 62/75 Grammarly Meet-up T Hamon
  92. 92. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Evaluation Performed by an Ukrainian native speaker having knowledge in medical informatics Manual checking of the extracted candidates: correct/non correct Validation: Terms: independently in each language Bilingual and trilingual relations Computing of the precision of the results: correct answers all the answers with exact and inexact match (the correct term is included or includes the candidate) 63/75 Grammarly Meet-up T Hamon
  93. 93. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Results Bilingual terminology from Wikipedia 357 Ukrainian medical terms (among them 177 single-word terms) Use of the MeSH codes and UMLS: 1428 French terms (among them, 339 single-word terms) 3625 English terms (among them, 448 single-word terms) Difference with the number of Ukrainian terms due to the MeSH synonyms Bilingual pairs: 1,515 Ukrainian/French term pairs (270 pairs between single-word terms) 3,789 Ukrainian/English term pairs (405 pairs between single-word terms) Precision: 1 because of the collecting method 64/75 Grammarly Meet-up T Hamon
  94. 94. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Results Bilingual terminology from the MedlinePlus - Transfer 1 436 Ukrainian terms with 0.966 precision associated with 316 French terms and 354 English terms 282 triples between Ukrainian/French/English terms (prec.: 0.954) 63 pairs only between Ukrainian/French terms (prec.: 0.937) 115 pairs only between Ukrainian/English terms (prec.: 0.965) Relations involving synonyms: {втома, fatigue/tiredness}, {фаллопієва труба, trompes de fallope/trompe utérine} (fallopian tube), {втрата слуху/втрачається слух, hearing loss} associating several case forms with same English or French form: {вагітність, pregnancy} and {вагітності, pregnancy} 65/75 Grammarly Meet-up T Hamon
  95. 95. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Analysis Bilingual terminology from the MedlinePlus - Transfer 1 Few errors: mainly partial match between two languages: {ви можете спати, dormir/sleep} - lit. you can sleep. {появу виразок у роті, mouth sores} - lit. (appearance of) mouth sores Causes of silence: variation due to the translation which prevents the transfer 1 method to extract term in French or English Догляд: match with French title Soins but not with the English title Your care Problem solved by the Transfer 2 method errors in the POS tagging or term extraction strategy Incapacity of the term extractor to identify French or English terms 66/75 Grammarly Meet-up T Hamon
  96. 96. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Results Bilingual terminology from the MedlinePlus - Transfer 2 9,040 Ukrainian extracted terms (prec.: 0.454) Exact match: Higher precision of the French (0.674) and English terms (0.761) But low number of terms: 3,671 for French, 3,597 for English Due to the rich morphology of the Ukrainian language: {напад, нападу} - attack, {припадків, припадки} - seizure, {костей, кістки} - bones Extraction of synonymous terms: {биття, удару} - beats, {приступам, припадків} - attacks/seizures Relations: 3,724 pairs of Ukrainian/French terms (prec.: 0.309) 4,745 pairs of Ukrainian/English terms (prec.: 0.401) 4,724 triples of Ukrainian/French/English terms (prec.: 0.419) Inexact match: Higher precision: +0.40 points for the Ukrainian terms, +0.05 for the French and English terms. Due to the alignment quality? 67/75 Grammarly Meet-up T Hamon
  97. 97. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Analysis Bilingual terminology from the MedlinePlus - Transfer 2 Error analysis Most of the errors are due to the alignment problems when the alignment is correct, the Ukrainian terms are correctly extracted by the transfer Term analysis Most of the extracted terms are specific to the medical domain {шприца, syringe}, {холестерину, cholesterol}, {фактори ризику, risk factors}, {трахеотомією, tracheostomy}), Other terms: close and approximating notions: {діти, children}, {здорову їжу, healthy diet}, {серцевий напад, heart attack}, {склянок рідини, glasses of liquid} Interesting observation: French and English terms correspond to phrases in Ukrainian: undercooked foods: не до кінця приготовлену їжу (lit. food which is not fully cooked) indolore (painless): При цьому обстеженні Ви не відчуєте жодного болю (lit. With this exam you will feel no pain) 68/75 Grammarly Meet-up T Hamon
  98. 98. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Conclusion Proposition of transfer-based methods to extract the term candidates in Ukrainian create term pairs Ukrainian/French and Ukrainian/English Works on freely available multilingual corpora in French, English and Ukrainian Resulting terminological resource: 4,588 Ukrainian medical terms and 34,267 relations with French and English terms → Method suitable for building terminology in low-resourced languages 69/75 Grammarly Meet-up T Hamon
  99. 99. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Future Work Bilingual word alignment with Fast-Align [Dyer et al.13] Use of statistical and morphological cues Use of transfer method for keyphrase extraction from scientific papers ⇒ Ongoing work with Kyiv Institute of Cybernetics Proposing a similar term extration method to work with comparable copora 70/75 Grammarly Meet-up T Hamon
  100. 100. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Overall conclusion Biomedical text mining: a complex task which involves several types of information ... ... to link together many strategies for identifying the information a lot of terminological and linguistic resources ... ... more or less available or difficult to build according to languages and areas Current challenges concept recognition (disambiguation, normalization) multilingual approaches approaches for low-resourced languages use of information issued from social media 71/75 Grammarly Meet-up T Hamon
  101. 101. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Ongoing funded projects Mining literature and using Open Linked Data MIAM Project (French National Agency, 2016) Mining literature to collect interactions existing between drugs and food which might lead to adverse drug events Example: Grapefruit has an adverse effect on the CPY3A4 enzyme contained in many drugs Objectives: Aggregating information issued from unstructured data with knowledge already recored in knowledge bases or Linked Open Data repository (Drugbank, Thériaque, Sider, Diseasome, etc.) Managing certainty and reliability of this information Formalisation of the interactions in Linked Open Data 72/75 Grammarly Meet-up T Hamon
  102. 102. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Drug-related information acne osteoporosis swelling of face arterial hypertension ulcer of stomach depression solumedrol salt phosphate disodique anhydre phosphate monosodique anhydre sodium lactosis cortisone steroidal anti−inflammatory allergic shock Quincke oedema suffocation by larynx oedema brain oedema methylprednisolone adverseeffects digitaline insulin composition is a prescribedfor INN DDI FDI dosage mode frequency reason duration prescriptionfeatures73/75 Grammarly Meet-up T Hamon
  103. 103. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Terminology acquisition for Ukrainian Use of transfer method for keyphrase extraction from scientific papers Tuning of YATEA of Ukrainian Definition and design of methods for terminological and semantic relation acquisition 74/75 Grammarly Meet-up T Hamon
  104. 104. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Дякую! 75/75 Grammarly Meet-up T Hamon
  105. 105. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Ahmad (Rabiah) et Bath (Peter A). -- Identification of risk factors for 15-year mortality among community-dwelling older people using Cox regression and a genetic algorithm. Journal of Gerontology, vol. 60 (8), 2005, pp. 1052--8. Aubin (Sophie) et Hamon (Thierry). -- Improving Term Extraction with Terminological Resources. In : Advances in Natural Language Processing (5th International Conference on NLP, FinTAL 2006), éd. par Salakoski (Tapio), Ginter (Filip), Pyysalo (Sampo) et Pahikkala (Tapio). pp. 380--387. -- Springer. Blake (Catherine). -- A text mining approach to enable detection of candidate risk factors. In : Medinfo, pp. 1528--1528. Cabré (MT), Estopà (R) et Vivaldi (J). -- Automatic term detection: a review of current systems, pp. 53--88. -- John Benjamins, 2001. Cerrito (Patricia). -- Inside text Mining. Health management technology, vol. 25 (3), 2004, pp. 28--31. Chapman (Wendy), Bridewell (Will), Hanbury (Paul), Cooper (Gregory) et Buchanan (Bruce). -- Evaluation of negation phrases in narrative clinical reports. In : Annual Symposium of the American Medical Informatics Association (AMIA). -- Washington, 2001. Dyer (Chris), Chahuneau (Victor) et Smith (Noah A.). -- A Simple, Fast, and Effective Reparameterization of IBM Model 2. In : NAACL/HLT, pp. 644--648. Golik (Wiktoria), Bossy (Robert), Ratkovic (Zorana) et Nédellec (Claire). -- Improving term extraction with linguistic analysis in the biomedical domain. In : Proceedings of the 14th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing'13). -- Samos, Greece, March 2013. Grouin (Cyril), Abacha (Asma Ben), Bernhard (Delphine), Cartoni (Bruno), Deléger (Louise), Grau (Brigitte), Ligozat (Anne-Laure), Minard (Anne-Lyse), Rosset (Sophie) et Zweigenbaum (Pierre). --75/75 Grammarly Meet-up T Hamon
  106. 106. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion CARAMBA: Concept, Assertion, and Relation Annotation using Machine-learning Based Approaches. In : Proceedings of the workshop I2B2 2010. Grouin (Cyril), Grabar (Natalia), Hamon (Thierry), Rosset (Sophie), Tannier (Xavier) et Zweigenbaum (Pierre). -- Eventual situations for timeline extraction from clinical reports. Journal of American Medical Informatics Association, vol. 20 (5), September 2013, pp. 820--827. -- (IF: 3.609). Hamon (Thierry) et Grabar (Natalia). -- Linguistic approach for identification of medication names and related information in clinical narratives. Journal of American Medical Informatics Association, vol. 17 (5), Sep-Oct 2010, pp. 549--554. -- PMID: 20819862. Hamon (Thierry) et Grabar (Natalia). -- Tuning HeidelTime for identifying time expressions in clinical texts in English and French. In : Proceedings of The Fifth International Workshop on Health Text Mining and Information Analysis (LOUHI2014) -- Short paper/Poster, pp. 101--105. -- Gothenburg, Sweden, April 2014. Hamon (Thierry) et Grabar (Natalia). -- Adaptation of Cross-Lingual Transfer Methods for the Building of Medical Terminology in Ukrainian. In : Proceedings of the 17th International Conference on Intelligent Text Processing and Computational Linguistics (CICLING2016). -- Springer. Hamon (Thierry) et Grabar (Natalia). -- Creation of a multilingual aligned corpus with Ukrainian as the target language and its exploitation. In : Proceedings of Computational Linguistics and Intelligent Systems (COLINS 2017), pp. 10--19. Hamon (Thierry), Nazarenko (Adeline), Poibeau (Thierry), Aubin (Sophie) et Derivière (Julien). -- A Robust Linguistic Platform for Efficient and Domain specific Web Content Analysis. In : Proceedings of RIAO 2007. -- Pittsburgh, USA, 2007. 15 pages. 75/75 Grammarly Meet-up T Hamon
  107. 107. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Hamon (Thierry), Graña (Martin), Raggio (Víctor), Grabar (Natalia) et Naya (Hugo). -- Identification of relations between risk factors and their pathologies or health conditions by mining scientific literature. In : Proceedings of MEDINFO 2010, pp. 964--968. -- PMID: 20841827. Hamon (Thierry), Grabar (Natalia) et Kokkinakis (Dimitrios). -- Medication Extraction and Guessing in Swedish, French and English. In : Proceedings of MedInfo 2013. -- Copenhagen, Danemark, August 2013. Hamon (Thierry), Engström (Christopher) et Silvestrov (Sergei). -- Term ranking adaptation to the domain: genetic algorithm based optimisation of the C-Value. In : Proceedings of PolTAL 2014 -- Advances in Natural Language Processing, éd. par Springer , pp. 71--83. Kageura (K) et Umino (B). -- Methods of Automatic Term Recognition. In : National Center for Science Information Systems, pp. 1--22. Kolyshkina (I) et van Rooyen (M). -- Text mining for insurance claim cost prediction, pp. 192--202. -- Springer-Verlag, 2006. Lopez (Adam), Nossal (Mike), Hwa (Rebecca) et Resnik (Philip). -- Word-Level Alignment for Multilingual Resource Acquisition. In : LREC Workshop on Linguistic Knowledge Acquisition and Representation: Bootstrapping Annotated Data. -- Las Palmas, Spain, 2002. McDonald (Ryan), Petrov (Slav) et Hall (Keith). -- Multi-source transfer of delexicalized dependency parsers. In : EMNLP. Minard (AL), Ligozat (AL), Ben Abacha (A), Bernhard (D), Cartoni (B), Deléger (L), Grau (B), Rosset (S), Zweigenbaum (P) et Grouin (C). -- Hybrid methods for improving information access in clinical documents: concept, assertion, and relation identification. J Am Med Inform Assoc, vol. 18 (5), 2011, pp. 588--93. Pazienza (Maria Teresa), Pennacchiotti (Marco) et Zanzotto (FabioMassimo). -- 75/75 Grammarly Meet-up T Hamon
  108. 108. Introduction Mining EHR Mining Literature Terminology building by Transfer Conclusion Terminology Extraction: An Analysis of Linguistic and Statistical Approaches. In : Knowledge Mining, éd. par Sirmakessis (Spiros), pp. 255--279. -- Springer Berlin Heidelberg, 2005. Périnet (Amandine), Grabar (Natalia) et Hamon (Thierry). -- Identification des assertions dans les textes médicaux : application à la relation {patient, problème médical}. Traitement Automatique des Langues (TAL), vol. 52 (1), 2011, pp. 97--132. Strötgen (Jannik) et Gertz (Michael). -- Temporal Tagging on Different Domains: Challenges, Strategies, and Gold Standards. In : Proceedings of the Eigth International Conference on Language Resources and Evaluation (LREC'12). pp. 3746--3753. -- ELRA. Tsuruoka (Yoshimasa), Tateishi (Yuka), Kim (Jin-Dong), Ohta (Tomoko), McNaught (John), Ananiadou (Sophia) et Tsujii (Jun'ichi). -- Developing a Robust Part-of-Speech Tagger for Biomedical Text. In : Proceedings of Advances in Informatics - 10th Panhellenic Conference on Informatics, pp. 382--392. Yarowsky (David), Ngai (Grace) et Wicentowski (Richard). -- Inducing multilingual text analysis tools via robust projection across aligned corpora. In : HLT. Zeman (D) et Resnik (P). -- Cross-language parser adaptation between related languages. In : NLP for Less Privileged Languages. Zweigenbaum (Pierre), Lavergne (Thomas), Grabar (Natalia), Hamon (Thierry), Rosset (Sophie) et Grouin (Cyril). -- Combining an expert-based medical entity recognizer to a machine-learning system: methods and a case study. Biomedical Informatics Insights, vol. 6 (Suppl. 1), 2013, pp. 51--62. 75/75 Grammarly Meet-up T Hamon

×