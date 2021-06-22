Successfully reported this slideshow.
Natural Language Processing for medical data
Dr. Anja Pilz, ML Conference 2021 About me @anja_pilz aplz ● PhD in machine learning & natural language processing from Un...
Dr. Anja Pilz, ML Conference 2021 Doctors spend more time documenting what they do than with effective treatment ● 70% of ...
Dr. Anja Pilz, ML Conference 2021 Support doctor’s daily work ● create warnings from automatically detected risks and cont...
Dr. Anja Pilz, ML Conference 2021 Support billing process ● billing process is super complex and needs to be soundproof ● ...
Dr. Anja Pilz, ML Conference 2021 NLP Tasks for Medical Data ﬁlter relevant entities (clinical, billing) Entity Recognitio...
Dr. Anja Pilz, ML Conference 2021 Challenges: Medical Domain is not News Typical medical texts are very different common N...
Dr. Anja Pilz, ML Conference 2021 Abbreviations are used for convenience ● ambiguous ones may cause miscommunication ● pot...
Dr. Anja Pilz, ML Conference 2021 Challenges: German Latin origin vs German spelling results in a bunch of variations ● Ca...
Dr. Anja Pilz, ML Conference 2021 ● data is available, e.g. BC5CDR (1500 PubMed articles with annotated chemicals, disease...
Dr. Anja Pilz, ML Conference 2021 ● typical off-the-shelf models are not useful for the medical domain ● need to train dom...
Dr. Anja Pilz, ML Conference 2021 Data? Real patient data ● resides in hospitals and medical practices ● not publicly avai...
Dr. Anja Pilz, ML Conference 2021 Entity Recognition Get data. Start annotating. ● entities are all concepts of interest: ...
Dr. Anja Pilz, ML Conference 2021 Train your own model Entity Recognition + data
Dr. Anja Pilz, ML Conference 2021 Most work in research: link entity mentions to concepts in medical thesaurus UMLS ● high...
Dr. Anja Pilz, ML Conference 2021 ICD-10 Linking ICD: International Statistical Classiﬁcation of Diseases and Related Heal...
Dr. Anja Pilz, ML Conference 2021 ICD-10 Linking ICD: International Statistical Classiﬁcation of Diseases and Related Heal...
Dr. Anja Pilz, ML Conference 2021 Higher clinical relevance ● support doctors: can’t get much more speciﬁc than with a dia...
Dr. Anja Pilz, ML Conference 2021 Most mentions may be clinically relevant, but not coding relevant. Need relation extract...
Dr. Anja Pilz, ML Conference 2021 Toy example. Typical cases are much more complex.
Dr. Anja Pilz, ML Conference 2021 To be really useful, the link must be super speciﬁc ● “some renal failure” (N17*) is not...
Dr. Anja Pilz, ML Conference 2021 Speciﬁcity To describe a disease in a certain stage or manifestation, the catalog is sup...
Dr. Anja Pilz, ML Conference 2021 Precision vs Context ICD is completely different from Wikipedia ● catalog entries are pr...
Dr. Anja Pilz, ML Conference 2021 About Context.. Disambiguating information need not be located the discharge letter ● ca...
Dr. Anja Pilz, ML Conference 2021 Entity Linking in Practice GoTo solution for candidate retrieval: inverted index over ca...
Dr. Anja Pilz, ML Conference 2021 Can handle typos and spelling variations. Query: “diabetes meltus” fetches all codes for...
Dr. Anja Pilz, ML Conference 2021 Can handle alternative names like synonyms or acronyms. Query “ANV 3” fetches all “Akute...
Dr. Anja Pilz, ML Conference 2021 Best Candidate? Recipe: rank by context similarity to decide on best candidate ● ﬁnd exp...
Dr. Anja Pilz, ML Conference 2021 Thanks! Questions? Say Hi!
Natural Language Processing for Medical Data

Natural language is highly ambiguous and the sense of a word heavily depends on the context it appears in. While slight uncertainties are acceptable for the texts you read on a daily basis, they can lead to fatalities in medical contexts. This talks gives an introduction to the underlying problem, word sense ambiguity, and the technical approach aiming to resolve it – entity linking. We highlight the crucial challenges that we need to overcome when dealing with German data in practical examples and show how we integrate those solutions in our product: damedic code.

Talk held at ML Conference 2021, online.

  1. 1. Natural Language Processing for medical data
  2. 2. Dr. Anja Pilz, ML Conference 2021 About me @anja_pilz aplz ● PhD in machine learning & natural language processing from University of Bonn & Fraunhofer IAIS ● Now in industry: AI and data driven products, since 2016 mostly in the medical and healthcare domain ● Main interests: NLP, especially German; information retrieval; recommender systems
  3. 3. Dr. Anja Pilz, ML Conference 2021 Doctors spend more time documenting what they do than with effective treatment ● 70% of work hours dedicated to tasks not performed on the patient (orga & docs) Important as documentation covers symptoms, risk factors, intolerances, treatments, … ● each piece of information is vital for the patient - but can be buried somewhere Not only complex cases quickly become “unscannable” ● Use NLP for Information Extraction: automatically search, analyze, and add structure to these unstructured texts Swiss Medical Journal, 2016;97(1):6–8 Motivation
  4. 4. Dr. Anja Pilz, ML Conference 2021 Support doctor’s daily work ● create warnings from automatically detected risks and contraindications ● summarize suspected and excluded diagnoses (differential diagnosis) ● add hints to treatment guidelines And much more! Motivation
  5. 5. Dr. Anja Pilz, ML Conference 2021 Support billing process ● billing process is super complex and needs to be soundproof ● help medical controllers to ﬁnd relevant information ● automatically ﬁnd mentions of diseases and treatments ● align with entries from catalogs used for billing (e.g. ICD-10) Motivation Image damedic code
  6. 6. Dr. Anja Pilz, ML Conference 2021 NLP Tasks for Medical Data ﬁlter relevant entities (clinical, billing) Entity Recognition (NER) Entity Linking (NEL/NED) Entity Filtering detect all relevant mentions: ● diagnoses ● procedures ● body parts ● drugs ● measurements ● negations... link to unique concepts: ● entries in (curated) medical ontologies or catalogs ● normalization used for documentation, summarization, & billing
  7. 7. Dr. Anja Pilz, ML Conference 2021 Challenges: Medical Domain is not News Typical medical texts are very different common NLP data ● super condensed and short, sometimes like an enumeration ● full of abbreviations, acronyms and technical terms ● ambiguity is often resolved through sheer knowledge, not necessarily by the local context Indication: Acute hypoxia. Relapsed AML, GVHD, and renal failure with new hypoxia with clear chest x-ray.
  8. 8. Dr. Anja Pilz, ML Conference 2021 Abbreviations are used for convenience ● ambiguous ones may cause miscommunication ● potentially jeopardise patient care Entity Linking needs to expand acronyms but must not rely on priors Challenges: Ambiguity TMZ temazepam temozolomide Holper et al., Ambiguous medical abbreviation study: challenges and opportunities, Intern Med J. 2020 LFT liver function test LFT lung function test HWI Harnwegsinfekt Hinterwandinfarkt BCa bladder cancer breast cancer VF Vorhofflimmern Vorhofflattern MS Magensonde Mitralstenose
  9. 9. Dr. Anja Pilz, ML Conference 2021 Challenges: German Latin origin vs German spelling results in a bunch of variations ● Carcinom, Karcinom, Carzinom, Karzinom, Ca, CA The notorious compound words ● sensory sensation disorder: Schallempﬁndungsstörung ● occlusion of the central retinal artery: Netzhautarterienverschluss ● detection of Tuberculosis: Tuberkulosenachweis Decompounding is non-trivial and requires profound linguistic knowledge
  10. 10. Dr. Anja Pilz, ML Conference 2021 ● data is available, e.g. BC5CDR (1500 PubMed articles with annotated chemicals, diseases & their interactions) ● trained models are available ● not “solved” but at a pretty good state of the art Entity Recognition (EN) https://scispacy.apps.allenai.org/
  11. 11. Dr. Anja Pilz, ML Conference 2021 ● typical off-the-shelf models are not useful for the medical domain ● need to train domain models here Entity Recognition (DE)
  12. 12. Dr. Anja Pilz, ML Conference 2021 Data? Real patient data ● resides in hospitals and medical practices ● not publicly available Public data ● netdoktor != Dr. B. Oss ● data in layman language does not compare well to real medical texts ● may still help Patient: “Ich habe im Moment keine Blutdruckprobleme” Doctor: “RR gut eingestellt”
  13. 13. Dr. Anja Pilz, ML Conference 2021 Entity Recognition Get data. Start annotating. ● entities are all concepts of interest: drugs, medical conditions, procedures, body parts, … ● annotation usually requires medical expert knowledge ● super speciﬁc vocabulary with lots of abbreviations & acronyms ● good to go after ~1k documents
  14. 14. Dr. Anja Pilz, ML Conference 2021 Train your own model Entity Recognition + data
  15. 15. Dr. Anja Pilz, ML Conference 2021 Most work in research: link entity mentions to concepts in medical thesaurus UMLS ● higher level metadata enrichment ● index new publications by topic & keywords ● hot topic and a bunch of publications exists Why not? ● no German version (yet) ● concepts are sometimes not speciﬁc enough Entity Linking Murty et al., Hierarchical Losses and New Resources for Fine-grained Entity Typing and Linking, ACL 2018 Kolitsas et al., End-to-End Neural Entity Linking, CoNLL 2018 Mohan & Li, MedMentions: A Large Biomedical Corpus Annotated with UMLS Concepts, AKBC 2019
  16. 16. Dr. Anja Pilz, ML Conference 2021 ICD-10 Linking ICD: International Statistical Classiﬁcation of Diseases and Related Health Problems ● catalogs mental and physical disorders in most speciﬁc and precise form ● global standard for clinical documentation and billing ● published yearly by the WHO https://icd.who.int/browse10/2019/en
  17. 17. Dr. Anja Pilz, ML Conference 2021 ICD-10 Linking ICD: International Statistical Classiﬁcation of Diseases and Related Health Problems ● catalogs mental and physical disorders in most speciﬁc and precise form ● global standard for clinical documentation and billing ● published yearly by the WHO ● … comes with German modiﬁcation ICD-10-GM (BfArM) https://www.dimdi.de/static/de/klassiﬁkationen/icd/icd-10-gm
  18. 18. Dr. Anja Pilz, ML Conference 2021 Higher clinical relevance ● support doctors: can’t get much more speciﬁc than with a diagnosis code ● support medical controllers: ICD codes are the items used in billing, not UMLS concepts Requires entity ﬁltering to avoid false positives ● excluded or suspected diagnoses ● “state after diseases”: clinically but not be billing relevant ICD-10 Linking EHR Keine Hinweis auf intrazerebrale Blutung. Z.n. Hysterektomie, 2006
  19. 19. Dr. Anja Pilz, ML Conference 2021 Most mentions may be clinically relevant, but not coding relevant. Need relation extraction approaches here.. Entity Filtering for primary coding Prostatacarcinom in der Vorgeschichte Vorbekannte Osteochondrose Z.n. mehrfachem Apoplexen, zuletzt 2006 Mamma-Ca wurde ausgeschlossen. Keine Hinweis auf intrazerebrale Blutung. Die BWK 9-Fraktur zeigte sich mit fehlender knöcherner Durchbauung im Sinne einer Pseudarthrose. Intrazerebrale Blutung konnte nicht bestätigt werden. Verdacht auf arterielle Hypertonie.
  20. 20. Dr. Anja Pilz, ML Conference 2021 Toy example. Typical cases are much more complex.
  21. 21. Dr. Anja Pilz, ML Conference 2021 To be really useful, the link must be super speciﬁc ● “some renal failure” (N17*) is not good enough Speciﬁcity relates to the stage of the disease ● hugely affects treatment complexity and care intensity ● treatment complexity directly corresponds to the hospital’s bill send to the insurance company ICD-10 Linking https://www.dimdi.de/static/de/klassiﬁkationen/icd/icd-10-gm
  22. 22. Dr. Anja Pilz, ML Conference 2021 Speciﬁcity To describe a disease in a certain stage or manifestation, the catalog is super speciﬁc ● 40 entries for different instances of Diabetes Mellitus, Type 1 and Type 2 each ● there are even more forms of Diabetes... Difference is sometimes only one word ● “nicht” or “mit/ohne”: usual stopwords are dangerous here! https://www.dimdi.de/static/de/klassiﬁkationen/icd/icd-10-gm
  23. 23. Dr. Anja Pilz, ML Conference 2021 Precision vs Context ICD is completely different from Wikipedia ● catalog entries are precise descriptions without further context ● descriptions are not the most commonly used names ● descriptions tend to be very long: median number of words is 5, maximum is 28 ● typically not used in this form by the doctors: low character overlap, low similarity ... RR 150/90... ... rezidiv. Bluthochdruck mit Schwächegefühl...
  24. 24. Dr. Anja Pilz, ML Conference 2021 About Context.. Disambiguating information need not be located the discharge letter ● can even be in a completely different data format, e.g. lab measurements ● N18*: multiple measurements of a speciﬁc lab value (Creatinine) ● not an NLP task anymore, time series analysis? https://www.dimdi.de/static/de/klassiﬁkationen/icd/icd-10-gm
  25. 25. Dr. Anja Pilz, ML Conference 2021 Entity Linking in Practice GoTo solution for candidate retrieval: inverted index over catalog descriptions ● basically a vector space model with cosine similarity over (query, entry) ● make use of the analyzers coming with lucene for tokenization, stemming, etc Secret sauce ● add medical knowledge and extend the descriptions (e.g. synonyms) ● hand craft search query from the mention context Gist: aim for high recall, you can’t link what you don’t ﬁnd... Pilz & Paaß, Collective Search for Concept Disambiguation, COLING 2012
  26. 26. Dr. Anja Pilz, ML Conference 2021 Can handle typos and spelling variations. Query: “diabetes meltus” fetches all codes for Diabetes mellitus. Demo
  27. 27. Dr. Anja Pilz, ML Conference 2021 Can handle alternative names like synonyms or acronyms. Query “ANV 3” fetches all “Akutes Nierenversagen ... Stadium 3” codes But which one is it? Can not decide on the best candidate... Demo
  28. 28. Dr. Anja Pilz, ML Conference 2021 Best Candidate? Recipe: rank by context similarity to decide on best candidate ● ﬁnd expressive vector representations of mention-candidate pairs ○ word2vec ○ topic distributions (LDA) ○ graphical similarity … ● plug vectors into some ranking model ○ Ranking SVM ○ speciﬁc loss functions in Neural Networks (Hamming) But we have seen: catalog does not provide extensive descriptions, so... Next time! Pilz & Paaß, From names to entities using thematic context distance, CIKM 2011
  29. 29. Dr. Anja Pilz, ML Conference 2021 Thanks! Questions? Say Hi!

