This document summarizes two projects aimed at improving representations of medical concepts and codes through natural language processing of clinical notes. The first project uses an external knowledge graph to enhance skip-gram embeddings of medical concepts. The second project jointly learns representations of medical codes and words from clinical notes using a modified skip-gram model that considers relationships between codes, words, and codes and words. The document concludes by discussing future directions, including applying these methods to other domains and using more recent pre-trained language models like clinical BERT.
2. Contents
● Introduction
● Related works
● Project 1:
■ Improving Medical concept representations with external knowledge
● Project 2:
■ Jointly learning medical concepts and code representation
● Conclusions and future work
2
3. Introduction
● What is medical concepts?
● Medical concepts are medical terms, abbreviations or short form words.
■ Ex: heart attack, breast cancer, tumor, ‘cp’ for chest pain, or drug names.
● What is medical codes?
● Standard codes for representing diagnosis, procedures or drugs.
● Different medical organizations provide standard code format.
■ Ex: 1749 is a ICD9 code for breast cancer.
■ Ex: 96409 is a CPT code for chemotherapy.
3
4. Introduction
● How looks like Electronic Health Records (EHRs)?
● This dataset has both structured (i.e. lab values, medical codes) and
unstructured data (physician’s note data).
4
Unstructured data
(clinical note)
Structured data
(medical code events)
Code Code description
1749 breast cancer
96409 chemotherapy
… … …
5. Introduction
● What is NLP?
● Natural Language Processing, or NLP, is defined as the automatic
manipulation of natural language, like speech and text.
● Finding machine readable representation for words, and documents.
■ Ex: Understanding ‘severity’ of patients from physician’s note.
■ Understanding semantic similarity between ‘kidney’ and ‘renal’.
■ Finding patient phenotype or clusters from clinical note description.
5
6. Related works
● Previous research used EHRs for patient phenotyping [1], health risk
prediction [2, 3], cohort selection [4], and visual explorations [5, 6].
● Understanding text written in the medical notes is a very important step
for such research studies.
● Many frequency based methods, such as BOW, TF-IDF [9], PMI [10],
GloVe [7] have been developed to present documents/sentences.
● Recent studies focused on neural network based methods.
6
7. Skip-gram Model
● Skip-gram model scans each sentence to find the log-likelihood of scanned
(target) words within their context window.
● The likelihood of observing the context word wi for the target word wt is:
Wt + 2- 2
7
How would we learn this probability?
9. Problem: Learning medical concept represenations
● We can run skip-gram on medical notes to find concept representations.
● However, many concepts are rarely used in notes, but are important and
have significant meaning.
● External knowledge can help to improve the medical concept representations.
9How can we integrate this knowledge? Modified skip-gram model
10. Results: Qualitative analysis
● For a given medical concept, we check the 10 nearest neighbors based on the cosine similarity in
the learned vector space.
Top 10 nearest neighbor concepts of “bipolar disorder”
10
Our model
Bipolar disorder
depression
anxiety
11. Project: Jointly learning medical code
and concept representation
11
T. Bai, A. K. Chanda, B. L. Egleston, S. Vucetic, Joint learning of representations of medical
concepts and words from EHR data, in: IEEE International Conference on Bioinformatics and
Biomedicine, BIBM, 2017.
12. Problem
● EHRs contain structured data such as diagnostic codes and laboratory tests,
they also contain unstructured clinical notes.
● Joint Skip-gram model jointly learns medical code and word representations.
● Four types of pairs (context, target) are considered for learning representations
following skip-gram model (code to word, code to code, word to word, word
to code).
12
14. Conclusions and future plans:
● Improving medical concept representations using context-free model.
● Clinical BERT is a recent context-free model.
● Skip-gram model could be applied on other fields.
■ Web mining: assuming user’s web click as a bag of words.
■ Business transaction: user’s item purchase history is also a kind of sequence.
■ Human activity: human activity also provides a log sequence.
14
15. References:
1. Halpern, Y., Horng, S., Choi, Y., Sontag, D.: Electronic medical record phenotyping using the anchor and learn framework. Journal of the American Medical Informatics Association 23(4), 731{740 (2016)
2. Choi, E., Schuetz, A., Stewart, W.F., Sun, J.: Medical concept representation learning from electronic health records and its application on heart failure prediction. arXiv preprint arXiv:1602.03686 (2016)
3. Choi, E., Schuetz, A., Stewart, W.F., Sun, J.: Using recurrent neural network models for early detection of heart failure onset. Journal of the American Medical Informatics Association 24(2), 361{370 (2016)
4. A. B. Nattinger, P. W. Laud, R. Bajorunaite, R. A. Sparapani, and J. L. Freeman, “An algorithm for the use of medicare claims data to identify women with incident breast cancer.” Health services research, 39(6p1):17331750,
2004.
5. D. Gotz, F. Wang, and A. Perer. A methodology for interactive mining and visual analysis of clinical event patterns using electronic health record data. Journal of biomedical informatics, 48:148–159, 2014
6. J. Krause, N. Razavian, E. Bertini, and D. Sontag. Visual exploration of temporal data in electronic medical records. In AMIA, 2015
7. Pennington, J.; Socher, R.; and Manning, C. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 1532–1543
8. T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations in vector space, CoRR abs/1301.3781. arXiv:1301.3781.
9. Ramos, J., 2003, December. Using tf-idf to determine word relevance in document queries. In Proceedings of the first instructional conference on machine learning (Vol. 242, pp. 133-142).
10. P. D. Turney and P. Pantel. From frequency to meaning: Vector space models of semantics. J. Artif. Intell. Res., 37:141–188, 2010. doi: 10. 1613/jair.2934
11. Bojanowski P, Grave E, Joulin A, Mikolov T. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics. 2017 Dec;5:135-46.
12. X. Rong, word2vec parameter learning explained, CoRR abs/1411.2738. arXiv:1411.2738. URL http://arxiv.org/abs/1411.2738
13. T. Bai, A. K. Chanda, S. Vucetic, B. L. Egleston. "EHR phenotyping via jointly embedding medical concepts and words into a unified vector space". Journal of BMC medical info., Publisher: BioMed Central, vol. 18, 2018.
14. S. Vucetic, A. K. Chanda, S. Zhang, T. Bai, A. Maiti "Peer assessment of CS doctoral programs shows strong correlation with faculty citations". Journal of Communications of the ACM, vol. 61, p. 70-76, 2018.
15. T. Bai, A. K. Chanda, S. Vucetic, B. L. Egleston. " Joint learning of representations of medical concepts and words from EHR data". IEEE International Conference on Bioinformatics and Biomedicine (BIBM), p 764-769,
2017.
16. Aronson, A. R., and Lang, F.-M. 2010. An overview of metamap: historical perspective and recent advances. Journal of the American Medical Informatics Association 17(3):229–236.
17. Bodenreider, O. 2004. The unified medical language system (umls): integrating biomedical terminology. Nucleic acids research 32(suppl 1):D267–D270
18. Johnson, A. E.; Pollard, T. J.; Shen, L.; Li-wei, H. L.; Feng, M.; Ghassemi, M.; Moody, B.; Szolovits, P.; Celi, L. A.; and Mark, R. G. 2016. Mimic-iii, a freely accessible critical care database. Scientific data 3:160035.
19. Mullenbach, J.; Wiegreffe, S.; Duke, J.; Sun, J.; and Eisenstein, J. 2018. Explainable prediction of medical codes from clinical text. In Proceedings of the 2018 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, NAACLHLT 2018
20. Pei, Jian, et al. "Prefixspan: Mining sequential patterns efficiently by prefix-projected pattern growth." Proceedings 17th international conference on data engineering. IEEE, 2001.
21. D. J. Gligorijevic, J. Stojanovic, and Z. Obradovic, “Modeling healthcare quality via compact representations of electronic health records.” Transactions on Computational Biology and Bioinformatics, 2016
15