2. o Medical Ontologies
o Linked Open Data
o Dataset generation
o Data Augmentation
o Text based classification
o Classification model
o Embeddings
o eXtreme scale classification
Presentation outline
3. About 80% of
Electronic Health
Records are in
unstructured format
Need for NLP tools for
processing clinical text
Lack of multilingual
terminology
resources and
domain specific
ontologies
The automatic processing and knowledge extraction from
medical records is a task with public importance
4. Clinical text
HISTORY OF PRESENT ILLNESS :The patient is an 80 female with
a history of diastolic function and heart failure , hypertension and
rheumatoid arthritis who presents from an outside hospital with
presyncope.
5. Clinical text
OPERATIONS / PROCEDURES :Dobutamine stress test , cardiac
ultrasound , EGD , chest x-ray , PICC placement .The patient is a
62-year-old female with a history of diabetes mellitus ,
hypertension , COPD , hypercholesterolemia , depression and CHF
6. Clinical text
HISTORY OF PRESENT ILLNESS :The patient is a 63 year-old
woman transferred for evaluation of thrombotic thrombocytopenic
purpura and bronchiolitis obliterans organizing pneumonia .
7. Why the task for concept normalization
is so important?
o Disambiguation
o Usage of URI
o Data integration
o Reasoning
o Similarity search
o Phenotypes
14. How to find training data?
o For 150000 classes we will need huge training dataset
o Clinical data are not publicly available due to GDPR issues
o There are quite few manually annotate datasets
o We need to rely only on publicly available sources:
− Other standard classifications and ontologies
− Open data
23. o Medical Ontologies
o Linked Open Data
o Dataset generation
o Data Augmentation
o Text based classification
o Classification model
o Embeddings
o eXtreme scale classification
Presentation outline
30. o Medical Ontologies
o Linked Open Data
o Dataset generation
o Data Augmentation
o Text based classification
o Classification model
o Embeddings
o eXtreme scale classification
Presentation outline
31. Medical Ontologies Mappings
o 1:1
o 1:N
o N:M
o No mappings
Source: https://library.ahima.org/doc?oid=106975#.YKOy_agzaHu
32. ExaMode dataset
Dataset version 1
• Summary:
– 22M+ data records
• 128K+ SNOMED codes
• 280K+ textual descriptions
- 17K+ undiscovered connections
32
33. Dataset Generation
o More data – more problems
o Data cleaning
o Unbalanced dataset
o Overrepresented vs underrepresented classes
34. o Medical Ontologies
o Linked Open Data
o Dataset generation
o Data Augmentation
o Text based classification
o Classification model
o Embeddings
o eXtreme scale classification
Presentation outline
35. Data Augmentation
o The original idea for dataset enlargement
− Datasets with images for Neural networks training
o Popular techniques:
− Flip
− Rotation
36. Data Augmentation
o Popular techniques:
− Scale
− Crop
− Translate
− Pixel/Region change (fill with constant)
− Pixel/Region swap
− ….
37. Types of data augmentation that are applicable
for textual data
o Swap random letters within a single word
o Swap random words within a text
o Replace word with its synonim
o Delete random letter within a single word
o Replace a random letter with a letter close to it on the keyboard
38. ExaMode dataset
Dataset version 2 Remove noise
• Additional data augmentations
• Additional heuristics
• Additional data cleaning
• Split the dataset into 3 subgroups:
– Disorders
– Procedures
– Findings
38
40. Presentation outline
o Medical Ontologies
o Linked Open Data
o Dataset generation
o Data Augmentation
o Text based classification
o Classification model
o Embeddings
o eXtreme scale classification
Presentation outline
42. Binary classification
o Samples takes only 1 label out of 2 classes
Review Sentiment
Delivered as expected Positive
Good quality Positive
There are scratches on the surface Negative
Works great Positive
I do not recommend it Negative
43. Multiclass classification
o Samples takes only 1 label out of number of classes
Movie Rating
Palmer 7
Bad Trip 6
Godzilla vs. Kong 6
Band of Brothers 9
Big fish 8
44. Multilabel classification
o Samples takes one or more than one labels out of number
of classes
Movie Drama Comedy Action Sci-Fi War Adventure Fantasy
Palmer 1 0 0 0 0 0 0
Bad Trip 0 1 0 0 0 0 0
Godzilla vs. Kong 0 0 1 1 0 0 0
Band of Brothers 1 0 1 0 1 0 0
Big fish 1 0 0 0 0 1 1
45. Presentation outline
o Medical Ontologies
o Linked Open Data
o Dataset generation
o Data Augmentation
o Text based classification
o Classification model
o Embeddings
o eXtreme scale classification
Presentation outline
46. Classification model
o BERT (Bidirectional Encoder Representations from
Transformers)
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin.
Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000–6010, 2017.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for
language understanding. arXiv preprint arXiv:1810.04805, 2018.
48. Classification model
o BERT core idea
Source: Park, Dongju & Ahn, Chang Wook. Self-Supervised Contextual Data Augmentation for Natural Language Processing, 2019
50. Classification model
o BERT advantages
o Incredible performance
o Open source
o Easy to pretrain with small amount of medical data
51. Classification model
o BERT pretrained models:
o bioBERT
o multilingualBERT
o slavicBERT
o clinicalBERT
o pubmedBERT
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In
Proceedings of NAACL, 2019.
Lee, Jinhyuk and Yoon, Wonjin and Kim, Sungdong and Kim, Donghyeon and Kim, Sunkyu and So, Chan Ho and Kang, Jaewoo. BioBERT: a pre-trained biomedical
language representation model for biomedical text mining. Bioinformatics, 2019.
Mikhail Arkhipov, Maria Trofimova, Yurii Kuratov, and Alexey Sorokin. Tuning multilingual transformers for language-specific named entity recognition. 2019.
Emily Alsentzer, John R. Murphy, Willie Boag, WeiHung Weng, Di Jin, Tristan Naumann, and Matthew B. A. McDermott. Publicly available clinical bert
embeddings. In ClinicalNLP workshop at NAACL, 2019.
Gu, Yu, et al. "Domain-specific language model pretraining for biomedical natural language processing." arXiv preprint arXiv:2007.15779, 2020.
52. Presentation outline
o Medical Ontologies
o Linked Open Data
o Dataset generation
o Data Augmentation
o Text based classification
o Classification model
o Embeddings
o eXtreme scale classification
Presentation outline
53. Embeddings
o Student: [2, 7]
o School: [3, 6]
o University: [1, 5]
o Dog: [6, 2.5]
o Cat: [5, 2]
o Fish: [7.5, 1]
54. Embeddings
o Deep learning embeddings
Figure is based on: Park, Dongju & Ahn, Chang Wook. Self-Supervised Contextual Data Augmentation for Natural Language Processing, 2019
55. Presentation outline
o Medical Ontologies
o Linked Open Data
o Dataset generation
o Data Augmentation
o Text based classification
o Classification model
o Embeddings
o eXtreme scale classification
Presentation outline
62. Acknowledgements
o Alexander Tahchiev
o Andrey Avramov
o Hristo Papazov
o Pavlin Gyurov
o Todor Primov
o Stanislav Slavkov
https://www.datasciencesociety.net/
https://www.ontotext.com
63. Thank you!
See Ontotext Platform demos
Star Wars API: https://swapi-platform.ontotext.com/graphiql/
Platform monitoring: https://test-platform.ontotext.com/grafana/