SlideShare a Scribd company logo
1 of 63
Download to read offline
Extreme-scale text-based classification of medical data
Anton Hristov & Svetla Boytcheva
18 May 2021
making sense of text and data
o Medical Ontologies
o Linked Open Data
o Dataset generation
o Data Augmentation
o Text based classification
o Classification model
o Embeddings
o eXtreme scale classification
Presentation outline
About 80% of
Electronic Health
Records are in
unstructured format
Need for NLP tools for
processing clinical text
Lack of multilingual
terminology
resources and
domain specific
ontologies
The automatic processing and knowledge extraction from
medical records is a task with public importance
Clinical text
HISTORY OF PRESENT ILLNESS :The patient is an 80 female with
a history of diastolic function and heart failure , hypertension and
rheumatoid arthritis who presents from an outside hospital with
presyncope.
Clinical text
OPERATIONS / PROCEDURES :Dobutamine stress test , cardiac
ultrasound , EGD , chest x-ray , PICC placement .The patient is a
62-year-old female with a history of diabetes mellitus ,
hypertension , COPD , hypercholesterolemia , depression and CHF
Clinical text
HISTORY OF PRESENT ILLNESS :The patient is a 63 year-old
woman transferred for evaluation of thrombotic thrombocytopenic
purpura and bronchiolitis obliterans organizing pneumonia .
Why the task for concept normalization
is so important?
o Disambiguation
o Usage of URI
o Data integration
o Reasoning
o Similarity search
o Phenotypes
Text-based classification
a process of assigning tags or categories to text
according to its content.
Standard Classification & Ontologies
SNOMED CT
SNOMED CT
Objective
To develop methods for automatic association of
SNOMED CD codes to textual descriptions of
diagnosis
How to find training data?
o For 150000 classes we will need huge training dataset
o Clinical data are not publicly available due to GDPR issues
o There are quite few manually annotate datasets
o We need to rely only on publicly available sources:
− Other standard classifications and ontologies
− Open data
ICD-10 CM
ICD-11
DOID
o Medical Ontologies
o Linked Open Data
o Dataset generation
o Data Augmentation
o Text based classification
o Classification model
o Embeddings
o eXtreme scale classification
Presentation outline
https://w.wiki/3Lyc
https://w.wiki/3Lyh
https://w.wiki/3Lys
o Medical Ontologies
o Linked Open Data
o Dataset generation
o Data Augmentation
o Text based classification
o Classification model
o Embeddings
o eXtreme scale classification
Presentation outline
Medical Ontologies Mappings
o 1:1
o 1:N
o N:M
o No mappings
Source: https://library.ahima.org/doc?oid=106975#.YKOy_agzaHu
ExaMode dataset
Dataset version 1
• Summary:
– 22M+ data records
• 128K+ SNOMED codes
• 280K+ textual descriptions
- 17K+ undiscovered connections
32
Dataset Generation
o More data – more problems
o Data cleaning
o Unbalanced dataset
o Overrepresented vs underrepresented classes
o Medical Ontologies
o Linked Open Data
o Dataset generation
o Data Augmentation
o Text based classification
o Classification model
o Embeddings
o eXtreme scale classification
Presentation outline
Data Augmentation
o The original idea for dataset enlargement
− Datasets with images for Neural networks training
o Popular techniques:
− Flip
− Rotation
Data Augmentation
o Popular techniques:
− Scale
− Crop
− Translate
− Pixel/Region change (fill with constant)
− Pixel/Region swap
− ….
Types of data augmentation that are applicable
for textual data
o Swap random letters within a single word
o Swap random words within a text
o Replace word with its synonim
o Delete random letter within a single word
o Replace a random letter with a letter close to it on the keyboard
ExaMode dataset
Dataset version 2 Remove noise
• Additional data augmentations
• Additional heuristics
• Additional data cleaning
• Split the dataset into 3 subgroups:
– Disorders
– Procedures
– Findings
38
ExaMode dataset
Dataset version 2
Summary:
– Disorders: ~105K SNOMED codes
– Procedures: ~67K SNOMED codes
– Findings: ~70K SNOMED codes
39
Presentation outline
o Medical Ontologies
o Linked Open Data
o Dataset generation
o Data Augmentation
o Text based classification
o Classification model
o Embeddings
o eXtreme scale classification
Presentation outline
Text based classification
o Binary classification
o Multiclass classification
o Multilabel classification
Binary classification
o Samples takes only 1 label out of 2 classes
Review Sentiment
Delivered as expected Positive
Good quality Positive
There are scratches on the surface Negative
Works great Positive
I do not recommend it Negative
Multiclass classification
o Samples takes only 1 label out of number of classes
Movie Rating
Palmer 7
Bad Trip 6
Godzilla vs. Kong 6
Band of Brothers 9
Big fish 8
Multilabel classification
o Samples takes one or more than one labels out of number
of classes
Movie Drama Comedy Action Sci-Fi War Adventure Fantasy
Palmer 1 0 0 0 0 0 0
Bad Trip 0 1 0 0 0 0 0
Godzilla vs. Kong 0 0 1 1 0 0 0
Band of Brothers 1 0 1 0 1 0 0
Big fish 1 0 0 0 0 1 1
Presentation outline
o Medical Ontologies
o Linked Open Data
o Dataset generation
o Data Augmentation
o Text based classification
o Classification model
o Embeddings
o eXtreme scale classification
Presentation outline
Classification model
o BERT (Bidirectional Encoder Representations from
Transformers)
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin.
Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000–6010, 2017.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for
language understanding. arXiv preprint arXiv:1810.04805, 2018.
Classification model
o Why was BERT created?
o Big gap in the data
Classification model
o BERT core idea
Source: Park, Dongju & Ahn, Chang Wook. Self-Supervised Contextual Data Augmentation for Natural Language Processing, 2019
Classification model
o BERT used for classification
Classification model
o BERT advantages
o Incredible performance
o Open source
o Easy to pretrain with small amount of medical data
Classification model
o BERT pretrained models:
o bioBERT
o multilingualBERT
o slavicBERT
o clinicalBERT
o pubmedBERT
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In
Proceedings of NAACL, 2019.
Lee, Jinhyuk and Yoon, Wonjin and Kim, Sungdong and Kim, Donghyeon and Kim, Sunkyu and So, Chan Ho and Kang, Jaewoo. BioBERT: a pre-trained biomedical
language representation model for biomedical text mining. Bioinformatics, 2019.
Mikhail Arkhipov, Maria Trofimova, Yurii Kuratov, and Alexey Sorokin. Tuning multilingual transformers for language-specific named entity recognition. 2019.
Emily Alsentzer, John R. Murphy, Willie Boag, WeiHung Weng, Di Jin, Tristan Naumann, and Matthew B. A. McDermott. Publicly available clinical bert
embeddings. In ClinicalNLP workshop at NAACL, 2019.
Gu, Yu, et al. "Domain-specific language model pretraining for biomedical natural language processing." arXiv preprint arXiv:2007.15779, 2020.
Presentation outline
o Medical Ontologies
o Linked Open Data
o Dataset generation
o Data Augmentation
o Text based classification
o Classification model
o Embeddings
o eXtreme scale classification
Presentation outline
Embeddings
o Student: [2, 7]
o School: [3, 6]
o University: [1, 5]
o Dog: [6, 2.5]
o Cat: [5, 2]
o Fish: [7.5, 1]
Embeddings
o Deep learning embeddings
Figure is based on: Park, Dongju & Ahn, Chang Wook. Self-Supervised Contextual Data Augmentation for Natural Language Processing, 2019
Presentation outline
o Medical Ontologies
o Linked Open Data
o Dataset generation
o Data Augmentation
o Text based classification
o Classification model
o Embeddings
o eXtreme scale classification
Presentation outline
eXtreme scale classification
o Labels clustering
o Dataset with +10K classes
eXtreme scale classification
o Labels clustering
o Dataset with +10K classes
Labels clustering
o Labels embeddings
o Embeddings clustering
Labels embeddings
Embeddings clustering
Clustering
algorithm
o Clustering algorithms:
o Agglomerative clustering
o DBSCAN
o K-Means
o Mean Shift
o Spectral Clustering
o ...
o etc.
Refinement
o Possible solutions:
o Classical shallow ANN
o Deep learning approach
o Binary classifiers for every label
Acknowledgements
o Alexander Tahchiev
o Andrey Avramov
o Hristo Papazov
o Pavlin Gyurov
o Todor Primov
o Stanislav Slavkov
https://www.datasciencesociety.net/
https://www.ontotext.com
Thank you!
See Ontotext Platform demos
Star Wars API: https://swapi-platform.ontotext.com/graphiql/
Platform monitoring: https://test-platform.ontotext.com/grafana/

More Related Content

Similar to DSS Ontotext Webinar -Examode: Extreme-scale text-based classification of medical data

Evotec - How can Knowledge Graphs support Druh Discovery
Evotec - How can Knowledge Graphs support Druh DiscoveryEvotec - How can Knowledge Graphs support Druh Discovery
Evotec - How can Knowledge Graphs support Druh DiscoveryNeo4j
 
Biomedical Entity Linking - Introduction, approaches, challenges
Biomedical Entity Linking - Introduction, approaches, challengesBiomedical Entity Linking - Introduction, approaches, challenges
Biomedical Entity Linking - Introduction, approaches, challengesAnja Pilz
 
ai-in-healthcare-202011-201117103639.pptx
ai-in-healthcare-202011-201117103639.pptxai-in-healthcare-202011-201117103639.pptx
ai-in-healthcare-202011-201117103639.pptxssuser6b571f
 
Usage of open source software for Real World Data Analysis in pharmaceutical ...
Usage of open source software for Real World Data Analysis in pharmaceutical ...Usage of open source software for Real World Data Analysis in pharmaceutical ...
Usage of open source software for Real World Data Analysis in pharmaceutical ...Kees van Bochove
 
Emerging collaboration models for academic medical centers _ our place in the...
Emerging collaboration models for academic medical centers _ our place in the...Emerging collaboration models for academic medical centers _ our place in the...
Emerging collaboration models for academic medical centers _ our place in the...Rick Silva
 
Principles organization and_operation_of_a_dna_bank
Principles organization and_operation_of_a_dna_bankPrinciples organization and_operation_of_a_dna_bank
Principles organization and_operation_of_a_dna_bankEspirituanna
 
Understanding medical concepts and codes through NLP methods
Understanding medical concepts and codes through NLP methodsUnderstanding medical concepts and codes through NLP methods
Understanding medical concepts and codes through NLP methodsAshis Chanda
 
BiTeM / SIBTex @ TREC CDS 2014
BiTeM / SIBTex @ TREC CDS 2014BiTeM / SIBTex @ TREC CDS 2014
BiTeM / SIBTex @ TREC CDS 2014Julien Gobeill
 
Data Visualization in Biomedical Sciences: More than Meets the Eye
Data Visualization in Biomedical Sciences: More than Meets the EyeData Visualization in Biomedical Sciences: More than Meets the Eye
Data Visualization in Biomedical Sciences: More than Meets the EyeNils Gehlenborg
 
CV of Rong Chen
CV of Rong ChenCV of Rong Chen
CV of Rong ChenRong Chen
 
Introduction to bioinformatics
Introduction to bioinformaticsIntroduction to bioinformatics
Introduction to bioinformaticsphilmaweb
 
Introduction to Bioinformatics.
 Introduction to Bioinformatics. Introduction to Bioinformatics.
Introduction to Bioinformatics.Elena Sügis
 
Quantitative Medicine Feb 2009
Quantitative Medicine Feb 2009Quantitative Medicine Feb 2009
Quantitative Medicine Feb 2009Ian Foster
 
MseqDR consortium: a grass-roots effort to establish a global resource aimed ...
MseqDR consortium: a grass-roots effort to establish a global resource aimed ...MseqDR consortium: a grass-roots effort to establish a global resource aimed ...
MseqDR consortium: a grass-roots effort to establish a global resource aimed ...Human Variome Project
 
Genomics and Computation in Precision Medicine March 2017
Genomics and Computation in Precision Medicine March 2017Genomics and Computation in Precision Medicine March 2017
Genomics and Computation in Precision Medicine March 2017Warren Kibbe
 
AETIONOMY Overview AD/PD Conference 2015 Nice
AETIONOMY Overview AD/PD Conference 2015 NiceAETIONOMY Overview AD/PD Conference 2015 Nice
AETIONOMY Overview AD/PD Conference 2015 NiceMartin Hofmann-Apitius
 
Introduction to data integration in bioinformatics
Introduction to data integration in bioinformaticsIntroduction to data integration in bioinformatics
Introduction to data integration in bioinformaticsYan Xu
 
Amia tb-review-13
Amia tb-review-13Amia tb-review-13
Amia tb-review-13Russ Altman
 
CHI MMTC Integrating Public and Private Data
CHI MMTC Integrating Public and Private DataCHI MMTC Integrating Public and Private Data
CHI MMTC Integrating Public and Private DataHans-Martin Will
 

Similar to DSS Ontotext Webinar -Examode: Extreme-scale text-based classification of medical data (20)

Evotec - How can Knowledge Graphs support Druh Discovery
Evotec - How can Knowledge Graphs support Druh DiscoveryEvotec - How can Knowledge Graphs support Druh Discovery
Evotec - How can Knowledge Graphs support Druh Discovery
 
Biomedical Entity Linking - Introduction, approaches, challenges
Biomedical Entity Linking - Introduction, approaches, challengesBiomedical Entity Linking - Introduction, approaches, challenges
Biomedical Entity Linking - Introduction, approaches, challenges
 
ai-in-healthcare-202011-201117103639.pptx
ai-in-healthcare-202011-201117103639.pptxai-in-healthcare-202011-201117103639.pptx
ai-in-healthcare-202011-201117103639.pptx
 
Usage of open source software for Real World Data Analysis in pharmaceutical ...
Usage of open source software for Real World Data Analysis in pharmaceutical ...Usage of open source software for Real World Data Analysis in pharmaceutical ...
Usage of open source software for Real World Data Analysis in pharmaceutical ...
 
Emerging collaboration models for academic medical centers _ our place in the...
Emerging collaboration models for academic medical centers _ our place in the...Emerging collaboration models for academic medical centers _ our place in the...
Emerging collaboration models for academic medical centers _ our place in the...
 
Principles organization and_operation_of_a_dna_bank
Principles organization and_operation_of_a_dna_bankPrinciples organization and_operation_of_a_dna_bank
Principles organization and_operation_of_a_dna_bank
 
Qiu_CV_Feb12_2017
Qiu_CV_Feb12_2017Qiu_CV_Feb12_2017
Qiu_CV_Feb12_2017
 
Understanding medical concepts and codes through NLP methods
Understanding medical concepts and codes through NLP methodsUnderstanding medical concepts and codes through NLP methods
Understanding medical concepts and codes through NLP methods
 
BiTeM / SIBTex @ TREC CDS 2014
BiTeM / SIBTex @ TREC CDS 2014BiTeM / SIBTex @ TREC CDS 2014
BiTeM / SIBTex @ TREC CDS 2014
 
Data Visualization in Biomedical Sciences: More than Meets the Eye
Data Visualization in Biomedical Sciences: More than Meets the EyeData Visualization in Biomedical Sciences: More than Meets the Eye
Data Visualization in Biomedical Sciences: More than Meets the Eye
 
CV of Rong Chen
CV of Rong ChenCV of Rong Chen
CV of Rong Chen
 
Introduction to bioinformatics
Introduction to bioinformaticsIntroduction to bioinformatics
Introduction to bioinformatics
 
Introduction to Bioinformatics.
 Introduction to Bioinformatics. Introduction to Bioinformatics.
Introduction to Bioinformatics.
 
Quantitative Medicine Feb 2009
Quantitative Medicine Feb 2009Quantitative Medicine Feb 2009
Quantitative Medicine Feb 2009
 
MseqDR consortium: a grass-roots effort to establish a global resource aimed ...
MseqDR consortium: a grass-roots effort to establish a global resource aimed ...MseqDR consortium: a grass-roots effort to establish a global resource aimed ...
MseqDR consortium: a grass-roots effort to establish a global resource aimed ...
 
Genomics and Computation in Precision Medicine March 2017
Genomics and Computation in Precision Medicine March 2017Genomics and Computation in Precision Medicine March 2017
Genomics and Computation in Precision Medicine March 2017
 
AETIONOMY Overview AD/PD Conference 2015 Nice
AETIONOMY Overview AD/PD Conference 2015 NiceAETIONOMY Overview AD/PD Conference 2015 Nice
AETIONOMY Overview AD/PD Conference 2015 Nice
 
Introduction to data integration in bioinformatics
Introduction to data integration in bioinformaticsIntroduction to data integration in bioinformatics
Introduction to data integration in bioinformatics
 
Amia tb-review-13
Amia tb-review-13Amia tb-review-13
Amia tb-review-13
 
CHI MMTC Integrating Public and Private Data
CHI MMTC Integrating Public and Private DataCHI MMTC Integrating Public and Private Data
CHI MMTC Integrating Public and Private Data
 

Recently uploaded

Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 

Recently uploaded (20)

Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 

DSS Ontotext Webinar -Examode: Extreme-scale text-based classification of medical data

  • 1. Extreme-scale text-based classification of medical data Anton Hristov & Svetla Boytcheva 18 May 2021 making sense of text and data
  • 2. o Medical Ontologies o Linked Open Data o Dataset generation o Data Augmentation o Text based classification o Classification model o Embeddings o eXtreme scale classification Presentation outline
  • 3. About 80% of Electronic Health Records are in unstructured format Need for NLP tools for processing clinical text Lack of multilingual terminology resources and domain specific ontologies The automatic processing and knowledge extraction from medical records is a task with public importance
  • 4. Clinical text HISTORY OF PRESENT ILLNESS :The patient is an 80 female with a history of diastolic function and heart failure , hypertension and rheumatoid arthritis who presents from an outside hospital with presyncope.
  • 5. Clinical text OPERATIONS / PROCEDURES :Dobutamine stress test , cardiac ultrasound , EGD , chest x-ray , PICC placement .The patient is a 62-year-old female with a history of diabetes mellitus , hypertension , COPD , hypercholesterolemia , depression and CHF
  • 6. Clinical text HISTORY OF PRESENT ILLNESS :The patient is a 63 year-old woman transferred for evaluation of thrombotic thrombocytopenic purpura and bronchiolitis obliterans organizing pneumonia .
  • 7. Why the task for concept normalization is so important? o Disambiguation o Usage of URI o Data integration o Reasoning o Similarity search o Phenotypes
  • 8. Text-based classification a process of assigning tags or categories to text according to its content.
  • 12.
  • 13. Objective To develop methods for automatic association of SNOMED CD codes to textual descriptions of diagnosis
  • 14. How to find training data? o For 150000 classes we will need huge training dataset o Clinical data are not publicly available due to GDPR issues o There are quite few manually annotate datasets o We need to rely only on publicly available sources: − Other standard classifications and ontologies − Open data
  • 17. DOID
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23. o Medical Ontologies o Linked Open Data o Dataset generation o Data Augmentation o Text based classification o Classification model o Embeddings o eXtreme scale classification Presentation outline
  • 24.
  • 25.
  • 28.
  • 30. o Medical Ontologies o Linked Open Data o Dataset generation o Data Augmentation o Text based classification o Classification model o Embeddings o eXtreme scale classification Presentation outline
  • 31. Medical Ontologies Mappings o 1:1 o 1:N o N:M o No mappings Source: https://library.ahima.org/doc?oid=106975#.YKOy_agzaHu
  • 32. ExaMode dataset Dataset version 1 • Summary: – 22M+ data records • 128K+ SNOMED codes • 280K+ textual descriptions - 17K+ undiscovered connections 32
  • 33. Dataset Generation o More data – more problems o Data cleaning o Unbalanced dataset o Overrepresented vs underrepresented classes
  • 34. o Medical Ontologies o Linked Open Data o Dataset generation o Data Augmentation o Text based classification o Classification model o Embeddings o eXtreme scale classification Presentation outline
  • 35. Data Augmentation o The original idea for dataset enlargement − Datasets with images for Neural networks training o Popular techniques: − Flip − Rotation
  • 36. Data Augmentation o Popular techniques: − Scale − Crop − Translate − Pixel/Region change (fill with constant) − Pixel/Region swap − ….
  • 37. Types of data augmentation that are applicable for textual data o Swap random letters within a single word o Swap random words within a text o Replace word with its synonim o Delete random letter within a single word o Replace a random letter with a letter close to it on the keyboard
  • 38. ExaMode dataset Dataset version 2 Remove noise • Additional data augmentations • Additional heuristics • Additional data cleaning • Split the dataset into 3 subgroups: – Disorders – Procedures – Findings 38
  • 39. ExaMode dataset Dataset version 2 Summary: – Disorders: ~105K SNOMED codes – Procedures: ~67K SNOMED codes – Findings: ~70K SNOMED codes 39
  • 40. Presentation outline o Medical Ontologies o Linked Open Data o Dataset generation o Data Augmentation o Text based classification o Classification model o Embeddings o eXtreme scale classification Presentation outline
  • 41. Text based classification o Binary classification o Multiclass classification o Multilabel classification
  • 42. Binary classification o Samples takes only 1 label out of 2 classes Review Sentiment Delivered as expected Positive Good quality Positive There are scratches on the surface Negative Works great Positive I do not recommend it Negative
  • 43. Multiclass classification o Samples takes only 1 label out of number of classes Movie Rating Palmer 7 Bad Trip 6 Godzilla vs. Kong 6 Band of Brothers 9 Big fish 8
  • 44. Multilabel classification o Samples takes one or more than one labels out of number of classes Movie Drama Comedy Action Sci-Fi War Adventure Fantasy Palmer 1 0 0 0 0 0 0 Bad Trip 0 1 0 0 0 0 0 Godzilla vs. Kong 0 0 1 1 0 0 0 Band of Brothers 1 0 1 0 1 0 0 Big fish 1 0 0 0 0 1 1
  • 45. Presentation outline o Medical Ontologies o Linked Open Data o Dataset generation o Data Augmentation o Text based classification o Classification model o Embeddings o eXtreme scale classification Presentation outline
  • 46. Classification model o BERT (Bidirectional Encoder Representations from Transformers) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000–6010, 2017. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  • 47. Classification model o Why was BERT created? o Big gap in the data
  • 48. Classification model o BERT core idea Source: Park, Dongju & Ahn, Chang Wook. Self-Supervised Contextual Data Augmentation for Natural Language Processing, 2019
  • 49. Classification model o BERT used for classification
  • 50. Classification model o BERT advantages o Incredible performance o Open source o Easy to pretrain with small amount of medical data
  • 51. Classification model o BERT pretrained models: o bioBERT o multilingualBERT o slavicBERT o clinicalBERT o pubmedBERT Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL, 2019. Lee, Jinhyuk and Yoon, Wonjin and Kim, Sungdong and Kim, Donghyeon and Kim, Sunkyu and So, Chan Ho and Kang, Jaewoo. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 2019. Mikhail Arkhipov, Maria Trofimova, Yurii Kuratov, and Alexey Sorokin. Tuning multilingual transformers for language-specific named entity recognition. 2019. Emily Alsentzer, John R. Murphy, Willie Boag, WeiHung Weng, Di Jin, Tristan Naumann, and Matthew B. A. McDermott. Publicly available clinical bert embeddings. In ClinicalNLP workshop at NAACL, 2019. Gu, Yu, et al. "Domain-specific language model pretraining for biomedical natural language processing." arXiv preprint arXiv:2007.15779, 2020.
  • 52. Presentation outline o Medical Ontologies o Linked Open Data o Dataset generation o Data Augmentation o Text based classification o Classification model o Embeddings o eXtreme scale classification Presentation outline
  • 53. Embeddings o Student: [2, 7] o School: [3, 6] o University: [1, 5] o Dog: [6, 2.5] o Cat: [5, 2] o Fish: [7.5, 1]
  • 54. Embeddings o Deep learning embeddings Figure is based on: Park, Dongju & Ahn, Chang Wook. Self-Supervised Contextual Data Augmentation for Natural Language Processing, 2019
  • 55. Presentation outline o Medical Ontologies o Linked Open Data o Dataset generation o Data Augmentation o Text based classification o Classification model o Embeddings o eXtreme scale classification Presentation outline
  • 56. eXtreme scale classification o Labels clustering o Dataset with +10K classes
  • 57. eXtreme scale classification o Labels clustering o Dataset with +10K classes
  • 58. Labels clustering o Labels embeddings o Embeddings clustering
  • 60. Embeddings clustering Clustering algorithm o Clustering algorithms: o Agglomerative clustering o DBSCAN o K-Means o Mean Shift o Spectral Clustering o ... o etc.
  • 61. Refinement o Possible solutions: o Classical shallow ANN o Deep learning approach o Binary classifiers for every label
  • 62. Acknowledgements o Alexander Tahchiev o Andrey Avramov o Hristo Papazov o Pavlin Gyurov o Todor Primov o Stanislav Slavkov https://www.datasciencesociety.net/ https://www.ontotext.com
  • 63. Thank you! See Ontotext Platform demos Star Wars API: https://swapi-platform.ontotext.com/graphiql/ Platform monitoring: https://test-platform.ontotext.com/grafana/