SlideShare a Scribd company logo
Exploiting biomedical literature to mine out a large
multimodal dataset of rare cancer studies
Anjani K. Dhrangadhariya et al.
MedGIFT group
University of Applied Sciences Western Switzerland (HES-SO)
Project supported by European Union
Horizon 2020 grant agreement 825292
SPIE Medical Imaging 2020, 16.02.2020
Motivation
> Rare cancers = 15 out of 100,000 / year
> Account for 25% cancer-related deaths
> Lower prevalence = fewer patients
> Less tumor samples for research
> Lack of robust clinical models
Puca, Loredana, et al. "Patient derived organoids to model rare prostate cancer phenotypes." Nature communications 9.1 (2018): 1-10.
2
Data resource
• Challenges
1) Private datasets
2) Limited size
3) Single center / scanner
4) Small variability
5) Some contain only images / only text
6) No or small subsets of manual annotations
7) Difficult to compare results
3
Medline/PubMed
PubMed / Medline
PubMed
Central
PubMed
Central
Open-
Access
(PMC-OA)
https://www.nlm.nih.gov/bsd/difference.html
30 million articles
~ 80 million images
5.9 million full texts
2.09 million full texts
6.73 million images
4
Rare cancer image
harvesting through
automated
knowledge
aggregation and
data mining
approaches?
2019
Individual record
Medical Subject Headings (MeSH)
Title
+
Abstract
Images
1
2
3
5
✓
✓
✓
✓
Medical Subject Headings (MeSH)
• Hierarchically organized
Controlled Vocabulary
• Cataloguing biomedical
information
• 16 thematic categories
• A = Anatomy
• B = Organism…
• Each term has a unique
MeSH Identifier
MeSH
term
MeSH
code
Lipscomb, Carolyn E. "Medical subject headings (MeSH)." Bulletin of the Medical Library Association 88.3 (2000): 265.
6
MeSH as annotation
• Manually annotated by National library of
Medicine (NLM) staff
• For e.g., All the studies about
benign cancer are indexed
under MeSH annotation “Neoplasm”
• Groundtruth annotation
• Not all PMC / PMCOA have annotations
7
Visual classification
• ImageCLEF medical image annotation
challenge (since 2013)
• Small subset of annotated PMC-OA >
train CNNs
• Classify into 31 modalities - PET, light
microscopy, CT, etc.
• State of the art: Superficial modality
classification
8
Deep Multimodal Classification of Image Types in Biomedical Journal Figures”, Andrearczyk and Müller, CLEF 2018
2000 Annotated PMC-OA
90% accuracy
Pipeline
99
Getting DLMI images
Getting “human” images
Getting “neoplastic” images
Getting “rare cancer” images
PMC-OA all images
1
2
3
4
5
DLMI
Diagnostic Light
Microscopy Images
10
Pipeline
Getting DLMI images
Getting “human” images
Getting “neoplastic” images
Getting “rare cancer” images
PMC-OA all images
1
2
3
4
5
Title +
Abstract
MeSH MeSH
vs
Visual Textual
DLMI
Diagnostic Light
Microscopy Images
Visual approach: CNNs
11
MeSH_1
MeSH_0
Model training and evaluation
• VGG19
• ImageNet weights
• With and without image
augmentation
Visual approach: CNNs
12
MeSH_1
MeSH_0
No MeSH
MeSH_1MeSH_0
Model training and evaluation
• VGG19
• ImageNet weights
• With and without image
augmentation
Title +
Abstract
Title +
Abstract
Textual approach
Title +
Abstract
Model
training &
evaluation
Best
performing
model
13
MeSH_0
MeSH_1
Title +
Abstract
MeSH_0
MeSH_1
Title +
Abstract
No MeSH
14
Pipeline
Getting DLMI images
Getting “human” images
Getting “neoplastic” images
Getting “rare cancer” images
PMC-OA all images
1
2
3
4
5
Title +
Abstract
MeSH MeSH
vs
- 0.5467
0.1111
0.5789
- 0.3789
- 0.4999
0.6687
- 0.1167
0.9976
Getting “human” images
Title +
Abstract
Title +
Abstract
Title +
Abstract
{MeSH}
DLMI
human
Model training and evaluation
1. Logistic regression
2. Support Vector Machine
3. K-nearest neighbor
1. Tf-idf,
2. Word vectors,
3. paragraph vector
Not human
20%
80%
Training set
Test set
human
Not human
Title +
Abstract
Title +
Abstract
=
= ⇔ B01.050.150.900.649.313.988.400.112.400.400 ∉ {MeSH}
⇔ B01.050.150.900.649.313.988.400.112.400.400 ∈ {MeSH} & other B01 codes ∉ {MeSH}
15
Getting “human” images
Title +
Abstract
Title +
Abstract
human
not human
Best performing
Model, hyper-params and
vectors
SVM, tf-idf bigrams
No MeSH
Title +
Abstract
DLMI
Title +
Abstract
Title +
Abstract
Title +
Abstract
{MeSH}
DLMI
human
Model training and evaluation
1. Logistic regression
2. Support Vector Machine
3. K-nearest neighbor
1. Tf-idf,
2. Word vectors,
3. paragraph vector
not human
20%
80%
Training set
Test set
- 0.5467
0.1111
0.5789
- 0.3789
- 0.4999
0.6687
- 0.1167
0.9976
16
17
Pipeline
Getting DLMI images
Getting “human” images
Getting “neoplastic” images
Getting “rare cancer” images
PMC-OA all images
1
2
3
4
5
Title +
Abstract
MeSH MeSH
vs
18
Getting “neoplastic” images
neoplastic
not neoplastic
Title +
Abstract
Title +
Abstract
=
= ⇔ C04 ∉ {MeSH}
⇔ C04 ∈ {MeSH}
Title +
Abstract
Title +
Abstract
Title +
Abstract
{MeSH}
DLMI
Model training and evaluation
1. Logistic regression
2. Support Vector Machine
3. K-nearest neighbor
1. Tf-idf,
2. Word vectors,
3. paragraph vector
20%
80%
Training set
Test set
human
neoplastic
not neoplastic
- 0.5467
0.1111
0.5789
- 0.3789
- 0.4999
0.6687
- 0.1167
0.9976
Getting “non-neoplastic” images
Title +
Abstract
Title +
Abstract
Title +
Abstract
{MeSH}
DLMI
Model training and evaluation
1. Logistic regression
2. Support Vector Machine
3. K-nearest neighbor
1. Tf-idf,
2. Word vectors,
3. paragraph vector
20%
80%
Training set
Test set
human
neoplastic
not neoplastic
Title +
Abstract
Title +
Abstract
Best performing
Model, hyper-params and
vectors
SVM, tf-idf bigrams
No MeSH
Title +
Abstract
DLMI
human
neoplastic
not neoplastic
- 0.5467
0.1111
0.5789
- 0.3789
- 0.4999
0.6687
- 0.1167
0.9976
19
20
Pipeline
Getting DLMI images
Getting “human” images
Getting “neoplastic” images
Getting “rare cancer” images
PMC-OA all images
1
2
3
4
5
Title +
Abstract
MeSH MeSH
vs
Getting “rare cancer” images
• No MeSH terms for “rare” cancer class
• Set of {rare cancer} terms by National Center for Advancing
Translational Sciences (NCATS)
https://rarediseases.info.nih.gov/diseases/diseases-by-category/1
21
Title +
Abstract
Title +
Abstract
DLMI
humanNo MeSH
{MeSH}
DLMI
neoplastic
human
neoplastic
Title +
Abstract
rare
cancer
Title +
Abstract
rare
cancer
= ⇔
Title +
Abstract ∩ {rare cancer} ≠
Ø Title +
Abstract
non-rare
cancer
Visual: “rare cancer”
22
rare cancer
Model training and evaluation
• VGG19
• ImageNet weights
• With and without image
augmentation
non-rare cancer
Visual: “rare cancer”
23
No label
Model training and evaluation
• VGG19
• ImageNet weights
• With and without image
augmentation
rare cancer
non-rare cancer
rare cancer non-rare cancer
Results
“human” vs. “non-human” classification
Data type Classifier Feature Precision Recall F1-score
Visual VGG19 With data augmentation 0.69 0.71 0.68
Textual SVM Tf-idf trigrams 0.89 0.90 0.90
24
Results
“human” vs. “non-human” classification
Data type Classifier Feature Precision Recall F1-score
Visual VGG19 With data augmentation 0.69 0.71 0.68
Textual SVM Tf-idf trigrams 0.89 0.90 0.90
“neoplastic” vs. “non-neoplastic” classification
Data type Classifier Feature Precision Recall F1-score
Visual VGG19 With data augmentation 0.68 0.65 0.64
Textual SVM Tf-idf bigrams 0.99 0.99 0.99
25
Results
“human” vs. “non-human” classification
Data type Classifier Feature Precision Recall F1-score
Visual VGG19 With data augmentation 0.69 0.71 0.68
Textual SVM Tf-idf trigrams 0.89 0.90 0.90
“neoplastic” vs. “non-neoplastic” classification
Data type Classifier Feature Precision Recall F1-score
Visual VGG19 With data augmentation 0.68 0.65 0.64
Textual SVM Tf-idf bigrams 0.99 0.99 0.99
“rare cancer” vs. “non-rare cancer” classification
Data type Classifier Feature Precision Recall F1-score
Visual VGG19 With data augmentation 0.62 0.77 0.69
26
Discussion: Textual vs. Visual
27
Textual approach
Outperformed visual approach
for all tasks
Tf-idf n-grams with SVM
performed the excellent for
both tasks.
Visual approach
Correctly classify some
“human” test instances with
recall of 0.71
Worse performance for
“neoplastic” identification
“rare cancer” classification had
a recall of 0.77
Conclusion
• First study targeting automatic rare cancer
image extraction
• Used approach relies on visual deep
learning and textual NLP
• 15,028 light microscopy (DLMI), human,
rare cancer images + corresponding journal
articles
Getting DLMI images
Getting “human” images
Getting “neoplastic” images
Getting “rare cancer” images
PMC-OA all data
28
1
2
3
4
5
Thank you for your attention
29
More information:
http://medgift.hevs.ch
Contact:
anjani.dhrangadhariya@hevs.ch
Follow us:
https://twitter.com/MedGIFT_group

More Related Content

What's hot

DISTANT-CTO: A Zero Cost, Distantly Supervised Approach to Improve Low-Resour...
DISTANT-CTO: A Zero Cost, Distantly Supervised Approach to Improve Low-Resour...DISTANT-CTO: A Zero Cost, Distantly Supervised Approach to Improve Low-Resour...
DISTANT-CTO: A Zero Cost, Distantly Supervised Approach to Improve Low-Resour...
Anjani Dhrangadhariya
 
International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...
International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...
International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...
CSCJournals
 
A Wavelet Based Automatic Segmentation of Brain Tumor in CT Images Using Opti...
A Wavelet Based Automatic Segmentation of Brain Tumor in CT Images Using Opti...A Wavelet Based Automatic Segmentation of Brain Tumor in CT Images Using Opti...
A Wavelet Based Automatic Segmentation of Brain Tumor in CT Images Using Opti...
CSCJournals
 
Comparing prediction accuracy for machine learning and
Comparing prediction accuracy for machine learning andComparing prediction accuracy for machine learning and
Comparing prediction accuracy for machine learning and
Alexander Decker
 
Automatic Diagnosis of Abnormal Tumor Region from Brain Computed Tomography I...
Automatic Diagnosis of Abnormal Tumor Region from Brain Computed Tomography I...Automatic Diagnosis of Abnormal Tumor Region from Brain Computed Tomography I...
Automatic Diagnosis of Abnormal Tumor Region from Brain Computed Tomography I...
ijcseit
 

What's hot (20)

IRJET- Breast Cancer Detection from Histopathology Images: A Review
IRJET-  	  Breast Cancer Detection from Histopathology Images: A ReviewIRJET-  	  Breast Cancer Detection from Histopathology Images: A Review
IRJET- Breast Cancer Detection from Histopathology Images: A Review
 
DISTANT-CTO: A Zero Cost, Distantly Supervised Approach to Improve Low-Resour...
DISTANT-CTO: A Zero Cost, Distantly Supervised Approach to Improve Low-Resour...DISTANT-CTO: A Zero Cost, Distantly Supervised Approach to Improve Low-Resour...
DISTANT-CTO: A Zero Cost, Distantly Supervised Approach to Improve Low-Resour...
 
Semantic representation of neuroimaging observation
Semantic representation of neuroimaging observationSemantic representation of neuroimaging observation
Semantic representation of neuroimaging observation
 
Deep learning application to medical imaging: Perspectives as a physician
Deep learning application to medical imaging: Perspectives as a physicianDeep learning application to medical imaging: Perspectives as a physician
Deep learning application to medical imaging: Perspectives as a physician
 
IRJET - Detection and Classification of Brain Tumor
IRJET - Detection and Classification of Brain TumorIRJET - Detection and Classification of Brain Tumor
IRJET - Detection and Classification of Brain Tumor
 
Detecting malaria using a deep convolutional neural network
Detecting malaria using a deep  convolutional neural networkDetecting malaria using a deep  convolutional neural network
Detecting malaria using a deep convolutional neural network
 
Optimizing Problem of Brain Tumor Detection using Image Processing
Optimizing Problem of Brain Tumor Detection using Image ProcessingOptimizing Problem of Brain Tumor Detection using Image Processing
Optimizing Problem of Brain Tumor Detection using Image Processing
 
Deep Learning for Computer Vision: Medical Imaging (UPC 2016)
Deep Learning for Computer Vision: Medical Imaging (UPC 2016)Deep Learning for Computer Vision: Medical Imaging (UPC 2016)
Deep Learning for Computer Vision: Medical Imaging (UPC 2016)
 
International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...
International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...
International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...
 
Skin Cancer Detection using Image Processing in Real Time
Skin Cancer Detection using Image Processing in Real TimeSkin Cancer Detection using Image Processing in Real Time
Skin Cancer Detection using Image Processing in Real Time
 
인공지능 논문작성과 심사에관한요령
인공지능 논문작성과 심사에관한요령인공지능 논문작성과 심사에관한요령
인공지능 논문작성과 심사에관한요령
 
Introduction to Machine Learning and Texture Analysis for Lesion Characteriza...
Introduction to Machine Learning and Texture Analysis for Lesion Characteriza...Introduction to Machine Learning and Texture Analysis for Lesion Characteriza...
Introduction to Machine Learning and Texture Analysis for Lesion Characteriza...
 
A Wavelet Based Automatic Segmentation of Brain Tumor in CT Images Using Opti...
A Wavelet Based Automatic Segmentation of Brain Tumor in CT Images Using Opti...A Wavelet Based Automatic Segmentation of Brain Tumor in CT Images Using Opti...
A Wavelet Based Automatic Segmentation of Brain Tumor in CT Images Using Opti...
 
Comparing prediction accuracy for machine learning and
Comparing prediction accuracy for machine learning andComparing prediction accuracy for machine learning and
Comparing prediction accuracy for machine learning and
 
IRJET - Lung Disease Prediction using Image Processing and CNN Algorithm
IRJET -  	  Lung Disease Prediction using Image Processing and CNN AlgorithmIRJET -  	  Lung Disease Prediction using Image Processing and CNN Algorithm
IRJET - Lung Disease Prediction using Image Processing and CNN Algorithm
 
CLASSIFICATION OF CANCER BY GENE EXPRESSION USING NEURAL NETWORK
CLASSIFICATION OF CANCER BY GENE EXPRESSION USING NEURAL NETWORKCLASSIFICATION OF CANCER BY GENE EXPRESSION USING NEURAL NETWORK
CLASSIFICATION OF CANCER BY GENE EXPRESSION USING NEURAL NETWORK
 
Review of Image Watermarking Technique for Medi
Review of Image Watermarking Technique for MediReview of Image Watermarking Technique for Medi
Review of Image Watermarking Technique for Medi
 
Medical Image Processing in Nuclear Medicine and Bone Arthroplasty
Medical Image Processing in Nuclear Medicine and Bone ArthroplastyMedical Image Processing in Nuclear Medicine and Bone Arthroplasty
Medical Image Processing in Nuclear Medicine and Bone Arthroplasty
 
SCDT: FC-NNC-structured Complex Decision Technique for Gene Analysis Using Fu...
SCDT: FC-NNC-structured Complex Decision Technique for Gene Analysis Using Fu...SCDT: FC-NNC-structured Complex Decision Technique for Gene Analysis Using Fu...
SCDT: FC-NNC-structured Complex Decision Technique for Gene Analysis Using Fu...
 
Automatic Diagnosis of Abnormal Tumor Region from Brain Computed Tomography I...
Automatic Diagnosis of Abnormal Tumor Region from Brain Computed Tomography I...Automatic Diagnosis of Abnormal Tumor Region from Brain Computed Tomography I...
Automatic Diagnosis of Abnormal Tumor Region from Brain Computed Tomography I...
 

Similar to Exploiting biomedical literature to mine out a large multimodal dataset of rare cancers

Detection of Covid19 From Chest X-Ray and CT.pptx
Detection of Covid19 From Chest X-Ray and CT.pptxDetection of Covid19 From Chest X-Ray and CT.pptx
Detection of Covid19 From Chest X-Ray and CT.pptx
namrataSingh900842
 
Quantitative Medicine Feb 2009
Quantitative Medicine Feb 2009Quantitative Medicine Feb 2009
Quantitative Medicine Feb 2009
Ian Foster
 

Similar to Exploiting biomedical literature to mine out a large multimodal dataset of rare cancers (20)

Exploiting biomedical literature to mine out a large multimodal dataset of ra...
Exploiting biomedical literature to mine out a large multimodal dataset of ra...Exploiting biomedical literature to mine out a large multimodal dataset of ra...
Exploiting biomedical literature to mine out a large multimodal dataset of ra...
 
What are the Responsibilities of a Product Manager by Google PM
What are the Responsibilities of a Product Manager by Google PMWhat are the Responsibilities of a Product Manager by Google PM
What are the Responsibilities of a Product Manager by Google PM
 
AI-powered Medical Imaging Analysis for Precision Medicine
AI-powered Medical Imaging Analysis for Precision MedicineAI-powered Medical Imaging Analysis for Precision Medicine
AI-powered Medical Imaging Analysis for Precision Medicine
 
Detection of Covid19 From Chest X-Ray and CT.pptx
Detection of Covid19 From Chest X-Ray and CT.pptxDetection of Covid19 From Chest X-Ray and CT.pptx
Detection of Covid19 From Chest X-Ray and CT.pptx
 
Ms thesis-final-defense-presentation
Ms thesis-final-defense-presentationMs thesis-final-defense-presentation
Ms thesis-final-defense-presentation
 
M 2 presentation(final)
M 2 presentation(final)M 2 presentation(final)
M 2 presentation(final)
 
[Review] High-performance medicine: the convergence of human and artificial i...
[Review] High-performance medicine: the convergence of human and artificial i...[Review] High-performance medicine: the convergence of human and artificial i...
[Review] High-performance medicine: the convergence of human and artificial i...
 
Challenges and opportunities for machine learning in biomedical research
Challenges and opportunities for machine learning in biomedical researchChallenges and opportunities for machine learning in biomedical research
Challenges and opportunities for machine learning in biomedical research
 
Masters' whole work(big back-u_pslide)
Masters' whole work(big back-u_pslide)Masters' whole work(big back-u_pslide)
Masters' whole work(big back-u_pslide)
 
IFMIA 2019 Plenary Talk : Deep Learning in Medicine; Engineers' Perspectives
IFMIA 2019 Plenary Talk : Deep Learning in Medicine; Engineers' PerspectivesIFMIA 2019 Plenary Talk : Deep Learning in Medicine; Engineers' Perspectives
IFMIA 2019 Plenary Talk : Deep Learning in Medicine; Engineers' Perspectives
 
tranSMART Community Meeting 5-7 Nov 13 - Session 3: The TraIT user stories fo...
tranSMART Community Meeting 5-7 Nov 13 - Session 3: The TraIT user stories fo...tranSMART Community Meeting 5-7 Nov 13 - Session 3: The TraIT user stories fo...
tranSMART Community Meeting 5-7 Nov 13 - Session 3: The TraIT user stories fo...
 
Medical Segmentation Decathalon
Medical Segmentation DecathalonMedical Segmentation Decathalon
Medical Segmentation Decathalon
 
Recent advances in diagnosis and treatment planning1 /certified fixed orthod...
Recent advances in diagnosis and treatment  planning1 /certified fixed orthod...Recent advances in diagnosis and treatment  planning1 /certified fixed orthod...
Recent advances in diagnosis and treatment planning1 /certified fixed orthod...
 
Radiomics and Deep Learning for Lung Cancer Screening
Radiomics and Deep Learning for Lung Cancer ScreeningRadiomics and Deep Learning for Lung Cancer Screening
Radiomics and Deep Learning for Lung Cancer Screening
 
Learning where to look: focus and attention in deep vision
Learning where to look: focus and attention in deep visionLearning where to look: focus and attention in deep vision
Learning where to look: focus and attention in deep vision
 
Quantitative Medicine Feb 2009
Quantitative Medicine Feb 2009Quantitative Medicine Feb 2009
Quantitative Medicine Feb 2009
 
[Explained] "Partial Success in Closing the Gap between Human and Machine Vis...
[Explained] "Partial Success in Closing the Gap between Human and Machine Vis...[Explained] "Partial Success in Closing the Gap between Human and Machine Vis...
[Explained] "Partial Success in Closing the Gap between Human and Machine Vis...
 
Recent advances in diagnosis and treatment planning1 /certified fixed orthod...
Recent advances in diagnosis and treatment  planning1 /certified fixed orthod...Recent advances in diagnosis and treatment  planning1 /certified fixed orthod...
Recent advances in diagnosis and treatment planning1 /certified fixed orthod...
 
University of Toronto - Radiomics for Oncology - 2017
University of Toronto  - Radiomics for Oncology - 2017University of Toronto  - Radiomics for Oncology - 2017
University of Toronto - Radiomics for Oncology - 2017
 
Big Data Analytics for Healthcare
Big Data Analytics for HealthcareBig Data Analytics for Healthcare
Big Data Analytics for Healthcare
 

Recently uploaded

一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 

Recently uploaded (20)

Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
Slip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp ClaimsSlip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp Claims
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
Using PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDBUsing PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDB
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 

Exploiting biomedical literature to mine out a large multimodal dataset of rare cancers

  • 1. Exploiting biomedical literature to mine out a large multimodal dataset of rare cancer studies Anjani K. Dhrangadhariya et al. MedGIFT group University of Applied Sciences Western Switzerland (HES-SO) Project supported by European Union Horizon 2020 grant agreement 825292 SPIE Medical Imaging 2020, 16.02.2020
  • 2. Motivation > Rare cancers = 15 out of 100,000 / year > Account for 25% cancer-related deaths > Lower prevalence = fewer patients > Less tumor samples for research > Lack of robust clinical models Puca, Loredana, et al. "Patient derived organoids to model rare prostate cancer phenotypes." Nature communications 9.1 (2018): 1-10. 2
  • 3. Data resource • Challenges 1) Private datasets 2) Limited size 3) Single center / scanner 4) Small variability 5) Some contain only images / only text 6) No or small subsets of manual annotations 7) Difficult to compare results 3
  • 4. Medline/PubMed PubMed / Medline PubMed Central PubMed Central Open- Access (PMC-OA) https://www.nlm.nih.gov/bsd/difference.html 30 million articles ~ 80 million images 5.9 million full texts 2.09 million full texts 6.73 million images 4 Rare cancer image harvesting through automated knowledge aggregation and data mining approaches? 2019
  • 5. Individual record Medical Subject Headings (MeSH) Title + Abstract Images 1 2 3 5 ✓ ✓ ✓ ✓
  • 6. Medical Subject Headings (MeSH) • Hierarchically organized Controlled Vocabulary • Cataloguing biomedical information • 16 thematic categories • A = Anatomy • B = Organism… • Each term has a unique MeSH Identifier MeSH term MeSH code Lipscomb, Carolyn E. "Medical subject headings (MeSH)." Bulletin of the Medical Library Association 88.3 (2000): 265. 6
  • 7. MeSH as annotation • Manually annotated by National library of Medicine (NLM) staff • For e.g., All the studies about benign cancer are indexed under MeSH annotation “Neoplasm” • Groundtruth annotation • Not all PMC / PMCOA have annotations 7
  • 8. Visual classification • ImageCLEF medical image annotation challenge (since 2013) • Small subset of annotated PMC-OA > train CNNs • Classify into 31 modalities - PET, light microscopy, CT, etc. • State of the art: Superficial modality classification 8 Deep Multimodal Classification of Image Types in Biomedical Journal Figures”, Andrearczyk and Müller, CLEF 2018 2000 Annotated PMC-OA 90% accuracy
  • 9. Pipeline 99 Getting DLMI images Getting “human” images Getting “neoplastic” images Getting “rare cancer” images PMC-OA all images 1 2 3 4 5 DLMI Diagnostic Light Microscopy Images
  • 10. 10 Pipeline Getting DLMI images Getting “human” images Getting “neoplastic” images Getting “rare cancer” images PMC-OA all images 1 2 3 4 5 Title + Abstract MeSH MeSH vs Visual Textual DLMI Diagnostic Light Microscopy Images
  • 11. Visual approach: CNNs 11 MeSH_1 MeSH_0 Model training and evaluation • VGG19 • ImageNet weights • With and without image augmentation
  • 12. Visual approach: CNNs 12 MeSH_1 MeSH_0 No MeSH MeSH_1MeSH_0 Model training and evaluation • VGG19 • ImageNet weights • With and without image augmentation
  • 13. Title + Abstract Title + Abstract Textual approach Title + Abstract Model training & evaluation Best performing model 13 MeSH_0 MeSH_1 Title + Abstract MeSH_0 MeSH_1 Title + Abstract No MeSH
  • 14. 14 Pipeline Getting DLMI images Getting “human” images Getting “neoplastic” images Getting “rare cancer” images PMC-OA all images 1 2 3 4 5 Title + Abstract MeSH MeSH vs
  • 15. - 0.5467 0.1111 0.5789 - 0.3789 - 0.4999 0.6687 - 0.1167 0.9976 Getting “human” images Title + Abstract Title + Abstract Title + Abstract {MeSH} DLMI human Model training and evaluation 1. Logistic regression 2. Support Vector Machine 3. K-nearest neighbor 1. Tf-idf, 2. Word vectors, 3. paragraph vector Not human 20% 80% Training set Test set human Not human Title + Abstract Title + Abstract = = ⇔ B01.050.150.900.649.313.988.400.112.400.400 ∉ {MeSH} ⇔ B01.050.150.900.649.313.988.400.112.400.400 ∈ {MeSH} & other B01 codes ∉ {MeSH} 15
  • 16. Getting “human” images Title + Abstract Title + Abstract human not human Best performing Model, hyper-params and vectors SVM, tf-idf bigrams No MeSH Title + Abstract DLMI Title + Abstract Title + Abstract Title + Abstract {MeSH} DLMI human Model training and evaluation 1. Logistic regression 2. Support Vector Machine 3. K-nearest neighbor 1. Tf-idf, 2. Word vectors, 3. paragraph vector not human 20% 80% Training set Test set - 0.5467 0.1111 0.5789 - 0.3789 - 0.4999 0.6687 - 0.1167 0.9976 16
  • 17. 17 Pipeline Getting DLMI images Getting “human” images Getting “neoplastic” images Getting “rare cancer” images PMC-OA all images 1 2 3 4 5 Title + Abstract MeSH MeSH vs
  • 18. 18 Getting “neoplastic” images neoplastic not neoplastic Title + Abstract Title + Abstract = = ⇔ C04 ∉ {MeSH} ⇔ C04 ∈ {MeSH} Title + Abstract Title + Abstract Title + Abstract {MeSH} DLMI Model training and evaluation 1. Logistic regression 2. Support Vector Machine 3. K-nearest neighbor 1. Tf-idf, 2. Word vectors, 3. paragraph vector 20% 80% Training set Test set human neoplastic not neoplastic - 0.5467 0.1111 0.5789 - 0.3789 - 0.4999 0.6687 - 0.1167 0.9976
  • 19. Getting “non-neoplastic” images Title + Abstract Title + Abstract Title + Abstract {MeSH} DLMI Model training and evaluation 1. Logistic regression 2. Support Vector Machine 3. K-nearest neighbor 1. Tf-idf, 2. Word vectors, 3. paragraph vector 20% 80% Training set Test set human neoplastic not neoplastic Title + Abstract Title + Abstract Best performing Model, hyper-params and vectors SVM, tf-idf bigrams No MeSH Title + Abstract DLMI human neoplastic not neoplastic - 0.5467 0.1111 0.5789 - 0.3789 - 0.4999 0.6687 - 0.1167 0.9976 19
  • 20. 20 Pipeline Getting DLMI images Getting “human” images Getting “neoplastic” images Getting “rare cancer” images PMC-OA all images 1 2 3 4 5 Title + Abstract MeSH MeSH vs
  • 21. Getting “rare cancer” images • No MeSH terms for “rare” cancer class • Set of {rare cancer} terms by National Center for Advancing Translational Sciences (NCATS) https://rarediseases.info.nih.gov/diseases/diseases-by-category/1 21 Title + Abstract Title + Abstract DLMI humanNo MeSH {MeSH} DLMI neoplastic human neoplastic Title + Abstract rare cancer Title + Abstract rare cancer = ⇔ Title + Abstract ∩ {rare cancer} ≠ Ø Title + Abstract non-rare cancer
  • 22. Visual: “rare cancer” 22 rare cancer Model training and evaluation • VGG19 • ImageNet weights • With and without image augmentation non-rare cancer
  • 23. Visual: “rare cancer” 23 No label Model training and evaluation • VGG19 • ImageNet weights • With and without image augmentation rare cancer non-rare cancer rare cancer non-rare cancer
  • 24. Results “human” vs. “non-human” classification Data type Classifier Feature Precision Recall F1-score Visual VGG19 With data augmentation 0.69 0.71 0.68 Textual SVM Tf-idf trigrams 0.89 0.90 0.90 24
  • 25. Results “human” vs. “non-human” classification Data type Classifier Feature Precision Recall F1-score Visual VGG19 With data augmentation 0.69 0.71 0.68 Textual SVM Tf-idf trigrams 0.89 0.90 0.90 “neoplastic” vs. “non-neoplastic” classification Data type Classifier Feature Precision Recall F1-score Visual VGG19 With data augmentation 0.68 0.65 0.64 Textual SVM Tf-idf bigrams 0.99 0.99 0.99 25
  • 26. Results “human” vs. “non-human” classification Data type Classifier Feature Precision Recall F1-score Visual VGG19 With data augmentation 0.69 0.71 0.68 Textual SVM Tf-idf trigrams 0.89 0.90 0.90 “neoplastic” vs. “non-neoplastic” classification Data type Classifier Feature Precision Recall F1-score Visual VGG19 With data augmentation 0.68 0.65 0.64 Textual SVM Tf-idf bigrams 0.99 0.99 0.99 “rare cancer” vs. “non-rare cancer” classification Data type Classifier Feature Precision Recall F1-score Visual VGG19 With data augmentation 0.62 0.77 0.69 26
  • 27. Discussion: Textual vs. Visual 27 Textual approach Outperformed visual approach for all tasks Tf-idf n-grams with SVM performed the excellent for both tasks. Visual approach Correctly classify some “human” test instances with recall of 0.71 Worse performance for “neoplastic” identification “rare cancer” classification had a recall of 0.77
  • 28. Conclusion • First study targeting automatic rare cancer image extraction • Used approach relies on visual deep learning and textual NLP • 15,028 light microscopy (DLMI), human, rare cancer images + corresponding journal articles Getting DLMI images Getting “human” images Getting “neoplastic” images Getting “rare cancer” images PMC-OA all data 28 1 2 3 4 5
  • 29. Thank you for your attention 29 More information: http://medgift.hevs.ch Contact: anjani.dhrangadhariya@hevs.ch Follow us: https://twitter.com/MedGIFT_group

Editor's Notes

  1. 2
  2. 3
  3. 4
  4. How are these biomedical publications stored in Medline represented in PubMed? A PubMed record consists of Title and Abstract followed by Publication images as shown in thumbnails. And a list of Medical Subject Headings or MeSH annotations that are like keywords or annotations describing something about the publication. All these text, images and MeSH terms are stringed together by the unique PubMed Identifier or PMID. You can also notice a PMCID or unique pubmed central identifier that links to the full-text of the publication. All these components, the images, text and the MeSH terms have thus 1 to 1 association with each other.
  5. 6
  6. PubMed records are manually annotated with MeSH terms by staff at NLM. What is the significance of attaching MeSH terms to a PubMed record? MeSH annotation enforces uniformity and consistency across the terminology in a way that all articles about benign cancer are indexed under MeSH term “Neoplasm”, all the articles or studies involving patients are annotated under MeSH term “Humans” So MeSH terms could be considered as gold standard annotations or groundtruth annotations for a publication. Not all publications in PubMed have these manually attached MeSH terms.
  7. Have this PMC-OA images been used elsewhere for image analysis? Yes, an annotated subset of PMC-OA has already been used in ImageCLEF medical image annotation challenge which is a public challenge that has been taking place since 2013. This small annotated subset of 2000 images was used to train CNNs for image classification into 31 image modality classes… Including PET, CT images, light microscopy images, et cetera. This classification approach achieved an overall 90% accuracy for modality classification. However, this approach only goes till superficial modality classification task. What about going beyond this generic modality classification into more specialized image sets?
  8. So what we did for navigating towards rare cancer sets was this: Take all the PMC-OA images and classify them using ImageCLEF setup into 31 modality types. Retain all the images classified as DLMI or diagnostic light microscopy images. We focus only upon DLMI images because they are fundamental to rare cancer diagnostics. All the retained DLMI images are linked to their respective title, abstract and MeSH annotations if available. With this multimodal annotated dataset in hand, we propose an approach for sequential curation of article abstracts and images using MeSH terms to eventually mine-out a large multimodal set of rare cancer images and full-texts.
  9. This involves three subsequent binary classification tasks where we first filter “human” from “non-human” set, followed by separating “neoplastic” from “non-neoplastic” set and finally separating “rare cancer“ from the “non-rare cancer“. It has to be noticed that at each binary classification step we compare visual vs. textual approach separately and use MeSH terms as the groundtruth labels for the datasets.
  10. For the visual classification tasks, images with two different MeSH classes were used to and evaluate VGG19 model using pretrained trained ImageNet weights and fine-tuned with and without image augmentation Data augmentation: image mirroring and cropping. Why do we use VGG?
  11. This fined-tuned models were then used to classify unlabeled images into their respective classes.
  12. 13
  13. Lets get back to the pipeline for further curating the previously retrieved DLMI dataset. «human» records were first filtered out from «non-human records» in following way.
  14. 15
  15. Best performing model setup was used to classify the un-annotated DLMI records into “human” and “non-human”.
  16. Then «neoplastic» or tumor-related records were separated from «non-neoplastic» records in similar manner.
  17. 18
  18. Best performing model setup was used to classify the un-annotated records into “neoplasm” and “non-neoplasm”. This was about the annotated text dataset. Similarly, the annotated image dataset classified using VGG19 setup.
  19. Finally, we chaff out rare cancer dataset from the non-rare cancer dataset.
  20. Unfortunately, there are no MeSH terms pertaining to “rare cancer”, so we used a pre-defined set of rare cancer terms available from NCATS. All the records recognized as “neoplasm” were retained and filtered out as “rare cancer” only if rare cancer term from NCATS set was present in the title and the abstract.
  21. After getting «rare cancer» and the «non-rare cancer» labels for images from the previous text classification, we used them to train and evaluate a VGG19 model for this binary classification task.
  22. After getting «rare cancer» and the «non-rare cancer» labels for images from the previous text classification, we used them to train and evaluate a VGG19 model for this binary classification task.
  23. For the «human» classification task, textual approach performed far better than visual approach. However, a recall of 0.71 hints that the visual classification model does learn something about retaining human images.
  24. For the neoplasm classification task too, textual performed better than visual. Visual approach did not have good results for this task.
  25. For the final task, a recall of 0.77 does hint that VGG19 model did learn something by better retaining the «rare cancer» images, but it has much room for improvement.
  26. Classification: Individual images ≠ full-texts