SlideShare a Scribd company logo
1 of 32
Exploiting biomedical literature to mine out a large
multimodal dataset of rare cancer studies
Anjani K. Dhrangadhariya et al.
MedGIFT group
University of Applied Sciences Western Switzerland (HES-SO)
Project supported by European Union
Horizon 2020 grant agreement 825292
SPIE Medical Imaging 2020, 02.16.2020
Motivation
Puca, Loredana, et al. "Patient derived organoids to model rare prostate cancer phenotypes." Nature communications 9.1 (2018): 1-10.
2
> Rare Cancer - 25% cancer-related deaths
> Affect less than 15 out of 100,000 / year
> Lower prevalence = fewer patients
> Less tumor samples for research
> Lack of robust clinical models
Data resource
• Challenges
1) Private datasets
2) Limited size
3) Single center / scanner
4) Small variability
5) Some contain only images / only text
6) No or small subsets of manual annotations
7) Difficult to compare results
3
Large
database
Open Access Annotation
Medline/PubMed
PubMed / Medline
PubMed
Central
PubMed
Central
Open-
Access
(PMC-OA)
https://www.nlm.nih.gov/bsd/difference.html
30 million articles
~ 80 million images
5.9 million full texts
2.09 million full texts
6.73 million images
4
Rare cancer image
harvesting through
automated
knowledge
aggregation and
data mining
approaches?
2019
Individual record
Medical Subject Headings (MeSH)
Title
+
Abstract
Images
1
2
3
5
✓
✓
✓
✓
Medical Subject Headings (MeSH)
• Hierarchically organized
Controlled Vocabulary
• Cataloguing biomedical
information
• 16 thematic categories
• A = anatomy
• B = organism
• C = diseases …
• Subcategories
MeSH
term
MeSH
code
Lipscomb, Carolyn E. "Medical subject headings (MeSH)." Bulletin of the Medical Library Association 88.3 (2000): 265.
6
MeSH as annotation
• Manually annotated by National library of
Medicine (NLM) staff
• For e.g., All the studies about
benign cancer are indexed
under MeSH annotation “Neoplasm”
• Groundtruth annotation
• Not all PMC / PMCOA have annotations
7
Visual classification
• ImageCLEF medical image annotation
challenge (since 2013)
• Small subset of annotated PMC-OA >
train CNNs
• Classify into 31 modalities - PET, light
microscopy, CT, etc.
• State of the art: Superficial modality
classification
8
Deep Multimodal Classification of Image Types in Biomedical Journal Figures”, Andrearczyk and Müller, CLEF 2018
2000 Annotated PMC-OA
> 90% accuracy
Method: Pipeline
99
Getting DLMI images
Getting “human” images
Getting “neoplastic” images
Getting “rare cancer” images
PMC-OA all images
1
2
3
4
5
DLMI
Diagnostic Light
Microscopy Images
10
Method: Pipeline
Getting DLMI images
Getting “human” images
Getting “neoplastic” images
Getting “rare cancer” images
PMC-OA all images
1
2
3
4
5
Title +
Abstract
MeSH MeSH
vs
Visual Textual
DLMI
Diagnostic Light
Microscopy Images
Method: Visual approach
11
Class_1
Class_0
Model training and evaluation
• VGG19
• ImageNet weights
• With and without image
augmentation
Method: Visual approach
12
Class_1
Class_0
No Class
Class_1Class_0
Model training and evaluation
• VGG19
• ImageNet weights
• With and without image
augmentation
Title +
Abstract
Title +
Abstract
Method: Textual approach
Title +
Abstract
Model
training &
evaluation
Best
performing
model
13
Class_0
Class_1
Title +
Abstract
Class_0
Class_1
Title +
Abstract
No Class
14
Method: Vectors
Count vector Word vector Document vector
• Documents
represented by
weighted word
counts.
• No semantics
• Multidimensional,
numerical vectors
• Semantically
similar words are
projected in
proximity in a
geometric space
• Learn to associate
words with
document labels
15
Experiments: Pipeline
Getting DLMI images
Getting “human” images
Getting “neoplastic” images
Getting “rare cancer” images
PMC-OA all images
1
2
3
4
5
Title +
Abstract
MeSH MeSH
vs
- 0.5467
0.1111
0.5789
- 0.3789
- 0.4999
0.6687
- 0.1167
0.9976
Getting “human” images
Title +
Abstract
Title +
Abstract
Title +
Abstract
{MeSH}
DLMI
human
Model training and evaluation
1. Logistic regression
2. Support Vector Machine
3. K-nearest neighbor
1. Count vectors
2. Word vectors,
3. document vector
Not human
20%
80%
Training set
Test set
human
Not human
Title +
Abstract
Title +
Abstract
=
= ⇔ B01.050.150.900.649.313.988.400.112.400.400 ∉ {MeSH}
⇔ B01.050.150.900.649.313.988.400.112.400.400 ∈ {MeSH} & other B01 codes ∉ {MeSH}
16
Getting “human” images
Title +
Abstract
Title +
Abstract
human
not human
Best performing
Model, hyper-params and
vectors
SVM, count vectors
No MeSH
Title +
Abstract
DLMI
Title +
Abstract
Title +
Abstract
Title +
Abstract
{MeSH}
DLMI
human
Model training and evaluation
1. Logistic regression
2. Support Vector Machine
3. K-nearest neighbor
1. Count vectors
2. Word vectors,
3. document vector
not human
20%
80%
Training set
Test set
- 0.5467
0.1111
0.5789
- 0.3789
- 0.4999
0.6687
- 0.1167
0.9976
17
18
Experiments: Pipeline
Getting DLMI images
Getting “human” images
Getting “neoplastic” images
Getting “rare cancer” images
PMC-OA all images
1
2
3
4
5
Title +
Abstract
MeSH MeSH
vs
19
Getting “neoplastic” images
neoplastic
not neoplastic
Title +
Abstract
Title +
Abstract
=
= ⇔ C04 ∉ {MeSH}
⇔ C04 ∈ {MeSH}
Title +
Abstract
Title +
Abstract
Title +
Abstract
{MeSH}
DLMI
Model training and evaluation
1. Logistic regression
2. Support Vector Machine
3. K-nearest neighbor
1. Count vectors
2. Word vectors,
3. document vector
20%
80%
Training set
Test set
human
neoplastic
not neoplastic
- 0.5467
0.1111
0.5789
- 0.3789
- 0.4999
0.6687
- 0.1167
0.9976
Getting “neoplastic” images
Title +
Abstract
Title +
Abstract
Title +
Abstract
{MeSH}
DLMI
Model training and evaluation
1. Logistic regression
2. Support Vector Machine
3. K-nearest neighbor
1. Count vectors
2. Word vectors,
3. document vector
20%
80%
Training set
Test set
human
neoplastic
not neoplastic
Title +
Abstract
Title +
Abstract
Best performing
Model, hyper-params and
vectors
SVM, count vectors
No MeSH
Title +
Abstract
DLMI
human
neoplastic
not neoplastic
- 0.5467
0.1111
0.5789
- 0.3789
- 0.4999
0.6687
- 0.1167
0.9976
20
21
Experiments: Pipeline
Getting DLMI images
Getting “human” images
Getting “neoplastic” images
Getting “rare cancer” images
PMC-OA all images
1
2
3
4
5
Title +
Abstract
MeSH MeSH
vs
Getting “rare cancer” images
• No MeSH terms for “rare” cancer class
• Set of 495 {rare cancer} terms by National Center for Advancing
Translational Sciences (NCATS)
https://rarediseases.info.nih.gov/diseases/diseases-by-category/1
22
Getting “rare cancer” images
• No MeSH terms for “rare” cancer class
• Set of 495 {rare cancer} terms by National Center for Advancing
Translational Sciences (NCATS)
https://rarediseases.info.nih.gov/diseases/diseases-by-category/1
23
Title +
Abstract
Title +
Abstract
DLMI
humanNo MeSH
{MeSH}
DLMI
neoplastic
human
neoplastic
Title +
Abstract
rare
cancer
Title +
Abstract
rare
cancer
= ⇔
Title +
Abstract ∩ {rare cancer} ≠
Ø Title +
Abstract
non-rare
cancer
24
Visual: “rare cancer”
25
rare cancer
Model training and evaluation
• VGG19
• ImageNet weights
• With and without image
augmentation
non-rare cancer
Visual: “rare cancer”
26
No Class
Model training and evaluation
• VGG19
• ImageNet weights
• With and without image
augmentation
rare cancer
non-rare cancer
rare cancer non-rare cancer
Results
“human” vs. “non-human” classification
Data type Classifier Feature Precision Recall F1-score
Visual VGG19 With data augmentation 0.69 0.71 0.68
Textual SVM Count vectors 0.89 0.90 0.90
27
Results
“human” vs. “non-human” classification
Data type Classifier Feature Precision Recall F1-score
Visual VGG19 With data augmentation 0.69 0.71 0.68
Textual SVM Count vectors 0.89 0.90 0.90
“neoplastic” vs. “non-neoplastic” classification
Data type Classifier Feature Precision Recall F1-score
Visual VGG19 With data augmentation 0.68 0.65 0.64
Textual SVM Count vectors 0.99 0.99 0.99
28
Results
“human” vs. “non-human” classification
Data type Classifier Feature Precision Recall F1-score
Visual VGG19 With data augmentation 0.69 0.71 0.68
Textual SVM Count vectors 0.89 0.90 0.90
“neoplastic” vs. “non-neoplastic” classification
Data type Classifier Feature Precision Recall F1-score
Visual VGG19 With data augmentation 0.68 0.65 0.64
Textual SVM Count vectors 0.99 0.99 0.99
“rare cancer” vs. “non-rare cancer” classification
Data type Classifier Feature Precision Recall F1-score
Visual VGG19 With data augmentation 0.62 0.77 0.69
29
Discussion: Textual vs. Visual
30
Textual approach
Outperformed visual approach
for all tasks
Count vectors with SVM
performed the excellent for
both tasks.
Visual approach
Correctly classify some
“human” test instances with
recall of 0.71
Worse performance for
“neoplastic” identification
“rare cancer” classification had
a recall of 0.77
Conclusion
• First study targeting automatic rare cancer
image extraction
• Used approach relies on visual deep
learning and textual NLP
• 15,028 light microscopy (DLMI), human,
rare cancer images + corresponding journal
articles
Getting DLMI images
Getting “human” images
Getting “neoplastic” images
Getting “rare cancer” images
PMC-OA all data
31
1
2
3
4
5
Thank you for your attention
32
More information:
http://medgift.hevs.ch
Contact:
anjani.dhrangadhariya@hevs.ch
Follow us:
https://twitter.com/MedGIFT_group

More Related Content

What's hot

11.[37 46]segmentation and feature extraction of tumors from digital mammograms
11.[37 46]segmentation and feature extraction of tumors from digital mammograms11.[37 46]segmentation and feature extraction of tumors from digital mammograms
11.[37 46]segmentation and feature extraction of tumors from digital mammograms
Alexander Decker
 

What's hot (20)

Recent advances of AI for medical imaging : Engineering perspectives
Recent advances of AI for medical imaging : Engineering perspectivesRecent advances of AI for medical imaging : Engineering perspectives
Recent advances of AI for medical imaging : Engineering perspectives
 
Master's Thesis
Master's ThesisMaster's Thesis
Master's Thesis
 
IRJET- Brain Tumor Detection using Hybrid Model of DCT DWT and Thresholding
IRJET- Brain Tumor Detection using Hybrid Model of DCT DWT and ThresholdingIRJET- Brain Tumor Detection using Hybrid Model of DCT DWT and Thresholding
IRJET- Brain Tumor Detection using Hybrid Model of DCT DWT and Thresholding
 
Early detection of breast cancer using mammography images and software engine...
Early detection of breast cancer using mammography images and software engine...Early detection of breast cancer using mammography images and software engine...
Early detection of breast cancer using mammography images and software engine...
 
Introduction to Machine Learning and Texture Analysis for Lesion Characteriza...
Introduction to Machine Learning and Texture Analysis for Lesion Characteriza...Introduction to Machine Learning and Texture Analysis for Lesion Characteriza...
Introduction to Machine Learning and Texture Analysis for Lesion Characteriza...
 
Review of Image Watermarking Technique for Medi
Review of Image Watermarking Technique for MediReview of Image Watermarking Technique for Medi
Review of Image Watermarking Technique for Medi
 
Breast Cancer Detection using Convolution Neural Network
Breast Cancer Detection using Convolution Neural NetworkBreast Cancer Detection using Convolution Neural Network
Breast Cancer Detection using Convolution Neural Network
 
Pneumonia Detection using CNN
Pneumonia Detection using CNNPneumonia Detection using CNN
Pneumonia Detection using CNN
 
An Analysis of The Methods Employed for Breast Cancer Diagnosis
An Analysis of The Methods Employed for Breast Cancer Diagnosis An Analysis of The Methods Employed for Breast Cancer Diagnosis
An Analysis of The Methods Employed for Breast Cancer Diagnosis
 
Brain Tumor Detection and Classification using Adaptive Boosting
Brain Tumor Detection and Classification using Adaptive BoostingBrain Tumor Detection and Classification using Adaptive Boosting
Brain Tumor Detection and Classification using Adaptive Boosting
 
Skin Cancer Detection using Image Processing in Real Time
Skin Cancer Detection using Image Processing in Real TimeSkin Cancer Detection using Image Processing in Real Time
Skin Cancer Detection using Image Processing in Real Time
 
11.[37 46]segmentation and feature extraction of tumors from digital mammograms
11.[37 46]segmentation and feature extraction of tumors from digital mammograms11.[37 46]segmentation and feature extraction of tumors from digital mammograms
11.[37 46]segmentation and feature extraction of tumors from digital mammograms
 
Leveraging Machine Learning Techniques Predictive Analytics for Knowledge Dis...
Leveraging Machine Learning Techniques Predictive Analytics for Knowledge Dis...Leveraging Machine Learning Techniques Predictive Analytics for Knowledge Dis...
Leveraging Machine Learning Techniques Predictive Analytics for Knowledge Dis...
 
IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...
IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...
IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...
 
Brain Tumor Segmentation in MRI Images
Brain Tumor Segmentation in MRI ImagesBrain Tumor Segmentation in MRI Images
Brain Tumor Segmentation in MRI Images
 
Automatic Brain Tumor Segmentation on Multi-Modal MRI with Deep Neural Networks
Automatic Brain Tumor Segmentation on Multi-Modal MRI with Deep Neural NetworksAutomatic Brain Tumor Segmentation on Multi-Modal MRI with Deep Neural Networks
Automatic Brain Tumor Segmentation on Multi-Modal MRI with Deep Neural Networks
 
Deep learning in medicine: An introduction and applications to next-generatio...
Deep learning in medicine: An introduction and applications to next-generatio...Deep learning in medicine: An introduction and applications to next-generatio...
Deep learning in medicine: An introduction and applications to next-generatio...
 
Application of-image-segmentation-in-brain-tumor-detection
Application of-image-segmentation-in-brain-tumor-detectionApplication of-image-segmentation-in-brain-tumor-detection
Application of-image-segmentation-in-brain-tumor-detection
 
IRJET- Classifying Chest Pathology Images using Deep Learning Techniques
IRJET- Classifying Chest Pathology Images using Deep Learning TechniquesIRJET- Classifying Chest Pathology Images using Deep Learning Techniques
IRJET- Classifying Chest Pathology Images using Deep Learning Techniques
 
Artificial Intelligence in Medical Imaging
Artificial Intelligence in Medical ImagingArtificial Intelligence in Medical Imaging
Artificial Intelligence in Medical Imaging
 

Similar to Exploiting biomedical literature to mine out a large multimodal dataset of rare cancer studies

Detection of Covid19 From Chest X-Ray and CT.pptx
Detection of Covid19 From Chest X-Ray and CT.pptxDetection of Covid19 From Chest X-Ray and CT.pptx
Detection of Covid19 From Chest X-Ray and CT.pptx
namrataSingh900842
 
Unknown power power point unknown power point
Unknown power power point unknown power pointUnknown power power point unknown power point
Unknown power power point unknown power point
xmendquick
 

Similar to Exploiting biomedical literature to mine out a large multimodal dataset of rare cancer studies (20)

Exploiting biomedical literature to mine out a large multimodal dataset of ra...
Exploiting biomedical literature to mine out a large multimodal dataset of ra...Exploiting biomedical literature to mine out a large multimodal dataset of ra...
Exploiting biomedical literature to mine out a large multimodal dataset of ra...
 
What are the Responsibilities of a Product Manager by Google PM
What are the Responsibilities of a Product Manager by Google PMWhat are the Responsibilities of a Product Manager by Google PM
What are the Responsibilities of a Product Manager by Google PM
 
Ms thesis-final-defense-presentation
Ms thesis-final-defense-presentationMs thesis-final-defense-presentation
Ms thesis-final-defense-presentation
 
AI-powered Medical Imaging Analysis for Precision Medicine
AI-powered Medical Imaging Analysis for Precision MedicineAI-powered Medical Imaging Analysis for Precision Medicine
AI-powered Medical Imaging Analysis for Precision Medicine
 
Medical Segmentation Decathalon
Medical Segmentation DecathalonMedical Segmentation Decathalon
Medical Segmentation Decathalon
 
Learning where to look: focus and attention in deep vision
Learning where to look: focus and attention in deep visionLearning where to look: focus and attention in deep vision
Learning where to look: focus and attention in deep vision
 
Detection of Covid19 From Chest X-Ray and CT.pptx
Detection of Covid19 From Chest X-Ray and CT.pptxDetection of Covid19 From Chest X-Ray and CT.pptx
Detection of Covid19 From Chest X-Ray and CT.pptx
 
Paris Data Ladies #14
Paris Data Ladies #14Paris Data Ladies #14
Paris Data Ladies #14
 
The Pursuit of Excellence in Image Quality
The Pursuit of Excellence in Image QualityThe Pursuit of Excellence in Image Quality
The Pursuit of Excellence in Image Quality
 
[Review] High-performance medicine: the convergence of human and artificial i...
[Review] High-performance medicine: the convergence of human and artificial i...[Review] High-performance medicine: the convergence of human and artificial i...
[Review] High-performance medicine: the convergence of human and artificial i...
 
Radiomics and Deep Learning for Lung Cancer Screening
Radiomics and Deep Learning for Lung Cancer ScreeningRadiomics and Deep Learning for Lung Cancer Screening
Radiomics and Deep Learning for Lung Cancer Screening
 
IRJET - Classification of Cancer Images using Deep Learning
IRJET -  	  Classification of Cancer Images using Deep LearningIRJET -  	  Classification of Cancer Images using Deep Learning
IRJET - Classification of Cancer Images using Deep Learning
 
[Explained] "Partial Success in Closing the Gap between Human and Machine Vis...
[Explained] "Partial Success in Closing the Gap between Human and Machine Vis...[Explained] "Partial Success in Closing the Gap between Human and Machine Vis...
[Explained] "Partial Success in Closing the Gap between Human and Machine Vis...
 
Challenges and opportunities for machine learning in biomedical research
Challenges and opportunities for machine learning in biomedical researchChallenges and opportunities for machine learning in biomedical research
Challenges and opportunities for machine learning in biomedical research
 
Wat betekent A.I. voor de radiologie?
Wat betekent A.I. voor de radiologie?Wat betekent A.I. voor de radiologie?
Wat betekent A.I. voor de radiologie?
 
My own Machine Learning project - Breast Cancer Prediction
My own Machine Learning project - Breast Cancer PredictionMy own Machine Learning project - Breast Cancer Prediction
My own Machine Learning project - Breast Cancer Prediction
 
IFMIA 2019 Plenary Talk : Deep Learning in Medicine; Engineers' Perspectives
IFMIA 2019 Plenary Talk : Deep Learning in Medicine; Engineers' PerspectivesIFMIA 2019 Plenary Talk : Deep Learning in Medicine; Engineers' Perspectives
IFMIA 2019 Plenary Talk : Deep Learning in Medicine; Engineers' Perspectives
 
Data Mining Techniques In Computer Aided Cancer Diagnosis
Data Mining Techniques In Computer Aided Cancer DiagnosisData Mining Techniques In Computer Aided Cancer Diagnosis
Data Mining Techniques In Computer Aided Cancer Diagnosis
 
Data Mining Techniques In Computer Aided Cancer Diagnosis
Data Mining Techniques In Computer Aided Cancer DiagnosisData Mining Techniques In Computer Aided Cancer Diagnosis
Data Mining Techniques In Computer Aided Cancer Diagnosis
 
Unknown power power point unknown power point
Unknown power power point unknown power pointUnknown power power point unknown power point
Unknown power power point unknown power point
 

More from Institute of Information Systems (HES-SO)

Solar production prediction based on non linear meteo source adaptation
Solar production prediction based on non linear meteo source adaptationSolar production prediction based on non linear meteo source adaptation
Solar production prediction based on non linear meteo source adaptation
Institute of Information Systems (HES-SO)
 

More from Institute of Information Systems (HES-SO) (20)

MIE20232.pptx
MIE20232.pptxMIE20232.pptx
MIE20232.pptx
 
Classification of noisy free-text prostate cancer pathology reports using nat...
Classification of noisy free-text prostate cancer pathology reports using nat...Classification of noisy free-text prostate cancer pathology reports using nat...
Classification of noisy free-text prostate cancer pathology reports using nat...
 
Machine learning assisted citation screening for Systematic Reviews - Anjani ...
Machine learning assisted citation screening for Systematic Reviews - Anjani ...Machine learning assisted citation screening for Systematic Reviews - Anjani ...
Machine learning assisted citation screening for Systematic Reviews - Anjani ...
 
L'IoT dans les usines. Quels avantages ?
L'IoT dans les usines. Quels avantages ?L'IoT dans les usines. Quels avantages ?
L'IoT dans les usines. Quels avantages ?
 
Studying Public Medical Images from Open Access Literature and Social Network...
Studying Public Medical Images from Open Access Literature and Social Network...Studying Public Medical Images from Open Access Literature and Social Network...
Studying Public Medical Images from Open Access Literature and Social Network...
 
Risques opérationnels et le système de contrôle interne : les limites d’un te...
Risques opérationnels et le système de contrôle interne : les limites d’un te...Risques opérationnels et le système de contrôle interne : les limites d’un te...
Risques opérationnels et le système de contrôle interne : les limites d’un te...
 
Le contrôle interne dans les administrations publiques tient-il toutes ses pr...
Le contrôle interne dans les administrations publiques tient-il toutes ses pr...Le contrôle interne dans les administrations publiques tient-il toutes ses pr...
Le contrôle interne dans les administrations publiques tient-il toutes ses pr...
 
Le système de contrôle interne : Présentation générale, enjeux et méthodes
Le système de contrôle interne : Présentation générale, enjeux et méthodesLe système de contrôle interne : Présentation générale, enjeux et méthodes
Le système de contrôle interne : Présentation générale, enjeux et méthodes
 
Crowdsourcing-based Mobile Application for Wheelchair Accessibility
Crowdsourcing-based Mobile Application for Wheelchair AccessibilityCrowdsourcing-based Mobile Application for Wheelchair Accessibility
Crowdsourcing-based Mobile Application for Wheelchair Accessibility
 
Quelle(s) valeur(s) pour le leadership stratégique ?
Quelle(s) valeur(s) pour le leadership stratégique ?Quelle(s) valeur(s) pour le leadership stratégique ?
Quelle(s) valeur(s) pour le leadership stratégique ?
 
A 3-D Riesz-Covariance Texture Model for the Prediction of Nodule Recurrence ...
A 3-D Riesz-Covariance Texture Model for the Prediction of Nodule Recurrence ...A 3-D Riesz-Covariance Texture Model for the Prediction of Nodule Recurrence ...
A 3-D Riesz-Covariance Texture Model for the Prediction of Nodule Recurrence ...
 
Challenges in medical imaging and the VISCERAL model
Challenges in medical imaging and the VISCERAL modelChallenges in medical imaging and the VISCERAL model
Challenges in medical imaging and the VISCERAL model
 
NOSE: une approche Smart-City pour les zones périphériques et extra-urbaines
NOSE: une approche Smart-City pour les zones périphériques et extra-urbainesNOSE: une approche Smart-City pour les zones périphériques et extra-urbaines
NOSE: une approche Smart-City pour les zones périphériques et extra-urbaines
 
Medical image analysis and big data evaluation infrastructures
Medical image analysis and big data evaluation infrastructuresMedical image analysis and big data evaluation infrastructures
Medical image analysis and big data evaluation infrastructures
 
Medical image analysis, retrieval and evaluation infrastructures
Medical image analysis, retrieval and evaluation infrastructuresMedical image analysis, retrieval and evaluation infrastructures
Medical image analysis, retrieval and evaluation infrastructures
 
How to detect soft falls on devices
How to detect soft falls on devicesHow to detect soft falls on devices
How to detect soft falls on devices
 
FUNDAMENTALS OF TEXTURE PROCESSING FOR BIOMEDICAL IMAGE ANALYSIS
FUNDAMENTALS OF TEXTURE PROCESSING FOR BIOMEDICAL IMAGE ANALYSISFUNDAMENTALS OF TEXTURE PROCESSING FOR BIOMEDICAL IMAGE ANALYSIS
FUNDAMENTALS OF TEXTURE PROCESSING FOR BIOMEDICAL IMAGE ANALYSIS
 
MOBILE COLLECTION AND DISSEMINATION OF SENIORS’ SKILLS
MOBILE COLLECTION AND DISSEMINATION OF SENIORS’ SKILLSMOBILE COLLECTION AND DISSEMINATION OF SENIORS’ SKILLS
MOBILE COLLECTION AND DISSEMINATION OF SENIORS’ SKILLS
 
Enhanced Students Laboratory The GET project
Enhanced Students Laboratory The GET projectEnhanced Students Laboratory The GET project
Enhanced Students Laboratory The GET project
 
Solar production prediction based on non linear meteo source adaptation
Solar production prediction based on non linear meteo source adaptationSolar production prediction based on non linear meteo source adaptation
Solar production prediction based on non linear meteo source adaptation
 

Recently uploaded

原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
pwgnohujw
 
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di Bontang
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di  Bontangobat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di  Bontang
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di Bontang
siskavia95
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
Abortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Doha {{ QATAR }} +966572737505) Get CytotecAbortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Stephen266013
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
Simplify hybrid data integration at an enterprise scale. Integrate all your d...
Simplify hybrid data integration at an enterprise scale. Integrate all your d...Simplify hybrid data integration at an enterprise scale. Integrate all your d...
Simplify hybrid data integration at an enterprise scale. Integrate all your d...
varanasisatyanvesh
 

Recently uploaded (20)

Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"
 
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
 
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
 
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
 
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di Bontang
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di  Bontangobat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di  Bontang
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di Bontang
 
Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?
 
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarjSCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Introduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptxIntroduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptx
 
Abortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Doha {{ QATAR }} +966572737505) Get CytotecAbortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
 
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptx
 
DBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTS
DBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTSDBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTS
DBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTS
 
DAA Assignment Solution.pdf is the best1
DAA Assignment Solution.pdf is the best1DAA Assignment Solution.pdf is the best1
DAA Assignment Solution.pdf is the best1
 
jll-asia-pacific-capital-tracker-1q24.pdf
jll-asia-pacific-capital-tracker-1q24.pdfjll-asia-pacific-capital-tracker-1q24.pdf
jll-asia-pacific-capital-tracker-1q24.pdf
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
 
Simplify hybrid data integration at an enterprise scale. Integrate all your d...
Simplify hybrid data integration at an enterprise scale. Integrate all your d...Simplify hybrid data integration at an enterprise scale. Integrate all your d...
Simplify hybrid data integration at an enterprise scale. Integrate all your d...
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 

Exploiting biomedical literature to mine out a large multimodal dataset of rare cancer studies

  • 1. Exploiting biomedical literature to mine out a large multimodal dataset of rare cancer studies Anjani K. Dhrangadhariya et al. MedGIFT group University of Applied Sciences Western Switzerland (HES-SO) Project supported by European Union Horizon 2020 grant agreement 825292 SPIE Medical Imaging 2020, 02.16.2020
  • 2. Motivation Puca, Loredana, et al. "Patient derived organoids to model rare prostate cancer phenotypes." Nature communications 9.1 (2018): 1-10. 2 > Rare Cancer - 25% cancer-related deaths > Affect less than 15 out of 100,000 / year > Lower prevalence = fewer patients > Less tumor samples for research > Lack of robust clinical models
  • 3. Data resource • Challenges 1) Private datasets 2) Limited size 3) Single center / scanner 4) Small variability 5) Some contain only images / only text 6) No or small subsets of manual annotations 7) Difficult to compare results 3 Large database Open Access Annotation
  • 4. Medline/PubMed PubMed / Medline PubMed Central PubMed Central Open- Access (PMC-OA) https://www.nlm.nih.gov/bsd/difference.html 30 million articles ~ 80 million images 5.9 million full texts 2.09 million full texts 6.73 million images 4 Rare cancer image harvesting through automated knowledge aggregation and data mining approaches? 2019
  • 5. Individual record Medical Subject Headings (MeSH) Title + Abstract Images 1 2 3 5 ✓ ✓ ✓ ✓
  • 6. Medical Subject Headings (MeSH) • Hierarchically organized Controlled Vocabulary • Cataloguing biomedical information • 16 thematic categories • A = anatomy • B = organism • C = diseases … • Subcategories MeSH term MeSH code Lipscomb, Carolyn E. "Medical subject headings (MeSH)." Bulletin of the Medical Library Association 88.3 (2000): 265. 6
  • 7. MeSH as annotation • Manually annotated by National library of Medicine (NLM) staff • For e.g., All the studies about benign cancer are indexed under MeSH annotation “Neoplasm” • Groundtruth annotation • Not all PMC / PMCOA have annotations 7
  • 8. Visual classification • ImageCLEF medical image annotation challenge (since 2013) • Small subset of annotated PMC-OA > train CNNs • Classify into 31 modalities - PET, light microscopy, CT, etc. • State of the art: Superficial modality classification 8 Deep Multimodal Classification of Image Types in Biomedical Journal Figures”, Andrearczyk and Müller, CLEF 2018 2000 Annotated PMC-OA > 90% accuracy
  • 9. Method: Pipeline 99 Getting DLMI images Getting “human” images Getting “neoplastic” images Getting “rare cancer” images PMC-OA all images 1 2 3 4 5 DLMI Diagnostic Light Microscopy Images
  • 10. 10 Method: Pipeline Getting DLMI images Getting “human” images Getting “neoplastic” images Getting “rare cancer” images PMC-OA all images 1 2 3 4 5 Title + Abstract MeSH MeSH vs Visual Textual DLMI Diagnostic Light Microscopy Images
  • 11. Method: Visual approach 11 Class_1 Class_0 Model training and evaluation • VGG19 • ImageNet weights • With and without image augmentation
  • 12. Method: Visual approach 12 Class_1 Class_0 No Class Class_1Class_0 Model training and evaluation • VGG19 • ImageNet weights • With and without image augmentation
  • 13. Title + Abstract Title + Abstract Method: Textual approach Title + Abstract Model training & evaluation Best performing model 13 Class_0 Class_1 Title + Abstract Class_0 Class_1 Title + Abstract No Class
  • 14. 14 Method: Vectors Count vector Word vector Document vector • Documents represented by weighted word counts. • No semantics • Multidimensional, numerical vectors • Semantically similar words are projected in proximity in a geometric space • Learn to associate words with document labels
  • 15. 15 Experiments: Pipeline Getting DLMI images Getting “human” images Getting “neoplastic” images Getting “rare cancer” images PMC-OA all images 1 2 3 4 5 Title + Abstract MeSH MeSH vs
  • 16. - 0.5467 0.1111 0.5789 - 0.3789 - 0.4999 0.6687 - 0.1167 0.9976 Getting “human” images Title + Abstract Title + Abstract Title + Abstract {MeSH} DLMI human Model training and evaluation 1. Logistic regression 2. Support Vector Machine 3. K-nearest neighbor 1. Count vectors 2. Word vectors, 3. document vector Not human 20% 80% Training set Test set human Not human Title + Abstract Title + Abstract = = ⇔ B01.050.150.900.649.313.988.400.112.400.400 ∉ {MeSH} ⇔ B01.050.150.900.649.313.988.400.112.400.400 ∈ {MeSH} & other B01 codes ∉ {MeSH} 16
  • 17. Getting “human” images Title + Abstract Title + Abstract human not human Best performing Model, hyper-params and vectors SVM, count vectors No MeSH Title + Abstract DLMI Title + Abstract Title + Abstract Title + Abstract {MeSH} DLMI human Model training and evaluation 1. Logistic regression 2. Support Vector Machine 3. K-nearest neighbor 1. Count vectors 2. Word vectors, 3. document vector not human 20% 80% Training set Test set - 0.5467 0.1111 0.5789 - 0.3789 - 0.4999 0.6687 - 0.1167 0.9976 17
  • 18. 18 Experiments: Pipeline Getting DLMI images Getting “human” images Getting “neoplastic” images Getting “rare cancer” images PMC-OA all images 1 2 3 4 5 Title + Abstract MeSH MeSH vs
  • 19. 19 Getting “neoplastic” images neoplastic not neoplastic Title + Abstract Title + Abstract = = ⇔ C04 ∉ {MeSH} ⇔ C04 ∈ {MeSH} Title + Abstract Title + Abstract Title + Abstract {MeSH} DLMI Model training and evaluation 1. Logistic regression 2. Support Vector Machine 3. K-nearest neighbor 1. Count vectors 2. Word vectors, 3. document vector 20% 80% Training set Test set human neoplastic not neoplastic - 0.5467 0.1111 0.5789 - 0.3789 - 0.4999 0.6687 - 0.1167 0.9976
  • 20. Getting “neoplastic” images Title + Abstract Title + Abstract Title + Abstract {MeSH} DLMI Model training and evaluation 1. Logistic regression 2. Support Vector Machine 3. K-nearest neighbor 1. Count vectors 2. Word vectors, 3. document vector 20% 80% Training set Test set human neoplastic not neoplastic Title + Abstract Title + Abstract Best performing Model, hyper-params and vectors SVM, count vectors No MeSH Title + Abstract DLMI human neoplastic not neoplastic - 0.5467 0.1111 0.5789 - 0.3789 - 0.4999 0.6687 - 0.1167 0.9976 20
  • 21. 21 Experiments: Pipeline Getting DLMI images Getting “human” images Getting “neoplastic” images Getting “rare cancer” images PMC-OA all images 1 2 3 4 5 Title + Abstract MeSH MeSH vs
  • 22. Getting “rare cancer” images • No MeSH terms for “rare” cancer class • Set of 495 {rare cancer} terms by National Center for Advancing Translational Sciences (NCATS) https://rarediseases.info.nih.gov/diseases/diseases-by-category/1 22
  • 23. Getting “rare cancer” images • No MeSH terms for “rare” cancer class • Set of 495 {rare cancer} terms by National Center for Advancing Translational Sciences (NCATS) https://rarediseases.info.nih.gov/diseases/diseases-by-category/1 23 Title + Abstract Title + Abstract DLMI humanNo MeSH {MeSH} DLMI neoplastic human neoplastic Title + Abstract rare cancer Title + Abstract rare cancer = ⇔ Title + Abstract ∩ {rare cancer} ≠ Ø Title + Abstract non-rare cancer
  • 24. 24
  • 25. Visual: “rare cancer” 25 rare cancer Model training and evaluation • VGG19 • ImageNet weights • With and without image augmentation non-rare cancer
  • 26. Visual: “rare cancer” 26 No Class Model training and evaluation • VGG19 • ImageNet weights • With and without image augmentation rare cancer non-rare cancer rare cancer non-rare cancer
  • 27. Results “human” vs. “non-human” classification Data type Classifier Feature Precision Recall F1-score Visual VGG19 With data augmentation 0.69 0.71 0.68 Textual SVM Count vectors 0.89 0.90 0.90 27
  • 28. Results “human” vs. “non-human” classification Data type Classifier Feature Precision Recall F1-score Visual VGG19 With data augmentation 0.69 0.71 0.68 Textual SVM Count vectors 0.89 0.90 0.90 “neoplastic” vs. “non-neoplastic” classification Data type Classifier Feature Precision Recall F1-score Visual VGG19 With data augmentation 0.68 0.65 0.64 Textual SVM Count vectors 0.99 0.99 0.99 28
  • 29. Results “human” vs. “non-human” classification Data type Classifier Feature Precision Recall F1-score Visual VGG19 With data augmentation 0.69 0.71 0.68 Textual SVM Count vectors 0.89 0.90 0.90 “neoplastic” vs. “non-neoplastic” classification Data type Classifier Feature Precision Recall F1-score Visual VGG19 With data augmentation 0.68 0.65 0.64 Textual SVM Count vectors 0.99 0.99 0.99 “rare cancer” vs. “non-rare cancer” classification Data type Classifier Feature Precision Recall F1-score Visual VGG19 With data augmentation 0.62 0.77 0.69 29
  • 30. Discussion: Textual vs. Visual 30 Textual approach Outperformed visual approach for all tasks Count vectors with SVM performed the excellent for both tasks. Visual approach Correctly classify some “human” test instances with recall of 0.71 Worse performance for “neoplastic” identification “rare cancer” classification had a recall of 0.77
  • 31. Conclusion • First study targeting automatic rare cancer image extraction • Used approach relies on visual deep learning and textual NLP • 15,028 light microscopy (DLMI), human, rare cancer images + corresponding journal articles Getting DLMI images Getting “human” images Getting “neoplastic” images Getting “rare cancer” images PMC-OA all data 31 1 2 3 4 5
  • 32. Thank you for your attention 32 More information: http://medgift.hevs.ch Contact: anjani.dhrangadhariya@hevs.ch Follow us: https://twitter.com/MedGIFT_group

Editor's Notes

  1. 2
  2. 3
  3. 4
  4. How are these biomedical publications stored in PubMed? A PubMed record consists of Title and Abstract from the publication followed by the publication images as shown in thumbnails. And a list of Medical Subject Headings or MeSH terms that are like annotations describing something about the publication. All these multimodal elements, the text, the images and MeSH annotations are stringed together by the unique PubMed Identifier or PMID. Thus sharing a 1 to 1 association with each other.
  5. 6
  6. Publications stored in PubMed are annotated with MeSH to enforce a kind of uniformity and consistency across the database structure in a way that all articles about benign cancer are indexed under MeSH term “Neoplasm”, all the articles or studies involving patients are annotated under MeSH term “Humans” PubMed records are manually annotated with MeSH terms by staff at NLM. So MeSH terms could be considered as gold standard annotations or groundtruth annotations for a publication. Not all publications in PubMed have these manually attached MeSH terms.
  7. A small annotated subset of PMC-OA has already been used in ImageCLEF medical image annotation challenge which is a public challenge that has been taking place since 2013. This small annotated subset of 2000 images was used to train CNNs for image classification into 31 image modality classes… Including PET, CT images, light microscopy images, et cetera. This classification approach achieved an overall 90% accuracy for modality classification. However, this approach only goes till superficial modality classification task. What about going beyond this generic modality classification into more specialized image sets?
  8. So what we did for navigating towards rare cancer sets was this: Take all the PMC-OA images of unknown type and classify them using ImageCLEF setup into 31 modality types. Retain all the images classified as DLMI or diagnostic light microscopy images. We focus only upon DLMI images because they are fundamental to rare cancer diagnostics. All the retained DLMI images are linked to their respective title, abstract and MeSH annotations if available. With this multimodal annotated dataset in hand, we propose an approach for sequential curation of article abstracts and images using MeSH terms to eventually mine-out a large multimodal set of rare cancer images and full-texts.
  9. This involves three subsequent binary classification tasks where we first filter “human” from “non-human” set, followed by separating “neoplastic” from “non-neoplastic” set and finally separating “rare cancer“ from the “non-rare cancer“. It has to be noticed that at each binary classification step we compare visual vs. textual approach separately and use MeSH terms as the groundtruth labels for the datasets.
  10. For the visual classification tasks, images were separated into two different classes based on MeSH terms. These class annotated images were used to train and evaluate a VGG19 model using pretrained ImageNet weights and fine-tuned with and without image augmentation Data augmentation: image mirroring and cropping. Why do we use VGG?
  11. These fine-tuned models were then used to classify unlabeled images into their respective MeSH classes.
  12. 13
  13. Just like images, text is not understandable by the ML algorithms. So numerical vectors need to be extracted from text. Count vectors are extracted based on word counts of a document and do not take into account semantics. Word vectors are and finally paragraph vectors learn to associate words with document labels.
  14. Lets get back to the pipeline for further curating the previously retrieved DLMI dataset. «human» records were first filtered out from «non-human records» in following way.
  15. 16
  16. Best performing model setup was used to classify the un-annotated DLMI records into “human” and “non-human”. This was about the annotated text dataset. Similarly, the annotated image dataset classified using VGG19 setup.
  17. Next, «neoplastic» or tumor-related records were separated from «non-neoplastic» records in similar manner.
  18. 19
  19. Best performing model setup was used to classify the un-annotated records into “neoplasm” and “non-neoplasm”. This was about the annotated text dataset. Similarly, the annotated image dataset classified using VGG19 setup.
  20. Finally, we chaff out rare cancer dataset from the non-rare cancer dataset.
  21. Unfortunately, there are no MeSH terms pertaining to “rare cancer”, so we used a pre-defined set of rare cancer terms available from NCATS.
  22. All the records recognized as “neoplasm” were retained and filtered out as “rare cancer” only if rare cancer term from NCATS set was present in the title and the abstract.
  23. After getting «rare cancer» and the «non-rare cancer» labels for the individual publications from the previous text classification, we used them to train and evaluate a VGG19 model for this binary classification task.
  24. And the fine-tuned VGG19 setup was used to classify the images into «rare cancer» and «non-rare cancer»
  25. For the «human» classification task, textual approach performed far better than visual approach. However, a recall of 0.71 hints that the visual classification model does learn something about retaining human images.
  26. For the neoplasm classification task too, textual performed better than visual. Visual approach did not have good results for this task.
  27. For the final task, a recall of 0.77 does hint that VGG19 model did learn something by better retaining the «rare cancer» images, but it has much room for improvement.
  28. Classification: Individual images ≠ full-texts
  29. In conclusion This is the first study targeting automatic extraction of rare cancer datasets Compares both, the visual and the textual approaches One outcome of this work is a large dataset containing about 15000 rare cancer images