Exploiting biomedical literature to mine out a large multimodal dataset of rare cancers

Exploiting biomedical literature to mine out a large
multimodal dataset of rare cancer studies
Anjani K. Dhrangadhariya et al.
MedGIFT group
University of Applied Sciences Western Switzerland (HES-SO)
Project supported by European Union
Horizon 2020 grant agreement 825292
SPIE Medical Imaging 2020, 16.02.2020

Motivation
> Rare cancers = 15 out of 100,000 / year
> Account for 25% cancer-related deaths
> Lower prevalence = fewer patients
> Less tumor samples for research
> Lack of robust clinical models
Puca, Loredana, et al. "Patient derived organoids to model rare prostate cancer phenotypes." Nature communications 9.1 (2018): 1-10.
2

Data resource
• Challenges
1) Private datasets
2) Limited size
3) Single center / scanner
4) Small variability
5) Some contain only images / only text
6) No or small subsets of manual annotations
7) Difficult to compare results
3

Medline/PubMed
PubMed / Medline
PubMed
Central
PubMed
Central
Open-
Access
(PMC-OA)
https://www.nlm.nih.gov/bsd/difference.html
30 million articles
~ 80 million images
5.9 million full texts
2.09 million full texts
6.73 million images
4
Rare cancer image
harvesting through
automated
knowledge
aggregation and
data mining
approaches?
2019

Individual record
Medical Subject Headings (MeSH)
Title
+
Abstract
Images
1
2
3
5
✓
✓
✓
✓

Medical Subject Headings (MeSH)
• Hierarchically organized
Controlled Vocabulary
• Cataloguing biomedical
information
• 16 thematic categories
• A = Anatomy
• B = Organism…
• Each term has a unique
MeSH Identifier
MeSH
term
MeSH
code
Lipscomb, Carolyn E. "Medical subject headings (MeSH)." Bulletin of the Medical Library Association 88.3 (2000): 265.
6

MeSH as annotation
• Manually annotated by National library of
Medicine (NLM) staff
• For e.g., All the studies about
benign cancer are indexed
under MeSH annotation “Neoplasm”
• Groundtruth annotation
• Not all PMC / PMCOA have annotations
7

Visual classification
• ImageCLEF medical image annotation
challenge (since 2013)
• Small subset of annotated PMC-OA >
train CNNs
• Classify into 31 modalities - PET, light
microscopy, CT, etc.
• State of the art: Superficial modality
classification
8
Deep Multimodal Classification of Image Types in Biomedical Journal Figures”, Andrearczyk and Müller, CLEF 2018
2000 Annotated PMC-OA
90% accuracy

Pipeline
99
Getting DLMI images
Getting “human” images
Getting “neoplastic” images
Getting “rare cancer” images
PMC-OA all images
1
2
3
4
5
DLMI
Diagnostic Light
Microscopy Images

10
Pipeline
Getting DLMI images
PMC-OA all images
1
2
3
4
5
Title +
Abstract
MeSH MeSH
vs
Visual Textual
DLMI
Diagnostic Light
Microscopy Images

Visual approach: CNNs
11
MeSH_1
MeSH_0
Model training and evaluation
• VGG19
• ImageNet weights
• With and without image
augmentation

Visual approach: CNNs
12
MeSH_1
MeSH_0
No MeSH
MeSH_1MeSH_0
• VGG19
augmentation

Title +
Abstract
Title +
Abstract
Textual approach
Title +
Abstract
Model
training &
evaluation
Best
performing
model
13
MeSH_0
MeSH_1
Title +
Abstract
MeSH_0
MeSH_1
Title +
Abstract
No MeSH

14
Pipeline
Getting DLMI images
PMC-OA all images
1
2
3
4
5
Title +
Abstract
MeSH MeSH
vs

- 0.5467
0.1111
0.5789
- 0.3789
- 0.4999
0.6687
- 0.1167
0.9976
Title +
Abstract
Title +
Abstract
Title +
Abstract
{MeSH}
DLMI
human
1. Logistic regression
2. Support Vector Machine
3. K-nearest neighbor
1. Tf-idf,
2. Word vectors,
3. paragraph vector
Not human
20%
80%
Training set
Test set
human
Not human
Title +
Abstract
Title +
Abstract
=
= ⇔ B01.050.150.900.649.313.988.400.112.400.400 ∉ {MeSH}
⇔ B01.050.150.900.649.313.988.400.112.400.400 ∈ {MeSH} & other B01 codes ∉ {MeSH}
15

Title +
Abstract
Title +
Abstract
human
not human
Best performing
Model, hyper-params and
vectors
SVM, tf-idf bigrams
No MeSH
Title +
Abstract
DLMI
Title +
Abstract
Title +
Abstract
Title +
Abstract
{MeSH}
DLMI
human
1. Tf-idf,
2. Word vectors,
3. paragraph vector
not human
20%
80%
Training set
Test set
- 0.5467
0.1111
0.5789
- 0.3789
- 0.4999
0.6687
- 0.1167
0.9976
16

17
Pipeline
Getting DLMI images
PMC-OA all images
1
2
3
4
5
Title +
Abstract
MeSH MeSH
vs

18
neoplastic
not neoplastic
Title +
Abstract
Title +
Abstract
=
= ⇔ C04 ∉ {MeSH}
⇔ C04 ∈ {MeSH}
Title +
Abstract
Title +
Abstract
Title +
Abstract
{MeSH}
DLMI
1. Tf-idf,
2. Word vectors,
3. paragraph vector
20%
80%
Training set
Test set
human
neoplastic
not neoplastic
- 0.5467
0.1111
0.5789
- 0.3789
- 0.4999
0.6687
- 0.1167
0.9976

Getting “non-neoplastic” images
Title +
Abstract
Title +
Abstract
Title +
Abstract
{MeSH}
DLMI
1. Tf-idf,
2. Word vectors,
3. paragraph vector
20%
80%
Training set
Test set
human
neoplastic
not neoplastic
Title +
Abstract
Title +
Abstract
Best performing
Model, hyper-params and
vectors
SVM, tf-idf bigrams
No MeSH
Title +
Abstract
DLMI
human
neoplastic
not neoplastic
- 0.5467
0.1111
0.5789
- 0.3789
- 0.4999
0.6687
- 0.1167
0.9976
19

20
Pipeline
Getting DLMI images
PMC-OA all images
1
2
3
4
5
Title +
Abstract
MeSH MeSH
vs

• No MeSH terms for “rare” cancer class
• Set of {rare cancer} terms by National Center for Advancing
Translational Sciences (NCATS)
https://rarediseases.info.nih.gov/diseases/diseases-by-category/1
21
Title +
Abstract
Title +
Abstract
DLMI
humanNo MeSH
{MeSH}
DLMI
neoplastic
human
neoplastic
Title +
Abstract
rare
cancer
Title +
Abstract
rare
cancer
= ⇔
Title +
Abstract ∩ {rare cancer} ≠
Ø Title +
Abstract
non-rare
cancer

Visual: “rare cancer”
22
rare cancer
• VGG19
augmentation
non-rare cancer

Visual: “rare cancer”
23
No label
• VGG19
augmentation
rare cancer
non-rare cancer
rare cancer non-rare cancer

Results
“human” vs. “non-human” classification
Data type Classifier Feature Precision Recall F1-score
Visual VGG19 With data augmentation 0.69 0.71 0.68
Textual SVM Tf-idf trigrams 0.89 0.90 0.90
24

Results
“neoplastic” vs. “non-neoplastic” classification
Textual SVM Tf-idf bigrams 0.99 0.99 0.99
25

Results
“neoplastic” vs. “non-neoplastic” classification
Textual SVM Tf-idf bigrams 0.99 0.99 0.99
“rare cancer” vs. “non-rare cancer” classification
26

Discussion: Textual vs. Visual
27
Textual approach
Outperformed visual approach
for all tasks
Tf-idf n-grams with SVM
performed the excellent for
both tasks.
Visual approach
Correctly classify some
“human” test instances with
recall of 0.71
Worse performance for
“neoplastic” identification
“rare cancer” classification had
a recall of 0.77

Conclusion
• First study targeting automatic rare cancer
image extraction
• Used approach relies on visual deep
learning and textual NLP
• 15,028 light microscopy (DLMI), human,
rare cancer images + corresponding journal
articles
Getting DLMI images
PMC-OA all data
28
1
2
3
4
5

Thank you for your attention
29
More information:
http://medgift.hevs.ch
Contact:
anjani.dhrangadhariya@hevs.ch
Follow us:
https://twitter.com/MedGIFT_group

Exploiting biomedical literature to mine out a large multimodal dataset of rare cancers

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Exploiting biomedical literature to mine out a large multimodal dataset of rare cancers

Similar to Exploiting biomedical literature to mine out a large multimodal dataset of rare cancers (20)

Recently uploaded

Recently uploaded (20)

Exploiting biomedical literature to mine out a large multimodal dataset of rare cancers

Editor's Notes