Image Mining from Gel Diagrams in Biomedical Publications

Image Mining from Gel Diagrams in
Biomedical Publications
Tobias Kuhn and Michael Krauthammer
Krauthammer Lab, Department of Pathology
Yale University School of Medicine
5th International Symposium on
Semantic Mining in Biomedicine (SMBM)
3 September 2012
Zurich, Switzerland

Introduction
The inclusion of ﬁgure images is a recent trend in the area of
literature mining.
The increasing amount of open access publications makes such
images available for automated analysis.
Image mining techniques can be used for image search interfaces,
for relation mining, and to complement text mining approaches.
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 2 / 19

Yale Image Finder
http://krauthammerlab.med.yale.edu/imagefinder/

Gel Images
Our approach focuses on gel images:
• They are the result of gel electrophoresis (e.g. Southern,
Western and Northern blotting)
• They are often shown in biomedical publication as evidence for
the discussed findings (e.g. protein-protein interactions and
protein expressions under different conditions)
• About 15% of all subfigures are gel images
• They are structured according to common regular patterns

Relations from Gel Images
Condition Measurement Result
MDA-MB-231 14-3-3σ high expression
NHEM 14-3-3σ no expression
C8161.9 14-3-3σ high expression
LOX 14-3-3σ low expression
MDA-MB-231 β-actin high expression
NHEM β-actin high expression
C8161.9 β-actin high expression
LOX β-actin high expression
Condition Measurement Result
IL-1β (–) DEX (–) RU486 (–) p-p38 low expression
IL-1β (+) DEX (–) RU486 (–) p-p38 high expression
IL-1β (–) DEX (+) RU486 (–) p-p38 no expression
IL-1β (+) DEX (+) RU486 (–) p-p38 low expression
IL-1β (–) DEX (–) RU486 (+) p-p38 no expression
IL-1β (+) DEX (–) RU486 (+) p-p38 high expression
IL-1β (–) DEX (+) RU486 (+) p-p38 low expression
IL-1β (+) DEX (+) RU486 (+) p-p38 high expression
... ... ...

Image Mining Processes
In principle, image mining involves the same processes as classical
literature mining1 (with some subtle but important diﬀerences):
• Document categorization (image categorization has to deal
with the two-dimensional space of pixels, instead of text)
• Named entity tagging (pinpointing the mention of an entity is
more diﬃcult with images; OCR errors have to be considered)
• Fact extraction (analysis of graphical elements instead of
parsing complete sentences)
• Collection-wide analysis
1
Berry De Bruijn and Joel Martin. 2002. Getting to the (c)ore of knowledge: mining biomedical literature.
International Journal of Medical Informatics, 67(1-3):7–18.

Procedure
A B
X
Y
P
A B
X
Y
P
A B
X
Y
P
A B
X
Y
P
A B
X
Y
P
A B
X
Y
P
articles figures segments text gels gel panels named entities
1 21 3 4 5 6
relations
7
1 Figure Extraction
2 Segmentation
3 Text Recognition
4 Gel Segment Detection
5 Gel Panel Detection
6 Named Entity Recognition
7 Relation Extraction

Figure Extraction
A B
X
Y
P
A B
X
Y
P
articles figures
11
We use structured XML files of the open access subset of PubMed
Central.
(Figure extraction from PDF files or even bitmaps of scanned articles
would be more difficult, but definitely feasible.)

Segmentation and Text Recognition
A B
X
Y
P
A B
X
Y
P
segments text
2 3
For segmentation and text recognition we rely on our previous work.2
This includes:
• Detection of layout elements
• Text region detection
• OCR (using the Microsoft Document Imaging package of MS
Oﬃce)
2
Songhua Xu and Michael Krauthammer. 2010. A new pivoting and iterative text detection algorithm for
biomedical images. J. of Biomedical Informatics, 43(6):924–931, December.
Songhua Xu and Michael Krauthammer. 2011. Boosting text extraction from biomedical images using text region
detection. In Biomedical Sciences and Engineering Conference (BSEC), 2011, pages 1–4. IEEE.

Gel Segment Detection
A B
X
Y
P
gels
4
Random forest classiﬁers (based on 75 random trees) on the following
features of image segments:
• coordinates of the relative position within the image
• relative and absolute width and height
• 16 grayscale histogram features
• color features: red, green and blue
• 13 texture features
• number of recognized characters

Gel Segment Detection Results
Manually annotated training and testing sets of 500 random ﬁgures
each.
Results for three diﬀerent thresholds:
Threshold Precision Recall F-score
high recall 0.15 0.439 0.909 0.592
0.30 0.765 0.739 0.752
high precision 0.60 0.926 0.301 0.455
Accuracy (area under ROC curve): 98.0%
Unbalanced set: 3% gel segments vs. 97% non-gel segments

Gel Panel Detection
A B
X
Y
P
gel panels
5
Algorithm:
• Start with a gel segment according to the high-precision classifier
• Repeatedly look for adjacent gel segments according to the
high-recall classifier, and merge them
• Collect labels in the form of text segments arround the detected
gel region
Results on another set of 500 manually annotated figures:
Precision Recall F-score
0.951 0.379 0.542

Named Entity Recognition
named entities
6
Detection of gene and protein names in gel labels:
• Tokenization of gel label texts
• Lookup in Entrez Gene database
• Case-sensitive matching
• Exclude tokens:
• Less than 3 characters
• Arabic or Latin numbers
• Common short words (from a list of the 100 most frequent words
in biomedical articles)
• 22 general words frequently used in gel diagrams (e.g. min, hrs,
line, type, protein, DNA)

Named Entity Recognition Results
Recognized gene/protein tokens in 2000 random ﬁgures:
absolute relative
Total 156 100.0%
Incorrect 54 34.6%
– Not mentioned (OCR errors) 28 17.9%
– Not references to genes or proteins 26 16.7%
Correct 102 65.3%
– Partially correct (could be more speciﬁc) 14 9.0%
– Fully correct 88 56.4%

Relation Extraction
relations
7
Relation extraction is future work and we do not have concrete
results at this point.
It would involve the following steps:
• Gene/protein name disambiguation
• Identify semantic roles (condition, measurement, ...)
• Quantify degree of expression
Combination with classical text mining techniques seems promising.

Overall Results on PubMed Central
We ran our pipeline on the whole open access subset of PubMed
Central:
Total articles 410 950
Processed articles 386 428
Total figures from processed articles 1 110 643
Processed figures 884 152
Detected gel panels 85 942
Detected gel panels per figure 0.097
Detected gel labels 309 340
Detected gel labels per panel 3.599
Detected gene tokens 1 854 609
Detected gene tokens in gel labels 75 610
Gene token ratio 0.033
Gene token ratio in gel labels 0.068

Discussion: Standardized Biomedical Diagrams?
It seems feasible to extract relations from gel images at satisfactory
accuracy, but it is clear that this procedure is far from perfect.
Shouldn’t we standardize biomedical diagrams? A Uniﬁed
Modeling Language (UML) for biomedicine?

Conclusions and Future Work
Conclusions:
• Gel segments can be detected with high accuracy
• Detection of gel panels at high precision
• Gene/protein name recognition in gel labels at satisfactory
precision
→ Image mining from gel diagrams is feasible
Future Work:
• Relation extraction
• Combination with classical text mining techniques
• Other named entity types: cell lines, drugs, ...
• Standard for biomedical diagrams?

Thank you for your Attention!
Questions?

Image Mining from Gel Diagrams in Biomedical Publications

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Image Mining from Gel Diagrams in Biomedical Publications

Similar to Image Mining from Gel Diagrams in Biomedical Publications (20)

More from Tobias Kuhn

More from Tobias Kuhn (20)

Recently uploaded

Recently uploaded (20)

Image Mining from Gel Diagrams in Biomedical Publications