(CC Attribution License does not apply to included third-party material on slides 5 and 17; see the paper for the references: http://www.tkuhn.ch/pub/kuhn2012smbm.pdf )
Azure Monitor & Application Insight to monitor Infrastructure & Application
Image Mining from Gel Diagrams in Biomedical Publications
1. Image Mining from Gel Diagrams in
Biomedical Publications
Tobias Kuhn and Michael Krauthammer
Krauthammer Lab, Department of Pathology
Yale University School of Medicine
5th International Symposium on
Semantic Mining in Biomedicine (SMBM)
3 September 2012
Zurich, Switzerland
2. Introduction
The inclusion of figure images is a recent trend in the area of
literature mining.
The increasing amount of open access publications makes such
images available for automated analysis.
Image mining techniques can be used for image search interfaces,
for relation mining, and to complement text mining approaches.
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 2 / 19
4. Gel Images
Our approach focuses on gel images:
• They are the result of gel electrophoresis (e.g. Southern,
Western and Northern blotting)
• They are often shown in biomedical publication as evidence for
the discussed findings (e.g. protein-protein interactions and
protein expressions under different conditions)
• About 15% of all subfigures are gel images
• They are structured according to common regular patterns
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 4 / 19
5. Relations from Gel Images
Condition Measurement Result
MDA-MB-231 14-3-3σ high expression
NHEM 14-3-3σ no expression
C8161.9 14-3-3σ high expression
LOX 14-3-3σ low expression
MDA-MB-231 β-actin high expression
NHEM β-actin high expression
C8161.9 β-actin high expression
LOX β-actin high expression
Condition Measurement Result
IL-1β (–) DEX (–) RU486 (–) p-p38 low expression
IL-1β (+) DEX (–) RU486 (–) p-p38 high expression
IL-1β (–) DEX (+) RU486 (–) p-p38 no expression
IL-1β (+) DEX (+) RU486 (–) p-p38 low expression
IL-1β (–) DEX (–) RU486 (+) p-p38 no expression
IL-1β (+) DEX (–) RU486 (+) p-p38 high expression
IL-1β (–) DEX (+) RU486 (+) p-p38 low expression
IL-1β (+) DEX (+) RU486 (+) p-p38 high expression
... ... ...
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 5 / 19
6. Image Mining Processes
In principle, image mining involves the same processes as classical
literature mining1 (with some subtle but important differences):
• Document categorization (image categorization has to deal
with the two-dimensional space of pixels, instead of text)
• Named entity tagging (pinpointing the mention of an entity is
more difficult with images; OCR errors have to be considered)
• Fact extraction (analysis of graphical elements instead of
parsing complete sentences)
• Collection-wide analysis
1
Berry De Bruijn and Joel Martin. 2002. Getting to the (c)ore of knowledge: mining biomedical literature.
International Journal of Medical Informatics, 67(1-3):7–18.
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 6 / 19
7. Procedure
A B
X
Y
P
A B
X
Y
P
A B
X
Y
P
A B
X
Y
P
A B
X
Y
P
A B
X
Y
P
articles figures segments text gels gel panels named entities
1 21 3 4 5 6
relations
7
1 Figure Extraction
2 Segmentation
3 Text Recognition
4 Gel Segment Detection
5 Gel Panel Detection
6 Named Entity Recognition
7 Relation Extraction
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 7 / 19
8. Figure Extraction
A B
X
Y
P
A B
X
Y
P
articles figures
11
We use structured XML files of the open access subset of PubMed
Central.
(Figure extraction from PDF files or even bitmaps of scanned articles
would be more difficult, but definitely feasible.)
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 8 / 19
9. Segmentation and Text Recognition
A B
X
Y
P
A B
X
Y
P
segments text
2 3
For segmentation and text recognition we rely on our previous work.2
This includes:
• Detection of layout elements
• Text region detection
• OCR (using the Microsoft Document Imaging package of MS
Office)
2
Songhua Xu and Michael Krauthammer. 2010. A new pivoting and iterative text detection algorithm for
biomedical images. J. of Biomedical Informatics, 43(6):924–931, December.
Songhua Xu and Michael Krauthammer. 2011. Boosting text extraction from biomedical images using text region
detection. In Biomedical Sciences and Engineering Conference (BSEC), 2011, pages 1–4. IEEE.
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 9 / 19
10. Gel Segment Detection
A B
X
Y
P
gels
4
Random forest classifiers (based on 75 random trees) on the following
features of image segments:
• coordinates of the relative position within the image
• relative and absolute width and height
• 16 grayscale histogram features
• color features: red, green and blue
• 13 texture features
• number of recognized characters
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 10 / 19
11. Gel Segment Detection Results
Manually annotated training and testing sets of 500 random figures
each.
Results for three different thresholds:
Threshold Precision Recall F-score
high recall 0.15 0.439 0.909 0.592
0.30 0.765 0.739 0.752
high precision 0.60 0.926 0.301 0.455
Accuracy (area under ROC curve): 98.0%
Unbalanced set: 3% gel segments vs. 97% non-gel segments
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 11 / 19
12. Gel Panel Detection
A B
X
Y
P
gel panels
5
Algorithm:
• Start with a gel segment according to the high-precision classifier
• Repeatedly look for adjacent gel segments according to the
high-recall classifier, and merge them
• Collect labels in the form of text segments arround the detected
gel region
Results on another set of 500 manually annotated figures:
Precision Recall F-score
0.951 0.379 0.542
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 12 / 19
13. Named Entity Recognition
named entities
6
Detection of gene and protein names in gel labels:
• Tokenization of gel label texts
• Lookup in Entrez Gene database
• Case-sensitive matching
• Exclude tokens:
• Less than 3 characters
• Arabic or Latin numbers
• Common short words (from a list of the 100 most frequent words
in biomedical articles)
• 22 general words frequently used in gel diagrams (e.g. min, hrs,
line, type, protein, DNA)
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 13 / 19
14. Named Entity Recognition Results
Recognized gene/protein tokens in 2000 random figures:
absolute relative
Total 156 100.0%
Incorrect 54 34.6%
– Not mentioned (OCR errors) 28 17.9%
– Not references to genes or proteins 26 16.7%
Correct 102 65.3%
– Partially correct (could be more specific) 14 9.0%
– Fully correct 88 56.4%
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 14 / 19
15. Relation Extraction
relations
7
Relation extraction is future work and we do not have concrete
results at this point.
It would involve the following steps:
• Gene/protein name disambiguation
• Identify semantic roles (condition, measurement, ...)
• Quantify degree of expression
Combination with classical text mining techniques seems promising.
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 15 / 19
16. Overall Results on PubMed Central
We ran our pipeline on the whole open access subset of PubMed
Central:
Total articles 410 950
Processed articles 386 428
Total figures from processed articles 1 110 643
Processed figures 884 152
Detected gel panels 85 942
Detected gel panels per figure 0.097
Detected gel labels 309 340
Detected gel labels per panel 3.599
Detected gene tokens 1 854 609
Detected gene tokens in gel labels 75 610
Gene token ratio 0.033
Gene token ratio in gel labels 0.068
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 16 / 19
17. Discussion: Standardized Biomedical Diagrams?
It seems feasible to extract relations from gel images at satisfactory
accuracy, but it is clear that this procedure is far from perfect.
Shouldn’t we standardize biomedical diagrams? A Unified
Modeling Language (UML) for biomedicine?
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 17 / 19
18. Conclusions and Future Work
Conclusions:
• Gel segments can be detected with high accuracy
• Detection of gel panels at high precision
• Gene/protein name recognition in gel labels at satisfactory
precision
→ Image mining from gel diagrams is feasible
Future Work:
• Relation extraction
• Combination with classical text mining techniques
• Other named entity types: cell lines, drugs, ...
• Standard for biomedical diagrams?
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 18 / 19
19. Thank you for your Attention!
Questions?
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 19 / 19