Image Mining from Gel Diagrams in
Biomedical Publications
Tobias Kuhn and Michael Krauthammer
Krauthammer Lab, Department ...
Introduction
The inclusion of figure images is a recent trend in the area of
literature mining.
The increasing amount of op...
Yale Image Finder
http://krauthammerlab.med.yale.edu/imagefinder/
T. Kuhn and M. Krauthammer, Yale University Image Mining...
Gel Images
Our approach focuses on gel images:
• They are the result of gel electrophoresis (e.g. Southern,
Western and No...
Relations from Gel Images
Condition Measurement Result
MDA-MB-231 14-3-3σ high expression
NHEM 14-3-3σ no expression
C8161...
Image Mining Processes
In principle, image mining involves the same processes as classical
literature mining1 (with some s...
Procedure
A B
X
Y
P
A B
X
Y
P
A B
X
Y
P
A B
X
Y
P
A B
X
Y
P
A B
X
Y
P
articles figures segments text gels gel panels named...
Figure Extraction
A B
X
Y
P
A B
X
Y
P
articles figures
11
We use structured XML files of the open access subset of PubMed
C...
Segmentation and Text Recognition
A B
X
Y
P
A B
X
Y
P
segments text
2 3
For segmentation and text recognition we rely on o...
Gel Segment Detection
A B
X
Y
P
gels
4
Random forest classifiers (based on 75 random trees) on the following
features of im...
Gel Segment Detection Results
Manually annotated training and testing sets of 500 random figures
each.
Results for three di...
Gel Panel Detection
A B
X
Y
P
gel panels
5
Algorithm:
• Start with a gel segment according to the high-precision classifier...
Named Entity Recognition
named entities
6
Detection of gene and protein names in gel labels:
• Tokenization of gel label t...
Named Entity Recognition Results
Recognized gene/protein tokens in 2000 random figures:
absolute relative
Total 156 100.0%
...
Relation Extraction
relations
7
Relation extraction is future work and we do not have concrete
results at this point.
It w...
Overall Results on PubMed Central
We ran our pipeline on the whole open access subset of PubMed
Central:
Total articles 41...
Discussion: Standardized Biomedical Diagrams?
It seems feasible to extract relations from gel images at satisfactory
accur...
Conclusions and Future Work
Conclusions:
• Gel segments can be detected with high accuracy
• Detection of gel panels at hi...
Thank you for your Attention!
Questions?
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Bio...
Upcoming SlideShare
Loading in...5
×

Image Mining from Gel Diagrams in Biomedical Publications

379
-1

Published on

(CC Attribution License does not apply to included third-party material on slides 5 and 17; see the paper for the references: http://www.tkuhn.ch/pub/kuhn2012smbm.pdf )

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
379
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Image Mining from Gel Diagrams in Biomedical Publications

  1. 1. Image Mining from Gel Diagrams in Biomedical Publications Tobias Kuhn and Michael Krauthammer Krauthammer Lab, Department of Pathology Yale University School of Medicine 5th International Symposium on Semantic Mining in Biomedicine (SMBM) 3 September 2012 Zurich, Switzerland
  2. 2. Introduction The inclusion of figure images is a recent trend in the area of literature mining. The increasing amount of open access publications makes such images available for automated analysis. Image mining techniques can be used for image search interfaces, for relation mining, and to complement text mining approaches. T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 2 / 19
  3. 3. Yale Image Finder http://krauthammerlab.med.yale.edu/imagefinder/ T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 3 / 19
  4. 4. Gel Images Our approach focuses on gel images: • They are the result of gel electrophoresis (e.g. Southern, Western and Northern blotting) • They are often shown in biomedical publication as evidence for the discussed findings (e.g. protein-protein interactions and protein expressions under different conditions) • About 15% of all subfigures are gel images • They are structured according to common regular patterns T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 4 / 19
  5. 5. Relations from Gel Images Condition Measurement Result MDA-MB-231 14-3-3σ high expression NHEM 14-3-3σ no expression C8161.9 14-3-3σ high expression LOX 14-3-3σ low expression MDA-MB-231 β-actin high expression NHEM β-actin high expression C8161.9 β-actin high expression LOX β-actin high expression Condition Measurement Result IL-1β (–) DEX (–) RU486 (–) p-p38 low expression IL-1β (+) DEX (–) RU486 (–) p-p38 high expression IL-1β (–) DEX (+) RU486 (–) p-p38 no expression IL-1β (+) DEX (+) RU486 (–) p-p38 low expression IL-1β (–) DEX (–) RU486 (+) p-p38 no expression IL-1β (+) DEX (–) RU486 (+) p-p38 high expression IL-1β (–) DEX (+) RU486 (+) p-p38 low expression IL-1β (+) DEX (+) RU486 (+) p-p38 high expression ... ... ... T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 5 / 19
  6. 6. Image Mining Processes In principle, image mining involves the same processes as classical literature mining1 (with some subtle but important differences): • Document categorization (image categorization has to deal with the two-dimensional space of pixels, instead of text) • Named entity tagging (pinpointing the mention of an entity is more difficult with images; OCR errors have to be considered) • Fact extraction (analysis of graphical elements instead of parsing complete sentences) • Collection-wide analysis 1 Berry De Bruijn and Joel Martin. 2002. Getting to the (c)ore of knowledge: mining biomedical literature. International Journal of Medical Informatics, 67(1-3):7–18. T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 6 / 19
  7. 7. Procedure A B X Y P A B X Y P A B X Y P A B X Y P A B X Y P A B X Y P articles figures segments text gels gel panels named entities 1 21 3 4 5 6 relations 7 1 Figure Extraction 2 Segmentation 3 Text Recognition 4 Gel Segment Detection 5 Gel Panel Detection 6 Named Entity Recognition 7 Relation Extraction T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 7 / 19
  8. 8. Figure Extraction A B X Y P A B X Y P articles figures 11 We use structured XML files of the open access subset of PubMed Central. (Figure extraction from PDF files or even bitmaps of scanned articles would be more difficult, but definitely feasible.) T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 8 / 19
  9. 9. Segmentation and Text Recognition A B X Y P A B X Y P segments text 2 3 For segmentation and text recognition we rely on our previous work.2 This includes: • Detection of layout elements • Text region detection • OCR (using the Microsoft Document Imaging package of MS Office) 2 Songhua Xu and Michael Krauthammer. 2010. A new pivoting and iterative text detection algorithm for biomedical images. J. of Biomedical Informatics, 43(6):924–931, December. Songhua Xu and Michael Krauthammer. 2011. Boosting text extraction from biomedical images using text region detection. In Biomedical Sciences and Engineering Conference (BSEC), 2011, pages 1–4. IEEE. T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 9 / 19
  10. 10. Gel Segment Detection A B X Y P gels 4 Random forest classifiers (based on 75 random trees) on the following features of image segments: • coordinates of the relative position within the image • relative and absolute width and height • 16 grayscale histogram features • color features: red, green and blue • 13 texture features • number of recognized characters T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 10 / 19
  11. 11. Gel Segment Detection Results Manually annotated training and testing sets of 500 random figures each. Results for three different thresholds: Threshold Precision Recall F-score high recall 0.15 0.439 0.909 0.592 0.30 0.765 0.739 0.752 high precision 0.60 0.926 0.301 0.455 Accuracy (area under ROC curve): 98.0% Unbalanced set: 3% gel segments vs. 97% non-gel segments T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 11 / 19
  12. 12. Gel Panel Detection A B X Y P gel panels 5 Algorithm: • Start with a gel segment according to the high-precision classifier • Repeatedly look for adjacent gel segments according to the high-recall classifier, and merge them • Collect labels in the form of text segments arround the detected gel region Results on another set of 500 manually annotated figures: Precision Recall F-score 0.951 0.379 0.542 T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 12 / 19
  13. 13. Named Entity Recognition named entities 6 Detection of gene and protein names in gel labels: • Tokenization of gel label texts • Lookup in Entrez Gene database • Case-sensitive matching • Exclude tokens: • Less than 3 characters • Arabic or Latin numbers • Common short words (from a list of the 100 most frequent words in biomedical articles) • 22 general words frequently used in gel diagrams (e.g. min, hrs, line, type, protein, DNA) T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 13 / 19
  14. 14. Named Entity Recognition Results Recognized gene/protein tokens in 2000 random figures: absolute relative Total 156 100.0% Incorrect 54 34.6% – Not mentioned (OCR errors) 28 17.9% – Not references to genes or proteins 26 16.7% Correct 102 65.3% – Partially correct (could be more specific) 14 9.0% – Fully correct 88 56.4% T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 14 / 19
  15. 15. Relation Extraction relations 7 Relation extraction is future work and we do not have concrete results at this point. It would involve the following steps: • Gene/protein name disambiguation • Identify semantic roles (condition, measurement, ...) • Quantify degree of expression Combination with classical text mining techniques seems promising. T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 15 / 19
  16. 16. Overall Results on PubMed Central We ran our pipeline on the whole open access subset of PubMed Central: Total articles 410 950 Processed articles 386 428 Total figures from processed articles 1 110 643 Processed figures 884 152 Detected gel panels 85 942 Detected gel panels per figure 0.097 Detected gel labels 309 340 Detected gel labels per panel 3.599 Detected gene tokens 1 854 609 Detected gene tokens in gel labels 75 610 Gene token ratio 0.033 Gene token ratio in gel labels 0.068 T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 16 / 19
  17. 17. Discussion: Standardized Biomedical Diagrams? It seems feasible to extract relations from gel images at satisfactory accuracy, but it is clear that this procedure is far from perfect. Shouldn’t we standardize biomedical diagrams? A Unified Modeling Language (UML) for biomedicine? T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 17 / 19
  18. 18. Conclusions and Future Work Conclusions: • Gel segments can be detected with high accuracy • Detection of gel panels at high precision • Gene/protein name recognition in gel labels at satisfactory precision → Image mining from gel diagrams is feasible Future Work: • Relation extraction • Combination with classical text mining techniques • Other named entity types: cell lines, drugs, ... • Standard for biomedical diagrams? T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 18 / 19
  19. 19. Thank you for your Attention! Questions? T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 19 / 19
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×