SlideShare a Scribd company logo
1 of 19
Download to read offline
Image Mining from Gel Diagrams in
Biomedical Publications
Tobias Kuhn and Michael Krauthammer
Krauthammer Lab, Department of Pathology
Yale University School of Medicine
5th International Symposium on
Semantic Mining in Biomedicine (SMBM)
3 September 2012
Zurich, Switzerland
Introduction
The inclusion of figure images is a recent trend in the area of
literature mining.
The increasing amount of open access publications makes such
images available for automated analysis.
Image mining techniques can be used for image search interfaces,
for relation mining, and to complement text mining approaches.
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 2 / 19
Yale Image Finder
http://krauthammerlab.med.yale.edu/imagefinder/
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 3 / 19
Gel Images
Our approach focuses on gel images:
• They are the result of gel electrophoresis (e.g. Southern,
Western and Northern blotting)
• They are often shown in biomedical publication as evidence for
the discussed findings (e.g. protein-protein interactions and
protein expressions under different conditions)
• About 15% of all subfigures are gel images
• They are structured according to common regular patterns
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 4 / 19
Relations from Gel Images
Condition Measurement Result
MDA-MB-231 14-3-3σ high expression
NHEM 14-3-3σ no expression
C8161.9 14-3-3σ high expression
LOX 14-3-3σ low expression
MDA-MB-231 β-actin high expression
NHEM β-actin high expression
C8161.9 β-actin high expression
LOX β-actin high expression
Condition Measurement Result
IL-1β (–) DEX (–) RU486 (–) p-p38 low expression
IL-1β (+) DEX (–) RU486 (–) p-p38 high expression
IL-1β (–) DEX (+) RU486 (–) p-p38 no expression
IL-1β (+) DEX (+) RU486 (–) p-p38 low expression
IL-1β (–) DEX (–) RU486 (+) p-p38 no expression
IL-1β (+) DEX (–) RU486 (+) p-p38 high expression
IL-1β (–) DEX (+) RU486 (+) p-p38 low expression
IL-1β (+) DEX (+) RU486 (+) p-p38 high expression
... ... ...
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 5 / 19
Image Mining Processes
In principle, image mining involves the same processes as classical
literature mining1 (with some subtle but important differences):
• Document categorization (image categorization has to deal
with the two-dimensional space of pixels, instead of text)
• Named entity tagging (pinpointing the mention of an entity is
more difficult with images; OCR errors have to be considered)
• Fact extraction (analysis of graphical elements instead of
parsing complete sentences)
• Collection-wide analysis
1
Berry De Bruijn and Joel Martin. 2002. Getting to the (c)ore of knowledge: mining biomedical literature.
International Journal of Medical Informatics, 67(1-3):7–18.
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 6 / 19
Procedure
A B
X
Y
P
A B
X
Y
P
A B
X
Y
P
A B
X
Y
P
A B
X
Y
P
A B
X
Y
P
articles figures segments text gels gel panels named entities
1 21 3 4 5 6
relations
7
1 Figure Extraction
2 Segmentation
3 Text Recognition
4 Gel Segment Detection
5 Gel Panel Detection
6 Named Entity Recognition
7 Relation Extraction
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 7 / 19
Figure Extraction
A B
X
Y
P
A B
X
Y
P
articles figures
11
We use structured XML files of the open access subset of PubMed
Central.
(Figure extraction from PDF files or even bitmaps of scanned articles
would be more difficult, but definitely feasible.)
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 8 / 19
Segmentation and Text Recognition
A B
X
Y
P
A B
X
Y
P
segments text
2 3
For segmentation and text recognition we rely on our previous work.2
This includes:
• Detection of layout elements
• Text region detection
• OCR (using the Microsoft Document Imaging package of MS
Office)
2
Songhua Xu and Michael Krauthammer. 2010. A new pivoting and iterative text detection algorithm for
biomedical images. J. of Biomedical Informatics, 43(6):924–931, December.
Songhua Xu and Michael Krauthammer. 2011. Boosting text extraction from biomedical images using text region
detection. In Biomedical Sciences and Engineering Conference (BSEC), 2011, pages 1–4. IEEE.
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 9 / 19
Gel Segment Detection
A B
X
Y
P
gels
4
Random forest classifiers (based on 75 random trees) on the following
features of image segments:
• coordinates of the relative position within the image
• relative and absolute width and height
• 16 grayscale histogram features
• color features: red, green and blue
• 13 texture features
• number of recognized characters
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 10 / 19
Gel Segment Detection Results
Manually annotated training and testing sets of 500 random figures
each.
Results for three different thresholds:
Threshold Precision Recall F-score
high recall 0.15 0.439 0.909 0.592
0.30 0.765 0.739 0.752
high precision 0.60 0.926 0.301 0.455
Accuracy (area under ROC curve): 98.0%
Unbalanced set: 3% gel segments vs. 97% non-gel segments
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 11 / 19
Gel Panel Detection
A B
X
Y
P
gel panels
5
Algorithm:
• Start with a gel segment according to the high-precision classifier
• Repeatedly look for adjacent gel segments according to the
high-recall classifier, and merge them
• Collect labels in the form of text segments arround the detected
gel region
Results on another set of 500 manually annotated figures:
Precision Recall F-score
0.951 0.379 0.542
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 12 / 19
Named Entity Recognition
named entities
6
Detection of gene and protein names in gel labels:
• Tokenization of gel label texts
• Lookup in Entrez Gene database
• Case-sensitive matching
• Exclude tokens:
• Less than 3 characters
• Arabic or Latin numbers
• Common short words (from a list of the 100 most frequent words
in biomedical articles)
• 22 general words frequently used in gel diagrams (e.g. min, hrs,
line, type, protein, DNA)
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 13 / 19
Named Entity Recognition Results
Recognized gene/protein tokens in 2000 random figures:
absolute relative
Total 156 100.0%
Incorrect 54 34.6%
– Not mentioned (OCR errors) 28 17.9%
– Not references to genes or proteins 26 16.7%
Correct 102 65.3%
– Partially correct (could be more specific) 14 9.0%
– Fully correct 88 56.4%
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 14 / 19
Relation Extraction
relations
7
Relation extraction is future work and we do not have concrete
results at this point.
It would involve the following steps:
• Gene/protein name disambiguation
• Identify semantic roles (condition, measurement, ...)
• Quantify degree of expression
Combination with classical text mining techniques seems promising.
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 15 / 19
Overall Results on PubMed Central
We ran our pipeline on the whole open access subset of PubMed
Central:
Total articles 410 950
Processed articles 386 428
Total figures from processed articles 1 110 643
Processed figures 884 152
Detected gel panels 85 942
Detected gel panels per figure 0.097
Detected gel labels 309 340
Detected gel labels per panel 3.599
Detected gene tokens 1 854 609
Detected gene tokens in gel labels 75 610
Gene token ratio 0.033
Gene token ratio in gel labels 0.068
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 16 / 19
Discussion: Standardized Biomedical Diagrams?
It seems feasible to extract relations from gel images at satisfactory
accuracy, but it is clear that this procedure is far from perfect.
Shouldn’t we standardize biomedical diagrams? A Unified
Modeling Language (UML) for biomedicine?
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 17 / 19
Conclusions and Future Work
Conclusions:
• Gel segments can be detected with high accuracy
• Detection of gel panels at high precision
• Gene/protein name recognition in gel labels at satisfactory
precision
→ Image mining from gel diagrams is feasible
Future Work:
• Relation extraction
• Combination with classical text mining techniques
• Other named entity types: cell lines, drugs, ...
• Standard for biomedical diagrams?
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 18 / 19
Thank you for your Attention!
Questions?
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 19 / 19

More Related Content

What's hot

A Review of Various Methods Used in the Analysis of Functional Gene Expressio...
A Review of Various Methods Used in the Analysis of Functional Gene Expressio...A Review of Various Methods Used in the Analysis of Functional Gene Expressio...
A Review of Various Methods Used in the Analysis of Functional Gene Expressio...ijitcs
 
Decision Support System for Bat Identification using Random Forest and C5.0
Decision Support System for Bat Identification using Random Forest and C5.0Decision Support System for Bat Identification using Random Forest and C5.0
Decision Support System for Bat Identification using Random Forest and C5.0TELKOMNIKA JOURNAL
 
Multiple Features Based Two-stage Hybrid Classifier Ensembles for Subcellular...
Multiple Features Based Two-stage Hybrid Classifier Ensembles for Subcellular...Multiple Features Based Two-stage Hybrid Classifier Ensembles for Subcellular...
Multiple Features Based Two-stage Hybrid Classifier Ensembles for Subcellular...CSCJournals
 
Biological Significance of Gene Expression Data Using Similarity Based Biclus...
Biological Significance of Gene Expression Data Using Similarity Based Biclus...Biological Significance of Gene Expression Data Using Similarity Based Biclus...
Biological Significance of Gene Expression Data Using Similarity Based Biclus...CSCJournals
 
Cheminformatics in drug design
Cheminformatics in drug designCheminformatics in drug design
Cheminformatics in drug designSurmil Shah
 
Classification of Microarray Gene Expression Data by Gene Combinations using ...
Classification of Microarray Gene Expression Data by Gene Combinations using ...Classification of Microarray Gene Expression Data by Gene Combinations using ...
Classification of Microarray Gene Expression Data by Gene Combinations using ...IJCSEA Journal
 
An Open Source Tool for Game Theoretic Health Data De-Identification
An Open Source Tool for Game Theoretic Health Data De-IdentificationAn Open Source Tool for Game Theoretic Health Data De-Identification
An Open Source Tool for Game Theoretic Health Data De-Identificationarx-deidentifier
 
Engineering data privacy - The ARX data anonymization tool
Engineering data privacy - The ARX data anonymization toolEngineering data privacy - The ARX data anonymization tool
Engineering data privacy - The ARX data anonymization toolarx-deidentifier
 
IEEE AP-MTT OSU Seminar
IEEE AP-MTT OSU SeminarIEEE AP-MTT OSU Seminar
IEEE AP-MTT OSU SeminarOgan Gurel MD
 
IRJET- Plant Disease Identification System
IRJET- Plant Disease Identification SystemIRJET- Plant Disease Identification System
IRJET- Plant Disease Identification SystemIRJET Journal
 
Comparison of Feature selection methods for diagnosis of cervical cancer usin...
Comparison of Feature selection methods for diagnosis of cervical cancer usin...Comparison of Feature selection methods for diagnosis of cervical cancer usin...
Comparison of Feature selection methods for diagnosis of cervical cancer usin...IJERA Editor
 
Subgraph relative frequency approach for extracting interesting substructur
Subgraph relative frequency approach for extracting interesting substructurSubgraph relative frequency approach for extracting interesting substructur
Subgraph relative frequency approach for extracting interesting substructurIAEME Publication
 
diffraction techniques
 diffraction techniques diffraction techniques
diffraction techniqueskarthi keyan
 
A Tool for Optimizing De-Identified Health Data for Use in Statistical Classi...
A Tool for Optimizing De-Identified Health Data for Use in Statistical Classi...A Tool for Optimizing De-Identified Health Data for Use in Statistical Classi...
A Tool for Optimizing De-Identified Health Data for Use in Statistical Classi...arx-deidentifier
 
Segmentation and removal of interphase cells from chromosome
Segmentation and removal of interphase cells from chromosomeSegmentation and removal of interphase cells from chromosome
Segmentation and removal of interphase cells from chromosomeAboul Ella Hassanien
 
Advances in prokaryote classification from microscopic images
Advances in prokaryote classification from microscopic imagesAdvances in prokaryote classification from microscopic images
Advances in prokaryote classification from microscopic imagesecij
 

What's hot (18)

(2011) Comparison of Face Image Quality Metrics
(2011) Comparison of Face Image Quality Metrics(2011) Comparison of Face Image Quality Metrics
(2011) Comparison of Face Image Quality Metrics
 
A Review of Various Methods Used in the Analysis of Functional Gene Expressio...
A Review of Various Methods Used in the Analysis of Functional Gene Expressio...A Review of Various Methods Used in the Analysis of Functional Gene Expressio...
A Review of Various Methods Used in the Analysis of Functional Gene Expressio...
 
Decision Support System for Bat Identification using Random Forest and C5.0
Decision Support System for Bat Identification using Random Forest and C5.0Decision Support System for Bat Identification using Random Forest and C5.0
Decision Support System for Bat Identification using Random Forest and C5.0
 
Multiple Features Based Two-stage Hybrid Classifier Ensembles for Subcellular...
Multiple Features Based Two-stage Hybrid Classifier Ensembles for Subcellular...Multiple Features Based Two-stage Hybrid Classifier Ensembles for Subcellular...
Multiple Features Based Two-stage Hybrid Classifier Ensembles for Subcellular...
 
Biological Significance of Gene Expression Data Using Similarity Based Biclus...
Biological Significance of Gene Expression Data Using Similarity Based Biclus...Biological Significance of Gene Expression Data Using Similarity Based Biclus...
Biological Significance of Gene Expression Data Using Similarity Based Biclus...
 
Cheminformatics in drug design
Cheminformatics in drug designCheminformatics in drug design
Cheminformatics in drug design
 
Classification of Microarray Gene Expression Data by Gene Combinations using ...
Classification of Microarray Gene Expression Data by Gene Combinations using ...Classification of Microarray Gene Expression Data by Gene Combinations using ...
Classification of Microarray Gene Expression Data by Gene Combinations using ...
 
An Open Source Tool for Game Theoretic Health Data De-Identification
An Open Source Tool for Game Theoretic Health Data De-IdentificationAn Open Source Tool for Game Theoretic Health Data De-Identification
An Open Source Tool for Game Theoretic Health Data De-Identification
 
Engineering data privacy - The ARX data anonymization tool
Engineering data privacy - The ARX data anonymization toolEngineering data privacy - The ARX data anonymization tool
Engineering data privacy - The ARX data anonymization tool
 
IEEE AP-MTT OSU Seminar
IEEE AP-MTT OSU SeminarIEEE AP-MTT OSU Seminar
IEEE AP-MTT OSU Seminar
 
IRJET- Plant Disease Identification System
IRJET- Plant Disease Identification SystemIRJET- Plant Disease Identification System
IRJET- Plant Disease Identification System
 
Comparison of Feature selection methods for diagnosis of cervical cancer usin...
Comparison of Feature selection methods for diagnosis of cervical cancer usin...Comparison of Feature selection methods for diagnosis of cervical cancer usin...
Comparison of Feature selection methods for diagnosis of cervical cancer usin...
 
Subgraph relative frequency approach for extracting interesting substructur
Subgraph relative frequency approach for extracting interesting substructurSubgraph relative frequency approach for extracting interesting substructur
Subgraph relative frequency approach for extracting interesting substructur
 
diffraction techniques
 diffraction techniques diffraction techniques
diffraction techniques
 
A Tool for Optimizing De-Identified Health Data for Use in Statistical Classi...
A Tool for Optimizing De-Identified Health Data for Use in Statistical Classi...A Tool for Optimizing De-Identified Health Data for Use in Statistical Classi...
A Tool for Optimizing De-Identified Health Data for Use in Statistical Classi...
 
Segmentation and removal of interphase cells from chromosome
Segmentation and removal of interphase cells from chromosomeSegmentation and removal of interphase cells from chromosome
Segmentation and removal of interphase cells from chromosome
 
CV
CVCV
CV
 
Advances in prokaryote classification from microscopic images
Advances in prokaryote classification from microscopic imagesAdvances in prokaryote classification from microscopic images
Advances in prokaryote classification from microscopic images
 

Similar to Image Mining from Gel Diagrams in Biomedical Publications

Challenges and opportunities for machine learning in biomedical research
Challenges and opportunities for machine learning in biomedical researchChallenges and opportunities for machine learning in biomedical research
Challenges and opportunities for machine learning in biomedical researchFranciscoJAzuajeG
 
Introduction to graph databases: Neo4j and Cypher
Introduction to graph databases: Neo4j and CypherIntroduction to graph databases: Neo4j and Cypher
Introduction to graph databases: Neo4j and CypherAnjani Dhrangadhariya
 
Algorithmic approach to computational biology using graphs
Algorithmic approach to computational biology using graphsAlgorithmic approach to computational biology using graphs
Algorithmic approach to computational biology using graphsS P Sajjan
 
The Impact of Information Technology on Chemistry and Related Sciences
The Impact of Information Technology on Chemistry and Related SciencesThe Impact of Information Technology on Chemistry and Related Sciences
The Impact of Information Technology on Chemistry and Related SciencesAshutosh Jogalekar
 
Images as Occlusions of Textures: A Framework for Segmentation
Images as Occlusions of Textures: A Framework for SegmentationImages as Occlusions of Textures: A Framework for Segmentation
Images as Occlusions of Textures: A Framework for Segmentationjohn236zaq
 
Texture-Based Computational Models of Tissue in Biomedical Images: Initial Ex...
Texture-Based Computational Models of Tissue in Biomedical Images: Initial Ex...Texture-Based Computational Models of Tissue in Biomedical Images: Initial Ex...
Texture-Based Computational Models of Tissue in Biomedical Images: Initial Ex...Institute of Information Systems (HES-SO)
 
Introduction to Machine Learning and Texture Analysis for Lesion Characteriza...
Introduction to Machine Learning and Texture Analysis for Lesion Characteriza...Introduction to Machine Learning and Texture Analysis for Lesion Characteriza...
Introduction to Machine Learning and Texture Analysis for Lesion Characteriza...Kevin Mader
 
American Statistical Association October 23 2009 Presentation Part 1
American Statistical Association October 23 2009 Presentation Part 1American Statistical Association October 23 2009 Presentation Part 1
American Statistical Association October 23 2009 Presentation Part 1Double Check ĆŐNSULTING
 
NetBioSIG2013-Talk Gang Su
NetBioSIG2013-Talk Gang SuNetBioSIG2013-Talk Gang Su
NetBioSIG2013-Talk Gang SuAlexander Pico
 
Chemoinformatics—an introduction for computer scientists
Chemoinformatics—an introduction for computer scientistsChemoinformatics—an introduction for computer scientists
Chemoinformatics—an introduction for computer scientistsunyil96
 
Introduction to bioinformatics
Introduction to bioinformaticsIntroduction to bioinformatics
Introduction to bioinformaticsphilmaweb
 
BolingerJustin - Honors Thesis
BolingerJustin - Honors ThesisBolingerJustin - Honors Thesis
BolingerJustin - Honors ThesisJustin P. Bolinger
 
Detection of Cancer in Pap smear Cytological Images Using Bag of Texture Feat...
Detection of Cancer in Pap smear Cytological Images Using Bag of Texture Feat...Detection of Cancer in Pap smear Cytological Images Using Bag of Texture Feat...
Detection of Cancer in Pap smear Cytological Images Using Bag of Texture Feat...IOSR Journals
 
Basics of Data Analysis in Bioinformatics
Basics of Data Analysis in BioinformaticsBasics of Data Analysis in Bioinformatics
Basics of Data Analysis in BioinformaticsElena Sügis
 
2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)Michael Atkins
 
Bioinformatics-R program의 실례
Bioinformatics-R program의 실례Bioinformatics-R program의 실례
Bioinformatics-R program의 실례mothersafe
 

Similar to Image Mining from Gel Diagrams in Biomedical Publications (20)

Challenges and opportunities for machine learning in biomedical research
Challenges and opportunities for machine learning in biomedical researchChallenges and opportunities for machine learning in biomedical research
Challenges and opportunities for machine learning in biomedical research
 
Introduction to graph databases: Neo4j and Cypher
Introduction to graph databases: Neo4j and CypherIntroduction to graph databases: Neo4j and Cypher
Introduction to graph databases: Neo4j and Cypher
 
Algorithmic approach to computational biology using graphs
Algorithmic approach to computational biology using graphsAlgorithmic approach to computational biology using graphs
Algorithmic approach to computational biology using graphs
 
The Impact of Information Technology on Chemistry and Related Sciences
The Impact of Information Technology on Chemistry and Related SciencesThe Impact of Information Technology on Chemistry and Related Sciences
The Impact of Information Technology on Chemistry and Related Sciences
 
Images as Occlusions of Textures: A Framework for Segmentation
Images as Occlusions of Textures: A Framework for SegmentationImages as Occlusions of Textures: A Framework for Segmentation
Images as Occlusions of Textures: A Framework for Segmentation
 
Texture-Based Computational Models of Tissue in Biomedical Images: Initial Ex...
Texture-Based Computational Models of Tissue in Biomedical Images: Initial Ex...Texture-Based Computational Models of Tissue in Biomedical Images: Initial Ex...
Texture-Based Computational Models of Tissue in Biomedical Images: Initial Ex...
 
Introduction to Machine Learning and Texture Analysis for Lesion Characteriza...
Introduction to Machine Learning and Texture Analysis for Lesion Characteriza...Introduction to Machine Learning and Texture Analysis for Lesion Characteriza...
Introduction to Machine Learning and Texture Analysis for Lesion Characteriza...
 
American Statistical Association October 23 2009 Presentation Part 1
American Statistical Association October 23 2009 Presentation Part 1American Statistical Association October 23 2009 Presentation Part 1
American Statistical Association October 23 2009 Presentation Part 1
 
NetBioSIG2013-Talk Gang Su
NetBioSIG2013-Talk Gang SuNetBioSIG2013-Talk Gang Su
NetBioSIG2013-Talk Gang Su
 
Viva201393(1).pptxbaru
Viva201393(1).pptxbaruViva201393(1).pptxbaru
Viva201393(1).pptxbaru
 
Research summary
Research summaryResearch summary
Research summary
 
Chemoinformatics—an introduction for computer scientists
Chemoinformatics—an introduction for computer scientistsChemoinformatics—an introduction for computer scientists
Chemoinformatics—an introduction for computer scientists
 
Introduction to bioinformatics
Introduction to bioinformaticsIntroduction to bioinformatics
Introduction to bioinformatics
 
BolingerJustin - Honors Thesis
BolingerJustin - Honors ThesisBolingerJustin - Honors Thesis
BolingerJustin - Honors Thesis
 
Detection of Cancer in Pap smear Cytological Images Using Bag of Texture Feat...
Detection of Cancer in Pap smear Cytological Images Using Bag of Texture Feat...Detection of Cancer in Pap smear Cytological Images Using Bag of Texture Feat...
Detection of Cancer in Pap smear Cytological Images Using Bag of Texture Feat...
 
A01110107
A01110107A01110107
A01110107
 
Bio ontology drtc-seminar_anwesha
Bio ontology drtc-seminar_anweshaBio ontology drtc-seminar_anwesha
Bio ontology drtc-seminar_anwesha
 
Basics of Data Analysis in Bioinformatics
Basics of Data Analysis in BioinformaticsBasics of Data Analysis in Bioinformatics
Basics of Data Analysis in Bioinformatics
 
2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)
 
Bioinformatics-R program의 실례
Bioinformatics-R program의 실례Bioinformatics-R program의 실례
Bioinformatics-R program의 실례
 

More from Tobias Kuhn

Nanopublications and Decentralized Publishing
Nanopublications and Decentralized PublishingNanopublications and Decentralized Publishing
Nanopublications and Decentralized PublishingTobias Kuhn
 
Linked Data Publishing with Nanopublications
Linked Data Publishing with NanopublicationsLinked Data Publishing with Nanopublications
Linked Data Publishing with NanopublicationsTobias Kuhn
 
Genuine semantic publishing
Genuine semantic publishingGenuine semantic publishing
Genuine semantic publishingTobias Kuhn
 
A Decentralized Approach to Dissemination, Retrieval, and Archiving of Data
A Decentralized Approach to Dissemination, Retrieval, and Archiving of DataA Decentralized Approach to Dissemination, Retrieval, and Archiving of Data
A Decentralized Approach to Dissemination, Retrieval, and Archiving of DataTobias Kuhn
 
The Controlled Natural Language of Randall Munroe’s Thing Explainer
The Controlled Natural Language of Randall Munroe’s Thing Explainer The Controlled Natural Language of Randall Munroe’s Thing Explainer
The Controlled Natural Language of Randall Munroe’s Thing Explainer Tobias Kuhn
 
Publishing without Publishers: a Decentralized Approach to Dissemination, Ret...
Publishing without Publishers: a Decentralized Approach to Dissemination, Ret...Publishing without Publishers: a Decentralized Approach to Dissemination, Ret...
Publishing without Publishers: a Decentralized Approach to Dissemination, Ret...Tobias Kuhn
 
nanopub-java: A Java Library for Nanopublications
nanopub-java: A Java Library for Nanopublicationsnanopub-java: A Java Library for Nanopublications
nanopub-java: A Java Library for NanopublicationsTobias Kuhn
 
Semantic Publishing and Nanopublications
Semantic Publishing and NanopublicationsSemantic Publishing and Nanopublications
Semantic Publishing and NanopublicationsTobias Kuhn
 
Scientific Data Publishing
Scientific Data PublishingScientific Data Publishing
Scientific Data PublishingTobias Kuhn
 
A Decentralized Network for Publishing Linked Data — Nanopublications, Trusty...
A Decentralized Network for Publishing Linked Data — Nanopublications, Trusty...A Decentralized Network for Publishing Linked Data — Nanopublications, Trusty...
A Decentralized Network for Publishing Linked Data — Nanopublications, Trusty...Tobias Kuhn
 
Science Bots: A Model for the Future of Scientific Computation?
Science Bots: A Model for the Future of Scientific Computation?Science Bots: A Model for the Future of Scientific Computation?
Science Bots: A Model for the Future of Scientific Computation?Tobias Kuhn
 
Data Publishing and Post-Publication Reviews
Data Publishing and Post-Publication ReviewsData Publishing and Post-Publication Reviews
Data Publishing and Post-Publication ReviewsTobias Kuhn
 
Semantic Publishing with Nanopublications
Semantic Publishing with Nanopublications Semantic Publishing with Nanopublications
Semantic Publishing with Nanopublications Tobias Kuhn
 
Meme Extraction from Corpora of Scientific Literature using Citation Networks
Meme Extraction from Corpora of Scientific Literature using Citation NetworksMeme Extraction from Corpora of Scientific Literature using Citation Networks
Meme Extraction from Corpora of Scientific Literature using Citation NetworksTobias Kuhn
 
A Multilingual Semantic Wiki Based on Controlled Natural Language
A Multilingual Semantic Wiki Based on Controlled Natural LanguageA Multilingual Semantic Wiki Based on Controlled Natural Language
A Multilingual Semantic Wiki Based on Controlled Natural LanguageTobias Kuhn
 
Citation Graph Analysis to Identify Memes in Scientific Literature
Citation Graph Analysis to Identify Memes in Scientific LiteratureCitation Graph Analysis to Identify Memes in Scientific Literature
Citation Graph Analysis to Identify Memes in Scientific LiteratureTobias Kuhn
 
Citation Graph Analysis to Identify Memes in Scientific Literature
Citation Graph Analysis to Identify Memes in Scientific LiteratureCitation Graph Analysis to Identify Memes in Scientific Literature
Citation Graph Analysis to Identify Memes in Scientific LiteratureTobias Kuhn
 
Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linke...
Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linke...Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linke...
Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linke...Tobias Kuhn
 
Automatische Übersetzung in einem multilingualen, semantischen Wiki
Automatische Übersetzung in einem multilingualen, semantischen WikiAutomatische Übersetzung in einem multilingualen, semantischen Wiki
Automatische Übersetzung in einem multilingualen, semantischen WikiTobias Kuhn
 

More from Tobias Kuhn (20)

Nanopublications and Decentralized Publishing
Nanopublications and Decentralized PublishingNanopublications and Decentralized Publishing
Nanopublications and Decentralized Publishing
 
Linked Data Publishing with Nanopublications
Linked Data Publishing with NanopublicationsLinked Data Publishing with Nanopublications
Linked Data Publishing with Nanopublications
 
Genuine semantic publishing
Genuine semantic publishingGenuine semantic publishing
Genuine semantic publishing
 
A Decentralized Approach to Dissemination, Retrieval, and Archiving of Data
A Decentralized Approach to Dissemination, Retrieval, and Archiving of DataA Decentralized Approach to Dissemination, Retrieval, and Archiving of Data
A Decentralized Approach to Dissemination, Retrieval, and Archiving of Data
 
The Controlled Natural Language of Randall Munroe’s Thing Explainer
The Controlled Natural Language of Randall Munroe’s Thing Explainer The Controlled Natural Language of Randall Munroe’s Thing Explainer
The Controlled Natural Language of Randall Munroe’s Thing Explainer
 
Publishing without Publishers: a Decentralized Approach to Dissemination, Ret...
Publishing without Publishers: a Decentralized Approach to Dissemination, Ret...Publishing without Publishers: a Decentralized Approach to Dissemination, Ret...
Publishing without Publishers: a Decentralized Approach to Dissemination, Ret...
 
nanopub-java: A Java Library for Nanopublications
nanopub-java: A Java Library for Nanopublicationsnanopub-java: A Java Library for Nanopublications
nanopub-java: A Java Library for Nanopublications
 
Semantic Publishing and Nanopublications
Semantic Publishing and NanopublicationsSemantic Publishing and Nanopublications
Semantic Publishing and Nanopublications
 
Scientific Data Publishing
Scientific Data PublishingScientific Data Publishing
Scientific Data Publishing
 
A Decentralized Network for Publishing Linked Data — Nanopublications, Trusty...
A Decentralized Network for Publishing Linked Data — Nanopublications, Trusty...A Decentralized Network for Publishing Linked Data — Nanopublications, Trusty...
A Decentralized Network for Publishing Linked Data — Nanopublications, Trusty...
 
Science Bots: A Model for the Future of Scientific Computation?
Science Bots: A Model for the Future of Scientific Computation?Science Bots: A Model for the Future of Scientific Computation?
Science Bots: A Model for the Future of Scientific Computation?
 
Data Publishing and Post-Publication Reviews
Data Publishing and Post-Publication ReviewsData Publishing and Post-Publication Reviews
Data Publishing and Post-Publication Reviews
 
Semantic Publishing with Nanopublications
Semantic Publishing with Nanopublications Semantic Publishing with Nanopublications
Semantic Publishing with Nanopublications
 
Nanopubs
NanopubsNanopubs
Nanopubs
 
Meme Extraction from Corpora of Scientific Literature using Citation Networks
Meme Extraction from Corpora of Scientific Literature using Citation NetworksMeme Extraction from Corpora of Scientific Literature using Citation Networks
Meme Extraction from Corpora of Scientific Literature using Citation Networks
 
A Multilingual Semantic Wiki Based on Controlled Natural Language
A Multilingual Semantic Wiki Based on Controlled Natural LanguageA Multilingual Semantic Wiki Based on Controlled Natural Language
A Multilingual Semantic Wiki Based on Controlled Natural Language
 
Citation Graph Analysis to Identify Memes in Scientific Literature
Citation Graph Analysis to Identify Memes in Scientific LiteratureCitation Graph Analysis to Identify Memes in Scientific Literature
Citation Graph Analysis to Identify Memes in Scientific Literature
 
Citation Graph Analysis to Identify Memes in Scientific Literature
Citation Graph Analysis to Identify Memes in Scientific LiteratureCitation Graph Analysis to Identify Memes in Scientific Literature
Citation Graph Analysis to Identify Memes in Scientific Literature
 
Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linke...
Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linke...Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linke...
Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linke...
 
Automatische Übersetzung in einem multilingualen, semantischen Wiki
Automatische Übersetzung in einem multilingualen, semantischen WikiAutomatische Übersetzung in einem multilingualen, semantischen Wiki
Automatische Übersetzung in einem multilingualen, semantischen Wiki
 

Recently uploaded

AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 

Recently uploaded (20)

AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 

Image Mining from Gel Diagrams in Biomedical Publications

  • 1. Image Mining from Gel Diagrams in Biomedical Publications Tobias Kuhn and Michael Krauthammer Krauthammer Lab, Department of Pathology Yale University School of Medicine 5th International Symposium on Semantic Mining in Biomedicine (SMBM) 3 September 2012 Zurich, Switzerland
  • 2. Introduction The inclusion of figure images is a recent trend in the area of literature mining. The increasing amount of open access publications makes such images available for automated analysis. Image mining techniques can be used for image search interfaces, for relation mining, and to complement text mining approaches. T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 2 / 19
  • 3. Yale Image Finder http://krauthammerlab.med.yale.edu/imagefinder/ T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 3 / 19
  • 4. Gel Images Our approach focuses on gel images: • They are the result of gel electrophoresis (e.g. Southern, Western and Northern blotting) • They are often shown in biomedical publication as evidence for the discussed findings (e.g. protein-protein interactions and protein expressions under different conditions) • About 15% of all subfigures are gel images • They are structured according to common regular patterns T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 4 / 19
  • 5. Relations from Gel Images Condition Measurement Result MDA-MB-231 14-3-3σ high expression NHEM 14-3-3σ no expression C8161.9 14-3-3σ high expression LOX 14-3-3σ low expression MDA-MB-231 β-actin high expression NHEM β-actin high expression C8161.9 β-actin high expression LOX β-actin high expression Condition Measurement Result IL-1β (–) DEX (–) RU486 (–) p-p38 low expression IL-1β (+) DEX (–) RU486 (–) p-p38 high expression IL-1β (–) DEX (+) RU486 (–) p-p38 no expression IL-1β (+) DEX (+) RU486 (–) p-p38 low expression IL-1β (–) DEX (–) RU486 (+) p-p38 no expression IL-1β (+) DEX (–) RU486 (+) p-p38 high expression IL-1β (–) DEX (+) RU486 (+) p-p38 low expression IL-1β (+) DEX (+) RU486 (+) p-p38 high expression ... ... ... T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 5 / 19
  • 6. Image Mining Processes In principle, image mining involves the same processes as classical literature mining1 (with some subtle but important differences): • Document categorization (image categorization has to deal with the two-dimensional space of pixels, instead of text) • Named entity tagging (pinpointing the mention of an entity is more difficult with images; OCR errors have to be considered) • Fact extraction (analysis of graphical elements instead of parsing complete sentences) • Collection-wide analysis 1 Berry De Bruijn and Joel Martin. 2002. Getting to the (c)ore of knowledge: mining biomedical literature. International Journal of Medical Informatics, 67(1-3):7–18. T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 6 / 19
  • 7. Procedure A B X Y P A B X Y P A B X Y P A B X Y P A B X Y P A B X Y P articles figures segments text gels gel panels named entities 1 21 3 4 5 6 relations 7 1 Figure Extraction 2 Segmentation 3 Text Recognition 4 Gel Segment Detection 5 Gel Panel Detection 6 Named Entity Recognition 7 Relation Extraction T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 7 / 19
  • 8. Figure Extraction A B X Y P A B X Y P articles figures 11 We use structured XML files of the open access subset of PubMed Central. (Figure extraction from PDF files or even bitmaps of scanned articles would be more difficult, but definitely feasible.) T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 8 / 19
  • 9. Segmentation and Text Recognition A B X Y P A B X Y P segments text 2 3 For segmentation and text recognition we rely on our previous work.2 This includes: • Detection of layout elements • Text region detection • OCR (using the Microsoft Document Imaging package of MS Office) 2 Songhua Xu and Michael Krauthammer. 2010. A new pivoting and iterative text detection algorithm for biomedical images. J. of Biomedical Informatics, 43(6):924–931, December. Songhua Xu and Michael Krauthammer. 2011. Boosting text extraction from biomedical images using text region detection. In Biomedical Sciences and Engineering Conference (BSEC), 2011, pages 1–4. IEEE. T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 9 / 19
  • 10. Gel Segment Detection A B X Y P gels 4 Random forest classifiers (based on 75 random trees) on the following features of image segments: • coordinates of the relative position within the image • relative and absolute width and height • 16 grayscale histogram features • color features: red, green and blue • 13 texture features • number of recognized characters T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 10 / 19
  • 11. Gel Segment Detection Results Manually annotated training and testing sets of 500 random figures each. Results for three different thresholds: Threshold Precision Recall F-score high recall 0.15 0.439 0.909 0.592 0.30 0.765 0.739 0.752 high precision 0.60 0.926 0.301 0.455 Accuracy (area under ROC curve): 98.0% Unbalanced set: 3% gel segments vs. 97% non-gel segments T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 11 / 19
  • 12. Gel Panel Detection A B X Y P gel panels 5 Algorithm: • Start with a gel segment according to the high-precision classifier • Repeatedly look for adjacent gel segments according to the high-recall classifier, and merge them • Collect labels in the form of text segments arround the detected gel region Results on another set of 500 manually annotated figures: Precision Recall F-score 0.951 0.379 0.542 T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 12 / 19
  • 13. Named Entity Recognition named entities 6 Detection of gene and protein names in gel labels: • Tokenization of gel label texts • Lookup in Entrez Gene database • Case-sensitive matching • Exclude tokens: • Less than 3 characters • Arabic or Latin numbers • Common short words (from a list of the 100 most frequent words in biomedical articles) • 22 general words frequently used in gel diagrams (e.g. min, hrs, line, type, protein, DNA) T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 13 / 19
  • 14. Named Entity Recognition Results Recognized gene/protein tokens in 2000 random figures: absolute relative Total 156 100.0% Incorrect 54 34.6% – Not mentioned (OCR errors) 28 17.9% – Not references to genes or proteins 26 16.7% Correct 102 65.3% – Partially correct (could be more specific) 14 9.0% – Fully correct 88 56.4% T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 14 / 19
  • 15. Relation Extraction relations 7 Relation extraction is future work and we do not have concrete results at this point. It would involve the following steps: • Gene/protein name disambiguation • Identify semantic roles (condition, measurement, ...) • Quantify degree of expression Combination with classical text mining techniques seems promising. T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 15 / 19
  • 16. Overall Results on PubMed Central We ran our pipeline on the whole open access subset of PubMed Central: Total articles 410 950 Processed articles 386 428 Total figures from processed articles 1 110 643 Processed figures 884 152 Detected gel panels 85 942 Detected gel panels per figure 0.097 Detected gel labels 309 340 Detected gel labels per panel 3.599 Detected gene tokens 1 854 609 Detected gene tokens in gel labels 75 610 Gene token ratio 0.033 Gene token ratio in gel labels 0.068 T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 16 / 19
  • 17. Discussion: Standardized Biomedical Diagrams? It seems feasible to extract relations from gel images at satisfactory accuracy, but it is clear that this procedure is far from perfect. Shouldn’t we standardize biomedical diagrams? A Unified Modeling Language (UML) for biomedicine? T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 17 / 19
  • 18. Conclusions and Future Work Conclusions: • Gel segments can be detected with high accuracy • Detection of gel panels at high precision • Gene/protein name recognition in gel labels at satisfactory precision → Image mining from gel diagrams is feasible Future Work: • Relation extraction • Combination with classical text mining techniques • Other named entity types: cell lines, drugs, ... • Standard for biomedical diagrams? T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 18 / 19
  • 19. Thank you for your Attention! Questions? T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 19 / 19