SlideShare a Scribd company logo
Finding and Accessing Diagrams in
Biomedical Publications
Tobias Kuhn, ThaiBinh Luong, and Michael Krauthammer
Krauthammer Lab, Department of Pathology
Yale University School of Medicine
AMIA 2012 Annual Symposium
6 November 2012
Chicago
Introduction
The inclusion of figure images is a recent trend in the area of
literature mining.
The increasing amount of open access publications makes such
images available for automated analysis.
Image mining techniques can be used for image search interfaces,
for relation mining, and to complement text mining approaches.
T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 2 / 20
Answer Queries with Images
Often, a query is best answered by an image.
For example, WolframAlpha for “growth age 6”:
Idea: Use existing diagrams of scientific articles to answer queries.
T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 3 / 20
Yale Image Finder
http://krauthammerlab.med.yale.edu/imagefinder/
T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 4 / 20
Detection and Analysis of Specific Image Types
For the next version of the Yale Image Finder, we are working on the
detection and analysis of specific image types:
• Axis Diagrams
• Gel Images
• Network Diagrams (work in progress)
T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 5 / 20
Axis Diagrams: Examples
T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 6 / 20
Axis Diagrams
Axis diagrams are important for several reasons:
• They are abundant in biomedical literature: about 38% of all
subfigures are axis diagrams
• They follow simple common patterns based on axes
• They are complex in the sense that they combine several
dimensions
• They summarize data for human readers
T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 7 / 20
Axis Diagram Detection Steps
Basic Idea: Large segments are detected as center segments of axis
diagrams if surrounded by a number of small label segments.
1. 2. 3. 4. 5.
original segments center label result
candidates candidates
T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 8 / 20
Additional Classifiers
To compare and improve our approach, we apply SVM classifiers with
the following two types of features:
• Image: texture and histogram features of the bitmap image
• Caption: word vector of the tokenized caption text
These classifiers only act on the complete figure and cannot spot the
location of axis diagrams.
T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 9 / 20
Results
Evaluation on a random sample of 100 articles from PubMed Central
with at least one figure. These 404 figures were manually annotated:
they contained 508 axis diagrams.
task method
precision
recall
F-score
detection of figures segments 0.87 0.66 0.75
with axis diagrams image 0.66 0.90 0.76
caption 0.84 0.77 0.80
image + segments 0.80 0.73 0.76
caption + segments 0.90 0.85 0.88
image + caption 0.85 0.84 0.84
image + caption + segments 0.90 0.89 0.89
extraction of axis segments 0.85 0.40 0.54
diagram locations image + segments 0.84 0.39 0.54
caption + segments 0.88 0.39 0.54
image + caption + segments 0.89 0.39 0.55
T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 10 / 20
Gel Images
Gel diagrams are another important type of image:
• They are the result of gel electrophoresis (e.g. Southern,
Western and Northern blotting)
• They are often shown in biomedical publication as evidence for
the discussed findings (e.g. protein-protein interactions and
protein expressions under different conditions)
• About 15% of all subfigures are gel images
• They are structured according to common regular patterns
T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 11 / 20
Relations from Gel Images
Condition Measurement Result
MDA-MB-231 14-3-3σ high expression
NHEM 14-3-3σ no expression
C8161.9 14-3-3σ high expression
LOX 14-3-3σ low expression
MDA-MB-231 β-actin high expression
NHEM β-actin high expression
C8161.9 β-actin high expression
LOX β-actin high expression
Condition Measurement Result
IL-1β (–) DEX (–) RU486 (–) p-p38 low expression
IL-1β (+) DEX (–) RU486 (–) p-p38 high expression
IL-1β (–) DEX (+) RU486 (–) p-p38 no expression
IL-1β (+) DEX (+) RU486 (–) p-p38 low expression
IL-1β (–) DEX (–) RU486 (+) p-p38 no expression
IL-1β (+) DEX (–) RU486 (+) p-p38 high expression
IL-1β (–) DEX (+) RU486 (+) p-p38 low expression
IL-1β (+) DEX (+) RU486 (+) p-p38 high expression
... ... ...
T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 12 / 20
Procedure
A B
X
Y
P
A B
X
Y
P
A B
X
Y
P
A B
X
Y
P
A B
X
Y
P
A B
X
Y
P
articles figures segments text
gels gel panels named entities
1 21 3
4 5 6
relations
7
We focus here on the steps 4, 5, and 6. Steps 1, 2, and 3 have been
addressed in prior work. Step 7 is future work.
T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 13 / 20
Gel Segment Detection
A B
X
Y
P
gels
4
Random forest classifiers on a number of features of image segments
(position, size, grayscale histogram, color, texture, and number of
recognized characters).
Results on 1000 manually annotated, random figures:
Threshold Precision Recall F-score AUC
high recall 0.15 0.439 0.909 0.592
0.30 0.765 0.739 0.752 0.980
high precision 0.60 0.926 0.301 0.455
T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 14 / 20
Gel Panel Detection
A B
X
Y
P
gel panels
5
Algorithm:
• Start with a gel segment according to the high-precision classifier
• Repeatedly look for adjacent gel segments according to the
high-recall classifier, and merge them
• Collect labels in the form of text segments arround the detected
gel region
Results on another set of 500 manually annotated figures:
Precision Recall F-score
0.951 0.379 0.542
T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 15 / 20
Named Entity Recognition
named entities
6
Detection of gene and protein names in gel labels from a sample
of 2000 random figures (tokenization; case-sensitive Entrez Gene
lookup; exclude very short and very common words):
absolute relative
Total 156 100.0%
Incorrect 54 34.6%
– Not mentioned (OCR errors) 28 17.9%
– Not references to genes or proteins 26 16.7%
Correct 102 65.3%
– Partially correct (could be more specific) 14 9.0%
– Fully correct 88 56.4%
T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 16 / 20
Overall Results on PubMed Central
We ran our pipeline on the whole open access subset of PubMed
Central:
Total articles 410 950
Processed articles 386 428
Total figures from processed articles 1 110 643
Processed figures 884 152
Detected gel panels 85 942
Detected gel panels per figure 0.097
Detected gene tokens 1 854 609
Detected gene tokens in gel labels 75 610
T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 17 / 20
Conclusions and Future Work
Conclusions
• The location of certain diagram types like axis and gel diagrams
can be extracted at a high precision of about 90% with an
f-score around 55%
Future Work
• Relation extraction
• Include other image types like network diagrams
• Combination with classical text mining techniques
• Detection of other named entity types: cell lines, drugs, ...
• Sophisticated diagram search interface
• Standard for biomedical diagrams?
T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 18 / 20
Discussion: Standardized Biomedical Diagrams?
It seems feasible to extract relations from gel images at satisfactory
accuracy, but it is clear that this procedure is far from perfect.
Do we need a standard for biomedical diagrams? A Unified
Modeling Language (UML) for biology and medicine?
T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 19 / 20
Thank you for your Attention!
Questions?
T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 20 / 20

More Related Content

What's hot

Booster in High Dimensional Data Classification
Booster in High Dimensional Data ClassificationBooster in High Dimensional Data Classification
Booster in High Dimensional Data Classification
rahulmonikasharma
 
Sampling and its variability
Sampling  and its variabilitySampling  and its variability
Sampling and its variability
DrBhushan Kamble
 
PRINCIPLE COMPONENT ANALYSIS.pptx
PRINCIPLE COMPONENT ANALYSIS.pptxPRINCIPLE COMPONENT ANALYSIS.pptx
PRINCIPLE COMPONENT ANALYSIS.pptx
ASHUTOSHGAURAV10
 
Errors in research
Errors in researchErrors in research
Errors in research
AasthaBhatia18
 
Neutrosophic multi criteria_decision_mak
Neutrosophic multi criteria_decision_makNeutrosophic multi criteria_decision_mak
Neutrosophic multi criteria_decision_mak
Dr. Hari Arora
 
Type of Sampling design
Type of Sampling  designType of Sampling  design
Complex random sampling designs
Complex random sampling designsComplex random sampling designs
Complex random sampling designs
Dr.Sangeetha R
 
Feature Selection Approach based on Firefly Algorithm and Chi-square
Feature Selection Approach based on Firefly Algorithm and Chi-square Feature Selection Approach based on Firefly Algorithm and Chi-square
Feature Selection Approach based on Firefly Algorithm and Chi-square
IJECEIAES
 
Graduate Paper--Hierarchical clustring and topology for psychometrics paper
Graduate Paper--Hierarchical clustring and topology for psychometrics paperGraduate Paper--Hierarchical clustring and topology for psychometrics paper
Graduate Paper--Hierarchical clustring and topology for psychometrics paper
Colleen Farrelly
 
An Open Source Tool for Game Theoretic Health Data De-Identification
An Open Source Tool for Game Theoretic Health Data De-IdentificationAn Open Source Tool for Game Theoretic Health Data De-Identification
An Open Source Tool for Game Theoretic Health Data De-Identification
arx-deidentifier
 
The sampling design process
The sampling design processThe sampling design process
The sampling design process
Kritika Jain
 
SAMPLE DESIGN, Probability sampling
SAMPLE DESIGN, Probability samplingSAMPLE DESIGN, Probability sampling
SAMPLE DESIGN, Probability sampling
Jyoti Rastogi
 
Classification of Health Care Data Using Machine Learning Technique
Classification of Health Care Data Using Machine Learning TechniqueClassification of Health Care Data Using Machine Learning Technique
Classification of Health Care Data Using Machine Learning Technique
inventionjournals
 
sampling types
sampling typessampling types
sampling types
Faizan Anjum
 
Lesson01_Static.11
Lesson01_Static.11Lesson01_Static.11
Lesson01_Static.11
thangv
 
Rm mc qs
Rm mc qsRm mc qs
Rm mc qs
kuldeep Dwivedi
 
Research methodology ppt
Research methodology pptResearch methodology ppt
Research methodology ppt
bgshalini
 
Sampling Procedure
Sampling ProcedureSampling Procedure
Sampling Procedure
Jalen Rebolledo
 
Chapter 3 Census and Sample Methods
Chapter   3 Census and Sample MethodsChapter   3 Census and Sample Methods
Chapter 3 Census and Sample Methods
Ritvik Tolumbia
 
Sampling Design
Sampling DesignSampling Design
Sampling Design
Jale Nonan
 

What's hot (20)

Booster in High Dimensional Data Classification
Booster in High Dimensional Data ClassificationBooster in High Dimensional Data Classification
Booster in High Dimensional Data Classification
 
Sampling and its variability
Sampling  and its variabilitySampling  and its variability
Sampling and its variability
 
PRINCIPLE COMPONENT ANALYSIS.pptx
PRINCIPLE COMPONENT ANALYSIS.pptxPRINCIPLE COMPONENT ANALYSIS.pptx
PRINCIPLE COMPONENT ANALYSIS.pptx
 
Errors in research
Errors in researchErrors in research
Errors in research
 
Neutrosophic multi criteria_decision_mak
Neutrosophic multi criteria_decision_makNeutrosophic multi criteria_decision_mak
Neutrosophic multi criteria_decision_mak
 
Type of Sampling design
Type of Sampling  designType of Sampling  design
Type of Sampling design
 
Complex random sampling designs
Complex random sampling designsComplex random sampling designs
Complex random sampling designs
 
Feature Selection Approach based on Firefly Algorithm and Chi-square
Feature Selection Approach based on Firefly Algorithm and Chi-square Feature Selection Approach based on Firefly Algorithm and Chi-square
Feature Selection Approach based on Firefly Algorithm and Chi-square
 
Graduate Paper--Hierarchical clustring and topology for psychometrics paper
Graduate Paper--Hierarchical clustring and topology for psychometrics paperGraduate Paper--Hierarchical clustring and topology for psychometrics paper
Graduate Paper--Hierarchical clustring and topology for psychometrics paper
 
An Open Source Tool for Game Theoretic Health Data De-Identification
An Open Source Tool for Game Theoretic Health Data De-IdentificationAn Open Source Tool for Game Theoretic Health Data De-Identification
An Open Source Tool for Game Theoretic Health Data De-Identification
 
The sampling design process
The sampling design processThe sampling design process
The sampling design process
 
SAMPLE DESIGN, Probability sampling
SAMPLE DESIGN, Probability samplingSAMPLE DESIGN, Probability sampling
SAMPLE DESIGN, Probability sampling
 
Classification of Health Care Data Using Machine Learning Technique
Classification of Health Care Data Using Machine Learning TechniqueClassification of Health Care Data Using Machine Learning Technique
Classification of Health Care Data Using Machine Learning Technique
 
sampling types
sampling typessampling types
sampling types
 
Lesson01_Static.11
Lesson01_Static.11Lesson01_Static.11
Lesson01_Static.11
 
Rm mc qs
Rm mc qsRm mc qs
Rm mc qs
 
Research methodology ppt
Research methodology pptResearch methodology ppt
Research methodology ppt
 
Sampling Procedure
Sampling ProcedureSampling Procedure
Sampling Procedure
 
Chapter 3 Census and Sample Methods
Chapter   3 Census and Sample MethodsChapter   3 Census and Sample Methods
Chapter 3 Census and Sample Methods
 
Sampling Design
Sampling DesignSampling Design
Sampling Design
 

Similar to Finding and Accessing Diagrams in Biomedical Publications

Image Mining from Gel Diagrams in Biomedical Publications
Image Mining from Gel Diagrams in Biomedical PublicationsImage Mining from Gel Diagrams in Biomedical Publications
Image Mining from Gel Diagrams in Biomedical Publications
Tobias Kuhn
 
Interpretability of machine learning
Interpretability of machine learningInterpretability of machine learning
Interpretability of machine learning
Daiki Tanaka
 
Images as Occlusions of Textures: A Framework for Segmentation
Images as Occlusions of Textures: A Framework for SegmentationImages as Occlusions of Textures: A Framework for Segmentation
Images as Occlusions of Textures: A Framework for Segmentation
john236zaq
 
Leibniz: A Digital Scientific Notation
Leibniz: A Digital Scientific NotationLeibniz: A Digital Scientific Notation
Leibniz: A Digital Scientific Notation
khinsen
 
Algorithmic approach to computational biology using graphs
Algorithmic approach to computational biology using graphsAlgorithmic approach to computational biology using graphs
Algorithmic approach to computational biology using graphs
S P Sajjan
 
Bio inspiring computing and its application in cheminformatics
Bio inspiring computing and its application in cheminformaticsBio inspiring computing and its application in cheminformatics
Bio inspiring computing and its application in cheminformatics
abdelazim Galal
 
Homework 21. Complete Chapter 3, Problem #1 under Project.docx
Homework 21. Complete Chapter 3, Problem #1 under Project.docxHomework 21. Complete Chapter 3, Problem #1 under Project.docx
Homework 21. Complete Chapter 3, Problem #1 under Project.docx
adampcarr67227
 
CV_of_ArulMurugan (2017_01_18)
CV_of_ArulMurugan (2017_01_18)CV_of_ArulMurugan (2017_01_18)
CV_of_ArulMurugan (2017_01_18)
ArulMurugan Ambikapathi
 
American Statistical Association October 23 2009 Presentation Part 1
American Statistical Association October 23 2009 Presentation Part 1American Statistical Association October 23 2009 Presentation Part 1
American Statistical Association October 23 2009 Presentation Part 1
Double Check ĆŐNSULTING
 
Diagrammatic elicitation & When to use diagrams, drawings and cartoons?
Diagrammatic elicitation & When to use diagrams, drawings and cartoons?Diagrammatic elicitation & When to use diagrams, drawings and cartoons?
Diagrammatic elicitation & When to use diagrams, drawings and cartoons?
Tünde Varga-Atkins
 
Challenges and opportunities for machine learning in biomedical research
Challenges and opportunities for machine learning in biomedical researchChallenges and opportunities for machine learning in biomedical research
Challenges and opportunities for machine learning in biomedical research
FranciscoJAzuajeG
 
Practical Methods for Identifying Anomalies That Matter in Large Datasets
Practical Methods for Identifying Anomalies That Matter in Large DatasetsPractical Methods for Identifying Anomalies That Matter in Large Datasets
Practical Methods for Identifying Anomalies That Matter in Large Datasets
Robert Grossman
 
A discriminative-feature-space-for-detecting-and-recognizing-pathologies-of-t...
A discriminative-feature-space-for-detecting-and-recognizing-pathologies-of-t...A discriminative-feature-space-for-detecting-and-recognizing-pathologies-of-t...
A discriminative-feature-space-for-detecting-and-recognizing-pathologies-of-t...
Damian R. Mingle, MBA
 
Bioinformatics _ an Introduction - Ramsden, Jeremy.pdf
Bioinformatics _ an Introduction - Ramsden, Jeremy.pdfBioinformatics _ an Introduction - Ramsden, Jeremy.pdf
Bioinformatics _ an Introduction - Ramsden, Jeremy.pdf
GajahNauli2
 
A supervised lung nodule classification method using patch based context anal...
A supervised lung nodule classification method using patch based context anal...A supervised lung nodule classification method using patch based context anal...
A supervised lung nodule classification method using patch based context anal...
ASWATHY VG
 
Experimental
ExperimentalExperimental
Experimental
Carla Piper
 
NetBioSIG2013-Talk Gang Su
NetBioSIG2013-Talk Gang SuNetBioSIG2013-Talk Gang Su
NetBioSIG2013-Talk Gang Su
Alexander Pico
 
International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...
International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...
International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...
CSCJournals
 
Broadening the Scope of Nanopublications
Broadening the Scope of NanopublicationsBroadening the Scope of Nanopublications
Broadening the Scope of Nanopublications
Tobias Kuhn
 
ДНК составляет лишь половину объёма хромосом
ДНК составляет лишь половину объёма хромосомДНК составляет лишь половину объёма хромосом
ДНК составляет лишь половину объёма хромосом
Anatol Alizar
 

Similar to Finding and Accessing Diagrams in Biomedical Publications (20)

Image Mining from Gel Diagrams in Biomedical Publications
Image Mining from Gel Diagrams in Biomedical PublicationsImage Mining from Gel Diagrams in Biomedical Publications
Image Mining from Gel Diagrams in Biomedical Publications
 
Interpretability of machine learning
Interpretability of machine learningInterpretability of machine learning
Interpretability of machine learning
 
Images as Occlusions of Textures: A Framework for Segmentation
Images as Occlusions of Textures: A Framework for SegmentationImages as Occlusions of Textures: A Framework for Segmentation
Images as Occlusions of Textures: A Framework for Segmentation
 
Leibniz: A Digital Scientific Notation
Leibniz: A Digital Scientific NotationLeibniz: A Digital Scientific Notation
Leibniz: A Digital Scientific Notation
 
Algorithmic approach to computational biology using graphs
Algorithmic approach to computational biology using graphsAlgorithmic approach to computational biology using graphs
Algorithmic approach to computational biology using graphs
 
Bio inspiring computing and its application in cheminformatics
Bio inspiring computing and its application in cheminformaticsBio inspiring computing and its application in cheminformatics
Bio inspiring computing and its application in cheminformatics
 
Homework 21. Complete Chapter 3, Problem #1 under Project.docx
Homework 21. Complete Chapter 3, Problem #1 under Project.docxHomework 21. Complete Chapter 3, Problem #1 under Project.docx
Homework 21. Complete Chapter 3, Problem #1 under Project.docx
 
CV_of_ArulMurugan (2017_01_18)
CV_of_ArulMurugan (2017_01_18)CV_of_ArulMurugan (2017_01_18)
CV_of_ArulMurugan (2017_01_18)
 
American Statistical Association October 23 2009 Presentation Part 1
American Statistical Association October 23 2009 Presentation Part 1American Statistical Association October 23 2009 Presentation Part 1
American Statistical Association October 23 2009 Presentation Part 1
 
Diagrammatic elicitation & When to use diagrams, drawings and cartoons?
Diagrammatic elicitation & When to use diagrams, drawings and cartoons?Diagrammatic elicitation & When to use diagrams, drawings and cartoons?
Diagrammatic elicitation & When to use diagrams, drawings and cartoons?
 
Challenges and opportunities for machine learning in biomedical research
Challenges and opportunities for machine learning in biomedical researchChallenges and opportunities for machine learning in biomedical research
Challenges and opportunities for machine learning in biomedical research
 
Practical Methods for Identifying Anomalies That Matter in Large Datasets
Practical Methods for Identifying Anomalies That Matter in Large DatasetsPractical Methods for Identifying Anomalies That Matter in Large Datasets
Practical Methods for Identifying Anomalies That Matter in Large Datasets
 
A discriminative-feature-space-for-detecting-and-recognizing-pathologies-of-t...
A discriminative-feature-space-for-detecting-and-recognizing-pathologies-of-t...A discriminative-feature-space-for-detecting-and-recognizing-pathologies-of-t...
A discriminative-feature-space-for-detecting-and-recognizing-pathologies-of-t...
 
Bioinformatics _ an Introduction - Ramsden, Jeremy.pdf
Bioinformatics _ an Introduction - Ramsden, Jeremy.pdfBioinformatics _ an Introduction - Ramsden, Jeremy.pdf
Bioinformatics _ an Introduction - Ramsden, Jeremy.pdf
 
A supervised lung nodule classification method using patch based context anal...
A supervised lung nodule classification method using patch based context anal...A supervised lung nodule classification method using patch based context anal...
A supervised lung nodule classification method using patch based context anal...
 
Experimental
ExperimentalExperimental
Experimental
 
NetBioSIG2013-Talk Gang Su
NetBioSIG2013-Talk Gang SuNetBioSIG2013-Talk Gang Su
NetBioSIG2013-Talk Gang Su
 
International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...
International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...
International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...
 
Broadening the Scope of Nanopublications
Broadening the Scope of NanopublicationsBroadening the Scope of Nanopublications
Broadening the Scope of Nanopublications
 
ДНК составляет лишь половину объёма хромосом
ДНК составляет лишь половину объёма хромосомДНК составляет лишь половину объёма хромосом
ДНК составляет лишь половину объёма хромосом
 

More from Tobias Kuhn

Nanopublications and Decentralized Publishing
Nanopublications and Decentralized PublishingNanopublications and Decentralized Publishing
Nanopublications and Decentralized Publishing
Tobias Kuhn
 
Linked Data Publishing with Nanopublications
Linked Data Publishing with NanopublicationsLinked Data Publishing with Nanopublications
Linked Data Publishing with Nanopublications
Tobias Kuhn
 
Genuine semantic publishing
Genuine semantic publishingGenuine semantic publishing
Genuine semantic publishing
Tobias Kuhn
 
A Decentralized Approach to Dissemination, Retrieval, and Archiving of Data
A Decentralized Approach to Dissemination, Retrieval, and Archiving of DataA Decentralized Approach to Dissemination, Retrieval, and Archiving of Data
A Decentralized Approach to Dissemination, Retrieval, and Archiving of Data
Tobias Kuhn
 
The Controlled Natural Language of Randall Munroe’s Thing Explainer
The Controlled Natural Language of Randall Munroe’s Thing Explainer The Controlled Natural Language of Randall Munroe’s Thing Explainer
The Controlled Natural Language of Randall Munroe’s Thing Explainer
Tobias Kuhn
 
Publishing without Publishers: a Decentralized Approach to Dissemination, Ret...
Publishing without Publishers: a Decentralized Approach to Dissemination, Ret...Publishing without Publishers: a Decentralized Approach to Dissemination, Ret...
Publishing without Publishers: a Decentralized Approach to Dissemination, Ret...
Tobias Kuhn
 
nanopub-java: A Java Library for Nanopublications
nanopub-java: A Java Library for Nanopublicationsnanopub-java: A Java Library for Nanopublications
nanopub-java: A Java Library for Nanopublications
Tobias Kuhn
 
Semantic Publishing and Nanopublications
Semantic Publishing and NanopublicationsSemantic Publishing and Nanopublications
Semantic Publishing and Nanopublications
Tobias Kuhn
 
Scientific Data Publishing
Scientific Data PublishingScientific Data Publishing
Scientific Data Publishing
Tobias Kuhn
 
A Decentralized Network for Publishing Linked Data — Nanopublications, Trusty...
A Decentralized Network for Publishing Linked Data — Nanopublications, Trusty...A Decentralized Network for Publishing Linked Data — Nanopublications, Trusty...
A Decentralized Network for Publishing Linked Data — Nanopublications, Trusty...
Tobias Kuhn
 
Science Bots: A Model for the Future of Scientific Computation?
Science Bots: A Model for the Future of Scientific Computation?Science Bots: A Model for the Future of Scientific Computation?
Science Bots: A Model for the Future of Scientific Computation?
Tobias Kuhn
 
Data Publishing and Post-Publication Reviews
Data Publishing and Post-Publication ReviewsData Publishing and Post-Publication Reviews
Data Publishing and Post-Publication Reviews
Tobias Kuhn
 
Semantic Publishing with Nanopublications
Semantic Publishing with Nanopublications Semantic Publishing with Nanopublications
Semantic Publishing with Nanopublications
Tobias Kuhn
 
Nanopubs
NanopubsNanopubs
Nanopubs
Tobias Kuhn
 
Meme Extraction from Corpora of Scientific Literature using Citation Networks
Meme Extraction from Corpora of Scientific Literature using Citation NetworksMeme Extraction from Corpora of Scientific Literature using Citation Networks
Meme Extraction from Corpora of Scientific Literature using Citation Networks
Tobias Kuhn
 
A Multilingual Semantic Wiki Based on Controlled Natural Language
A Multilingual Semantic Wiki Based on Controlled Natural LanguageA Multilingual Semantic Wiki Based on Controlled Natural Language
A Multilingual Semantic Wiki Based on Controlled Natural Language
Tobias Kuhn
 
Citation Graph Analysis to Identify Memes in Scientific Literature
Citation Graph Analysis to Identify Memes in Scientific LiteratureCitation Graph Analysis to Identify Memes in Scientific Literature
Citation Graph Analysis to Identify Memes in Scientific Literature
Tobias Kuhn
 
Citation Graph Analysis to Identify Memes in Scientific Literature
Citation Graph Analysis to Identify Memes in Scientific LiteratureCitation Graph Analysis to Identify Memes in Scientific Literature
Citation Graph Analysis to Identify Memes in Scientific Literature
Tobias Kuhn
 
Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linke...
Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linke...Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linke...
Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linke...
Tobias Kuhn
 
Automatische Übersetzung in einem multilingualen, semantischen Wiki
Automatische Übersetzung in einem multilingualen, semantischen WikiAutomatische Übersetzung in einem multilingualen, semantischen Wiki
Automatische Übersetzung in einem multilingualen, semantischen Wiki
Tobias Kuhn
 

More from Tobias Kuhn (20)

Nanopublications and Decentralized Publishing
Nanopublications and Decentralized PublishingNanopublications and Decentralized Publishing
Nanopublications and Decentralized Publishing
 
Linked Data Publishing with Nanopublications
Linked Data Publishing with NanopublicationsLinked Data Publishing with Nanopublications
Linked Data Publishing with Nanopublications
 
Genuine semantic publishing
Genuine semantic publishingGenuine semantic publishing
Genuine semantic publishing
 
A Decentralized Approach to Dissemination, Retrieval, and Archiving of Data
A Decentralized Approach to Dissemination, Retrieval, and Archiving of DataA Decentralized Approach to Dissemination, Retrieval, and Archiving of Data
A Decentralized Approach to Dissemination, Retrieval, and Archiving of Data
 
The Controlled Natural Language of Randall Munroe’s Thing Explainer
The Controlled Natural Language of Randall Munroe’s Thing Explainer The Controlled Natural Language of Randall Munroe’s Thing Explainer
The Controlled Natural Language of Randall Munroe’s Thing Explainer
 
Publishing without Publishers: a Decentralized Approach to Dissemination, Ret...
Publishing without Publishers: a Decentralized Approach to Dissemination, Ret...Publishing without Publishers: a Decentralized Approach to Dissemination, Ret...
Publishing without Publishers: a Decentralized Approach to Dissemination, Ret...
 
nanopub-java: A Java Library for Nanopublications
nanopub-java: A Java Library for Nanopublicationsnanopub-java: A Java Library for Nanopublications
nanopub-java: A Java Library for Nanopublications
 
Semantic Publishing and Nanopublications
Semantic Publishing and NanopublicationsSemantic Publishing and Nanopublications
Semantic Publishing and Nanopublications
 
Scientific Data Publishing
Scientific Data PublishingScientific Data Publishing
Scientific Data Publishing
 
A Decentralized Network for Publishing Linked Data — Nanopublications, Trusty...
A Decentralized Network for Publishing Linked Data — Nanopublications, Trusty...A Decentralized Network for Publishing Linked Data — Nanopublications, Trusty...
A Decentralized Network for Publishing Linked Data — Nanopublications, Trusty...
 
Science Bots: A Model for the Future of Scientific Computation?
Science Bots: A Model for the Future of Scientific Computation?Science Bots: A Model for the Future of Scientific Computation?
Science Bots: A Model for the Future of Scientific Computation?
 
Data Publishing and Post-Publication Reviews
Data Publishing and Post-Publication ReviewsData Publishing and Post-Publication Reviews
Data Publishing and Post-Publication Reviews
 
Semantic Publishing with Nanopublications
Semantic Publishing with Nanopublications Semantic Publishing with Nanopublications
Semantic Publishing with Nanopublications
 
Nanopubs
NanopubsNanopubs
Nanopubs
 
Meme Extraction from Corpora of Scientific Literature using Citation Networks
Meme Extraction from Corpora of Scientific Literature using Citation NetworksMeme Extraction from Corpora of Scientific Literature using Citation Networks
Meme Extraction from Corpora of Scientific Literature using Citation Networks
 
A Multilingual Semantic Wiki Based on Controlled Natural Language
A Multilingual Semantic Wiki Based on Controlled Natural LanguageA Multilingual Semantic Wiki Based on Controlled Natural Language
A Multilingual Semantic Wiki Based on Controlled Natural Language
 
Citation Graph Analysis to Identify Memes in Scientific Literature
Citation Graph Analysis to Identify Memes in Scientific LiteratureCitation Graph Analysis to Identify Memes in Scientific Literature
Citation Graph Analysis to Identify Memes in Scientific Literature
 
Citation Graph Analysis to Identify Memes in Scientific Literature
Citation Graph Analysis to Identify Memes in Scientific LiteratureCitation Graph Analysis to Identify Memes in Scientific Literature
Citation Graph Analysis to Identify Memes in Scientific Literature
 
Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linke...
Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linke...Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linke...
Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linke...
 
Automatische Übersetzung in einem multilingualen, semantischen Wiki
Automatische Übersetzung in einem multilingualen, semantischen WikiAutomatische Übersetzung in einem multilingualen, semantischen Wiki
Automatische Übersetzung in einem multilingualen, semantischen Wiki
 

Recently uploaded

leewayhertz.com-AI agents for healthcare Applications benefits and implementa...
leewayhertz.com-AI agents for healthcare Applications benefits and implementa...leewayhertz.com-AI agents for healthcare Applications benefits and implementa...
leewayhertz.com-AI agents for healthcare Applications benefits and implementa...
alexjohnson7307
 
(CISOPlatform Summit & SACON 2024) Digital Personal Data Protection Act.pdf
(CISOPlatform Summit & SACON 2024) Digital Personal Data Protection Act.pdf(CISOPlatform Summit & SACON 2024) Digital Personal Data Protection Act.pdf
(CISOPlatform Summit & SACON 2024) Digital Personal Data Protection Act.pdf
Priyanka Aash
 
Figma AI Design Generator_ In-Depth Review.pdf
Figma AI Design Generator_ In-Depth Review.pdfFigma AI Design Generator_ In-Depth Review.pdf
Figma AI Design Generator_ In-Depth Review.pdf
Management Institute of Skills Development
 
Dublin_mulesoft_meetup_Mulesoft_Salesforce_Integration (1).pptx
Dublin_mulesoft_meetup_Mulesoft_Salesforce_Integration (1).pptxDublin_mulesoft_meetup_Mulesoft_Salesforce_Integration (1).pptx
Dublin_mulesoft_meetup_Mulesoft_Salesforce_Integration (1).pptx
Kunal Gupta
 
find out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challengesfind out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challenges
huseindihon
 
Opencast Summit 2024 — Opencast @ University of Münster
Opencast Summit 2024 — Opencast @ University of MünsterOpencast Summit 2024 — Opencast @ University of Münster
Opencast Summit 2024 — Opencast @ University of Münster
Matthias Neugebauer
 
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyyActive Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
RaminGhanbari2
 
Using LLM Agents with Llama 3, LangGraph and Milvus
Using LLM Agents with Llama 3, LangGraph and MilvusUsing LLM Agents with Llama 3, LangGraph and Milvus
Using LLM Agents with Llama 3, LangGraph and Milvus
Zilliz
 
Integrating Kafka with MuleSoft 4 and usecase
Integrating Kafka with MuleSoft 4 and usecaseIntegrating Kafka with MuleSoft 4 and usecase
Integrating Kafka with MuleSoft 4 and usecase
shyamraj55
 
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdfAcumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
BrainSell Technologies
 
July Patch Tuesday
July Patch TuesdayJuly Patch Tuesday
July Patch Tuesday
Ivanti
 
Pigging Unit Lubricant Oil Blending Plant
Pigging Unit Lubricant Oil Blending PlantPigging Unit Lubricant Oil Blending Plant
Pigging Unit Lubricant Oil Blending Plant
LINUS PROJECTS (INDIA)
 
Feature sql server terbaru performance.pptx
Feature sql server terbaru performance.pptxFeature sql server terbaru performance.pptx
Feature sql server terbaru performance.pptx
ssuser1915fe1
 
Salesforce AI & Einstein Copilot Workshop
Salesforce AI & Einstein Copilot WorkshopSalesforce AI & Einstein Copilot Workshop
Salesforce AI & Einstein Copilot Workshop
CEPTES Software Inc
 
Data Integration Basics: Merging & Joining Data
Data Integration Basics: Merging & Joining DataData Integration Basics: Merging & Joining Data
Data Integration Basics: Merging & Joining Data
Safe Software
 
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python CodebaseEuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
Jimmy Lai
 
Three New Criminal Laws in India 1 July 2024
Three New Criminal Laws in India 1 July 2024Three New Criminal Laws in India 1 July 2024
Three New Criminal Laws in India 1 July 2024
aakash malhotra
 
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
Kief Morris
 
The importance of Quality Assurance for ICT Standardization
The importance of Quality Assurance for ICT StandardizationThe importance of Quality Assurance for ICT Standardization
The importance of Quality Assurance for ICT Standardization
Axel Rennoch
 
CHAPTER-8 COMPONENTS OF COMPUTER SYSTEM CLASS 9 CBSE
CHAPTER-8 COMPONENTS OF COMPUTER SYSTEM CLASS 9 CBSECHAPTER-8 COMPONENTS OF COMPUTER SYSTEM CLASS 9 CBSE
CHAPTER-8 COMPONENTS OF COMPUTER SYSTEM CLASS 9 CBSE
kumarjarun2010
 

Recently uploaded (20)

leewayhertz.com-AI agents for healthcare Applications benefits and implementa...
leewayhertz.com-AI agents for healthcare Applications benefits and implementa...leewayhertz.com-AI agents for healthcare Applications benefits and implementa...
leewayhertz.com-AI agents for healthcare Applications benefits and implementa...
 
(CISOPlatform Summit & SACON 2024) Digital Personal Data Protection Act.pdf
(CISOPlatform Summit & SACON 2024) Digital Personal Data Protection Act.pdf(CISOPlatform Summit & SACON 2024) Digital Personal Data Protection Act.pdf
(CISOPlatform Summit & SACON 2024) Digital Personal Data Protection Act.pdf
 
Figma AI Design Generator_ In-Depth Review.pdf
Figma AI Design Generator_ In-Depth Review.pdfFigma AI Design Generator_ In-Depth Review.pdf
Figma AI Design Generator_ In-Depth Review.pdf
 
Dublin_mulesoft_meetup_Mulesoft_Salesforce_Integration (1).pptx
Dublin_mulesoft_meetup_Mulesoft_Salesforce_Integration (1).pptxDublin_mulesoft_meetup_Mulesoft_Salesforce_Integration (1).pptx
Dublin_mulesoft_meetup_Mulesoft_Salesforce_Integration (1).pptx
 
find out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challengesfind out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challenges
 
Opencast Summit 2024 — Opencast @ University of Münster
Opencast Summit 2024 — Opencast @ University of MünsterOpencast Summit 2024 — Opencast @ University of Münster
Opencast Summit 2024 — Opencast @ University of Münster
 
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyyActive Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
 
Using LLM Agents with Llama 3, LangGraph and Milvus
Using LLM Agents with Llama 3, LangGraph and MilvusUsing LLM Agents with Llama 3, LangGraph and Milvus
Using LLM Agents with Llama 3, LangGraph and Milvus
 
Integrating Kafka with MuleSoft 4 and usecase
Integrating Kafka with MuleSoft 4 and usecaseIntegrating Kafka with MuleSoft 4 and usecase
Integrating Kafka with MuleSoft 4 and usecase
 
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdfAcumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
 
July Patch Tuesday
July Patch TuesdayJuly Patch Tuesday
July Patch Tuesday
 
Pigging Unit Lubricant Oil Blending Plant
Pigging Unit Lubricant Oil Blending PlantPigging Unit Lubricant Oil Blending Plant
Pigging Unit Lubricant Oil Blending Plant
 
Feature sql server terbaru performance.pptx
Feature sql server terbaru performance.pptxFeature sql server terbaru performance.pptx
Feature sql server terbaru performance.pptx
 
Salesforce AI & Einstein Copilot Workshop
Salesforce AI & Einstein Copilot WorkshopSalesforce AI & Einstein Copilot Workshop
Salesforce AI & Einstein Copilot Workshop
 
Data Integration Basics: Merging & Joining Data
Data Integration Basics: Merging & Joining DataData Integration Basics: Merging & Joining Data
Data Integration Basics: Merging & Joining Data
 
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python CodebaseEuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
 
Three New Criminal Laws in India 1 July 2024
Three New Criminal Laws in India 1 July 2024Three New Criminal Laws in India 1 July 2024
Three New Criminal Laws in India 1 July 2024
 
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
 
The importance of Quality Assurance for ICT Standardization
The importance of Quality Assurance for ICT StandardizationThe importance of Quality Assurance for ICT Standardization
The importance of Quality Assurance for ICT Standardization
 
CHAPTER-8 COMPONENTS OF COMPUTER SYSTEM CLASS 9 CBSE
CHAPTER-8 COMPONENTS OF COMPUTER SYSTEM CLASS 9 CBSECHAPTER-8 COMPONENTS OF COMPUTER SYSTEM CLASS 9 CBSE
CHAPTER-8 COMPONENTS OF COMPUTER SYSTEM CLASS 9 CBSE
 

Finding and Accessing Diagrams in Biomedical Publications

  • 1. Finding and Accessing Diagrams in Biomedical Publications Tobias Kuhn, ThaiBinh Luong, and Michael Krauthammer Krauthammer Lab, Department of Pathology Yale University School of Medicine AMIA 2012 Annual Symposium 6 November 2012 Chicago
  • 2. Introduction The inclusion of figure images is a recent trend in the area of literature mining. The increasing amount of open access publications makes such images available for automated analysis. Image mining techniques can be used for image search interfaces, for relation mining, and to complement text mining approaches. T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 2 / 20
  • 3. Answer Queries with Images Often, a query is best answered by an image. For example, WolframAlpha for “growth age 6”: Idea: Use existing diagrams of scientific articles to answer queries. T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 3 / 20
  • 4. Yale Image Finder http://krauthammerlab.med.yale.edu/imagefinder/ T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 4 / 20
  • 5. Detection and Analysis of Specific Image Types For the next version of the Yale Image Finder, we are working on the detection and analysis of specific image types: • Axis Diagrams • Gel Images • Network Diagrams (work in progress) T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 5 / 20
  • 6. Axis Diagrams: Examples T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 6 / 20
  • 7. Axis Diagrams Axis diagrams are important for several reasons: • They are abundant in biomedical literature: about 38% of all subfigures are axis diagrams • They follow simple common patterns based on axes • They are complex in the sense that they combine several dimensions • They summarize data for human readers T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 7 / 20
  • 8. Axis Diagram Detection Steps Basic Idea: Large segments are detected as center segments of axis diagrams if surrounded by a number of small label segments. 1. 2. 3. 4. 5. original segments center label result candidates candidates T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 8 / 20
  • 9. Additional Classifiers To compare and improve our approach, we apply SVM classifiers with the following two types of features: • Image: texture and histogram features of the bitmap image • Caption: word vector of the tokenized caption text These classifiers only act on the complete figure and cannot spot the location of axis diagrams. T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 9 / 20
  • 10. Results Evaluation on a random sample of 100 articles from PubMed Central with at least one figure. These 404 figures were manually annotated: they contained 508 axis diagrams. task method precision recall F-score detection of figures segments 0.87 0.66 0.75 with axis diagrams image 0.66 0.90 0.76 caption 0.84 0.77 0.80 image + segments 0.80 0.73 0.76 caption + segments 0.90 0.85 0.88 image + caption 0.85 0.84 0.84 image + caption + segments 0.90 0.89 0.89 extraction of axis segments 0.85 0.40 0.54 diagram locations image + segments 0.84 0.39 0.54 caption + segments 0.88 0.39 0.54 image + caption + segments 0.89 0.39 0.55 T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 10 / 20
  • 11. Gel Images Gel diagrams are another important type of image: • They are the result of gel electrophoresis (e.g. Southern, Western and Northern blotting) • They are often shown in biomedical publication as evidence for the discussed findings (e.g. protein-protein interactions and protein expressions under different conditions) • About 15% of all subfigures are gel images • They are structured according to common regular patterns T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 11 / 20
  • 12. Relations from Gel Images Condition Measurement Result MDA-MB-231 14-3-3σ high expression NHEM 14-3-3σ no expression C8161.9 14-3-3σ high expression LOX 14-3-3σ low expression MDA-MB-231 β-actin high expression NHEM β-actin high expression C8161.9 β-actin high expression LOX β-actin high expression Condition Measurement Result IL-1β (–) DEX (–) RU486 (–) p-p38 low expression IL-1β (+) DEX (–) RU486 (–) p-p38 high expression IL-1β (–) DEX (+) RU486 (–) p-p38 no expression IL-1β (+) DEX (+) RU486 (–) p-p38 low expression IL-1β (–) DEX (–) RU486 (+) p-p38 no expression IL-1β (+) DEX (–) RU486 (+) p-p38 high expression IL-1β (–) DEX (+) RU486 (+) p-p38 low expression IL-1β (+) DEX (+) RU486 (+) p-p38 high expression ... ... ... T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 12 / 20
  • 13. Procedure A B X Y P A B X Y P A B X Y P A B X Y P A B X Y P A B X Y P articles figures segments text gels gel panels named entities 1 21 3 4 5 6 relations 7 We focus here on the steps 4, 5, and 6. Steps 1, 2, and 3 have been addressed in prior work. Step 7 is future work. T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 13 / 20
  • 14. Gel Segment Detection A B X Y P gels 4 Random forest classifiers on a number of features of image segments (position, size, grayscale histogram, color, texture, and number of recognized characters). Results on 1000 manually annotated, random figures: Threshold Precision Recall F-score AUC high recall 0.15 0.439 0.909 0.592 0.30 0.765 0.739 0.752 0.980 high precision 0.60 0.926 0.301 0.455 T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 14 / 20
  • 15. Gel Panel Detection A B X Y P gel panels 5 Algorithm: • Start with a gel segment according to the high-precision classifier • Repeatedly look for adjacent gel segments according to the high-recall classifier, and merge them • Collect labels in the form of text segments arround the detected gel region Results on another set of 500 manually annotated figures: Precision Recall F-score 0.951 0.379 0.542 T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 15 / 20
  • 16. Named Entity Recognition named entities 6 Detection of gene and protein names in gel labels from a sample of 2000 random figures (tokenization; case-sensitive Entrez Gene lookup; exclude very short and very common words): absolute relative Total 156 100.0% Incorrect 54 34.6% – Not mentioned (OCR errors) 28 17.9% – Not references to genes or proteins 26 16.7% Correct 102 65.3% – Partially correct (could be more specific) 14 9.0% – Fully correct 88 56.4% T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 16 / 20
  • 17. Overall Results on PubMed Central We ran our pipeline on the whole open access subset of PubMed Central: Total articles 410 950 Processed articles 386 428 Total figures from processed articles 1 110 643 Processed figures 884 152 Detected gel panels 85 942 Detected gel panels per figure 0.097 Detected gene tokens 1 854 609 Detected gene tokens in gel labels 75 610 T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 17 / 20
  • 18. Conclusions and Future Work Conclusions • The location of certain diagram types like axis and gel diagrams can be extracted at a high precision of about 90% with an f-score around 55% Future Work • Relation extraction • Include other image types like network diagrams • Combination with classical text mining techniques • Detection of other named entity types: cell lines, drugs, ... • Sophisticated diagram search interface • Standard for biomedical diagrams? T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 18 / 20
  • 19. Discussion: Standardized Biomedical Diagrams? It seems feasible to extract relations from gel images at satisfactory accuracy, but it is clear that this procedure is far from perfect. Do we need a standard for biomedical diagrams? A Unified Modeling Language (UML) for biology and medicine? T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 19 / 20
  • 20. Thank you for your Attention! Questions? T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 20 / 20