SlideShare a Scribd company logo
1 of 35
Download to read offline
PaperMaker: Validation of biomedical scientific
publications


January 19th, 2011

Workshop: „BeyondThePdf“
Dietrich Rebholz-Schuhmann, MD, PhD
Group Leader Rebholz Group
European Bioinformatics Institute
Publishing is about …

    • ... Agreeing / disagreeing about current science
           • Only peer review can judge current science
    • ... Bringing new results
           • Conceptual results are more difficult than new data
    • ... Gaining new knowledge
           • New data and new results can imply new knowledge where even
             the author is still unaware of
    • ... Rewarding the scientist
           • Count whatever you can count that could have an impact.
           • Validating the scientist’s claim is the key reward.
           • Any scientist can fool any system, but (hopefully) only short-term



2   20.01.2011                       Literature and Text Mining
                               BioCreative III, Rebholz
Future of biomedical text mining

    Working towards ...

    • ... Literature integration
           • to have it full fledged as part of bioinformatics data resources
    • ... Cross-domain support
           • to deliver the content to different scientific communities.
    • ... Provenance
           • to carry credit of findings into analytical biomedical research
    • ... Inference & Reasoning
           • to make use of the full semantic support in the scientific literature




3   20.01.2011                       Literature and Text Mining
                               BioCreative III, Rebholz
Literature content in the Semantic Web




4   20.01.2011        Literature and Text Mining
Terminologies vs. Ontologies




                                                      Ontological resources
    Database type Resource building
                                                      Explicit semantics
    Terminologies, collection of terms
                                                      Manual generation
    Automatic generation
                                                      Consistency, inference, reasoning
    Exploitation of terminological features
                                                      Interoperability with all semantic
    Standardisation of TM solutions                   resources
    Interoperability with database                    Working towards a reasoning
    resources                                         infrastructure

5                                        Literature and Text Mining
Efforts in the Rebholz group towards
    interoperability of literature with bioinformatics
    •    Whatizit infrastructure
           •     Biomedical NER as a public, large-scale service
    •    LexEBI / BioLexicon (collab. w. NaCTeM, Pisa-U)
           •     Biomedical terminological resource, standardisation of semantics
    •    IeXML (BioLink SIG 2006, Brasil)
           •     Put the annotations into the document (inline annotations)
    •    CALBC project
           •     Collaborative annotation of a large-scale biomedical corpus
    •    UKPMC: U.K. Pubmed Central (collab. w. NaCTeM, BL)
           •     Use of Whatizit, BioLexicon, IeXML, CALBC alignments for the delivery of quality
                 annotation services to the public
    •    SESL project
           •     Joint project with pharma & publishers, literature content in a triple store
    •    PaperMaker
           •     Validation of the scientific literature against the above


6   20.01.2011                                Literature and Text Mining
                                        BioCreative III, Rebholz
1
                 Whatizit
7   20.01.2011          Literature and Text Mining
                  BioCreative III, Rebholz
Integrating biomedical literature and data
                                                    Rebholz-Schuhmann, D., et
                                                    al. Text Processing through
                                                    Web Services: Calling
                                                    Whatizit. Bioinformatics 24,
                                                    no. 2 (2008): 296-98.




8   20.01.2011         Literature and Text Mining
2
                 BioLexicon
                   LexEBI
9   20.01.2011           Literature and Text Mining
                   BioCreative III, Rebholz
LexEBI: content
                                  # Labels # Variants        Total        Total / # Unique Uniq. T. /
                                                                          Labels    terms   Labels
      Prot.
      Gene




                   GP 7.0          516,113   4,005,040     4,521,153         8.76 1,726,853      3.35
        /




                   GP 6.0          488,577   3,389,316     3,877,893         7.94 1,564,436      3.20
                   Jochem          278,578   1,691,980     1,970,558         7.07 1,527,752      5.48
        Chemi-
         cals




                   ChEBI            19,645      94,748       114,393         5.82 101,307        5.16
                   ChEBI (all)     549,838   1,187,322     1,737,160         3.16
                   Enzymes           4,905       8,082        12,987         2.65    12,377      2.52
           Other




                   Species         643,280     199,130       842,410         1.31 838,135        1.30
                   Interpro         20,671           0        20,671         1.00    20,671      1.00
                   Antineuro.,       4,718       6,488        11,206         2.38
                   Neo
                   Bio. Act.        54,148      87,209       141,357         2.61
           UMLS




                   Enzymes          26,065      56,332        82,397         3.16
                   Lipid, Carb.     11,518       9,770        21,288         1.85
                   Pharm. Act.     104,201     123,840       228,041         2.19
                   Vit., Horm.       6,877      10,258        17,135         2.49

10   20.01.2011                              Literature and Text Mining
3
                  IeXML
11   20.01.2011         Literature and Text Mining
                  BioCreative III, Rebholz
IeXML: Annotating entities in text


     • Inline annotations to any part of the document with the
       annotations
     • No hassle with character or byte counts or layout
       modifications to the document
     • “Alignment” of annotated documtents to
           • Compare annotations
           • Validate annotations
           • Harmonise annotations (SESL project)




12   20.01.2011                     Literature and Text Mining
                              BioCreative III, Rebholz
4
                  CALBC
13   20.01.2011         Literature and Text Mining
                  BioCreative III, Rebholz
The challenge
                                           150,000 documents
                                           or more ...




                                            Test set for all systems
                                            Assessment, benchmarking


14   20.01.2011         Literature and Text Mining
                  BioCreative III, Rebholz
CALBC Challenge II


(1) 75,000 documents training data
(2) 175,000 testing data
(3) Additional 700,000 testing data

•    September 13th 2010:
     Second harmonized corpus available for CALBC
     Challenge II
•    December 15th, 2010: Challenge II closes
•    March 2011: CALBC Workshop II
•    June 30th, 2011:
     Final harmonized corpus available

                           Literature and Text Mining
                     BioCreative III, Rebholz
5
     Ukpmc/Elixir
16   20.01.2011         Literature and Text Mining
                  BioCreative III, Rebholz
17   20.01.2011         Literature and Text Mining
                  BioCreative III, Rebholz
UKPMC




                  ~ 10 % the size of PubMed
18   20.01.2011             Literature and Text Mining
                      BioCreative III, Rebholz
6
                  sesl
19   20.01.2011         Literature and Text Mining
                  BioCreative III, Rebholz
SESL Project: from publisher to pharma
                                                                                    Multiple
                                                                                    Consumers

                                  Disease                                           Knowledge
                                  Dossier                                           Applications

                  Service Layer (RDF, Web 2.0)                        Std Public
Open                                                                                 Common
                  Assertions, SPARQL, Triple Store                   Vocabularies
Stan-                                                                                Service
                  Integration, Inference, Reasoning                   Business
dards                                                                                Broker
                  Sharing of data                                      Rules

                                                                                      Content
                                                                                      Suppliers




20   20.01.2011                         Literature20
                                                   and Text Mining
Literature content in the Semantic Web




21   20.01.2011        Literature and Text Mining
7
      Papermaker
22   20.01.2011         Literature and Text Mining
                  BioCreative III, Rebholz
PaperMaker - Overview

• Inte
• PaperMaker - a tool to support authors writing biomedical
  papers:
• Interactive feedback on the contents of papers (related
  work and concept annotations)
• Formal consistency criteria checking (spelling,
  terminology, acronyms, references)




30.03.2009               Literature and Text Mining
                   BioCreative III, Rebholz
Consistency parameters

Domain-independent


•    General spelling and grammar
•    General readability
•    Appropriate use of references
•    Finding and acknowledging related work




30.03.2009                 Literature and Text Mining
                     BioCreative III, Rebholz
Consistence parameters

Domain-specific

• The use of terminology:

       • Should be consistent with naming domain-specific guidelines
       • Should not be ambiguous
       • Should conform to the conventional usage (possible clashes
         between naming guidelines and common-sense convention)
       • Useful to resolve terminology to reference databases (e. g.
         UniProt for protein names, ChEBI chemical entities, etc.)
       • The special case of acronyms




30.03.2009                      Literature and Text Mining
                          BioCreative III, Rebholz
Content feedback

• Resolving the contents to literature repositories
       • Finding related work (document retrieval)
       • Finding related ideas (passage retrieval)

• Resolving the contents to ontological reference
  databases
       • MeSH descriptors have been demonstrated to improve
         biomedical information retrieval. Can we suggest MeSH terms
         directly to the authors?
       • Gene Ontology (GO) terms are increasingly used in information
         extraction systems.




30.03.2009                      Literature and Text Mining
                          BioCreative III, Rebholz
PaperMaker workflow




30.03.2009         Literature and Text Mining
             BioCreative III, Rebholz
Literature and Text Mining
Literature and Text Mining
Literature and Text Mining
Literature and Text Mining
Conclusions

• PaperMaker can help the author conform to the formal
  requirements of paper writing with special emphasis on
  the domain

• It also provides feedback on the contents by relating it to
  reference resources and literature repositories

• It may improve the indexing of a paper in literature
  repositories (less ambiguous terminology)

• http://www.ebi.ac.uk/Rebholz-srv/PaperMaker
  Work in progress 



30.03.2009                  Literature and Text Mining
                      BioCreative III, Rebholz
8
                  Summary

33   20.01.2011          Literature and Text Mining
                   BioCreative III, Rebholz
Efforts in the Rebholz group towards
     interoperability of literature with bioinformatics
     •    Whatizit infrastructure
            •     Biomedical NER as a public, large-scale service
     •    LexEBI / BioLexicon (collab. w. NaCTeM, Pisa-U)
            •     Biomedical terminological resource, standardisation of semantics
     •    IeXML (BioLink SIG 2006, Brasil)
            •     Put the annotations into the document (inline annotations)
     •    CALBC project
            •     Collaborative annotation of a large-scale biomedical corpus
     •    UKPMC: U.K. Pubmed Central (collab. w. NaCTeM, BL)
            •     Use of Whatizit, BioLexicon, IeXML, CALBC alignments for the delivery of quality
                  annotation services to the public
     •    SESL project
            •     Joint project with pharma & publishers, literature content in a triple store
     •    PaperMaker
            •     Validation of the scientific literature against the above


34   20.01.2011                                Literature and Text Mining
                                         BioCreative III, Rebholz
Literature and Text Mining
BioCreative III, Rebholz

More Related Content

Viewers also liked

constellation energy Q2 2008 Earnings Presentation 2008 Second Quarter
constellation energy Q2 2008 Earnings Presentation 2008 Second Quarterconstellation energy Q2 2008 Earnings Presentation 2008 Second Quarter
constellation energy Q2 2008 Earnings Presentation 2008 Second Quarter
finance12
 
So You Think Your Having a Bad Day
So You Think Your Having a Bad DaySo You Think Your Having a Bad Day
So You Think Your Having a Bad Day
ninedots
 
Pest Photo Presentation
Pest Photo PresentationPest Photo Presentation
Pest Photo Presentation
mwoodring
 
goodyear Annual Report 2000
goodyear Annual Report 2000goodyear Annual Report 2000
goodyear Annual Report 2000
finance12
 
tesoro 2005 Q1
tesoro 2005 Q1tesoro 2005 Q1
tesoro 2005 Q1
finance12
 
constellation energy 2006 10K
constellation energy 2006 10K constellation energy 2006 10K
constellation energy 2006 10K
finance12
 
goodyear 10Q Reports1Q'06 10-Q
goodyear 10Q Reports1Q'06 10-Qgoodyear 10Q Reports1Q'06 10-Q
goodyear 10Q Reports1Q'06 10-Q
finance12
 
goodyear 8K Reports 02/27/08
goodyear 8K Reports 02/27/08goodyear 8K Reports 02/27/08
goodyear 8K Reports 02/27/08
finance12
 
constellation energy 2008 Fourth Quarter Supporting Materials
constellation energy 2008 Fourth Quarter Supporting Materialsconstellation energy 2008 Fourth Quarter Supporting Materials
constellation energy 2008 Fourth Quarter Supporting Materials
finance12
 
goodyear 10Q Reports12B-25 -
goodyear 10Q Reports12B-25  - goodyear 10Q Reports12B-25  -
goodyear 10Q Reports12B-25 -
finance12
 
constellation energy 2007 Fourth Quarter Form 10-K
constellation energy 2007 Fourth Quarter  	Form 10-Kconstellation energy 2007 Fourth Quarter  	Form 10-K
constellation energy 2007 Fourth Quarter Form 10-K
finance12
 
goodyear 8K Reports 04/11/08
goodyear 8K Reports 04/11/08goodyear 8K Reports 04/11/08
goodyear 8K Reports 04/11/08
finance12
 
international paper 2008 Proxy Statement
international paper 2008 Proxy Statementinternational paper 2008 Proxy Statement
international paper 2008 Proxy Statement
finance12
 

Viewers also liked (20)

OECs Vernal Pool Program D. Celebrezze
OECs Vernal Pool Program  D. CelebrezzeOECs Vernal Pool Program  D. Celebrezze
OECs Vernal Pool Program D. Celebrezze
 
constellation energy Q2 2008 Earnings Presentation 2008 Second Quarter
constellation energy Q2 2008 Earnings Presentation 2008 Second Quarterconstellation energy Q2 2008 Earnings Presentation 2008 Second Quarter
constellation energy Q2 2008 Earnings Presentation 2008 Second Quarter
 
So You Think Your Having a Bad Day
So You Think Your Having a Bad DaySo You Think Your Having a Bad Day
So You Think Your Having a Bad Day
 
Ennio Morricone
Ennio MorriconeEnnio Morricone
Ennio Morricone
 
7 Steps To LinkedIn Enlightenment
7 Steps To LinkedIn Enlightenment7 Steps To LinkedIn Enlightenment
7 Steps To LinkedIn Enlightenment
 
Pest Photo Presentation
Pest Photo PresentationPest Photo Presentation
Pest Photo Presentation
 
goodyear Annual Report 2000
goodyear Annual Report 2000goodyear Annual Report 2000
goodyear Annual Report 2000
 
tesoro 2005 Q1
tesoro 2005 Q1tesoro 2005 Q1
tesoro 2005 Q1
 
constellation energy 2006 10K
constellation energy 2006 10K constellation energy 2006 10K
constellation energy 2006 10K
 
Telecom Business Advisory Initial Meeting
Telecom Business Advisory   Initial MeetingTelecom Business Advisory   Initial Meeting
Telecom Business Advisory Initial Meeting
 
Data centre strategies in consideration of climate change
Data centre strategies in consideration of climate changeData centre strategies in consideration of climate change
Data centre strategies in consideration of climate change
 
goodyear 10Q Reports1Q'06 10-Q
goodyear 10Q Reports1Q'06 10-Qgoodyear 10Q Reports1Q'06 10-Q
goodyear 10Q Reports1Q'06 10-Q
 
goodyear 8K Reports 02/27/08
goodyear 8K Reports 02/27/08goodyear 8K Reports 02/27/08
goodyear 8K Reports 02/27/08
 
Lifehacking presentatie personal branding
Lifehacking presentatie personal brandingLifehacking presentatie personal branding
Lifehacking presentatie personal branding
 
constellation energy 2008 Fourth Quarter Supporting Materials
constellation energy 2008 Fourth Quarter Supporting Materialsconstellation energy 2008 Fourth Quarter Supporting Materials
constellation energy 2008 Fourth Quarter Supporting Materials
 
goodyear 10Q Reports12B-25 -
goodyear 10Q Reports12B-25  - goodyear 10Q Reports12B-25  -
goodyear 10Q Reports12B-25 -
 
constellation energy 2007 Fourth Quarter Form 10-K
constellation energy 2007 Fourth Quarter  	Form 10-Kconstellation energy 2007 Fourth Quarter  	Form 10-K
constellation energy 2007 Fourth Quarter Form 10-K
 
goodyear 8K Reports 04/11/08
goodyear 8K Reports 04/11/08goodyear 8K Reports 04/11/08
goodyear 8K Reports 04/11/08
 
WordPress Meetup (Davie, FL) - Top 9 April 2016
WordPress Meetup (Davie, FL) - Top 9 April 2016WordPress Meetup (Davie, FL) - Top 9 April 2016
WordPress Meetup (Davie, FL) - Top 9 April 2016
 
international paper 2008 Proxy Statement
international paper 2008 Proxy Statementinternational paper 2008 Proxy Statement
international paper 2008 Proxy Statement
 

Similar to PaperMaker, BeyondThePdf, RebholzSchuhmann, 19Jan2011

How Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open ScienceHow Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open Science
drnigam
 
ICBO 2018 Poster - Current Development in the Evidence and Conclusion Ontolog...
ICBO 2018 Poster - Current Development in the Evidence and Conclusion Ontolog...ICBO 2018 Poster - Current Development in the Evidence and Conclusion Ontolog...
ICBO 2018 Poster - Current Development in the Evidence and Conclusion Ontolog...
dolleyj
 
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATION
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATIONONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATION
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATION
IJwest
 

Similar to PaperMaker, BeyondThePdf, RebholzSchuhmann, 19Jan2011 (20)

How Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open ScienceHow Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open Science
 
Introduction to EOL.org for scientists
Introduction to EOL.org for scientistsIntroduction to EOL.org for scientists
Introduction to EOL.org for scientists
 
BHL Tech Report
BHL Tech ReportBHL Tech Report
BHL Tech Report
 
Bio-ontologies in bioinformatics: Growing up challenges
Bio-ontologies in bioinformatics: Growing up challengesBio-ontologies in bioinformatics: Growing up challenges
Bio-ontologies in bioinformatics: Growing up challenges
 
Biodiversity Virtual e-Laboratory (BioVeL)
Biodiversity Virtual e-Laboratory (BioVeL)Biodiversity Virtual e-Laboratory (BioVeL)
Biodiversity Virtual e-Laboratory (BioVeL)
 
ICBO 2018 Poster - Current Development in the Evidence and Conclusion Ontolog...
ICBO 2018 Poster - Current Development in the Evidence and Conclusion Ontolog...ICBO 2018 Poster - Current Development in the Evidence and Conclusion Ontolog...
ICBO 2018 Poster - Current Development in the Evidence and Conclusion Ontolog...
 
Tutorial: “How to use ontology repositories and ontology–based services”
Tutorial: “How to use ontology repositories and ontology–based services”Tutorial: “How to use ontology repositories and ontology–based services”
Tutorial: “How to use ontology repositories and ontology–based services”
 
Introduction to Ontologies for Environmental Biology
Introduction to Ontologies for Environmental BiologyIntroduction to Ontologies for Environmental Biology
Introduction to Ontologies for Environmental Biology
 
BHL @ #TDWG09 - with discussion
BHL @ #TDWG09 - with discussionBHL @ #TDWG09 - with discussion
BHL @ #TDWG09 - with discussion
 
Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Inte...
Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Inte...Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Inte...
Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Inte...
 
Advanced Bioinformatics for Genomics and BioData Driven Research
Advanced Bioinformatics for Genomics and BioData Driven ResearchAdvanced Bioinformatics for Genomics and BioData Driven Research
Advanced Bioinformatics for Genomics and BioData Driven Research
 
Biodiversity Heritiage Library: progress and process
Biodiversity Heritiage Library: progress and processBiodiversity Heritiage Library: progress and process
Biodiversity Heritiage Library: progress and process
 
Biodiversity Heritage Library: Content liberator
Biodiversity Heritage Library: Content liberatorBiodiversity Heritage Library: Content liberator
Biodiversity Heritage Library: Content liberator
 
Introduction to bioinformatics
Introduction to bioinformaticsIntroduction to bioinformatics
Introduction to bioinformatics
 
Experience with MarkLogic at Elsevier
Experience with MarkLogic at ElsevierExperience with MarkLogic at Elsevier
Experience with MarkLogic at Elsevier
 
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATION
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATIONONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATION
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATION
 
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATION
ONTOLOGY SERVICE CENTER: A DATAHUB FOR  ONTOLOGY APPLICATION ONTOLOGY SERVICE CENTER: A DATAHUB FOR  ONTOLOGY APPLICATION
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATION
 
Ontology Web Services for Semantic Applications
Ontology Web Services for Semantic Applications Ontology Web Services for Semantic Applications
Ontology Web Services for Semantic Applications
 
Digitizing Entomology: The Biodiversity Heritage Library @ the Smithsonian
Digitizing Entomology: The Biodiversity Heritage Library @ the SmithsonianDigitizing Entomology: The Biodiversity Heritage Library @ the Smithsonian
Digitizing Entomology: The Biodiversity Heritage Library @ the Smithsonian
 
Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI
Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBILiterature-data integration in the life sciences – Jo McEntyre, EMBL-EBI
Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 

PaperMaker, BeyondThePdf, RebholzSchuhmann, 19Jan2011

  • 1. PaperMaker: Validation of biomedical scientific publications January 19th, 2011 Workshop: „BeyondThePdf“ Dietrich Rebholz-Schuhmann, MD, PhD Group Leader Rebholz Group European Bioinformatics Institute
  • 2. Publishing is about … • ... Agreeing / disagreeing about current science • Only peer review can judge current science • ... Bringing new results • Conceptual results are more difficult than new data • ... Gaining new knowledge • New data and new results can imply new knowledge where even the author is still unaware of • ... Rewarding the scientist • Count whatever you can count that could have an impact. • Validating the scientist’s claim is the key reward. • Any scientist can fool any system, but (hopefully) only short-term 2 20.01.2011 Literature and Text Mining BioCreative III, Rebholz
  • 3. Future of biomedical text mining Working towards ... • ... Literature integration • to have it full fledged as part of bioinformatics data resources • ... Cross-domain support • to deliver the content to different scientific communities. • ... Provenance • to carry credit of findings into analytical biomedical research • ... Inference & Reasoning • to make use of the full semantic support in the scientific literature 3 20.01.2011 Literature and Text Mining BioCreative III, Rebholz
  • 4. Literature content in the Semantic Web 4 20.01.2011 Literature and Text Mining
  • 5. Terminologies vs. Ontologies Ontological resources Database type Resource building Explicit semantics Terminologies, collection of terms Manual generation Automatic generation Consistency, inference, reasoning Exploitation of terminological features Interoperability with all semantic Standardisation of TM solutions resources Interoperability with database Working towards a reasoning resources infrastructure 5 Literature and Text Mining
  • 6. Efforts in the Rebholz group towards interoperability of literature with bioinformatics • Whatizit infrastructure • Biomedical NER as a public, large-scale service • LexEBI / BioLexicon (collab. w. NaCTeM, Pisa-U) • Biomedical terminological resource, standardisation of semantics • IeXML (BioLink SIG 2006, Brasil) • Put the annotations into the document (inline annotations) • CALBC project • Collaborative annotation of a large-scale biomedical corpus • UKPMC: U.K. Pubmed Central (collab. w. NaCTeM, BL) • Use of Whatizit, BioLexicon, IeXML, CALBC alignments for the delivery of quality annotation services to the public • SESL project • Joint project with pharma & publishers, literature content in a triple store • PaperMaker • Validation of the scientific literature against the above 6 20.01.2011 Literature and Text Mining BioCreative III, Rebholz
  • 7. 1 Whatizit 7 20.01.2011 Literature and Text Mining BioCreative III, Rebholz
  • 8. Integrating biomedical literature and data Rebholz-Schuhmann, D., et al. Text Processing through Web Services: Calling Whatizit. Bioinformatics 24, no. 2 (2008): 296-98. 8 20.01.2011 Literature and Text Mining
  • 9. 2 BioLexicon LexEBI 9 20.01.2011 Literature and Text Mining BioCreative III, Rebholz
  • 10. LexEBI: content # Labels # Variants Total Total / # Unique Uniq. T. / Labels terms Labels Prot. Gene GP 7.0 516,113 4,005,040 4,521,153 8.76 1,726,853 3.35 / GP 6.0 488,577 3,389,316 3,877,893 7.94 1,564,436 3.20 Jochem 278,578 1,691,980 1,970,558 7.07 1,527,752 5.48 Chemi- cals ChEBI 19,645 94,748 114,393 5.82 101,307 5.16 ChEBI (all) 549,838 1,187,322 1,737,160 3.16 Enzymes 4,905 8,082 12,987 2.65 12,377 2.52 Other Species 643,280 199,130 842,410 1.31 838,135 1.30 Interpro 20,671 0 20,671 1.00 20,671 1.00 Antineuro., 4,718 6,488 11,206 2.38 Neo Bio. Act. 54,148 87,209 141,357 2.61 UMLS Enzymes 26,065 56,332 82,397 3.16 Lipid, Carb. 11,518 9,770 21,288 1.85 Pharm. Act. 104,201 123,840 228,041 2.19 Vit., Horm. 6,877 10,258 17,135 2.49 10 20.01.2011 Literature and Text Mining
  • 11. 3 IeXML 11 20.01.2011 Literature and Text Mining BioCreative III, Rebholz
  • 12. IeXML: Annotating entities in text • Inline annotations to any part of the document with the annotations • No hassle with character or byte counts or layout modifications to the document • “Alignment” of annotated documtents to • Compare annotations • Validate annotations • Harmonise annotations (SESL project) 12 20.01.2011 Literature and Text Mining BioCreative III, Rebholz
  • 13. 4 CALBC 13 20.01.2011 Literature and Text Mining BioCreative III, Rebholz
  • 14. The challenge 150,000 documents or more ... Test set for all systems Assessment, benchmarking 14 20.01.2011 Literature and Text Mining BioCreative III, Rebholz
  • 15. CALBC Challenge II (1) 75,000 documents training data (2) 175,000 testing data (3) Additional 700,000 testing data • September 13th 2010: Second harmonized corpus available for CALBC Challenge II • December 15th, 2010: Challenge II closes • March 2011: CALBC Workshop II • June 30th, 2011: Final harmonized corpus available Literature and Text Mining BioCreative III, Rebholz
  • 16. 5 Ukpmc/Elixir 16 20.01.2011 Literature and Text Mining BioCreative III, Rebholz
  • 17. 17 20.01.2011 Literature and Text Mining BioCreative III, Rebholz
  • 18. UKPMC ~ 10 % the size of PubMed 18 20.01.2011 Literature and Text Mining BioCreative III, Rebholz
  • 19. 6 sesl 19 20.01.2011 Literature and Text Mining BioCreative III, Rebholz
  • 20. SESL Project: from publisher to pharma Multiple Consumers Disease Knowledge Dossier Applications Service Layer (RDF, Web 2.0) Std Public Open Common Assertions, SPARQL, Triple Store Vocabularies Stan- Service Integration, Inference, Reasoning Business dards Broker Sharing of data Rules Content Suppliers 20 20.01.2011 Literature20 and Text Mining
  • 21. Literature content in the Semantic Web 21 20.01.2011 Literature and Text Mining
  • 22. 7 Papermaker 22 20.01.2011 Literature and Text Mining BioCreative III, Rebholz
  • 23. PaperMaker - Overview • Inte • PaperMaker - a tool to support authors writing biomedical papers: • Interactive feedback on the contents of papers (related work and concept annotations) • Formal consistency criteria checking (spelling, terminology, acronyms, references) 30.03.2009 Literature and Text Mining BioCreative III, Rebholz
  • 24. Consistency parameters Domain-independent • General spelling and grammar • General readability • Appropriate use of references • Finding and acknowledging related work 30.03.2009 Literature and Text Mining BioCreative III, Rebholz
  • 25. Consistence parameters Domain-specific • The use of terminology: • Should be consistent with naming domain-specific guidelines • Should not be ambiguous • Should conform to the conventional usage (possible clashes between naming guidelines and common-sense convention) • Useful to resolve terminology to reference databases (e. g. UniProt for protein names, ChEBI chemical entities, etc.) • The special case of acronyms 30.03.2009 Literature and Text Mining BioCreative III, Rebholz
  • 26. Content feedback • Resolving the contents to literature repositories • Finding related work (document retrieval) • Finding related ideas (passage retrieval) • Resolving the contents to ontological reference databases • MeSH descriptors have been demonstrated to improve biomedical information retrieval. Can we suggest MeSH terms directly to the authors? • Gene Ontology (GO) terms are increasingly used in information extraction systems. 30.03.2009 Literature and Text Mining BioCreative III, Rebholz
  • 27. PaperMaker workflow 30.03.2009 Literature and Text Mining BioCreative III, Rebholz
  • 32. Conclusions • PaperMaker can help the author conform to the formal requirements of paper writing with special emphasis on the domain • It also provides feedback on the contents by relating it to reference resources and literature repositories • It may improve the indexing of a paper in literature repositories (less ambiguous terminology) • http://www.ebi.ac.uk/Rebholz-srv/PaperMaker Work in progress  30.03.2009 Literature and Text Mining BioCreative III, Rebholz
  • 33. 8 Summary 33 20.01.2011 Literature and Text Mining BioCreative III, Rebholz
  • 34. Efforts in the Rebholz group towards interoperability of literature with bioinformatics • Whatizit infrastructure • Biomedical NER as a public, large-scale service • LexEBI / BioLexicon (collab. w. NaCTeM, Pisa-U) • Biomedical terminological resource, standardisation of semantics • IeXML (BioLink SIG 2006, Brasil) • Put the annotations into the document (inline annotations) • CALBC project • Collaborative annotation of a large-scale biomedical corpus • UKPMC: U.K. Pubmed Central (collab. w. NaCTeM, BL) • Use of Whatizit, BioLexicon, IeXML, CALBC alignments for the delivery of quality annotation services to the public • SESL project • Joint project with pharma & publishers, literature content in a triple store • PaperMaker • Validation of the scientific literature against the above 34 20.01.2011 Literature and Text Mining BioCreative III, Rebholz
  • 35. Literature and Text Mining BioCreative III, Rebholz