RDF for PubMedCentral


Published on

we present our approach to the generation of self-describing machine-readable scholarly documents. We understand the scientific document as an entry point and interface to the Web of Data. We have semantically processed the full-text, open-access subset of PubMed Central. Our RDF model and resulting dataset make extensive use of existing ontologies and semantic enrichment services.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Not limited to open access modelsNot limited to closed business models
  • In spite of the advances, scientific publications remain poorly connected to each other as well as to external resources. Furthermore, most of the information remains locked up in discrete documents without machine-processable content.
  • 3rd point : As easy as building mash-ups
  • 4th point then The paper becomes an
  • Distribution of the first 20 journals, corresponding to about 40% of the total of 270,834 Number of biological entities identified across the papers
  • SPARQL queries can be used to retrieve metadata and content. It is possible to specify words and sentences that should appear in the text or in the section title.
  • As content has been semantically enriched, it is possible to retrieve articles based on either the annotated terms, e.g., “mixture,” or their corresponding biological entities, e.g., CHEBI:60004.
  • Users search for a human gene names; the term is initially resolved against GeneWiki, the associated UniProt accession is then used in the SPARQL query. The resulting set includes publication metadata, abstract, and a cloud of annotations. b) Enriched content based on annotations is displayed on the interactive zone; this may be the annotated paragraph, a chemical entity, or protein related information and so on.
  • Graph-based retrieval for the terms “catalase”; only shared terms with more than 30 associated biological terms are included in the results. We want to visualoize the network of interconected documents. How is a document related to another document, what do they share.
  • What are the challenges we have face so far in the PMC rdfication case? Well a first challenge is related to content, tables and images are not part of the core RDF but are referenced as links so we still have access to them. However, there are some inline tables whose files and columns are described in XML; so far we recover the content but not the format so we are missing information there. The supplementary material is still difficult to integrate, it could be a Section, another paper, a technical report following a total different schema, can be outside PMC, can be a footnote... Also, most of the documents follow the same schema, but we have found some few with a different one, those cannot be currently processed. Those cases are less than the 5%.As for the references, we have at least 4 different styles, so the process is different in each case and some times we get references that are just plain text so we have to do some extra process in order to get the metadata associated.And the annotators, well, they are web services so they can be down, or too busy. Also, the stop words are tricky, we have tried to avoid them as much as possible but it is always possible to get some noisy terms in the annotations.Nota: stop words son palabrascomo “they” “can” que son evitadaspor los anotadorespor ser muycomunes o muycortas. Peroesdifícilcubrirlastodas, tenemoslasmáscomunes. Otracuestiónesquepuede ser queperdamosinformaciónporquealgunosacrónimospuede ser iguales a un stop word. Porejemplo CAN es un acrónimousado en CHEBI perocomoesunapalabracomún la evitamos, si en algún paper CAN esusado en el contexto de CHEBI y no del verboauxiliar, estaríamosperdiendoesainformación.
  • Methods&materials, what Olga is working onCollaborative element… still do not know but something should be done there
  • RDF for PubMedCentral

    1. 1. 2/14/2014 Biotea, RDF4PMC RDF4PMC, RDFizing PubMed Central Alexander Garcia1, Leyla Jael García Castro2, Casey McLaughlin1 1Florida State University 2Universitat Jaume I 1
    2. 2. The Biotea project Why Semantic Web Technologies? RDF4PMC in a nutshell Architecture RDFization process • • • • PMC RDFization Content enrichment Some numbers for RDF4PMC Architecture • Using the data • • • • • • • • • SPARQL Bio2RDF integration Web services A first prototype Challenges and Lessons Currently working on… Future Work Conclusions Acknowledgments Biotea, RDF4PMC • • • • • 2/14/2014 Outline 2
    3. 3. Christine L. Borgman • Methodologies, methods and techniques supporting semantic enrichment of scholarly communication • Once enriched, then how is this changing our user experience? Biotea, RDF4PMC Scholarly data and documents are of most value when they are interconnected rather than independent 2/14/2014 Biotea 3
    4. 4. Biotea • How are publications connected to each other? • Putting together explicit assertions from different papers to form new implicit assertions • Semantic Web Technology supporting scholarly communication, Literature Based Discovery and the SearchRetrieval-and-Interacting-with-the-Document (SRID) processes Biotea, RDF4PMC Christine L. Borgman 2/14/2014 Scholarly data and documents are of most value when they are interconnected rather than independent 4
    5. 5. • Retrieve all papers that have a component X (CHEBI) and the cellular location in GO terms Biotea, RDF4PMC • Generates an adaptable open approach, the data becomes the platform • The SW delivers an integrative platform • Makes it easier for the community to build over the platform • Simplifies programmatic access to information 2/14/2014 Why SWT for research documents • As simple as relating terminologies • Delivers Social Network ready content 5
    6. 6. Biotea, RDF4PMC • Delivers an interoperable, interlinked, and selfdescribing document model in the biomedical domain. • A network of interconnected documents • Semantic infrastructure for PMC • An interface to the Web of Data • A knowledge model for biomedical literature – easily extendible 2/14/2014 RDF4PMC in a nutshell 6
    7. 7. Biotea, RDF4PMC • RDFizing biomedical literature by orchestrating ontologies such as • DoCO, BIBO, DC, FOAF, W3CPROV, and others • Datasets are available • RDF for metadata and content • RDF for annotations from text-mining • RDFizator will be available • Adding other ontologies and annotators is possible • Working with XML from other sources is possible 2/14/2014 RDF4PMC in a nutshell 7
    8. 8. PMC RDFization RDF Generation Biotea, RDF4PMC References Enrichment 2/14/2014 Metadata+ Content + References RDFReactor PMC NXML 8
    9. 9. 9 Biotea, RDF4PMC 2/14/2014
    10. 10. Annotations: Content Enrichment Biotea, RDF4PMC 2/14/2014 Enriched RDF RDF Generation Automatic Annotation Web service Metadata+ Content + References Web service 10
    11. 11. 11 Biotea, RDF4PMC 2/14/2014
    12. 12. Biotea, RDF4PMC 2/14/2014 RDF4PMC, some numbers 12
    13. 13. RDF4PMC Server Architecture RDF DB Slave RDF DB Master Master Server Import scripts + RDF files PMC RDFization Web & SPARQL Server (development) RDF DB Slave Web & SPARQL Server (production)
    14. 14. Consuming the data: SPARQL Query expressed in natural SPARQL query language Retrieving PubMed ?article a bibo:Document ; bibo:pmid ?pmid ; identifier, article title, dcterms:title ?title . section title, and ?section a doco:Section ; paragraphs for those dcterms:isPartOf ?article ; dcterms:title ?secTitle . Biotea, RDF4PMC WHERE { 2/14/2014 SELECT ?pmid ?title ?secTitle ?text  articles containing the FILTER (regex(str(?secTitle), "introduction", "i")). ?para a doco:Paragraph ; dcterms:isPartOf ?section ; term “cancer” in any section whose title cnt:chars ?text . FILTER (regex(str(?text), "cancer", "i")). } LIMIT 50 includes “introduction” 14
    15. 15. Consuming the data: SPARQL Query expressed in natural Retrieving PubMed identifier SELECT distinct ?pmid for those articles that have WHERE { been semantically annotated ?article a bibo:AcademicArticle ; with the biological entity bibo:pmid ?pmid .  Biotea, RDF4PMC language 2/14/2014 SPARQL query CHEBI:60004. The semantic ?annotation a aot:ExactQualifier ; annotation comes from the ao:annotatesResource ?article ; occurrence of the term ao:hasTopic <http://purl.obolibrary.org/obo/CHEBI_60004> . “mixture” in any paragraph } of the retrieved articles. CHEBI:60004 A mixture is a chemical substance composed of multiple molecules, at least two of which are of a different kind 15
    16. 16. Annotations Biotea, RDF4PMC 2/14/2014 Content Metadata & References Bio2RDF Integration 16
    17. 17. Consuming the data: Web services Retrieval Service A list of topics and their related vocabularies http://biotea.idiginfo.org/api/topics All topics related to a term e.g., http://biotea.idiginfo.org/api/topics?term=cancer All vocabularies related to a term e.g., http://biotea.idiginfo.org/api/vocabularies?term=cancer All terms that start with a specific string (for autocompletion) e.g.,http://biotea.idiginfo.org/api/terms?prefix=canc All topics related to a vocabulary e.g., http://biotea.idiginfo.org/api/topics?vocabulary=po RDF of articles that include a term e.g., http://biotea.idiginfo.org/api/articles?term=cancer Count of RDF of articles that include a term e.g., http://biotea.idiginfo.org/api/articles?term=cancer&count=true 2/14/2014 http://biotea.idiginfo.org/api/terms Biotea, RDF4PMC A list of terms and their related topics 17 A list of vocabularies and their prefixes http://biotea.idiginfo.org/vocabularies RDF of articles that include a vocabulary e.g., http://biotea.idiginfo.org/api/articles?vocabulary=po
    18. 18. Semantically enriched publication Metadata+ Content + References Automatically Annotated RDF Biotea, RDF4PMC 2/14/2014 Consuming the data: a dashboard for semantic bio-publications SPARQL 18 Catalase
    19. 19. Consuming the data: first prototype Cloud of Bio-annotations (term + # of bio-entities) 2/14/2014 Title & authors Biotea, RDF4PMC Links Abstract Paragraphs containing the annotation selected by the user Graphical tools 19
    20. 20. Biotea, RDF4PMC 2/14/2014 Consuming the data: A first prototype 20
    21. 21. Challenges and Lessons Tables and images  Links Inline tables  Format is lost Supplementary material Most of them follow one DTD but … • References • At least 4 different styles • Some times are just plain text Biotea, RDF4PMC • • • • 2/14/2014 • Content • Annotators • Not always available • Stop words are tricky 21
    22. 22. Challenges and Lessons • Annotation is context dependent Biotea, RDF4PMC • Delivering the expressivity of the data set to the end user is a complex issue 2/14/2014 • Where are the facts? How to validate the facts? • Maintaining the triplet store has a learning curve of its own • Building SW infrastructure is H A R D 22
    23. 23. Currently working on: Literature Discovery Process • Search • Usually string-based search mechanisms • Little cognitive support • Retrieval • Simple list of DB entries • Little cognitive support • Interacting with the document • Straight into the PDF • Zero cognitive support • Data availability
    24. 24. Currently working on: Literature Discovery Process • Search • Usually string-based search mechanisms • Little cognitive support • Retrieval • Simple list of DB entries • Little cognitive support • How, why and where are a set of documents similar? • Interacting with the document • Straight into the PDF • Zero cognitive support
    25. 25. Currently working on: Literature Discovery Process • Search • Usually string-based search mechanisms • Little cognitive support • Retrieval • Simple list of DB entries • Little cognitive support • Interacting with the document • Straight into the PDF • Zero cognitive support
    26. 26. Future Work • User Experience • • • • Web services for data analysis RDF browser More visualization tools Supporting and taking advantage of the structure of the document • Collaborative element Biotea, RDF4PMC • URI standardization following similar patterns to identifiers.org and Bio2RDF • Integration into Bio2RDF • Dataset identification and summary (void) • Improve data for references 2/14/2014 • RDF 29
    27. 27. Future Work Biotea, RDF4PMC • From PDF to XML to RDF to Enriched Metadata for the PDF • The PDF is gently introduced in the WoD • Once the metadata has been enriched then 2/14/2014 • Application in Clinical Psychology, the MSRC case • Rich interaction supporting: SEARCH-RETRIEVALINTERACTION WITH THE DOCUMENT (PDF) 30
    28. 28. Conclusions • New vocabularies as well as annotators can easily be plugged in • Our approach is useful for both open and non-open access datasets Biotea, RDF4PMC • the transformation into RDF from the original PMC files • the annotation of the RDF • an API which makes that data available. 2/14/2014 • We provide • Publishers may decide what to expose via RDF and what content to make available • Our approach is also applicable for PDF-only environments 31
    29. 29. The MSRC consortium Greg Riccardi, FSU Oscar Corcho, UPM Olga Giraldo, UPM Bob Morris, Harvard University Michel Dumontier, Carleton University Dietrich Rebholz-Schuhmann, University of Zurich Diane Leiva, FSU US DoD Grant MOMRP Grant w81xwh-10-2-0181 All of those who gave us feedback about the RDFization and the quality of our RDF datasets Biotea, RDF4PMC • • • • • • • • • • 2/14/2014 Acknowledgments 32
    30. 30. Contacts • Alexander García: agarciac@gmail.com • L. Jael García Castro: leylajael@gmail.com Biotea, RDF4PMC 2/14/2014 Thanks for you attention 33