Paper as a Research Object

Research around and about
the scientific paper in the
biomedical domain.
Supporting Literature Based
Discovery

From the paper to the data back and forth

Alexander Garcia, PhD.
FSU

350 Years and Counting
 Scientific articles have adopted electronic dissemination
channels
 Scholarly communication has been complemented by
the adoption of blogs, mailing lists, social networks, and
other technologies

 Information remains locked up in PDFs

And so we are…
Managing the publication on a postmortem basis…

The paper as an interface to the Web of Data?
The problem remains, so…
To be born semantics… why not?

Heading towards
 A semantic document, one where human-readable
knowledge is augmented to enable its interpretation by
machine
 A human interpretable document fully procesable by
machines

 Human interoperability and machine interoperability
 Literature Based Discovery and the Paper as an interface
to the WoD

We all know that
 Information is locked up in discrete documents
 Mostly PDF
 Controlled vocabularies are not always available
 Text Mining depends on availability of data
 Poor metadata

Agenda
Biotea
Citagora
Semantic documents as scaffolds for research objects

Human interoperability and machine interoperability

Literature Based Discovery
• The key idea is: putting together explicit
assertions from different papers to form
new implicit assertions

– PTSD and suicide
– Magnesium-migraine
– Fish oil-Raynaud’s or calcium-channel blokers

• Sophisticated access to online information
• Supplement document retrieval with:
– Information extraction
– Automatic summarization
– Question answering

The White Paper Challenge
 Search and Retrieval
How to get relevant documents faster
Info Sources
Query Builders
Notifications
How to “scan” the document in a meaningful
manner?

How to repurpose fragments of the documents?

Literature Discovery Process
 Search
 Usually string-based search mechanisms
 Little cognitive support

 Retrieval
 Simple list of DB entries

 Interacting with the document
 Straight into the PDF
 Zero cognitive support
 Data availability

Literature Discovery Process
 Search
 Usually string-based search mechanisms

 Retrieval
 Simple list of DB entries

 Interacting with the document
 Straight into the PDF

 Zero cognitive support

Challenge: Language Complexity
The average age of participants (approximately 63
years), the predominance of women, and the high
prevalence of comorbid conditions (for
example, hypertension and cardiovascular disease) reflect
typical characteristics of patients with osteoarthritis.
Language encodes a lot of information

Words and Phrases
age
approximately
average
cardiovascular
characteristics
comorbid
conditions
disease
example
high

average age of
participants
approximately 63 years
predominance of women
high prevalence
comorbid conditions

Semantic Predications
The average age of participants (approximately 63
years), the predominance of women, and the high
prevalence of comorbid conditions (for
example, hypertension and cardiovascular disease)
reflect typical characteristics of patients with
osteoarthritis.

Semantic Predications
Cardiovascular Diseases
CO-OCCURS_WITH
Degenerative polyarthritis
Hypertension
CO-OCCURS_WITH
Degenerative polyarthritis
Suicide Ideation
CO-OCCURS_WITH
Suicide Risk

What is needed
 Disambiguate Text and tag/link concepts
 Meta-analyse information at concept level
 Provide meta-analysed information
 Support Information Based Knowledge Discovery
(especially new associations)

In order to support
Literature Based
Discovery
 Ontologies
 Communities
 Annotation
 Machinereadable
documents
In a nutshell….
…documents as interfaces
to the Web of Data….

Biotea

• Machine-readable and
procesable documents
• Interactive documents
• Enriched metadata
• Full content
management, document
centric
• Social hub

Citagora

-Aggregated search
-Single entry point
-Social hub
-Citation centric

Biotea in a nutshell
 It is a knowledge model for biomedical literature
 We are semantically annotating literature with text mining
and ontologies
 Delivers a network of interrelated documents
 Delivers a semantic infrastructure for PMC and scientific
literature in general

PMC RDFication
Metadata+
Content +
References

References
Enrichment

RDF Generation

RDFReacto
r

PMC XML

RDF4PMC, some results
Makes possible


How similar are two articles?  based
on
authors, keywords, abstracts, ontologi
cal terms



Metadata +
Content +
References

What articles use this reference in a
section with title “Results”?

Annotations
Makes possible
•
How similar are two articles?
 based on semantic
distance
•
Which annotation co-occurs
more with this “YYY”
annotation?
•
Which articles include “TERM”
but not this other “TERM”?

Annotations
Some numbers, article PMC126253
“Computational method for
reducing variance with
Affymetrix microarrays”
•
NCBO
•
Annotations: 407
•
Topics: 633
•
Whatizit
•
Annotations: 14
•
Topics: 203

Delivering: the platform that makes possible to build interactive environments for semantic publications

A dashboard for semantic biopublications

Semantically
enriched
publication
Metadata+
Content +
References

SPARQL

Catalase

Automatically
Annotated
RDF

Cloud of Bioannotations
(term + # of bioentities)
Title &
authors

Links

Abstra
ct

Paragraphs
containing the
annotation selected
by the user

Bio-entities for the
annotation selected

Enriched content: interactive zone for
the bio-entity selected by user

Citagora
 An Agora for Citations
 From Citations to Social Web to an Interactive Document
 Aggregating activity from Social Networks, Reference
Management Systems, Blogs, Publishers, etc.

 Aggregating sources from Google Scholar, Microsoft
Academics, Zotero, Mendely, etc.

What is MSRC.CITAGORA?
Corpus of documents for one specific domain

•
•
•

BibRef centric
Enrichment mechanism
Based on heterogeneous data
sources, aggregator
o

•

o

Heterogeneous BibRef data sources
Heterogeneous PDF layouts

Value in
o
o
o
o

Enriching semantics around the BibRef
Aggregating social activity around the BibRef

Social activity as part of the BifRef
Making use of the content without exposing it
DATA for and compatible with the Web of Data

MSRC.CITAGORA
Data Source
Data Sources, may be users
uploading ENL files, that have
for
each
record
the
corresponding PDF.
Result
from
harvesting
Mendeley, ZOTERO, Elsevier
API, Microsoft Academics
API, etc.

Extracting Meaningful
Information by
Processing the Data
Source
-List of references
this document
cites_to
-Meaningful bag of
words
Authors, affiliations,
emails

Outcome: RDF
-BibRef for the
original PDF
-Annotations
for the whole
document
-Text
-List of cites_to

MSRC.CITAGORA
Citagora
Harvester

Citation
Metadata
&
References

Database

S2T

PDFs

Basic
XML

Enhanced
XML

Ontology /
Citation
References Vocabulary

Documen Query
Search
t
Database Engine
RDF
SPARQL

Interface
(Search +
Tag
Browser)

Moving Towards OPEN.CITAGORA
Lets build the largest OPEN repository of everything around a
standardized interoperable bibliographic reference

Annotations

has_part
BibRef

has_part

has_part

has_part

Living in the Web of Data
References

Content

PDF

Focus for OPEN.CITAGORA
Data
Interoperability
Unlocking valuable information from the PDF
Home of the largest collection of scientific bibliographic
references and literature

Semantic Enrichment
Jailbreaking
PDF

Content is
locked up

Meaningful Text
Citations, cites_t
o
this paper
cites_to
-Authors
-this paper
has_authors
-Title, DOI, etc
-Content as text
-Bag of words
describing
content

Annotations

PDF
has_part

has_part
BibRef
has_part

has_part

Content

References

Semantic Enrichment
Jailbreaking
BibRef

PDF

Meaningful Text
-Citations,
cites_to
Heterogeneous Content is
this paper
locked up
formats
cites_to
Diversity in APIs
-Authors
for collecting
-this paper
BibRefs
has_authors
Poor in
-Title, DOI, etc
descriptors
-Content as text
anchored in the
-Bag of words
content
Not justdescribing
about the
Louzy
content
PDF
metadata
Standardization, all in one place, one
URI, etc

Annotatio
ns

PDF
has_p
art

has_p
art
BibRef

has_p
art

Reference
s

has_p
art
Conte
nt

Translational Research
 How is MSRC contributing to Translational Research in
Clinical Psychology?
 Data Standards
 Semantic Infrastructure
 Bridging the gap between documents and data
repositories

Narrative
Text
Usable by humans and comp

The paper as a
Research Object

The RO is a fluid structured grid

About data

Data Processing

Data Processing

BibRef Object BibRef Object

Data

The RO is a fluid structured grid

Rhetorical structure: Header, Body.

Lab
Notebook

BIBLIOGRAPHIC RECORD:
CiTO+FaBIO

HEAD: Bibliographic
record (this paper),
KeyWords, Author
Contacts

AUTHOR CONTACT: FOAF

RHETORIC
INFORMATION + EVIDENCE (external):
SWAN-SIOC + CiTO + FaBIO

SCIENTIFIC
PAPER: Head,
Body, Tail

BODY: Rhetoric,
Information,
Evidence

METHODS & MATERIALS:
REAGENTS,
PROTOCOLS,
EQUIPMENT,
INSTRUMENTATION
INFORMATION +
EVIDENCE (internal):
METHODS &
MATERIALS,
EXPERIMENTAL
DESIGN, DATA &
COMPUTATIONS,
INTERPRETATIONS

REAGENTS:
SemRes Antibodies,
SemRes Mouse Models

EXPERIMENTAL DESIGN:
SWAN Data + Experiment, OBI, myExperiment

DATA & COMPUTATIONS:
SWAN Data+Experiment,
OBI, SWAN, myExperiment

INTERPRETATIONS:
SWAN-SIOC

TAIL: Bibliographic
records (papers cited
as external evidence)

BIBLIOGRAPHIC RECORDS:
SWAN Collections, CiTO+FaBIO

We have learned so far
 Born semantic enables the semantics to be of use to the
authors, as they are present in the publication process
from the start. To add value for readers and
computational consumption these semantics must then
be "preserved” throughout the publication process;
so, we need to address the publication process to
achieve this goal.

Acknowledgments
 Special Thanks to John Gomez, John Patterson, Dietrich
Rebholz-Schuhmann, Robert Morris, Oscar Corcho, Diane
Leiva and Greg Riccardi

Paper as a Research Object

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Paper as a Research Object

Similar to Paper as a Research Object (20)

Recently uploaded

Recently uploaded (20)

Paper as a Research Object

Editor's Notes