BRIF workshop Toulouse 2012 Digital IDs subgroup

BRIF Digital identiﬁers subgroup

Gudmundur A. Thorisson <gt50@leicester.ac.uk> GEN2PHEN / University of Leicester
Pierre-Antoine Gourraud <pierreantoine.gourraud@ucsf.edu> UCSF

-- Overview --
‣Brief backgrounder on identification & digital identifiers
‣Use cases for bio-resource identification in BRIF
‣Digital resources: datasets, databases (Mummi)
‣Non-digital resources: projects, studies, cohorts [...] (Pierre)

‣Conclusions and next steps

This work is published under the Creative Commons Attribution license
(CC BY: http://creativecommons.org/licenses/by/3.0/) which means that
it can be freely copied, redistributed and adapted, as long as proper
attribution is given.

Monday, 22 October 12

BRIF and bio-resource identiﬁcation
• The identiﬁcation requirement: need to identify resources in
order to
– track use/reuse and impact
– credit those who contribute to them

• Biobanking projects have relied on:
– Project/study/cohort names
• Example: the GAZEL study in France >20 years http://www.gazel.inserm.fr
• Challenges: - ad hoc agreements with research groups who reuse samples or data
- painstaking manual searching through literature for mentions of ‘GAZEL‘
- project names are often ambiguous in global context

BRIF workshop, Toulouse Oct 22 2012

BRIF and bio-resource identiﬁcation
• The identiﬁcation requirement: need to identify resources in
order to
– track use/reuse and impact
– credit those who contribute to them

• Example: biobanking projects frequently rely on...
– Project/study/cohort names
• Example: the GAZEL study in France >20 years http://www.gazel.inserm.fr
• Challenges: - ad hoc agreements with research groups who reuse samples or data
- painstaking manual searching through literature for mentions of ‘GAZEL‘
- project names are often ambiguous in global context

– Citations to journal publications
• Which paper to cite? Tricky to keep track of which citations are relevant to impact
• Also troublesome if there is no paper to cite (e.g. for a new study)

Digital identifiers - some background
• Definition: a digital identifier is a character string used to uniquely
identify i) a digital object in a computer system, or ii) a record in a
computer system which describes a non-digital object
• Persistence - once assigned, identifier MUST NOT change
• Uniqueness - global scope vs local scope
– Most ID schemes require tacid knowledge of the type of identifier to interpret
• Example: EC grant identifiers in acknowledgement statements


This work has received funding from the European Community's
Seventh Framework Programme (FP7/2007-2013) under grant
agreement number 200754 - the GEN2PHEN project.


This work has received funding
under grant
agreement number 200754


Digital identifiers - some background
• Definition: a digital identifier is a character string used to uniquely
identify i) a digital object in a computer system, or ii) a record in a
computer system which describes a non-digital object
• Persistence - once assigned, identifier MUST NOT change
• Uniqueness - global scope vs local scope
– Most ID schemes require tacid knowledge of the type of identifier to interpret
• Example: EC grant identifiers

• Some problem domains require for globally unique IDs
– Example: ISBN numbers to identify books, e.g. for copyright purposes

• Some problem domains require resolvable IDs
– Resolve = retrieve out information about the thing being identified, including where
to access it (for a digital object, its location on the Internet)
– Digital Object IDs best known, but several other systems exist


Identiﬁer use cases in BRIF
• 3x broad categories of “stuff” to identify

i) Digital resources
Resources that actually “lives” in computers (born-digital or digitized content):
datasets and databases

ii) Physical resources
Resources corresponding to actual physical things: samples, groups of samples,
experimental instruments, etc.

iii) Project-level and other “meta” resources
Higher-level aggregates of things, projects, organizations, consortia etc.

NB in many cases identiﬁers already exist for these things, but they are
not exposed to the outside world in a usable form (i.e. made resolvable,
citable, globally-unique).

Datasets
• Definition: a data set (or dataset) is a collection of data, often presented in
tabular form but in the bio-sciences also frequently in a multitude of
domain-specific formats, such as FASTA for biological sequences
• Data publication and data citation is a hot topic - lots of
research and infrastructure-building activity in recent years
• Emerging best practices for data citation & attribution
• Identifiers for dataset - persistent data DOIs issued via DataCite

• Little new for BRIF to add here, except issue recommendations
– KEY POINT: infrastructure for data preservation and access is a prerequisite for any
sort of persistent bio-dataset identification scheme. Many projects don’t have this!


Data DOI scenario (simpliﬁed)
1. Research group registers a dataset and metadata in a suitable domain
repository (or their own repository)

2. Repository archives dataset and and assigns a DOI name to it

3. Unique DOI name is used by article authors (and others) to indicate resource
reuse (ideally via formal data citation)

4. Journal article reference listings & full-text and other sources are mined to
identify references to dataset and/or downloads

5. Dataset-level metrics calculated from collected data
e.g. - total no. citations in scholarly articles
- no. secondary citations (citations to papers which cited the original dataset)
- no. downloads in the last 2 years


ORCID and DataCite Interoperability Network

• Persistent identiﬁers for connecting people and
dataset
• 2y EC-funded project, 7 partners in Europe + USA
• Two main proof-of-concept pilots
– Social Science data - use and citation of British Birth Cohort
Studies
• historical data, decades old, steadily being curated by lots of
different people
• high rate of reuse, often cited in papers
– High-energy physics - attribution challenges
• dealing with large no. authors on HEP papers - ‘dilution’ of the term
authorship
• Linking HEP papers to supporting datasets

http://odin-project.eu/

Databases
• Definition: an online database can be regarded as a collection of
data, but made accessible in such a way that facilitates using the data
to answer scientific question, via structured querying and/or free-text
searching of the data over the Internet
• Broad range, from large-scale DNA and protein sequence
repositories to small locus-specific databaess
– E.g. GenBank, UniProt, GWAS Central, Ehlers-Danlos Syndrome Variant Database

• Challenges in assessing impact & attributing curators
– Reliance citations to database paper, if there is one (sometimes many)
• Analyzing website traffic is another indicator - highly-accessed database =~ important
– Database URLs sometimes change
– Database name + URL often only mentioned only in materials&methods, no citation
– Credit via authorship impossible if there is no database journal paper

BioDBCore - global catalogue of bio-db’s
• BioDBCore aims
– annotation - organize the bio-database
‘resourceome’
– discovery - e.g. which protein sequence
databases are available?

• Who’s behind it?
– International Society for Biocuration
– Resource catalogues: Bioinformatics Links,
BioSiteMaps, NAR db-issue etc
– Working group includes reps from NAR and
DATABASE journals, MIBBI, Model
organism db’s, others

• Catalogue will have persistent
identiﬁers for each db entry

http://www.biosharing.org/biodbcore

•[slot in Pierre]


From
Pa(ents
to
BioBanks
and
back…
• Persistent
IDs
for
datasets
&
other
digital
resources
– Absolute
need
• From
BioresourceResearchIF
to
BioresourceXIF
– More
than
an
IP
address
?

• Increase
need
of
iden<ﬁca<on
for
source
of
informa<on

in
general

–
Not
only
research
purpose…
– “Big
data”

– Quan<ﬁed
self.
• Blurring
the
border
between
:
Research,
data
(Non-‐CLIA),

Clinically
approved
,
consumer
centered
data


Database
Gateway

&
Computa1ons

User
data Imaging

Reference
Front-‐end

Individual
data groups
of
pa.ents tablet

Applica1on

Copyright
©
2012
The
Regents
of
University
California,
USA
-‐
All
right
reserved.


Conclusions / next steps
• Complex landscape, lots of problems to tackle
• Key challenge will be to get authors to use the right identifiers
– education, awareness, best practices, journal guidelines etc.
– build support into tools that researchers use

• Potential outputs from BRIF subgroup, by end of GEN2PHEN
– Continue work on whitepaper on identifiers (partial drafted earlier in the year)
– Compile recommendations for authors & biobankers, for use cases where workable
solutions exist or are emerging (data DOIs, BioDBCore)

• Need some biobanker-expert help in ID subgroup!
– Esp. to look in-depth into study catalogues with established identifier schemes
• International Clinical Trials Registry Platform
• ClinicalTrials.gov
• P3G study catalogue

Acknowledgements
GEN2PHEN Consortium
This work has received funding from the
http://www.gen2phen.org/about-gen2phen/partners European Community's Seventh
Framework Programme (FP7/2007-2013)
under grant agreement number 200754 -
Prof Anthony J. Brookes Bioinformatics Group, Leicester
the GEN2PHEN project.

Contact me!

<gt50@le.ac.uk> |<gthorisson@gmail.com>
http://www.linkedin.com/in/mummi
http://www.twitter.com/gthorisson
Published under the CC BY license (http://
http://www.gthorisson.name creativecommons.org/licenses/by/3.0/)


BRIF workshop Toulouse 2012 Digital IDs subgroup

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to BRIF workshop Toulouse 2012 Digital IDs subgroup

Similar to BRIF workshop Toulouse 2012 Digital IDs subgroup (20)

More from Gudmundur Thorisson

More from Gudmundur Thorisson (16)

BRIF workshop Toulouse 2012 Digital IDs subgroup