PREPARING DATA FOR
SHARING
The FAIR Principles
Gareth Knight
London School of Hygiene & Tropical Medicine
gareth.knight@lshtm.ac.uk
ADMIT Network Meeting
01 December 2015
FAIR Principles
Findable
•Descriptive
metadata
•Persistent
Identifiers
Accessible
•Determining
what to share
•Participant
consent and risk
management
•Access status
Interoperable
•XML standards
•Data
Documentation
Initiative
•CDISC
Reusable
•Rights and
licence models
•Permitted and
non-permitted
use
http://datafairport.org/
Make your data:
• Findable
• Accessible
• Interoperable
• Reusable
Data Sharing in the sciences
• Data sharing has always taken place in
some form
• Enlightenment during 17 – 18th
century built upon open debate and
sharing of knowledge
• Science depends on openness and
transparency to advance
– Replicate results
– Correct errors & address bias
• Negative as well as positive findings
need to be in the public domain
“Systematic Dictionary of the Sciences, Arts, and Crafts”
Diderot & d'Alembert (1751 onwards)
Data Sharing in the News
“To make progress in science, we need to be open and share.”
Neelie Kroes (2012)
vice president of the European Commission
http://europa.eu/rapid/press-release_SPEECH-12-258_en.htm
Key Motivators
Research / Policy development Ensure validity
Funder Requirement Publisher requirements
Data reuse improves citation rate
• Studies that made data available in a
public repository received 9% more
citations than similar studies where
data was not available
• Creators tend to cite own data up to 2
years
• Third party use grew over time: for 100
datasets deposited in year 0,
– 40 reuse papers in PubMed in year 2
– 100 by year 4
– 150+ by year 5.
Piwowar & Vision, T.J (2013). Data reuse and the open data citation advantage. https://peerj.com/articles/175/
Study of 10,557 articles published between 2001 and 2009 that
collected gene expression microarray data
DATA
DISCOVERY
Is your data findable?
Discovery Metadata
• Descriptive metadata created to
describe key attributes of data:
– Title
– Creator
– Content description
• Data repositories/journals capture
and publish discovery metadata in
several formats (DC, DataCite, DDI)
• Metadata ‘harvested’ by research
data catalogues & search engines
• Metadata available to all, even if data
is not
Registry of Research Data Repositories
http://service.re3data.org
Citing Data
• Research data are a citable resource, same as papers & books
• 44-75 days is the estimated average lifespan of web URLs
• A unique, long-term identifier is necessary to enable citation
• Many persistent ID systems developed to solve problem
– DOI, Handle, ARK, etc.
• Data citation in reports and publications
UK Data Service: Citing Data
https://www.ukdataservice.ac.uk/use-data/citing-data
DATA
ACCESS
Do you have permission to share? If so, what?
Data Selection
Meet funder /
journal obligations
Encourage research
use
Higher citation rate
Reproduce &
validate results
ConstraintsMotivation
Concern that will attract
lower rate of response or
people will be less honest
Intellectual Property
Rights issues
Participant consent
doesn’t address
sharing
Data Protection
legislation
Data sharing decisions built upon
recognition of all influencing factors
Information Commissioner Office. Data Sharing Code of Practice
http://www.ico.org.uk/for_organisations/data_protection/topic_guides/data_sharing/
Handling individual level data
• Collected and analysed for
specific purpose
• Stored no longer than is
necessary
• Kept securely and safely to
prevent unauthorised or
unlawful access, process, loss,
or destruction
EU Data Protection Directive 95/46/EC establishes limitations on
how information on living individuals is held and used
Reform of the data protection legal framework in the EU
http://ec.europa.eu/justice/data-protection/reform/index_en.htm
Data Sharing as a barrier
Investigation of influence of open data policies on consent rate:
• No participants declined to participate, regardless of condition
• Rates of drop-out vs completion did not vary between open/non-open policies
• No significant change in potential consent rates when participants openly asked
about the influence of open data policies on their likelihood of consent.
Some researchers consider sharing obligations to be a
barrier to research participation
Access Status
Control method
• Data Transfer Agreement
• Access controls
Application process:
• Request form
• Review process
Access criteria:
• Permitted users – how do you identify?
• Permitted use – topic, academic use,
• Other criteria: encryption, time period
Open Vs. controlled access
https://www.flickr.com/photos/toruokada/16958186672/
DATA
INTEROPERABILITY
Can data be analysed and harmonized?
Data Standards
Data exchange is dependent upon:
• Open formats
• Common standards
• Documented metadata specification
• Consistent vocabulary
• Documented workflows https://biosharing.org/
Clinical Data Interchange
Standards Consortium
Standards intended to improve consistency
across the clinical trial lifecycle
Protocol
Data
Collection
Data
Tabulation
Data
Analysis
Archiving
and
exchange
Protocol
Representation
Model
Clinical Data
Acquisition
Standards
Harmonization
(CDASH)
Operational Data
Model (ODM)
and
Define-XML
Study Data
Tabulation
Model
(SDTM)
Analysis
Data Model
(ADaM)
Data Documentation Initiative
• Maintained & developed by DDI Alliance
• Supported by data archives, producers,
research data centers, university data
libraries, statistics organizations, etc.
• Two versions:
– DDI2 / Codebook: An archived instance of
a study
– DDI3 / DDI Lifecycle: Suitable for
longitudinal and repeated surveys
An XML-based metadata standard developed for social science
and economic statistics
http://www.ddialliance.org/
Study
Concepts
measures
Survey
Instruments
using
Questions
made up of
Universes
about
Responses
collect
resulting in
with values of
Variables
Comprised of
Categories/
Codes,
Numbers
Data Files
Survey Data Model
Slide source:
https://www.unece.org/fileadmin/DAM/stat
s/documents/ece/ces/ge.33/2011/mtg2/W
P_1_Arofan.ppt
DDI Codebook
A codeBook consists of:
1. docDscr: describes the DDI document
2. stdyDscr: Title, abstract, methodologies, agencies, access policy
3. fileDscr: a description of files in the dataset
4. dataDscr: variables (name, code, etc.), variable groups, cubes
5. othMat: other related materials, e.g. document citation
3 levels - Study, dataset, variable
Preserves the collection of files associated with
an archival copy of a survey
DDI Lifecycle
http://www.ddialliance.org/what
Data collector
Data Analyst Data Curator
Secondary user
Each stage may be performed by different groups
DDI Metadata reuse
Basic metadata can be reused during study life:
• Concepts, questions, responses, variables, categories, codes, survey
instruments, etc. may be adopted from earlier waves
Referencing earlier iterations:
• Unique identifier
• Version number - control over time
Common metadata ‘groups’ maintained by specific agencies:
• Schemes: lists of items of a single type
• Modules: metadata for a specific purpose or lifecycle stage
• All maintainable metadata has a known owner or agency
Unique ID example
urn=“urn:ddi:3_0:VariableScheme.Variable=pop.umn.edu:
STUDY0145_VarSch01(1_0).V101(1_1)”
This is a URN From DDI Version 3.0 For a variable
The scheme agency is
pop.umn.edu
With identifier
STUDY012345_VarSch01
Version 1.0 Variable ID is
V101
Version 1.1
http://www.iza.org/conference_files/eddi09/ppt/thomas_wendy_course.pdf
DDI Cross-study comparison
Variables are comparable if they possess same properties:
• Age is comparable if has:
– Same concept (e.g., age at last birthday)
– Same top-level universe (people)
– Same representation (i.e., an integer from 0-99)
DDI Comparison module:
• Place similar items in same group and perform tailored comparison
• Mappings are context-dependent, i.e. sufficient for purposes of particular
research
DDI Tools
DDI Codebook:
• Nesstar Publisher & Server
• IHSN Microdata Management
Toolkit
• Collectica
• NADA
• UKDA - DExT, ODaF DeXtris
DDI Lifecycle
• Collectica Designer, Collectica for
Excel, Portal
• Sledgehammer
DDI Tools
http://www.ddialliance.org/resources/tools
DATA
REUSE
Can data be used for further research?
Data Rights
• Many rights apply to data
– Copyright
– Moral
– Database
– Patents & trade secrets
• Rights issues vary between
countries
• Ensure your project has clarified
rights issues before sharing
https://www.flickr.com/photos/riekhavoc/4813140176/
Rights issues influence how data can be shared, used and cited
FAIR data
• Consider permitted
use
• Apply appropriate
licence
• Use open formats
• Consistent vocabulary
• Common metadata
standards
• Consider what will be
shared
• Obtain participant
consent & perform
risk management
• Describe your
data in a data
repository
• Apply a
persistent
identifiers
Findable
ReusableInteroperable
Accessible
Thank You for your attention!
Questions

Preparing Data for Sharing: The FAIR Principles

  • 1.
    PREPARING DATA FOR SHARING TheFAIR Principles Gareth Knight London School of Hygiene & Tropical Medicine gareth.knight@lshtm.ac.uk ADMIT Network Meeting 01 December 2015
  • 2.
    FAIR Principles Findable •Descriptive metadata •Persistent Identifiers Accessible •Determining what toshare •Participant consent and risk management •Access status Interoperable •XML standards •Data Documentation Initiative •CDISC Reusable •Rights and licence models •Permitted and non-permitted use http://datafairport.org/ Make your data: • Findable • Accessible • Interoperable • Reusable
  • 3.
    Data Sharing inthe sciences • Data sharing has always taken place in some form • Enlightenment during 17 – 18th century built upon open debate and sharing of knowledge • Science depends on openness and transparency to advance – Replicate results – Correct errors & address bias • Negative as well as positive findings need to be in the public domain “Systematic Dictionary of the Sciences, Arts, and Crafts” Diderot & d'Alembert (1751 onwards)
  • 4.
    Data Sharing inthe News “To make progress in science, we need to be open and share.” Neelie Kroes (2012) vice president of the European Commission http://europa.eu/rapid/press-release_SPEECH-12-258_en.htm
  • 5.
    Key Motivators Research /Policy development Ensure validity Funder Requirement Publisher requirements
  • 6.
    Data reuse improvescitation rate • Studies that made data available in a public repository received 9% more citations than similar studies where data was not available • Creators tend to cite own data up to 2 years • Third party use grew over time: for 100 datasets deposited in year 0, – 40 reuse papers in PubMed in year 2 – 100 by year 4 – 150+ by year 5. Piwowar & Vision, T.J (2013). Data reuse and the open data citation advantage. https://peerj.com/articles/175/ Study of 10,557 articles published between 2001 and 2009 that collected gene expression microarray data
  • 7.
  • 8.
    Discovery Metadata • Descriptivemetadata created to describe key attributes of data: – Title – Creator – Content description • Data repositories/journals capture and publish discovery metadata in several formats (DC, DataCite, DDI) • Metadata ‘harvested’ by research data catalogues & search engines • Metadata available to all, even if data is not Registry of Research Data Repositories http://service.re3data.org
  • 9.
    Citing Data • Researchdata are a citable resource, same as papers & books • 44-75 days is the estimated average lifespan of web URLs • A unique, long-term identifier is necessary to enable citation • Many persistent ID systems developed to solve problem – DOI, Handle, ARK, etc. • Data citation in reports and publications UK Data Service: Citing Data https://www.ukdataservice.ac.uk/use-data/citing-data
  • 10.
    DATA ACCESS Do you havepermission to share? If so, what?
  • 11.
    Data Selection Meet funder/ journal obligations Encourage research use Higher citation rate Reproduce & validate results ConstraintsMotivation Concern that will attract lower rate of response or people will be less honest Intellectual Property Rights issues Participant consent doesn’t address sharing Data Protection legislation Data sharing decisions built upon recognition of all influencing factors Information Commissioner Office. Data Sharing Code of Practice http://www.ico.org.uk/for_organisations/data_protection/topic_guides/data_sharing/
  • 12.
    Handling individual leveldata • Collected and analysed for specific purpose • Stored no longer than is necessary • Kept securely and safely to prevent unauthorised or unlawful access, process, loss, or destruction EU Data Protection Directive 95/46/EC establishes limitations on how information on living individuals is held and used Reform of the data protection legal framework in the EU http://ec.europa.eu/justice/data-protection/reform/index_en.htm
  • 13.
    Data Sharing asa barrier Investigation of influence of open data policies on consent rate: • No participants declined to participate, regardless of condition • Rates of drop-out vs completion did not vary between open/non-open policies • No significant change in potential consent rates when participants openly asked about the influence of open data policies on their likelihood of consent. Some researchers consider sharing obligations to be a barrier to research participation
  • 14.
    Access Status Control method •Data Transfer Agreement • Access controls Application process: • Request form • Review process Access criteria: • Permitted users – how do you identify? • Permitted use – topic, academic use, • Other criteria: encryption, time period Open Vs. controlled access https://www.flickr.com/photos/toruokada/16958186672/
  • 15.
    DATA INTEROPERABILITY Can data beanalysed and harmonized?
  • 16.
    Data Standards Data exchangeis dependent upon: • Open formats • Common standards • Documented metadata specification • Consistent vocabulary • Documented workflows https://biosharing.org/
  • 17.
    Clinical Data Interchange StandardsConsortium Standards intended to improve consistency across the clinical trial lifecycle Protocol Data Collection Data Tabulation Data Analysis Archiving and exchange Protocol Representation Model Clinical Data Acquisition Standards Harmonization (CDASH) Operational Data Model (ODM) and Define-XML Study Data Tabulation Model (SDTM) Analysis Data Model (ADaM)
  • 18.
    Data Documentation Initiative •Maintained & developed by DDI Alliance • Supported by data archives, producers, research data centers, university data libraries, statistics organizations, etc. • Two versions: – DDI2 / Codebook: An archived instance of a study – DDI3 / DDI Lifecycle: Suitable for longitudinal and repeated surveys An XML-based metadata standard developed for social science and economic statistics http://www.ddialliance.org/
  • 19.
    Study Concepts measures Survey Instruments using Questions made up of Universes about Responses collect resultingin with values of Variables Comprised of Categories/ Codes, Numbers Data Files Survey Data Model Slide source: https://www.unece.org/fileadmin/DAM/stat s/documents/ece/ces/ge.33/2011/mtg2/W P_1_Arofan.ppt
  • 20.
    DDI Codebook A codeBookconsists of: 1. docDscr: describes the DDI document 2. stdyDscr: Title, abstract, methodologies, agencies, access policy 3. fileDscr: a description of files in the dataset 4. dataDscr: variables (name, code, etc.), variable groups, cubes 5. othMat: other related materials, e.g. document citation 3 levels - Study, dataset, variable Preserves the collection of files associated with an archival copy of a survey
  • 21.
    DDI Lifecycle http://www.ddialliance.org/what Data collector DataAnalyst Data Curator Secondary user Each stage may be performed by different groups
  • 22.
    DDI Metadata reuse Basicmetadata can be reused during study life: • Concepts, questions, responses, variables, categories, codes, survey instruments, etc. may be adopted from earlier waves Referencing earlier iterations: • Unique identifier • Version number - control over time Common metadata ‘groups’ maintained by specific agencies: • Schemes: lists of items of a single type • Modules: metadata for a specific purpose or lifecycle stage • All maintainable metadata has a known owner or agency
  • 23.
    Unique ID example urn=“urn:ddi:3_0:VariableScheme.Variable=pop.umn.edu: STUDY0145_VarSch01(1_0).V101(1_1)” Thisis a URN From DDI Version 3.0 For a variable The scheme agency is pop.umn.edu With identifier STUDY012345_VarSch01 Version 1.0 Variable ID is V101 Version 1.1 http://www.iza.org/conference_files/eddi09/ppt/thomas_wendy_course.pdf
  • 24.
    DDI Cross-study comparison Variablesare comparable if they possess same properties: • Age is comparable if has: – Same concept (e.g., age at last birthday) – Same top-level universe (people) – Same representation (i.e., an integer from 0-99) DDI Comparison module: • Place similar items in same group and perform tailored comparison • Mappings are context-dependent, i.e. sufficient for purposes of particular research
  • 25.
    DDI Tools DDI Codebook: •Nesstar Publisher & Server • IHSN Microdata Management Toolkit • Collectica • NADA • UKDA - DExT, ODaF DeXtris DDI Lifecycle • Collectica Designer, Collectica for Excel, Portal • Sledgehammer DDI Tools http://www.ddialliance.org/resources/tools
  • 26.
    DATA REUSE Can data beused for further research?
  • 27.
    Data Rights • Manyrights apply to data – Copyright – Moral – Database – Patents & trade secrets • Rights issues vary between countries • Ensure your project has clarified rights issues before sharing https://www.flickr.com/photos/riekhavoc/4813140176/ Rights issues influence how data can be shared, used and cited
  • 28.
    FAIR data • Considerpermitted use • Apply appropriate licence • Use open formats • Consistent vocabulary • Common metadata standards • Consider what will be shared • Obtain participant consent & perform risk management • Describe your data in a data repository • Apply a persistent identifiers Findable ReusableInteroperable Accessible
  • 29.
    Thank You foryour attention! Questions