RESEARCH DATA SHARING:
A BASIC FRAMEWORK
Paul Groth @pgroth
pgroth.com
Elsevier Labs @elsevierlabs
LERU Summer School 2016
Data Stewardship for Scientific Discovery and Innovation
WHAT IS DATA?
WHAT IS DATA?
“Data refers to entities used as evidence of phenomena for
the purposes of research or scholarship”
[Borgman Big Data, Little
Data, No Data 2015 p.29]
WHY COLLECT
DATA?
WHY COLLECT
DATA?
Borgman, C. L. (2012). The conundrum of sharing
research data. Journal of the American Society for
Information Science and Technology.
HOW IS DATA
OBTAINED
HOW IS DATA
OBTAINED
Borgman, C. L. (2012). The conundrum of sharing
research data. Journal of the American Society for
Information Science and Technology.
WHY SHARE DATA?
WHY SHARE DATA?
• R1: reproduce or verify research,
• R2: make results of publicly funded
research available to the public
• R3: enable others to ask new
questions of extant data
• R4: advance the state of research
and innovation.
Borgman, C. L. (2012). The conundrum of sharing research data.
Journal of the American Society for Information Science and
Technology.
• All empirical papers must archive their data upon acceptance in order to be published unless the authors provide
a compelling reason why they cannot (e.g., expense, confidentiality). The action editor will be the final arbiter of whether the reason is
sufficiently compelling.
• “Data” refers to an electronic file containing nonidentified responses that are potentially already coded. Normally, the data would
represent an early stage of electronic processing, before individual responses have been aggregated. The data must be in
a form that allows all reported statistical analyses to be reproduced
while retaining the confidentiality of individual participants. This entails that the data are formatted and documented in a way that makes
the structure of the data set readily apparent.
• Archiving consists either of submitting the data to the journal (to be displayed as supplementary material at the end of the article),
sending it to some other archive that is accessible to established researchers and maintained by a substantial established institution, or
authors making the data available on their own website, assuming that they can assure us the site will be maintained by a recognized
institution for a reasonable period of time. Again, action editors will be the final arbiters of the appropriateness of an archive.
• Any publication that reports analyses of or refers to archived data will be expected to cite the original
publication in which the data were reported.
• This policy is new and therefore open to modification. Our aim is to implement a policy that maximizes transparency while minimizing the
burden on authors.
THE IMPORTANCE OF CITING DATA
Data Citation Synthesis Group: Joint Declaration of Data Citation
Principles. Martone M. (ed.) San Diego CA: FORCE11; 2014
[https://www.force11.org/group/joint-declaration-data-
citation-principles-final].
1. Importance
2. Credit and Attribution
3. Evidence
4. Unique Identification
5. Access
6. Persistence
7. Specificity and Verifiability
8. Interoperability and Flexibility
10 ASPECTS OF HIGHLY EFFECTIVE RESEARCH DATA
https://www.elsevier.com/con
nect/10-aspects-of-highly-
effective-research-data
https://storify.com/chenghlee/dataformathell
http://isps.yale.edu/sites/default/files/files/I
DCC14_DQR_PeerGreenStephenson.pdf
ALL DATA ISN’T SUCCESSFUL
BARRIERS TO REACHING SUCCESSFUL
DATA?
Common practice: data is very fragmented
Using antibodies
and squishy bits
Grad Students experiment
and enter details into their
lab notebook.
The PI then tries to make
sense of their slides,
and writes a paper.
End of story.
17
ALL DATA ISN’T CURATED
Cost of documentation
http://www.indoition.com/en/services/costs
-prices-software-documentation.htm
20Yolanda GilUSC Information Sciences Institute gil@isi.edu
Measuring Time Savings with
“Reproducibility Maps” [Garijo et al PLOS CB12]
2 months of effort in reproducing published method (in PLoS’10)
Authors expertise was required
Comparison of
ligand binding
sites
Comparison of
dissimilar protein
structures
Graph network
generation
Molecular Docking
Work with D. Garijo of UPM and P. Bourne of UCSD
CURRENT STRATEGIES FOR DATA SHARING
SUBJECT SPECIFIC REPOSITORIES
SUBJECT SPECIFIC REPOSITORIES
COMMUNITY SPECIFIC REPOSITORIES
GENERIC REPOSITORIES
http://data.mendeley.com/
Each dataset receives a versioned
DOI, so it can be cited
The citation for the
associated article is
displayed
DATA PUBLICATION
BENEFITS OF MACHINE READBILITY
HOW DO WE MOVE UP THE PYRAMID
https://www.elsevier.com/con
nect/10-aspects-of-highly-
effective-research-data
60 % OF TIME IS SPENT ON DATA
PREPARATION
CURATED DATA SETS
http://ivory.idyll.org/blog/replication-i.html
MORE SEMANTICS
A FRAMEWORK FOR HELPING
RESEARCHERS SHARE DATA
• What data?
• Determine the context
• Why is data being collected?
• How is data obtained?
• What is the researchers’ reason for sharing?
• Document
• Understand Cost/benefit tradeoffs
• Target audience
• Automation
FURTHER READING
• Syllabus for Data Management and Practice, Part I, Winter 2016. Data
Management and Practice, Part I (2016)Christine L Borgmam.
https://works.bepress.com/borgman/381/
• Christine L. Borgman. “Big Data, Little Data, No Data”
• Reference list
://www.zotero.org/groups/borgman_big_data_little_data_no_data
• Borgman, C. L. (2012). The conundrum of sharing research data. Journal of
the American Society for Information Science and Technology.
• Goodman A, Pepe A, Blocker AW, Borgman CL, Cranmer K, et al. (2014)
Ten Simple Rules for the Care and Feeding of Scientific Data. PLoS Comput
Biol 10(4): e1003542. doi: 10.1371/journal.pcbi.1003542

Research Data Sharing: A Basic Framework

  • 1.
    RESEARCH DATA SHARING: ABASIC FRAMEWORK Paul Groth @pgroth pgroth.com Elsevier Labs @elsevierlabs LERU Summer School 2016 Data Stewardship for Scientific Discovery and Innovation
  • 2.
  • 3.
    WHAT IS DATA? “Datarefers to entities used as evidence of phenomena for the purposes of research or scholarship” [Borgman Big Data, Little Data, No Data 2015 p.29]
  • 4.
  • 5.
    WHY COLLECT DATA? Borgman, C.L. (2012). The conundrum of sharing research data. Journal of the American Society for Information Science and Technology.
  • 6.
  • 7.
    HOW IS DATA OBTAINED Borgman,C. L. (2012). The conundrum of sharing research data. Journal of the American Society for Information Science and Technology.
  • 8.
  • 9.
    WHY SHARE DATA? •R1: reproduce or verify research, • R2: make results of publicly funded research available to the public • R3: enable others to ask new questions of extant data • R4: advance the state of research and innovation. Borgman, C. L. (2012). The conundrum of sharing research data. Journal of the American Society for Information Science and Technology.
  • 10.
    • All empiricalpapers must archive their data upon acceptance in order to be published unless the authors provide a compelling reason why they cannot (e.g., expense, confidentiality). The action editor will be the final arbiter of whether the reason is sufficiently compelling. • “Data” refers to an electronic file containing nonidentified responses that are potentially already coded. Normally, the data would represent an early stage of electronic processing, before individual responses have been aggregated. The data must be in a form that allows all reported statistical analyses to be reproduced while retaining the confidentiality of individual participants. This entails that the data are formatted and documented in a way that makes the structure of the data set readily apparent. • Archiving consists either of submitting the data to the journal (to be displayed as supplementary material at the end of the article), sending it to some other archive that is accessible to established researchers and maintained by a substantial established institution, or authors making the data available on their own website, assuming that they can assure us the site will be maintained by a recognized institution for a reasonable period of time. Again, action editors will be the final arbiters of the appropriateness of an archive. • Any publication that reports analyses of or refers to archived data will be expected to cite the original publication in which the data were reported. • This policy is new and therefore open to modification. Our aim is to implement a policy that maximizes transparency while minimizing the burden on authors.
  • 12.
    THE IMPORTANCE OFCITING DATA Data Citation Synthesis Group: Joint Declaration of Data Citation Principles. Martone M. (ed.) San Diego CA: FORCE11; 2014 [https://www.force11.org/group/joint-declaration-data- citation-principles-final]. 1. Importance 2. Credit and Attribution 3. Evidence 4. Unique Identification 5. Access 6. Persistence 7. Specificity and Verifiability 8. Interoperability and Flexibility
  • 14.
    10 ASPECTS OFHIGHLY EFFECTIVE RESEARCH DATA https://www.elsevier.com/con nect/10-aspects-of-highly- effective-research-data
  • 15.
  • 16.
    BARRIERS TO REACHINGSUCCESSFUL DATA?
  • 17.
    Common practice: datais very fragmented Using antibodies and squishy bits Grad Students experiment and enter details into their lab notebook. The PI then tries to make sense of their slides, and writes a paper. End of story. 17
  • 18.
  • 19.
  • 20.
    20Yolanda GilUSC InformationSciences Institute gil@isi.edu Measuring Time Savings with “Reproducibility Maps” [Garijo et al PLOS CB12] 2 months of effort in reproducing published method (in PLoS’10) Authors expertise was required Comparison of ligand binding sites Comparison of dissimilar protein structures Graph network generation Molecular Docking Work with D. Garijo of UPM and P. Bourne of UCSD
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
    GENERIC REPOSITORIES http://data.mendeley.com/ Each datasetreceives a versioned DOI, so it can be cited The citation for the associated article is displayed
  • 26.
  • 27.
  • 28.
    HOW DO WEMOVE UP THE PYRAMID https://www.elsevier.com/con nect/10-aspects-of-highly- effective-research-data
  • 29.
    60 % OFTIME IS SPENT ON DATA PREPARATION
  • 30.
  • 31.
  • 32.
  • 33.
    A FRAMEWORK FORHELPING RESEARCHERS SHARE DATA • What data? • Determine the context • Why is data being collected? • How is data obtained? • What is the researchers’ reason for sharing? • Document • Understand Cost/benefit tradeoffs • Target audience • Automation
  • 34.
    FURTHER READING • Syllabusfor Data Management and Practice, Part I, Winter 2016. Data Management and Practice, Part I (2016)Christine L Borgmam. https://works.bepress.com/borgman/381/ • Christine L. Borgman. “Big Data, Little Data, No Data” • Reference list ://www.zotero.org/groups/borgman_big_data_little_data_no_data • Borgman, C. L. (2012). The conundrum of sharing research data. Journal of the American Society for Information Science and Technology. • Goodman A, Pepe A, Blocker AW, Borgman CL, Cranmer K, et al. (2014) Ten Simple Rules for the Care and Feeding of Scientific Data. PLoS Comput Biol 10(4): e1003542. doi: 10.1371/journal.pcbi.1003542

Editor's Notes

  • #19 http://www.tamr.com/piketty-revisited-improving-economics-data-science/
  • #30 NASA, A.40 Computational Modeling Algorithms and Cyberinfrastructure, tech. report, NASA, 19 Dec. 2011