Your SlideShare is downloading. ×
0
Open access to scientific research dataGudmundur A. Thorisson, PhD <gt50@leicester.ac.uk>Research associate, University of...
Overview   ๏ Intro to the world of Big Science & Big Data              •Why is inadequate access to data such a problem?  ...
Big Science, Big Data• Scientific research increasingly large-scale and data-driven• High-profile discipline examples     ...
Hypothesis generation guided by available data                                                                            ...
Biological research too is                         increasingly big and data-driven  • From: small-scale datasets that    ...
Biological research too is                         increasingly big and data-driven• To: large-scale collection of  biolog...
Examples: domain repositories for sequence data • GenBank - genetic sequence   repository, established 1986               ...
“Community resource projects” - large-scale data generation  for the purpose of making the data available for broad reuse•...
Big Data – challenges, opportunities• Managing & making sense of large-scale datasets     – Data easy/cheap to generate - ...
Data = “fuel” of science                         Smith,V. Data publication: towards a database of everything. BMC Res Note...
11
12
Data = “fuel” of science                            [..] If digital technologies are the engine of this                   ...
Biology and data sharing in the “long tail”• Biology is complex, so data are often very  heterogeneous• Technologies chang...
[…] Overall, only 47 papers (9%) deposited full primary raw data                         online. None of the 149 papers no...
DATA                                                               analysed                                               ...
Credit: http://cutcaster.com/photo/800902839-The-hand-drawing-question-WHY/                                               ...
Lots and lots of diverse reasons!!                                            Some quotes from researchers:               ...
Gnarly issue #1: “ownership” vs “stewardship”• Many researchers consider data their property, even if research  funded by ...
Gnarly issue #2 – biomedical data• Usually sensitive, cannot be shared without restrictions     – Detailed, reidentifiable...
How to Make a Tackle in RugbyTackling in rugby is one of the most important aspects of the game.[...]Credit:http://djamba....
...which are an imperfect solution• Arguments that mandates by themselves are not the way• Mandates likely to ensure only ...
Sharing now tends to be driven by mandates...  * Journals increasingly require data to be made available  “Provide support...
Strategies focused on encouraging sharing                                 - Make it easy -                                ...
Treating data as citable                      publications in their own right• Core strategy: enable data to be treated as...
Exemplar – Data Dryad“international repository of dataunderlying peer-reviewed articles inthe basic and applied bioscience...
Key building blocks: the 3 I’s of identificationRDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29...
1I        Identifying scholarly publications (and other research outputs)          • Why? So it is possible to..          ...
2I        Identifying use/reuse - measuring impact      – Historical reliance on formal citations and citation-based metri...
3I        Identifying contributors – attributing credit      – Why? So we can..         ..link content creators with their...
Tackling the author name ambiguity problem               (or ‘Who’s Who?’)                                                ...
The Open Researcher & Contributor ID initiativeLaunched end of 2009, ORCID will work tosupport the creation of a permanent...
The Open Researcher & Contributor ID initiativeORCID will add value for scholars andthe organizations that they areinterac...
ORCID transcends discipline, geographic, national andinstitutional boundaries - now >300 participantshttp://www.orcid.org ...
Some food for thought / recommendation            kind of stuff to conclude• Status of research data in Iceland is unclear...
Even more food for thought /                         recommendation kind of stuff• Universities & other research instituti...
Final bite of food-for-thought                  Lets make research data an integral part of the                   OA missi...
AcknowledgementsGEN2PHEN Consortium                                                              This work has received fu...
Upcoming SlideShare
Loading in...5
×

RDFC2012 Open Access to Research Data

797

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
797
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "RDFC2012 Open Access to Research Data"

  1. 1. Open access to scientific research dataGudmundur A. Thorisson, PhD <gt50@leicester.ac.uk>Research associate, University of LeicesterGuest scientist, University of IcelandParticipant in the GEN2PHEN Consortium and the ORCID Technical Working Group This work is published under the Creative Commons Attribution license (CC BY: http://creativecommons.org/licenses/by/3.0/) which means that it can be freely copied, redistributed and adapted, as long as proper attribution is given.
  2. 2. Overview ๏ Intro to the world of Big Science & Big Data •Why is inadequate access to data such a problem? ๏ Incentive-based approaches to tackling the sharing problem Identification, identification, identification ๏ Key relevant developments internationally ๏ Some food for thought for funders, institutions, other key players ๏ Concluding remarksRDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
  3. 3. Big Science, Big Data• Scientific research increasingly large-scale and data-driven• High-profile discipline examples – High-energy particle physics - experiments performed in the Large Hadron Collider – Astronomy - data from ground-based and space telescopes, the Virtual Observatory (VO) • Doctorow, C. Big data: Welcome to the petacentre. Nature 455, 16- 21 (2008). http://dx.doi.org/10.1038/455016aRDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
  4. 4. Hypothesis generation guided by available data Kell and Oliver. Bioessays (2004) vol. 26 (1)• Science paradigms – 1st: Empirical - describing natural phenomena – 2nd: Theoretical - models, generalizations – 3rd: Computational - simulating complex phenomena – 4th (1+2+3): Data exploration, e-ScienceGray, J. 2009. The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft Research RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
  5. 5. Biological research too is increasingly big and data-driven • From: small-scale datasets that fit into a printed journal article Richards, M. et al. Paleolithic and neolithic lineages in the European mitochondrial gene pool. American journal of human genetics 59, 185-203 (1996). http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1915109/RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
  6. 6. Biological research too is increasingly big and data-driven• To: large-scale collection of biological data in digital form• Huge technological advances in last 5-10 years – experimental / observations <-- gathering data with high-throughput equipment – computer technology <-- storing & analyzing massive data volumes• Example: massively-parallel sequencing – Determine human genome sequence in <1 day - the $1000 genome – Metagenomics: sequence *everything* in environment samples – Large bio-specimen collections • x100,0000 of individuals in disease/population biobanksRDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
  7. 7. Examples: domain repositories for sequence data • GenBank - genetic sequence repository, established 1986 • UniProt - knowledge base for protein sequence & functionConference on Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012 RDFC2012 Unique Identifiers, Vilnius, Feb 14 2012
  8. 8. “Community resource projects” - large-scale data generation for the purpose of making the data available for broad reuse• The sequence of the human genome – International Human Genome project - mandatory rapid data sharing, the Bermuda principles• Pattern of variation in the human genome – International Haplotype Map Project - genotyping population samples – 1000 Genomes Project - sequencing population samplesRDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
  9. 9. Big Data – challenges, opportunities• Managing & making sense of large-scale datasets – Data easy/cheap to generate - not so cheap to store & use – Favorite quote: “the $1000 genome sequence, followed by the ++$10,000 analysis”• Integration & analysis - combining datasets – more data of the same type - e.g. combine sequences from multiple species – related data of different type - e.g. a person’s genome sequence + his/her phenotype• Potential for accelerating research, creating new knowledge and (in biomedicine) improving human health.• Key driver = unrestricted sharing of scientifc data deposited in the public domainRDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
  10. 10. Data = “fuel” of science Smith,V. Data publication: towards a database of everything. BMC Res Notes (2009) vol. 2 (1)RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
  11. 11. 11
  12. 12. 12
  13. 13. Data = “fuel” of science [..] If digital technologies are the engine of this revolution, digital data are its fuel. But for many scientific disciplines, this fuel is in short supply.[..] Smith,V. Data publication: towards a database of everything. BMC Res Notes (2009) vol. 2 (1)RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
  14. 14. Biology and data sharing in the “long tail”• Biology is complex, so data are often very heterogeneous• Technologies changing rapidly• Lots of small-scale research projects• Lots of small/medium datasets The ‘long tail’ of dark bio-data• Data in the long tail usually *not* shared OR not shared in a useful way • Contrast with other data-intensive disciplines with – a long history of sharing research data - a “culture of sharing” – big, expensive, shared facilities = the only way to do this kind of research – relatively homogeneous datasets, easier to scale up to big volumes (e.g. telescope images)RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
  15. 15. […] Overall, only 47 papers (9%) deposited full primary raw data online. None of the 149 papers not subject to data availability policies made their full primary data publicly available. Conclusion: A substantial proportion of original research papers published in high-impact journals are either not subject to any data availability policies, or do not adhere to the data availability instructions in their respective journals. This empiric evaluation highlights opportunities for improvementRDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
  16. 16. DATA analysed synthesised interpreted INFORMATION published KNOWLEDGE Publication Lots of published knowledge but hard/impossible to go back and reproduce work & validate findings + Opportunity for maximising the value of data through reuse is wastedRDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
  17. 17. Credit: http://cutcaster.com/photo/800902839-The-hand-drawing-question-WHY/ 17
  18. 18. Lots and lots of diverse reasons!! Some quotes from researchers: “Dont which digital repository I should upload to” “Too much work, got better things to do!” “My competitors will just take the data and ‘scoop’ me” “Its my data, I collected them and noone else is entitled to use them” “[myriad other reasons]” Worringly, many authors dont seem to care whether evidence underpinning their published findings is accessible or notKoslow. Should the neuroscience community make aparadigm shift to sharing primary data?. Nat Neurosci(2000) vol. 3 (9). http://dx.doi.org/10.1038/78760 RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
  19. 19. Gnarly issue #1: “ownership” vs “stewardship”• Many researchers consider data their property, even if research funded by public money – e.g. want to do further analysis on data in future, publish more papers• ..which conficts with interests of other stakeholders in the game, e.g. (funders, universities) who want: – to maximize return on investment in the funded research – to ensure good, solid evidence-based science is done, etc.RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
  20. 20. Gnarly issue #2 – biomedical data• Usually sensitive, cannot be shared without restrictions – Detailed, reidentifiable biomedical data that cannot be fully anonymized – Personal privacy considerations• Specialized controlled-access archives deal with some of this – NCBIs database of Genotypes and Phenotypes – dbGaP – European Genome-phenome Archive – EGA – [specific diseases / disorders, research consortia, others]RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
  21. 21. How to Make a Tackle in RugbyTackling in rugby is one of the most important aspects of the game.[...]Credit:http://djamba.com/how-to-make-a-tackle-in-rugby.html 21
  22. 22. ...which are an imperfect solution• Arguments that mandates by themselves are not the way• Mandates likely to ensure only minimum compliance – sharing would be done in minimally useful form (as in, whatever is the least effort) …. and are meaningless if not enforced (currently the case with many journals)RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
  23. 23. Sharing now tends to be driven by mandates... * Journals increasingly require data to be made available “Provide supporting data in a repository OR we won’t publish your paper” * Funders increasingly require data sharing plan & budget baked into grant proposals. “Publish data we are funding you to generate OR we will not fund your research again” Using just a stick gets you so only farRDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
  24. 24. Strategies focused on encouraging sharing - Make it easy - - Make it useful - - Make it citable -RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
  25. 25. Treating data as citable publications in their own right• Core strategy: enable data to be treated as 1st class citizens of the scholarly record which: i) are indexed and can be discovered, located and accessed, and ii) can be properly identified & cited unambiguously like other scholarly works• Link datasets with the primary journal publication - citation crosslinks• Give data creators/curators/analysts proper credit for their contribution to the digital resource• Focus on the benefits to researchers from publishing their data – Data sharing → Data PUBLICATION + CITATION – Others reuse & cite their stuff → more citations → more impact – The more useful a dataset, the more likely to be used & citedRDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
  26. 26. Exemplar – Data Dryad“international repository of dataunderlying peer-reviewed articles inthe basic and applied biosciences” http://datadryad.org• Combines – Mandates (journal policy) and – Citable data publication• Citation cross-linking – Paper references dataset – Dataset references paperRDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
  27. 27. Key building blocks: the 3 I’s of identificationRDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
  28. 28. 1I Identifying scholarly publications (and other research outputs) • Why? So it is possible to.. ..cite the work unambiguously (‘..we used the method described in Thorisson et al (2009)’) ..locate the work (retrieve Nature article as PDF from journal website) ..give credit to persons/entities who contributed to the work (G. Thorisson authored paper X) • Need for globally unique, persistent identifiers to combat unstable Web URLs, broken hyperlinks • e.g. Digital Object Identifiers (DOIs) for pubs, datasets and more: – Bell et al. 2009. Science 323(5919) doi:10.1371/journal.pone.0024357 – Goodwillie C et al (2005) Data from: The evolutionary enigma of mixed mating systems in plants: occurrence, theoretical explanations, and empirical evidence. Dryad Digital Repository. doi:10.5061/dryad.292q34fpRDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
  29. 29. 2I Identifying use/reuse - measuring impact – Historical reliance on formal citations and citation-based metrics – ISI Impact Factor widely used, but really metric for infuence of a scholarly journal – Citation analysis not going away - remains the gold standard – Many other use/reuse indicators for impact of individual research outputs • Focus on the impact of the *publication* itself, not the journal in which it appears • Indicators: no. full-text downloads, tweets (i.e. mentions on Twitter), social bookmarking • AltMetrics - a growing grassroots movement “ to better measure and reward all the different ways that people contribute to the messy and complex process of scientific progress [..] born out of a simple recognition: Many of the traditional measurements are too slow or simplistic to keep pace with today’s Internet-age science” http://altmetrics.org – Lots new tools and projects emerging to explore possibilities in this space • e.g. http://total-impact.orgRDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
  30. 30. 3I Identifying contributors – attributing credit – Why? So we can.. ..link content creators with their works - attribute credit accurately ..figure out: who contributed to publication X? which publications has person/organization Y contributed to? – What kind of contributions? Characterizing ‘contributorship’ author, creator, analyst, reviewer, ‘conceived of study & designed experiment’ etcRDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
  31. 31. Tackling the author name ambiguity problem (or ‘Who’s Who?’) How about these? Or these? J. Smith J. Smith J. Smith Are these authors all the same person? J. Smith G. Thorisson, University of Leicester J. Smith G. A. Thorisson, University of Leicester [etc.] G. A. Thorisson, Cold Spring Harbor Laboratory ∼2/3 of the ∼6 million authors in MEDLINE share a last name and first initial with at least one other author, and an ambiguous name refers to ∼8 persons on average. Torvik and Smalheiser. Author name disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data (2009) vol. 3 (3)RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
  32. 32. The Open Researcher & Contributor ID initiativeLaunched end of 2009, ORCID will work tosupport the creation of a permanent, clearand unambiguous record of scholarlycommunication by enabling reliableattribution of authors and contributorsthrough unique identifiers RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
  33. 33. The Open Researcher & Contributor ID initiativeORCID will add value for scholars andthe organizations that they areinteracting with, including universities,scholarly societies, fundingorganizations and publishers •Joins faculty or student body •Joins scholarly society •Applies for grant •Submits manuscriptRDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
  34. 34. ORCID transcends discipline, geographic, national andinstitutional boundaries - now >300 participantshttp://www.orcid.org 34
  35. 35. Some food for thought / recommendation kind of stuff to conclude• Status of research data in Iceland is unclear → need research – Build on & extend 2007 Rannís report “Gagnagrunnar á Íslandi um náttúru, umhverfi og orku” Rannís, we´re looking at you!• Funders to take lead – Mandates (aka sticks) - require data management plan + budget in grant proposals • Many best practices & tools available to draw upon, e.g. by the UK Digital Curation Centre – Call for & fund research proposals to build infrastructural foundations & explore technologies/initiatives – Raise awareness in the local research communityRDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
  36. 36. Even more food for thought / recommendation kind of stuff• Universities & other research institutions need to – Take research data seriously – Build infrastructure for data storage & preservation, support personnel (e.g. data officers / coordinators) – Include datasets and other non-conventional outputs in professional evalutations• Identify & engage with key international initiatives in this space – ORCID, DataCite, Dryad, Open Knowledge Foundation, others – OpenAIRPlus ← Solveigs talk coming up!RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
  37. 37. Final bite of food-for-thought Lets make research data an integral part of the OA mission in Iceland, NOT an afterthoughtRDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
  38. 38. AcknowledgementsGEN2PHEN Consortium This work has received funding by the http://www.gen2phen.org/about-gen2phen/partners European Communitys Seventh Framework Programme (FP7/2007-2013) under grant agreement number 200754 -Prof Anthony J. Brookes Bioinformatics Group, Leicester the GEN2PHEN project. Contact me! Contact me! ORCID - http://www.orcid.org <gthorisson@gmail.com> <gthorisson@gmail.com> http://www.linkedin.com/in/mummi http://www.linkedin.com/in/mummi http://www.twitter.com/gthorisson http://www.twitter.com/gthorisson Published under the Creative Commons BY license http://www.gthorisson.name http://www.gthorisson.name (http://creativecommons.org/licenses/by/3.0/)
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×