a future where data citation Counts


Presentation by Heather Piwowar at International Digital Curation Conference #idcc

  1. 1. A future where data attribution Counts Heather  Piwowar  @researchremix   DataONE  postdoc  with  NESCent  and  Dryad #idcc11  some photos NC, SA
  2. 2. http://www.metmuseum.org/toah/ho/09/euwf/ho_24.45.1.htmIf I have seen farther it is by standing on the shoulders of giants, said Isaac Newton and others before him.While historians speculate that Isaac Newton was actually being sarcastic,
  3. 3. http://www.flickr.com/photos/jsmjr/62443357/most of us would agree that science progresses by standing on shoulders of those who came before. Or by kneeling on their backs. Or clambering up their work anyother way we can.
  4. 4. http://www.flickr.com/photos/camilleharrington/3587294608/Many of us believe that when we share our research output, not only as published research descriptions, but also in the form of open datasets and methods, we are,in effect, making our shoulders broader.
  5. 5. http://www.flickr.com/photos/rkuhnau/3318245976/All of a sudden, a lot more people can build on ourwork.
  6. 6. http://www.flickr.com/photos/conformpdx/1796399674/Researchers can climb higher than otherwisepossible,
  7. 7. http://www.flickr.com/photos/rkuhnau/3317418699/and jump up and down on our findings to make sure they are really stable.
  8. 8. http://www.flickr.com/photos/zemlinki/261617721/It allows contributions from places we may never haveexpected,
  9. 9. http://www.flickr.com/photos/tracenmatt/3020786491/and investigators can explore places they never could have on theirown.
  10. 10. http://www.flickr.com/photos/the-o/2078239333/In short, our broad-shouldered research can make a contribution that far exceeds its originalrole.
  11. 11. This is a great story, right? And why where are all here.But it is also a great metaphor for the problem
  12. 12. http://www.flickr.com/photos/davemurr/4592014327/What exactly do broad shoulders get the individual researcher?Pain!Because a few citations, as much as wed like to think otherwise, arent enough to offset the hard work and Fear Uncertainty and Doubt that accompanies the costs of uploadinga dataset in the current culture.
  13. 13. http://www.flickr.com/photos/joshb/25983792Nobody looks at the supporting structure of an impressive tower. We are all busy oggling the top. That means these people? These ones with the shoulders? Theyve gotnothing.
  14. 14. http://www.flickr.com/photos/joshb/25983792everyone is looking at this guy
  15. 15. http://www.flickr.com/photos/joshb/25983792not this one. he’s not getting any fame or glory here, he isn’t making great strides in hiscareer.
  16. 16. http://www.flickr.com/photos/joshb/25983792ok, maybe this guy gets some citations. Not enough.
  17. 17. http://www.flickr.com/photos/joshb/25983792everyone is looking at this guy
  18. 18. http://www.flickr.com/photos/supersam5/216868485/This person
  19. 19. http://www.flickr.com/photos/commissariat/4829261601/ in/faves-30112411@N02/somebody else gets to be top tog. And I think a lot of researchers actually believe that bymaking their shoulders broader they enable others to become top tog at their expense.
  20. 20. http://www.flickr.com/photos/sunrise/35819369/A few citations aren’t enough to overcome thatfear.
  21. 21. Gleditsch et al. 2003. Posting Your Data: Will You Be Scooped or Will You Be Famous?, International Studies Perspectives 4(1): 89–97. Piwowar et al. 2007. Sharing Detailed research data is associated with increased citation Rate. PLoS ONE. Ioannidis et al. Repeatability of published microarray gene expression analyses. Nature Genetics 41, 149 - 155 Pienta et al. 2010. NSR Social Science Secondary Use. Michigan IR. Henneken et al. 2011. Linking to Data – Effect on Citation Rates in Astronomy. ESO. Sears 2011. Data Sharing Effect on Article Citation rate in Paleoceanography. AGU.Dont get me wrong, Im a fan of studies that show a citation benefit for sharing data :) . But it wont be enough.
  22. 22. http://www.flickr.com/photos/bfhoyt/4606049592/If it were, wed have researchers knocking down the doors of our IR for the 10 minute job of sending in their preprints. They arent doingthat.
  23. 23. but....
  25. 25. So.So.What to do about it? How to change the culture?
  26. 26. We need to facilitate deep recognition of the labour of dataset creation.We need to facilitate deep recognition of the labour of dataset creation. hat top John Wilbanks.Ok let me say that again because it is so importantWe need to facilitate deep recognition of the labour of dataset creation.
  27. 27. http://www.flickr.com/photos/g_kat26/4255119413/Lets dig in to how these groups do impact tracking now, and how theyd like to do it in thefuture.
  28. 28. http://www.flickr.com/photos/joshb/25983792how to researchers value their own contributions now
  29. 29. http://www.flickr.com/photos/europedistrict/5692787622/Data repositories, who we might view as perhaps personal trainers.
  30. 30. http://www.flickr.com/photos/digitaljourney/5767535618/and funders, the ones who pay for all of the gym equipment
  31. 31. Researchers
  32. 32. Investigators, today, can list research products on CV. This can include datasets.
  33. 33. Investigators, today, can list research products on CV. This can include datasets.
  34. 34. http://total-impact.orgA CV is sort of bland, dont you think? It has no context of use.We can see one version of a more useful future comes from a tool called total-Impact. Continuing a project that started as a hackathon at the Open Society Foundationworkshop Beyond Impact organized by Cameron Neylon here in the UK last spring, Jason Priem, me, and a few other people have been working on a tool called total-impact.http://total-impact.org
  35. 35. http://total-impact.orgtotal-Impact aggregates metrics for papers and also non-traditional research metrics, for traditional research project like articles
  36. 36. http://total-impact.orgcan drill inThe metrics are citations, but also altmetrics. PLoS has done some of the ground breaking work in this space with article-level citations, but a lot of other metrics are availablealso...various indications that others have found your research worth bookmarking, or blogging, or referencing on Wikipedia.
  37. 37. http://total-impact.orgAlso non-traditional research products like datasets.It doesnt currently look for dataset identifiers in public R packages, but it could, for example, as indication of use.This makes a “live CV” if you will, giving post-publication context to research output.
  38. 38. http://total-impact.orgThis is where citations would go. More on that later.
  39. 39. RepositoriesRepositories, today,
  40. 40. http://dx.doi.org/10.5061/dryad.18can look at graphs of their deposit counts.Many know their own download statistics, some share this with their authors or the public.
  41. 41. http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/3131/utilizationAs a result of intensive manual digging, some have metrics about how many times their datasets have been mentioned in theliterature.
  42. 42. http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/3131/utilizationThey have details about what was downloaded
  43. 43. http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/3131/utilizationIn cases where logons are required to get the data, have information about who is downloading. These stats are from ICPSR for one dataset. Publiclyavailable.
  44. 44. Ill splash by a few graphs of preliminary research findings.... come find me or my blog if you want more info.Using manual annotation we are starting to be able to estimate third party reuse. In terms of raw numbers, with extrapolations
  45. 45. Teasing out use by the original authors from use by 3rd parties who probably only got access to the data because of the repository. Tools that support data citation will helpthis.
  46. 46. We have observed reuse of at 35% of GEO datasets submitted in 2005.And distribution of the data use across all of the datasets in the repository. Is it 1% of the datasets thatdrive all the use? Nope, it looks like often use is distributed across a broad population of datasets.
  47. 47. Piwowar, Vision, Whitlock (2011) Data archiving is a good investment. Nature letter to the editor: 473, p285. http://researchremix.wordpress.com/2011/05/19/nature-letter/This sort of information is very valuable for repositories when they want to make their case.As I said, right now we can get some of this information through a lot of painful manual searchingacross the internet. Data citations will help reduce some of this burden.
  48. 48. IndispensibleWhat repositories really want, though, though -- correct me if I’m wrong -- is to show that they are indispensable. That they generate new, profound science not otherwisepossible. That they are a great financial investment in scientific progress. This requires knowing more than just a citation count, it requires knowing the context of reuse. Thismeans we need access to the full text of the paper that cites the data.
  49. 49. FundersWhat about funders?
  50. 50. http://www.flickr.com/photos/n2artscapes/3527520456/They want to know the impact the data had on society. Did it facilitate innovation, reduce discrimination, create jobs, save the rainforest, increase our GDP.That kind of tracking is beyond what any of us know how to do yet :)Were going to need digital tracking technology that as far as I know isnt available yet but Im sure people are working on. Google analytics meets digital RF-ID tags.... Idunno... but I do know we need it. Furthermore, we need these digital tracking mechanisms to be affordable and open, to facilitate mashups.
  51. 51. Ok, so with that sort of future vision for tracking, what do we need as a scholarly ecosystem need to power this future world?
  52. 52. innovation and experimentationWe need innovation and experimentation.
  53. 53. http://www.flickr.com/photos/jo-h/2688026447/We need 1000 flowers bloomingWe need solutions that are open and generativeWe need data that is open and generativeI dont have all the answers, but here is part of it:
  54. 54. open access to citation dataWe cant just rely on Scopus, Thomson, and Google Scholar.Those are only three players, They good at what they do and have been invaluable, but they cant possibly be as nimble as a whole bunch of startups.It is taking them a long time to come out with a data tracking tool. Why? Probably because they have an ambitious vision and need time to fit it into their other productofferings. That isn’t a bad thing... but at the same time, Some of the rest of us would be happy with iterating on a quick and dirty solution.We need more competition in this space. The barrier to entry is extrodinarily high because of course reference lists are almost all behind copyright and paywalls.... but openaccess publications gives us a toehold.
  55. 55. open access to full textOpen access to full text.Open access also gives us a toehold into citation context information.A citation to a dataset tells us that the dataset played some role in that new research paper. What role? Was it used to validate a new method? Detect errors? Was it combinedwith other datasets to solve a problem that was otherwise intractable? The answers to these questions are fundamental to what funders and others need to know about impact.It wont be easy to derive them from the text of the paper, but I strongly believe it is possible.
  56. 56. open access to other metricsOpen access to other use.We need broad-based metrics... not just citations, but blog posts about data, slides that include R and STATA tutorials about data, bookmarks to data on bookmarking sites.altmetrics. If you run a data repository, make your download stats publicly available. We frankly dont know what all of this info means yet, but we didnt know what citationsto papers meant 50 years ago either. Well all figure it out, the more data the better.
  57. 57. here’s what each of us need todo
  58. 58. 1. raise our expectationsraise our expectations
  59. 59. http://www.flickr.com/photos/quinnanya/2055471833what and and should be open and able to be mashed upwhat each of us can do to make a differencewhat we must do
  60. 60. 2. raise our voicesraise our voices
  61. 61. 3. get excited and make thingshere’s what each of us need todo
  62. 62. http://www.flickr.com/photos/blackbeltjones/3365682994/
  63. 63. 1. raise our expectations 2. raise our voices 3. get excited and make thingshere’s what each of us need todo
  64. 64. http://www.flickr.com/photos/huzzahvintage/4577075021/These things will make shoulders that get noticed whereever they go, and recognition whenthey make dramatic impact
  65. 65. A future wheredata attribution Counts
  66. 66. A future about what kind of impact  a dataset makes,not just a citation number.
  67. 67. The future is http://www.flickr.com/photos/myklroventine/892446624/The future is open.
  68. 68. Open data.Open data about our data.
  69. 69. thank you Todd Vision, Jonathan Carlson, Estephanie Sta Maria, Jason Priem, total-Impact and Beyond Impact Dryad and DataONE teams The open science online community and those who release their articles, datasets and photos openly blog: ResearchRemix.wordpress.com @researchremixthank you
  70. 70. 1. raise our expectations2. raise our voices3. get excited and make things
