a future where data citation Counts

A future where
data attribution Counts

Heather
Piwowar
@researchremix

DataONE
postdoc
with
NESCent
and
Dryad

#idcc11

some photos NC, SA

http://www.metmuseum.org/toah/ho/09/euwf/ho_24.45.1.htm
If I have seen farther it is by standing on the shoulders of giants, said Isaac Newton and others before him.

While historians speculate that Isaac Newton was actually being sarcastic,

http://www.flickr.com/photos/jsmjr/62443357/
most of us would agree that science progresses by standing on shoulders of those who came before. Or by kneeling on their backs. Or clambering up their work any
other way we can.

http://www.flickr.com/photos/camilleharrington/3587294608/

Many of us believe that when we share our research output, not only as published research descriptions, but also in the form of open datasets and methods, we are,
in effect, making our shoulders broader.

http://www.flickr.com/photos/rkuhnau/3318245976/

All of a sudden, a lot more people can build on our
work.

http://www.flickr.com/photos/conformpdx/1796399674/
Researchers can climb higher than otherwise
possible,

http://www.flickr.com/photos/rkuhnau/3317418699/
and jump up and down on our findings to make sure they are really stable.

http://www.flickr.com/photos/zemlinki/261617721/

It allows contributions from places we may never have
expected,

http://www.flickr.com/photos/tracenmatt/3020786491/

and investigators can explore places they never could have on their
own.

http://www.flickr.com/photos/the-o/2078239333/
In short, our broad-shouldered research can make a contribution that far exceeds its original
role.

This is a great story, right? And why where are all here.

But it is also a great metaphor for the problem

http://www.ﬂickr.com/photos/davemurr/4592014327/

What exactly do broad shoulders get the individual researcher?

Pain!

Because a few citations, as much as we'd like to think otherwise, aren't enough to offset the hard work and Fear Uncertainty and Doubt that accompanies the costs of uploading
a dataset in the current culture.

http://www.flickr.com/photos/joshb/25983792
Nobody looks at the supporting structure of an impressive tower. We are all busy oggling the top. That means these people? These ones with the shoulders? They've got
nothing.


everyone is looking at this guy


not this one. he’s not getting any fame or glory here, he isn’t making great strides in his
career.


ok, maybe this guy gets some citations. Not enough.

http://www.flickr.com/photos/supersam5/216868485/

This person

http://www.flickr.com/photos/commissariat/4829261601/
in/faves-30112411@N02/
somebody else gets to be top tog. And I think a lot of researchers actually believe that by
making their shoulders broader they enable others to become top tog at their expense.

http://www.flickr.com/photos/sunrise/35819369/

A few citations aren’t enough to overcome that
fear.

Gleditsch et al. 2003. Posting Your Data: Will You Be
Scooped or Will You Be Famous?, International Studies
Perspectives 4(1): 89–97.

Piwowar et al. 2007. Sharing Detailed research data is
associated with increased citation Rate. PLoS ONE.

Ioannidis et al. Repeatability of published microarray gene
expression analyses. Nature Genetics 41, 149 - 155

Pienta et al. 2010. NSR Social Science Secondary Use.
Michigan IR.

Henneken et al. 2011. Linking to Data – Effect on Citation
Rates in Astronomy. ESO.

Sears 2011. Data Sharing Effect on Article Citation rate in
Paleoceanography. AGU.
Don't get me wrong, I'm a fan of studies that show a citation benefit for sharing data :) . But it won't be enough.

http://www.ﬂickr.com/photos/bfhoyt/4606049592/
If it were, we'd have researchers knocking down the doors of our IR for the 10 minute job of sending in their preprints. They aren't doing
that.

So.

So.

What to do about it? How to change the culture?

We need to facilitate
deep recognition of the
labour of dataset creation.

We need to facilitate deep recognition of the labour of dataset creation. hat top John Wilbanks.

Ok let me say that again because it is so important

We need to facilitate deep recognition of the labour of dataset creation.

http://www.ﬂickr.com/photos/g_kat26/4255119413/
Let's dig in to how these groups do impact tracking now, and how they'd like to do it in the
future.


how to researchers value their own contributions now

http://www.flickr.com/photos/europedistrict/5692787622/

Data repositories, who we might view as perhaps personal trainers.

http://www.flickr.com/photos/digitaljourney/5767535618/

and funders, the ones who pay for all of the gym equipment

Investigators, today, can list research products on CV. This can include datasets.

http://total-impact.org
A CV is sort of bland, don't you think? It has no context of use.

We can see one version of a more useful future comes from a tool called total-Impact. Continuing a project that started as a hackathon at the Open Society Foundation
workshop Beyond Impact organized by Cameron Neylon here in the UK last spring, Jason Priem, me, and a few other people have been working on a tool called total-impact.

total-Impact aggregates metrics for papers and also non-traditional research metrics, for traditional research project like articles

can drill in

The metrics are citations, but also altmetrics. PLoS has done some of the ground breaking work in this space with article-level citations, but a lot of other metrics are available
also...various indications that others have found your research worth bookmarking, or blogging, or referencing on Wikipedia.

Also non-traditional research products like datasets.

It doesn't currently look for dataset identifiers in public R packages, but it could, for example, as indication of use.

This makes a “live CV” if you will, giving post-publication context to research output.

This is where citations would go. More on that later.

Repositories

Repositories, today,

http://dx.doi.org/10.5061/dryad.18
can look at graphs of their deposit counts.

Many know their own download statistics, some share this with their authors or the public.

http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/3131/utilization
As a result of intensive manual digging, some have metrics about how many times their datasets have been mentioned in the
literature.

They have details about what was downloaded

In cases where logons are required to get the data, have information about who is downloading. These stats are from ICPSR for one dataset. Publicly
available.

I'll splash by a few graphs of preliminary research findings.... come find me or my blog if you want more info.
Using manual annotation we are starting to be able to estimate third party reuse. In terms of raw numbers, with extrapolations

Teasing out use by the original authors from use by 3rd parties who probably only got access to the data because of the repository. Tools that support data citation will help
this.

We have observed reuse of at 35%
of GEO datasets submitted in 2005.

And distribution of the data use across all of the datasets in the repository. Is it 1% of the datasets that
drive all the use? Nope, it looks like often use is distributed across a broad population of datasets.

Piwowar, Vision, Whitlock (2011)
Data archiving is a good investment.
Nature letter to the editor: 473, p285.

http://researchremix.wordpress.com/2011/05/19/nature-letter/
This sort of information is very valuable for repositories when they want to make their case.

As I said, right now we can get some of this information through a lot of painful manual searching
across the internet. Data citations will help reduce some of this burden.

Indispensible

What repositories really want, though, though -- correct me if I’m wrong -- is to show that they are indispensable. That they generate new, profound science not otherwise
possible. That they are a great financial investment in scientific progress. This requires knowing more than just a citation count, it requires knowing the context of reuse. This
means we need access to the full text of the paper that cites the data.

Funders

What about funders?

http://www.ﬂickr.com/photos/n2artscapes/3527520456/
They want to know the impact the data had on society. Did it facilitate innovation, reduce discrimination, create jobs, save the rainforest, increase our GDP.

That kind of tracking is beyond what any of us know how to do yet :)

We're going to need digital tracking technology that as far as I know isn't available yet but I'm sure people are working on. Google analytics meets digital RF-ID tags.... I
dunno... but I do know we need it. Furthermore, we need these digital tracking mechanisms to be affordable and open, to facilitate mashups.

Ok, so with that sort of future vision for tracking, what do we need as a scholarly ecosystem need to power this future world?

innovation and
experimentation

We need innovation and experimentation.

http://www.flickr.com/photos/jo-h/2688026447/

We need 1000 flowers blooming
We need solutions that are open and generative
We need data that is open and generative

I don't have all the answers, but here is part of it:

open access to citation data

We can't just rely on Scopus, Thomson, and Google Scholar.

Those are only three players, They good at what they do and have been invaluable, but they can't possibly be as nimble as a whole bunch of startups.

It is taking them a long time to come out with a data tracking tool. Why? Probably because they have an ambitious vision and need time to fit it into their other product
offerings. That isn’t a bad thing... but at the same time, Some of the rest of us would be happy with iterating on a quick and dirty solution.

We need more competition in this space. The barrier to entry is extrodinarily high because of course reference lists are almost all behind copyright and paywalls.... but open
access publications gives us a toehold.

open access to full text

Open access to full text.
Open access also gives us a toehold into citation context information.
A citation to a dataset tells us that the dataset played some role in that new research paper. What role? Was it used to validate a new method? Detect errors? Was it combined
with other datasets to solve a problem that was otherwise intractable? The answers to these questions are fundamental to what funders and others need to know about impact.
It won't be easy to derive them from the text of the paper, but I strongly believe it is possible.

open access to other metrics

Open access to other use.

We need broad-based metrics... not just citations, but blog posts about data, slides that include R and STATA tutorials about data, bookmarks to data on bookmarking sites.
altmetrics. If you run a data repository, make your download stats publicly available. We frankly don't know what all of this info means yet, but we didn't know what citations
to papers meant 50 years ago either. We'll all figure it out, the more data the better.

here’s what each of us need to
do

1. raise our expectations

raise our expectations

http://www.ﬂickr.com/photos/quinnanya/2055471833

what and and should be open and able to be mashed up
what each of us can do to make a difference
what we must do

2. raise our voices

raise our voices

3. get excited and
make things

do

http://www.ﬂickr.com/photos/blackbeltjones/3365682994/

2. raise our voices
3. get excited and make things

do

http://www.ﬂickr.com/photos/huzzahvintage/4577075021/

These things will make shoulders that get noticed whereever they go, and recognition when
they make dramatic impact

A future where
data attribution Counts

A future about
what kind of impact
a dataset makes,
not just a citation number.

The future is

http://www.flickr.com/photos/myklroventine/892446624/

The future is open.

Open data.
Open data about our data.

thank you
Todd Vision,
Jonathan Carlson, Estephanie Sta Maria,
Jason Priem, total-Impact and Beyond Impact
Dryad and DataONE teams
The open science online community and those who
release their articles, datasets and photos openly
blog: ResearchRemix.wordpress.com
@researchremix
thank you

2. raise our voices
3. get excited and make things

a future where data citation Counts

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to a future where data citation Counts

Similar to a future where data citation Counts (20)

More from Heather Piwowar

More from Heather Piwowar (20)

Recently uploaded

Recently uploaded (20)

a future where data citation Counts