Creating a sustainable business model for a digital repository: the Dryad experience
Peggy Schaeffer
Datadryad.org
Presentation at Research Data Access & Preservation Summit
22 March 2012
Creating a sustainable business model for a digital repository: the Dryad experience - Peggy Schaeffer - RDAP12
1. Creating a sustainable business model for a digital
repository: the Dryad experience
"Cherish old knowledge so that you may acquire new"
- The Analects of Confucius
Peggy Schaeffer
Research Data Access and Preservation (RDAP) summit
March 22, 2012
datadryad.org 1
2. • The End
– To make data archiving and reuse standard within scientific communication.
• The Means
– Enable low-burden data archiving at the time of manuscript submission.
– Promote researcher benefits from data archiving.
– Promote responsible data reuse.
– Empower journals, societies & publishers in shared governance.
– Ensure sustainability and long-term preservation.
• The Scope
– Research data in the basic and applied biosciences, broadly defined
– Primarily data underlying findings in peer-reviewed articles
– Also data from some non-peer reviewed publications (e.g. dissertations)
– And some non-data content (e.g. software scripts, figures)
datadryad.org 2
3. The value proposition
• For researchers, Dryad…
– increases the impact of, and citations to, published research
– preserves and makes available others’ data
– frees researchers from the burden of data preservation and access
• For journals, publishers and societies, Dryad…
– frees journals from the burden of maintaining supplemental data
• For libraries and institutions, Dryad…
– makes data available at no cost, under clear terms of use
– helps fulfill their research data management mandates
• For funders, Dryad…
– provides a cost-effective mechanism to make research more accessible
datadryad.org 3
4. 2007 NSF/ESA Data Sharing and NESCent Small Science workshops
Beginning negotiation of Joint Data Archiving Policy (JDAP)
2008 Journals/societies join NESCent & others to fund Dryad through NSF
Initial NSF funding for Dryad begins (through 2012)
2009 Repository went online
First consortium board meetings
Initial development of sustainability and revenue plans
Debut of integrated data submission
2010 Announcement of Joint Data Archiving Plan
JISC funding begins
Discussions with potential charter partners
2011 JDAP (and NSF DMP mandate) takes effect
New funding from NSF begins (through 2016)
2012 Approval of cost-recovery plan and governance structure
Cost-recovery begins
Transitional funding begins
datadryad.org 4
5. Dryad integrates article and data submission
• Why?
– Ensures permanent link to data
within each article (and vice versa).
– Makes data deposition fast and easy
for authors (once files are prepared)
• Options are customized to
meet the requirements of
individual journals
– Submission of data prior to
manuscript review OR upon
acceptance
– Whether authors have the option of
a 1-yr no-questions asked embargo
OR not*
*By default, data is released upon article publication, and editors can permit longer embargoes for
special cases
datadryad.org 5
6. Over 20 integrated partner journals
.. and more being added regularly
The American Naturalist
BMJ Open
Biological Journal of the Linnean Society
Ecological Monographs
Evolutionary Applications
Evolution
Heredity
Journal of Evolutionary Biology
Journal of Fish and Wildlife Management
Journal of Heredity
Journal of Paleontology
Molecular Ecology and M.E. Resources
Paleobiology
PLoS Biology
Systematic Biology
ZooKeys & 7 other Pensoft journals
datadryad.org 6
8. And using the data for research and education
datadryad.org 8
9. Dryad principles & priorities
• Enable data archiving as an extension of traditional
publication
• Assert the value-added benefits of
– Citable data (for depositors)
– Economies of scale (for journals and publishers)
– Professional curation (for users of data)
– Long-term preservation of data (for all)
• Align incentives with the business model
– Lower costs for partner journals
• Ensure international participation
• Understand the impact of data citation
• Understand the true costs of hosting supplementary data
datadryad.org 9
10. Sustainability planning
• Long-term preservation requires an organization
with a viable business model
– Not one dependent on the success of future grant
proposals.
• Goal: a business model based on the added value
of repository to stakeholders:
– Depositors of data
– Users of archived data
– Journals, publishers, societies
– Universities, research institutions, and libraries
– Funding organizations
datadryad.org 10
11. Assumptions (2009)
• Institutional support: host provides
efficiencies (accounting, contracts & grants,
legal, shared staff, IT network, facilities)
• Hardware and storage costs decline faster
than repository growth
• Curation effort
– is primary staff expense
– scales with level of curation tasks and volume
datadryad.org 11
12. Potential sources of revenue (2009)
• an archiving charge (similar to a page charge)
• pay-per-use, or individual subscriptions, for access to
repository contents(never seriously considered)
• institutional subscriptions (possibly for higher service
levels?)
• subscriptions from societies and journals (possibly in return
for full partnership benefits?)
• fees from publishers
• recovery of cost from archiving of large data packages
• grants from government funding agencies across the globe
as well as private foundations
• angel donors
datadryad.org 12
13. Two consultancies (2009)
• Cost model (Lorraine • Broad sustainability
Richards) plan (Charles Beagrie,
– Examined current Ltd.)
literature& environment – Strategy, performance
– Developed list of indicators and measures
potential exemplar – Comparators and
repositories understanding of the
– Interviewed Dryad staff costs
– Identified relevant cost – Advantages, benefits
categories &assumptions and revenue options
– Made best estimates – Drafted a proposal for
sustainability
datadryad.org 13
14. Development of cost model
• Based on JISC Keeping Cost categories:
Research Data Safe2 – Repository management
• Total and per paper – Curation
costs estimated – Storage and hardware
• Per paper cost – Outsourcing
estimates, by volume – Infrastructure, facilities,
& administration
– 5,000 papers per yr =
$40 (approximate) – R&D
– 10,000 papers per yr = – Maintenance
$32 (approximate) – Outreach and promotion
– Documentation
datadryad.org 14
15. Curation Minutes Tasks
level
1 (Low) 5 Verify that the DOI points to the correct article
Spell check
Verify that article metadata is correct
Verify that data files have expected kind of data
2 (Medium) 20 Expand keywords based on submitted metadata
Convert data files to preservable formats
Deposit additional supplemental data at publisher site
Create/approve relationships to content in partner
repositories
Approve updates submitted by the author
View the contents of metadata fields across the repository,
and enforce consistency
3 (High) 140 • Enter/verify authors in name authority file (LCNAF)
• Expand keywords based on text of the article
• Within-file annotations (spreadsheet columns, taxon
names in trees)
• Evaluate comments from end users and relay to the author
datadryad.org 15
17. Growing Dryad
• Enlarging repository scope
– Biomedical data
– Dissertation data
– Software & other supplementary materials
• Building journal & publisher connections
– DryadUK at the British Library, funded by JISC
• Wiley Blackwell
• Oxford Univ. Press
• Nature Publishing Group
• Elsevier
• PLoS
• Expanding Consortium membership
datadryad.org 17
18. Dryad as an organization
• Dryad Consortium, soon to be a 501(c)3 nonprofit
• Membership composed of journal & organization
representatives
– Open to the full spectrum of stakeholder organizations,
including scientific societies, publishers, funding agencies,
universities & institutes
– Nominal annual fee - no more than $1000 USD
• Governed by a Board of Directors (12 members)
– Nominated and elected by the Membership
• Next board meeting July in North Carolina
– Transition to 501(c)3 status, hosted at Duke Univ.
– Adopt governance model
– Adopt cost-recovery model
datadryad.org 18
19. Dryad’s sustainability model
• Deposit fees are the primary source of
revenue, for several reasons:
– The time of deposit is when the majority of costs are
incurred
– Revenue scales with costs (i.e. volume of deposits)
– The costs are distributed both fairly and widely
– This enables Dryad to make access to the data free in
perpetuity
• Membership fees will cover costs of annual
Membership meetings
• Additional revenue
– Project grants will supplement the operational budget
for R&D activities
datadryad.org 19
20. Payment plans (proposed)
Plan Contract? Paid by Cost2 (approximate)
Journal yes Journal1, in advance Based on annual volume of
subscription research articles ($25-30/article)
Pre-paid yes Journal1, in advance $50-60/data package
per-deposit
Pay-as-you-go yes Journal1, invoiced $60-70/data package
per-deposit periodically for prior
deposits
Individualdep no Author, at time of deposit $70-80/data package, with a
osit process for granting waivers
under development
1 Or other sponsoring organization
2 Up to a fixed deposit size (currently 10GB). Additional charges for larger deposits.
datadryad.org 20
21. Projections and issues
• Rate of deposit
• High volume journals and publishers
• How long before sustainability achieved?
• Potential for growth
• Enlarged scope?
datadryad.org 21
22. To learn more
• Repository home: http://datadryad.org
• News: http://blog.datadryad.org
• Project documentation: http://wiki.datadryad.org
• Announce and User mailing lists: http://datadryad.org/about
• Twitter: @datadryad
• Code: http://code.google.com/p/dryad
or contact me: Peggy Schaeffer, pschaeffer@nescent.org
datadryad.org 22
23. References
Beagrie, N, Lavoie, B, Woollard, M. Keeping Research Data Safe 2, JISC, 2010.
http://www.jisc.ac.uk/publications/reports/2010/keepingresearchdatasafe2.aspx
Beagrie, N, Eakin-Richards, L and Vision, T. Business Models and Cost Estimation: Dryad
Repository Case Study, iPRES2010 Vienna, September 2010.
http://wiki.datadryad.org/wg/dryad/images/4/47/IPRES2010_Paper37.pdf
Piwowar HA, Day RS, Fridsma DB (2007) Sharing Detailed Research Data Is Associated with
Increased Citation Rate. PLoS ONE 2(3): e308. doi:10.1371/journal.pone.0000308
Piwowar, HA, Vision, TJ, & Whitlock, MC (2011). Data archiving is a good investment Nature,
473 (7347), 285-285 doi: 10.1038/473285a
Vision, TJ. (2010) Open Data and the Social Contract of Scientific Publishing. BioScience
60(5):330-330. doi:10.1525/bio.2010.60.5.2
Ware M, Mabe M (2009) The STM report: An overview of scientific and scholarly journal
publishing.
The complete list of Dryad publications and presentations is at
http://wiki.datadryad.org/Publications
datadryad.org 23
Editor's Notes
Some quick Dryad background to introduce you to the repository Dryad builds on the traditional scientific publishing model by augmenting the journal article with links to its underlying dataDesigned for data that does not already have a discipline-specific home, e.g. GenBank Original focus on evolutionary biology and ecology but now embraces all biosicences, including medicine, epidemiology, paleontology, etc.Data repository broadly defined, & also contains non-data materials– about which more later Important to note that ALL deposits in Dryad are associated with a publication
Dryad offers services and assets to all the major players in the scientific research cycle Authors and researchers: Citable data Access to data to verify their published results, to refine methodologies, and to repurpose. For journals and publishers: enables them to increase the discoverability and impact of their articles, and to add value to the communities they serve.Libraries and institutions makes data widely available & indexed could complement institutional repositories is a tool to suggest when advising about research data management For funders, Dryad preserves and makes universally available the data which is the outcome of research fundingAnd also… Repository is open source, based on DspaceAll deposits in the repository are freely available under CC0 waiver Dryad is an organization too, & community-led– Dryad Consortium is composed of partner journals, societies and institutions
Timeline to date Dryad incubated at the National Evolutionary Synthesis Center, in Durham NC, with partnership from Duke Univ., the Univ of NC at Chapel Hill and NC State Univ in RaleighSustainability planning has been integral to Dryad since before inception– was first topic on the agenda at the first workshop in 2007
How does Dryad work? Workflow diagram – see brochure The time and effort required to prepare and submit data to a repository is a major barrier to data archiving.Dryad addresses this problem in partby integrating into the manuscript submission workflow of its partner journals. Automated email messages containing bibliographic metadata (e.g. author names, DOIs) are passed between the journal and the repository, so that the journal links to the data, the repository links to the article, and minimal effort is required of the author or the editor. Note that the workflow diagram shows a journal that asks authors to deposit AFTER manuscript acceptance. Alternative workflows are supported, for example making data securely & anonymously available to editors and peer reviewers during the review process.
Currently over 20 journals have integrated with Dryad. This is a major focus of my work. Working closely with our partner journals to implement integrated processing so that authors of accepted manuscripts, as well as those under review if the journal wishes, are invited (or in some cases, required) to deposit data in Dryad. Result: The published articles include a link to the data in Dryad, and Dryad links to the published articles.Journal partners may become members of the Dryad Consortium
Dryad currently receives over 100 submissions a month, and the rate of submission continues to grow steadily. About one quarter of submissions are volunteered from authors publishing in non-partner journals, which explains why Dryad has data for over five times as many journals as there are integrated partners.
There is evidence for widespread data reuse. This 2009 data package been downloaded over 2 thousand times. For data packages deposited in 2012, the median number of downloads for the most popular file in each data package is twelve. Some of this download activity is known to be from use of data in the classroom.Eagle eyes will note that the author list for the data is not the same as for the article. An independent data citation can provide a incentive for members of collaboratives who want to get primary credit for their contributions.
Goals and priorities set by consensus process; Dryad leadership, Todd Vision & Jane Greenberg at UNC, working with a Board consisting of representatives of partner journals, later augmented with publishers’ repsInternationalization is of course an ongoing effort– members from UK, US, Canada, Bulgaria, & 3 Board meetings so far in three different countries Impact of data citation: article- 2007: Sharing Detailed Research Data Is Associated with Increased Citation Rate.And continued efforts to work toward standards for data citation Very difficult to ascertain costs of current practice of supplementary data hosted by journals; these were in most cases unknown or only partially known. We tried to study this in 2010 but found that journals and publishers really didn’t know or manage this function directly.
For a data repository of course it is critical to ensure that adequate funds exist to preserve data in perpetuity, not dependent on grant funding Early recognition that in the long term, some combination of archiving charges, society and institutional subscriptions, grants and other sources will be needed to ensure that the repository is not dependent on ephemeral grant funding, and one of the major tasks of the Consortium Board is to reach agreement on a sustainability plan and help implement it. There are large flows of money in publishing, but diminishing flows in libraries mean that we’ve focused on the very small marginal cost of data archiving when viewed as part of the publication cycle, and that investment in data archiving is money well spent. An analysis we did that was published in Nature showed that, compared to the average number of research papers that result from $400K in grant funding – 16 papers over 4 years– a similar investment in a data repository can be expected to yield over 1000 papers over 4 years. The money for scholarly communication in open access typically flows through the funder to institutions and researchers to publishers, which then covers the cost of publication. The primary revenue in our model taps into that same flow - primarily at the publisher end.
Important assumptions underlying these plans
Initially identified sources of revenue included– these ideas were all considered in 2009 – and some very quickly rejected, like pay-per-use
Two consultants were retained with the goal of preparing the basis for a sustainability plan to be presented to the Board at the second meeting, in London Dec. 2009Lori Richardsis a PhD student at UNC SILSNeil Beagrie is a London-based consultant specialized on working with digital archives & collection, libraries, and government agencies More detail available in their paper presented in 2010
The cost model found that curation costs are the largest driver Some components of the cost model are shown here- along with some rough estimates of the efficiencies of scale and effect of volume Curation costs vary according to the level of additional work, e.g., metadata enhancement, and the packaging and documentation for re-use in teaching that may be undertaken by Dryad. We have developed a set of “curation service levels” and their associated costs; levels of curation set very deliberately on the part of the journals with knowledge of the costs and benefits involvedDryad’s level of data curation is moderate-- because we have the great advantage that all data is associated with a publication, and the metadata for the article is used as the basis for our metadata
Since curation is key to costs, here’s a little detail about our curation levels. These times are of course, averages for a record - we are not using this to charge depositors by the minute but to understand how curation level drives costThis intentionally lightweight method enables higher volume as we scale upOur Metadata doesn’t dig deeply into the content of the files This flexible approach suits all sorts of files, since Dryad is multidisciplinary and our data files are heterogeneous Level 1 with selected steps from Level 2 is now standard approach Curator also serves as a help desk for editors & authors
Cost model also allowed us to make these projections---Cost drivers are volume and curation levelTWO POINTS:1 Curation levels 1 & 2 are nearly the same in cost Consequences for repository management: the vast majority of files do not warrant/require Level 3 curation 2 Economies of scale reached at 5,000 to 10,000 articles per yr, at which costs are at $40/article approximatelyCurrent volume is about 25-30 new data submissions per week, or 100+ per month Of course, the idea that we will be sustainable with 50-100 integrated journals depends on many factors such as the degree of journal support for data archiving (which affects the volume of submissions for a given journal), the percentage of data-rich papers in a journal, and just the volume of the journal. We are just under 20% of the way there now, and have a few years to reach that goal.We are committed to expanding the repository in order to reach the sustainability levels anticipated here, by bringing in as new partners journals & publishers who either offer or require data archiving in Dryad – more about this in a minute.
Expanding scopeactively: embracing biomedical data, accepting dissertation data, and enabling deposit of non-data materials, such as software that doesn’t have another more appropriate archive We’re expanding our connections to publishers and journals that understand the issues around research data and either wish to offer Dryad as an option or require data availability as a condition of publicationAlthough authors can voluntarily deposit data in Dryad, the most effective way for us to add volume is to collaborate with journals and publishers who support data archiving & who seek a sustainable solution for research data (NOT journal Supplementary Online Materials) This is an ongoing effort, and a primary focus of my work. A one year grant funded by JISC in the UK and based at the British Library allowed us to build connections with and begin working toward several important publisher-wide implementations, since so many major STM publishers are based in Europe Currently we are working with several high-volume journals and publishing platforms: PLoS Biology will be the first PLoS journal, soon to be followed (we hope) by the others, incl. PLoS ONE --with a volume of over 13,000 articles last year! To help put Dryad's 10K article target in perspective: As of 2009, according to STM, there were 11,550 journals, to which 1.5 million articles were being added per year. That is across all STM disciplines, but it's safe to say bioscience and biomedicine is at least 33% of that, since PubMed indexes 0.5M abstracts/yr. Anyway, using the STM figure, the avg # of articles/journal is 130. To reach 10K, Dryad would need to be receiving 100% of the content of ~75 journals, or - more realistically - 50% of the content from 150 journals and 25% of content from 300 journals. Another way to put it in perspective is to say that Dryad would need to receive less than 2/3 the content from the largest of the megajournals, PLoS ONE (at 17K articles/yr) alone.We are also implementing integration with BMJ and Biomed Central, and talking to F1000 Research and eLife, 2 new open access, data-centric journals based in the UK. F1000 Research will publish articles in biology and medicine (and link to their data) immediately after a “sanity check”, for open peer review & reader comment, while eLife is supported by funders who are motivated to see data openly available in a public archive - so it's a very interesting precedent.Expanding the Dryad Consortium to invite membership involvement from many interested parties–we certainly welcome more participants, if your organization is interested!
As I said earlier, Dryad is an organization too. We have focused on creating and extending an international community of journals and other partners interested in actively supporting data archiving. The Dryad organization is a Consortium with heterogeneous membership, including journals and publishers as well as and other organizations and institutions that promote data archiving. Funding agencies, professional societies, research institutes, and universities for example, can all be members of the Dryad Consortium and offer streamlined & discounted data archiving services to their affiliates through Dryad. The full membership elects the board, votes on changes to bylaws, and is actively engaged in discussions about the future of the repository. The 12-member BoD has fiduciary and decision making responsibility. The Board will be elected in April, but I can say that thediverse expertise of board nominees includes outstanding leaders with expertise in publishing, intellectual property, journal editors, information science and sustainability.The governance and sustainability plans presented here are proposals that have yet to be ratified, and are subject to change.
Two separate kinds of fees: fees for depositing data & a Membership feeDeposit fees will cover operating costs for new data deposits Costs are recovered upfront, in order toallow free disseminationassure preservationFees predominantly paid by journals, which may bepassed on to authorssubsidized by societiesrolled into publisher costs/revenueFees should beattractive: cost-effective relative to SOMfair: to all different types of journalsModel will surely evolveUnder control of consortium of partner journals and membersMembership fees are proposed at $1,000 or less (annually) & provide a 10% discount on deposit costs
As currently proposed, these are 3 primary plans for deposit fees,Designed to accommodate partner journals with very different business models (subscription, open-access, etc.)You can see that a journal-- or a publisher or society, or a research institute--- can choose the first 2 prospective options and have a fixed annual cost, or may prefer the third option, in which case they’ll be billed retrospectively at a slightly higher rate The last option is designed for cases when authors wish to archive data associated with an article in a journal not affiliated with Dryad; this carries higher costs for us since we don’t have the metadata from the journal. Of course we encourage authors to budget for data archiving costs like these when they compile their data management plans, as required now by many funding agencies. Dryad need recover only 1-2% of the total article publishing costsAssuming the repository has economy of scale -- over 10K data packages annuallyAt that scale, fixed costs are low, and marginal costs per deposit are mostly due to human curation (not storage)Board will vote soon to ratify this plan and meets this summer; expect to implement payment plans later this year We don’t expect this model to be perfect and we know it will need revision. The board will be required to review the revenue structure at regular intervals, and no longer than 3 yrs.Stay tuned on Dryad’s blog for news!
Open questions: Will the rate of deposits meet expectations? How effective will we be at scaling up to handle the large volume of data from high-volume publishers like PLoS? Will we reach sustainability at 7-10K data deposits per year, as anticipated? What is the true potential for growth? Will Dryad’s disciplinary scope continue to enlarge?