Opening up data – Jisc and CNI conference 10 July 2014
Upcoming SlideShare
Loading in...5

Opening up data – Jisc and CNI conference 10 July 2014



MacKenzie Smith, university librarian, University of California, Davis

MacKenzie Smith, university librarian, University of California, Davis



Total Views
Views on SlideShare
Embed Views



10 Embeds 2,423 1732 654 11 9 9 3 2 1 1 1



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution-NonCommercial-NoDerivs LicenseCC Attribution-NonCommercial-NoDerivs LicenseCC Attribution-NonCommercial-NoDerivs License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment
  • 25 minutes + 15 minutes Q&A
  • In the U.S., Open Data -- or data sharing – has advanced mainly through grassroots efforts. Historically, data sharing was the domain of social scientists (e.g. ICPSR) with notable exceptions like the NIH’s National Center for Biotechnology Information, and a few federal agencies who dealt in very big data – NASA and NOAA. It’s fair to say that every discipline, even sub-discipline, has taken a different approach to the question of whether and how to openly share their data.
  • Starting around 2005, organizations like Science Commons (part of Creative Commons) and the Open Knowledge Foundation, started to formalize the idea of open data, building on the Open Access movement in scholarly communication. <br /> <br /> These early efforts focused mainly on health and life sciences, probably because that’s where OA was getting the most traction. They eventually came up with tools like the Open Database License and CC0 waiver as tools to help researchers share their data.
  • And, for a variety of reasons, U.S. funding agencies haved started to get involved – beginning with the NIH (managers of NCBI) then later the NSF, leading up to…
  • Even more recently, we’re experiencing a sudden surge of scientific misconduct cases, some very serious, others contested. Turns out that scientific reproducibility is a lot harder than it sounds, and data availability is a necessary but insufficient component.
  • Related to all this question of reproducibility, and a big driver for researchers to open up their data, are the growing requirements from publishers that data associated with articles they publish be deposited and retrievable publically. PLoS is just the latest of these. Nature has required deposit of sequence data to GenBank before publication for many years, and in some cases an entire discipline adopts this practice -- like the evolutional biology with the Dryad data repository.
  • An OSTP memo calling for ALL funding agencies of a certain size to develop and Open Access strategy for articles AND research data. <br /> <br /> The U.S. federal funding agencies are still working out their policies for data, and seem to aligning with their historical practices. So NASA, for example, will make data sharing a requirement and already have the infrastructure to support it, NOAA and NIH are similar, while NSF and other less physical and life science-oriented funders are taking more time. <br /> <br /> Now we’re seeing a few private foundations that fund research, and a few State governments, considering similar open access policies for the research they fund.
  • NIH has often been a bellweather for funder policies in the U.S., and with the recent hire of Phil Bourne as their Chief Data Officer, that trend is continuing. In a fairly short time he’s developed a framework for thinking about digital assets in the context of academic research and is beginning to fund new pieces of infrastructure. <br /> <br /> Note here an important development – he mentions software as equal in importance to articles and data. This is a theme of growing importance in the U.S.
  • Integrated journals may allow authors to embargo one or more datafiles within a data package from release for one year following the data of publication, or they may disallow this option. Editors may also direct Dryad to grant longer custom embargoes upon request. It is of interest to know how often embargoes are used when authors are given the choice, as a measure of the level of comfort researchers have with the idea of publishing data alongside an article. Reassuringly, we find that since 2009, more than 90% of datafiles are being released either immediately or at the time of article publication in those cases where the authors have freedom to choose. Less than 1% of datafiles were placed under specially requested embargoes of greater than one year, and those came from a limited number of journals (Vision T, Scherle R, Mannheimer, S (2013) Embargo selections of Dryad data authors. Figshare <br />
  • Stanford Repository List
  • The SHARE initiative from the ARL, AAU, and APLU is very well represented at this meeting so I won’t discuss it in detail. Just to say that it’s a new, and one of the first, national initiatives coming from Higher Education and addressing the problem of Open Access to publications and data at scale. <br />
  • The first problem SHARE identified is how to how to know what researchers have done that should be shared. <br /> In the U.S., most institutions -- including major research universities -- have no idea what their researchers have accomplished, much less whether or not they’re complying with funder requirements, local Open Access policies, etc. It’s very difficult to keep up with new publications, much less other research products that aren’t part of the formal publishing ecosystem. Today there exists no single, structured way to report research output releases in timely and reliable manner. <br /> <br /> The Notification System will be a digest of metadata about publicly available research from which institutions, repositories, and funding agencies can receive information about research outputs they’re interested in. The digest will be created mainly through harvest of available streams of data and will not require the direct participation of principal investigators or present additional burdens to them. Notification System Project underway: Beta release fall 2014, Full release fall 2015. <br />
  • Longer-term, there are big issues to tackle related to rights, and relating SHARE research to the Open Access goals of the federal mandates it was intended to address. As all of you know, there are tensions between researchers’ desire to get credit for their work in the scholarly reward system without necessarily giving up control of it to get that credit. In the world of research publications that got sorted out a long time ago, and researchers are motivated to publish as quickly as they can. That’s not always true for data or related software, so SHARE is going to work on a rights framework that includes data.
  • So what are these areas of RDM services? We distinguished types of service that characterize what is most common for libraries data management support services <br /> And distinguished levels of service for particular libraries by how many of these services they offered, which corresponds roughly to the depth of library resources devoted to these services. <br /> <br /> Levels of service ranged from website resources on data management planning, to staffed consulting and archiving. <br /> But most libraries with RDM had offering in our three categories of services <br /> <br /> Most common (more than 40 institutions) RDM services are: <br /> Providing Data management Planning services – mainly through Online Resources but also DMP consulting <br /> <br /> Broader Data Management support – such as – training on particular DM topics <br /> Or providing Research metadata support <br /> <br /> And providing support for Sharing Data – such as on data citation <br /> And many libraries are starting to directly archive researchers’ data. <br /> <br /> <br />
  • 89% provide researchers with what we call Consulting – but defined broadly as in-person help of some kind, both email and office visits. (Researchers rarely if ever visit the library direclty for help) <br /> Training sessions for writing data management plans are also a common offering. 61% - most as in-person workshops and some delivered online
  • He observes that trying to change behavior in an intensely competitive field like academic research is counter-productive, if it’s even possible. That our top priority in HE should be to make every researcher and student ‘data literate’ in the sense of knowing how to create and manage data efficiently and effectively, and provide them with simple-to-use tools to publish and cite data and software, so they can reap the credit. <br />

Opening up data – Jisc and CNI conference 10 July 2014 Opening up data – Jisc and CNI conference 10 July 2014 Presentation Transcript

  • July 10, 2014 JISC-CNI 2014 ©UC Regents, 2014 1 MacKenzie Smith University Librarian University of California, Davis
  • July 10, 2014 JISC-CNI 2014 ©UC Regents, 2014 2
  • At Creative Commons, we believe scientific data should be freely available to everyone. We call this idea Open Data. Creative Commons legal tools can be used to make data and databases freely available. We’ve already had successful implementations in taxonomic, energy, genomics, disease research, geospatial, polar, and bibliometric disciplines, and are providing guidance to funders, institutions, private foundations, governments, the corporate sector, and other stakeholders. Read more about Creative Commons and data. July 10, 2014 JISC-CNI 2014 ©UC Regents, 2014 3
  • NIH (2003): “The NIH expects and supports the timely release and sharing of final research data from NIH-supported studies for use by other researchers.” (>$500,000, include data sharing plan) NSF grant guidelines: “NSF ... expects investigators to share with other researchers, at no more than incremental cost and within a reasonable time, the data, samples, physical collections and other supporting materials created or gathered in the course of the work. It also encourages grantees to share software and inventions or otherwise act to make the innovations they embody widely useful and usable.” (2005 and earlier) NSF peer-reviewed Data Management Plan (DMP), January 2011 July 10, 2014 JISC-CNI 2014 ©UC Regents, 2014 4
  • 3/13/2014 ©UC Regents, 2014 5
  • July 10, 2014 JISC-CNI 2014 ©UC Regents, 2014 6
  • 2011 2012 Required as condition of publication, barring exceptions Required but may not affect editorial decisions Encouraged/addressed, may be reviewed and/or hosted Implied No mention 10.6% 11.2% 1.7% 5.9% 20.6% 17.6% 0% 2.9% 67.1% 62.4% 3/13/2014 ©UC Regents, 2014 7 Source: Stodden, Guo, Ma (2013) PLoS ONE, 8(6)
  • 2011 2012 Required as condition of publication, barring exceptions Required but may not affect editorial decisions Encouraged/addressed, may be reviewed and/or hosted Implied No mention 3.5% 3.5% 3.5% 3.5% 10% 12.4% 0% 1.8% 82.9% 78.8% 3/13/2014 ©UC Regents, 2014 8 Source: Stodden, Guo, Ma (2013) PLoS ONE, 8(6)
  • JASA June • 1996 • 2006 • 2009 • 2011 Computational Articles Code Publicly Available 9 of 20 0% 33 of 35 9% 32 of 32 16% 29 of 29 21% 3/13/2014 ©UC Regents, 2014 9
  • Executive Memorandum directing federal funding agencies to develop plans for public access to data and publications (Feb 2013) “data is defined... as the digital recorded factual material commonly accepted in the scientific community as necessary to validate research findings including data sets used to support scholarly publications...” Executive Order directing federal agencies to make their own data publicly available (May 9) July 10, 2014 JISC-CNI 2014 ©UC Regents, 2014 10
  • • Consists of digital assets • Datasets, papers, software, lab notes • Each asset is uniquely identified and has provenance, including access control • e.g., publishing simply involves changing the access control • Digital assets are interoperable across the enterprise July 10, 2014 JISC-CNI 2014 ©UC Regents, 2014 11
  • Code Data 77% Time to document and clean up 54% 52% Dealing with questions from users 34% 44% Not receiving attribution 42% 40% Possibility of patents - 34% Legal Barriers (e.g. copyright) 41% - Time to verify release with admin 38% 30% Potential loss of future publications 35% 30% Competitors may get an advantage 33%July 10, 2014 JISC-CNI 2014 ©UC Regents, 2014 12 Survey of the Machine Learning Community, NIPS (Stodden 2010)
  • July 10, 2014 JISC-CNI 2014 ©UC Regents, 2014 13 Pass National Data Breach Legislation that provides for a single national data breach standard, along the lines of the Administration's 2011 Cybersecurity legislative proposal.
  • July 10, 2014 JISC-CNI 2014 ©UC Regents, 2014 14
  • • Infrastructure • Developing new tools across the research life cycle • Mostly individual institutions or disciplines • National initiatives emerging (e.g. ARL/AAU/APLU SHARE initiative) • Policy • Institutional Open Access policies • SHARE copyright group • Training • ARL e-science institute • ARL spec kit on RDM activities • Current events July 10, 2014 JISC-CNI 2014 ©UC Regents, 2014 15
  • Dissemination Platforms, e.g. DataONE DataVerse Workflow Tracking and Research Environments, e.g. VisTrails Kepler Taverna Embedded Publishing, e.g. Sweave Knitr VCR (Verifiable Computational Research) 3/13/2014 ©UC Regents, 2014 16
  • • Disciplinary • ICPSR, Genbank • Dryad, ONEShare • Sage Commons (Sage Bionetworks) • Displinary/Institutional • DataVerse, Nesstar • Institutional • IRs galore: e.g., UC’s Dash and Chronopolis, Purdue’s PURR, JHU’s Data Conservancy, Stanford Digital Repository, many local DSpace/Fedora/Hydra/Islandora instances, Locally run and cloud hosted, locally run and cloud hosted • Data Centers on every campus • Generic/cloud • Figshare • DuraCloud July 10, 2014 JISC-CNI 2014 ©UC Regents, 2014 17
  • We continued to refine the infrastructure for linking between articles and data. The web service for returning the corresponding Dryad data DOI when queried with an article DOI is now being used by Elsevier to provide a link to the data from ScienceDirect for 40 different Elsevier journals that have at least one data package in Dryad. Dryad is an international collaborator in the EU-funded ORCID DataCite interoperability Network Project (, which this past year introduced a tool enabling researchers to add research outputs with DataCite DOIs (such as Dryad data packages) to their ORCID profiles. We also introduced regular updating of linkages between related records in PubMed, Genbank, and EuropePMC to data packages in Dryad. To further promote discoverability and accessibility, Dryad officially became a DataONE Tier 1 member node. Improvements to the curation interface have led to an increase in curation efficiency of greater than 25% in the past year. July 10, 2014 JISC-CNI 2014 ©UC Regents, 2014 18 Dryad Annual Report, 2013
  • Embargo selections of Dryad data authors for the 10,108 files in Dryad deposited by September 20, 2013. Data include only datasets related to articles published in journals for which the authors had the option of selecting an embargo. (B) Longer term embargoes (>1 year) by journal that granted them. Data Archiving: Suggestions to Increase Participation. PLoS Biol12(1): e1001779 doi:10.1371/journal.pbio.10017796 July 10, 2014 JISC-CNI 2014 ©UC Regents, 2014 19
  • July 10, 2014 JISC-CNI 2014 ©UC Regents, 2014 20
  • July 10, 2014 JISC-CNI 2014 ©UC Regents, 2014 21
  • July 10, 2014 JISC-CNI 2014 ©UC Regents, 2014 22
  • COMING NIH data catalog (part of the BD2K initiative) SHARE registry HERE NOW Thomson Reuters Data Citation Index OCLC WorldShare (includes OAIster) Google/Google Scholar July 10, 2014 JISC-CNI 2014 ©UC Regents, 2014 23
  • • DOIs for Data (DataCite, CrossRef, EZID) • ORCIDs for Researchers • FundRef for funding agencies • Still missing good institutional identifiers July 10, 2014 JISC-CNI 2014 ©UC Regents, 2014 24
  • July 10, 2014 JISC-CNI 2014 ©UC Regents, 2014 25
  • July 10, 2014 JISC-CNI 2014 ©UC Regents, 2014 26
  • • An IP rights strategy, including the promotion of university-based Open Access policies and favorable licensing terms, will be part of the scaffolding that will enable the layers of SHARE to develop. • Rights subgroup formed to deal with this • A broad collective action by AAU and APLU – to be discussed with AAU Presidents in April
  • 40 22 38 42 23 33 48 47 0 10 20 30 40 50 60 Data archiving by library Data sharing & access support Data citation support Research metadata support Other Data Mangement… DMP training DMP consulting Online DMP resources Data manageme nt planning Data manageme nt support Data sharing & archiving Key finding: RDM Service Offering ARL SPEC Kit 334: Research Data Management Services (July 2013) July 10, 2014 JISC-CNI 2014 ©UC Regents, 2014 28
  • July 10, 2014 JISC-CNI 2014 ©UC Regents, 2014 29 0 10 20 30 40 50 60 DMP training DMP consulting 89% N = 48 61% N = 33 ARL SPEC Kit 334: Research Data Management Services (July 2013)
  • July 10, 2014 JISC-CNI 2014 ©UC Regents, 2014 30
  • July 10, 2014 JISC-CNI 2014 ©UC Regents, 2014 31
  • An ACRL e-Learning Online Course, July 14-August 1, 2014 Description: Demand for data management plan consultants is growing as more granting agencies add this requirement. Most presentations concerning data management do not provide practical advice on how to consult with researchers writing a data management plan for grant submission. This course teaches participants about the elements of a successful data management plan, and provides practice critiquing data management plans in a supportive learning environment where no grant funding is at stake. Join two experienced data management plan consultants with experience in liaison librarianship and information technology as they demonstrate how all librarians have the ability to successfully consult on data management plan. Each week will include assigned readings, a written lecture, discussion questions, weekly assignments, and live chats with the instructors. Participants will examine how data and metadata are defined, open data formats, dark archives, and secure repositories as well as addressing specialty concerns such as how securely preserve information related to at risk populations, etc. Selection of effective long term data preservation and sharing strategies will also be examined. Lastly, participants will evaluate sample data management plans from the sciences, social sciences, and the arts and humanities as a final project for the course. Critiques of each plan will be presented to the class during the final chat session at the end of the course. Learning Outcomes: List specific data depository resources in order to formulate recommendations for researchers to securely deposit and share their data. Learn about how different funding agencies, and departments within those agencies, have different requirements for data management plans in order to determine how to effectively advise each researcher according to the requirements for their specific plan. Analyze sample data management plans in order to develop an understanding of what constitutes a thorough data management plan. Presenters: Dee Ann Allison, Professor, University of Nebraska-Lincoln; Kiyomi Deards, Assistant Professor, University of Nebraska- Lincoln Course Requirements: Your participation will require approximately three to five hours per week of primarily asynchronous activities to: Read the online seminar material Post to online discussion boards Synchronous chat sessions (optional) Complete online exercises Complete a seminar evaluation form July 10, 2014 JISC-CNI 2014 ©UC Regents, 2014 32
  • July 10, 2014 JISC-CNI 2014 ©UC Regents, 2014 33
  • July 10, 2014 JISC-CNI 2014 ©UC Regents, 2014 34
  • “CLIR Postdoctoral Fellows work on projects that forge and strengthen connections among library collections, educational technologies, and current research. The program offers recent PhD graduates the chance to help develop research tools, resources, and services while exploring new career opportunities. Host institutions benefit from fellows' field-specific expertise by gaining insights into their collections' potential uses and users, scholarly information behaviors, and current teaching and learning practices within particular disciplines.” • >110 fellows so far • UC Davis postdoc in neuroscience: Jonathan Cachat July 10, 2014 JISC-CNI 2014 ©UC Regents, 2014 35
  • “Painstakingly detailed surveys have been performed across several research organizations, particularly in North America (CLIR; ARL; CDL), Europe (DCC; RIN; NESTA) and Australia (ANDS). The same overall picture emerges: • Research data is found in a dizzying number of file formats (some proprietary) • Research data can be digital or non-digital • Lack of metadata & documentation • Data storage is desperate, unorganized, unsecured and researchers need more space • Researchers welcome help with federal funding mandates (Data Management Plans) • PIs are not concerned with data sharing preparation – a time consuming, thankless job in the current publish-or-perish merit system” July 10, 2014 JISC-CNI 2014 ©UC Regents, 2014 36
  • “There is ample evidence of a need for research data management services as provided by reports published from libraries and organizations (cited above). However, it is critical to recognize that sloppy record keeping and the constant, fast-paced strive for bigger, faster, stronger technological infrastructure are inherent to the scientific enterprise. Any services that sterilize or mandate rigid process control may provide solutions to specific data concerns, but will do so at a detriment to science – not an ideal solution” Amari, Beltrame, Bjaalie, & Dalkara, 2002; Gardner et al., 2003; Kubilius, 2014; Landreth & Silva, 2013; Wallis et al., 2013; White, Baldridge, Brym, Locey, & McGlinn, 2013. July 10, 2014 JISC-CNI 2014 ©UC Regents, 2014 37
  • “Mandated changes that are detrimental to the flow rate of a daily research enterprise will not be successful. This challenges the core of research data management, curation and service efforts. It highlights the fact that sometimes efforts to help an external group (e.g., neuroscientists) with internal expertise (e.g., library skill sets), even with the best intentions and solid rational can be unhelpful and unsustainable.” The problem we are trying to solve is advancing the environmental support and training provided by the university to researchers and students in order to fulfill its mission. Researchers and students are aware of the growing popularity and potential of big data, open data, interdisciplinary data. They desire opportunities, skills and support. Advancing the environmental support will improve their research, it will improve their education – it gives them an edge, and for that a university is recognized.” July 10, 2014 JISC-CNI 2014 ©UC Regents, 2014 38
  • • Less emphasis on infrastructure • More emphasis on policy • Citation practices in different research disciplines for data, software • Legal tools for data and software sharing in different contexts • Lots more emphasis on training and culture change • Not of librarians, but researchers themselves July 10, 2014 JISC-CNI 2014 ©UC Regents, 2014 39
  • July 10, 2014 JISC-CNI 2014 ©UC Regents, 2014 40