A great deal has been written of late about the growing prevalence of data, data analysis and data sharing in science. This is just one of those works, but one I highly recommend, edited by Tony Hey, who is head of Microsoft Research, and others – its available freely online and worth the time to go though.
And as science has changed, so too have the ways in which scholarship is communicated and distributed. Increasingly paper is no longer the main medium of distributing findings. Most academic journals have already moved online and many are ceasing print publication altogether. Increasingly, the limitations of physical distribution are also falling away. We are no longer tied simply to text or static images. Science can be communicated in moving images, data visualizations, programs, and even datasets themselves, which can be separately analyzed, reprocessed and reused.
Massive in scale – terabytes orpetabytes of data is difficult to manage, store
This is just one example of the problems surrounding data publishing and use in science, on one of the vectors which I’ve just mentioned
Finally, another critical question about data is provenance.
So I’m going to talk about two projects of the many, many, many initiatives that are underway related to data that I and NISO are directly involved in related to data and data management. The first is data citation.
CODATA = International Council for Science (ICSU) : Committee on Data for Science and Technology –
Unfortunately, my crystal ball doesn’t work terribly well.
Each community is doing its own things, for its own reasons to support their own needs. This isn’t wrong. It’s completely understandable. However, it will increasingly cause problems when we start talking about Machines talking to Machines and interoperability, which is how access to data can be so valuable. Sure a human might be able to decipher the differences between biology and ecology, population science from voting results. However, computers can’t. Just as one example, we can’t do anything if we don’t have the right metadata. Data tables are filled with figures that are just numbers. Without context, a 22 is as useless a figure as as 356,000, when one is tons and one is grams.
Data Exchange, Data Citation: An overview of some community work Todd Carpenter Executive Director, NISO May 20, 2012 Council of Science Editors
National Information Standards Organization Non-profit industry trade association accredited by ANSI 110+ Organizational Members Divided roughly in thirds among publishers, libraries and support service providers Mission of developing and maintaining technical standards related to information, documentation, discovery and distribution of published materials and media Volunteer driven organization: 400+ contributors whoMay 20, 2012 CSE: Data Standards and Data Citation 1 Issues
Standards are familiar, even if you don‟t noticeImage: DanTaylor Image: Joel WashingMay 20, 2012 CSE: Data Standards and Data Citation 2 Issues
Data PublishingMay 20, 2012 CSE: Data Standards and Data Citation 3 Issues
Large Scale Data-DrivenScienceIncreasingly, scientificbreakthroughs will bepowered by advancedcomputing capabilitiesthat help researchersmanipulate and exploremassive datasets.May 20, 2012 CSE: Data Standards and Data Citation 4 Issues
Science data isn‟t what it used to beImage: Walters Art Museum Image: Domenico, Caron, Davis, et al.May 20, 2012 CSE: Data Standards and Data Citation 5 Issues
Challenges of Data Publishing/SharingData – is massive in scale. – can be constantly changing. – is difficult to describe/catalog. – is often difficult to integrate with other data. – can be complex to analyze (tools required). – is hard to peer review. – is challenging to curate. – is challenging to preserve.May 20, 2012 CSE: Data Standards and Data Citation 6 Issues
Data – Complexities Storage Provenance Legal Issues Technical Interoperability Metadata Privacy IdentificationMay Citation 20, 2012 CSE: Data Standards and Data Citation 7 Issues
Problem with data, e.g.:mutabilityThe fact data sets are constantly changing poses the following problems (not exhaustive, BTW) – How does someone rerun your experiment, when the data set isn‟t now what it was when you ran your experiment – For citation – You have to cite the subset of the data that existed on XXX date when you ran your analysis – How can you preserve a snapshot of a dataset? – How do you describe the dataset at that point in time? – Problems of managing the metadata of each data depositMay 20, 2012 CSE: Data Standards and Data Citation 8 – Problems of synchronization across repositories Issues
Data Complexity – Provenance • How can you be certain that the data set you are looking at hasn‟t been altered or changed? • Can you trust the data source? • Has the data been updated/added to/or appended since the analysis was initially run • Dealing with issues of abstractionMay 20, 2012 CSE: Data Standards and Data Citation Issues 9
Data CitationMay 20, 2012 CSE: Data Standards and Data Citation 10 Issues
We all know what a citation looks like?Right?May 20, 2012 CSE: Data Standards and Data Citation 11 Issues
But what about citing a data set?Source: Citations for SEER Databases Source: International Polar Year Source: Global Land Cover Facility Source: The Economist Source: ICPSR May 20, 2012 CSE: Data Standards and Data Citation 12 Issues
CODATA Group on DataCitation• Approved by the CODATA 27th General Assembly in Cape Town in October, 2010TASKS – Survey existing literature and existing data citation initiatives. – Obtain input from stakeholders in library, academic, publishing, and research communities. – Hold at least one meeting and a workshop to help establish a solid foundation of the state of the art and practices in this area. – Work with the ISO and major regional and national standards organizations to develop formal data citation standards and good practices.May 20, 2012 CSE: Data Standards and Data Citation 13 Issues
Issues Requiring AttentionTechnical Sustainability• Metadata Standards IdentifiersScientific • DOI, PURL, ARC• Publication and peer review Legal/IntellectualInstitutional Property Rights• Promotion & Tenure • Accessibility and reuse rights• Reward infrastructure Socio-culturalFinancial • Culture of citing data• Infrastructure support • Proper attributionMay 20, 2012 CSE: Data Standards and Data Citation 14 Issues
Accomplishments so far• Compiled a significant bibliography of 200+ references• Conducted a survey for repository managers, publishers, and funding organizations. – More than 60 narrative responses• Upcoming meeting in Copenhagen in June• Final meeting in Taipei in October with final report to follow.May 20, 2012 CSE: Data Standards and Data Citation 15 Issues
BRDI Meeting in Berkeley• Two-day meeting in Berkeley, CA on data citation issues in August 2011• Covered topics critical to data citation – Administration -- Legal – Publishing -- Policy – Identification -- P&T – Funding support -- ProvenanceNational Academies report to be published in JuneMay 20, 2012 CSE: Data Standards and Data Citation 16 Issues
What does the future hold?May 20, 2012 CSE: Data Standards and Data Citation 17 Issues
Standards for Data: Workareas? Many potential areas for work in sharing of scientific data including: • Author/Contributor disambiguation & other issues • Data Equivalence – How does one know that this thing and that are equivalent (i.e., contain same data)? • Systemic metadata What is the form of this information? What are its structural components? • Archival issues Storage, physical level, but also migration issues • Bibliographic information for discovery, delivery and reuse • Rights issues – Ownership, recognition, sharing, privacyMay 20, 2012 CSE: Data Standards and Data Citation Issues 18
ORCID – Author disambiguationFounded by CrossRef, Thomson-Reuters, Nature in 2009Now 328 participant organizations, 50 of which have provided sponsorship fundingPrototype technologyFull launch in Fall 2012May 20, 2012 CSE: Data Standards and Data Citation Issues 19
International Standard Name ID• ISO 27729:2012 Information & Documentation -- International Standard Name Identifier• Another identifier for public identity of parties• Application for institutions (NISO I2)• Launched in SpringMay 20, 2012 CSE: Data Standards and Data Citation Issues 20 „12
Data Equivalence• Creation of a high- level conceptual model of data description• A “FRBR” for data• Defines the distinctions between states & transformations of data• Basis for identificationMay 20, 2012 CSE: Data Standards and Data Citation Issues 21 & description
Too many players, doing too manythings? Source: XKCD May 20, 2012 CSE: Data Standards and Data Citation Issues 22
Other Initiatives & Links • CODATA/ICSTI – Task Group on Data Citation Standards and Practices • DataCite – datacite.org • NISO/NFAIS Group on Supplemental Journal Materials • National Academies Board on Research Data and Information – For Attribution: Developing Data Attribution and Citation Practices and Standards in Berkely, CA. August 2011 Report due in January • Cite Datasets and Link to Publicationsby Digital Curation Centre • ORCID - about.orcid.org • International Standard Name ID – www.isni.org • ScienceCommons - Scholar‟s Copyright Project • International Council for Scientific and Technical Information (ICSTI) – Multimedia Search and Retrieval, and Interactive Journal Articles Projects • Dublin Core Metadata Initiative - Science and Metadata Community • DataNet projects funded by NSF, launched in 2009 • NISO/OAI ResourceSyncproject funded by Alfred P. Sloan FoundationMay 20, 2012 CSE: Data Standards and Data Citation 23 Issues
Information Standards Quarterly (ISQ) Summer 2010 Issue of ISQ Special issue on Enhanced Journal Articles Article-level Enhancements in the Humanities and Social Sciences Hosting Supplemental Materials Archiving Supplemental Materials DOIs - Linking and Beyond Supplemental Materials Survey Report on NISO/NFAIS project NOW AVAILABLE OPEN ACCESS www.niso.org/publications/is qMay 20, 2012 CSE: Data Standards and Data Citation 24 Issues
Questions? Todd Carpenter Executive Director email@example.com National Information Standards Organization (NISO) One North Charles Street, Suite 1905 Baltimore, MD 21201 USA +1 (301) 654-2512 www.niso.orgMay 20, 2012 CSE: Data Standards and Data Citation 25 Issues