This work. by Micah Altman (http://micahaltman.com) is licensed under the Creative Commons Attribution-Share Alike 3.0 United States License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/us/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.
Best practices aren't.The core issue is that there are few models for the systematic valuation of data: We have no robust general proven ways of answering the question of how much data X be worth to community Y at time Z. Thus the "bestness" (optimality) of practices are generally strongly dependent on operational context.. and the context of data sharing is currently both highly complex and dynamic Until there is systematic descriptive evidence that best practices are used, predictive evidence that best practices are associated with future desired outcomes, and causal evidence that the application of best practices yields improved outcomes, we will be unsure that practices are "best".Nevertheless, one should use established "not-bad" practices, for a number of reasons. First, to avoid practices that are clearly bad; second, because use of such practices acts to dcoument op[erational and tacit knowledge; third because selecting practices can help to elicit the underlying assumptions under which practices are applied; and finally because not-bad practcies provide a basis for auditing, evaluation, and eventual improvement.Specific not-bad practices for data sharing fall into roughly three categories :Analytic practices: lifecycle analysis & requirements analysisPolicy practices for: data dissemination, licensing, privacy, availability, citation and reproducibilityTechnical practices for sharing and reproducibility, including fixity, replication, provenanceThis presentation at the Second Open Economics International Workshop (sponsored by the Sloan Foundation, MIT and OKFN) provides an overview of these and links to specific practices recommendations, standards, and tools:
LHC produces a PB every 2 weeks, Sloan Galaxy zoo has hundreds of thousands of “authors”, 50K people attend a class from the University of michigan, and to understand public opinion instead of surveying 100’s of people per month we can analyze 10ooo tweets per second.
Most of the different stakeholders have stronger relationships/stakes with research at different stages. But researchers and research institutions are in the middle – they have a strong stake in most stagesResearchers are more directly concerned with collection, processing, analysis, dissemination. Organizations have a higher stake in internal sharing, re-use, long-term access.
Best Practices for Sharing Economics Data
Prepared forSecond Open Economics International WorkshopJune 2013“Not-bad” Practices for SharingEconomics DataDr. Micah Altman<email@example.com>Director of Research, MIT LibrariesNon-Resident Senior Fellow, The Brookings Institution
DISCLAIMERThese opinions are my own, they are not the opinionsof MIT, Brookings, any of the project funders, nor (withthe exception of co-authored previously publishedwork) my collaboratorsSecondary disclaimer:“It’s tough to make predictions, especially about thefuture!”-- Attributed to Woody Allen, Yogi Berra, Niels Bohr, Vint Cerf, Winston Churchill,Confucius, Disreali [sic], Freeman Dyson, Cecil B. Demille, Albert Einstein, Enrico Fermi,Edgar R. Fiedler, Bob Fourer, Sam Goldwyn, Allan Lamport, Groucho Marx, Dan Quayle,George Bernard Shaw, Casey Stengel, Will Rogers, M. Taub, Mark Twain, Kerr L. White,etc.“Not-bad” Practices for Sharing Economics Data 2
Collaborators & Co-Conspirators• Jonathan Crabtree, Merce Crosas, GaryKing, Michael McDonald, NancyMcGovern, Salil Vadhan & many others• Research SupportThanks to the Library of Congress, the NationalScience Foundation, IMLS, the SloanFoundation, the Joyce Foundation, theMassachusetts Institute of Technology, &Harvard University.“Not-bad” Practices for Sharing Economics Data 3
Related Work• Altman (2013) Data Citation in The Dataverse Network ®,. In Developing DataAttribution and Citation Practices and Standards: Report from an InternationalWorkshop.• National Digital Stewardship Alliance, 2013 (Forthcoming), 2014 NationalAgenda for Digital Stewardship.• M. Altman, Adams, M., Crabtree, J., Donakowski, D., Maynard, M., Pienta, A., & Young,C. 2009. "Digital preservation through archival collaboration: The Data PreservationAlliance for the Social Sciences." The American Archivist. 72(1): 169-182• M. Altman, 2008, "A Fingerprint Method for Verification of Scientific Data" in,Advances in Systems, Computing Sciences and Software Engineering, (Proceedings ofthe International Conference on Systems, Computing Sciences and SoftwareEngineering 2007) , Springer-Verlag.• M. Altman and G. King. 2007. “A Proposed Standard for the Scholarly Citation ofQuantitative Data”, D-Lib, 13, 3/4 (March/April).Most reprints available from:informatics.mit.edu“Not-bad” Practices for Sharing Economics Data 4
“Not-bad” Practices for Sharing Economics Data„Not Bad‟Practices5
Some TrendsShifting Evidence BaseHigh Performance Collaboration(here comes everybody…)More DataPublish, then FilterMore Learners6More OpenThe Lifecycle and Institutional Ecology of Data
Why not ‘best’ practices?• Few models for systematic valuation of data– how much will data X be worth to community Y at time Z?See: National Digital Stewardship Alliance, 2013 (Forthcoming), 2014National Agenda for Digital Stewardship. Library of Congress• Optimality of practices are generally strongly dependent on operationalcontext• Context of data sharing very dynamic– change in publication models– change in evidence base– change in data management methodologies– change in policies• Paucity of evidence to establish data practices as best:– Descriptive: adoption, compliance– Predictive: association of best practices &desired outcomes– Causal: intervention with best practices linked to improvement“Not-bad” Practices for Sharing Economics Data 7Best practices neither best nor practiced.
Why ‘not bad’ practices?• Avoid clearly bad practices• Document operational and tacit knowledge• Elicit assumptions• Provide basis for auditing, evaluation, andimprovement“Not-bad” Practices for Sharing Economics Data 8
“Not-bad” Practices for Sharing Economics DataProbably Not BadPractices9
Types of Practices• Analytic practices– Lifecycle analysis– Requirements analysis• Policy practices– Data dissemination policies– Data citation policies– Reproducibility policies• Technical practices– Sharing technologies– Reproducibility technologies“Not-bad” Practices for Sharing Economics Data 10
Core Dimensions of Shared Information Infrastructure“Not-bad” Practices for Sharing Economics Data• Stakeholder incentives– recognition; citation; payment; compliance; services• Dissemination– access to metadata; documentation; data• Access control– authentication; authorization; rights management• Provenance– chain of control; verification of metadata, bits, semantic content• Persistence– bits; semantic content; use• Legal protection– rights management; consent; record keeping;• Usability for…– discovery; deposit; curation; administration; annotation; collaboration• Economic model– valuation models; cost models; business models• Trust model– verification; transparency; enforcementSee: King 2007; ICSU 2004; NSB 2005; Schneier 201111
Creation/CollectionStorage/IngestProcessingInternalSharingAnalysisExternaldissemination/publicationRe-useLong-termaccessStakeholdersScholarlyPublishersResearchersDataArchives/PublisherResearchSponsorsDataSources/SubjectsConsumersService/InfrastructureProvidersResearchOrganizations“Not-bad” Practices for Sharing Economics Data12Modeling
Data Dissemination Policies - How• License: Creative CommonsVersion 4.0 of the Creative Commons licenses– Legally well crafted– Avoids attribution stacking – attribution through links– Handles sui-generis database rights, licensee rights to publicity, etc.– Machine actionableSee: wiki.creativecommons.org/4.0• ConfidentialityDeidentification & public use files insufficient.– Need multiple modes of access, including protected access to confidential data.See: National Research Council. 2005. Expanding access to research data: Reconciling risks andopportunities. Washington, DC: The National Academies Press.Vadhan, S. , et al. 2010. “Re: Advance Notice of Proposed Rulemaking: Human Subjects ResearchProtections”. Available from: http://dataprivacylab.org/projects/irb/Vadhan.pdf“Not-bad” Practices for Sharing Economics Data 14
Data Dissemination Policy - When• Timeliness [NRC Recommendations]– Sharing data should be a regular practice.– Investigators should share their data by the time ofpublication of initial major results of analyses of thedata except in compelling circumstances.– Data relevant to public policy should be shared asquickly and widely as possible.– Plans for data sharing should be an integral part of aresearch plan whenever data sharing is feasible.Fienberg, et al. (eds). 1985. Sharing Research data.Washington, DC: The National Academies Press.“Not-bad” Practices for Sharing Economics Data 15
Data Dissemination Policy - Where• With journals. Follow NISO supplementarymaterials:http://www.niso.org/workrooms/supplementalrecommendations• With sustainable well known collaboratively-stewarded repositories– Example: data-pass.orgAlso see:M.Altman, Adams, M., Crabtree, J., Donakowski, D., Maynard, M., Pienta, A., &Young, C. 2009. "Digital preservation through archival collaboration: The DataPreservation Alliance for the Social Sciences." The American Archivist. 72(1):169-182“Not-bad” Practices for Sharing Economics Data 16
Data Citation Policies• Data Citation First Principles(Harvard Workshop, NRC Report, Co-Data Forthcoming)– Data citations should be treated as first-class objects of publication– At minimum, all data necessary to understand assess extend conclusions in scholarlywork should be cited.See:Altman, Micah. “Data Citation in The Dataverse Network.” Developing Data Attribution and Citation Practices andStandards Report from an International Workshop. Ed. Paul F Uhlir. National Academies Press, 2012M. Altman and G. King. 2007. “A Proposed Standard for the Scholarly Citation of Quantitative Data”, D-Lib, 13, 3/4(March/April).• Data-PASS recommendations– Minimal elements: author, date, title, persistent id– Location: must appear with other elements– Recommended: fixity information, such asUniversal Numeric FingerprintSee: data-pass.org/citations.html“Not-bad” Practices for Sharing Economics Data 17
Reproducibility Policies• Science– “Unpublished data and personal communications. Citations to unpublisheddata and personal communications cannot be used to support claims in apublished paper. Papers will be held for publication until all "in press" citationsare published.”– “Data and materials availability All data necessary to understand, assess, andextend the conclusions of the manuscript must be available to any reader ofScience. All computer codes involved in the creation or analysis of data mustalso be available to any reader of Science. “• Support for publishing replication– Registered replication reports:http://www.psychologicalscience.org/index.php/replication– ICMJE Clinical Trials Registration:http://www.icmje.org/publishing_10register.html– Journals of Negative/Null Results“Not-bad” Practices for Sharing Economics Data 18
Policies are not Self-Enforcing /Sustaining• Technical and financial sustainability must beplanned, to ensure long term accessSee: National Science Board, Long-Lived Digital DataCollections: Enabling Research andEducation in the 21st Century. NSF.http://www.nsf.gov/pubs/2005/nsb0540/nsb0540.pdf• Long-term access requires initial investment indata preparation– Capture tacit knowledge, create metadata– Transfer to stable formats“Not-bad” Practices for Sharing Economics Data 19
Compliance with Data Sharing Policies is oftenLowThe Lifecycle and Institutional Ecology of Data Compliance is low even in bestexamples of journals Checking compliance is labor-intensive without citation andrepository standards[See Glandon 2011; Mucullough, et.al 2008]20
Technical Infrastructure Examples• CKAN– Open Source– Established– Built on drupal platform– http://ckan.org/• Dataverse Network– http://thedata.org– Open Source– Flexible archival models– Semantic Fixity (UNF)[Altman 2008]• MyExperiment– http://www.myexperiment.org/– Long lasting– Archives complete workflows to produce results“Not-bad” Practices for Sharing Economics Data 21Technical Criteria• Long term access– Replication,independence• Verifiability and fixity• Provenance• Workflows/code
Final Observations• Best practices aren’t…– document context of practice & measure desired outcomes• Not-bad practice starts with analysis…– lifecycle; requirements; sustainability ; predicted costs andbenefits• Effective data sharing requires policies:– dissemination, citation, replication, auditing• Effective data sharing requires infrastructure:– For verifiability, provenance, workflows/code, & long termaccess• Policies are not self-enforcing– combine incentives, transparency, auditing, & evaluation“Not-bad” Practices for Sharing Economics Data 22
Additional Bibliography (Selected)• McCullough, B.D., Kerry Anne McGeary, and Teresa D. Harrison. "Do Economics Journal Archives Promote ReplicableResearch?" Canadian Journal of Economics 41, no. 4 (2008).• Schneier, Bruce, 2012, Liars and Outliers. Wiley.• Borgman, Christine. “The Conundrum of Research Sharing.” Journal of the American Society for Information Science andTechnology (2011):1-40.• Glandon P. , 2011. Report on the American Economic Review Data Availability Compliance Project.http://www.aeaweb.org/aer/2011_Data_Compliance_Report.pdf• King, Gary. 2007. An Introduction to the Dataverse Network as an Infrastructure for Data Sharing. SociologicalMethods and Research 36: 173–199NSB• International Council For Science (ICSU) 2004. ICSU Report of the CSPR Assessment Panel on Scientific Data and Information.Report.“Not-bad” Practices for SharingEconomics Data23
Questions?E-mail: firstname.lastname@example.orgWeb: micahaltman.comTwitter: @drmaltman“Not-bad” Practices for Sharing EconomicsData24