Best Practices for Sharing Economics Data

Prepared for
Second Open Economics International Workshop
June 2013
“Not-bad” Practices for Sharing
Economics Data
Dr. Micah Altman
<escience@mit.edu>
Director of Research, MIT Libraries
Non-Resident Senior Fellow, The Brookings Institution

DISCLAIMER
These opinions are my own, they are not the opinions
of MIT, Brookings, any of the project funders, nor (with
the exception of co-authored previously published
work) my collaborators
Secondary disclaimer:
“It’s tough to make predictions, especially about the
future!”
-- Attributed to Woody Allen, Yogi Berra, Niels Bohr, Vint Cerf, Winston Churchill,
Confucius, Disreali [sic], Freeman Dyson, Cecil B. Demille, Albert Einstein, Enrico Fermi,
Edgar R. Fiedler, Bob Fourer, Sam Goldwyn, Allan Lamport, Groucho Marx, Dan Quayle,
George Bernard Shaw, Casey Stengel, Will Rogers, M. Taub, Mark Twain, Kerr L. White,
etc.
“Not-bad” Practices for Sharing Economics Data 2

Collaborators & Co-Conspirators
• Jonathan Crabtree, Merce Crosas, Gary
King, Michael McDonald, Nancy
McGovern, Salil Vadhan & many others
• Research Support
Thanks to the Library of Congress, the National
Science Foundation, IMLS, the Sloan
Foundation, the Joyce Foundation, the
Massachusetts Institute of Technology, &
Harvard University.

Related Work
• Altman (2013) Data Citation in The Dataverse Network ®,. In Developing Data
Attribution and Citation Practices and Standards: Report from an International
Workshop.
• National Digital Stewardship Alliance, 2013 (Forthcoming), 2014 National
Agenda for Digital Stewardship.
• M. Altman, Adams, M., Crabtree, J., Donakowski, D., Maynard, M., Pienta, A., & Young,
C. 2009. "Digital preservation through archival collaboration: The Data Preservation
Alliance for the Social Sciences." The American Archivist. 72(1): 169-182
• M. Altman, 2008, "A Fingerprint Method for Verification of Scientific Data" in,
Advances in Systems, Computing Sciences and Software Engineering, (Proceedings of
the International Conference on Systems, Computing Sciences and Software
Engineering 2007) , Springer-Verlag.
• M. Altman and G. King. 2007. “A Proposed Standard for the Scholarly Citation of
Quantitative Data”, D-Lib, 13, 3/4 (March/April).
Most reprints available from:
informatics.mit.edu

“Not-bad” Practices for Sharing Economics Data
„Not Bad‟
Practices
5

Some Trends
Shifting Evidence Base
High Performance Collaboration
(here comes everybody…)
More Data
Publish, then Filter
More Learners
6
More Open
The Lifecycle and Institutional Ecology of Data

Why not ‘best’ practices?
• Few models for systematic valuation of data
– how much will data X be worth to community Y at time Z?
See: National Digital Stewardship Alliance, 2013 (Forthcoming), 2014
National Agenda for Digital Stewardship. Library of Congress
• Optimality of practices are generally strongly dependent on operational
context
• Context of data sharing very dynamic
– change in publication models
– change in evidence base
– change in data management methodologies
– change in policies
• Paucity of evidence to establish data practices as best:
– Descriptive: adoption, compliance
– Predictive: association of best practices &desired outcomes
– Causal: intervention with best practices linked to improvement
Best practices neither best nor practiced.

Why ‘not bad’ practices?
• Avoid clearly bad practices
• Document operational and tacit knowledge
• Elicit assumptions
• Provide basis for auditing, evaluation, and
improvement

Probably Not Bad
Practices
9

Types of Practices
• Analytic practices
– Lifecycle analysis
– Requirements analysis
• Policy practices
– Data dissemination policies
– Data citation policies
– Reproducibility policies
• Technical practices
– Sharing technologies
– Reproducibility technologies

Core Dimensions of Shared Information Infrastructure
• Stakeholder incentives
– recognition; citation; payment; compliance; services
• Dissemination
– access to metadata; documentation; data
• Access control
– authentication; authorization; rights management
• Provenance
– chain of control; verification of metadata, bits, semantic content
• Persistence
– bits; semantic content; use
• Legal protection
– rights management; consent; record keeping;
• Usability for…
– discovery; deposit; curation; administration; annotation; collaboration
• Economic model
– valuation models; cost models; business models
• Trust model
– verification; transparency; enforcement
See: King 2007; ICSU 2004; NSB 2005; Schneier 2011
11

Creation/C
ollection
Storage/
Ingest
Processing
Internal
Sharing
Analysis
External
dissemination/
publication
Re-use
Long-
term
access
Stakeholders
Scholarly
Publishers
Researchers
Data
Archives/
Publisher
Research
Sponsors
Data
Sources/Su
bjects
Consumers
Service/Infras
tructure
Providers
Research
Organizations
“Not-bad” Practices for Sharing Economics Data12
Modeling

Legal Constraints
Contract Intellectual Property
Access
Rights Confidentiality
Copyright
Fair Use
DMCA
Database Rights
Moral Rights
Intellectual
Attribution
Trade Secret
Patent
Trademark
Common Rule
45 CFR 26
HIPAA
FERPA
EU Privacy Directive
Privacy
Torts
(Invasion,
Defamation)
Rights of
Publicity
Sensitive but
Unclassified
Potentially
Harmful
(Archeological
Sites,
Endangered
Species,
Animal Testing,
…)
Classified
FOIA
CIPSEA
State
Privacy Laws
EAR
State FOI
Laws
Journal
Replication
Requirements
Funder Open
Access
Contract
License
Click-Wrap
TOU
ITAR
Export
Restrictions

Data Dissemination Policies - How
• License: Creative Commons
Version 4.0 of the Creative Commons licenses
– Legally well crafted
– Avoids attribution stacking – attribution through links
– Handles sui-generis database rights, licensee rights to publicity, etc.
– Machine actionable
See: wiki.creativecommons.org/4.0
• Confidentiality
Deidentification & public use files insufficient.
– Need multiple modes of access, including protected access to confidential data.
See: National Research Council. 2005. Expanding access to research data: Reconciling risks and
opportunities. Washington, DC: The National Academies Press.
Vadhan, S. , et al. 2010. “Re: Advance Notice of Proposed Rulemaking: Human Subjects Research
Protections”. Available from: http://dataprivacylab.org/projects/irb/Vadhan.pdf

Data Dissemination Policy - When
• Timeliness [NRC Recommendations]
– Sharing data should be a regular practice.
– Investigators should share their data by the time of
publication of initial major results of analyses of the
data except in compelling circumstances.
– Data relevant to public policy should be shared as
quickly and widely as possible.
– Plans for data sharing should be an integral part of a
research plan whenever data sharing is feasible.
Fienberg, et al. (eds). 1985. Sharing Research data.
Washington, DC: The National Academies Press.

Data Dissemination Policy - Where
• With journals. Follow NISO supplementary
materials:
http://www.niso.org/workrooms/supplementalre
commendations
• With sustainable well known collaboratively-
stewarded repositories
– Example: data-pass.org
Also see:
M.
Altman, Adams, M., Crabtree, J., Donakowski, D., Maynard, M., Pienta, A., &
Young, C. 2009. "Digital preservation through archival collaboration: The Data
Preservation Alliance for the Social Sciences." The American Archivist. 72(1):
169-182

Data Citation Policies
• Data Citation First Principles
(Harvard Workshop, NRC Report, Co-Data Forthcoming)
– Data citations should be treated as first-class objects of publication
– At minimum, all data necessary to understand assess extend conclusions in scholarly
work should be cited.
See:
Altman, Micah. “Data Citation in The Dataverse Network.” Developing Data Attribution and Citation Practices and
Standards Report from an International Workshop. Ed. Paul F Uhlir. National Academies Press, 2012
M. Altman and G. King. 2007. “A Proposed Standard for the Scholarly Citation of Quantitative Data”, D-Lib, 13, 3/4
(March/April).
• Data-PASS recommendations
– Minimal elements: author, date, title, persistent id
– Location: must appear with other elements
– Recommended: fixity information, such as
Universal Numeric Fingerprint
See: data-pass.org/citations.html

Reproducibility Policies
• Science
– “Unpublished data and personal communications. Citations to unpublished
data and personal communications cannot be used to support claims in a
published paper. Papers will be held for publication until all "in press" citations
are published.”
– “Data and materials availability All data necessary to understand, assess, and
extend the conclusions of the manuscript must be available to any reader of
Science. All computer codes involved in the creation or analysis of data must
also be available to any reader of Science. “
• Support for publishing replication
– Registered replication reports:
http://www.psychologicalscience.org/index.php/replication
– ICMJE Clinical Trials Registration:
http://www.icmje.org/publishing_10register.html
– Journals of Negative/Null Results

Policies are not Self-Enforcing /Sustaining
• Technical and financial sustainability must be
planned, to ensure long term access
See: National Science Board, Long-Lived Digital Data
Collections: Enabling Research and
Education in the 21st Century. NSF.
http://www.nsf.gov/pubs/2005/nsb0540/nsb0540.pdf
• Long-term access requires initial investment in
data preparation
– Capture tacit knowledge, create metadata
– Transfer to stable formats

Compliance with Data Sharing Policies is often
Low
The Lifecycle and Institutional Ecology of Data
 Compliance is low even in best
examples of journals
 Checking compliance is labor-
intensive without citation and
repository standards
[See Glandon 2011; Mucullough, et.
al 2008]
20

Technical Infrastructure Examples
• CKAN
– Open Source
– Established
– Built on drupal platform
– http://ckan.org/
• Dataverse Network
– http://thedata.org
– Open Source
– Flexible archival models
– Semantic Fixity (UNF)
[Altman 2008]
• MyExperiment
– http://www.myexperiment.org/
– Long lasting
– Archives complete workflows to produce results
Technical Criteria
• Long term access
– Replication,
independence
• Verifiability and fixity
• Provenance
• Workflows/code

Final Observations
• Best practices aren’t…
– document context of practice & measure desired outcomes
• Not-bad practice starts with analysis…
– lifecycle; requirements; sustainability ; predicted costs and
benefits
• Effective data sharing requires policies:
– dissemination, citation, replication, auditing
• Effective data sharing requires infrastructure:
– For verifiability, provenance, workflows/code, & long term
access
• Policies are not self-enforcing
– combine incentives, transparency, auditing, & evaluation

Additional Bibliography (Selected)
• McCullough, B.D., Kerry Anne McGeary, and Teresa D. Harrison. "Do Economics Journal Archives Promote Replicable
Research?" Canadian Journal of Economics 41, no. 4 (2008).
• Schneier, Bruce, 2012, Liars and Outliers. Wiley.
• Borgman, Christine. “The Conundrum of Research Sharing.” Journal of the American Society for Information Science and
Technology (2011):1-40.
• Glandon P. , 2011. Report on the American Economic Review Data Availability Compliance Project.
http://www.aeaweb.org/aer/2011_Data_Compliance_Report.pdf
• King, Gary. 2007. An Introduction to the Dataverse Network as an Infrastructure for Data Sharing. Sociological
Methods and Research 36: 173–199NSB
• International Council For Science (ICSU) 2004. ICSU Report of the CSPR Assessment Panel on Scientific Data and Information.
Report.
“Not-bad” Practices for Sharing
Economics Data
23

Questions?
E-mail: escience@mit.edu
Web: micahaltman.com
Twitter: @drmaltman
“Not-bad” Practices for Sharing Economics
Data
24

Best Practices for Sharing Economics Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Best Practices for Sharing Economics Data

Similar to Best Practices for Sharing Economics Data (20)

More from Micah Altman

More from Micah Altman (20)

Recently uploaded

Recently uploaded (20)

Best Practices for Sharing Economics Data

Editor's Notes