Managing data responsibly
to enable research integrity
Heather Coates | Digital Scholarship & Data Management Librarian
http://ulib.iupui.edu/digitalscholarship/datasupport/
Introduction to Research Ethics
Quaid G504 (Fall 2016)
What is research integrity?
security
privacy
trust
honesty
accuracy
efficiency
objectivity
personal responsibility
ownership
stewardship
governance
Why does RDM matter?
RDM as a component of RCR
Roles & Responsibilities
Practical RDM
WHY DOES RDM MATTER?
The value of data increases with their use.
-Paul Uhlir
Source: John Gantz, IDC Corporation: The Expanding Digital Universe
0
100,000
200,000
300,000
400,000
500,000
600,000
700,000
800,000
900,000
1,000,000
2005 2006 2007 2008 2009 2010
The World of Data Around Us
Transient information
or unfilled demand for
storage
Information
Available Storage
PetabytesWorldwide
• Natural disaster
• Facilities infrastructure failure
• Storage failure
• Server hardware/software failure
• Application software failure
• External dependencies (e.g. PKI failure)
• Format obsolescence
• Legal encumbrance
• Human error
• Malicious attack by human or automated agents
• Loss of staffing competencies
• Loss of institutional commitment
• Loss of financial stability
• Changes in user expectations and requirements
The World of Data Around Us: Data Loss
CCimagebySharynMorrowonFlickr
CCimagebymomboleumonFlickr
Poor Data Management Affects Everyone
“MEDICARE PAYMENT ERRORS NEAR $20B” (CNN) December 2004
Miscoding and Billing Errors from Doctors and Hospitals totaled $20,000,000,000 in FY2003 (9.3% error rate)
. The error rate measured claims that were paid despite being medically unnecessary, inadequately
documented or improperly coded. In some instances, Medicare asked health care providers for medical
records to back up their claims and got no response. The survey did not document instances of alleged fraud.
This error rate actually was an improvement over the previous fiscal year (9.8% error rate).
“AUDIT: JUSTICE STATS ON ANTI-TERROR CASES FLAWED” (AP) February 2007
The Justice Department Inspector General found only two sets of data out of 26 concerning terrorism attacks
were accurate. The Justice Department uses these statistics to argue for their budget. The Inspector General
said the data “appear to be the result of decentralized and haphazard methods of collections … and do not
appear to be intentional.”
“OOPS! TECH ERROR WIPES OUT Alaska Info” (AP) March 2007
A technician managed to delete the data and backup for the $38 billion Alaska oil revenue fund – money
received by residents of the State. Correcting the errors cost the State an additional $220,700 (which of course
was taken off the receipts to Alaska residents.)
"Good data management practice allows reliable
verification of results and permits new and
innovative research built on existing information.
This is important if the full value of public
investment in research is to be realized."
Managing and Sharing Data: Best Practices for Researchers
UK Data Archive
Benefits: Good Data Practices & Open Data
• Open data addresses social justice issues
• Open data enhances social welfare
• Open data benefits for effective governance and policy making
• Open data grows the economy
• Open data improves the integrity of the scholarly record
• Open data facilitates the education and training of new generations
• Open data enables validation or replication to support published results
• Open data accelerates the pace of discovery
• GDP increases the impact of your work by sharing your data, code & other products
• GDP improves the quality and consistency of research data you produce (save $$$)
• GDP improves the efficiency of your research (save time)
Personal Experience
“Please forgive my paranoia about protocols, standards, and data review. I'm in the
latter stages of a long career with USGS (30 years, and counting), and have experienced
much. Experience is the knowledge you get just after you needed it.
Several times, I've seen colleagues called to court in order to testify about conditions
they have observed. Without a strong tradition of constant review and approval of
basic data, they would've been in deep trouble under cross-examination. Instead, they
were able to produce field notes, data approval records, and the like, to back up their
testimony.
It's one thing to be questioned by a college student who is working on a project for
school. It's another entirely to be grilled by an attorney under oath with the media
present.”
- Nelson Williams, Scientist
US Geological Survey
RDM AS A COMPONENT OF RCR
Concepts of Data Management
• Data ownership
• Data collection
• Data storage
• Data protection
• Data retention
• Data analysis
• Data sharing
• Data reporting
Steneck, 2004
DataONE
Data Life Cycle
The purpose of data management planning is to ensure that
research data produced by a project are high quality, well
organized, thoroughly documented, preserved, and accessible so
that the validity of the data can be determined at any time.
ORI Guidelines for Responsible Data Management
The goal of data management is to produce self-describing data
sets.
DataONE Primer on Data Management
Why is good data management so challenging?
ambiguity effect
availability heuristic
confirmation bias
experimenter’s or expectation bias
framing effect
hindsight bias
neglect of probability
optimism bias
planning fallacy
well traveled road effect
ROLES & RESPONSIBILITIES
Funder progress towards openness
1985:
National
Research
Council
1999:
Office of
Mgmt &
Budget,
Circular
A-110
revisions
2003: NIH
Data
Sharing
Policy
2008: NIH
Public
Access
Policy
2011:
NSF DMP
Require-
ment
2012:
NEH,
Office of
Digital
Humanitie
s DMP
Require-
ment
2013:
NSF Bio
sketch
change
2013:
OSTP
Memo on
Public
Access to
Results of
Federally-
Funded
Research
2014: OSTP
Memo on
Improving the
Mgmt of &
Access to
Scientific
Collections
2014:
OMB
Circular
A-81
(Uniform
Guidance)
takes
effect
2015:
Federal
Funding
agencies
release
plans
responding
to 2013
OSTP
memo
2016:
Federal
Funding
Agency
Plans take
effect
(DMP req)
Funder Policies: DMP & data sharing
• Association for Healthcare Research & Quality
• Centers for Disease Control & Prevention
• National Institutes of Health
• National Science Foundation
More agency policies at datasharing.sparcopen.org
Publisher Policies: Data availability
• DataDryad Publishers
• PLoS Journals
• Nature Publishing Group
• American Economic Review
• BioMedCentral
• JORD - Social Science Journals with a research data policy
• Data policies of Economic Journals
https://ulib.iupui.edu/digitalscholarship/datasupport/publisher_policies
Institutional Policies
• Vary greatly, with lots of gaps
• Distributed – address specific
local or state requirements for
specific types of data
• Often focus on institutional data
rather than research data
• Do not provide practical guidance
• Do not distinguish between
institutional and personal
responsibilities
Roles/Responsibilities of project personnel
• On your own
– Fill in the team members responsible for key data activities
• In small groups (2-3 people), share and discuss
– What kind of training is provided for team members to complete
these tasks accurately?
• Whole group discussion
– What barriers do you face in tracking roles and responsibilities?
– What barriers do you face in providing training?
Team Member Name Project Role Activity Description
Project design
[+ documentation]
Determining the aims of the project, the methods used to achieve those aims, and identifying the
products resulting from the project.
Translating the aims of the project into measurable research questions or hypotheses.
Instrument/measure/data
collection tool design
[+ documentation]
Creating tools that adapt the research questions or hypotheses into questions that can be
addressed by discrete data points.
Validating tools through external review or pilot testing.
Data collection
[+ documentation]
Conducting surveys, interviews, experiments, and other project procedures according to the
protocol in order to generate data.
Data processing [+
documentation]: entry,
proofing/cleaning, preparation for
analysis
Entering analog data into spreadsheet or database. Documenting procedures, date, and person
responsible.
Checking data entry for accuracy and completeness. Documenting procedures, date, and person
responsible.
Checking data for missing data, errors, and outliers. Documenting procedures, date, and person
responsible.
Deciding what data to include/exclude. Documenting decision-making process and criteria used.
Data analysis
[+ documentation]:
Selecting analytical tools to be used. Documenting decision-making process and criteria used.
Conducting data characterization and screening tests, running analyses, generating results.
Documenting process and files generated.
Deciding what data are relevant to the project aims and objectives. Documenting decision-making
process and criteria used.
Data reporting: Creating summary tables, graphs, and other visuals to represent the data.
Writing up the project details and relevant results in the packages/format requested by the client,
as specified by the deliverables agreed upon in the contract.
PRACTICAL RDM
Basic Principles: Good Research Data Practices
1. Have a plan & use it
2. Follow the 3-2-1 rule of data storage
3. Document
4. Be consistent; when you aren’t, document the deviation
5. Use common, standardized terminology
6. Monitor the quality of the data as it is being created
7. Report enough detail about your research so that others in your
field can reproduce it and others outside your field can evaluate it
8. Be as open as possible
9. Think about how your research might be useful to others
1: Functional data management plans support teams
• A tool for planning all the key activities related to data before
you have a messy pile of bits on your hands
• A working document that reflects how a study is conducted
• Communication device for the team
• Documents the team members and their roles
• Customized to address the issues most relevant to your
research
1: Planning…learning from Good Clinical Data
Management Practices
• Begin with the end in mind OR Produce report-ready outputs
• Plan, test, revise, plan, test, revise…implement
• Include all stakeholders in the design of the protocol, data
collection tools, data management plan, etc.
• Document, document, document
– Specify documents required for reproducible research
– Facilitates clear communication and shared understanding
throughout the project
– Specify roles and responsibilities from the beginning
2: Follow the 3-2-1 Rule
The accepted rule for backup best practices is the three-two-one
rule. It can be summarized as: if you’re backing something up,
you should have:
• At least three copies (in different places),
• in two different formats,
• with one of those copies off-site.
3: Document: How much?
More than you think you will need BUT less than everything
Information EntropyDATADETAILS Time of data development
Specific details about problems with individual items or
specific dates are lost relatively rapidly
General details about datasets are lost through time
Accident or
technology
change may
make data
unusable
Retirement or career change makes access to
“mental storage” difficult or unlikely
Loss of data developer leads
to loss of remaining
information
TIME
(From Michener et al 1997)
3: Document, document, document
Documentation should capture crucial details needed for post publication
peer review or validation of results
• Study: research questions/aims, IRB protocol, informed
consents/authorizations, etc.
• Data collection instruments or tools OR data sources
• Data collection process or workflow
• Can take many forms, but should be consistent with standards or norms
of practice for your field (e.g., data dictionary, data model, codebook,
readme.txt, lab notebook)
3A: Know what you have - Data Inventory
• On your own
– Fill in as much of the data inventory as you can
• In small groups (2-3 people), share and discuss
– Benefits of knowing exactly what data you have?
– How hard would it be to complete this fully and accurately?
• Whole group discussion
– How might this be helpful throughout various phases of the project?
– How might it be helpful to have an inventory for complete and active projects?
3A: Know what you have - Data Inventory Example
• Funding source
• Program or initiative
• Project title
• PI First Name
• PI Surname
• Other Researchers/Data Contacts
• Project Start Date dd-mm-yyyy
• Project End Date dd-mm-yyyy
• New datasets created?
• How many datasets created
• Data location(s)
• Dataset Type (qualitative, quantitative,
mixed methods, model)
• Sharing data?
– Deposit location?
– Licensing?
– Embargo?
http://www.data-archive.ac.uk/create-manage/strategies-for-centres/data-inventory
3B: Documentation Strategies
• Lab notebooks (print or electronic)
• Codebooks
• Data Dictionaries
• Procedures Manuals
• Protocols
• Readme.txt
3C: Structured documentation [metadata] is crucial for
discovery, reuse, and interoperability
• Metadata describes the who, what, when, where, how, why of the data
• Metadata = documentation for machines (standardized, structured)
• Purpose is to enable evaluation, discovery, organization, management, re-
use, authority/identification, and preservation
• Standards are commonly agreed upon terms and definitions in a structured
format
• Good documentation builds trust in your data – provenance, data integrity,
transparency, audit trail, etc.
4: Be consistent
• We’re human – recognize the challenge
• Prevention – design research instruments & processes to prevent
mistakes
• Pilot everything to identify potential problem areas
• When you aren’t, document the deviation
• Train your project personnel to be consistent & monitor performance
• Do internal audits, quality checks, data screening periodically to
detect inconsistencies
5: Use common, standardized terminology
• For things/concepts
– Diagnoses
– Species/cell lines
– Locations
– Variable names
– Samples & materials
• For formats, too
– Dates
– Codes
– Identifiers
6: Monitor the quality of the data
• Don’t wait until data collection is over
• Quality Assurance
• Quality Control
• Build it into the project timeline
• Make it someone’s job
• Document what you find and how it was corrected
7: Better reporting
• Report enough detail about your research so that others in your field
can reproduce it and others outside your field can evaluate it
• This includes ALL aspects of the study: study design, data collection
methods, sampling, population, data screening & processing, QA/QC
procedures, analytical procedures, visualization procedures, etc.
I know you can’t fit this into a journal article but you can write up pieces of this as
the study is conducted to support publications and reporting to the funder. Plus it
makes writing those products much easier.
8: Be as open as possible (Open Science)
Open isn’t an all or nothing choice
• Study registration
• Open notebook science
• Data sharing (raw data, processed data, data supporting published
results)
• Open Data
• Open Access publishing (deposit pre/post print in a repository,
choose OA journal, choose Gold OA option)
Want to learn more? Center for Open Science Why Open Research?
9: Think ahead
How might your research might be useful to others? To yourself
in 5/15/50 years? Your students or trainees? Historians?
• Consider what you will forget in that time and document it
• Consider whether your data will be useful beyond the life of the project. If
so, put it somewhere safe like an institutional or subject repository to share
it and ensure long-term access.
Stakeholders in (Academic) RDM
• Research Administration
• Research Compliance
• University IT
• University Libraries
• University Archives
• Consortia (e.g., CIC)
• NIH CTSA Hubs/NCATS
• Research & Technology Corporation: http://iurtc.iu.edu/
Case studies: Discussion
1. http://retractionwatch.com/2015/11/05/got-the-blues-you-
can-still-see-blue-after-all-paper-on-sadness-and-color-
perception-retracted/
2. https://ori.hhs.gov/content/case-summary-anderson-david
Resources
1. Uhlir, P. F. (2010). Information Gulags, Intellectual Straightjackets, and Memory Holes. Data Science Journal, 9, ES1-ES5.
2. DataONE Education Module: Data Management. DataONE. Retrieved December 2013. From
http://www.dataone.org/sites/all/documents/L01_DataManagement.pptx
3. Scientists are hoarding data and it’s ruining medical research: http://www.buzzfeed.com/bengoldacre/deworming-trials
4. Losing data from the National Centre for E-Social Science (NCESS) Portal:
http://datastories.jiscinvolve.org/wp/2015/08/10/losing-data-from-the-national-centre-for-e-social-science-ncess-
portal/
5. Biomedical data sharing and reuse: Attitudes and practices of clinical and scientific research staff:
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0129506
6. Over half of psychology studies fail reproducibility test: http://www.nature.com/news/over-half-of-psychology-
studies-fail-reproducibility-test-1.18248
7. Value of Open Data Sharing: https://www.fosteropenscience.eu/sites/default/files/pdf/2536.pdf
8. Michener, W. K., Brunt, J. W., Helly, J. J., Kirchner, T. B., & Stafford, S. G. (1997). Nongeospatial metadata for the
ecological sciences. Ecological Applications, 7(1), 330-342.
9. Society for Clinical Data Management. (2013). Good Clinical Data Management Practices. Washington, D.C.
10. UK Data Archive. (2015). Prepare and manage data. From http://ukdataservice.ac.uk/manage-data.

Managing data responsibly to enable research interity

  • 1.
    Managing data responsibly toenable research integrity Heather Coates | Digital Scholarship & Data Management Librarian http://ulib.iupui.edu/digitalscholarship/datasupport/ Introduction to Research Ethics Quaid G504 (Fall 2016)
  • 2.
    What is researchintegrity?
  • 3.
  • 4.
    Why does RDMmatter? RDM as a component of RCR Roles & Responsibilities Practical RDM
  • 5.
    WHY DOES RDMMATTER?
  • 6.
    The value ofdata increases with their use. -Paul Uhlir
  • 7.
    Source: John Gantz,IDC Corporation: The Expanding Digital Universe 0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000 2005 2006 2007 2008 2009 2010 The World of Data Around Us Transient information or unfilled demand for storage Information Available Storage PetabytesWorldwide
  • 8.
    • Natural disaster •Facilities infrastructure failure • Storage failure • Server hardware/software failure • Application software failure • External dependencies (e.g. PKI failure) • Format obsolescence • Legal encumbrance • Human error • Malicious attack by human or automated agents • Loss of staffing competencies • Loss of institutional commitment • Loss of financial stability • Changes in user expectations and requirements The World of Data Around Us: Data Loss CCimagebySharynMorrowonFlickr CCimagebymomboleumonFlickr
  • 10.
    Poor Data ManagementAffects Everyone “MEDICARE PAYMENT ERRORS NEAR $20B” (CNN) December 2004 Miscoding and Billing Errors from Doctors and Hospitals totaled $20,000,000,000 in FY2003 (9.3% error rate) . The error rate measured claims that were paid despite being medically unnecessary, inadequately documented or improperly coded. In some instances, Medicare asked health care providers for medical records to back up their claims and got no response. The survey did not document instances of alleged fraud. This error rate actually was an improvement over the previous fiscal year (9.8% error rate). “AUDIT: JUSTICE STATS ON ANTI-TERROR CASES FLAWED” (AP) February 2007 The Justice Department Inspector General found only two sets of data out of 26 concerning terrorism attacks were accurate. The Justice Department uses these statistics to argue for their budget. The Inspector General said the data “appear to be the result of decentralized and haphazard methods of collections … and do not appear to be intentional.” “OOPS! TECH ERROR WIPES OUT Alaska Info” (AP) March 2007 A technician managed to delete the data and backup for the $38 billion Alaska oil revenue fund – money received by residents of the State. Correcting the errors cost the State an additional $220,700 (which of course was taken off the receipts to Alaska residents.)
  • 13.
    "Good data managementpractice allows reliable verification of results and permits new and innovative research built on existing information. This is important if the full value of public investment in research is to be realized." Managing and Sharing Data: Best Practices for Researchers UK Data Archive
  • 15.
    Benefits: Good DataPractices & Open Data • Open data addresses social justice issues • Open data enhances social welfare • Open data benefits for effective governance and policy making • Open data grows the economy • Open data improves the integrity of the scholarly record • Open data facilitates the education and training of new generations • Open data enables validation or replication to support published results • Open data accelerates the pace of discovery • GDP increases the impact of your work by sharing your data, code & other products • GDP improves the quality and consistency of research data you produce (save $$$) • GDP improves the efficiency of your research (save time)
  • 16.
    Personal Experience “Please forgivemy paranoia about protocols, standards, and data review. I'm in the latter stages of a long career with USGS (30 years, and counting), and have experienced much. Experience is the knowledge you get just after you needed it. Several times, I've seen colleagues called to court in order to testify about conditions they have observed. Without a strong tradition of constant review and approval of basic data, they would've been in deep trouble under cross-examination. Instead, they were able to produce field notes, data approval records, and the like, to back up their testimony. It's one thing to be questioned by a college student who is working on a project for school. It's another entirely to be grilled by an attorney under oath with the media present.” - Nelson Williams, Scientist US Geological Survey
  • 17.
    RDM AS ACOMPONENT OF RCR
  • 19.
    Concepts of DataManagement • Data ownership • Data collection • Data storage • Data protection • Data retention • Data analysis • Data sharing • Data reporting Steneck, 2004
  • 20.
  • 21.
    The purpose ofdata management planning is to ensure that research data produced by a project are high quality, well organized, thoroughly documented, preserved, and accessible so that the validity of the data can be determined at any time. ORI Guidelines for Responsible Data Management The goal of data management is to produce self-describing data sets. DataONE Primer on Data Management
  • 22.
    Why is gooddata management so challenging?
  • 23.
    ambiguity effect availability heuristic confirmationbias experimenter’s or expectation bias framing effect hindsight bias neglect of probability optimism bias planning fallacy well traveled road effect
  • 24.
  • 26.
    Funder progress towardsopenness 1985: National Research Council 1999: Office of Mgmt & Budget, Circular A-110 revisions 2003: NIH Data Sharing Policy 2008: NIH Public Access Policy 2011: NSF DMP Require- ment 2012: NEH, Office of Digital Humanitie s DMP Require- ment 2013: NSF Bio sketch change 2013: OSTP Memo on Public Access to Results of Federally- Funded Research 2014: OSTP Memo on Improving the Mgmt of & Access to Scientific Collections 2014: OMB Circular A-81 (Uniform Guidance) takes effect 2015: Federal Funding agencies release plans responding to 2013 OSTP memo 2016: Federal Funding Agency Plans take effect (DMP req)
  • 27.
    Funder Policies: DMP& data sharing • Association for Healthcare Research & Quality • Centers for Disease Control & Prevention • National Institutes of Health • National Science Foundation More agency policies at datasharing.sparcopen.org
  • 28.
    Publisher Policies: Dataavailability • DataDryad Publishers • PLoS Journals • Nature Publishing Group • American Economic Review • BioMedCentral • JORD - Social Science Journals with a research data policy • Data policies of Economic Journals https://ulib.iupui.edu/digitalscholarship/datasupport/publisher_policies
  • 29.
    Institutional Policies • Varygreatly, with lots of gaps • Distributed – address specific local or state requirements for specific types of data • Often focus on institutional data rather than research data • Do not provide practical guidance • Do not distinguish between institutional and personal responsibilities
  • 30.
    Roles/Responsibilities of projectpersonnel • On your own – Fill in the team members responsible for key data activities • In small groups (2-3 people), share and discuss – What kind of training is provided for team members to complete these tasks accurately? • Whole group discussion – What barriers do you face in tracking roles and responsibilities? – What barriers do you face in providing training?
  • 31.
    Team Member NameProject Role Activity Description Project design [+ documentation] Determining the aims of the project, the methods used to achieve those aims, and identifying the products resulting from the project. Translating the aims of the project into measurable research questions or hypotheses. Instrument/measure/data collection tool design [+ documentation] Creating tools that adapt the research questions or hypotheses into questions that can be addressed by discrete data points. Validating tools through external review or pilot testing. Data collection [+ documentation] Conducting surveys, interviews, experiments, and other project procedures according to the protocol in order to generate data. Data processing [+ documentation]: entry, proofing/cleaning, preparation for analysis Entering analog data into spreadsheet or database. Documenting procedures, date, and person responsible. Checking data entry for accuracy and completeness. Documenting procedures, date, and person responsible. Checking data for missing data, errors, and outliers. Documenting procedures, date, and person responsible. Deciding what data to include/exclude. Documenting decision-making process and criteria used. Data analysis [+ documentation]: Selecting analytical tools to be used. Documenting decision-making process and criteria used. Conducting data characterization and screening tests, running analyses, generating results. Documenting process and files generated. Deciding what data are relevant to the project aims and objectives. Documenting decision-making process and criteria used. Data reporting: Creating summary tables, graphs, and other visuals to represent the data. Writing up the project details and relevant results in the packages/format requested by the client, as specified by the deliverables agreed upon in the contract.
  • 32.
  • 33.
    Basic Principles: GoodResearch Data Practices 1. Have a plan & use it 2. Follow the 3-2-1 rule of data storage 3. Document 4. Be consistent; when you aren’t, document the deviation 5. Use common, standardized terminology 6. Monitor the quality of the data as it is being created 7. Report enough detail about your research so that others in your field can reproduce it and others outside your field can evaluate it 8. Be as open as possible 9. Think about how your research might be useful to others
  • 34.
    1: Functional datamanagement plans support teams • A tool for planning all the key activities related to data before you have a messy pile of bits on your hands • A working document that reflects how a study is conducted • Communication device for the team • Documents the team members and their roles • Customized to address the issues most relevant to your research
  • 35.
    1: Planning…learning fromGood Clinical Data Management Practices • Begin with the end in mind OR Produce report-ready outputs • Plan, test, revise, plan, test, revise…implement • Include all stakeholders in the design of the protocol, data collection tools, data management plan, etc. • Document, document, document – Specify documents required for reproducible research – Facilitates clear communication and shared understanding throughout the project – Specify roles and responsibilities from the beginning
  • 36.
    2: Follow the3-2-1 Rule The accepted rule for backup best practices is the three-two-one rule. It can be summarized as: if you’re backing something up, you should have: • At least three copies (in different places), • in two different formats, • with one of those copies off-site.
  • 37.
    3: Document: Howmuch? More than you think you will need BUT less than everything
  • 38.
    Information EntropyDATADETAILS Timeof data development Specific details about problems with individual items or specific dates are lost relatively rapidly General details about datasets are lost through time Accident or technology change may make data unusable Retirement or career change makes access to “mental storage” difficult or unlikely Loss of data developer leads to loss of remaining information TIME (From Michener et al 1997)
  • 39.
    3: Document, document,document Documentation should capture crucial details needed for post publication peer review or validation of results • Study: research questions/aims, IRB protocol, informed consents/authorizations, etc. • Data collection instruments or tools OR data sources • Data collection process or workflow • Can take many forms, but should be consistent with standards or norms of practice for your field (e.g., data dictionary, data model, codebook, readme.txt, lab notebook)
  • 40.
    3A: Know whatyou have - Data Inventory • On your own – Fill in as much of the data inventory as you can • In small groups (2-3 people), share and discuss – Benefits of knowing exactly what data you have? – How hard would it be to complete this fully and accurately? • Whole group discussion – How might this be helpful throughout various phases of the project? – How might it be helpful to have an inventory for complete and active projects?
  • 41.
    3A: Know whatyou have - Data Inventory Example • Funding source • Program or initiative • Project title • PI First Name • PI Surname • Other Researchers/Data Contacts • Project Start Date dd-mm-yyyy • Project End Date dd-mm-yyyy • New datasets created? • How many datasets created • Data location(s) • Dataset Type (qualitative, quantitative, mixed methods, model) • Sharing data? – Deposit location? – Licensing? – Embargo? http://www.data-archive.ac.uk/create-manage/strategies-for-centres/data-inventory
  • 42.
    3B: Documentation Strategies •Lab notebooks (print or electronic) • Codebooks • Data Dictionaries • Procedures Manuals • Protocols • Readme.txt
  • 43.
    3C: Structured documentation[metadata] is crucial for discovery, reuse, and interoperability • Metadata describes the who, what, when, where, how, why of the data • Metadata = documentation for machines (standardized, structured) • Purpose is to enable evaluation, discovery, organization, management, re- use, authority/identification, and preservation • Standards are commonly agreed upon terms and definitions in a structured format • Good documentation builds trust in your data – provenance, data integrity, transparency, audit trail, etc.
  • 44.
    4: Be consistent •We’re human – recognize the challenge • Prevention – design research instruments & processes to prevent mistakes • Pilot everything to identify potential problem areas • When you aren’t, document the deviation • Train your project personnel to be consistent & monitor performance • Do internal audits, quality checks, data screening periodically to detect inconsistencies
  • 45.
    5: Use common,standardized terminology • For things/concepts – Diagnoses – Species/cell lines – Locations – Variable names – Samples & materials • For formats, too – Dates – Codes – Identifiers
  • 46.
    6: Monitor thequality of the data • Don’t wait until data collection is over • Quality Assurance • Quality Control • Build it into the project timeline • Make it someone’s job • Document what you find and how it was corrected
  • 47.
    7: Better reporting •Report enough detail about your research so that others in your field can reproduce it and others outside your field can evaluate it • This includes ALL aspects of the study: study design, data collection methods, sampling, population, data screening & processing, QA/QC procedures, analytical procedures, visualization procedures, etc. I know you can’t fit this into a journal article but you can write up pieces of this as the study is conducted to support publications and reporting to the funder. Plus it makes writing those products much easier.
  • 48.
    8: Be asopen as possible (Open Science) Open isn’t an all or nothing choice • Study registration • Open notebook science • Data sharing (raw data, processed data, data supporting published results) • Open Data • Open Access publishing (deposit pre/post print in a repository, choose OA journal, choose Gold OA option) Want to learn more? Center for Open Science Why Open Research?
  • 49.
    9: Think ahead Howmight your research might be useful to others? To yourself in 5/15/50 years? Your students or trainees? Historians? • Consider what you will forget in that time and document it • Consider whether your data will be useful beyond the life of the project. If so, put it somewhere safe like an institutional or subject repository to share it and ensure long-term access.
  • 51.
    Stakeholders in (Academic)RDM • Research Administration • Research Compliance • University IT • University Libraries • University Archives • Consortia (e.g., CIC) • NIH CTSA Hubs/NCATS • Research & Technology Corporation: http://iurtc.iu.edu/
  • 52.
    Case studies: Discussion 1.http://retractionwatch.com/2015/11/05/got-the-blues-you- can-still-see-blue-after-all-paper-on-sadness-and-color- perception-retracted/ 2. https://ori.hhs.gov/content/case-summary-anderson-david
  • 53.
    Resources 1. Uhlir, P.F. (2010). Information Gulags, Intellectual Straightjackets, and Memory Holes. Data Science Journal, 9, ES1-ES5. 2. DataONE Education Module: Data Management. DataONE. Retrieved December 2013. From http://www.dataone.org/sites/all/documents/L01_DataManagement.pptx 3. Scientists are hoarding data and it’s ruining medical research: http://www.buzzfeed.com/bengoldacre/deworming-trials 4. Losing data from the National Centre for E-Social Science (NCESS) Portal: http://datastories.jiscinvolve.org/wp/2015/08/10/losing-data-from-the-national-centre-for-e-social-science-ncess- portal/ 5. Biomedical data sharing and reuse: Attitudes and practices of clinical and scientific research staff: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0129506 6. Over half of psychology studies fail reproducibility test: http://www.nature.com/news/over-half-of-psychology- studies-fail-reproducibility-test-1.18248 7. Value of Open Data Sharing: https://www.fosteropenscience.eu/sites/default/files/pdf/2536.pdf 8. Michener, W. K., Brunt, J. W., Helly, J. J., Kirchner, T. B., & Stafford, S. G. (1997). Nongeospatial metadata for the ecological sciences. Ecological Applications, 7(1), 330-342. 9. Society for Clinical Data Management. (2013). Good Clinical Data Management Practices. Washington, D.C. 10. UK Data Archive. (2015). Prepare and manage data. From http://ukdataservice.ac.uk/manage-data.

Editor's Notes

  • #2 Value/The why Content Activity
  • #4 SHARED VALUES IN SCIENTIFIC RESEARCH HONESTY: convey information truthfully and honoring commitments ACCURACY: report findings precisely and take care to avoid errors EFFICIENCY: use resources wisely and avoid waste OBJECTIVITY: let the facts speak for themselves and avoid improper bias *STENECK, N. H. 2007. ORI - Introduction to the Responsible Conduct of Research , Washington D.C. , U.S. Government Printing Office, p.3
  • #7 This is the idea behind the push towards greater openness in information and the products of research.
  • #8 7
  • #9 There are lots of ways to lose data – some estimate that 40% or nearly half of researchers have lost data due to human, hardware, software, and organizational failure
  • #10 Other hot topics in research these days include reproducibility, data sharing & reuse, and loss of data due to organizational failure.
  • #11 As our personal, professional, and societal systems are increasingly built on data, we must make sure that those data are as accurate as possible. “MEDICARE PAYMENT ERRORS NEAR $20B” (CNN) December 2004 Miscoding and Billing Errors from Doctors and Hospitals totaled $20,000,000,000 in FY 2003 (9.3% error rate) . The error rate measured claims that were paid despite being medically unnecessary, inadequately documented or improperly coded. In some instances, Medicare asked health care providers for medical records to back up their claims and got no response. The survey did not document instances of alleged fraud. This error rate actually was an improvement over the previous fiscal year (9.8% error rate). “AUDIT: JUSTICE STATS ON ANTI-TERROR CASES FLAWED” (AP) February 2007) The Justice Department Inspector General found only two sets of data out of 26 concerning terrorism attacks were accurate. The Justice Department uses these statistics to argue for their budget. The Inspector General said the data “appear to be the result of decentralized and haphazard methods of collections … and do not appear to be intentional.” “OOPS! TECH ERROR WIPES OUT Alaska Info” (AP) March 2007 A technician managed to delete the data and backup for the $38 billion Alaska oil revenue fund – money received by residents of the State. Correcting the errors cost the State an additional $220,700 (which of course was taken off the receipts to Alaska residents.)
  • #12 We are also facing an immense amount of data being created everyday that while not produced by researchers is potentially useful in addressing grand research challenges. How can we tailor cancer treatment to the individual? How can we cure cancer? How can we create environments to promote mental health? How do we sort, find, and make use of this distributed, disorganized, temporary stream of data? Source: http://sustainablesmartbusiness.com/2016/02/how-can-big-data-help-deliver-sustainability-strategies/
  • #13 Mobile Parkinson Disease Study: http://parkinsonmpower.org/
  • #14 Good data practices are a key but often unrecognized part of the research process. If it is done badly, it leads to wasted time and money and potentially even lives as we pursue the wrong path.
  • #15 On an individual level, the immediate challenge for researchers is how to choose from the options available to them and how and choose the easiest and most appropriate tools that will integrate with their research workflows. Source: http://www.gocertify.com/articles/online-resources-can-help-you-get-started-in-big-data
  • #16 While there are many societal benefits to good data practices, particularly open data, behavioral change only happens when the right incentives are in place. Currently, there are far more sticks in place, albeit rather weak ones, than carrots. For the existing workforce, it also helps if they see that there is a problem to be solved by changing their data practices. But we must also train new researchers are trained to use good data practices from day one. It helps a lot if the new way is easier or more convenient, but that isn’t always the case.
  • #17 Another perspective on the value of managing data well comes from a scientist who works on the US Geological Survey. When data are used for policy decisions that potentially affect millions of people, the integrity of the data is crucial and must be demonstrated.
  • #21 It can be helpful to think of data management as a set of activities that occur throughout the data lifecycle. This lifecycle extends beyond the grant period and the writing phases.
  • #23 Although people are good at problem solving, we are bad at being consistent over time and documenting all the details of how a study is conducted;
  • #24 we assume we will remember; we assume bad things won’t happen to us; we underestimate the time required to get things done; we incorrectly interpret information based on our chosen idea or concept While technical solutions can solve some of the problems as we transition to a fully digital research environment, the more challenging issues to be solved are cultural and behavioral. There are an increasing number of people engaged in meta-research – studying the way that research is funded, conducted, reported, published, and preserved for long-term access. https://en.wikipedia.org/wiki/List_of_cognitive_biases#Decision-making.2C_belief.2C_and_behavioral_biases
  • #26 In contrast to how we usually think of the people who deal with data within a particular research project, this diagram reflects the diverse roles that people have with data across the full data lifecycle. While you might argue with the titles – I do – the recognition that data produced by a research team has value and purpose beyond that project and requires many people to take care of it is an important. Since the purpose of the diagram is to highlight the skills needed to support data management activities, it leaves out organizational stakeholders such as funding agencies, publishers, data archives/centers, and academic and research institutions. Source: http://data-forum.blogspot.com/2008/12/rdmf2-core-skills-diagram.html
  • #27 The need for greater transparency in research has been ongoing for several decades, at least. More recently, the power of the internet to disseminate data with funding agency responses to demands for public access have accelerated the conversation.
  • #33 We are still learning what the most effective research data practices are in the digital environment. Talking through all of them would take more time than we have today, so I’m going to share those that are generally relevant to the widest range of researchers and tools. Though the technologies and techniques we use today will surely change, the principles underlying these practices will remain true.
  • #34 This is a list of some core practices – as usual, how you do this depends on the norms of your field, funder policies, institutional policies, publisher policies, and other factors. So let’s dive a little deeper into specifics.
  • #39 When we rely on our memory of a project, significant information is lost over time. During a research project, we are storing and using rich information about a project or dataset. But as we move through the project towards completion, the details begin to fade. A variety of circumstances can intervene, and eventually detailed knowledge about the dataset fades. Without documentation, this data might be unusable. A dataset it not considered complete without documentation to accompany it.
  • #51 Lastly, I hope you REMEMBER: Research is a group enterprise; knowledge is based on an accumulation of evidence supporting one theory over another, NOT a single study