Big data is changing how research is being conducted and allowing new kinds of questions to be asked. Meanwhile, data management has enabled a rapid increase in the dissemination and preservation of research products and many funding agencies like the National Science Foundation and National Institute of Health now require data management plans in their grant applications. The combination of big data applications and data management processes has created new opportunities and pitfalls for researchers. In the past year, prominent scientists including the Director of the NIH have suggested that inappropriate methodology for data acquisition, analysis and storage has led to a gap in the translation of basic research findings to clinical cures. In this session we will track data through all research stages, describe best practices and university resources available to faculty grappling with these important issues.
5. Infrastructure
• Where do you store it?
• How do you move it?
• How do you analyze it? (HPC?)
+ Ultra High Speed Research LAN
+ College or
Department Servers
+ Bioinformatics
& other Clusters
http://istec.colostate.edu/activities/hpc/
6. Data Acquisition/Generation
Reuse Existing
• Where to find it?
• How to understand/use it?
• Do you trust it?
• Create your own data
Metadata + README files
Data Provenance
Privacy, Security, Proprietary, Dual Use
Research of Concern
7. Data Management
• Access/Permissions
• File Naming
• Metadata
• Organization
• Collaboration
• Version Control
• Fixity/Integrity
http://lib.colostate.edu/services/data-management
8. Dissemination
Where to share your data?
• Institutional Repository
• Discipline Specific Repository
How to cite your data?
• Permanent identifier (doi, handle, PURL, etc.)
• Citation standards
http://lib.colostate.edu/services/data-management/citing-data
10. Public Outcry Regarding Data Integrity
• “Why Most Published Research Findings are False”, Ioannidis, 2005
• “Update of the Stroke Therapy Academic Industry Roundtable Preclinical
Recommendations,” Fisher et al., 2009
• “Science Publishing: The Trouble with Retractions,” Van Noorden, 2011
• “Believe it or not: how much can we rely on published data on potential drug
targets?” Prinz et al., 2011
• “Misconduct Accounts for the Majority of Retracted Scientific Publications,”
Fang et al., 2012
• “Drug Development: Raise standards for Preclinical Cancer Research,”
Begley & Ellis, 2012
http://i97.photobucket.com/albums/l217/Shockwave_73/angry-
mob-at-frankenstein-castle_zps364a2714.jpg
11. Integrity - Reliability - Translation
• “Power Failure: why small sample size undermines the
reliability of neuroscience”, Button et al., 2013
• “Challenges in Translating Academic Research into
Therapeutic Advancement,” Matos et al., 2013 (epilepsy)
• “Reproducibility,” McNutt, 2014
• “NIH plans to enhance reproducibility,” Collins & Tabak,
2014
• “Reproducibility: Fraud is not the big problem,” – Gunn,
2014
• Taxpayers are wasting their investment because the
integrity of basic research is flawed, not due to
intentional misconduct but to unintentional
mismanagement.
12. Research Misconduct
1. Fabrication, falsification, plagiarism, or other practices
that seriously deviate from those that are commonly
accepted within the relevant scientific/academic
community for proposing, conducting, reviewing or
reporting research; that
2. Has been committed intentionally, knowingly or
recklessly; and, that
3. Has been proven by a preponderance of the evidence
(more likely than not)
Misconduct does not include honest error or honest
differences in interpretations or judgments of data.
13. Reporting Concerns
• All employees and individuals associated with CSU should report observed,
suspected or apparent Research Misconduct to their Department Head, Dean,
the RIO and/or the Vice President for Research.
• If an individual is unsure whether a suspected incident falls within the definition
of scientific misconduct, a call may be placed to one of these individuals to
discuss the suspected misconduct informally.
http://reportinghotline.colostate.edu/
14. Research Integrity Officer
› Primary contact for departments and deans with
questions about potential misconduct issues
› Represents CSU with the PHS Office of Research
Integrity (ORI), NSF, USDA, etc
› Manages the CSU MIS process to meet
institutional, state and federal standards
› Kathy.Partin@colostate.edu
15. External Pressure to Fix or Be Fixed
• Issues with data reliability have brought external pressure
on the scientific community
• From Congress
• Presidential Council of Advisors on Science and Technology
(PCAST) – “Improving Scientific Reproducibility in an Age of
International Competition and Big Data” , 2014
http://www.tvworldwide.com/events/pcast/140131/
• From the popular press and “watch dog” websites/blogs
• The Economist - “Unreliable research: Trouble at the Lab”, 2013
• NYT– “New truths that only one can see”, 2014
• RetractionWatch.com
16. The Gap Between Applied & Basic Research
Innovation
Reliability
The two opposite and contrary forces of data
Dynamic,
agile,
discovery,
exploration,
optimization,
creative,
outside-the-
box, anti-
dogmatic
(pre pre-
clinical study)
Reproducible,
robust,
translatable to
bedside, rigid,
immutable, non-
optimized,
boring
(preclinical or
clinical study)
17. What needs to change?
• Funding agencies need to raise the bar for data
acquisition
• Publishers need to raise the bar for data quality
• Academic institutions need to reassess how success is
defined
• Academic institutions need to provide their faculty with the
right tools and training to do it right
• Faculty need to pass this down to their trainees
18. External Changes
• NIH appears to be
• Developing a new training module on good experimental design to
disseminate
• Developing a data checklist for grant proposals
• DDI- Data Discovery Index
• New biosketch format to reduce the focus on numbers of publications
and increase the focus on impact of publications
• Considering blinded review of grant proposals
• Science Exchange Reproducibility Initiative
19. DDI
“In summary, a Data Discovery Index (DDI) emphasizes
development of an adaptable, scalable system through
active community engagement that would serve as an
index to large biomedical datasets.”
Rather than in a traditional “catalog” the DDI concept
stresses discoverability, access, and citability.
This is a dataset of raw data, which rarely saw the light of
day in academic research before.
20. Publishers
• Preventing plagiarism with iThenticate
• Preventing Fabrication/Falsification with new data checklists
• Abolishing word limits on methods sections
21. Six Common Experimental Failings
1. Poor experimental design
2. Poor reagents
3. Poor analysis
4. Failure to reject hypothesis after observing discordant,
valid experimental results
5. Deliberate bias in selecting positive rather than
negative results to report, publish, cite, and fund
6. Failure to follow through when wondering “Why is this
result NOT what I expected?”
22. Statistics & General Methods
1. How was the sample size chosen to ensure adequate
power to detect a pre-specified effect size?
2. Describe inclusion/exclusion criteria if samples, subjects
or animals were excluded from the analysis. Were the
criteria pre-established?
3. If a method of randomization was used to determine
how samples/subjects/animals were allocated to
experimental groups and processed, describe it.
23. Statistics & General Methods
4. If the investigator was blinded to the group allocation
during the experiment and/or when assessing the
outcome, state the extent of blinding.
5. For every figure, are statistical tests justified as
appropriate? Do the data meet the assumptions of the
tests (e.g., normal distribution)?
a) Is there an estimate of variation within each group of data?
b) Is the variance similar between the groups that are being statistically
compared?
24.
25. Good Laboratory Practice for Data
A Attributable (who made the entry)
L Legible
C Contemporaneous/Complete
O Original
A Accurate
http://www.paduiblog.com/pa-dui/why-forensic-science-testing-
for-dui-bac-determination-is-the-silly-sister-of-analytical-science-
good-laboratory-practices/
26. Data Notebooks – Another Vulnerability
• Binders
• Electronic Notebooks
• Software documentation
• Field notes
• Images
• Algorithms
27. Data Corrections & Amendments
• Errors, additions, and modifications should be identified
by crossing out the original data with a single line (do not
obscure the initial data) and initialing, dating and providing
a reason for the change.
• Missing or obscured data/pages are often interpreted as
intentional obfuscation of data
•Absence is interpreted as guilt
Academic institutions are under more scrutiny than ever, due to public outcry, on the perceived lack of data integrity. THIS WILL HAVE AN IMPACT ON FUNDING!
Most people suspect intentional misconduct – altering the data record in favor of your hypothesis. Actually, most studies suggest that unintentional mismanagement is a more likely culprit.
Research Misconduct definition – a key discriminator is intentionality.
If you have concerns about data integrity you can be an anonymous whistleblower
Or, you can contact me with a “hypothetical “ scenario and I will protect your identity.
Let’s take a closer look at unintentional problems with data. We need to auto-correct and we need to expect greater external scrutiny.
Think, as we strive to move to applied solutions instead of pure research, about the tension between pure discovery and its application.
So, if we are ready to keep our side of the street clean, what do we need to do?
Sponsors will add increased requirements regarding data integrity. You need help with boilerplate verbiage in your grant to demonstrate your approach to this issue.
Uploading your raw data! I never dreamed of doing this in the past. Of course, if you have protected data (human subjects), there are more hoops to jump through.
The Libraries can hook you up with iThenticate. Expect that when you submit a peer-reviewed article, it will be run through iThenticate.