Writing Sample 2 -Bridging the Divide: Enhancing Public Engagement in Urban D...
Conquering Chaos in the Age of Networked Science: Research Data Management
1. Conquering Chaos in the Age of Networked Science:
Research Data Management*
*Adaptation of the NECDMC First Module
Kathryn M. Houk, MLIS
Tufts University Hirsh Health Sciences Library
Wednesday June 4, 2014
Librarians: Your Partners in Research
2. Today’s Objectives
Recognize what research data is and what data
management entails
Recognize why managing data is important
Identify common data management issues
Learn best practices and resources for managing these
issues
Learn about how the library can help you identify data
management resources, tools, and best practices
3. What is Data?
• “Research data, unlike other types of information, is
collected, observed, or created, for purposes of
analysis to produce original research results”
(University of Edinburgh).
• Observational
• Experimental
• Simulation data
• Derived or compiled data
4. Why Should I Manage it?
• Transparency & Integrity
• Compliance
5. Science & Personal Benefits
• Who uses your data now?
• Who COULD use your data?
• Shared/Open Data
• Scientific progress
• Impact on your career
• Citation counts
6. What if I Don’t Consider RDM?
Data Sharing and Management Snafu in 3 Short Acts:
A data management horror story by Karen Hanson,
Alisa Surkis and Karen Yacobucci.
http://www.youtube.com/watch?v=N2zK3sAtr-4
8. Data Management Plans
• What types of data will be created?
• Who will own, have access to, and be responsible
for managing these data?
• What equipment and methods will be used to
capture and process data?
• Where will data be stored during and after?
9. Simplified Data Management Plan
1. Types of data
• What types of data will you be creating or capturing? (experimental measures, observational
or qualitative, model simulation, existing)
• How will you capture, create, and/or process the data? (Identify instruments, software,
imaging, etc. used)
2. Contextual Details (Metadata) Needed to Make Data Meaningful to others
• What file formats and naming conventions will you be using?
3. Storage, Backup and Security
• Where and on what media will you store the data?
• What is your backup plan for the data?
• How will you manage data security?
4. Provisions for Protection/Privacy
• How are you addressing any ethical or privacy issues (IRB, anonymization of data)?
• Who will own any copyright or intellectual property rights to the data?
5. Policies for re-use
• What restrictions need to be placed on re-use of your data?
6. Policies for access and sharing
• What is the process for gaining access to your data?
7. Plan for archiving and preservation of access
• What is your long-term plan for preservation and maintenance of the data?
10. Creating a DMP & Considering
Long-Term DM Issues
• Read the case study provided
• Your group is assigned a set of questions (labeled Group 1-6)
to answer as best you can
• First set of questions are from one section of the simplified DMP
• 2nd set of questions highlight an issue that arises in day-to-day or
long-term management of research data (a more detailed level)
• Elect a group speaker
• Each group will discuss their answers
• We will go over the issue associated with your section, common
problems, and best practices
11. Group 1
• DMP Section 1: Types of Data
1. What types (e.g. images, lists of readings, text documents) of
data are being collected for this study?
2. What analytical methods and tools are being used in this
study?
3. What types of data will be generated from these analytical
tools and methods?
• Detailed Planning
1. What naming conventions are being used in the lab?
2. Is there a structure for saving files in the lab?
3. What kind of information would you include in a naming
convention for files?
4. What kinds of things would you avoid in naming/labeling files?
12. Issue: Records Management
• Does this sound familiar?
• Inconsistently labeled files
• in multiple versions…
• inside poorly structured folders…
• stored on multiple media…
• in multiple locations…
• and in various formats…
13.
14. Issue: Records Management
• Best Practices:
• Avoid special characters in a file name.
• Use capitals or underscores instead of periods or spaces.
• Use 25 or fewer characters.
• Use documented & standardized descriptive information
about the project/experiment.
• Use date format ISO 8601:YYYYMMDD.
• Include a version number.
16. Group 2
• DMP Section 2: Contextual Details (Metadata)
1. What contextual details would the researcher need to document
to make her data meaningful to others?
2. How would a lack of naming and labeling conventions impact later
data access by other researchers and possibly herself?
• Detailed Planning
1. What general information do you think is needed for scientific data
to make it discoverable? (ex. Think of a search screen and a
dropdown menu of where you can search for a term: Title, Author,
Genre, etc.)
2. Are you aware of any metadata standards for the life or health
sciences?
3. Do you think all metadata has to be hand-entered or recorded?
4. How would you ensure lab members knew to collect and record
specific information in standard ways?
17. Issue: Metadata
• How will someone make sense of your data e.g. the cells
and values of your spreadsheet?
• What universal or disciplinary standards could be used to
label your data?
• How can you describe a data set to make it
discoverable?
20. Issue: Metadata
• Title
• Creator
• Identifier
• Subject
• Funders
• Rights
• Access information
• Language
• Dates
• Location
• Methodology
• Data processing
• Sources
• List of file names
• File Formats
• File structure
• Variable list
• Code lists
• Versions
• Checksums
21. Issue: Metadata
• Best Practices
• Describe the contents of data files
• Define the parameters and the units on the parameter
• Explain the formats for dates, time, geographic coordinates,
and other parameters
• Define any coded values
• Describe quality flags or qualifying values
• Define missing values
22. Group 3
• DMP Section 3: Data Backup, Storage, and Security
1. Where and on what media will the data from each source be
stored?
2. How, how often, and where will the data be backed up?
3. Are there any security concerns for the data and have they
been addressed?
• Detailed Planning
1. How many copies of your data do you think you should have
and where should you keep them?
2. Is there any group on campus you think could help you with
backup and security/access concerns?
3. What are some good data storage and backup practices you
know about or practice?
23. Issue: Backup & Security
• How often should data be backed up?
• How many copies of data should you have?
• Where can you store your data?
• How much server space can I get?
24. Issue: Backup & Security
• Best Practices
• Make 3 copies (original + external/local + external/remote)
• Have them geographically distributed (local vs. remote)
• Use a Hard drive (e.g. Vista backup, Mac Timeline, UNIX
rsync) or Tape backup system
• Cloud Storage - some examples of private sector storage
resources include: (Amazon S3, Elephant Drive, Jungle
Disk, Mozy, Carbonite)
• Unencrypted is ideal for storing your data because it will
make it most easily read by you and others in the future…but
if you do need to encrypt your data because of human
subjects then:
• Keep passwords and keys on paper (2 copies), and in a PGP
(pretty good privacy) encrypted digital file
• Uncompressed is also ideal for storage, but if you need to
do so to conserve space, limit compression to your 3rd
backup copy
25. Group 4
• DMP Sections 4. Data protection/privacy and 5. Policies for
reuse of data
1. How is the lab addressing any privacy or ethical issues?
2. Who will own any copyright or intellectual property rights to
the data?
3. Are there any restrictions to the reuse of the data?
• Detailed Planning
1. Are there any reasons to not share or reuse data? Are these
ethical or cultural issues?
2. Will having public funding affect data sharing and reuse
differently than having private funding?
3. Who has the right to make decisions about reuse of your data?
26. Issue: Ownership & Retention
• Intellectual Property Policy
• IRB data retention policy
• Funders’ data retention policy
• Publishers’ data retention policy
• Federal and State laws
28. Issue: Ownership & Retention
• IRB OHRP Requirements: 45 CFR 46 requires research records to be retained
for at least 3 years after the completion of the research.
• HIPAA Requirements: Any research that involved collecting identifiable health
information is subject to HIPAA requirements. As a result records must be
retained for a minimum of 6 years after each subject signed an authorization.
• FDA Requirements 21 CFR 312.62.c Any research that involved drugs,
devices, or biologics being tested in humans must have records retained for a
period of 2 years following the date a marketing application is approved for the
drug for the indication for which it is being investigated; or, if no application is
to be filed or if the application is not approved for such indication, until 2 years
after the investigation is discontinued and FDA is notified.
• VA Requirements: At present records for any research that involves the VA
must be retained indefinitely per VA federal regulatory requirements.
• Intellectual Property Requirements - Any research data used to support a
patent through must be retained for the life of the patent in accordance with
Intellectual Property Policy.
• Check with your Funder and Publisher Requirements
• Questions of data validity: If there are questions or allegations about the validity
of the data or appropriate conduct of the research, you must retain all of the
original research data until such questions or allegations have been completely
resolved.
29. Group 5
• DMP Sections 6: Policies for access and sharing
1. How will others be able to gain future access to the study
data?
2. How does the graduate student plan to link her datasets to her
published article?
• Detailed Planning
1. Could there be a use for the graduate student’s data that was
not used in the published article?
2. Are the data the student collected open formats or proprietary
(will people need specialized software to access and interpret
the data)?
a) How would this affect future accessibility & reuse?
30. Group 6
• DMP Section 7: Plan for archiving and preservation of access
1. What is the long-term strategy for maintaining, curating and
archiving the data?
2. Where will the data be stored?
3. What contextual data (data that describes your data) or other
related data will be included in the archive?
• Detailed Planning
1. What data should be included in an archive?
2. Do you know of any data repositories that you could use for your
data?
3. How can you ensure that your data is discoverable and
interpretable?
4. How long should the data be maintained? What factors affect the
length of time you retain your data?
31. Issue: Long-Term Planning
• What will happen to my data after my project ends?
• How can I appraise the value of my data?
• What are my options for archiving and preserving my
data?
• What are my options for publishing and sharing data?
32. Data Formats
• Is the file format open (i.e. open source) or closed
(i.e proprietary)?
• Is a particular software package required to read
and work with the data file? If so, the software
package, version, and operating system platform
should be cited in the metadata
• Do multiple files comprise the data file structure? If
so, that should be specified in the metadata
34. Issue: Long-Term Planning
• Best Practices
• When choosing a file format, select a consistent
format that can be read well into the future and is
independent of changes in applications.
• Non-proprietary: Open, documented standard,
Unencrypted, Uncompressed, ASCII formatted
files will be readable into the future.
35. Issue: Long-Term Planning
• Librarians can help:
• Identify file formats suitable for long-term preservation
• Interpret your funder or publisher’s repository
requirements
• Find and evaluate a suitable repository for your data
• Upload your data sets to a repository
• Help make your data in a repository searchable and
discoverable
• Create a doi and persistent id
• Choosing metadata standards for increased
discoverability
36. Issue: Data Stewardship
• Challenges
• Team Science
• Managing Laboratory Notebooks
• Rotating Lab Personnel
37. Issue: Data Stewardship
• Best Practices
• Define roles and assign responsibilities for data
management
• Identify skills needed to perform tasks outlined in
DMP and match to available staff
• Develop training plans for continuity
• Assign responsible parties and monitor results
38. How the Library Can Help:
• Teach you, your lab, or
your classes about data
management best
practices
• Write a data
management and/or
sharing plan
• Comply with federal,
funder, and publisher
data sharing policies
• Find & submit your data
to a repository
• Find standards to
describe & label your
data & data files
• Find a data set
• Cite others’ data
• Publish a data set
• Get a doi for a data set
• Measure the citation
impact of your data set
• Build a collection of
research data that others
can search & access
• Archive & preserve your
data
• Learn about copyright &
license issues
surrounding your data
39. Find Help
• Ask your librarian if the library can help!
• Make it known you are interested in receiving
assistance from the library
• Ask your IT department for information on storage and
security available
• Let them help you make a backup and storage plan
40. Learn More
• Data Management Principles & Education:
• Research Data MANTRA
• DataONE: Best Practices
• UK Data Archives
• MIT Data Management and Publishing Guide
• Data Management Plans
• Digital Curation Centre
• DMPTool2
• DataONE: Data Management Planning
41. Works Cited
Lamar Soutter Library, University of Massachusetts Medical School. 2014.
“New England Collaborative Data Management Curriculum: Module 1.”
http://library.umassmed.edu/necdmc.
DataONE. 2013. “Best Practices for Data Management.”
http://www.dataone.org/best-practices.
MIT Libraries. 2013. “Data Management and Publishing.” MIT
http://libraries.mit.edu/guides/subjects/data-management/index.html.
Office of Research Integrity. 2013. “Data Management.” United States
Department of Health and Human Services. United States Federal
Government.
http://ori.hhs.gov/education/products/rcradmin/topics/data/open.shtml.
Special thanks to Jen Ferguson, Richard Moore and Glenn Gaudette for
permission to use their slides.
Editor's Notes
There are a number of definitions for ‘research data,’ but this is my favorite.
Data covers a broad range of types of information. Can you think of any other types of data that get created during research?
Documents (text, Word), spreadsheets
Laboratory notebooks, field notebooks, diaries
Questionnaires, transcripts, codebooks
Survey responses
Health indicators such as blood cell counts, vital signs
Audio and video recordings
Images, films
Protein or genetic sequences
You may be required by a funder or publisher to maintain the data that underlies your published works and findings.
Managing data is a part of compliance with the University’s IRB, and your funders’ data sharing and data management policies. Funders like the NIH reserve the right to audit your lab notebooks and pre-publication data; Since 2011 the NSF has required a data management plan and the federal govt. is currently working to make publicly funded research data available to the public.
The Fair Access to Science and Technology Research (FASTR) Act is a bipartisan effort aiming to make data from federally funded research more open and accessible.
“The Administration is committed to ensuring that…the direct results of federally funded scientific research are made available to and useful for the public, industry, and the scientific community. Such results include peer-reviewed publications and digital data” (Holdren 2013).
Expanding Public Access to the Results of Federally Funded Research, Office of Science and Technology Policy
Publications, private foundations and specific funders - like the American Heart Association – may also require data management provisions.
Managing data saves you time and effort, and avoids the duplication of efforts, “good RDM = good research”. You can easily find the data you need and make these available should you be asked.
In addition, publishing your data can increase your citation impact and discoverability of your research & help with promotion and tenure.
You don’t know how someone else may use your data in the future.
Anna Gold. Cyberinfrastructure, Data, and Libraries, Part 1: A Cyberinfrastructure Primer for Librarians. D-Lib Magazine, September/October, 2007, Volume 13 Number 9/10 http://www.dlib.org/dlib/september07/gold/09gold-pt1.html.
“Managing and sharing data…
increases the impact and visibility of research;
promotes innovation and potential new data uses;
leads to new collaborations between data users and creators;
maximizes transparency and accountability;
enables scrutiny of research findings;
encourages improvement and validation of research methods;
reduces cost of duplicating data collection;
and provides important resources for education and training”
Increase the visibility of your research
Save time
Simplify your life
Preserve your data
Increase your research efficiency
Documentation
Meet grant requirements
Facilitate new discoveries
Support Open Access
This video from NYU lays a solid groundwork for the issues we will discuss today. In it are several scenarios that highlight data management issues that were identified by the Department of Health and Human Services’ Office of Research Integrity.
Data has a life that extends beyond the project where it was created. This cycle helps you to visualize the activities in order to plan for the project’s data management needs and how data may be collected, stored, described, preserved, and/or shared.
While a 2-page plan for a grant application is very important, every research project will benefit from planning for managing a project’s data throughout the life of the project, including planning for how data will be produced, collected, analyzed, stored, archived or shared, etc.
The DMP is just a snapshot – an executive summary – compared to a comprehensive data management policy for your lab.
Many research funders require that you have a plan to manage and/or share your data. For example, in 2011 the NSF began requiring a data management 2-page supplement with all submitted grant applications. The NIH has requires a plan for projects in excess of $500,000.
These are some questions that are commonly addressed in a data management plan.
The NSF has laid the foundation for requiring a data management and sharing plan. You have a copy of a simplified data management plan. It is 7 sections with at least one question that should be answered per section to satisfy the requirements for a 2-page data management document.
This simplified Data Management Plan (DMP) is based on the NSF recommendations for its required 2-page data management plan. If you can answer the questions in the 7 sections of this plan, then you will likely be able to write any other data management requirements. Some of the sections can be standardized based on your institutional practices – for example, at Tufts, we have standardized language in section 3 because most of our researchers use a University research drive to store and back up their data.
Remember that this is an executive summary for people who do not necessarily know much about your research or the process you will use. It is meant to be high-level and a broad overview. Going into too much detail could be counter-productive as it may make the document too long and might also make it too confusing for grant reviewers.
Provided is a case study from a real researcher and their project. This means that the questions you have may not be answered explicitly. The researcher may not be aware of the need or did not mention it in their interview.
You are split into groups based on the sections of the simplified DMP. For example, Group 1 will answer Part 1 of the DMP based on the information in the case study, but will also have a second set of questions that dive into more detailed issues that arise in your day-to-day or long-term management of data. The second set of questions are more reflective and you can use your own experience and knowledge to answer them. The second set of questions also highlight issues that often arise when starting to think about data management holistically.
When we come back together, your team speaker will tell us your questions and answers and any other insights or questions the group had. We will then talk about the issue related to your group and some Best Practices in regards to that issue before moving on to the next group.
You may not be able to answer everything with confidence – that’s OK! It’s only practice to help get you thinking about these topics and discussing with your colleagues. I will be walking around the room if you need clarification.
You have 20 minutes to discuss and prepare.
If we think back to the video, we realize that a lot of the issues regarding data management relate to inconsistent and confusing file and folder labels, saving data in multiple locations, and not thinking about how someone might find and make sense of your data.
Records management requires thinking about how you and others can both easily find and make sense of your data.
This slide comes from a colleague at Northeastern University. She looked at a sample of data files produced by students collecting data a bioscience lab.
As you can see, their file naming conventions do not always take into consideration how someone not involved in a project will make sense of what is in the file.
After some time, these files would probably not even make sense to the person involved in creating the file!
These are some best practices for creating file names. Poorly constructed file names can cause issues when transferring files from one format to another, or to another operating system.
For example, a researcher recently identified that when she moved files from REDCap™ to her analysis software, the dates were reformatted. In addition, OS like Unix can have issues reading files with spaces or special characters.
Here is an example from a biomedical engineering lab that shows how you can add in project information into the file name. Notice that he labels each file with an experiment that links back to the laboratory notebook, so that there could be multiple people in the lab and multiple experiments involving the same sample, but having a systematic approach to labeling and mapping files allows for the efficient retrieval and interpretation of the data.
Thanks to the NSA & Ed Snowden scandal, ‘metadata’ has become a household word!
Often described as data about data, metadata are simply descriptors can help you to record information and create labels to catalog, and make sense of your data.
Metadata standards can be used to describe the data’s field labels, their values, elements and parameters, and they can also describe the nature of the files that are produced, such as how many bytes, the format, the software used to create the file, the version, and who created it.
Here is an example of metadata collected about a data set. It states who created the file and when, the format, and descriptive information about the data and its location.
Using REDCap you can upload or create a data dictionary to define the fields, elements, and parameters for your data collection.
Here is another example of metadata from a dataset uploaded to the NCBI “Flybase”.
It incorporates a large amount of scientific disciplinary information such as the strain, tissue, and cell line used in the sample.
Most databases where you upload your data will inform you of the basic metadata they require. Many data repositories actually have experts that create more metadata after you submit in order to make it findable and interpretable.
Here is a list of common metadata fields associated with a data set.
IRB guidelines and IT departments can help you learn where and how to best store & backup your data
Electronic data should be saved on a device that has the appropriate security safeguards such as unique identification of authorized users, password protection, encryption, automated operating system patch (bug fix), anti-virus controls, firewall configuration, and scheduled and automatic backups to protect against data loss or theft.
When it comes to data ownership and data retention there are a lot of overlapping policies. IP policy can cover the ownership and retention of data related to patents, the IRB wants to ensure that documentation of human subjects’ data are retained and/or destroyed appropriately, and the funders and publishers want you to retain data to defend the integrity of your findings, and then there are federal guidelines like HIPAA.
“How long should I retain data?” is not a clear and cut data management question. Last June, for example, the Journal of Clinical Investigation retracted a published article after 6 years because one of its data tables was duplicated.
The publisher contacted the researchers to have them update the data, but they could not locate the original data files after six years, so the journal was forced to issue a retraction.
This case highlights how difficult it is to know for how long to keep data. This article was peer reviewed and cited over 55 times but it took six years for the representation of its data to be called into questioned. Thus, thinking about ways to digitize documents and store and preserve electronic files of data in a self-archived, disciplinary, or local data repository is important, and one of the many tasks the library can help you with.
RetractionWatch.com has three different categories dealing with retractions due to various data misconduct. Fabrication of data, duplication of data, manipulation of figures/images. There’s also the issue of non-reproducible results. All of these issues can be avoided by honest researchers by good data management and preservation practices.
As you can see, data retention is very situation dependent. A sponsor may require you to retain your research related documents. Prior to agreeing to a contract that specifies how long records will be maintained, you should ensure that you will receive adequate funding to pay for their storage and preservation. The library can help provide guidance on long-term preservation.
Again, it would be prudent to check with the sponsor, IRB, and Office of Research before destroying any records…
After a project you may want to consider appraising, and publishing or depositing your data in a repository. There are a variety of factors that impact your ability to share data with outside parties. According to the OHRP, you should contact the IRB prior to proceeding with a release of human subject data unless (a) your subjects signed an IRB approved consent document with HIPAA compliant authorization language that clearly details what information will be collected, used, and disclosed and (b) the outside party is specified in the document.
Archiving & Preserving versus Storage - there’s a big difference! Digital data degrades if it is not properly taken care of. Depositing data in a repository for preservation and open access ensures that data will be properly cared for throughout the rest of it’s life – however long you determine that to be. If it is forever, then the repository will migrate the data onto the newest, most abundant storage media and convert it into a format that can be interpreted by computers in the future. (Think of the 3.5in floppy, the zip drive, the cassette, etc.)
One of the greatest challenges for preservation is thinking ahead about the formats of your data. Type of data is what it is – an image, a survey, etc. Format is how it is encoded by a computer – jpg, doc, txt, etc. Some formats are produced by a specific software that is owned by a particular company. If that software becomes obsolete, so does the ability to read the file formatting and the information contained in that digital object.
This graphic was created by a colleague that observed the number of instruments in just one biomedical lab relying on proprietary software.
This means that to be able to open and view this file, someone would need to know the software that created it, and be able to access that software. Thus converting your files to open source and sustainable formats and standards are essential for long-term sharing, preservation and access.
One of the greatest challenges in managing data is the distributed nature of modern research. With so many responsibilities, it is easy to not prioritize data management. By assigning data management tasks, you will increase the efficiency of your research.
Laboratory notebooks, paper and electronic, may be audited by the funder, such as NIH. Managing and preserving these notebooks require a plan.
In many labs personnel are changing constantly. There must be a plan to bridge the data management knowledge of new and outgoing students, post-docs, and staff.
Unless the distribution of responsibility is clear, misunderstandings can result and compliance jeopardized.
We hear a lot from students that they have had to learn DM on the go and may have little to no formal training on how to manage a specific project’s data, so do not be afraid to ask for clarification, and for documenting and formalizing DM roles & responsibilities. This is an important aspect of a DM plan.