This is the PowerPoint for my "Data Management for Undergraduate Researchers" workshop for the Office of Undergraduate Research Seminar and Workshop Series. Major topics include motivations behind good data management, file naming, version control, metadata, storage, and archiving.
A presentation on research data management presented at the Utah Library Association conference in May 2015. Main topics included federal mandates, data repositories, metadata, and file naming conventions. Presenters: Rebekah Cummings, Elizabeth Smart, Becky Thoms, and Brit Faggerheim.
Who owns the data? Intellectual property considerations for academic research...Rebekah Cummings
Intellectual property (IP) is often complicated but is even more so as it pertains to data, as “facts” are not eligible for copyright protection under United States copyright law. The IP issues surrounding data in academic research environments are often exacerbated by the fact that data ownership has rarely been discussed in university environments prior to NSF’s data management plan requirement in 2011. Researchers retained custody over their datasets and other stakeholders – namely universities and funding agencies – rarely contested ownership. Now, as datasets are increasingly seen as valuable outputs of research alongside publications, questions of data ownership are coming to the fore. This presentation will frame the complex issues surrounding data ownership in an academic research setting and will discuss strategies for educating and advising your researchers on intellectual property issues related to research data.
A presentation on research data management presented at the Utah Library Association conference in May 2015. Main topics included federal mandates, data repositories, metadata, and file naming conventions. Presenters: Rebekah Cummings, Elizabeth Smart, Becky Thoms, and Brit Faggerheim.
Who owns the data? Intellectual property considerations for academic research...Rebekah Cummings
Intellectual property (IP) is often complicated but is even more so as it pertains to data, as “facts” are not eligible for copyright protection under United States copyright law. The IP issues surrounding data in academic research environments are often exacerbated by the fact that data ownership has rarely been discussed in university environments prior to NSF’s data management plan requirement in 2011. Researchers retained custody over their datasets and other stakeholders – namely universities and funding agencies – rarely contested ownership. Now, as datasets are increasingly seen as valuable outputs of research alongside publications, questions of data ownership are coming to the fore. This presentation will frame the complex issues surrounding data ownership in an academic research setting and will discuss strategies for educating and advising your researchers on intellectual property issues related to research data.
February 18 2015 NISO Virtual Conference
Scientific Data Management: Caring for Your Institution and its Intellectual Wealth
Network Effects: RMap Project
Sheila M. Morrissey, Senior Researcher, ITHAKA
Data Citation Implementation Guidelines By Tim Clarkdatascienceiqss
This talk presents a set of detailed technical recommendations for operationalizing the Joint Declaration of Data Citation Principles (JDDCP) - the most widely agreed set of principle-based recommendations for direct scholarly data citation.
We will provide initial recommendations on identifier schemes, identifier resolution behavior, required metadata elements, and best practices for realizing programmatic machine actionability of cited data.
We hope that these recommendations along with the new NISO JATS document schema revision, developed in parallel, will help accelerate the wide adoption of data citation in scholarly literature. We believe their adoption will enable open data transparency for validation, reuse and extension of scientific results; and will significantly counteract the problem of false positives in the literature.
Data Publishing Models by Sünje Dallmeier-Tiessendatascienceiqss
Data Publishing is becoming an integral part of scholarly communication today. Thus, it is indispensable to understand how data publishing works across disciplines. Are there best practices others can learn from or even data publishing standards? How do they impact interoperability in the Open Science landscape? The presentation will look at a range of examples, and the main building blocks of data publishing today. The work has been conducted as part of the RDA Data Publishing Workflows group.
DataONE Education Module 01: Why Data Management?DataONE
Lesson 1 in a set of 10 created by DataONE on Best Practices fo Data Management. The full module can be downloaded from the DataONE.org website at: http://www.dataone.org/educaiton-modules. Released under a CC0 license, attribution and citation requested.
Lesson 2 in a set of 10 created by DataONE on Best Practices fo Data Management. The full module can be downloaded from the DataONE.org website at: http://www.dataone.org/educaiton-modules. Released under a CC0 license, attribution and citation requested.
This presentation was provided by Melissa Levine of the University of Michigan during a NISO Virtual Conference on the topic of data curation, held on Wednesday, August 31, 2016
Data and Donuts: How to write a data management planC. Tobin Magle
This presentation describes best practices for how to write a data management plan for your research data. Additionally, it provides information about finding funder requirements, metadata standards, and repositories.
It is about:
Introduction: What Is “Research Data”? and Data Lifecycle
Part 1:
Why Manage Your Data?
Formatting and organizing the data
Storage and Security of Data
Data documentation and meta data
Quality Control
Version controlling
Working with sensitive data
Controlled Vocabulary
Centralized Data Management
Part 2:
Data sharing
What are publishers & funders saying about data sharing?
Researchers’ Attitudes
Benefits of data sharing
Considerations before data sharing
Methods of Data Sharing
Shared Data Uses and Its’ Limitations
Data management plans
Brief summary
Acknowledgment , References
February 18 2015 NISO Virtual Conference
Scientific Data Management: Caring for Your Institution and its Intellectual Wealth
Network Effects: RMap Project
Sheila M. Morrissey, Senior Researcher, ITHAKA
Data Citation Implementation Guidelines By Tim Clarkdatascienceiqss
This talk presents a set of detailed technical recommendations for operationalizing the Joint Declaration of Data Citation Principles (JDDCP) - the most widely agreed set of principle-based recommendations for direct scholarly data citation.
We will provide initial recommendations on identifier schemes, identifier resolution behavior, required metadata elements, and best practices for realizing programmatic machine actionability of cited data.
We hope that these recommendations along with the new NISO JATS document schema revision, developed in parallel, will help accelerate the wide adoption of data citation in scholarly literature. We believe their adoption will enable open data transparency for validation, reuse and extension of scientific results; and will significantly counteract the problem of false positives in the literature.
Data Publishing Models by Sünje Dallmeier-Tiessendatascienceiqss
Data Publishing is becoming an integral part of scholarly communication today. Thus, it is indispensable to understand how data publishing works across disciplines. Are there best practices others can learn from or even data publishing standards? How do they impact interoperability in the Open Science landscape? The presentation will look at a range of examples, and the main building blocks of data publishing today. The work has been conducted as part of the RDA Data Publishing Workflows group.
DataONE Education Module 01: Why Data Management?DataONE
Lesson 1 in a set of 10 created by DataONE on Best Practices fo Data Management. The full module can be downloaded from the DataONE.org website at: http://www.dataone.org/educaiton-modules. Released under a CC0 license, attribution and citation requested.
Lesson 2 in a set of 10 created by DataONE on Best Practices fo Data Management. The full module can be downloaded from the DataONE.org website at: http://www.dataone.org/educaiton-modules. Released under a CC0 license, attribution and citation requested.
This presentation was provided by Melissa Levine of the University of Michigan during a NISO Virtual Conference on the topic of data curation, held on Wednesday, August 31, 2016
Data and Donuts: How to write a data management planC. Tobin Magle
This presentation describes best practices for how to write a data management plan for your research data. Additionally, it provides information about finding funder requirements, metadata standards, and repositories.
It is about:
Introduction: What Is “Research Data”? and Data Lifecycle
Part 1:
Why Manage Your Data?
Formatting and organizing the data
Storage and Security of Data
Data documentation and meta data
Quality Control
Version controlling
Working with sensitive data
Controlled Vocabulary
Centralized Data Management
Part 2:
Data sharing
What are publishers & funders saying about data sharing?
Researchers’ Attitudes
Benefits of data sharing
Considerations before data sharing
Methods of Data Sharing
Shared Data Uses and Its’ Limitations
Data management plans
Brief summary
Acknowledgment , References
Elaine Martin, MSLS, DA, Donna Kafel, RN, MSLS, and Andrew Creamer, MaEd, MSLS of UMass Medical School''s Lamar Soutter Library present Best Practices for Managing Data. The presentation features the importance of managing data for research projects, and tactical best practice initiatives to create a data management and sharing plan, including how to preserve label, secure, store, and preserve data. Issues, such as licensing, data dictionaries, regulations, and metadata are addressed in the presentation.
This presentation covers a number of best practices for managing research data. The main topics include: file naming and organization conventions, data documentation, and data storage and backups.
This session covers topics related to data archiving and sharing. This includes data formats, metadata, controlled vocabularies, preservation, archiving and repositories.
An introduction to Research Data Management and Data Management Planning presented at the University of the West of England on Wednesday 9th July 2014.
Presentation given at the Indiana University School of Medicine's Ruth Lilly Medical Library. Contains information and resources specific to Indiana University Purdue University Indianapolis (IUPUI). For full class materials, see LYD17_IUPUIWorkshop folder here: https://osf.io/r8tht/.
Presentation from a University of York Library workshop on research data management. The workshop provides an introduction to research data management, covering best practice for the successful organisation, storage, documentation, archiving, and sharing of research data.
Session presented by Judith Carr, Research Data Manager at the University of Liverpool on Research Data Management and your PhD.
Aim:- To show how research data management can contribute to the success of your PhD.
Covers:
* What is research data and why it is important?
* The Research Data lifecycle
Research Data – more than just your results
* FAIR data and Open Research
DMP online tool
This presentation was delivered as part of a Digital Humanities workshop in Medieval Studies at the University of Toronto. Its aim was to engage with digital humanists in the area of data management and start a conversation about what good data management means (from collection to preservation). Included is a data management checklist for DH projects.
Responsible conduct of research: Data ManagementC. Tobin Magle
A presentation for the Food and Nutrition Science Responsible conduct of research class on data management best practices. Covers material in the context of writing a data management plan.
Similar to Data Management for Undergraduate Research (20)
Webinar for the Mountain West Digital Library on how to turn your digital collections into datasets for digital humanities research. Includes a case study of the University of Utah Marriott Library and four digital collections we made available as datasets.
“Data? I don’t have data” is a common refrain for researchers working in the arts and humanities. Yet whether or not you consider yourself a “digital humanist,” the reality is that most of us are working digitally now, and there are different techniques for managing digital research assets than physical ones. This workshop explores how scholars of all stripes can add value to their research by making the products of their work more organized, transparent, usable, and ethical. In addition to instruction in best practices for managing research assets, participants of this workshop will create a short “data management plan,” excellent practice for fulfilling the NEA, NEH, and IMLS data management plan grant requirement!
Finding, Evaluating, and Using Quality Information Rebekah Cummings
How to find, evaluate, and capture quality information. Lecture and workshop for undergraduate students. Cover fake news, media bias, strategies for evaluating websites, use of library resources, and capturing resources in Zotero.
Worth a Thousand Words: Finding, Evaluating, and Using Historical ImagesRebekah Cummings
45 minute lecture and interactive discussion on finding, evaluating, using, and citing images for historical research. Includes short discussions on copyright, fair use, Creative Commons licenses, and attribution. Presentation created for a first year information literacy college class.
45 minute lecture and interactive discussion about the purpose of newspapers, journalism ethics, fake news, bias, and the role of a reader in parsing real news from fake news. Created for a first year college information literacy class.
Level Up! Building data services at the Marriott LibraryRebekah Cummings
Research data services have become a common fixture in academic libraries, yet many libraries still struggle to develop an appropriate and in-demand mix of services to support their research community. While an elite few offer seemingly endless curatorial assistance, the majority of libraries are building basic to mid-level services such as DMP support, workshops, and consultations. This case study provides a detailed look at the University of Utah Marriott Library’s data services, the rationale behind our current service model, the results of our campus data needs assessment, and how we plan to grow our technical infrastructure into the future. In addition to an overview of our data service mix, we will look closely at one current initiative, the Entertainment, Arts, and Engineering (EAE) Thesis Preservation Project, which highlights curation challenges such as irregular and proprietary file formats, copyright restrictions, long-term preservation, and a lack of appropriate metadata standards. This presentation will highlight the Marriott Library’s data curation accomplishments to date alongside an honest assessment of ongoing challenges.
Your digital humanities are in my library! No, your library is in my digital ...Rebekah Cummings
A presentation on the intersection of libraries and digital humanities presented at the Utah Digital Humanities Symposium at Utah Valley University on February 26, 2016.
A 40 minute presentation and demo on how to use bibliographic management systems. This presentation also included extensive demonstrations in Zotero and EndNote.
Since Wikipedia launched in 2001, librarians have maintained a cautious and, at times, hostile relationship with the online, crowd-sourced encyclopedia. Librarians have largely ignored Wikipedia, citing it as an unreliable and non-authoritative resource, and steering information seekers toward traditional reference materials. While librarians waged this quiet war, Wikipedia has gained increasing dominance as an information resource, and is now the indisputable starting point for most quick research. In this presentation, attendees will learn how to wield the power of Wikipedia in their libraries and embrace Wikipedia as an information resource. Presenters will discuss how to use Wikipedia for reference and instruction, linking online resources, increasing search engine optimization, and creating linked data for the semantic web. Presenters will also discuss the great need for librarians to delve into the world of Wikipedia as researchers and contributors; including the ethics of contributing to Wikipedia. Presenters: Dustin Fife, Rebekah Cummings, Jessica Breiman
Summary report of ACRL webinar on emerging technologiesRebekah Cummings
Summary report of the ACRL webinar on emerging technologies in libraries. Reported to the University of Utah RLS Forum in May 2015 and the Marriott Library All-Staff meeting in June 2015.
Hosting Hubs Update: Services, Pricing, and HighlightsRebekah Cummings
Mountain West Digital Library (MWDL) provides a central search portal to over 800,000 digital resources from memory institutions in Utah, Nevada, Idaho, Arizona, and Hawaii. MWDL partners typically work with one of approximately 30 MWDL hosting hubs. Hubs assist partners by providing digital collections training, digitization services, and repository hosting services. Through the hubs model MWDL supports a distributed digital collections network around the Mountain West and works to expand digital library services to additional memory institutions in the region.
In this webinar, Sandra and Rebekah will provide background on the hubs model, explain the different kinds of MWDL hubs, and discuss the need to update the current model of service. Time will be allotted for questions and discussions about the needs of both hubs and partners, and for ideas about how MWDL can modify the hubs model in the future.
MWDL as a Service Hub for the Digital Public Library of America: Updates and ...Rebekah Cummings
In this presentation, Sandra and Rebekah talk about how MWDL became a Service Hub for the DPLA and what being a Service Hub entails. They will also discuss upcoming MWDL/DPLA announcements and events such as the digitization mini-contracts program and the DPLA Community Representatives program.
Welcome to the Mountain West Digital Library: Update for New PartnersRebekah Cummings
In this webinar, Sandra and Rebekah talk about how the MWDL network came together and how partners work together across the region. They will also discuss how to join the Mountain West Digital Library, what it means to be an MWDL partner, and the benefits of partnership.
A presentation to the Research and Learning Services department at the University of Utah. The 20 minute presentation included an overview of the Mountain West Digital Library and the Digital Public Library as research resources and ended with live demonstrations on how to navigate both interfaces effectively.
PowerPoint for a junior high Career Day at which I presented. There are several slides dispelling stereotypes about librarians, followed by a few slides on what librarians are and where we work. Lastly, I spoke about my job as the Assistant Director of the Mountain West Digital Library and why Google is not enough (namely, because of metadata).
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
1. Data Management for
Undergraduate
Researchers
Office of Undergraduate Research Seminar and Workshop Series
Rebekah Cummings, Research Data Management Librarian
J. Willard Marriott Library, University of Utah
June 18, 2015
2. • Introductions
• What are data?
• Why manage data?
• Data Management Plans
• File Naming
• Metadata
• Storage and Archiving
• Questions
4. What are data?
“The recorded factual material
commonly accepted in the research
community as necessary to validate
research findings.”
- U.S. OMB Circular A-110
7. Why manage data?
Your best collaborator is yourself
six months from now, and your past
self doesn’t answer emails.
8. Why else manage data?
• Save time and efficiency
• Meet grant requirements
• Promote reproducible research
• Enable new discoveries from your data
• Make the results of publicly funded research
publicly available
10. Two bears data
management problems
1. Didn’t know where he stored the data
2. Saved one copy of the data on a USB drive
3. Data was in a format that could only be read by
outdated, proprietary software
4. No codebook to explain the variable names
5. Variable names were not descriptive
6. No contact information for the co-author Sam Lee
12. Scenario
You develop a research project during your
undergraduate experience.You write up the
results, which are accepted by a reputable
journal. People start citing your work! Three
years later someone accuses you of falsifying
your work.
Scenario adapted from MANTRA training
module
13. • Would you be able to prove you did the
work as you described in the article?
• What would you need to prove you hadn’t
falsified the data?
• What should you have done throughout
your research study to be able to prove
you did the work as described?
14. Elements of a DMP
• Types of data, including file formats
• Data description
• Data storage
• Data sharing, including confidentiality or
security restrictions
• Data archiving and responsibility
• Data management costs
18. File naming best practices
• Files should include only letters, numbers, and
underscores.
• No special characters (%@#*?!)
• No spaces
• Lowercase or camel case (LikeThis)
• Not all systems are case sensitive.Assume this,
THIS, and tHiS are the same.
19. Dates and numbering…
1. Use leading zeros for scalability
001
002
009
019
999
2. If using dates use YYYYMMDD
June2015 = BAD!
06-18-2015 = BAD!
20150618 = GREAT!
2015-06-18 = This is fine too
20. Who filed better?
• July 24 2014_SoilSamples%_v6
• 20140724_NSF_SoilSamples_Cummings
• SoilSamples_FINAL
21. File organization best
practices
• Top level folder should include project title
and date.
• Sub-structure should have a clear and
consistent naming convention.
• Document your structure in a README
text file.
23. Metadata
Unstructured
Data
Structured
Data
There was a study put out by Dr. Gary
Bradshaw from the University of
Nebraska Medical Center in 1982
called “ Growth of Rodent Kidney
Cells in Serum Media and the Effect of
Viral Transformation On Growth”. It
concerns the cytology of kidney cells.
Title Growth of rodent
kidney cells in serum
media and the effect of
viral transformations on
growth.
Author Gary Bradshaw
Date 1982
Publisher University of Nebraska
Medical Center
Subject Kidney -- Cytology
30. Storing sensitive data
• If possible, collect the necessary data
without using direct identifiers
• Otherwise, de-identify your data upon
collection or immediately afterwards
• Do not store or share sensitive data on
unencrypted devices
• Talk to IRB
33. Major takeaways
• Data management starts at the beginning of
a project
• Document your data so that someone else
could understand it
• Have more than one copy of your data
• Consider archiving options when you are
done with your project
Specifically we are going to be be talking about data management of your research data, but some of the principles will help you when thinking about the organization of any digital materials, your notes, your PowerPoints, your grocery lists….
. Most of these concepts are pretty straightforward, they almost seem like common sense, but the reality is that very few people manage their data well and if you do, you will be at a big advantage.
Overview of what we will be covering in this session. Each of these could be a one hour course, but we are going to hit the highlights so to speak.
Introductions
Name
Major
Are you working on a research project?
What is data?
(are/is debate)
This is the definition that most people refer to.
Recorded factual material
Validate your research findings – when you write up your research it usually ends with your findings. What you discovered in the course of your research. Data is how you got there. It’s your proof.
Data are a lot more complicated than that OMB definition. Data is whatever you consider to evidence for the research that you do. In that way, data can be very subjective.
Scientific data – observations, computational models, lab notebooks
Social sciences – results of surveys, video recordings, field notes
Humanities – text mining, newspapers, records of human history
So what is data – EVIDENCE FOR YOUR RESEARCH
Another attribute of data is that it tends to get messy
Most of us just don’t realize this because our messy, disorganized files are locked up in a neat little box called your computer.
Don’t believe me? How long would it take you to find a photo from five years ago on your computer? Here is a hint. If your image files start with DSC_ or IMG_ and some number following it, it will probably take you a very long time.
If most people’s digital files were analog, this is exactly what they would look like.
The main reason you should manage your data is for yourself and for your own research team.
Data management is one of those essential skills you need to get just like learning how manage citations or understand research methods.
But it can feel a bit boring like filing. But six months later when you want to locate a file, or even understand your file, your future self will thank you.
Most important reason to have good data management is for your own good and the good of your research team. If you want to be able to locate your files or understand your files in the future, good data management is crucial. Plus, unlike research methods and managing citations, this is something that even seasoned scientists are not very good at. So you will have something to offer your research team in the future even as a young scientists.
https://www.youtube.com/watch?v=N2zK3sAtr-4
For all the reasons we have talked about, many agencies are now requiring data management plans at the start of a research project. This means when you apply for funding for a project, you will have to have a two-page data management plan as part of your proposal. That plan is going to talk about the “lifecycle” of your data throughout the course of the project.
How many of you plan on applying for a grant at some point in your careers?
Introduce data lifecycle.
Funders know that the earlier you start thinking about your data, the better. It’s much more likely that the results of your research will be reproducible, it helps avoid data loss, and increases the value of your research.
Hopefully by now you can all see why data management is important. Now we’re going to think a little more deeply about how we can avoid the “Two bears” situation.
Let’s look at this scenario…
Get in groups and talk about this for a few minutes.
The first thing that you would want to have is a DMP. The DMP is going to be your roadmap for good data management. This is the document that you create at the start of a project to think about the lifecycle of your data.
We’ve talked about data management at kind of a high level. What is data? Why should you manage it well?
Now we are going to talk about some of the nuts and bolts of data management. Starting with file naming. How do you currently name files? Do you have a system?
To some extent we are all guilty of bad file naming but when it comes to your research it is important to create a system that makes sense not just to you, but other people as well.
are all guilty of bad file naming but when it comes to your research it is important to create a system that makes sense not just to you, but other people as well.
File names should reflect the contents of a file and enough information to uniquely identify the data file without getting way too long.
Don’t be generic in your file names
Be consistent!!!!
Your file name may include project acronym, location, investigator, date of data collection, data type, and version number. Whatever will help you or someone else uniquely identify that file in the future.
Think about what can be added and what can be omitted in your file names. If you are the only person on a project, you probably don’t need your name. If there are going to be multiple versions of a file, make sure you add a version number or a date to differentiate.
#1 is the best one.
Descriptive
Not too long, not too short
Nothing that makes it look like your file name is swearing at me.
Uppercase lettering can affect numbering.
There are also best practices around version control and numbering.
Version control is often achieved by using dates or a standard numbering system
#2 is the best choice here.
First example here has spaces, irregular dates that won’t line up in order, special characters
Third example may not be descriptive enough for for a secondary user. Also, beware of the “FINAL” as opposed to using a standardized numbering system.
That is how to name an individual file. What about your whole file structure?
All your research materials need to be in one folder. The top level folder should include the project title and year. If it is multiple year, include the first and last year in the title.
The substructures should have a clear and consistent naming convention that is documented in a README file.
Exercise!!
Possible solutions:
Organize by type of file (all transcripts in one folder all audio recordings in another)
Organize by person (Have a Cliff Barrett folder and a Robert Bennett folder)
Problems with file names:
Dates are not standardized
Special characters/spaces
File type in the file name which is unnecessary
Unnecessary information in file name – “found on Internet, think okay, better than mine” picture
NO consistency to file naming
Metadats is very, very important for other people looking to use your project.
Often called data about data.
Structured information about an object.
Mention that there are standards for creating metadata (Dublin Core) including subject specific data.
Data needs context to be understandable
If you have a spreadsheet of survey responses, you need to have the survey to understand the responses.
You also need the codebook that explains your variable names and the values that you used, how you cleaned your data. Once again, try to think how a secondary user would interpret your data.
Going back to file organization, make sure your data documentation is stored in the same folder as the data.
You must make a codebook and include it in your documentation.
This is documenting at a variable level. It’s just as important that you document at a Project and file level as well.
Summary, good data documentation includes…
Through the course of your research your data needs to be stored securely, backed up, and maintained regularly. Once again this sounds like common sense, but you will be happy when you pay some attention to it. (e.g. when your laptop crashes or is stolen.).
https://www.youtube.com/watch?v=QyMgNZHtdk8
#1 rule of data storage – never just keep your data on one device. You are one dropped computer, one spilled glass of water, one unscrupulous thief away from losing all of your data. Every single day I go to Mom’s Café and see people leave their computers at their table while they go to the bathroom or grab a cup of coffee.
LOCKSS - There should never just be one copy of your data. Do you backup your data? Most important data management task. NO less than two, preferably three copies of research data.
How well are you covered against unexpected loss? Make sure that when disaster strikes, it isn’t a disaster
There are three options for
Personal computers and laptops – Convenient for storing your data while in use. Should not be used for storing master copies of your data.
Networked drives – Highly recommended. You can share data. Your data is stored in a single place and backed up regularly. Available to you from any place at any time. If using a department drive or Box stored securing thereby minimizing the risk of loss, theft, or authorized access. BEST!!!
External storage devices – thumb drives, flash drives, external hard drive. Cheap, easy to store and pass around. Feel better knowing it’s in your hands where you can see it. Not recommended for the long-term storage of your data.
Throughout the course of your research, many of you may collect data that is referred to as human subject data. If you do this, you will need to work with the IRB office on campus to figure out how to protect the privacy of your research subjects. Ultimately, the IRB has the final say, but here are some tips for keeping your confidential data, confidential.
Direct vs. Indirect identifiers
Another area of data management that you will have to consider is data archiving.
Archiving is not the same thing as storage
Archiving adds additional value to your data.
Long-term preservation
Metadata
Sharable, usually through a persistent identifier
Makes data citable
There are lots of archiving options for your data. Some people choose to put their data on their website which is an option, but not a best practice.