A very short, very minimal presentation I prepared for the Yale Libraries' SCOPA event to introduce librarians in diverse disciplines to the concepts and challenges of data curation.
3. What is data? Definition varies by discipline and can include experimental, observational, and computational data.
4. What is data? Definition varies by discipline and can include experimental, observational, and computational data. In general “research data” refers to raw or processed products of a research project.
5. What is data? Definition varies by discipline and can include experimental, observational, and computational data. In general “research data” refers to raw or processed products of a research project. These products can be video, images, or numeric files in the form of geographic information, spreadsheets, and other formats.
7. What is data curation? “Data curation is the active and ongoing management of research data through its lifecycle of interest and usefulness to scholarship, science, and education.” – Carole Palmer, UIUC GSLIS
8. What is data curation? “Data curation is the active and ongoing management of research data through its lifecycle of interest and usefulness to scholarship, science, and education.” – Carole Palmer, UIUC GSLIS “Curation” includes selection, appraisal, maintenance, preservation.
10. Why is data curation important for us? According to Paul F. Uhlir, Director of the Board on Research Data and Information, researchers are “contributing to a networked information enterprise where data are a fundamental infrastructural component of the modern research system.”
11. Why is data curation important for us? According to Paul F. Uhlir, Director of the Board on Research Data and Information, researchers are “contributing to a networked information enterprise where data are a fundamental infrastructural component of the modern research system.” Increasingly, data itself is a product and record of scholarship.
14. Some Problems that make curation difficult. No standards. Lack of interoperability.
15. Some Problems that make curation difficult. No standards. Lack of interoperability. Controlled vocabularies are missing.
16. Some Problems that make curation difficult. No standards. Lack of interoperability. Controlled vocabularies are missing. Storage space is limited.
17. Some Problems that make curation difficult. No standards. Lack of interoperability. Controlled vocabularies are missing. Storage space is limited. Domain of stewardship/responsibility is unclear.
18. Some Problems that make curation difficult. No standards. Lack of interoperability. Controlled vocabularies are missing. Storage space is limited. Domain of stewardship/responsibility is unclear. Individual repositories make silos of content.
21. Ideas for solutions! Experiment tracking software and electronic notebooks. Automatic metadata.
22. Ideas for solutions! Experiment tracking software and electronic notebooks. Automatic metadata. Integrating curation early into the researcher workflow.
23. Ideas for solutions! Experiment tracking software and electronic notebooks. Automatic metadata. Integrating curation early into the researcher workflow. Educating graduate students on proper data management.
24. Ideas for solutions! Experiment tracking software and electronic notebooks. Automatic metadata. Integrating curation early into the researcher workflow. Educating graduate students on proper data management. DataONE and the Data Conservancy.
25. Other stuff! Data citation Data sharing Reward models Identity control (ORCID, EZID) Semantic web and linked data Cyberinfrastructure
Hi! I’m Michelle Hudson. I’m the science and social science librarian working in the social science library for now, but I’ll be in the new Center for Science and Social Science Information when we open in January up in the Kline Biology Tower.I started at Yale in April and I’ve been going to conferences and workshops all summer – today I’m going to talk about things I learned at the Summer Institute on Data Curation put on by UIUC’s GSLIS and the NSF-funded Princeton Research Data Lifecycle Workshop.
Director of CIRSS -- Center for Informatics Research in Science & Scholarship
As well as many other things – too many to list.
No standards for storage or metadata either within or across disciplines in many cases.
Poor interoperability makes it difficult to maintain or convert data to appropriate formats – open source software and standards are lacking.
Even simple ones to standardize the outcomes of experiments – people don’t talk about things the same way.
Some data are cheaper to generate than to store. That’s not to say they’re cheap to generate – it’s just that adequate backed-up storage is still unreasonably expensive for most research enterprises.
Domain of stewardship/responsibility - whose job is it to keep university assets for thirty, forty years or more? liaison models and services to researchers. Who can give this authority and who is willing to fill the gap? As a discipline, we don’t have liaison models in place. Is it the library? Is it IT?
So Cornell has a repository with a lot of great data they’re willing to share. This isn’t helpful unless we know data sets are stored there. It’s even less helpful if we can’t access them. This is true for all kinds of universities and centers all over the world.
Utilizing software that tracks and records the progress of experiments and keeps data as it happens is something that’s gaining more popularity. Every scientist uses a notebook, and, until recently, these notebooks were paper and shelved somewhere inaccessible or forgotten once the project was done. With electronic notebooks, it’ll be easier to keep track of what’s done to the data, when, etc.
Automatic metadata application came up as a main point during the Princeton workshop. Scientists want to advocate for instrument makers to give more robust options for creating metadata at the point of data creation/observation.
If we start early in the workflow, it’s easier to get usable data we can decide to select and keep with less work down the line. If data is taken care of early with the aim of keeping it safe for a long time, it would require less work on the part of the curators.
Some institutions are beginning to educate grad students on good data management techniques. Simple data or file management isn’t often taught as part of a research methods scope, so students have been welcoming this information, and see librarians as experts on this kind of thing.
DataONE and the Data Conservancy are two NSF-funded projects with important goals for nation-wide cyberinfrastructure and research data storage. DataONE intends to connect “nodes” of research institutions to make the data they share widely available. They also have a mission to educate students, librarians, and researchers on data management best practices. The Data Conservancy is a model for a large-scale repository for research data. Instances will start rolling out as early as next year and it’ll be interesting to see how it develops and if it’s viable. You can read a lot more about it on their website, which is on your handout.
Feel free to talk to me about anything else related to data as well. Obviously not much time to cover it all today, but let’s meet and chat! Also refer to the handout for good resources on learning more.