This document summarizes a presentation about preserving scientific data at the American Museum of Natural History (AMNH). It discusses current issues in digital preservation and science, provides an overview of the AMNH including its collections and staff, and outlines a project to understand the AMNH's digital preservation needs. A survey was conducted and results showed challenges around management, personnel, infrastructure, and preservation risks. The AMNH needs to improve its digital preservation practices to better protect and provide access to its valuable scientific data and collections.
Our regular Introduction to Data Management (DM) workshop (90-minutes). Covers very basic DM topics and concepts. Audience is graduate students from all disciplines. Most of the content is in the NOTES FIELD.
Tools and approaches for data deposition into nanomaterial databasesValery Tkachenko
Sustainable research progress in many scientific disciplines critically depends on the existence of robust specialized databases that integrate and structure all available experimental information in the respective fields. The need for such reference database is especially critical for nanoscience and nanomaterial research given the significant diversity of shapes, sizes, and properties of engineered nanomaterials and the difficulty of synthesizing engineered nanoparticles with controlled properties. The acquisition of data from public sources is inefficient, time consuming and limited in scope. Moreover, it is not clear where the resources come from to support this activity on a perpetual basis. The NIH has recently posted its intention to provide special funds toward data deposition by the experimental investigators through the ‘data sharing plan’ for each proposal. However, this points to a current weakness which is that all laboratories use different data collection approaches each of which requires interpretation by staff hosting the database. It would be far more efficient and useful if a template with key terms that could be modified to add new or important additional data or parameters for each investigator. We will discuss tools and approaches to facilitate collection and direct deposition of experimental data into Nanomaterial Registry (https://www.nanomaterialregistry.org/) - a versatile semantically enriched templates-based platform for registering diverse data pertaining to nanomaterials research.
Our regular Introduction to Data Management (DM) workshop (90-minutes). Covers very basic DM topics and concepts. Audience is graduate students from all disciplines. Most of the content is in the NOTES FIELD.
Tools and approaches for data deposition into nanomaterial databasesValery Tkachenko
Sustainable research progress in many scientific disciplines critically depends on the existence of robust specialized databases that integrate and structure all available experimental information in the respective fields. The need for such reference database is especially critical for nanoscience and nanomaterial research given the significant diversity of shapes, sizes, and properties of engineered nanomaterials and the difficulty of synthesizing engineered nanoparticles with controlled properties. The acquisition of data from public sources is inefficient, time consuming and limited in scope. Moreover, it is not clear where the resources come from to support this activity on a perpetual basis. The NIH has recently posted its intention to provide special funds toward data deposition by the experimental investigators through the ‘data sharing plan’ for each proposal. However, this points to a current weakness which is that all laboratories use different data collection approaches each of which requires interpretation by staff hosting the database. It would be far more efficient and useful if a template with key terms that could be modified to add new or important additional data or parameters for each investigator. We will discuss tools and approaches to facilitate collection and direct deposition of experimental data into Nanomaterial Registry (https://www.nanomaterialregistry.org/) - a versatile semantically enriched templates-based platform for registering diverse data pertaining to nanomaterials research.
Presentation given to the BEACON 2013 Congress during the "Collaborating with Industry" sandbox
Original w/ slide notes at: https://docs.google.com/presentation/d/1mmvD0R3fLIl11TmFHij5fGcMDb9qJxy_nwENO2Rt-YI/edit?usp=sharing
RDAP 15: Beyond Metadata: Leveraging the “README” to support disciplinary Doc...ASIS&T
Research Data Access and Preservation Summit, 2015
Minneapolis, MN
April 22-23, 2015
Part of “Beyond metadata: Supporting non-standardized documentation to facilitate data reuse”
RDAP 15: “This is just for me”: Researchers on their data documentation pract...ASIS&T
Research Data Access and Preservation Summit, 2015
Minneapolis, MN
April 22-23, 2015
Part of "Beyond metadata: Supporting non-standardized documentation to facilitate data reuse"
Sara Mannheimer, Data Management Librarian, Montana State University
RDAP 16 Poster: Challenges and Opportunities in an Institutional Repository S...ASIS&T
Research Data Access and Preservation Summit, 2016
Atlanta, GA
May 4-7, 2016
Poster session (Wednesday, May 4)
Presenters:
Amy Koshoffer, University of Cincinnati
Eric J. Tepe, University of Cincinnati
Why Data Science Matters - 2014 WDS Data Stewardship Award LectureXiaogang (Marshall) Ma
A presentation with a review of technical trends in data management, publication and citation, and methodologies on data interoperability, provenance of research and semantic escience.
The International Geo Sample Number (IGSN) is designed to provide an unambiguous globally unique persistent identifier (PID) for physical samples. It facilitates the location, identification, and citation of physical samples used in research.
Released on July 1, the ANDS IGSN minting service was developed in collaboration with AuScope to enable the Australian earth science community to assign IGSNs to geologic and environmental samples such as rocks, drill cores and soils, as well as related sampling features such as sections, dredges, wells and drill holes.
Join us for this webinar to:
--learn more about IGSN and their place in the PID ecosystem
--understand the many benefits of assigning IGSN to research samples
--gain insights into the current status and future directions for IGSN implementation in Australia and internationally
--find out about the ANDS IGSN service including service scope and access, as well as plans to expand the service beyond the earth sciences domain
--hear from IGSN experts and ask questions of them
Our speaker line up includes:
--Prof Brent McInnes, Curtin University
--Dr Jens Klump, CSIRO
--Dr Lesley Wyborn, NCI
--Joel Benn, ANDS
RDAP 16 Poster: Diving into Data: Implementing a Data Repository at the Texas...ASIS&T
Research Data Access and Preservation Summit, 2016
Atlanta, GA
May 4-7, 2016
Poster session (Wednesday, May 4)
Presenters:
Sean Buckner, Texas A&M University
Jeremy Donald, Trinity University
Bruce Herbert, Texas A&M University
Wendi Kaspar, Texas A&M University
Nick Lauland, Texas Digital Library
Kristi Park, Texas Digital Library
Todd Peters, Texas State University
Denyse Rodgers, Baylor University
Cecilia Smith, Texas A&M University
Chris Starcher, Texas Tech University
Ryan Steans, Texas Digital Library
Santi Thompson, University of Houston
Ray Uzwyshyn, Texas State University
Laura Waugh, Texas Digital Library
RDAP 16 Poster: A Proposed Course Model for Integrating RDM with Research Rep...ASIS&T
Research Data Access and Preservation Summit, 2016
Atlanta, GA
May 4-7, 2016
Poster session (Wednesday, May 4)
Poster available at: https://repository.library.brown.edu/studio/item/bdr:650020/
Presenters:
Andrew Creamer, Brown University
Hope Lappen, Brown University
John Santiago, Brown University
Presenters:
In response to data management implications of ecosystem approach-based assessments, the ICES Data Centre has developed the EcoSystemData system, a tool to facilitate a holistic view and management of marine ecosystems. The use of integrated data structures improves the support of integrated data-requests covering diverse scientific topics. Data-requests are received from client commissions: Oslo Paris Commission for the Protection of the Marine Environment of the North-East Atlantic (OSPAR), Helsinki Commission (HELCOM) and EEA (European Environmental Agency), ICES working groups, individual scientists and research students.
The study of the physical conditions, of the chemical nature of the ocean waters, of the currents, etc., is of greatest importance for the investigation of the problems connected with life … and consequently a sharp line should never be drawn between these two main divisions.
Opportunities in chemical structure standardizationValery Tkachenko
This talk was given at EBI's Wellcome Trust Genome Campus and is dedicated to outlining problems with chemical information standardization and various efforts to tackle this problem.
Presentation held at XSEDE'14 - Atlanta, USA
Abstract - Reproducibility of published results is a cornerstone in scientific publishing and progress. Currently, most of the approaches in computational science conservation, in particular for scientific workflow executions, have been focused on data, code, and the workflow description, but not on the underlying infrastructure. we propose a logical-oriented approach to conserve computational environments, where the capabilities of the resources (virtual machines (VM)) are semantically described. We propose to describe the resources involved in the execution of the experiment, using a set of semantic vocabularies, and use those descriptions to define the infrastructure specification. This specification can then be used to derive the set of instructions that can be executed to obtain a new equivalent infrastructure.
More information: www.rafaelsilva.com
Presentation given to the BEACON 2013 Congress during the "Collaborating with Industry" sandbox
Original w/ slide notes at: https://docs.google.com/presentation/d/1mmvD0R3fLIl11TmFHij5fGcMDb9qJxy_nwENO2Rt-YI/edit?usp=sharing
RDAP 15: Beyond Metadata: Leveraging the “README” to support disciplinary Doc...ASIS&T
Research Data Access and Preservation Summit, 2015
Minneapolis, MN
April 22-23, 2015
Part of “Beyond metadata: Supporting non-standardized documentation to facilitate data reuse”
RDAP 15: “This is just for me”: Researchers on their data documentation pract...ASIS&T
Research Data Access and Preservation Summit, 2015
Minneapolis, MN
April 22-23, 2015
Part of "Beyond metadata: Supporting non-standardized documentation to facilitate data reuse"
Sara Mannheimer, Data Management Librarian, Montana State University
RDAP 16 Poster: Challenges and Opportunities in an Institutional Repository S...ASIS&T
Research Data Access and Preservation Summit, 2016
Atlanta, GA
May 4-7, 2016
Poster session (Wednesday, May 4)
Presenters:
Amy Koshoffer, University of Cincinnati
Eric J. Tepe, University of Cincinnati
Why Data Science Matters - 2014 WDS Data Stewardship Award LectureXiaogang (Marshall) Ma
A presentation with a review of technical trends in data management, publication and citation, and methodologies on data interoperability, provenance of research and semantic escience.
The International Geo Sample Number (IGSN) is designed to provide an unambiguous globally unique persistent identifier (PID) for physical samples. It facilitates the location, identification, and citation of physical samples used in research.
Released on July 1, the ANDS IGSN minting service was developed in collaboration with AuScope to enable the Australian earth science community to assign IGSNs to geologic and environmental samples such as rocks, drill cores and soils, as well as related sampling features such as sections, dredges, wells and drill holes.
Join us for this webinar to:
--learn more about IGSN and their place in the PID ecosystem
--understand the many benefits of assigning IGSN to research samples
--gain insights into the current status and future directions for IGSN implementation in Australia and internationally
--find out about the ANDS IGSN service including service scope and access, as well as plans to expand the service beyond the earth sciences domain
--hear from IGSN experts and ask questions of them
Our speaker line up includes:
--Prof Brent McInnes, Curtin University
--Dr Jens Klump, CSIRO
--Dr Lesley Wyborn, NCI
--Joel Benn, ANDS
RDAP 16 Poster: Diving into Data: Implementing a Data Repository at the Texas...ASIS&T
Research Data Access and Preservation Summit, 2016
Atlanta, GA
May 4-7, 2016
Poster session (Wednesday, May 4)
Presenters:
Sean Buckner, Texas A&M University
Jeremy Donald, Trinity University
Bruce Herbert, Texas A&M University
Wendi Kaspar, Texas A&M University
Nick Lauland, Texas Digital Library
Kristi Park, Texas Digital Library
Todd Peters, Texas State University
Denyse Rodgers, Baylor University
Cecilia Smith, Texas A&M University
Chris Starcher, Texas Tech University
Ryan Steans, Texas Digital Library
Santi Thompson, University of Houston
Ray Uzwyshyn, Texas State University
Laura Waugh, Texas Digital Library
RDAP 16 Poster: A Proposed Course Model for Integrating RDM with Research Rep...ASIS&T
Research Data Access and Preservation Summit, 2016
Atlanta, GA
May 4-7, 2016
Poster session (Wednesday, May 4)
Poster available at: https://repository.library.brown.edu/studio/item/bdr:650020/
Presenters:
Andrew Creamer, Brown University
Hope Lappen, Brown University
John Santiago, Brown University
Presenters:
In response to data management implications of ecosystem approach-based assessments, the ICES Data Centre has developed the EcoSystemData system, a tool to facilitate a holistic view and management of marine ecosystems. The use of integrated data structures improves the support of integrated data-requests covering diverse scientific topics. Data-requests are received from client commissions: Oslo Paris Commission for the Protection of the Marine Environment of the North-East Atlantic (OSPAR), Helsinki Commission (HELCOM) and EEA (European Environmental Agency), ICES working groups, individual scientists and research students.
The study of the physical conditions, of the chemical nature of the ocean waters, of the currents, etc., is of greatest importance for the investigation of the problems connected with life … and consequently a sharp line should never be drawn between these two main divisions.
Opportunities in chemical structure standardizationValery Tkachenko
This talk was given at EBI's Wellcome Trust Genome Campus and is dedicated to outlining problems with chemical information standardization and various efforts to tackle this problem.
Presentation held at XSEDE'14 - Atlanta, USA
Abstract - Reproducibility of published results is a cornerstone in scientific publishing and progress. Currently, most of the approaches in computational science conservation, in particular for scientific workflow executions, have been focused on data, code, and the workflow description, but not on the underlying infrastructure. we propose a logical-oriented approach to conserve computational environments, where the capabilities of the resources (virtual machines (VM)) are semantically described. We propose to describe the resources involved in the execution of the experiment, using a set of semantic vocabularies, and use those descriptions to define the infrastructure specification. This specification can then be used to derive the set of instructions that can be executed to obtain a new equivalent infrastructure.
More information: www.rafaelsilva.com
SERVICIOS QUE GD-INCO OFRECE A PROPIETARIOS DE INMUEBLESBIMGENIA S.L.
Breve descripción de los servicios que GD-INCO ofrece a las empresas propietarias de inmuebles, empresas que gestionan los activos físicos de otras, etc.
DataONE Education Module 01: Why Data Management?DataONE
Lesson 1 in a set of 10 created by DataONE on Best Practices fo Data Management. The full module can be downloaded from the DataONE.org website at: http://www.dataone.org/educaiton-modules. Released under a CC0 license, attribution and citation requested.
DataONE Education Module 03: Data Management PlanningDataONE
Lesson 3 in a set of 10 created by DataONE on Best Practices fo Data Management. The full module can be downloaded from the DataONE.org website at: http://www.dataone.org/educaiton-modules. Released under a CC0 license, attribution and citation requested.
Spring 2014 Data Management Lab: Session 1 Slides (more details at http://ulib.iupui.edu/digitalscholarship/dataservices/datamgmtlab)
What you will learn:
1. Build awareness of research data management issues associated with digital data.
2. Introduce methods to address common data management issues and facilitate data integrity.
3. Introduce institutional resources supporting effective data management methods.
4. Build proficiency in applying these methods.
5. Build strategic skills that enable attendees to solve new data management problems.
Sediment Experimentalist Network (SEN): Sharing and reusing methods and data ...hsuleslie
Presentation given to the Summer Institute for Earth Surface Dynamics (SIESD) 2014 at St. Anthony Falls Laboratory, University of Minnesota, about the Sediment Experimentalist Network (SEN). SEN is an EarthCube Research Coordination Network, whose goal is to integrate the efforts of sediment experimentalists and build a knowledge base for guidance on best practices for data collection and management.
Immersive informatics - research data management at Pitt iSchool and Carnegie...Keith Webster
A joint presentation by Liz Lyon and Keith Webster on providing education for librarians engaged in research data management. This was delivered at Library Research Seminar VI, at the University of Illinois Urbana Champaign in September 2014. The presentation looks at a class delivered by Lyon at the University of Pittsburgh's iSchool in 2014, and the related needs for immersive training opportunities amongst experienced practicing librarians, using Carnegie Mellon University's library, led by Webster, as a case study.
This presentation was provided by Lisa Johnston, University of Minnesota, for a NISO Virtual Conference on data curation held on Wednesday, August 31, 2016
Introduction to research data management; Lecture 01 for GRAD521Amanda Whitmire
Lesson 1: Introduction to research data management. From a series of lectures from a 10-week, 2-credit graduate-level course in research data management (GRAD521, offered at Oregon State University).
The course description is: "Careful examination of all aspects of research data management best practices. Designed to prepare students to exceed funder mandates for performance in data planning, documentation, preservation and sharing in an increasingly complex digital research environment. Open to students of all disciplines."
Major course content includes: Overview of research data management, definitions and best practices; Types, formats and stages of research data; Metadata (data documentation); Data storage, backup and security; Legal and ethical considerations of research data; Data sharing and reuse; Archiving and preservation.
See also, "Whitmire, Amanda (2014): GRAD 521 Research Data Management Lectures. figshare. http://dx.doi.org/10.6084/m9.figshare.1003835. Retrieved 23:25, Jan 07, 2015 (GMT)"
Research Data Access and Preservation Summit, 2014
San Diego, CA
March 26-28, 2014
Jared Lyle, ICPSR
Jennifer Doty, Emory University
Joel Herndon, Duke University
Libbie Stephenson, University of California, Los Angeles
Elaine Martin, D.A., presented Teaching Data Management at Purdue University in September 2013. She demonstrated strategic data management plans and skills librarians will need to help researchers develop a plan for organizing, preserving, and storing their data for easy access and retrieval. Details can also be found at Twitter hashtag #datainfolit.
Curation and Preservation of Crystallography DataManjulaPatel
A presentation given by Manjula Patel (UKOLN) at "Chemistry in the Digital Age: A Workshop connecting research and education", June 11-12th 2009, Penn State University,
http://www.chem.psu.edu/cyberworkshop09
About the Webinar
Big data is being collected at a rate that is surpassing traditional analytical methods due to the constantly expanding ways in which data can be created and mined. Faculty in all disciplines are increasingly creating and/or incorporating big data into their research and institutions are creating repositories and other tools to manage it all. There are many challenge to effectively manage and curate this data—challenges that are both similar and different to managing document archives. Libraries can and are assuming a key role in making this information more useful, visible, and accessible, such as creating taxonomies, designing metadata schemes, and systematizing retrieval methods.
Our panelists will talk about their experience with big data curation, best practices for research data management, and the tools used by libraries as they take on this evolving role.
2. Itinerary
1. Current State of DigiPres + Science
2. An Overview of the AMNH
3. Project Specifics
a. Timeline
b. Methodology
c. Results
d. Recommendations
3. NDSA Levels of Digital Preservation
Level 1:
Protect
Your Data
Level 2:
Know
Your Data
Level 3:
Monitor
Your Data
Level 4:
Repair
Your Data
Storage & Geographic Locations
File Fixity & Data Integrity
Information Security
Metadata
File Formats
4. USGS Levels of Digital Preservation
Level 1 Level 2 Level 3 Level 4
Storage & Geographic Locations
Data Integrity
Information Security
Metadata
File Formats
Physical Media
5. Digital Preservation in Science in the US
❏ North Carolina County Geospatial Data
❏ Caroline Dean Wildflower Collection
❏ FSU Biological Scientist, Dr. A.K.S.K.
Prasad Diatomscapes I and II Collections
Photographs
❏ FSU Department of Oceanography
Technical Reports
source: http://www.digitalpreservation.gov/collections/
10. Review of SSC Report 2012
● “Digital needs can be broken out into 3 parts:
○ Pure storage needs
■ digital collections & research data
(Science)
■ digital technology infrastructure
■ HD video (Education, Exhibition)
■ other data (Library, Photo Studio)
○ Interfaces to digital archives
■ pointer management
○ Cataloguing digital collections & archiving
■ naming conventions
■ best practices by discipline or type”
“The most pressing science need in the near term
is for pure storage of CT and genomics data.”
19. Results: Outline of Challenges
Theme Participant Responses
Management Structure ● “I think only the division chair knows how much server space we
have in total.”
Personnel Practices ● “My research assistants manage all my data.”
Interdepartmental Relationships ● “I don’t know where exactly to get support for my database. I don’t
know if its IT’s jurisdiction or job.”
Workflow Structure ● “I have no time to standardize my data management.”
Technical Infrastructure ● “There are not enough computer terminals in the imaging lab.”
22. Results: Management
Research Data Management Strategies:
1. Scientific Staff/Assistant
2. Curator themselves
3. Research databases
Collections Data Management
Strategies:
1. Collections Databases per department
2. KE EMu
3. Paper Catalogue
23. Results: Preservation Risks
A CAT scan image of a Velociraptor skull.
http://publications.nigms.nih.gov/thenewgenetics/chapter1.html Fossil Insect Collaborative-Digitization Project | Facebook
Hello again! I’m very delighted to share my experiences working on an NDSR project at the American Museum of Natural History with you.
This will be the basic structure of this presentation--I will give you some context about the state of digital preservation of science data, then describe the AMNH and the role Science plays there. After that, I’ll delve into my project with considerations made for methodology, preliminary results, and my first forays into recommendations for the Museum.
These are the National Digital Stewardship Alliance levels of digital preservation. What’s nice about these is that the each row is independent of the other--an institution could be at level four for metadata and level one for file formats. These levels are intended as basically tiered recommendations for how institutions should begin or enhance their digital preservation activities. It’s a fairly simple yet robust guide useful for everyone on the spectrum of digital preservation--from those just beginning to think about it to those who exclusively think about it. It’s basically a great self-assessment tool.
So the US Geological Survey actually took these levels and adapted them for their own purposes, in language that they believe their scientific data managers will better understand. You can see I’ve bolded the main differences between the NDSA levels and the USGS levels, though they are very miniscule. You can see that the USGS wanted to account for the physical media that stores the digital data, the file formatting for each, and replaced “file fixity” with simply “data integrity,” though based on my reading of their levels, it means the same thing to them.
Somewhat most importantly, this is the first national science institution to explicitly bring attention to digital preservation. Their website also has a very robust set of pages explaining what digital preservation is, what it means for science, and why more people should be paying attention to it. They are really trying to get a conversation started, which is absolutely commendable. (Click) great job, USGS! Gold star!
This map comes from the NDSA website, and lists some preservation partners in the US that are currently preserving at risk digital content. I just wanted you all to have some context about how little science is in the conversation in terms of digital preservation. Out of the 14000 collections listed, four are scientific. The North Carolina County Geospatial Data, the Carolina Dean Wildflower Collection, FSU Scientist Dr. AKSK Prasad’s Diatomscapes 1 & 2 collections photographs, and the FSU Department of Oceanography’s Technical Reports.
As you can see, science data as a whole is very underrepresented in our digital preservation community.
Also, for those wondering: Diatomscapes are single-celled plant organisms that live in aquatic environments of all types.
Science is the heart of the AMNH--it makes everything in the Museum tick, from exhibitions to education initiatives. I’m here to begin the process to preserve the legacy of that research, which has taken an increasingly digital and complex shape since the 90s.
The Museum has five research science divisions, some with nested departments. These divisions are: Anthropology, Paleontology, Invertebrate Zoology, Vertebrate Zoology, and Earth & Planetary Science, all supported by the AMNH Research Library, where I am located. The AMNH is in the unique position of being both a cultural heritage organization and a research institution, so the Library is in an equally unique position of supporting traditional archives as well as research data management.
Currently at the AMNH, we have over 200 scientists in our employ with over 33 million specimens in our collections. Luckily for me, I only had to speak to 41 curators and their immediate scientific staff. Many of the other scientists are members of their “labs” or “teams,” so speaking the curators was more than sufficient.
However, for all the unique and rare scientific data the Museum produces, there is no institution-wide plan to store, manage, or preserve this data. Each curator is left to their own devices, literally, to store their research. This is a function of the relationship each scientist has with the Museum: they are considered “free agents” and received little funding for research from the Museum. The Museum pays their salaries but leaves it up to the curators to find their own funding for research initiatives, including data and storage management.
Becuase it is such a timesuck, most curators at the Museum don’t manage their data with regularity. There are backups made, but no real plan for the data after publication. I am here to provide the recommendations that would enable the AMNH to remedy that situation.
Previous to my arrival at the Museum, there was an effort to identify what storage projections would look like for the Museum, as well as identifying the most at-risk data. This was undertaken by the Science Support Committee, a committee in the Science Senate. The report, published in 2012, sought to review the digital storage needs across science by surveying the Microscopy and Imaging Facility and the Sackler Institute for Comparative Genomics. It gave special consideration for management and preservation, largely due to the federal data management plan that had come to be a part of NSF grants, which fund the majority of research at the Museum.
The recommendation in 2012 was to build a trustworthy digital repository--one I agree with, I might add. Though the focus was decidedly on storage needs, the Committee did make strong mention of management in terms of accessibility: “It is not wise to build up large digital collections that have not been archived correctly from the start. Therefore, it is urgent that best practices be established for file naming and directory structuring of CT and genomics data, so that it remains accessible and discoverable in the future. This issue is best addressed at the level of management of individual collections, in consultation with Library staff.”
As you can see, this has been a demonstrated need for at least three years at the Museum for storage, management, and preservation of scientific research data. My project was a natural next step. With me, the Museum would get storage projections for the next five years with accompanying management and preservation recommendations.
My project at the AMNH has three phases:
SURVEY: To develop and implement a survey of existing digital assets at the AMNH
this includes interviewing those 41 curators and select scientific staff about their digital data with regard to questions that would inform choices made for digital preservation. The three main question categories were “storage” i.e. do they have enough and what was their rate of growth, “management” i.e. do they have any management practices and what are they, and “preservation,” which are essentially questions about how long they think their data will be useful for others and what needs to be preserved alongside it to make it useful to others.
You can see here some excerpts from the survey I developed. I formed a lot of the interview questions from the great tool from Purdue called the “Digital Curation Profiles” and the toolkit they provide--they gave a really great framework for the interview, so if you need to do something like this at your institution, I’d really recommend it.
So, I would email or call each curator and schedule a time I could come visit their office. I would sit down with them and start by asking them to describe their research cycle, which often gave me answers about their data’s lifecycle. This helps me understand how much and what kind of data is generated at each stage so I can account for that in my recommendations.
for example: if a curator tells me it is only important for them to store processed gene sequences, which are aligned and tend to be annotated, and not the raw sequence data, there are significantly different storage and preservation concerns than if they wanted to preserve it all. Most data have an expiration date, and these questions help me identify that.
At the moment, I have done the majority of my ANALYSIS: this means I took the results of the survey and have begun translate them into functional requirements to be met at the Museum.
After sitting down with each curator, I would transfer the recording off of my phone and digitize my notes for the next step of the process-analysis. I transcribed all the audio interviews with a free transcription software, and used analysis methodologies from sociology, as all of my data are qualitative. These methods are called “grounded theory.”
As I reviewed the data collected, I found quite a few repeated ideas and tagged them with codes; this is literally just a word or phrase to represent those ideas. As more data is collected, and as data is re-reviewed, codes can be grouped into concepts, and then into categories. These categories may become the basis for new theory. I did this using a software called NVivo, which allows me to import transcriptions of audio interviews or interview notes for those who didn’t want to be recorded, and code it. Because my background is in computer science and not sociology, there was a bit of a learning curve but NVivo was a powerful yet simple tool to use as I worked through my first attempt at qualitative analysis. The key was to create codes for concepts that weren’t too abstract that they couldn’t be empirically proved--after all, I work primarily with scientists. This was difficult, but after re-reading all my 50+ interviews over and over again, I was able to nail down the core concepts.
I am currently in the reporting phase, where I am articulating all my analysis into a comprehensive report to be given to the Museum at the end of May. For our purposes, I’ll outline some of those results and speak briefly about some recommendations I’m thinking about.
These are just a few of those core concepts I mentioned earlier. This is an example of what my semi-finalized data will look like--obviously with more participant responses. For the sake of this presentation, I didn’t want to overwhelm you with too many. You can see on the left side of this table are the specific overlapping themes--the codes, and on the right the corresponding participant quote--the evidence.
Though the majority of my data is qualitative, I have been able to discover how much digital data is currently stored by AMNH scientists, and the projected annual growth of digital data. The numbers for fiscal year 2014 are based on my interviews with curators and their staff, as well as the projected numbers for FY 2015. Using both these data points, I was able to construct 5 year projections for the growth of digital data at the AMNH. You can see its exponential, with the AMNH generating close to 3 petabytes of data in the next five fiscal years.
You can see here a pie chart depicting the number of users per storage medium--many curators will use many different mediums to store their data. You can see two most popular are desktops & laptops, and external hard drives. Each curator is largely responsible for storing their own data, so this means that while the AMNH will usually provide them with both a laptop and a desktop, money comes out of the curators personal funding (grants or otherwise) to cover the additional storage. This means that the majority of the scientific research data at the AMNH is stored on external hard drives sitting in each curators office. Some departments and labs have internal servers, bought with department funds, though divying that up between curators means that usually the allotment is quite small when compared to their needs. With the institution-wide SAN, the situation is the same, only with many more users vying for space. Some rely on cloud-based services, like Amazon S3 or Dropbox, though that is mainly to share data with collaborators.
These are the most popular management strategies employed by the curators at the AMNH. For research data, it’s often left to their scientific staff to manage--and even then, it tends to be more management in terms of making back ups and selecting devices for storage more than creating file naming conventions or file tree structures--though some staff do do that. After that, the curators will manage their own data, although it is similar to the staff in that there aren’t a lot of management over the actual files in terms of naming and placement; its more project-level management where they will put all files from one project together and so forth. A couple curators and their labs maintain research databases, but its a rare few. These are also access, filemaker pro, or homegrown.
For Collections databases, there are quite a few disparate collections databases spread across the divisions and nested departments, mainly filemaker pro, access, or homegrown sql. The Museum recently bought a database system from KE called EMu, which is in use in our Vertebrate Zoology department, with our Invert Zoology department in the process of moving their data over. In addition to those two systems, paper catalogues are maintained in a number of departments to this day. This is often described as a “romantic nod to the history of science“ but it also acts as a fail safe if technology fails.
So as you can see, there are about as many management systems as there are curators. This is a bit of a problem when combined with all the many devices that each curator stores their data in. Sometimes, its a matter of finding one out of a thousand files on one out of a hundred projects on one out of five drives, two USB sticks, two computers, dropbox, and amazon cloud. With no standardized management--the risk of data loss is obvious.
Here are what I believe are the data at greatest risk--CT data (click), genomics data (click), and digital images (click), both moving and still. These are not only some of the largest files at the AMNH, but the ones most subject to data loss. To preserve CT data, it is important to store the raw CT data--from this, other scientists can reconstruct whatever part of the taxa they’d like. The same raw skull data can be used by many scientists: those focusing on inner ear development, or brain casing, or facial morphology. Similarly to CT data, it is important to keep the raw genomic sequence data as well, because different scientists use different algorithms to align them. For digital images, well, the digital preservation world has a suite of recommendations for those, but the base line is: keep the file formats open source.
For these data, the usual means--file format or software obsolescence--sometimes take a backseat to data simply walking out of the door. Because the Museum has its own gene sequencing and imaging facilities, visiting scientists, students, and research collaborators will often come use AMNH specimens on AMNH machines, and take the resulting digital files with them because the Museum lacks the storage capacity and infrastructure to take a copy. This puts our specimens at risk, which is of extreme importance because many are unique.
This is essentially where the AMNH is at now: the first step is providing centralized safe storage for all our data before we can take the next steps towards an increase in management and preservation practices.
That being said, my idea next step is the construction of a central, trustworthy institutional repository based on open-source software to house scientific data throughout its whole lifecycle, based out of the AMNH Research Library. I believe that providing a central storage, management, and preservation infrastructure at the AMNH would mitigate the risk of data loss--which only increases as the amount of data we generate does. A repository would answer the needs of the curators while putting the Library in a position to begin preserving the data so intrinsically important to the Museum.
Thank you so much. You can find me on Twitter at Vicky Steeves and you can email me at my AMNH email listed there.