SlideShare a Scribd company logo
Preserving Scientific Data
@ the AMNH
Vicky Steeves | ARLIS-NY 2015
Itinerary
1. Current State of DigiPres + Science
2. An Overview of the AMNH
3. Project Specifics
a. Timeline
b. Methodology
c. Results
d. Recommendations
NDSA Levels of Digital Preservation
Level 1:
Protect
Your Data
Level 2:
Know
Your Data
Level 3:
Monitor
Your Data
Level 4:
Repair
Your Data
Storage & Geographic Locations
File Fixity & Data Integrity
Information Security
Metadata
File Formats
USGS Levels of Digital Preservation
Level 1 Level 2 Level 3 Level 4
Storage & Geographic Locations
Data Integrity
Information Security
Metadata
File Formats
Physical Media
Digital Preservation in Science in the US
❏ North Carolina County Geospatial Data
❏ Caroline Dean Wildflower Collection
❏ FSU Biological Scientist, Dr. A.K.S.K.
Prasad Diatomscapes I and II Collections
Photographs
❏ FSU Department of Oceanography
Technical Reports
source: http://www.digitalpreservation.gov/collections/
The AMNH
Vicky Steeves, AMNH
Vicky Steeves, AMNH
41 Curators & their scientific staff
over 33 million specimens in collections
Why Me @ the AMNH?
Review of SSC Report 2012
● “Digital needs can be broken out into 3 parts:
○ Pure storage needs
■ digital collections & research data
(Science)
■ digital technology infrastructure
■ HD video (Education, Exhibition)
■ other data (Library, Photo Studio)
○ Interfaces to digital archives
■ pointer management
○ Cataloguing digital collections & archiving
■ naming conventions
■ best practices by discipline or type”
“The most pressing science need in the near term
is for pure storage of CT and genomics data.”
Project Timeline
Survey: Example
Survey Methodology
Survey: Example
vs
Project Timeline
Analysis Methodology
Analysis Methodology
NVivo Screenshots
Current Phase
Results: Outline of Challenges
Theme Participant Responses
Management Structure ● “I think only the division chair knows how much server space we
have in total.”
Personnel Practices ● “My research assistants manage all my data.”
Interdepartmental Relationships ● “I don’t know where exactly to get support for my database. I don’t
know if its IT’s jurisdiction or job.”
Workflow Structure ● “I have no time to standardize my data management.”
Technical Infrastructure ● “There are not enough computer terminals in the imaging lab.”
Results: Storage Projections
Results: Storage Mediums
Number of users per medium
Results: Management
Research Data Management Strategies:
1. Scientific Staff/Assistant
2. Curator themselves
3. Research databases
Collections Data Management
Strategies:
1. Collections Databases per department
2. KE EMu
3. Paper Catalogue
Results: Preservation Risks
A CAT scan image of a Velociraptor skull.
http://publications.nigms.nih.gov/thenewgenetics/chapter1.html Fossil Insect Collaborative-Digitization Project | Facebook
Results: Preservation Steps
What the AMNH Needs
Vicky Steeves, AMNH
@VickySteeves
vsteeves@amnh.org

More Related Content

What's hot

Better science through superior software
Better science through superior softwareBetter science through superior software
Better science through superior software
Michael R. Crusoe
 
RDAP 15: Beyond Metadata: Leveraging the “README” to support disciplinary Doc...
RDAP 15: Beyond Metadata: Leveraging the “README” to support disciplinary Doc...RDAP 15: Beyond Metadata: Leveraging the “README” to support disciplinary Doc...
RDAP 15: Beyond Metadata: Leveraging the “README” to support disciplinary Doc...
ASIS&T
 
RDAP 15: “This is just for me”: Researchers on their data documentation pract...
RDAP 15: “This is just for me”: Researchers on their data documentation pract...RDAP 15: “This is just for me”: Researchers on their data documentation pract...
RDAP 15: “This is just for me”: Researchers on their data documentation pract...
ASIS&T
 
RDAP 16 Poster: Challenges and Opportunities in an Institutional Repository S...
RDAP 16 Poster: Challenges and Opportunities in an Institutional Repository S...RDAP 16 Poster: Challenges and Opportunities in an Institutional Repository S...
RDAP 16 Poster: Challenges and Opportunities in an Institutional Repository S...
ASIS&T
 
Why Data Science Matters - 2014 WDS Data Stewardship Award Lecture
Why Data Science Matters - 2014 WDS Data Stewardship Award LectureWhy Data Science Matters - 2014 WDS Data Stewardship Award Lecture
Why Data Science Matters - 2014 WDS Data Stewardship Award Lecture
Xiaogang (Marshall) Ma
 
2011 03-provenance-workshop-edingurgh
2011 03-provenance-workshop-edingurgh2011 03-provenance-workshop-edingurgh
2011 03-provenance-workshop-edingurgh
Jun Zhao
 
A Guide for Reproducible Research
A Guide for Reproducible ResearchA Guide for Reproducible Research
A Guide for Reproducible Research
Yasmin AlNoamany, PhD
 
Igsn webinar-26Jul-Slides
Igsn webinar-26Jul-SlidesIgsn webinar-26Jul-Slides
Igsn webinar-26Jul-Slides
ARDC
 
RDAP 16 Poster: Diving into Data: Implementing a Data Repository at the Texas...
RDAP 16 Poster: Diving into Data: Implementing a Data Repository at the Texas...RDAP 16 Poster: Diving into Data: Implementing a Data Repository at the Texas...
RDAP 16 Poster: Diving into Data: Implementing a Data Repository at the Texas...
ASIS&T
 
SEAD Prototype: Data Curation and Preservation for Sustainability Science
SEAD Prototype: Data Curation and Preservation for Sustainability ScienceSEAD Prototype: Data Curation and Preservation for Sustainability Science
SEAD Prototype: Data Curation and Preservation for Sustainability Science
SEAD
 
RDAP 16 Poster: A Proposed Course Model for Integrating RDM with Research Rep...
RDAP 16 Poster: A Proposed Course Model for Integrating RDM with Research Rep...RDAP 16 Poster: A Proposed Course Model for Integrating RDM with Research Rep...
RDAP 16 Poster: A Proposed Course Model for Integrating RDM with Research Rep...
ASIS&T
 
31 E30 Eco System Concept Halifax
31 E30 Eco System Concept Halifax31 E30 Eco System Concept Halifax
31 E30 Eco System Concept Halifax
cmspinto
 
Resume of Tianyi Xu 06-2015
Resume of Tianyi Xu 06-2015Resume of Tianyi Xu 06-2015
Resume of Tianyi Xu 06-2015Xu Tianyi
 
The Research Data Life Cycle for Biology - A Researcher Perspective
The Research Data Life Cycle for Biology - A Researcher PerspectiveThe Research Data Life Cycle for Biology - A Researcher Perspective
The Research Data Life Cycle for Biology - A Researcher Perspective
Philippa Griffin
 
Gaining credit for sharing research data
Gaining credit for sharing research dataGaining credit for sharing research data
Gaining credit for sharing research data
Varsha Khodiyar
 
Opportunities in chemical structure standardization
Opportunities in chemical structure standardizationOpportunities in chemical structure standardization
Opportunities in chemical structure standardization
Valery Tkachenko
 
Leveraging Semantics to Improve Reproducibility in Scientific Workflows
Leveraging Semantics to Improve Reproducibility in Scientific WorkflowsLeveraging Semantics to Improve Reproducibility in Scientific Workflows
Leveraging Semantics to Improve Reproducibility in Scientific Workflows
Rafael Ferreira da Silva
 

What's hot (18)

Better science through superior software
Better science through superior softwareBetter science through superior software
Better science through superior software
 
RDAP 15: Beyond Metadata: Leveraging the “README” to support disciplinary Doc...
RDAP 15: Beyond Metadata: Leveraging the “README” to support disciplinary Doc...RDAP 15: Beyond Metadata: Leveraging the “README” to support disciplinary Doc...
RDAP 15: Beyond Metadata: Leveraging the “README” to support disciplinary Doc...
 
RDAP 15: “This is just for me”: Researchers on their data documentation pract...
RDAP 15: “This is just for me”: Researchers on their data documentation pract...RDAP 15: “This is just for me”: Researchers on their data documentation pract...
RDAP 15: “This is just for me”: Researchers on their data documentation pract...
 
RDAP 16 Poster: Challenges and Opportunities in an Institutional Repository S...
RDAP 16 Poster: Challenges and Opportunities in an Institutional Repository S...RDAP 16 Poster: Challenges and Opportunities in an Institutional Repository S...
RDAP 16 Poster: Challenges and Opportunities in an Institutional Repository S...
 
Why Data Science Matters - 2014 WDS Data Stewardship Award Lecture
Why Data Science Matters - 2014 WDS Data Stewardship Award LectureWhy Data Science Matters - 2014 WDS Data Stewardship Award Lecture
Why Data Science Matters - 2014 WDS Data Stewardship Award Lecture
 
Biostatistics Conference
Biostatistics ConferenceBiostatistics Conference
Biostatistics Conference
 
2011 03-provenance-workshop-edingurgh
2011 03-provenance-workshop-edingurgh2011 03-provenance-workshop-edingurgh
2011 03-provenance-workshop-edingurgh
 
A Guide for Reproducible Research
A Guide for Reproducible ResearchA Guide for Reproducible Research
A Guide for Reproducible Research
 
Igsn webinar-26Jul-Slides
Igsn webinar-26Jul-SlidesIgsn webinar-26Jul-Slides
Igsn webinar-26Jul-Slides
 
RDAP 16 Poster: Diving into Data: Implementing a Data Repository at the Texas...
RDAP 16 Poster: Diving into Data: Implementing a Data Repository at the Texas...RDAP 16 Poster: Diving into Data: Implementing a Data Repository at the Texas...
RDAP 16 Poster: Diving into Data: Implementing a Data Repository at the Texas...
 
SEAD Prototype: Data Curation and Preservation for Sustainability Science
SEAD Prototype: Data Curation and Preservation for Sustainability ScienceSEAD Prototype: Data Curation and Preservation for Sustainability Science
SEAD Prototype: Data Curation and Preservation for Sustainability Science
 
RDAP 16 Poster: A Proposed Course Model for Integrating RDM with Research Rep...
RDAP 16 Poster: A Proposed Course Model for Integrating RDM with Research Rep...RDAP 16 Poster: A Proposed Course Model for Integrating RDM with Research Rep...
RDAP 16 Poster: A Proposed Course Model for Integrating RDM with Research Rep...
 
31 E30 Eco System Concept Halifax
31 E30 Eco System Concept Halifax31 E30 Eco System Concept Halifax
31 E30 Eco System Concept Halifax
 
Resume of Tianyi Xu 06-2015
Resume of Tianyi Xu 06-2015Resume of Tianyi Xu 06-2015
Resume of Tianyi Xu 06-2015
 
The Research Data Life Cycle for Biology - A Researcher Perspective
The Research Data Life Cycle for Biology - A Researcher PerspectiveThe Research Data Life Cycle for Biology - A Researcher Perspective
The Research Data Life Cycle for Biology - A Researcher Perspective
 
Gaining credit for sharing research data
Gaining credit for sharing research dataGaining credit for sharing research data
Gaining credit for sharing research data
 
Opportunities in chemical structure standardization
Opportunities in chemical structure standardizationOpportunities in chemical structure standardization
Opportunities in chemical structure standardization
 
Leveraging Semantics to Improve Reproducibility in Scientific Workflows
Leveraging Semantics to Improve Reproducibility in Scientific WorkflowsLeveraging Semantics to Improve Reproducibility in Scientific Workflows
Leveraging Semantics to Improve Reproducibility in Scientific Workflows
 

Viewers also liked

สนามสอบ วิทยาลัยเทคโนโลยีและการจัดการ กฟผ
สนามสอบ วิทยาลัยเทคโนโลยีและการจัดการ กฟผสนามสอบ วิทยาลัยเทคโนโลยีและการจัดการ กฟผ
สนามสอบ วิทยาลัยเทคโนโลยีและการจัดการ กฟผPacha Naka
 
козлова т.ю. выступление на открытии конференции
козлова т.ю. выступление на открытии конференциикозлова т.ю. выступление на открытии конференции
козлова т.ю. выступление на открытии конференцииchanna1971
 
(Web3000) PC World #3: Accelerate the Web
(Web3000) PC World #3: Accelerate the Web(Web3000) PC World #3: Accelerate the Web
(Web3000) PC World #3: Accelerate the WebSteven Spenser
 
Audio dsp class2 - filterbanks - 5band eq
Audio dsp   class2 - filterbanks - 5band eqAudio dsp   class2 - filterbanks - 5band eq
Audio dsp class2 - filterbanks - 5band eqAmir Rahimzadeh
 
Saad Khan CV
Saad Khan CVSaad Khan CV
Saad Khan CVSaad Khan
 
SERVICIOS QUE GD-INCO OFRECE A PROPIETARIOS DE INMUEBLES
SERVICIOS QUE GD-INCO OFRECE A PROPIETARIOS DE INMUEBLESSERVICIOS QUE GD-INCO OFRECE A PROPIETARIOS DE INMUEBLES
SERVICIOS QUE GD-INCO OFRECE A PROPIETARIOS DE INMUEBLES
BIMGENIA S.L.
 
Louise Doe CV - Feb 2015
Louise Doe CV - Feb 2015Louise Doe CV - Feb 2015
Louise Doe CV - Feb 2015Louise Doe
 
Cabinet veterinaire
Cabinet veterinaireCabinet veterinaire
Cabinet veterinaireblogidenko
 

Viewers also liked (10)

สนามสอบ วิทยาลัยเทคโนโลยีและการจัดการ กฟผ
สนามสอบ วิทยาลัยเทคโนโลยีและการจัดการ กฟผสนามสอบ วิทยาลัยเทคโนโลยีและการจัดการ กฟผ
สนามสอบ วิทยาลัยเทคโนโลยีและการจัดการ กฟผ
 
козлова т.ю. выступление на открытии конференции
козлова т.ю. выступление на открытии конференциикозлова т.ю. выступление на открытии конференции
козлова т.ю. выступление на открытии конференции
 
(Web3000) PC World #3: Accelerate the Web
(Web3000) PC World #3: Accelerate the Web(Web3000) PC World #3: Accelerate the Web
(Web3000) PC World #3: Accelerate the Web
 
Audio dsp class2 - filterbanks - 5band eq
Audio dsp   class2 - filterbanks - 5band eqAudio dsp   class2 - filterbanks - 5band eq
Audio dsp class2 - filterbanks - 5band eq
 
Lirik Lagu Shut 'Em Up
Lirik Lagu Shut 'Em UpLirik Lagu Shut 'Em Up
Lirik Lagu Shut 'Em Up
 
Saad Khan CV
Saad Khan CVSaad Khan CV
Saad Khan CV
 
SERVICIOS QUE GD-INCO OFRECE A PROPIETARIOS DE INMUEBLES
SERVICIOS QUE GD-INCO OFRECE A PROPIETARIOS DE INMUEBLESSERVICIOS QUE GD-INCO OFRECE A PROPIETARIOS DE INMUEBLES
SERVICIOS QUE GD-INCO OFRECE A PROPIETARIOS DE INMUEBLES
 
Louise Doe CV - Feb 2015
Louise Doe CV - Feb 2015Louise Doe CV - Feb 2015
Louise Doe CV - Feb 2015
 
Matrikula zuzenketa
Matrikula zuzenketaMatrikula zuzenketa
Matrikula zuzenketa
 
Cabinet veterinaire
Cabinet veterinaireCabinet veterinaire
Cabinet veterinaire
 

Similar to ARLIS-NY Presentation

DataONE Education Module 01: Why Data Management?
DataONE Education Module 01: Why Data Management?DataONE Education Module 01: Why Data Management?
DataONE Education Module 01: Why Data Management?
DataONE
 
Sarah Jones RDM from a disciplinary perspective
Sarah Jones RDM from a disciplinary perspectiveSarah Jones RDM from a disciplinary perspective
Sarah Jones RDM from a disciplinary perspective
Jisc
 
DMPTool Webinar 8: Data Curation Profiles and the DMPTool (presented by Jake ...
DMPTool Webinar 8: Data Curation Profiles and the DMPTool (presented by Jake ...DMPTool Webinar 8: Data Curation Profiles and the DMPTool (presented by Jake ...
DMPTool Webinar 8: Data Curation Profiles and the DMPTool (presented by Jake ...University of California Curation Center
 
DataONE Education Module 03: Data Management Planning
DataONE Education Module 03: Data Management PlanningDataONE Education Module 03: Data Management Planning
DataONE Education Module 03: Data Management Planning
DataONE
 
2-6-14 ESI Supplemental Webinar: The Data Information Literacy Project
2-6-14 ESI Supplemental Webinar: The Data Information  Literacy Project2-6-14 ESI Supplemental Webinar: The Data Information  Literacy Project
2-6-14 ESI Supplemental Webinar: The Data Information Literacy Project
DuraSpace
 
Data Management Lab: Session 1 Slides
Data Management Lab: Session 1 SlidesData Management Lab: Session 1 Slides
Data Management Lab: Session 1 Slides
IUPUI
 
Sediment Experimentalist Network (SEN): Sharing and reusing methods and data ...
Sediment Experimentalist Network (SEN): Sharing and reusing methods and data ...Sediment Experimentalist Network (SEN): Sharing and reusing methods and data ...
Sediment Experimentalist Network (SEN): Sharing and reusing methods and data ...
hsuleslie
 
Immersive informatics - research data management at Pitt iSchool and Carnegie...
Immersive informatics - research data management at Pitt iSchool and Carnegie...Immersive informatics - research data management at Pitt iSchool and Carnegie...
Immersive informatics - research data management at Pitt iSchool and Carnegie...
Keith Webster
 
Survey of research data management practices up2010digschol2011
Survey of research data management practices up2010digschol2011Survey of research data management practices up2010digschol2011
Survey of research data management practices up2010digschol2011heila1
 
Johnston - How to Curate Research Data
Johnston - How to Curate Research DataJohnston - How to Curate Research Data
Johnston - How to Curate Research Data
National Information Standards Organization (NISO)
 
Introduction to research data management; Lecture 01 for GRAD521
Introduction to research data management; Lecture 01 for GRAD521Introduction to research data management; Lecture 01 for GRAD521
Introduction to research data management; Lecture 01 for GRAD521
Amanda Whitmire
 
RDAP14: Learning to Curate Panel
RDAP14: Learning to Curate Panel RDAP14: Learning to Curate Panel
RDAP14: Learning to Curate Panel
ASIS&T
 
Teaching Data Management
Teaching Data Management Teaching Data Management
Teaching Data Management
Elaine Martin
 
Curation and Preservation of Crystallography Data
Curation and Preservation of Crystallography DataCuration and Preservation of Crystallography Data
Curation and Preservation of Crystallography Data
ManjulaPatel
 
Survey of research data management practices up2010
Survey of research data management practices up2010Survey of research data management practices up2010
Survey of research data management practices up2010heila1
 
Disciplinary RDM
Disciplinary RDMDisciplinary RDM
Disciplinary RDM
Sarah Jones
 
Data Management - Lynn Woolfrey
Data Management - Lynn WoolfreyData Management - Lynn Woolfrey
Data Management - Lynn Woolfrey
pvhead123
 
Sept 18 NISO Webinar: Research Data Curation, Part 2: Libraries and Big Data ...
Sept 18 NISO Webinar: Research Data Curation, Part 2: Libraries and Big Data ...Sept 18 NISO Webinar: Research Data Curation, Part 2: Libraries and Big Data ...
Sept 18 NISO Webinar: Research Data Curation, Part 2: Libraries and Big Data ...
National Information Standards Organization (NISO)
 
Data management woolfrey
Data management woolfreyData management woolfrey
Data management woolfreypvhead123
 

Similar to ARLIS-NY Presentation (20)

METRO RDM Webinar
METRO RDM WebinarMETRO RDM Webinar
METRO RDM Webinar
 
DataONE Education Module 01: Why Data Management?
DataONE Education Module 01: Why Data Management?DataONE Education Module 01: Why Data Management?
DataONE Education Module 01: Why Data Management?
 
Sarah Jones RDM from a disciplinary perspective
Sarah Jones RDM from a disciplinary perspectiveSarah Jones RDM from a disciplinary perspective
Sarah Jones RDM from a disciplinary perspective
 
DMPTool Webinar 8: Data Curation Profiles and the DMPTool (presented by Jake ...
DMPTool Webinar 8: Data Curation Profiles and the DMPTool (presented by Jake ...DMPTool Webinar 8: Data Curation Profiles and the DMPTool (presented by Jake ...
DMPTool Webinar 8: Data Curation Profiles and the DMPTool (presented by Jake ...
 
DataONE Education Module 03: Data Management Planning
DataONE Education Module 03: Data Management PlanningDataONE Education Module 03: Data Management Planning
DataONE Education Module 03: Data Management Planning
 
2-6-14 ESI Supplemental Webinar: The Data Information Literacy Project
2-6-14 ESI Supplemental Webinar: The Data Information  Literacy Project2-6-14 ESI Supplemental Webinar: The Data Information  Literacy Project
2-6-14 ESI Supplemental Webinar: The Data Information Literacy Project
 
Data Management Lab: Session 1 Slides
Data Management Lab: Session 1 SlidesData Management Lab: Session 1 Slides
Data Management Lab: Session 1 Slides
 
Sediment Experimentalist Network (SEN): Sharing and reusing methods and data ...
Sediment Experimentalist Network (SEN): Sharing and reusing methods and data ...Sediment Experimentalist Network (SEN): Sharing and reusing methods and data ...
Sediment Experimentalist Network (SEN): Sharing and reusing methods and data ...
 
Immersive informatics - research data management at Pitt iSchool and Carnegie...
Immersive informatics - research data management at Pitt iSchool and Carnegie...Immersive informatics - research data management at Pitt iSchool and Carnegie...
Immersive informatics - research data management at Pitt iSchool and Carnegie...
 
Survey of research data management practices up2010digschol2011
Survey of research data management practices up2010digschol2011Survey of research data management practices up2010digschol2011
Survey of research data management practices up2010digschol2011
 
Johnston - How to Curate Research Data
Johnston - How to Curate Research DataJohnston - How to Curate Research Data
Johnston - How to Curate Research Data
 
Introduction to research data management; Lecture 01 for GRAD521
Introduction to research data management; Lecture 01 for GRAD521Introduction to research data management; Lecture 01 for GRAD521
Introduction to research data management; Lecture 01 for GRAD521
 
RDAP14: Learning to Curate Panel
RDAP14: Learning to Curate Panel RDAP14: Learning to Curate Panel
RDAP14: Learning to Curate Panel
 
Teaching Data Management
Teaching Data Management Teaching Data Management
Teaching Data Management
 
Curation and Preservation of Crystallography Data
Curation and Preservation of Crystallography DataCuration and Preservation of Crystallography Data
Curation and Preservation of Crystallography Data
 
Survey of research data management practices up2010
Survey of research data management practices up2010Survey of research data management practices up2010
Survey of research data management practices up2010
 
Disciplinary RDM
Disciplinary RDMDisciplinary RDM
Disciplinary RDM
 
Data Management - Lynn Woolfrey
Data Management - Lynn WoolfreyData Management - Lynn Woolfrey
Data Management - Lynn Woolfrey
 
Sept 18 NISO Webinar: Research Data Curation, Part 2: Libraries and Big Data ...
Sept 18 NISO Webinar: Research Data Curation, Part 2: Libraries and Big Data ...Sept 18 NISO Webinar: Research Data Curation, Part 2: Libraries and Big Data ...
Sept 18 NISO Webinar: Research Data Curation, Part 2: Libraries and Big Data ...
 
Data management woolfrey
Data management woolfreyData management woolfrey
Data management woolfrey
 

ARLIS-NY Presentation

  • 1. Preserving Scientific Data @ the AMNH Vicky Steeves | ARLIS-NY 2015
  • 2. Itinerary 1. Current State of DigiPres + Science 2. An Overview of the AMNH 3. Project Specifics a. Timeline b. Methodology c. Results d. Recommendations
  • 3. NDSA Levels of Digital Preservation Level 1: Protect Your Data Level 2: Know Your Data Level 3: Monitor Your Data Level 4: Repair Your Data Storage & Geographic Locations File Fixity & Data Integrity Information Security Metadata File Formats
  • 4. USGS Levels of Digital Preservation Level 1 Level 2 Level 3 Level 4 Storage & Geographic Locations Data Integrity Information Security Metadata File Formats Physical Media
  • 5. Digital Preservation in Science in the US ❏ North Carolina County Geospatial Data ❏ Caroline Dean Wildflower Collection ❏ FSU Biological Scientist, Dr. A.K.S.K. Prasad Diatomscapes I and II Collections Photographs ❏ FSU Department of Oceanography Technical Reports source: http://www.digitalpreservation.gov/collections/
  • 8. Vicky Steeves, AMNH 41 Curators & their scientific staff over 33 million specimens in collections
  • 9. Why Me @ the AMNH?
  • 10. Review of SSC Report 2012 ● “Digital needs can be broken out into 3 parts: ○ Pure storage needs ■ digital collections & research data (Science) ■ digital technology infrastructure ■ HD video (Education, Exhibition) ■ other data (Library, Photo Studio) ○ Interfaces to digital archives ■ pointer management ○ Cataloguing digital collections & archiving ■ naming conventions ■ best practices by discipline or type” “The most pressing science need in the near term is for pure storage of CT and genomics data.”
  • 19. Results: Outline of Challenges Theme Participant Responses Management Structure ● “I think only the division chair knows how much server space we have in total.” Personnel Practices ● “My research assistants manage all my data.” Interdepartmental Relationships ● “I don’t know where exactly to get support for my database. I don’t know if its IT’s jurisdiction or job.” Workflow Structure ● “I have no time to standardize my data management.” Technical Infrastructure ● “There are not enough computer terminals in the imaging lab.”
  • 21. Results: Storage Mediums Number of users per medium
  • 22. Results: Management Research Data Management Strategies: 1. Scientific Staff/Assistant 2. Curator themselves 3. Research databases Collections Data Management Strategies: 1. Collections Databases per department 2. KE EMu 3. Paper Catalogue
  • 23. Results: Preservation Risks A CAT scan image of a Velociraptor skull. http://publications.nigms.nih.gov/thenewgenetics/chapter1.html Fossil Insect Collaborative-Digitization Project | Facebook
  • 25. What the AMNH Needs

Editor's Notes

  1. Hello again! I’m very delighted to share my experiences working on an NDSR project at the American Museum of Natural History with you.
  2. This will be the basic structure of this presentation--I will give you some context about the state of digital preservation of science data, then describe the AMNH and the role Science plays there. After that, I’ll delve into my project with considerations made for methodology, preliminary results, and my first forays into recommendations for the Museum.
  3. These are the National Digital Stewardship Alliance levels of digital preservation. What’s nice about these is that the each row is independent of the other--an institution could be at level four for metadata and level one for file formats. These levels are intended as basically tiered recommendations for how institutions should begin or enhance their digital preservation activities. It’s a fairly simple yet robust guide useful for everyone on the spectrum of digital preservation--from those just beginning to think about it to those who exclusively think about it. It’s basically a great self-assessment tool.
  4. So the US Geological Survey actually took these levels and adapted them for their own purposes, in language that they believe their scientific data managers will better understand. You can see I’ve bolded the main differences between the NDSA levels and the USGS levels, though they are very miniscule. You can see that the USGS wanted to account for the physical media that stores the digital data, the file formatting for each, and replaced “file fixity” with simply “data integrity,” though based on my reading of their levels, it means the same thing to them. Somewhat most importantly, this is the first national science institution to explicitly bring attention to digital preservation. Their website also has a very robust set of pages explaining what digital preservation is, what it means for science, and why more people should be paying attention to it. They are really trying to get a conversation started, which is absolutely commendable. (Click) great job, USGS! Gold star!
  5. This map comes from the NDSA website, and lists some preservation partners in the US that are currently preserving at risk digital content. I just wanted you all to have some context about how little science is in the conversation in terms of digital preservation. Out of the 14000 collections listed, four are scientific. The North Carolina County Geospatial Data, the Carolina Dean Wildflower Collection, FSU Scientist Dr. AKSK Prasad’s Diatomscapes 1 & 2 collections photographs, and the FSU Department of Oceanography’s Technical Reports. As you can see, science data as a whole is very underrepresented in our digital preservation community. Also, for those wondering: Diatomscapes are single-celled plant organisms that live in aquatic environments of all types.
  6. Science is the heart of the AMNH--it makes everything in the Museum tick, from exhibitions to education initiatives. I’m here to begin the process to preserve the legacy of that research, which has taken an increasingly digital and complex shape since the 90s.
  7. The Museum has five research science divisions, some with nested departments. These divisions are: Anthropology, Paleontology, Invertebrate Zoology, Vertebrate Zoology, and Earth & Planetary Science, all supported by the AMNH Research Library, where I am located. The AMNH is in the unique position of being both a cultural heritage organization and a research institution, so the Library is in an equally unique position of supporting traditional archives as well as research data management.
  8. Currently at the AMNH, we have over 200 scientists in our employ with over 33 million specimens in our collections. Luckily for me, I only had to speak to 41 curators and their immediate scientific staff. Many of the other scientists are members of their “labs” or “teams,” so speaking the curators was more than sufficient.
  9. However, for all the unique and rare scientific data the Museum produces, there is no institution-wide plan to store, manage, or preserve this data. Each curator is left to their own devices, literally, to store their research. This is a function of the relationship each scientist has with the Museum: they are considered “free agents” and received little funding for research from the Museum. The Museum pays their salaries but leaves it up to the curators to find their own funding for research initiatives, including data and storage management. Becuase it is such a timesuck, most curators at the Museum don’t manage their data with regularity. There are backups made, but no real plan for the data after publication. I am here to provide the recommendations that would enable the AMNH to remedy that situation.
  10. Previous to my arrival at the Museum, there was an effort to identify what storage projections would look like for the Museum, as well as identifying the most at-risk data. This was undertaken by the Science Support Committee, a committee in the Science Senate. The report, published in 2012, sought to review the digital storage needs across science by surveying the Microscopy and Imaging Facility and the Sackler Institute for Comparative Genomics. It gave special consideration for management and preservation, largely due to the federal data management plan that had come to be a part of NSF grants, which fund the majority of research at the Museum. The recommendation in 2012 was to build a trustworthy digital repository--one I agree with, I might add. Though the focus was decidedly on storage needs, the Committee did make strong mention of management in terms of accessibility: “It is not wise to build up large digital collections that have not been archived correctly from the start. Therefore, it is urgent that best practices be established for file naming and directory structuring of CT and genomics data, so that it remains accessible and discoverable in the future. This issue is best addressed at the level of management of individual collections, in consultation with Library staff.” As you can see, this has been a demonstrated need for at least three years at the Museum for storage, management, and preservation of scientific research data. My project was a natural next step. With me, the Museum would get storage projections for the next five years with accompanying management and preservation recommendations.
  11. My project at the AMNH has three phases: SURVEY: To develop and implement a survey of existing digital assets at the AMNH this includes interviewing those 41 curators and select scientific staff about their digital data with regard to questions that would inform choices made for digital preservation. The three main question categories were “storage” i.e. do they have enough and what was their rate of growth, “management” i.e. do they have any management practices and what are they, and “preservation,” which are essentially questions about how long they think their data will be useful for others and what needs to be preserved alongside it to make it useful to others.
  12. You can see here some excerpts from the survey I developed. I formed a lot of the interview questions from the great tool from Purdue called the “Digital Curation Profiles” and the toolkit they provide--they gave a really great framework for the interview, so if you need to do something like this at your institution, I’d really recommend it.
  13. So, I would email or call each curator and schedule a time I could come visit their office. I would sit down with them and start by asking them to describe their research cycle, which often gave me answers about their data’s lifecycle. This helps me understand how much and what kind of data is generated at each stage so I can account for that in my recommendations.
  14. for example: if a curator tells me it is only important for them to store processed gene sequences, which are aligned and tend to be annotated, and not the raw sequence data, there are significantly different storage and preservation concerns than if they wanted to preserve it all. Most data have an expiration date, and these questions help me identify that.
  15. At the moment, I have done the majority of my ANALYSIS: this means I took the results of the survey and have begun translate them into functional requirements to be met at the Museum.
  16. After sitting down with each curator, I would transfer the recording off of my phone and digitize my notes for the next step of the process-analysis. I transcribed all the audio interviews with a free transcription software, and used analysis methodologies from sociology, as all of my data are qualitative. These methods are called “grounded theory.”
  17. As I reviewed the data collected, I found quite a few repeated ideas and tagged them with codes; this is literally just a word or phrase to represent those ideas. As more data is collected, and as data is re-reviewed, codes can be grouped into concepts, and then into categories. These categories may become the basis for new theory. I did this using a software called NVivo, which allows me to import transcriptions of audio interviews or interview notes for those who didn’t want to be recorded, and code it. Because my background is in computer science and not sociology, there was a bit of a learning curve but NVivo was a powerful yet simple tool to use as I worked through my first attempt at qualitative analysis. The key was to create codes for concepts that weren’t too abstract that they couldn’t be empirically proved--after all, I work primarily with scientists. This was difficult, but after re-reading all my 50+ interviews over and over again, I was able to nail down the core concepts.
  18. I am currently in the reporting phase, where I am articulating all my analysis into a comprehensive report to be given to the Museum at the end of May. For our purposes, I’ll outline some of those results and speak briefly about some recommendations I’m thinking about.
  19. These are just a few of those core concepts I mentioned earlier. This is an example of what my semi-finalized data will look like--obviously with more participant responses. For the sake of this presentation, I didn’t want to overwhelm you with too many. You can see on the left side of this table are the specific overlapping themes--the codes, and on the right the corresponding participant quote--the evidence.
  20. Though the majority of my data is qualitative, I have been able to discover how much digital data is currently stored by AMNH scientists, and the projected annual growth of digital data. The numbers for fiscal year 2014 are based on my interviews with curators and their staff, as well as the projected numbers for FY 2015. Using both these data points, I was able to construct 5 year projections for the growth of digital data at the AMNH. You can see its exponential, with the AMNH generating close to 3 petabytes of data in the next five fiscal years.
  21. You can see here a pie chart depicting the number of users per storage medium--many curators will use many different mediums to store their data. You can see two most popular are desktops & laptops, and external hard drives. Each curator is largely responsible for storing their own data, so this means that while the AMNH will usually provide them with both a laptop and a desktop, money comes out of the curators personal funding (grants or otherwise) to cover the additional storage. This means that the majority of the scientific research data at the AMNH is stored on external hard drives sitting in each curators office. Some departments and labs have internal servers, bought with department funds, though divying that up between curators means that usually the allotment is quite small when compared to their needs. With the institution-wide SAN, the situation is the same, only with many more users vying for space. Some rely on cloud-based services, like Amazon S3 or Dropbox, though that is mainly to share data with collaborators.
  22. These are the most popular management strategies employed by the curators at the AMNH. For research data, it’s often left to their scientific staff to manage--and even then, it tends to be more management in terms of making back ups and selecting devices for storage more than creating file naming conventions or file tree structures--though some staff do do that. After that, the curators will manage their own data, although it is similar to the staff in that there aren’t a lot of management over the actual files in terms of naming and placement; its more project-level management where they will put all files from one project together and so forth. A couple curators and their labs maintain research databases, but its a rare few. These are also access, filemaker pro, or homegrown. For Collections databases, there are quite a few disparate collections databases spread across the divisions and nested departments, mainly filemaker pro, access, or homegrown sql. The Museum recently bought a database system from KE called EMu, which is in use in our Vertebrate Zoology department, with our Invert Zoology department in the process of moving their data over. In addition to those two systems, paper catalogues are maintained in a number of departments to this day. This is often described as a “romantic nod to the history of science“ but it also acts as a fail safe if technology fails. So as you can see, there are about as many management systems as there are curators. This is a bit of a problem when combined with all the many devices that each curator stores their data in. Sometimes, its a matter of finding one out of a thousand files on one out of a hundred projects on one out of five drives, two USB sticks, two computers, dropbox, and amazon cloud. With no standardized management--the risk of data loss is obvious.
  23. Here are what I believe are the data at greatest risk--CT data (click), genomics data (click), and digital images (click), both moving and still. These are not only some of the largest files at the AMNH, but the ones most subject to data loss. To preserve CT data, it is important to store the raw CT data--from this, other scientists can reconstruct whatever part of the taxa they’d like. The same raw skull data can be used by many scientists: those focusing on inner ear development, or brain casing, or facial morphology. Similarly to CT data, it is important to keep the raw genomic sequence data as well, because different scientists use different algorithms to align them. For digital images, well, the digital preservation world has a suite of recommendations for those, but the base line is: keep the file formats open source. For these data, the usual means--file format or software obsolescence--sometimes take a backseat to data simply walking out of the door. Because the Museum has its own gene sequencing and imaging facilities, visiting scientists, students, and research collaborators will often come use AMNH specimens on AMNH machines, and take the resulting digital files with them because the Museum lacks the storage capacity and infrastructure to take a copy. This puts our specimens at risk, which is of extreme importance because many are unique.
  24. This is essentially where the AMNH is at now: the first step is providing centralized safe storage for all our data before we can take the next steps towards an increase in management and preservation practices.
  25. That being said, my idea next step is the construction of a central, trustworthy institutional repository based on open-source software to house scientific data throughout its whole lifecycle, based out of the AMNH Research Library. I believe that providing a central storage, management, and preservation infrastructure at the AMNH would mitigate the risk of data loss--which only increases as the amount of data we generate does. A repository would answer the needs of the curators while putting the Library in a position to begin preserving the data so intrinsically important to the Museum.
  26. Thank you so much. You can find me on Twitter at Vicky Steeves and you can email me at my AMNH email listed there.