Claudia Bauzer Medeiros Digital preservation – caring for our data to foster knowledge discovery and dissemination

1,071 views

Published on

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,071
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
11
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Claudia Bauzer Medeiros Digital preservation – caring for our data to foster knowledge discovery and dissemination

  1. 1. Digital preservationcaring for our data to foster knowledge discovery and dissemination Claudia Bauzer Medeiros Institute of Computing UNICAMP
  2. 2. Pre-Saervare (Before) – (Save)= save before disappears
  3. 3. Maintain Manu-tenere= being able to get/find it
  4. 4. Dec 2008Feb 2010
  5. 5. Data deluge• At end of 2011 – info created and replicated > 1.8 zettabytes• 90% data created in the last 2 years• 5 hour flight – 240 Tbytes• Facebook – 200 million users, >70 languages• Each person in England is filmed 300 times/day• Teenagers in the US send average 110 phone text messages a day=> We need to build arks during the deluge - PRESERVATION
  6. 6. Outline• Why preserve?• What to preserve?• How to preserve?• Where to preserve?And a few associated challenges
  7. 7. Outline• Why preserve?• What to preserve?• How to preserve?• Where to preserve?And a few associated challenges
  8. 8. WHY PRESERVE• Costly to produce• Contribute to progress of science• Intrinsic value culture/science/sustainability
  9. 9. WHY PRESERVE• Costly to produce – Infrastructure, power, software, models, visualization, people – Hardware, Software, Peopleware• Contribute to progress of science – Reproducibility and reusability – Publication and sharing – Quality• Intrinsic value culture/science/sustainability – Digital humanities – Domesday project – Fonoteca Neotropical Jacques Vieillard
  10. 10. WHY PRESERVE• Costly to produce – Infrastructure, power, software, models, visualization, people – Hardware, Software, Peopleware• Contribute to progress of science – Reproducibility and reusability – Publication and sharing – Quality• Intrinsic value culture/science/sustainability – Digital humanities – Domesday project – Fonoteca Neotropical Jacques Vieillard
  11. 11. WHY PRESERVE• Costly to produce – Infrastructure, power, software, models, visualization, people – Hardware, Software, Peopleware• Contribute to progress of science – Reproducibility and reusability – Publication and sharing – Quality• Intrinsic value culture/science/sustainability – Digital humanities – Domesday project – Fonoteca Neotropical Jacques Vieillard
  12. 12. The Domesday Project 1086-1986• Digital decay• Equipment obsolescence• Software obsolescence
  13. 13. Domesday reloaded
  14. 14. FonotecaNeotropicalJacquesVieillard
  15. 15. Outline• Why preserve?• What to preserve?• How to preserve?And associated challenges
  16. 16. What to preserve?• Data• BUT what is “data”?• Only data?
  17. 17. What to preserve?• Data• BUT what is “data”? – Files and records – Models, documentation, annotations, sketches, experiments, recordings• Only data?
  18. 18. What to preserve?• Data• BUT what is “data”? – Files and records – Models, documentation, annotations, sketches, experiments, recordings• Only data? – How produced it – workflows, devices, methodologies, materials and methods, reasonings, logs --- provenance
  19. 19. What to preserve?• Data• Environment in which was produced• Data needed to preserve occupies more space than the data itself• Preservation means storing more than object itself
  20. 20. What about our research data? (slide adapted from Jim Gray)Experiments Instruments Files Questions Papers Answers Simulations Models DATAData-driven science “Collaboratory” 23/10000
  21. 21. Data sources? Table of Product Characteristics id Property name Value MilkProd productsrep MilkA MilkProd quantity 10000 MilkProd validity date 10/06/2006CheeseProd productsr MinasCheeseProd epquantity 2000CheeseProd validity date 12/02/2006CheeseProd shape Circular 24/10000
  22. 22. eEnvironmental Science• Direct and indirect observations 25/10000
  23. 23. Data sources 26/10000
  24. 24. 27/10000
  25. 25. We are DATASCOPE engineersSoftware is the device/tool
  26. 26. Outline• Why preserve?• What to preserve?• How to preserve?And associated challenges
  27. 27. How to preserve?How to construct the ark during the deluge?Presaervare, Manutenere and Share
  28. 28. How to preserve?• To ensure retrievability and sharing – Index structures – Ontologies, metadata, keywords, standards – Workflows• To ensure longevity – Media decay, software decay, hardware decay• To ensure quality – Curation procedures• To afford maintenance costs – Cloud? CAP theorem?
  29. 29. How to preserve?• To ensure retrievability and sharing – Index structures – Ontologies, metadata, keywords, standards – Workflows• To ensure longevity – Media decay, software decay, hardware decay• To ensure quality – Curation procedures• To afford maintenance costs – Cloud? CAP theorem?
  30. 30. How to preserve?• To ensure retrievability and sharing – Index structures – Ontologies, metadata, keywords, standards – Workflows• To ensure longevity – Media decay, software decay, hardware decay• To ensure quality – Curation procedures• To afford maintenance costs – Cloud? CAP theorem?
  31. 31. How to preserve?• To ensure retrievability and sharing – Index structures – Ontologies, metadata, keywords, standards – Workflows• To ensure longevity – Media decay, software decay, hardware decay• To ensure quality – Curation procedures, metadata,standards• To afford maintenance costs – Cloud? CAP theorem?
  32. 32. How to preserve?• To ensure retrievability and sharing – Index structures – Ontologies, metadata, keywords, standards – Workflows• To ensure longevity – Media decay, software decay, hardware decay• To ensure quality – Curation procedures,metadata, standards• To afford maintenance costs – Cloud? CAP theorem? ======= WHERE
  33. 33. How to preserve?• To ensure retrievability and sharing – Index structures – Ontologies, metadata, keywords, standards – Workflows• To ensure longevity – Media decay, software decay, hardware decay – PEOPLE DECAY• To ensure quality – Curation procedures,metadata, standards• To afford maintenance costs – Cloud? CAP theorem? ======= WHERE
  34. 34. Sharing and open accessNSF Data Management Policy Paper and data publication
  35. 35. Sharing of Data Leads to Progress on Alzheimer’s By GINA KOLATA Published: August 12, 2010 = NEW YORK TIMESIn 2003, a group of scientists and executives from the National Institutes of Health, the Food andDrug Administration, the drug and medical-imaging industries, universities and nonprofit groupsjoined in a project that experts say had no precedent: a collaborative effort to find the biological markers that show the progression of Alzheimer’s disease in the human brain. share all the data, making every single finding public immediately, available to anyone with a computer anywhere in the world => AVAILABILITY and REUSE
  36. 36. • Data must be properly curated throughout its life-cycle and released with the appropriate high-quality metadata.• Medical Research Council UK 40/10000
  37. 37. • Research data should be made available for use by other researchers. Researchers must retain research data, including electronic data, in a durable, indexed and retrievable form.• Australian Govnmt National Health and Medical Research Council 41/10000
  38. 38. Microsoft Academic Search40M publications19M authors75 publishers (Wiley, Springer, ACM, IEEE …) 42/10000
  39. 39. Google Scholar Citations 43/10000
  40. 40. • Citing data is as important as citing papers• For researchers, publishers, data centers• Over 1M DOI, several major national research libraries – Germany, France, Korea, Netherlands, Australia, USA...• Present manager – German National Library of Science and Technology 44/10000
  41. 41. Publish on the CloudAdd metadataPre-print sharing 45/10000
  42. 42. FNJV proj.lis.ic.unicamp.br/fnjv• Sharing by publishing on the Web• Retrievability by extending metadata 46/10000
  43. 43. CURATION AND USE OF STANDARDS
  44. 44. Workflows and model preservation
  45. 45. Workflows and model preservation Comb-e-Chem Video Simulation Properties Analysis Diffractometer Structures DatabaseX-Ray Propertiese-Lab e-Lab Grid Middleware 52/10000
  46. 46. The cloud and CAP
  47. 47. Outline• Why preserve?• What to preserve?• How to preserve?• Where to preserve?And a few associated challengesPRE-SAVE and MANU-TENERE
  48. 48. Outline• Why preserve? – Costly to produce (hardware, software, peopleware) – Contribute to progress of science – Value – culture, science, sustainability• What to preserve? – Data [WHAT IS DATA?] – Context of production and use• How to preserve? – Accessibility and sharing – standards, metadata, ontologies – Integrity and quality – context to use (hw, sw), standards
  49. 49. References• 56/10000
  50. 50. ReferencesNSF – CISE Data management policyThe Domesday Projecthttp://www.atsf.co.uk/dottext/domesday.htmlThe CLARIN Project (languages)Eigenfactor.orgAltmetrics movement
  51. 51. Thank you!!!!

×