• Like
Claudia Bauzer Medeiros  Digital preservation – caring for our data to foster knowledge discovery and dissemination
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Claudia Bauzer Medeiros Digital preservation – caring for our data to foster knowledge discovery and dissemination

  • 512 views
Published

 

Published in Education , Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
512
On SlideShare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
7
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Digital preservationcaring for our data to foster knowledge discovery and dissemination Claudia Bauzer Medeiros Institute of Computing UNICAMP
  • 2. Pre-Saervare (Before) – (Save)= save before disappears
  • 3. Maintain Manu-tenere= being able to get/find it
  • 4. Dec 2008Feb 2010
  • 5. Data deluge• At end of 2011 – info created and replicated > 1.8 zettabytes• 90% data created in the last 2 years• 5 hour flight – 240 Tbytes• Facebook – 200 million users, >70 languages• Each person in England is filmed 300 times/day• Teenagers in the US send average 110 phone text messages a day=> We need to build arks during the deluge - PRESERVATION
  • 6. Outline• Why preserve?• What to preserve?• How to preserve?• Where to preserve?And a few associated challenges
  • 7. Outline• Why preserve?• What to preserve?• How to preserve?• Where to preserve?And a few associated challenges
  • 8. WHY PRESERVE• Costly to produce• Contribute to progress of science• Intrinsic value culture/science/sustainability
  • 9. WHY PRESERVE• Costly to produce – Infrastructure, power, software, models, visualization, people – Hardware, Software, Peopleware• Contribute to progress of science – Reproducibility and reusability – Publication and sharing – Quality• Intrinsic value culture/science/sustainability – Digital humanities – Domesday project – Fonoteca Neotropical Jacques Vieillard
  • 10. WHY PRESERVE• Costly to produce – Infrastructure, power, software, models, visualization, people – Hardware, Software, Peopleware• Contribute to progress of science – Reproducibility and reusability – Publication and sharing – Quality• Intrinsic value culture/science/sustainability – Digital humanities – Domesday project – Fonoteca Neotropical Jacques Vieillard
  • 11. WHY PRESERVE• Costly to produce – Infrastructure, power, software, models, visualization, people – Hardware, Software, Peopleware• Contribute to progress of science – Reproducibility and reusability – Publication and sharing – Quality• Intrinsic value culture/science/sustainability – Digital humanities – Domesday project – Fonoteca Neotropical Jacques Vieillard
  • 12. The Domesday Project 1086-1986• Digital decay• Equipment obsolescence• Software obsolescence
  • 13. Domesday reloaded
  • 14. FonotecaNeotropicalJacquesVieillard
  • 15. Outline• Why preserve?• What to preserve?• How to preserve?And associated challenges
  • 16. What to preserve?• Data• BUT what is “data”?• Only data?
  • 17. What to preserve?• Data• BUT what is “data”? – Files and records – Models, documentation, annotations, sketches, experiments, recordings• Only data?
  • 18. What to preserve?• Data• BUT what is “data”? – Files and records – Models, documentation, annotations, sketches, experiments, recordings• Only data? – How produced it – workflows, devices, methodologies, materials and methods, reasonings, logs --- provenance
  • 19. What to preserve?• Data• Environment in which was produced• Data needed to preserve occupies more space than the data itself• Preservation means storing more than object itself
  • 20. What about our research data? (slide adapted from Jim Gray)Experiments Instruments Files Questions Papers Answers Simulations Models DATAData-driven science “Collaboratory” 23/10000
  • 21. Data sources? Table of Product Characteristics id Property name Value MilkProd productsrep MilkA MilkProd quantity 10000 MilkProd validity date 10/06/2006CheeseProd productsr MinasCheeseProd epquantity 2000CheeseProd validity date 12/02/2006CheeseProd shape Circular 24/10000
  • 22. eEnvironmental Science• Direct and indirect observations 25/10000
  • 23. Data sources 26/10000
  • 24. 27/10000
  • 25. We are DATASCOPE engineersSoftware is the device/tool
  • 26. Outline• Why preserve?• What to preserve?• How to preserve?And associated challenges
  • 27. How to preserve?How to construct the ark during the deluge?Presaervare, Manutenere and Share
  • 28. How to preserve?• To ensure retrievability and sharing – Index structures – Ontologies, metadata, keywords, standards – Workflows• To ensure longevity – Media decay, software decay, hardware decay• To ensure quality – Curation procedures• To afford maintenance costs – Cloud? CAP theorem?
  • 29. How to preserve?• To ensure retrievability and sharing – Index structures – Ontologies, metadata, keywords, standards – Workflows• To ensure longevity – Media decay, software decay, hardware decay• To ensure quality – Curation procedures• To afford maintenance costs – Cloud? CAP theorem?
  • 30. How to preserve?• To ensure retrievability and sharing – Index structures – Ontologies, metadata, keywords, standards – Workflows• To ensure longevity – Media decay, software decay, hardware decay• To ensure quality – Curation procedures• To afford maintenance costs – Cloud? CAP theorem?
  • 31. How to preserve?• To ensure retrievability and sharing – Index structures – Ontologies, metadata, keywords, standards – Workflows• To ensure longevity – Media decay, software decay, hardware decay• To ensure quality – Curation procedures, metadata,standards• To afford maintenance costs – Cloud? CAP theorem?
  • 32. How to preserve?• To ensure retrievability and sharing – Index structures – Ontologies, metadata, keywords, standards – Workflows• To ensure longevity – Media decay, software decay, hardware decay• To ensure quality – Curation procedures,metadata, standards• To afford maintenance costs – Cloud? CAP theorem? ======= WHERE
  • 33. How to preserve?• To ensure retrievability and sharing – Index structures – Ontologies, metadata, keywords, standards – Workflows• To ensure longevity – Media decay, software decay, hardware decay – PEOPLE DECAY• To ensure quality – Curation procedures,metadata, standards• To afford maintenance costs – Cloud? CAP theorem? ======= WHERE
  • 34. Sharing and open accessNSF Data Management Policy Paper and data publication
  • 35. Sharing of Data Leads to Progress on Alzheimer’s By GINA KOLATA Published: August 12, 2010 = NEW YORK TIMESIn 2003, a group of scientists and executives from the National Institutes of Health, the Food andDrug Administration, the drug and medical-imaging industries, universities and nonprofit groupsjoined in a project that experts say had no precedent: a collaborative effort to find the biological markers that show the progression of Alzheimer’s disease in the human brain. share all the data, making every single finding public immediately, available to anyone with a computer anywhere in the world => AVAILABILITY and REUSE
  • 36. • Data must be properly curated throughout its life-cycle and released with the appropriate high-quality metadata.• Medical Research Council UK 40/10000
  • 37. • Research data should be made available for use by other researchers. Researchers must retain research data, including electronic data, in a durable, indexed and retrievable form.• Australian Govnmt National Health and Medical Research Council 41/10000
  • 38. Microsoft Academic Search40M publications19M authors75 publishers (Wiley, Springer, ACM, IEEE …) 42/10000
  • 39. Google Scholar Citations 43/10000
  • 40. • Citing data is as important as citing papers• For researchers, publishers, data centers• Over 1M DOI, several major national research libraries – Germany, France, Korea, Netherlands, Australia, USA...• Present manager – German National Library of Science and Technology 44/10000
  • 41. Publish on the CloudAdd metadataPre-print sharing 45/10000
  • 42. FNJV proj.lis.ic.unicamp.br/fnjv• Sharing by publishing on the Web• Retrievability by extending metadata 46/10000
  • 43. CURATION AND USE OF STANDARDS
  • 44. Workflows and model preservation
  • 45. Workflows and model preservation Comb-e-Chem Video Simulation Properties Analysis Diffractometer Structures DatabaseX-Ray Propertiese-Lab e-Lab Grid Middleware 52/10000
  • 46. The cloud and CAP
  • 47. Outline• Why preserve?• What to preserve?• How to preserve?• Where to preserve?And a few associated challengesPRE-SAVE and MANU-TENERE
  • 48. Outline• Why preserve? – Costly to produce (hardware, software, peopleware) – Contribute to progress of science – Value – culture, science, sustainability• What to preserve? – Data [WHAT IS DATA?] – Context of production and use• How to preserve? – Accessibility and sharing – standards, metadata, ontologies – Integrity and quality – context to use (hw, sw), standards
  • 49. References• 56/10000
  • 50. ReferencesNSF – CISE Data management policyThe Domesday Projecthttp://www.atsf.co.uk/dottext/domesday.htmlThe CLARIN Project (languages)Eigenfactor.orgAltmetrics movement
  • 51. Thank you!!!!