Data Preservation


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • 1
  • 2 The Data Intensive Computing Environment group at the San Diego Supercomputer Center has 16 full-time staff members, and 6-10 associated graduate students, working on topics from: - data handling systems (Wan, Rajasekar) - collection management (Rajasekar) - collection building (Kremenek, Zhu) - information management (Baru, Ludascher, Marciano) - knowledge management (Ludascher, Gupta) - presentation systems & GIS systems (Zaslavsky) - user interfaces (Cowart, Ludascher, Marciano, Zaslavasky, Zhu)
  • 11
  • 18
  • 17
  • 24
  • 26
  • Data Preservation

    1. 1. Preservation and Long Term Access to Data and Records in a Knowledge-based Society Reagan W. Moore San Diego Supercomputer Center [email_address]
    2. 2. Data and Knowledge Systems Group <ul><li>Staff </li></ul><ul><li>Reagan Moore </li></ul><ul><li>Ilkai Altintas </li></ul><ul><li>Chaitan Baru </li></ul><ul><li>Sheau Yen Chen </li></ul><ul><li>Charles Cowart </li></ul><ul><li>Amarnath Gupta </li></ul><ul><li>George Kremenek </li></ul><ul><li>M. Kulrul </li></ul><ul><li>Bertram Ludäscher </li></ul><ul><li>Richard Marciano </li></ul><ul><li>A. Memon </li></ul><ul><li>XuFei Qian </li></ul><ul><li>Roman Olshanowsky </li></ul><ul><li>Arcot Rajasekar </li></ul><ul><li>Abe Singer </li></ul><ul><li>Michael Wan </li></ul><ul><li>Ilya Zaslavsky </li></ul><ul><li>Bing Zhu </li></ul><ul><li>Graduate Students </li></ul><ul><li>A. Bagchi </li></ul><ul><li>S. Bansal </li></ul><ul><li>A. Behere </li></ul><ul><li>R. Bharath </li></ul><ul><li>S. Bharath </li></ul><ul><li>L. Sui </li></ul><ul><li>Undergraduate Interns </li></ul><ul><li>N. Cotofana </li></ul><ul><li>D. Le </li></ul><ul><li>J. Trang </li></ul><ul><li>L. Yin </li></ul><ul><li>+/- NN </li></ul>
    3. 3. Topics <ul><li>Building persistent archives </li></ul><ul><li>Data grids </li></ul><ul><li>Authenticity mechanisms </li></ul><ul><li>Managing technology evolution </li></ul><ul><li>Knowledge-based access </li></ul>
    4. 4. Archival Processes <ul><ul><li> Appraisal –determine the archivable content </li></ul></ul><ul><ul><li> Accession - determine the initial physical location for the data, and the relationship of the new collection to existing collections </li></ul></ul><ul><ul><li>Arrangemen t - add administration control, describe the information content (provenance, authenticity, structure, administrative), and decompose digital objects into their components as needed. </li></ul></ul><ul><ul><li>Description - complete the definition of collection attributes by iterating between arrangement, reformatting, and representation. </li></ul></ul><ul><ul><li>Preservation – build an archivable form of the digital entities, characterize the collection context , and manage their storage </li></ul></ul><ul><ul><li> Access – provide query mechanisms for discovering, retrieving, and presenting the digital entities. </li></ul></ul>
    5. 5. ERA Concept model
    6. 6. Common Approach (digital library, persistent archive, data grid) <ul><li>Logical name space used to organize digital entities, and associate attributes </li></ul><ul><li>Separation of information management from data storage management </li></ul><ul><li>Definition of abstraction mechanisms for dealing with repositories </li></ul><ul><li>Emergence of need for knowledge management </li></ul>
    7. 7. Java, NT Browsers Web WSDL Prolog Predicate SDSC Storage Resource Broker & Meta-data Catalog Levels of Abstraction Application HRM Clients Servers Storage Abstraction Catalog Abstraction Databases DB2, Oracle, Sybase Logical Name Space Latency Management Data Transport Metadata Transport Consistency Management / Authorization-Authentication Prime Server Linux I/O DLL / Python Unix Shell Archives HPSS, ADSM, UniTree, DMF Databases DB2, Oracle, Postgres File Systems Unix, NT, Mac OSX C, C++, Libraries
    8. 8. Authenticity <ul><li>Guarantee that the data has not been changed </li></ul><ul><ul><li>Collection owned data, only accessible through the data handling system </li></ul></ul><ul><ul><li>Support roles defining access (curation, owner, annotation, read) </li></ul></ul><ul><ul><li>Support access controls mapping users to roles </li></ul></ul><ul><li>Audit trails that record all operations on files </li></ul><ul><li>Digital signatures - cryptographic checksums </li></ul>
    9. 9. Managing Technology Evolution <ul><li>Data grids provide interoperability mechanisms to access data in multiple administration domains and multiple types of storage systems. </li></ul><ul><li>Persistent archives migrate collections from old technology to new technology to support presentation on new systems </li></ul><ul><li>Both require the ability to access heterogeneous systems </li></ul>
    10. 10. Presentation of Digital Objects Storage System Operating System Application Digital Object Display System
    11. 11. Technology Management - Emulation New Storage System New Operating System Old Application Digital Object New Display System Wrap Application
    12. 12. Technology Management New Storage System New Operating System Old Application Digital Object New Display System Add Operating System Call
    13. 13. Technology Management Old Storage System New Operating System Old Application Digital Object Old Display System Add Operating System Call Add Operating System Call
    14. 14. Technology Management Migration New Storage System New Operating System New Application Digital Object New Display System Migrate Encoding Format
    15. 15. Technology Management - SDSC Old Storage System New Operating System New Application Digital Object Old Display System Wrap Storage System Wrap Display System Migrate Encoding Format
    16. 16. Accessing Archived Data <ul><li>Name transparency </li></ul><ul><ul><li>Access data without knowing the file name </li></ul></ul><ul><ul><li>Map from attributes to a local file name </li></ul></ul><ul><li>Location transparency </li></ul><ul><ul><li>Access data without knowing where it is stored </li></ul></ul><ul><ul><li>Map from global file name to local file name </li></ul></ul><ul><li>Collection transparency </li></ul><ul><ul><li>Access data without knowing the collection attributes </li></ul></ul><ul><ul><li>Map from concept space to collection attributes </li></ul></ul>
    17. 17. Information Management- Logical Name Space <ul><li>Set of attributes to describe digital entities that are registered into the logical name space </li></ul><ul><ul><ul><li>SRB metadata - Unix file system semantics </li></ul></ul></ul><ul><ul><ul><li>Provenance metadata - Dublin Core </li></ul></ul></ul><ul><ul><ul><li>Resource metadata - User access control lists </li></ul></ul></ul><ul><ul><ul><li>Discipline metadata - User defined attributes </li></ul></ul></ul><ul><li>Each digital entity may have unique attributes </li></ul>
    18. 18. Knowledge Management - Discovery across Collections <ul><li>Mapping from collection attributes to discipline concepts </li></ul><ul><ul><li>Make queries based on discipline concepts </li></ul></ul><ul><li>Characterization of relationships between attributes </li></ul><ul><ul><li>Semantic / logical - cross-walks </li></ul></ul><ul><ul><li>Procedural / temporal - records management </li></ul></ul><ul><ul><li>Structural / spatial - GIS </li></ul></ul>
    19. 19. Knowledge Based Data Grids Attributes Semantics Knowledge Information Data Ingest Services Management Access Services (Model-based Access) (Data Handling System - SRB) MCAT/HDF Grids XML DTD SDLIP XTM DTD <ul><ul><li>Rules - KQL </li></ul></ul>Information Repository Attribute- based Query Feature-based Query Knowledge or Topic-Based Query / Browse Knowledge Repository for Rules Relationships Between Concepts Fields Containers Folders Storage (Replicas, Persistent IDs)
    20. 20. Further Information