Rdap12 wrap up reagan moore

1,095 views

Published on

Presentation at Research Data Access & Preservation Summit
23 March 2012

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,095
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
9
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Rdap12 wrap up reagan moore

  1. 1. RDAP Summary Topics that drive future digital libraries Reagan Moore4/4/2012 ASIST RDAP 2012 1
  2. 2. Topics• Data Management Plans and Policies – Scientific research data support – Planning for NSF Data Management Plans• Data Citation Panel – Digital identifiers – Data representation (context)• Curation Service Models – Institution-based repositories• SIG-DL Sustainability Panel – Cost model – Business model• Training Data Management Practitioners – Theory for information and knowledge, but not digital data – Teaching eScience librarians how to manage data for researchers4/4/2012 ASIST RDAP 2012 2
  3. 3. Data Management Plans• Enforcement of regulations: – IRB, FERPA, HIPAA• Enforcement of agency policies: – NSF Data management plans• Enforcement of institutional policies: – Trustworthiness• Compliance with community consensus on collection properties – Compliance with standards for discovery and access• Enforcement of management policies: – Integrity, authenticity, retention, disposition, replication• Automation of administrative tasks – Migration• Validation of assessment criteria4/4/2012 ASIST RDAP 2012 3
  4. 4. Data Identifiers• Generate identifiers that are location independent – Handle system, hash function – Data management system updates link from identifier to representation of location (replicas)• Given an identifier, what does it represent – Landing page that provides context for the data – Data model that approximates data in space and time – Direct access to the data – Access to procedure that generates the data4/4/2012 ASIST RDAP 2012 4
  5. 5. Data Identifiers• For derived data – NASA Level 0 – raw data – NASA Level 1 – Calibrated – NASA Level 2 – Transformed to physical quantities – NASA Level 3 – Functional transformations, projections• Can we identify the process that created the data – Generalization of workflow provenance – Re-execute the workflow to re-create the data• Create identifier for the workflow – Need workflow virtualization• Reproducible science4/4/2012 ASIST RDAP 2012 5
  6. 6. Curation Service Models• Driven by user requirements – Unique services for each science and engineering domain – Different data formats, data analyses, semantics• Can generic software support each unique collection? – View curation as a continuum with varying policies and procedures for each stage of the data life cycle – Characterize domains by access methods, policies, and procedures• Are there standard best practices for a data center? – Data colocation – minimize administrative costs – Evolution of center to broaden range of supported communities 4/4/2012 ASIST RDAP 2012 6
  7. 7. Standard Services• Data discovery• Data access• Data manipulation – Re-creation of derived data products – Transformation – Feature detection – Indexing – Representation – fit polynomial in space and time • Manipulate data based on polynomial4/4/2012 ASIST RDAP 2012 7
  8. 8. Sustainability• Business models – Identification of a sustaining community – Quantification of benefit• Cost model – Distribution of cost across entire community – Membership fee – Pro-rated per item cost• Minimizing cost – Automate curation – Transfer curation tasks to submitter – FITS file (astronomy) • Metadata for project/observatory • Metadata for each image4/4/2012 ASIST RDAP 2012 8
  9. 9. Creating a Repository• Identify a support community – Tie to requirements of researchers – Tie to new science and research initiatives – Tie to intellectual capital of the university• Identify cost benefit – Co-location of services – Benefit of scale• Demonstrate responsiveness – Support for users4/4/2012 ASIST RDAP 2012 9
  10. 10. Educating Next Generation• Identify a motivating challenge• Curriculum development – Coupling of research to education – Competency in scientific data management and technology• Data intensive science – Interest driven by a domain – Multi-disciplinary problems – Treat as a skill• Work with live data – Enable students to make a discovery 4/4/2012 ASIST RDAP 2012 10
  11. 11. Data – Information – Knowledge (iRODS)• Data – instantiation of an approximation to reality – Form of representation of reality – Requires description of the physical approximation (context)• Information – application of label to data – Requires identification of the relationships that must be satisfied for the label to be applied – Reification of knowledge (extraction of features)• Knowledge – relationships between labels – Requires procedures to parse data to see if relationships are present• Data science – transformation of data into knowledge – Use case driven4/4/2012 ASIST RDAP 2012 11
  12. 12. Digital Library Evolution• Witnessing rapid evolution of digital libraries – Item level indexing – Item level searching – Data manipulation services• Driven by scale – Completeness of semantics • Represent every word in the English language (15 million) • Represent cultural knowledge (~ 1 Tbyte) – Types of reified relationships • Index based on more than 100 relationships present within documents (IBM-Watson) • Spatial, temporal, organizational, familial, … – Ability to couple indexing to data within storage4/4/2012 ASIST RDAP 2012 12
  13. 13. Vision• Dynamic digital library – Continually extract features from data – Generate index based on features within the data• Create knowledge base – Link local index to community index• Support evolution of the library – Define new relationships – Analyze contents – Generate new index4/4/2012 ASIST RDAP 2012 13
  14. 14. Implications• Characterize scientific data by the workflow that creates the published version – Transform from a library of data files into a library of workflows• Support re-execution of workflows – Modify input parameters, generate new version• Generate discovery semantics (features) through reification of relationships – Must be able to parse each file – Create algorithm that tests for the desired relationship – Apply algorithms within storage systems – Build terabyte index of reified relationships for each storage system4/4/2012 ASIST RDAP 2012 14
  15. 15. Virtualization• Digital library represents data as searchable metadata• Collection virtualization defines and manages the properties of the collection – Assertions about each file in the collection – Location independent naming and access – Management of state information• Workflow virtualization defines the properties of procedures – Provenance information for each procedure – Location independent naming and execution – Management of state information4/4/2012 ASIST RDAP 2012 15
  16. 16. Digital Library in 2050• Links contents to cultural knowledge – Terabyte indices• Enables analysis of library contents – Feature detection services• Provides workspace in which research is conducted – Coupling of processing to data storage• Validates assertions about collection properties – Published policies• Scalable infrastructure4/4/2012 ASIST RDAP 2012 16

×