1. RDAP Summary
Topics that drive future digital libraries
Reagan Moore
4/4/2012 ASIST RDAP 2012 1
2. Topics
• Data Management Plans and Policies
– Scientific research data support
– Planning for NSF Data Management Plans
• Data Citation Panel
– Digital identifiers
– Data representation (context)
• Curation Service Models
– Institution-based repositories
• SIG-DL Sustainability Panel
– Cost model
– Business model
• Training Data Management Practitioners
– Theory for information and knowledge, but not digital data
– Teaching eScience librarians how to manage data for researchers
4/4/2012 ASIST RDAP 2012 2
3. Data Management Plans
• Enforcement of regulations:
– IRB, FERPA, HIPAA
• Enforcement of agency policies:
– NSF Data management plans
• Enforcement of institutional policies:
– Trustworthiness
• Compliance with community consensus on collection properties
– Compliance with standards for discovery and access
• Enforcement of management policies:
– Integrity, authenticity, retention, disposition, replication
• Automation of administrative tasks
– Migration
• Validation of assessment criteria
4/4/2012 ASIST RDAP 2012 3
4. Data Identifiers
• Generate identifiers that are location
independent
– Handle system, hash function
– Data management system updates link from identifier
to representation of location (replicas)
• Given an identifier, what does it represent
– Landing page that provides context for the data
– Data model that approximates data in space and time
– Direct access to the data
– Access to procedure that generates the data
4/4/2012 ASIST RDAP 2012 4
5. Data Identifiers
• For derived data
– NASA Level 0 – raw data
– NASA Level 1 – Calibrated
– NASA Level 2 – Transformed to physical quantities
– NASA Level 3 – Functional transformations, projections
• Can we identify the process that created the data
– Generalization of workflow provenance
– Re-execute the workflow to re-create the data
• Create identifier for the workflow
– Need workflow virtualization
• Reproducible science
4/4/2012 ASIST RDAP 2012 5
6. Curation Service Models
• Driven by user requirements
– Unique services for each science and engineering domain
– Different data formats, data analyses, semantics
• Can generic software support each unique collection?
– View curation as a continuum with varying policies and
procedures for each stage of the data life cycle
– Characterize domains by access methods, policies, and
procedures
• Are there standard best practices for a data center?
– Data colocation – minimize administrative costs
– Evolution of center to broaden range of supported
communities
4/4/2012 ASIST RDAP 2012 6
7. Standard Services
• Data discovery
• Data access
• Data manipulation
– Re-creation of derived data products
– Transformation
– Feature detection
– Indexing
– Representation – fit polynomial in space and time
• Manipulate data based on polynomial
4/4/2012 ASIST RDAP 2012 7
8. Sustainability
• Business models
– Identification of a sustaining community
– Quantification of benefit
• Cost model
– Distribution of cost across entire community
– Membership fee
– Pro-rated per item cost
• Minimizing cost
– Automate curation
– Transfer curation tasks to submitter
– FITS file (astronomy)
• Metadata for project/observatory
• Metadata for each image
4/4/2012 ASIST RDAP 2012 8
9. Creating a Repository
• Identify a support community
– Tie to requirements of researchers
– Tie to new science and research initiatives
– Tie to intellectual capital of the university
• Identify cost benefit
– Co-location of services
– Benefit of scale
• Demonstrate responsiveness
– Support for users
4/4/2012 ASIST RDAP 2012 9
10. Educating Next Generation
• Identify a motivating challenge
• Curriculum development
– Coupling of research to education
– Competency in scientific data management and technology
• Data intensive science
– Interest driven by a domain
– Multi-disciplinary problems
– Treat as a skill
• Work with live data
– Enable students to make a discovery
4/4/2012 ASIST RDAP 2012 10
11. Data – Information – Knowledge
(iRODS)
• Data – instantiation of an approximation to reality
– Form of representation of reality
– Requires description of the physical approximation (context)
• Information – application of label to data
– Requires identification of the relationships that must be
satisfied for the label to be applied
– Reification of knowledge (extraction of features)
• Knowledge – relationships between labels
– Requires procedures to parse data to see if relationships are
present
• Data science – transformation of data into knowledge
– Use case driven
4/4/2012 ASIST RDAP 2012 11
12. Digital Library Evolution
• Witnessing rapid evolution of digital libraries
– Item level indexing
– Item level searching
– Data manipulation services
• Driven by scale
– Completeness of semantics
• Represent every word in the English language (15 million)
• Represent cultural knowledge (~ 1 Tbyte)
– Types of reified relationships
• Index based on more than 100 relationships present within
documents (IBM-Watson)
• Spatial, temporal, organizational, familial, …
– Ability to couple indexing to data within storage
4/4/2012 ASIST RDAP 2012 12
13. Vision
• Dynamic digital library
– Continually extract features from data
– Generate index based on features within the data
• Create knowledge base
– Link local index to community index
• Support evolution of the library
– Define new relationships
– Analyze contents
– Generate new index
4/4/2012 ASIST RDAP 2012 13
14. Implications
• Characterize scientific data by the workflow that creates the
published version
– Transform from a library of data files into a library of workflows
• Support re-execution of workflows
– Modify input parameters, generate new version
• Generate discovery semantics (features) through reification
of relationships
– Must be able to parse each file
– Create algorithm that tests for the desired relationship
– Apply algorithms within storage systems
– Build terabyte index of reified relationships for each storage
system
4/4/2012 ASIST RDAP 2012 14
15. Virtualization
• Digital library represents data as searchable metadata
• Collection virtualization defines and manages the
properties of the collection
– Assertions about each file in the collection
– Location independent naming and access
– Management of state information
• Workflow virtualization defines the properties of
procedures
– Provenance information for each procedure
– Location independent naming and execution
– Management of state information
4/4/2012 ASIST RDAP 2012 15
16. Digital Library in 2050
• Links contents to cultural knowledge
– Terabyte indices
• Enables analysis of library contents
– Feature detection services
• Provides workspace in which research is conducted
– Coupling of processing to data storage
• Validates assertions about collection properties
– Published policies
• Scalable infrastructure
4/4/2012 ASIST RDAP 2012 16