RDA Work Groups Outputs and Adoption - Early WG Report back session

RDA and Adoption
Early WG Report back session
September 23, 2014

Happy Birthday! 2
http://cdn.cakecentral.com/d/d3/900x900px-LL-d3548099_gallery6680631282672149.jpeg

3
What did we learn?
§ Motivated groups of people can do a lot
§ But we are relying too much on volunteer labour
contributed on top of over-full lives
§ Looks like the RDA-challenge goal of 12-18 months is
achievable
§ But IGs also provide valuable space for longer-term
interaction
§ We need to reduce friction in our processes
§ But the organisation is maturing rapidly

4
RDA and Outputs
§ RDA will only deliver on its promise if it produces
deliverables, and those deliverables become adopted
outside the groups that created them
§ Consequential TAB foci:
§ proposals for new groups – adoption plans?
§ tracking groups underway – fit for purpose?
§ monitoring of adoption once groups conclude – actually
adopted?
§ So, how can we most usefully think through the process
of adoption?

5
Diffusion and RDA
§ Adoption can be seen as the end result of a diffusion
process. This diffusion process involves
§ awareness
§ interest
§ evaluation
§ trial
§ adoption
§ RDA has a role to play in
§ supporting each stage
§ making the transitions from one stage to the next more likely

6
Important questions
1. How do we talk about data?
2. How can we describe the data?
3. Can we optimize addressing the data?
4. How can we get trust in our infrastructure?

7
What
§ Base infrastructure
§ (Coincidence, also social groups!)
§ Lets agree on Terms. (DFT)
§ Descriptions for Interoperability. (DTR)
§ Scaling across PID systems. (PIT)
§ Building policies into the infrastructure. (PP)

8
These groups
§ Amplify each other
§ Use each others outputs
§ Have to interlock properly
§ Will continue the effort after they finish.

Data Foundation and Terminology
Chairs: Gary Berg-Cross, Raphael Ritz, Peter Wittenburg

Task 10
Bob Kahn:
You need to know where you are talking about.
DFT mission: understand what the core of the data domain
is, develop definitions of core terms based on data models.
DFT is part of coming to an agreed culture in RDA.
Scope:
AND only speak about domain of registered data.
§ knowing that there is a lot of non-registered data
§ knowing that some disciplines are further away from
what we are discussing as necessity

DFT WG Activities & Accomplishments 11
§ Drafted 4 related Model Documents on core
work:
1. Data Models 1: Overview – 22+ models
2. Data Models 2: Analysis & Synthesis
3. Data Models 3: Term Snapshot
4. Data Models 4: Use Cases
(Work with other RDA WGs on use cases to
illustrate
data concepts)
§ Developed Semantic Media Wiki Term Tool to
capture initial list of terms and definitions for
discussions, demo held at P3
(open for others and “persistent”)
Candidate List
Evolved to
Consolidated
List

12
Our Core Terms in simple Words J
§ digital object (DO)
§ persistent identifier
§ PID resolution system
§ metadata
§ aggregation
§ digital collection
§ (digital) repository
§ bitstream
§ state information
Need to put relation between terms into the documents
On purpose no formal ontology (yet) and no terminologist’ exactness
since we made definitions for data practitioners first.

13
Definitions & Process
§ A digital collection is an aggregation of DOs that is identified by
a PID and described by metadata.
§ Note: A digital collection is a (complex) DO.
§ Variants
§ A collection is a form of aggregation of elements that has an identity of its own separate from the
identity of the elements.
§ Collection is defined as a “group of objects gathered together for some intellectual, artistic curatorial
purpose.
§ A digital collection is a type of aggregation formed by a collection process on existing data and data
sets where the collected data is in digital form.
§ Collection is a type of aggregation obeying part-role relations and is a digital object since it has a
PID to be referable and metadata describing its properties.
§ A Digital Collection is an organized aggregation or other grouping of distinct DOs that are related by
some criteria and where the collection is described by metadata. A Digital Collection may also be
identified by a unique persistent identifier, in which case the collection may be construed as a DO.
(Kahn et.al)
§ Conclusion points
§ purpose and process of aggregation/collection building and part relations not
relevant for definition
§ remember: only speak about domain of registered DOs.

Interactions with others 14
• Interacted with RDA WGs and IGs.
• Participated in Munich meeting and Chairs telcos.
• Part of WG forum discussions
• also “active” interactions with about 120 groups
RDA/EU & EUDAT Interviews Interactions Total
Humanities &Soc Sci 8 13 21
Environmental 7 2 9
Life Sciences 10 7 17
Natural Sciences 11 13 24
Engineering & CS - 14 14
Various disciplines - 24 24
others 4 3 7
40 74 114

Adoption 15
• What does adoption mean in case of a set of terms?
• it’s about the interaction process itself within and
outside of RDA
• it’s about influencing conceptualization and thus
harmonizing “language”
• it’s about changing cultures
• we have done a lot – many departments & communities
• why so relevant:
• report from 120 interactions tells us that data practices
are a nightmare (report is available)
• data organizations are so different that data federation
including “logical information” is too expensive
• current data science is not reproducible

Objectives until/for P4 16
1. Go out and intensify interaction based on Snapshot
§ create condensed statements for different groups (2-page flyer)
§ interact with other groups in RDA and early adopters
§ interact with the many communities (outside RDA) we already contacted
(in Europe ESFRI RI projects: 17th October, Brussels)
§ encourage people using the term wiki
2. Come to new consolidated agreements
§ consolidated definitions until P5
§ present the consolidated definitions and tend core term set
§ identify some people from communities that have adoption talks (no PR!)
3. Finish some unsolved issues
§ synthesis: generic flexible enough model to capture terms and their
relationships
§ add more use cases
§ see how to continue maintenance

Data Type Registries WG
Outcomes

19
Problem: Implicit Assumptions in Data
§ Data sharing requires that data can be parsed,
understood, and reused by people and applications
other than those that created the data
§ How do we do this now?
§ For documents – formats are enough, e.g., PDF, and then the
document explains itself to humans
§ This doesn’t work well with data – numbers are not self-explanatory
§ What does the number 7 mean in cell B27?
§ Data producers may not have explicitly specified certain
details in the data: measurement units, coordinate
systems, variable names, etc.
§ Need a way to precisely characterize those assumptions
such that they can be identified by humans and
machines that were not closely involved in its creation

20
Goals: Explicate and Share Assumptions using
Types and Type Registries
§ Evaluate and identify a few assumptions in data that can
be codified and shared in order to…
§ Produce a functioning Registry system that can easily
be evaluated by organizations before adoption
§ Highly configurable for changing scope of captured and shared
assumptions depending on the domain or organization
§ Supports several Type record dissemination variations
§ Design for allowing federation between multiple Registry
instances
§ The group’s emphasis is not on
§ Identifying every possible assumption and data characteristic
applicable for all domains
§ Technology

21
Results
§ Produced a community consensus system – in this case the
consensus was between the group members
§ Input from folks from different backgrounds including
technologists, scientists, policy analysts, etc., is considered
§ Released a functioning prototype that can be adapted (with no s/w
changes) for domain-specific use
§ Not a turnkey solution
§ Adapt - Evaluate – Adopt cycle is expected at each organization
or community
§ Federation between different instances is technically possible
§ Organizational policies were not discussed due to the lack of
time
§ CNRI, a member of the group, has designed and implemented a
prototype, the latest of which is at: http://typeregistry.org
§ With the help of RDA provided scholar, we seeded the Registry
with Types that pertain to geosciences community

22
Points to Keep in Mind
§ Data Type Registry is neither a turnkey system
nor an immediate ROI application
§ Every organization should nominate a domain
expert for defining the scope of Type records
and for seeding their Registry instance
§ Cross-domain interpretation beyond some basic
computability needs social processes in place
§ Data systems such as Type Registries are low-level
infrastructure systems with wide
applicability
§ Network effect plays a significant role in the success of any
infrastructure

23
Adoption and Impact
§ We expect multiple groups to put significant
efforts into exercising the prototype:
§ the EUDAT project in Europe,
§ National Institute of Standards and Technology
(NIST) in the US,
§ the International DOI Foundation
§ (Wo Chang, Digital Data Manager at NIST,
shares his evaluation plans)

24
Conclusion – For Now
§ Adoption plans will continue
§ The group, or some part of it, will continue to
work, we hope with RDA’s blessing and maybe
support. We will have more to say at P5
§ Future-proofing data is hard work, but is
essential for long-term data-driven science

WG PID Information Types
Outcomes

26
Problem & Goal
§ PIDs are associated with additional information and this
information needs to be typed
§ Harmonization across disciplines and PID providers
§ What are PID Information Types?
§ Specify a framework for defining types
§ Agree on some essential types
§ Provide technical solutions for interaction with PID types
§ Provide the tools first, then create types individually

27
Results
Insights gained:
§ Types depend on use cases and semantics differ between
disciplines
§ There is no single set of types fitting all cases
§ Community processes must define types from practical adoption
Final deliverables avaliable:
§ Type examples and illustrating use cases
§ Types registered in the Type Registry prototype
§ API description and prototypic implementation
§ Client demonstrator GUI

Registered types enable cross-services 28
Format:
Checksum:
Size:
Verification service
Size:
Format:
Checksum:

29
Adoption & Impact
§ Register your types so they can be adopted and reused,
making it easier for others to use your data
§ Information on how to register new types available in the report
§ Adopt types already being used in your domain to
increase interoperability
§ Decouple object management from contents
§ Simplify client access to data across domains, implementations
and changes in information models
§ More lightweight access to information on less accessible
objects

30
Possible follow-ups
§ Adoption of these capabilities by PID infrastructure
providers
§ Discipline-specific types, preferably from practical
adoption
§ Establish a type ecosystem
§ Refine data model
§ Enhance REST API

31
Conclusions
§ Draft final report available via the website
§ Demonstrator web GUI:
http://smw-rda.esc.rzg.mpg.de/PitApiGui/

§ Create research data repository
§ Data: 2 TB, 500,000 files + growing
+ integrity
+ access (IG FIM)
+ publish (publication+PID)
+ …
§ Some assertions: policies & rules attached to the data
WG Practical Policies 34
Scenario
Policy:
Asser%on
or
assurance
that
is
enforced
about
a
collec%on
or
a
dataset

Computer actionable policies
§ Enforce management,
§ Automate administrative tasks,
§ Validate assessment criteria,
§ Automate scientific analyses
§ etc.
A generic set of policies that can be revised and adapted
by user communities and site managers does not exist.
§ Domain scientists who want to build-up a collection or
a repository
§ Data centers for automating policies
Problem

§ To bring together practitioners in policy making and
policy implementation (nearly all RDA WG/IGs)
§ To identify typical application scenarios for policies
such as replication, preservation etc.
§ To collect and to register practical policies
§ To enable sharing, revising, adapting, and re-using of
computer actionable policies
Goals

Survey of 30 Institutions for Highest Priority
Policies
Policy
Importance
Integrity
217
Preserva%on
150
Access
control
126
Provenance
108
Data
Management
plans
99
Publica%on
75
Replica%on
66
Data
staging
52
Federa%on
37
Metadata
sharing
23
Regulatory
16
Collec%on
proper%es
7
Iden%fiers
7
Data
sharing
7
Versioning
7
Licensing
6
Format
6
Data
Life
Cycle
6
Arrangement
5
Processing
5
In close cooperation with the Engagement Group

Contextual
Metadata
Extrac%on
Data
Reten%on
Disposi%on
Integrity
Storage
Cost
Reports
Restricted
Searching
No%fica%on
Data
Access
Control
Use
Agreements
Data
backup
Data
Format
Control
Collec%on-‐
based
Policies
Identification of
11 important
policy areas:

Identification of 11 important policy areas:
§ Contextual metadata extraction
§ Data access control
§ Data backup
§ Data format control
§ Data retention
§ Disposition
§ Integrity (including replication)
§ Notification
§ Restricted searching
§ Storage cost reports
§ Use agreements
Results

https://www.rd-alliance.org/filedepot?cid=104&fid=556
Templates
§ Interactions of policies and DO attributes
§ Policy descriptions
§ Technology independent
§ Reviews of the provided policy areas in progress
Results

Results
§ Examples for implementations:
§ English language descriptions
§ iRODS
§ GPFS
§ ~50 pages

Result: List of of policy categories and policies
§ Improved data center administration
§ By sharing policies, communities can interoperate and
share data more effectively
§ Transparency: basis of establishing trust
§ Implemented policies: can be used as examples and be
adapted to specific requirements and other data
management systems
Impact

Target Communities:
§ Groups managing data collections
§ Data centers
First adopters are the institutions/organizations who
contributed to the results, e.g. RENCI, KIT, OSC, DARIAH,
RZG, etc.:
§ EUDAT
§ CESNET
§ (DataNet Federation Consortium, WDS ? )
Adoption

§ “Outcomes Policy Templates: Practical Policy Working
Group, September 2014”
§ “Implementations: Practical Policy Working Group,
September 2014”
§ Work in Progress: Reviews
Conclusions

Conclusions: Next Steps
§ More interaction with other technical groups
à Data Fabric
à Publication policies
§ More interaction with domain specific groups
à Adopters
For information please contact
§ Reagan Moore rwmoore@renci.org and
§ Rainer Stotzka rainer.stotzka@kit.edu

WG Practical Policies
Outbreak Session:
Tuesday September 23, 14:00 – 15:30
Agenda:
1. Introduction
2. Presentation of deliverables
3. David Antos & Petr Benedikt: "Policy implementations
on GPFS”
4. Discussions:
§ Policy reviews
§ Adding new policies
§ Interoperability with other WG/IGs
§ Adoption

47
P5 and Adoption Day
§ More groups will be presenting at P5
§ Starting to see how different WG outputs can fit together
§ Ex: Data Fabric
§ Planning to have a major focus at P5 on adoption of WG
outputs
§ Also thinking through how best to accelerate adoption
and support groups that want to integrate RDA outputs

48
How you can help!
§ Get involved in WGs, IGs to ensure outputs meet your
needs and the needs of your organisation
§ Encourage your organisation to become aware of RDA
outputs and evaluate or trial them
§ Look for places where RDA can make a difference

RDA Work Groups Outputs and Adoption - Early WG Report back session

More Related Content

Similar to RDA Work Groups Outputs and Adoption - Early WG Report back session

More from Research Data Alliance

Recently uploaded

RDA Work Groups Outputs and Adoption - Early WG Report back session