Virtual Organizations 2.0: Social Constructs for Data-centered Collaborative Discovery

Information Sciences Institute
Virtual Organizations 2.0: Social Constructs for
Data-centered Collaborative Discovery
Carl Kesselman
carl@isi.edu
Dean’s Professor, Industrial and Systems Engineering
Director: Informatics Systems Research Division, Information Sciences Institute
University of Southern California

Set the way back machine….
“The real and specific problem that underlies the Grid concept is
coordinated resource sharing and problem solving in dynamic, multi-
institutional virtual organizations.”
The Anatomy of the Grid: Enabling Scalable Virtual Organizations
Ian Foster, Carl Kesselman, Steven Tuecke, 2001
Globus Users Group 1999

Virtual Organization
A virtual organization (VO) is a group of individuals whose members and
resources may be dispersed geographically and institutionally, yet who
function as a coherent unit through the use of cyberinfrastructure (CI). A
VO is typically enabled by, and provides shared and often real-time access
to, centralized or distributed resources, such as community-specific tools,
applications, data, and sensors, and experimental operations. A VO may
be known as or composed of systems known as collaboratories, e-Science
or e-Research, distributed workgroups or virtual teams, virtual
environments, and online communities. Cummings, Jonathon & Finholt, Thomas & Foster, Ian
& Kesselman, Carl & Lawrence, Katherine. (2008).
Beyond Being There: A Blueprint for Advancing the
Design, Development, and Evaluation of Virtual
Organizations. 10.13140/RG.2.1.2694.2167.

VOs: A Pre-Millenial Perspective

How do we create a scientific result
Any time scientists disagree, it's because we have insufficient data. Then
we can agree on what kind of data to get; we get the data; and the data
solves the problem. Either I'm right, or you're right, or we're both wrong.
And we move on. That kind of conflict resolution does not exist in politics
or religion.
Neil deGrasse Tyson
• Science is about communities arguing over data
– How do those communities form
– How do communities argue: knowledge capture and communication

Data driven multidisciplinary collaboration is a
adaptive socio-technical eco-system
• How do manage social and technical subsystems
• How do we optimize across the ecosystem across time and space
Social Subsystem
Concerned with the attributes of people (e.g.
skills, attitudes, values), the relationships
among people, reward systems, and
authority structures
Technical Subsystem
Concerned with the processes, tasks, and
technology needed to transform inputs to
outputs
Joint
Optimization
Militello, e. al.. (2013). Sources of variation in
primary care clinical workflow: Implications for
the design of cognitive support. Health
informatics journal.
Man-computer symbiosis … aims are 1) to let computers facilitate
formulative thinking …., and 2) to enable men and computers to
cooperate in making decisions and controlling complex situations without
inflexible dependence on predetermined programs.
Licklider, 1960

Top down or bottom Up: The TIMn perspective
• Tribes
• Institutions
• Markets
• Networks
Ronfeldt, David, Tribes, Institutions, Markets, Networks: A
Framework About Societal Evolution. Santa Monica, CA:
RAND Corporation, 1996.
Traditional eScience projects look like markets built around
episodic exchange of papers/data (publication)
Transformational science requires networks.

Community of practice
• With diverse viewpoints, role, etc.
• Engaged in joint work
• Over a significant period of time ,
• In which they build things, solve problems, learn , invent, and
negotiate meaning,
• And evolve a way of talking and reading each other!
A group of people….

Networks of practice for transformational science
• Multiple communities working together in
integrated, dynamic, adaptive and open collaboration
groups
Concepts
knowing
that
Stories
“best
practices”
Intuition
skills
Knowing
“how”
Genres
Work
practices
Knowing - -
communities of
practice
groups
Concepts
knowing
that
Stories
“best
practices”
Intuition
skills
Knowing
“how”
Genres
Work
practices
Knowing - -
communities of
practice
individuals groups
Concepts
knowing
that
Stories
“best
practices”
Intuition
skills
Knowing
“how”
Genres
Work
practices
Knowing - -
communities of
practice
ExplicitTacit
VS.

’
Knowledge flow across communities of practice
• Boundary objects and knowledge brokers
Boundary
object
Star, Susan Leigh, and James R. Griesemer. “Institutional Ecology,
`Translations’ and Boundary Objects: Amateurs and Professionals in
Berkeley’s Museum of Vertebrate Zoology, 1907-39.” Social Studies of
Science 19, no. 3
Community
of
Practice
Community
of
Practice
Translator/Knowledge
Broker
Community
of Practice
Community
of Practice
Information used in different ways by different
communities. Boundary objects are plastic, interpreted
differently across communities but with enough
immutable content to maintain integrity.

Data as a boundary object
• Must be findable by network of practice,
• Must be accessible across the network while following norms of the
community
• Must be interpretable by other community (interoperable)
– The bits
– The terms we use to describe
– The relationships between elements
• Must provide value to other community
– Useable in new context
Where have we seen these requirements before…..

FAIR data…
• Findable
– identified by a unique identifier, characterized by rich metadata
• Accessible
– standard protocol with access control, metadata accessible even
when the data is not,
• Interoperable
– by standardized terms to describe it
• Reusable
– Accurate and relevant attributes.

FAIR collaboration
• Data as boundary object --- Create FAIR data at every part of a
scientific investigation
– Enable “long tail” science organized around data
• Requires accurate descriptions of data
–Characteristics of data element
–Relationships between data elements
• Question: How do we create and maintain these metadata and
relationships in a collaborative environment?
Madduri, Ravi, et al. "Reproducible big data
science: a case study in continuous
FAIRness." PloS one 14.4 (2019).

Scientific asset management system
• Data Driven Collaboration
– Think of it as a “photo manager” for scientific data
– Maintain FAIR data from creation through publication
• Captures data and relationships between data
– Catalog to capture relationships between data
– Object store to hold data
• Can rapidly change to follow changes in scientific knowledge
Discovery Environment for Relational Information and Versioned Assets
(DERIVA)

Relational Database
Turn scientists into active participants throughout the
data lifecycle
(Schuler, Czajkowski, Kesselman 2014; Schuler, Czajkowski, Kesselman 2016; Bugacov et al
2017)
acquisition
annotation
organizationsharing
derivation
Data Lifecycle
REPLACE:
• spreadsheets,
• “meaningful” filenames,
• inefficient EAV models,…
WITH:
• well-formed relational
models…
but… how to evolve model?
tossing it over the fence
✔ “self serve” data curaMon
Web
Interfaces

DERIVA promotes FAIR data production
• F: providing rich metadata using an Entity-Relationship model to
express relationships between diverse data elements;
• A: offering rich access control and access to metadata via standard
HTTP web service interfaces;
• I: integrating with standardized terms defined by collaborators,
consortium or communities; and
• R: supporting dynamic model evolution so that the data presented
accurately represents the current structure and state of knowledge
within an investigation.

DERIVA Ecosystem

Collaborative Metadata Management
• ERMRest: A web service for managing metadata and
relationships between data assets
– Everything is a resource with a URI and interface
• Entity/Relationship modeling

ERMRest and Globus
• Globus Auth for all authenticated user access
• Extensive use of Globus Groups for fine grain access control
– Use groups as roles
– Write policy against

Data API
• All operations mapped into URL:
ermrest/catalog/1@Y-1Z4B/entity/foo/bar/x::gt::7
–Queries can be viewed as naming data
–Tables exchanges as CSV or JSON
• Operate on whole or partial entity, projections, aggregates
• Joins, filter-based row selection, column projection,
configurable sort order, and pagination.
• Insertion, update, or deletion of data in one table at a time.

ERMRest Fine grain access control
• ACL = (resource, permission, roles)
–schema, table, column, or reference
–access permission (e.g., insert, update, etc.)
–a set of user or group identity
• ACL is associated with a resource in the hierarchy:
–catalog, schema, table, column, and constraint
• ACLs are inherited simplifying management
• Static and dynamic policy specifications
• Enables context dependent policy, e.g. curation pipelines

Snapshots and history
• Logical snapshots at every catalog mutation
– Prior values are stored in the catalog
– Snapshot ID created for every snapshot
• Catalog URI can include snapshot ID
–ermrest/catalog/1@Y-1Z4B/entity/foo/bar/x
• History Management
– Range Discovery, Truncation, Policy Amendment, Data Redaction

Persistent identifiers
• Every entity ERMRest has unique ID (RID)
– Assigned by catalog, can be used as foreign-key
• Every version of entire catalog has a unique ID.
• Query with snapshot ID uniquely names a entity set
– ERMRest URI with snapshot ID
• RID and Snapshot ID uniquely name an specific version of entity
• Support for identifiers from community controlled vocabulary
– E.g. Uberon, Schema.org, RRIDs

The “20 questions” approach, gets it started, but we
need more agile methods throughout formative
phases
• Drive database and system
design by ~20 key queries
However…
“Database technology has had
limited uptake [in science and data
science], in part due to the
overhead in designing a schema…
Changing data and changing
requirements make it difficult to
amortize these upfront costs...”
(Gray, Szalay et al
2009)
(Jain et al 2016)
http://jimgray.azurewebsites.net/talks/SciData.pp
t

Collaboration is an evolutionary process
• Data interoperability, e.g. structure and meaning is time-dependent
• Any evaluation of FAIRness is ill-posed unless we specify community
and point in time!
LFNG: Lunatic Fringe
LFNG: O-fucosylpeptide
3-beta-N-
acetylglucosaminyltransferase
Ribes and Bowker. "A learning
trajectory for ontology building.", 2005
Community
Domain Representatives
Targeted Future
Users

Suboptimal schema leads to suboptimal data:
e.g., “messy” data!
Unconstrained Domain
This
happened

• “Everyone starts with the same schema: <stuff/>. Then they refine it.”
– J. Widom

Numerous steps to evolve even “simple” database
ID A B Litte
r
Anatomy Age
73 abc def YC1 Mandible E10.5
74 mn
o
pq
r
YC1 Mandible E10.5
75 stu vx
y
YC1 Maxila E8.5
ID A B FK
73 abc def 1
74 mno pqr 1
75 stu vxy 2
ID Litte
r
Anatomy Age
1 YC1 Mandible E10.5
2 YC1 Maxila E8.5
ID A B FK
73 abc def 1
74 mno pqr 1
75 stu vxy 2
ID Litte
r
Anatomy Age
1 YC1 Mandible E10.5
2 YC1 Maxila E8.5
Anatomy
Mandible
Maxila
Measures
Specimens
Measures + Specimens
Vocabulary
Age
E10.5
E8.5
Measures
Specimens
1
3
4
2 Cleaning (not pictured)

Our approach: An open, extensible Data Evolution
Language
• Primitive SMO: indivisible operators,
extended definitions of more common
relational operators
• Composite SMO: defined as a functional
composition over other (primitive or
composite) SMOs
• Algebraic expressions can be decomposed
and rewritten into semantically equivalent,
more efficient expressions.
• Embed in general-purpose language with
user programs translated into the algebra.
Conceptual overview of the approach
Schuler, Robert E., and Carl Kesselman. "Towards an efficient and
effective framework for the evolution of scientific databases." Proceedings
of the 30th International Conference on Scientific and Statistical Database
Management. 2018
Define set of Schema Modification
Operators (SMO) and associated algebra

Problem scenario: free text “tags” of gene names…
ID A B List_of_Genes
73 abc def Msx1, Foxc2, Sox9
74 mno pqr Tgfb3, Sox9, Bmp4, …
… … … …
Assays:
ID Term Synony
ms
1 Msx1 …
2 Tgfb3 …
3 Foxc2 …
4 Sox9 …
5 Bmp4 …
… … …
Genes:
Table of
canonical
terms
Table of genetic
assays
Objective: to extract the
free text “List_of_Genes”
in a normalized and
semantically aligned
association with the
canonical term set.
Before
PROBLEM
Free text listing of
gene names
related to the
assay records

Problem scenario: free text “tags” of gene names…
ID A B List_of_Genes
73 abc def Msx1, Foxc2, Sox9
74 mno pqr Tgfb3, Sox9, Bmp4, …
… … … …
Assays:
ID Term Synony
ms
1 Msx1 …
2 Tgfb3 …
3 Foxc2 …
4 Sox9 …
5 Bmp4 …
… … …
Genes:
Table of
canonical
terms
Assay Gene
73 Msx1
73 Foxc2
73 Sox9
74 Tgfb3
74 Sox9
74 Bmp4
… …
Assay_Genes:
Table of
semantically
aligned assay
to gene name
association
Table of genetic
assays
Objective: to extract the
free text “List_of_Genes”
in a normalized and
semantically aligned
association with the
canonical term set.
After

Tagify: normalize and semantically align free form
tags…
d Tagifya
r
↦ d Aligna (Atomizea r)
↦ d Aligna (μ (ReifySub
a r))
↦ d Aligna (μ (𝜋key(r),a r))
↦ 𝜚t/a (𝜋-a,-s (μ (𝜋key(r),a r)) ⋈≡ d)
(Tagify
definition)
(Atomize definition)
(ReifySub definition)
(Align definition)
Input:
d: Genes
r: Assays
Parameters:
a: List_of_Genes
(others omitted for brevity)
Result:
Computes association table
Iteratively apply composite SMO
definitions until we have an
expression of only primitive SMOs

Tagify: normalize and semantically align free form
tags…
d Tagifya
r
↦ d Aligna (Atomizea r)
↦ d Aligna (μ (ReifySub
a r))
↦ d Aligna (μ (𝜋key(r),a r))
↦ 𝜚t/a (𝜋-a,-s (μ (𝜋key(r),a r)) ⋈≡ d)
(Tagify
definition)
(Atomize definition)
(ReifySub definition)
(Align definition)
Input:
d: Genes
r: Assays
Parameters:
a: List_of_Genes
(others omitted for brevity)
Result:
Computes association table

Big data bags
• A packaging format for encapsulating
– Payload: arbitrary content
– Tags: metadata describing the payload
– Checksums: supports verification of content
Bio_data_bag/
|-- data
| -- genomic
| -- 2a673.fastq
| -- 2a673.fastq
| -- manifest-md5.txt
| afbfa23123bfa data/genomic/2a673.fasta
| -- bagit.txt
Contact-Name: John Smith
Chard, Kyle, et al. "I'll take that to go:
Big data bags and minimal identifiers
for exchange of large, complex
datasets." 2016 IEEE International
Conference on Big Data (Big Data). IEEE,
2016

Rich metadata for Bags
{
"@context": {
"@vocab": "http://purl.org/dc/terms/",
"dcmi":
"http://purl.org/dc/dcmitype/Dataset"
},
"@id": "../../data/numbers.csv",
"@type": "dcmi:Dataset",
"title": "CSV files of beverage consumption",
”description": "A CSV file listing the number
of cups consumed per person.”
}
38
Research ObjectsOpen Knowledge Foundation
Table Schema
{
"name": "Method",
"schema": {
"fields": [
{
"name": "id",
"title": "A globally unique ID",
"type": "string",
"constraints": { "required": true, "unique": true}
},
],
"missingValues": [ "" ], "primaryKey": "id"
}
},

Making Bags big
• Manifest lists all content and checksums
• Available content contained in “data” directory
• Content may be “missing”
– Missing content must be listed in “fetch.txt”
– Fetch entries list local name in data directory, and URL of where to fetch data
• Have created tool for creating, validating and materializing bags
• Globus Auth, and Globus Transfer are supported

CFDE Metadata Loading Flow
CFDE Repo
DCC
Extract CFDE
Metadata
Pull referenced
BDBag
Create DERIVA
Catalog
Load asset
information
save to shared
metadata
repository
Automate
Common Fund Data Environment
Scalable approach to integrating data coordinating center asset information
DCC
DCC
Transform DCC
metadata into
C2M2 records
DCC
administrative
interface
dashboard
repeatable &
extensible
process
CFDE
metadata
catalog
reusable
data asset
information
queries

C2M2 Model
files
datasets
subjects
orgs
protocols
events
samples
We have defined an entity/relationship model
that captures the high-level asset descriptions
in a form that we can map different data into.
relationships
“You can only manage what
you can measure”
• Drives metadata choices
• Focused on collecting
information to answer
specific questions
• E.g.,
• How much data?
• What type?
• Where is it stored?

Chaise – An adaptive model-driven
Web user interface
• How little can we assume?
– discovery, analysis, visualization, editing, sharing and collaboration over tabular
data (ERMRest).
• Makes almost no assumptions about data model
– Introspect the data model from ERMrest.
– Use heuristics, for instance, how to flatten a hierarchical structure into a
simplified presentation for searching and viewing.
– Schema annotations are used to modify or override its rendering heuristics, for
instance, to hide a column of a table or to use a specific display name.
– Apply user preferences to override, for instance, to present a nested table of
data in a transposed layout.

Management of all scientific assets

Dashboards to track progress

Dynamic data collections with global IDs

Model driven data entry…

With controlled vocabulary.

Acknowledgments
• Rob Schuler, Karl Czajkowski, Hongsuda Tangmunarunkit, Mike D’Arcy
• Rick Wagner, Kyle Chard, Ravi Maduri, Ian Foster
• Many, many conversations with John Seely Brown (jsb)
– See: Design Unbound: Designing for Emergence in a White Water World, Brown
and Pendleton-Jullian, MIT Press

Summary
• Think of eScience infrastructure as part of socio-technical system
– Design for the interface across communities of practice
• Validated this approach with Deriva across many domains and scales
– Craniofacial dysmorphia, protein structure database, molecular atlas, kidney
reconstruction, dynamic synapse mapping, optimization models, pancreatic beta
cell modeling, developmental biology, …
• For more information: www.derivacloud.org

Virtual Organizations 2.0: Social Constructs for Data-centered Collaborative Discovery

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Virtual Organizations 2.0: Social Constructs for Data-centered Collaborative Discovery

Similar to Virtual Organizations 2.0: Social Constructs for Data-centered Collaborative Discovery (20)

More from Globus

More from Globus (20)

Virtual Organizations 2.0: Social Constructs for Data-centered Collaborative Discovery