SlideShare a Scribd company logo
1 of 56
Download to read offline
Information Sciences Institute
Virtual Organizations 2.0: Social Constructs for
Data-centered Collaborative Discovery
Carl Kesselman
carl@isi.edu
Dean’s Professor, Industrial and Systems Engineering
Director: Informatics Systems Research Division, Information Sciences Institute
University of Southern California
Information Sciences Institute
Set the way back machine….
“The real and specific problem that underlies the Grid concept is
coordinated resource sharing and problem solving in dynamic, multi-
institutional virtual organizations.”
The Anatomy of the Grid: Enabling Scalable Virtual Organizations
Ian Foster, Carl Kesselman, Steven Tuecke, 2001
Globus Users Group 1999
Information Sciences Institute
Virtual Organization
A virtual organization (VO) is a group of individuals whose members and
resources may be dispersed geographically and institutionally, yet who
function as a coherent unit through the use of cyberinfrastructure (CI). A
VO is typically enabled by, and provides shared and often real-time access
to, centralized or distributed resources, such as community-specific tools,
applications, data, and sensors, and experimental operations. A VO may
be known as or composed of systems known as collaboratories, e-Science
or e-Research, distributed workgroups or virtual teams, virtual
environments, and online communities. Cummings, Jonathon & Finholt, Thomas & Foster, Ian
& Kesselman, Carl & Lawrence, Katherine. (2008).
Beyond Being There: A Blueprint for Advancing the
Design, Development, and Evaluation of Virtual
Organizations. 10.13140/RG.2.1.2694.2167.
Information Sciences Institute
VOs: A Pre-Millenial Perspective
Information Sciences Institute
How do we create a scientific result
Any time scientists disagree, it's because we have insufficient data. Then
we can agree on what kind of data to get; we get the data; and the data
solves the problem. Either I'm right, or you're right, or we're both wrong.
And we move on. That kind of conflict resolution does not exist in politics
or religion.
Neil deGrasse Tyson
• Science is about communities arguing over data
– How do those communities form
– How do communities argue: knowledge capture and communication
Information Sciences Institute
Data driven multidisciplinary collaboration is a
adaptive socio-technical eco-system
• How do manage social and technical subsystems
• How do we optimize across the ecosystem across time and space
Social Subsystem
Concerned with the attributes of people (e.g.
skills, attitudes, values), the relationships
among people, reward systems, and
authority structures
Technical Subsystem
Concerned with the processes, tasks, and
technology needed to transform inputs to
outputs
Joint
Optimization
Militello, e. al.. (2013). Sources of variation in
primary care clinical workflow: Implications for
the design of cognitive support. Health
informatics journal.
Man-computer symbiosis … aims are 1) to let computers facilitate
formulative thinking …., and 2) to enable men and computers to
cooperate in making decisions and controlling complex situations without
inflexible dependence on predetermined programs.
Licklider, 1960
Information Sciences Institute
Top down or bottom Up: The TIMn perspective
• Tribes
• Institutions
• Markets
• Networks
Ronfeldt, David, Tribes, Institutions, Markets, Networks: A
Framework About Societal Evolution. Santa Monica, CA:
RAND Corporation, 1996.
Traditional eScience projects look like markets built around
episodic exchange of papers/data (publication)
Transformational science requires networks.
Information Sciences Institute
Community of practice
• With diverse viewpoints, role, etc.
• Engaged in joint work
• Over a significant period of time ,
• In which they build things, solve problems, learn , invent, and
negotiate meaning,
• And evolve a way of talking and reading each other!
A group of people….
Information Sciences Institute
Networks of practice for transformational science
• Multiple communities working together in
integrated, dynamic, adaptive and open collaboration
groups
Concepts
knowing
that
Stories
“best
practices”
Intuition
skills
Knowing
“how”
Genres
Work
practices
Knowing - -
communities of
practice
groups
Concepts
knowing
that
Stories
“best
practices”
Intuition
skills
Knowing
“how”
Genres
Work
practices
Knowing - -
communities of
practice
individuals groups
Concepts
knowing
that
Stories
“best
practices”
Intuition
skills
Knowing
“how”
Genres
Work
practices
Knowing - -
communities of
practice
ExplicitTacit
VS.
Information Sciences Institute
’
Knowledge flow across communities of practice
• Boundary objects and knowledge brokers
Boundary
object
Star, Susan Leigh, and James R. Griesemer. “Institutional Ecology,
`Translations’ and Boundary Objects: Amateurs and Professionals in
Berkeley’s Museum of Vertebrate Zoology, 1907-39.” Social Studies of
Science 19, no. 3
Community
of
Practice
Community
of
Practice
Translator/Knowledge
Broker
Community
of Practice
Community
of Practice
Information used in different ways by different
communities. Boundary objects are plastic, interpreted
differently across communities but with enough
immutable content to maintain integrity.
Information Sciences Institute
Data as a boundary object
• Must be findable by network of practice,
• Must be accessible across the network while following norms of the
community
• Must be interpretable by other community (interoperable)
– The bits
– The terms we use to describe
– The relationships between elements
• Must provide value to other community
– Useable in new context
Where have we seen these requirements before…..
Information Sciences Institute
FAIR data…
• Findable
– identified by a unique identifier, characterized by rich metadata
• Accessible
– standard protocol with access control, metadata accessible even
when the data is not,
• Interoperable
– by standardized terms to describe it
• Reusable
– Accurate and relevant attributes.
Information Sciences Institute
FAIR collaboration
• Data as boundary object --- Create FAIR data at every part of a
scientific investigation
– Enable “long tail” science organized around data
• Requires accurate descriptions of data
–Characteristics of data element
–Relationships between data elements
• Question: How do we create and maintain these metadata and
relationships in a collaborative environment?
Madduri, Ravi, et al. "Reproducible big data
science: a case study in continuous
FAIRness." PloS one 14.4 (2019).
Information Sciences Institute
Scientific asset management system
• Data Driven Collaboration
– Think of it as a “photo manager” for scientific data
– Maintain FAIR data from creation through publication
• Captures data and relationships between data
– Catalog to capture relationships between data
– Object store to hold data
• Can rapidly change to follow changes in scientific knowledge
Discovery Environment for Relational Information and Versioned Assets
(DERIVA)
Information Sciences Institute
Relational Database
Turn scientists into active participants throughout the
data lifecycle
(Schuler, Czajkowski, Kesselman 2014; Schuler, Czajkowski, Kesselman 2016; Bugacov et al
2017)
acquisition
annotation
organizationsharing
derivation
Data Lifecycle
REPLACE:
• spreadsheets,
• “meaningful” filenames,
• inefficient EAV models,…
WITH:
• well-formed relational
models…
but… how to evolve model?
tossing it over the fence
✔ “self serve” data curaMon
Web
Interfaces
Information Sciences Institute
DERIVA promotes FAIR data production
• F: providing rich metadata using an Entity-Relationship model to
express relationships between diverse data elements;
• A: offering rich access control and access to metadata via standard
HTTP web service interfaces;
• I: integrating with standardized terms defined by collaborators,
consortium or communities; and
• R: supporting dynamic model evolution so that the data presented
accurately represents the current structure and state of knowledge
within an investigation.
Information Sciences Institute
DERIVA Ecosystem
Information Sciences Institute
Collaborative Metadata Management
• ERMRest: A web service for managing metadata and
relationships between data assets
– Everything is a resource with a URI and interface
• Entity/Relationship modeling
Information Sciences Institute
ERMRest and Globus
• Globus Auth for all authenticated user access
• Extensive use of Globus Groups for fine grain access control
– Use groups as roles
– Write policy against
Information Sciences Institute
Data API
• All operations mapped into URL:
ermrest/catalog/1@Y-1Z4B/entity/foo/bar/x::gt::7
–Queries can be viewed as naming data
–Tables exchanges as CSV or JSON
• Operate on whole or partial entity, projections, aggregates
• Joins, filter-based row selection, column projection,
configurable sort order, and pagination.
• Insertion, update, or deletion of data in one table at a time.
Information Sciences Institute
ERMRest Fine grain access control
• ACL = (resource, permission, roles)
–schema, table, column, or reference
–access permission (e.g., insert, update, etc.)
–a set of user or group identity
• ACL is associated with a resource in the hierarchy:
–catalog, schema, table, column, and constraint
• ACLs are inherited simplifying management
• Static and dynamic policy specifications
• Enables context dependent policy, e.g. curation pipelines
Information Sciences Institute
Snapshots and history
• Logical snapshots at every catalog mutation
– Prior values are stored in the catalog
– Snapshot ID created for every snapshot
• Catalog URI can include snapshot ID
–ermrest/catalog/1@Y-1Z4B/entity/foo/bar/x
• History Management
– Range Discovery, Truncation, Policy Amendment, Data Redaction
Information Sciences Institute
Persistent identifiers
• Every entity ERMRest has unique ID (RID)
– Assigned by catalog, can be used as foreign-key
• Every version of entire catalog has a unique ID.
• Query with snapshot ID uniquely names a entity set
– ERMRest URI with snapshot ID
• RID and Snapshot ID uniquely name an specific version of entity
• Support for identifiers from community controlled vocabulary
– E.g. Uberon, Schema.org, RRIDs
Information Sciences Institute
The “20 questions” approach, gets it started, but we
need more agile methods throughout formative
phases
• Drive database and system
design by ~20 key queries
However…
“Database technology has had
limited uptake [in science and data
science], in part due to the
overhead in designing a schema…
Changing data and changing
requirements make it difficult to
amortize these upfront costs...”
(Gray, Szalay et al
2009)
(Jain et al 2016)
http://jimgray.azurewebsites.net/talks/SciData.pp
t
Information Sciences Institute
Collaboration is an evolutionary process
• Data interoperability, e.g. structure and meaning is time-dependent
• Any evaluation of FAIRness is ill-posed unless we specify community
and point in time!
LFNG: Lunatic Fringe
LFNG: O-fucosylpeptide
3-beta-N-
acetylglucosaminyltransferase
Ribes and Bowker. "A learning
trajectory for ontology building.", 2005
Community
Domain Representatives
Targeted Future
Users
Information Sciences Institute
Suboptimal schema leads to suboptimal data:
e.g., “messy” data!
Unconstrained Domain
This
happened
Information Sciences Institute
• “Everyone starts with the same schema: <stuff/>. Then they refine it.”
– J. Widom
Information Sciences Institute
Numerous steps to evolve even “simple” database
ID A B Litte
r
Anatomy Age
73 abc def YC1 Mandible E10.5
74 mn
o
pq
r
YC1 Mandible E10.5
75 stu vx
y
YC1 Maxila E8.5
ID A B FK
73 abc def 1
74 mno pqr 1
75 stu vxy 2
ID Litte
r
Anatomy Age
1 YC1 Mandible E10.5
2 YC1 Maxila E8.5
ID A B FK
73 abc def 1
74 mno pqr 1
75 stu vxy 2
ID Litte
r
Anatomy Age
1 YC1 Mandible E10.5
2 YC1 Maxila E8.5
Anatomy
Mandible
Maxila
Measures
Specimens
Measures + Specimens
Vocabulary
Age
E10.5
E8.5
Measures
Specimens
1
3
4
2 Cleaning (not pictured)
Information Sciences Institute
Our approach: An open, extensible Data Evolution
Language
• Primitive SMO: indivisible operators,
extended definitions of more common
relational operators
• Composite SMO: defined as a functional
composition over other (primitive or
composite) SMOs
• Algebraic expressions can be decomposed
and rewritten into semantically equivalent,
more efficient expressions.
• Embed in general-purpose language with
user programs translated into the algebra.
Conceptual overview of the approach
Schuler, Robert E., and Carl Kesselman. "Towards an efficient and
effective framework for the evolution of scientific databases." Proceedings
of the 30th International Conference on Scientific and Statistical Database
Management. 2018
Define set of Schema Modification
Operators (SMO) and associated algebra
Information Sciences Institute
Problem scenario: free text “tags” of gene names…
ID A B List_of_Genes
73 abc def Msx1, Foxc2, Sox9
74 mno pqr Tgfb3, Sox9, Bmp4, …
… … … …
Assays:
ID Term Synony
ms
1 Msx1 …
2 Tgfb3 …
3 Foxc2 …
4 Sox9 …
5 Bmp4 …
… … …
Genes:
Table of
canonical
terms
Table of genetic
assays
Objective: to extract the
free text “List_of_Genes”
in a normalized and
semantically aligned
association with the
canonical term set.
Before
PROBLEM
Free text listing of
gene names
related to the
assay records
Information Sciences Institute
Problem scenario: free text “tags” of gene names…
ID A B List_of_Genes
73 abc def Msx1, Foxc2, Sox9
74 mno pqr Tgfb3, Sox9, Bmp4, …
… … … …
Assays:
ID Term Synony
ms
1 Msx1 …
2 Tgfb3 …
3 Foxc2 …
4 Sox9 …
5 Bmp4 …
… … …
Genes:
Table of
canonical
terms
Assay Gene
73 Msx1
73 Foxc2
73 Sox9
74 Tgfb3
74 Sox9
74 Bmp4
… …
Assay_Genes:
Table of
semantically
aligned assay
to gene name
association
Table of genetic
assays
Objective: to extract the
free text “List_of_Genes”
in a normalized and
semantically aligned
association with the
canonical term set.
After
Information Sciences Institute
Tagify: normalize and semantically align free form
tags…
d Tagifya
r
↦ d Aligna (Atomizea r)
↦ d Aligna (μ (ReifySub
a r))
↦ d Aligna (μ (𝜋key(r),a r))
↦ 𝜚t/a (𝜋-a,-s (μ (𝜋key(r),a r)) ⋈≡ d)
(Tagify
definition)
(Atomize definition)
(ReifySub definition)
(Align definition)
Input:
d: Genes
r: Assays
Parameters:
a: List_of_Genes
(others omitted for brevity)
Result:
Computes association table
Iteratively apply composite SMO
definitions until we have an
expression of only primitive SMOs
Information Sciences Institute
Tagify: normalize and semantically align free form
tags…
d Tagifya
r
↦ d Aligna (Atomizea r)
↦ d Aligna (μ (ReifySub
a r))
↦ d Aligna (μ (𝜋key(r),a r))
↦ 𝜚t/a (𝜋-a,-s (μ (𝜋key(r),a r)) ⋈≡ d)
(Tagify
definition)
(Atomize definition)
(ReifySub definition)
(Align definition)
Input:
d: Genes
r: Assays
Parameters:
a: List_of_Genes
(others omitted for brevity)
Result:
Computes association table
Information Sciences Institute
Tagify: normalize and semantically align free form
tags…
d Tagifya
r
↦ d Aligna (Atomizea r)
↦ d Aligna (μ (ReifySub
a r))
↦ d Aligna (μ (𝜋key(r),a r))
↦ 𝜚t/a (𝜋-a,-s (μ (𝜋key(r),a r)) ⋈≡ d)
(Tagify
definition)
(Atomize definition)
(ReifySub definition)
(Align definition)
Input:
d: Genes
r: Assays
Parameters:
a: List_of_Genes
(others omitted for brevity)
Result:
Computes association table
Information Sciences Institute
Tagify: normalize and semantically align free form
tags…
d Tagifya
r
↦ d Aligna (Atomizea r)
↦ d Aligna (μ (ReifySub
a r))
↦ d Aligna (μ (𝜋key(r),a r))
↦ 𝜚t/a (𝜋-a,-s (μ (𝜋key(r),a r)) ⋈≡ d)
(Tagify
definition)
(Atomize definition)
(ReifySub definition)
(Align definition)
Input:
d: Genes
r: Assays
Parameters:
a: List_of_Genes
(others omitted for brevity)
Result:
Computes association table
Information Sciences Institute
Tagify: normalize and semantically align free form
tags…
d Tagifya
r
↦ d Aligna (Atomizea r)
↦ d Aligna (μ (ReifySub
a r))
↦ d Aligna (μ (𝜋key(r),a r))
↦ 𝜚t/a (𝜋-a,-s (μ (𝜋key(r),a r)) ⋈≡ d)
(Tagify
definition)
(Atomize definition)
(ReifySub definition)
(Align definition)
Input:
d: Genes
r: Assays
Parameters:
a: List_of_Genes
(others omitted for brevity)
Result:
Computes association table
Information Sciences Institute
Big data bags
• A packaging format for encapsulating
– Payload: arbitrary content
– Tags: metadata describing the payload
– Checksums: supports verification of content
Bio_data_bag/
|-- data
| -- genomic
| -- 2a673.fastq
| -- 2a673.fastq
| -- manifest-md5.txt
| afbfa23123bfa data/genomic/2a673.fasta
| -- bagit.txt
Contact-Name: John Smith
Chard, Kyle, et al. "I'll take that to go:
Big data bags and minimal identifiers
for exchange of large, complex
datasets." 2016 IEEE International
Conference on Big Data (Big Data). IEEE,
2016
Information Sciences Institute
Rich metadata for Bags
{
"@context": {
"@vocab": "http://purl.org/dc/terms/",
"dcmi":
"http://purl.org/dc/dcmitype/Dataset"
},
"@id": "../../data/numbers.csv",
"@type": "dcmi:Dataset",
"title": "CSV files of beverage consumption",
”description": "A CSV file listing the number
of cups consumed per person.”
}
38
Research ObjectsOpen Knowledge Foundation
Table Schema
{
"name": "Method",
"schema": {
"fields": [
{
"name": "id",
"title": "A globally unique ID",
"type": "string",
"constraints": { "required": true, "unique": true}
},
],
"missingValues": [ "" ], "primaryKey": "id"
}
},
Information Sciences Institute
Making Bags big
• Manifest lists all content and checksums
• Available content contained in “data” directory
• Content may be “missing”
– Missing content must be listed in “fetch.txt”
– Fetch entries list local name in data directory, and URL of where to fetch data
• Have created tool for creating, validating and materializing bags
• Globus Auth, and Globus Transfer are supported
Information Sciences Institute
CFDE Metadata Loading Flow
CFDE Repo
DCC
Extract CFDE
Metadata
Pull referenced
BDBag
Create DERIVA
Catalog
Load asset
information
save to shared
metadata
repository
Automate
Common Fund Data Environment
Scalable approach to integrating data coordinating center asset information
DCC
DCC
Transform DCC
metadata into
C2M2 records
DCC
administrative
interface
dashboard
repeatable &
extensible
process
CFDE
metadata
catalog
reusable
data asset
information
queries
Information Sciences Institute
C2M2 Model
files
datasets
subjects
orgs
protocols
events
samples
We have defined an entity/relationship model
that captures the high-level asset descriptions
in a form that we can map different data into.
relationships
“You can only manage what
you can measure”
• Drives metadata choices
• Focused on collecting
information to answer
specific questions
• E.g.,
• How much data?
• What type?
• Where is it stored?
Information Sciences Institute
Chaise – An adaptive model-driven
Web user interface
• How little can we assume?
– discovery, analysis, visualization, editing, sharing and collaboration over tabular
data (ERMRest).
• Makes almost no assumptions about data model
– Introspect the data model from ERMrest.
– Use heuristics, for instance, how to flatten a hierarchical structure into a
simplified presentation for searching and viewing.
– Schema annotations are used to modify or override its rendering heuristics, for
instance, to hide a column of a table or to use a specific display name.
– Apply user preferences to override, for instance, to present a nested table of
data in a transposed layout.
Information Sciences Institute
Management of all scientific assets
Information Sciences Institute
Dashboards to track progress
Information Sciences Institute
Dynamic data collections with global IDs
Information Sciences Institute
Model driven data entry…
Information Sciences Institute
With controlled vocabulary.
Information Sciences Institute
Information Sciences Institute
Information Sciences Institute
Information Sciences Institute
Information Sciences Institute
Information Sciences Institute
Information Sciences Institute
Information Sciences Institute
Acknowledgments
• Rob Schuler, Karl Czajkowski, Hongsuda Tangmunarunkit, Mike D’Arcy
• Rick Wagner, Kyle Chard, Ravi Maduri, Ian Foster
• Many, many conversations with John Seely Brown (jsb)
– See: Design Unbound: Designing for Emergence in a White Water World, Brown
and Pendleton-Jullian, MIT Press
Information Sciences Institute
Summary
• Think of eScience infrastructure as part of socio-technical system
– Design for the interface across communities of practice
• Validated this approach with Deriva across many domains and scales
– Craniofacial dysmorphia, protein structure database, molecular atlas, kidney
reconstruction, dynamic synapse mapping, optimization models, pancreatic beta
cell modeling, developmental biology, …
• For more information: www.derivacloud.org

More Related Content

What's hot

Exploiting classical bibliometrics of CSCW: classification, evaluation, limit...
Exploiting classical bibliometrics of CSCW: classification, evaluation, limit...Exploiting classical bibliometrics of CSCW: classification, evaluation, limit...
Exploiting classical bibliometrics of CSCW: classification, evaluation, limit...António Correia
 
DERI Stream Meeting 2010: What I'm working on
DERI Stream Meeting 2010: What I'm working onDERI Stream Meeting 2010: What I'm working on
DERI Stream Meeting 2010: What I'm working onjodischneider
 
Thinking about technology .... differently
Thinking about technology .... differentlyThinking about technology .... differently
Thinking about technology .... differentlylisld
 
KNOWLEDGE MANAGEMENT
KNOWLEDGE MANAGEMENTKNOWLEDGE MANAGEMENT
KNOWLEDGE MANAGEMENTNadeem Nazir
 
The research library: scalable efficiency and scalable learning
The research library: scalable efficiency and scalable learningThe research library: scalable efficiency and scalable learning
The research library: scalable efficiency and scalable learninglisld
 
Semantic Web Technologies: Changing Bibliographic Descriptions?
Semantic Web Technologies: Changing Bibliographic Descriptions?Semantic Web Technologies: Changing Bibliographic Descriptions?
Semantic Web Technologies: Changing Bibliographic Descriptions?Stuart Weibel
 
Viewing universities as landscapes of scholarship, VIVO keynote, 2017-08-04
Viewing universities as landscapes of scholarship, VIVO keynote, 2017-08-04Viewing universities as landscapes of scholarship, VIVO keynote, 2017-08-04
Viewing universities as landscapes of scholarship, VIVO keynote, 2017-08-04jodischneider
 
from local/regional OER Silos towards an OER Global Dataspace
from local/regional OER Silos towards an OER Global Dataspacefrom local/regional OER Silos towards an OER Global Dataspace
from local/regional OER Silos towards an OER Global DataspaceOpen Education Consortium
 
Ralph schroeder and eric meyer
Ralph schroeder and eric meyerRalph schroeder and eric meyer
Ralph schroeder and eric meyeroiisdp
 
2009-C&T-NodeXL and social queries - a social media network analysis toolkit
2009-C&T-NodeXL and social queries - a social media network analysis toolkit2009-C&T-NodeXL and social queries - a social media network analysis toolkit
2009-C&T-NodeXL and social queries - a social media network analysis toolkitMarc Smith
 
Managing Knowledge in a Network Environment
Managing Knowledge in a Network EnvironmentManaging Knowledge in a Network Environment
Managing Knowledge in a Network EnvironmentAlbert Simard
 
Evaluating Library Capacity to Manage Research Data
Evaluating Library Capacity to Manage Research DataEvaluating Library Capacity to Manage Research Data
Evaluating Library Capacity to Manage Research DataCharleston Conference
 
Activating Research Collaboratories with Collaboration Patterns
Activating Research Collaboratories with Collaboration PatternsActivating Research Collaboratories with Collaboration Patterns
Activating Research Collaboratories with Collaboration PatternsCommunitySense
 
Interactive Innovation Through Social Software And Web 2.0
Interactive Innovation Through Social Software And Web 2.0Interactive Innovation Through Social Software And Web 2.0
Interactive Innovation Through Social Software And Web 2.0Thomas Ryberg
 
Can Repositories be Attractive—Even Sexy—Features of Digital Libraries?
Can Repositories be Attractive—Even Sexy—Features of Digital Libraries?Can Repositories be Attractive—Even Sexy—Features of Digital Libraries?
Can Repositories be Attractive—Even Sexy—Features of Digital Libraries?susan borda
 

What's hot (20)

Exploiting classical bibliometrics of CSCW: classification, evaluation, limit...
Exploiting classical bibliometrics of CSCW: classification, evaluation, limit...Exploiting classical bibliometrics of CSCW: classification, evaluation, limit...
Exploiting classical bibliometrics of CSCW: classification, evaluation, limit...
 
DERI Stream Meeting 2010: What I'm working on
DERI Stream Meeting 2010: What I'm working onDERI Stream Meeting 2010: What I'm working on
DERI Stream Meeting 2010: What I'm working on
 
Next Generation Systems
Next Generation SystemsNext Generation Systems
Next Generation Systems
 
ALTMETRICS
ALTMETRICSALTMETRICS
ALTMETRICS
 
Thinking about technology .... differently
Thinking about technology .... differentlyThinking about technology .... differently
Thinking about technology .... differently
 
KNOWLEDGE MANAGEMENT
KNOWLEDGE MANAGEMENTKNOWLEDGE MANAGEMENT
KNOWLEDGE MANAGEMENT
 
The research library: scalable efficiency and scalable learning
The research library: scalable efficiency and scalable learningThe research library: scalable efficiency and scalable learning
The research library: scalable efficiency and scalable learning
 
Semantic Web Technologies: Changing Bibliographic Descriptions?
Semantic Web Technologies: Changing Bibliographic Descriptions?Semantic Web Technologies: Changing Bibliographic Descriptions?
Semantic Web Technologies: Changing Bibliographic Descriptions?
 
Viewing universities as landscapes of scholarship, VIVO keynote, 2017-08-04
Viewing universities as landscapes of scholarship, VIVO keynote, 2017-08-04Viewing universities as landscapes of scholarship, VIVO keynote, 2017-08-04
Viewing universities as landscapes of scholarship, VIVO keynote, 2017-08-04
 
from local/regional OER Silos towards an OER Global Dataspace
from local/regional OER Silos towards an OER Global Dataspacefrom local/regional OER Silos towards an OER Global Dataspace
from local/regional OER Silos towards an OER Global Dataspace
 
Ralph schroeder and eric meyer
Ralph schroeder and eric meyerRalph schroeder and eric meyer
Ralph schroeder and eric meyer
 
2009-C&T-NodeXL and social queries - a social media network analysis toolkit
2009-C&T-NodeXL and social queries - a social media network analysis toolkit2009-C&T-NodeXL and social queries - a social media network analysis toolkit
2009-C&T-NodeXL and social queries - a social media network analysis toolkit
 
Managing Knowledge in a Network Environment
Managing Knowledge in a Network EnvironmentManaging Knowledge in a Network Environment
Managing Knowledge in a Network Environment
 
Evaluating Library Capacity to Manage Research Data
Evaluating Library Capacity to Manage Research DataEvaluating Library Capacity to Manage Research Data
Evaluating Library Capacity to Manage Research Data
 
Activating Research Collaboratories with Collaboration Patterns
Activating Research Collaboratories with Collaboration PatternsActivating Research Collaboratories with Collaboration Patterns
Activating Research Collaboratories with Collaboration Patterns
 
Oess NCRM Festival
Oess NCRM FestivalOess NCRM Festival
Oess NCRM Festival
 
Interactive Innovation Through Social Software And Web 2.0
Interactive Innovation Through Social Software And Web 2.0Interactive Innovation Through Social Software And Web 2.0
Interactive Innovation Through Social Software And Web 2.0
 
The Future of Research Communications and e-Scholarship: Are we there yet?
The Future of Research Communications and e-Scholarship: Are we there yet?The Future of Research Communications and e-Scholarship: Are we there yet?
The Future of Research Communications and e-Scholarship: Are we there yet?
 
Innovation ecosystems value co-creation
Innovation ecosystems value co-creationInnovation ecosystems value co-creation
Innovation ecosystems value co-creation
 
Can Repositories be Attractive—Even Sexy—Features of Digital Libraries?
Can Repositories be Attractive—Even Sexy—Features of Digital Libraries?Can Repositories be Attractive—Even Sexy—Features of Digital Libraries?
Can Repositories be Attractive—Even Sexy—Features of Digital Libraries?
 

Similar to Virtual Organizations 2.0: Social Constructs for Data-centered Collaborative Discovery

Virtual Communities: Catalysts for Advancing Scholarship
Virtual Communities: Catalysts for Advancing ScholarshipVirtual Communities: Catalysts for Advancing Scholarship
Virtual Communities: Catalysts for Advancing ScholarshipJohn Butler
 
Supporting Content Curation Communities: The Case of the Encyclopedia of Life
Supporting Content Curation Communities: The Case of the Encyclopedia of LifeSupporting Content Curation Communities: The Case of the Encyclopedia of Life
Supporting Content Curation Communities: The Case of the Encyclopedia of LifeHarish Vaidyanathan
 
Creating Sustainable Communities in Open Data Resources: The eagle-i and VIVO...
Creating Sustainable Communities in Open Data Resources: The eagle-i and VIVO...Creating Sustainable Communities in Open Data Resources: The eagle-i and VIVO...
Creating Sustainable Communities in Open Data Resources: The eagle-i and VIVO...Robert H. McDonald
 
"Building Capacity for Open Research" - AAMC
"Building Capacity for Open Research" - AAMC"Building Capacity for Open Research" - AAMC
"Building Capacity for Open Research" - AAMCKaitlin Thaney
 
Trust and Accountability: experiences from the FAIRDOM Commons Initiative.
Trust and Accountability: experiences from the FAIRDOM Commons Initiative.Trust and Accountability: experiences from the FAIRDOM Commons Initiative.
Trust and Accountability: experiences from the FAIRDOM Commons Initiative.Carole Goble
 
Wild data: collaborative e-research and university libraries
Wild data: collaborative e-research and university librariesWild data: collaborative e-research and university libraries
Wild data: collaborative e-research and university librariesRAILS7
 
Investing in a time of desruptive change
Investing in a time of desruptive changeInvesting in a time of desruptive change
Investing in a time of desruptive changeJisc
 
California Ocean Science Trust " Building a Sustainable Knowledge Base for ...
California Ocean Science Trust " Building a Sustainable Knowledge Base for ...California Ocean Science Trust " Building a Sustainable Knowledge Base for ...
California Ocean Science Trust " Building a Sustainable Knowledge Base for ...Tom Moritz
 
FORCE11: Future of Research Communications and e-Scholarship
FORCE11:  Future of Research Communications and e-ScholarshipFORCE11:  Future of Research Communications and e-Scholarship
FORCE11: Future of Research Communications and e-ScholarshipMaryann Martone
 
ICDMWorkshopProposal.doc
ICDMWorkshopProposal.docICDMWorkshopProposal.doc
ICDMWorkshopProposal.docbutest
 
Making the web work for science - RIT Dean's Lecture Series
Making the web work for science - RIT Dean's Lecture SeriesMaking the web work for science - RIT Dean's Lecture Series
Making the web work for science - RIT Dean's Lecture SeriesKaitlin Thaney
 
Open Access and Research Communication: The Perspective of Force11
Open Access and Research Communication: The Perspective of Force11Open Access and Research Communication: The Perspective of Force11
Open Access and Research Communication: The Perspective of Force11Maryann Martone
 
ELPUB 2018 Feminist Open Science workshop
ELPUB 2018   Feminist Open Science workshopELPUB 2018   Feminist Open Science workshop
ELPUB 2018 Feminist Open Science workshopLeslie Chan
 
In Search of a Missing Link in the Data Deluge vs. Data Scarcity Debate
In Search of a Missing Link in the Data Deluge vs. Data Scarcity DebateIn Search of a Missing Link in the Data Deluge vs. Data Scarcity Debate
In Search of a Missing Link in the Data Deluge vs. Data Scarcity DebateNeuroscience Information Framework
 
2015 03 19 (EDUCON2015) eMadrid UPM Towards a Learning Analytics Approach for...
2015 03 19 (EDUCON2015) eMadrid UPM Towards a Learning Analytics Approach for...2015 03 19 (EDUCON2015) eMadrid UPM Towards a Learning Analytics Approach for...
2015 03 19 (EDUCON2015) eMadrid UPM Towards a Learning Analytics Approach for...eMadrid network
 

Similar to Virtual Organizations 2.0: Social Constructs for Data-centered Collaborative Discovery (20)

Virtual Communities: Catalysts for Advancing Scholarship
Virtual Communities: Catalysts for Advancing ScholarshipVirtual Communities: Catalysts for Advancing Scholarship
Virtual Communities: Catalysts for Advancing Scholarship
 
Supporting Content Curation Communities: The Case of the Encyclopedia of Life
Supporting Content Curation Communities: The Case of the Encyclopedia of LifeSupporting Content Curation Communities: The Case of the Encyclopedia of Life
Supporting Content Curation Communities: The Case of the Encyclopedia of Life
 
Creating Sustainable Communities in Open Data Resources: The eagle-i and VIVO...
Creating Sustainable Communities in Open Data Resources: The eagle-i and VIVO...Creating Sustainable Communities in Open Data Resources: The eagle-i and VIVO...
Creating Sustainable Communities in Open Data Resources: The eagle-i and VIVO...
 
"Building Capacity for Open Research" - AAMC
"Building Capacity for Open Research" - AAMC"Building Capacity for Open Research" - AAMC
"Building Capacity for Open Research" - AAMC
 
Trust and Accountability: experiences from the FAIRDOM Commons Initiative.
Trust and Accountability: experiences from the FAIRDOM Commons Initiative.Trust and Accountability: experiences from the FAIRDOM Commons Initiative.
Trust and Accountability: experiences from the FAIRDOM Commons Initiative.
 
Wild data: collaborative e-research and university libraries
Wild data: collaborative e-research and university librariesWild data: collaborative e-research and university libraries
Wild data: collaborative e-research and university libraries
 
Investing in a time of desruptive change
Investing in a time of desruptive changeInvesting in a time of desruptive change
Investing in a time of desruptive change
 
Shifting from librarian to data manager
Shifting from librarian to data managerShifting from librarian to data manager
Shifting from librarian to data manager
 
California Ocean Science Trust " Building a Sustainable Knowledge Base for ...
California Ocean Science Trust " Building a Sustainable Knowledge Base for ...California Ocean Science Trust " Building a Sustainable Knowledge Base for ...
California Ocean Science Trust " Building a Sustainable Knowledge Base for ...
 
FORCE11: Future of Research Communications and e-Scholarship
FORCE11:  Future of Research Communications and e-ScholarshipFORCE11:  Future of Research Communications and e-Scholarship
FORCE11: Future of Research Communications and e-Scholarship
 
ICDMWorkshopProposal.doc
ICDMWorkshopProposal.docICDMWorkshopProposal.doc
ICDMWorkshopProposal.doc
 
Open Science
Open Science Open Science
Open Science
 
Alpsp final martone
Alpsp final martoneAlpsp final martone
Alpsp final martone
 
Making the web work for science - RIT Dean's Lecture Series
Making the web work for science - RIT Dean's Lecture SeriesMaking the web work for science - RIT Dean's Lecture Series
Making the web work for science - RIT Dean's Lecture Series
 
Open Access and Research Communication: The Perspective of Force11
Open Access and Research Communication: The Perspective of Force11Open Access and Research Communication: The Perspective of Force11
Open Access and Research Communication: The Perspective of Force11
 
ELPUB 2018 Feminist Open Science workshop
ELPUB 2018   Feminist Open Science workshopELPUB 2018   Feminist Open Science workshop
ELPUB 2018 Feminist Open Science workshop
 
In Search of a Missing Link in the Data Deluge vs. Data Scarcity Debate
In Search of a Missing Link in the Data Deluge vs. Data Scarcity DebateIn Search of a Missing Link in the Data Deluge vs. Data Scarcity Debate
In Search of a Missing Link in the Data Deluge vs. Data Scarcity Debate
 
Sabina Leonelli
Sabina LeonelliSabina Leonelli
Sabina Leonelli
 
2015 03 19 (EDUCON2015) eMadrid UPM Towards a Learning Analytics Approach for...
2015 03 19 (EDUCON2015) eMadrid UPM Towards a Learning Analytics Approach for...2015 03 19 (EDUCON2015) eMadrid UPM Towards a Learning Analytics Approach for...
2015 03 19 (EDUCON2015) eMadrid UPM Towards a Learning Analytics Approach for...
 
Jonathan Breeze, Symplectic
Jonathan Breeze, SymplecticJonathan Breeze, Symplectic
Jonathan Breeze, Symplectic
 

More from Globus

Advanced Globus System Administration Topics
Advanced Globus System Administration TopicsAdvanced Globus System Administration Topics
Advanced Globus System Administration TopicsGlobus
 
Instrument Data Automation: The Life of a Flow
Instrument Data Automation: The Life of a FlowInstrument Data Automation: The Life of a Flow
Instrument Data Automation: The Life of a FlowGlobus
 
Building Research Applications with Globus PaaS
Building Research Applications with Globus PaaSBuilding Research Applications with Globus PaaS
Building Research Applications with Globus PaaSGlobus
 
Reliable, Remote Computation at All Scales
Reliable, Remote Computation at All ScalesReliable, Remote Computation at All Scales
Reliable, Remote Computation at All ScalesGlobus
 
Best Practices for Data Sharing Using Globus
Best Practices for Data Sharing Using GlobusBest Practices for Data Sharing Using Globus
Best Practices for Data Sharing Using GlobusGlobus
 
An Introduction to Globus for Researchers
An Introduction to Globus for ResearchersAn Introduction to Globus for Researchers
An Introduction to Globus for ResearchersGlobus
 
Introduction to Research Automation with Globus
Introduction to Research Automation with GlobusIntroduction to Research Automation with Globus
Introduction to Research Automation with GlobusGlobus
 
Globus for System Administrators
Globus for System AdministratorsGlobus for System Administrators
Globus for System AdministratorsGlobus
 
Introduction to Globus for System Administrators
Introduction to Globus for System AdministratorsIntroduction to Globus for System Administrators
Introduction to Globus for System AdministratorsGlobus
 
Introduction to Data Transfer and Sharing for Researchers
Introduction to Data Transfer and Sharing for ResearchersIntroduction to Data Transfer and Sharing for Researchers
Introduction to Data Transfer and Sharing for ResearchersGlobus
 
Introduction to the Globus Platform for Developers
Introduction to the Globus Platform for DevelopersIntroduction to the Globus Platform for Developers
Introduction to the Globus Platform for DevelopersGlobus
 
Introduction to the Command Line Interface (CLI)
Introduction to the Command Line Interface (CLI)Introduction to the Command Line Interface (CLI)
Introduction to the Command Line Interface (CLI)Globus
 
Automating Research Data with Globus Flows and Compute
Automating Research Data with Globus Flows and ComputeAutomating Research Data with Globus Flows and Compute
Automating Research Data with Globus Flows and ComputeGlobus
 
Automating Research Data Flows and Introduction to the Globus Platform
Automating Research Data Flows and Introduction to the Globus PlatformAutomating Research Data Flows and Introduction to the Globus Platform
Automating Research Data Flows and Introduction to the Globus PlatformGlobus
 
Advanced Globus System Administration
Advanced Globus System AdministrationAdvanced Globus System Administration
Advanced Globus System AdministrationGlobus
 
Introduction to Globus for System Administrators
Introduction to Globus for System AdministratorsIntroduction to Globus for System Administrators
Introduction to Globus for System AdministratorsGlobus
 
Introduction to Globus for New Users
Introduction to Globus for New UsersIntroduction to Globus for New Users
Introduction to Globus for New UsersGlobus
 
Working with Globus Platform Services and Portals
Working with Globus Platform Services and PortalsWorking with Globus Platform Services and Portals
Working with Globus Platform Services and PortalsGlobus
 
Globus Automation
Globus AutomationGlobus Automation
Globus AutomationGlobus
 
Advanced Globus System Administration
Advanced Globus System AdministrationAdvanced Globus System Administration
Advanced Globus System AdministrationGlobus
 

More from Globus (20)

Advanced Globus System Administration Topics
Advanced Globus System Administration TopicsAdvanced Globus System Administration Topics
Advanced Globus System Administration Topics
 
Instrument Data Automation: The Life of a Flow
Instrument Data Automation: The Life of a FlowInstrument Data Automation: The Life of a Flow
Instrument Data Automation: The Life of a Flow
 
Building Research Applications with Globus PaaS
Building Research Applications with Globus PaaSBuilding Research Applications with Globus PaaS
Building Research Applications with Globus PaaS
 
Reliable, Remote Computation at All Scales
Reliable, Remote Computation at All ScalesReliable, Remote Computation at All Scales
Reliable, Remote Computation at All Scales
 
Best Practices for Data Sharing Using Globus
Best Practices for Data Sharing Using GlobusBest Practices for Data Sharing Using Globus
Best Practices for Data Sharing Using Globus
 
An Introduction to Globus for Researchers
An Introduction to Globus for ResearchersAn Introduction to Globus for Researchers
An Introduction to Globus for Researchers
 
Introduction to Research Automation with Globus
Introduction to Research Automation with GlobusIntroduction to Research Automation with Globus
Introduction to Research Automation with Globus
 
Globus for System Administrators
Globus for System AdministratorsGlobus for System Administrators
Globus for System Administrators
 
Introduction to Globus for System Administrators
Introduction to Globus for System AdministratorsIntroduction to Globus for System Administrators
Introduction to Globus for System Administrators
 
Introduction to Data Transfer and Sharing for Researchers
Introduction to Data Transfer and Sharing for ResearchersIntroduction to Data Transfer and Sharing for Researchers
Introduction to Data Transfer and Sharing for Researchers
 
Introduction to the Globus Platform for Developers
Introduction to the Globus Platform for DevelopersIntroduction to the Globus Platform for Developers
Introduction to the Globus Platform for Developers
 
Introduction to the Command Line Interface (CLI)
Introduction to the Command Line Interface (CLI)Introduction to the Command Line Interface (CLI)
Introduction to the Command Line Interface (CLI)
 
Automating Research Data with Globus Flows and Compute
Automating Research Data with Globus Flows and ComputeAutomating Research Data with Globus Flows and Compute
Automating Research Data with Globus Flows and Compute
 
Automating Research Data Flows and Introduction to the Globus Platform
Automating Research Data Flows and Introduction to the Globus PlatformAutomating Research Data Flows and Introduction to the Globus Platform
Automating Research Data Flows and Introduction to the Globus Platform
 
Advanced Globus System Administration
Advanced Globus System AdministrationAdvanced Globus System Administration
Advanced Globus System Administration
 
Introduction to Globus for System Administrators
Introduction to Globus for System AdministratorsIntroduction to Globus for System Administrators
Introduction to Globus for System Administrators
 
Introduction to Globus for New Users
Introduction to Globus for New UsersIntroduction to Globus for New Users
Introduction to Globus for New Users
 
Working with Globus Platform Services and Portals
Working with Globus Platform Services and PortalsWorking with Globus Platform Services and Portals
Working with Globus Platform Services and Portals
 
Globus Automation
Globus AutomationGlobus Automation
Globus Automation
 
Advanced Globus System Administration
Advanced Globus System AdministrationAdvanced Globus System Administration
Advanced Globus System Administration
 

Virtual Organizations 2.0: Social Constructs for Data-centered Collaborative Discovery

  • 1. Information Sciences Institute Virtual Organizations 2.0: Social Constructs for Data-centered Collaborative Discovery Carl Kesselman carl@isi.edu Dean’s Professor, Industrial and Systems Engineering Director: Informatics Systems Research Division, Information Sciences Institute University of Southern California
  • 2. Information Sciences Institute Set the way back machine…. “The real and specific problem that underlies the Grid concept is coordinated resource sharing and problem solving in dynamic, multi- institutional virtual organizations.” The Anatomy of the Grid: Enabling Scalable Virtual Organizations Ian Foster, Carl Kesselman, Steven Tuecke, 2001 Globus Users Group 1999
  • 3. Information Sciences Institute Virtual Organization A virtual organization (VO) is a group of individuals whose members and resources may be dispersed geographically and institutionally, yet who function as a coherent unit through the use of cyberinfrastructure (CI). A VO is typically enabled by, and provides shared and often real-time access to, centralized or distributed resources, such as community-specific tools, applications, data, and sensors, and experimental operations. A VO may be known as or composed of systems known as collaboratories, e-Science or e-Research, distributed workgroups or virtual teams, virtual environments, and online communities. Cummings, Jonathon & Finholt, Thomas & Foster, Ian & Kesselman, Carl & Lawrence, Katherine. (2008). Beyond Being There: A Blueprint for Advancing the Design, Development, and Evaluation of Virtual Organizations. 10.13140/RG.2.1.2694.2167.
  • 4. Information Sciences Institute VOs: A Pre-Millenial Perspective
  • 5. Information Sciences Institute How do we create a scientific result Any time scientists disagree, it's because we have insufficient data. Then we can agree on what kind of data to get; we get the data; and the data solves the problem. Either I'm right, or you're right, or we're both wrong. And we move on. That kind of conflict resolution does not exist in politics or religion. Neil deGrasse Tyson • Science is about communities arguing over data – How do those communities form – How do communities argue: knowledge capture and communication
  • 6. Information Sciences Institute Data driven multidisciplinary collaboration is a adaptive socio-technical eco-system • How do manage social and technical subsystems • How do we optimize across the ecosystem across time and space Social Subsystem Concerned with the attributes of people (e.g. skills, attitudes, values), the relationships among people, reward systems, and authority structures Technical Subsystem Concerned with the processes, tasks, and technology needed to transform inputs to outputs Joint Optimization Militello, e. al.. (2013). Sources of variation in primary care clinical workflow: Implications for the design of cognitive support. Health informatics journal. Man-computer symbiosis … aims are 1) to let computers facilitate formulative thinking …., and 2) to enable men and computers to cooperate in making decisions and controlling complex situations without inflexible dependence on predetermined programs. Licklider, 1960
  • 7. Information Sciences Institute Top down or bottom Up: The TIMn perspective • Tribes • Institutions • Markets • Networks Ronfeldt, David, Tribes, Institutions, Markets, Networks: A Framework About Societal Evolution. Santa Monica, CA: RAND Corporation, 1996. Traditional eScience projects look like markets built around episodic exchange of papers/data (publication) Transformational science requires networks.
  • 8. Information Sciences Institute Community of practice • With diverse viewpoints, role, etc. • Engaged in joint work • Over a significant period of time , • In which they build things, solve problems, learn , invent, and negotiate meaning, • And evolve a way of talking and reading each other! A group of people….
  • 9. Information Sciences Institute Networks of practice for transformational science • Multiple communities working together in integrated, dynamic, adaptive and open collaboration groups Concepts knowing that Stories “best practices” Intuition skills Knowing “how” Genres Work practices Knowing - - communities of practice groups Concepts knowing that Stories “best practices” Intuition skills Knowing “how” Genres Work practices Knowing - - communities of practice individuals groups Concepts knowing that Stories “best practices” Intuition skills Knowing “how” Genres Work practices Knowing - - communities of practice ExplicitTacit VS.
  • 10. Information Sciences Institute ’ Knowledge flow across communities of practice • Boundary objects and knowledge brokers Boundary object Star, Susan Leigh, and James R. Griesemer. “Institutional Ecology, `Translations’ and Boundary Objects: Amateurs and Professionals in Berkeley’s Museum of Vertebrate Zoology, 1907-39.” Social Studies of Science 19, no. 3 Community of Practice Community of Practice Translator/Knowledge Broker Community of Practice Community of Practice Information used in different ways by different communities. Boundary objects are plastic, interpreted differently across communities but with enough immutable content to maintain integrity.
  • 11. Information Sciences Institute Data as a boundary object • Must be findable by network of practice, • Must be accessible across the network while following norms of the community • Must be interpretable by other community (interoperable) – The bits – The terms we use to describe – The relationships between elements • Must provide value to other community – Useable in new context Where have we seen these requirements before…..
  • 12. Information Sciences Institute FAIR data… • Findable – identified by a unique identifier, characterized by rich metadata • Accessible – standard protocol with access control, metadata accessible even when the data is not, • Interoperable – by standardized terms to describe it • Reusable – Accurate and relevant attributes.
  • 13. Information Sciences Institute FAIR collaboration • Data as boundary object --- Create FAIR data at every part of a scientific investigation – Enable “long tail” science organized around data • Requires accurate descriptions of data –Characteristics of data element –Relationships between data elements • Question: How do we create and maintain these metadata and relationships in a collaborative environment? Madduri, Ravi, et al. "Reproducible big data science: a case study in continuous FAIRness." PloS one 14.4 (2019).
  • 14. Information Sciences Institute Scientific asset management system • Data Driven Collaboration – Think of it as a “photo manager” for scientific data – Maintain FAIR data from creation through publication • Captures data and relationships between data – Catalog to capture relationships between data – Object store to hold data • Can rapidly change to follow changes in scientific knowledge Discovery Environment for Relational Information and Versioned Assets (DERIVA)
  • 15. Information Sciences Institute Relational Database Turn scientists into active participants throughout the data lifecycle (Schuler, Czajkowski, Kesselman 2014; Schuler, Czajkowski, Kesselman 2016; Bugacov et al 2017) acquisition annotation organizationsharing derivation Data Lifecycle REPLACE: • spreadsheets, • “meaningful” filenames, • inefficient EAV models,… WITH: • well-formed relational models… but… how to evolve model? tossing it over the fence ✔ “self serve” data curaMon Web Interfaces
  • 16. Information Sciences Institute DERIVA promotes FAIR data production • F: providing rich metadata using an Entity-Relationship model to express relationships between diverse data elements; • A: offering rich access control and access to metadata via standard HTTP web service interfaces; • I: integrating with standardized terms defined by collaborators, consortium or communities; and • R: supporting dynamic model evolution so that the data presented accurately represents the current structure and state of knowledge within an investigation.
  • 18. Information Sciences Institute Collaborative Metadata Management • ERMRest: A web service for managing metadata and relationships between data assets – Everything is a resource with a URI and interface • Entity/Relationship modeling
  • 19. Information Sciences Institute ERMRest and Globus • Globus Auth for all authenticated user access • Extensive use of Globus Groups for fine grain access control – Use groups as roles – Write policy against
  • 20. Information Sciences Institute Data API • All operations mapped into URL: ermrest/catalog/1@Y-1Z4B/entity/foo/bar/x::gt::7 –Queries can be viewed as naming data –Tables exchanges as CSV or JSON • Operate on whole or partial entity, projections, aggregates • Joins, filter-based row selection, column projection, configurable sort order, and pagination. • Insertion, update, or deletion of data in one table at a time.
  • 21. Information Sciences Institute ERMRest Fine grain access control • ACL = (resource, permission, roles) –schema, table, column, or reference –access permission (e.g., insert, update, etc.) –a set of user or group identity • ACL is associated with a resource in the hierarchy: –catalog, schema, table, column, and constraint • ACLs are inherited simplifying management • Static and dynamic policy specifications • Enables context dependent policy, e.g. curation pipelines
  • 22. Information Sciences Institute Snapshots and history • Logical snapshots at every catalog mutation – Prior values are stored in the catalog – Snapshot ID created for every snapshot • Catalog URI can include snapshot ID –ermrest/catalog/1@Y-1Z4B/entity/foo/bar/x • History Management – Range Discovery, Truncation, Policy Amendment, Data Redaction
  • 23. Information Sciences Institute Persistent identifiers • Every entity ERMRest has unique ID (RID) – Assigned by catalog, can be used as foreign-key • Every version of entire catalog has a unique ID. • Query with snapshot ID uniquely names a entity set – ERMRest URI with snapshot ID • RID and Snapshot ID uniquely name an specific version of entity • Support for identifiers from community controlled vocabulary – E.g. Uberon, Schema.org, RRIDs
  • 24. Information Sciences Institute The “20 questions” approach, gets it started, but we need more agile methods throughout formative phases • Drive database and system design by ~20 key queries However… “Database technology has had limited uptake [in science and data science], in part due to the overhead in designing a schema… Changing data and changing requirements make it difficult to amortize these upfront costs...” (Gray, Szalay et al 2009) (Jain et al 2016) http://jimgray.azurewebsites.net/talks/SciData.pp t
  • 25. Information Sciences Institute Collaboration is an evolutionary process • Data interoperability, e.g. structure and meaning is time-dependent • Any evaluation of FAIRness is ill-posed unless we specify community and point in time! LFNG: Lunatic Fringe LFNG: O-fucosylpeptide 3-beta-N- acetylglucosaminyltransferase Ribes and Bowker. "A learning trajectory for ontology building.", 2005 Community Domain Representatives Targeted Future Users
  • 26. Information Sciences Institute Suboptimal schema leads to suboptimal data: e.g., “messy” data! Unconstrained Domain This happened
  • 27. Information Sciences Institute • “Everyone starts with the same schema: <stuff/>. Then they refine it.” – J. Widom
  • 28. Information Sciences Institute Numerous steps to evolve even “simple” database ID A B Litte r Anatomy Age 73 abc def YC1 Mandible E10.5 74 mn o pq r YC1 Mandible E10.5 75 stu vx y YC1 Maxila E8.5 ID A B FK 73 abc def 1 74 mno pqr 1 75 stu vxy 2 ID Litte r Anatomy Age 1 YC1 Mandible E10.5 2 YC1 Maxila E8.5 ID A B FK 73 abc def 1 74 mno pqr 1 75 stu vxy 2 ID Litte r Anatomy Age 1 YC1 Mandible E10.5 2 YC1 Maxila E8.5 Anatomy Mandible Maxila Measures Specimens Measures + Specimens Vocabulary Age E10.5 E8.5 Measures Specimens 1 3 4 2 Cleaning (not pictured)
  • 29. Information Sciences Institute Our approach: An open, extensible Data Evolution Language • Primitive SMO: indivisible operators, extended definitions of more common relational operators • Composite SMO: defined as a functional composition over other (primitive or composite) SMOs • Algebraic expressions can be decomposed and rewritten into semantically equivalent, more efficient expressions. • Embed in general-purpose language with user programs translated into the algebra. Conceptual overview of the approach Schuler, Robert E., and Carl Kesselman. "Towards an efficient and effective framework for the evolution of scientific databases." Proceedings of the 30th International Conference on Scientific and Statistical Database Management. 2018 Define set of Schema Modification Operators (SMO) and associated algebra
  • 30. Information Sciences Institute Problem scenario: free text “tags” of gene names… ID A B List_of_Genes 73 abc def Msx1, Foxc2, Sox9 74 mno pqr Tgfb3, Sox9, Bmp4, … … … … … Assays: ID Term Synony ms 1 Msx1 … 2 Tgfb3 … 3 Foxc2 … 4 Sox9 … 5 Bmp4 … … … … Genes: Table of canonical terms Table of genetic assays Objective: to extract the free text “List_of_Genes” in a normalized and semantically aligned association with the canonical term set. Before PROBLEM Free text listing of gene names related to the assay records
  • 31. Information Sciences Institute Problem scenario: free text “tags” of gene names… ID A B List_of_Genes 73 abc def Msx1, Foxc2, Sox9 74 mno pqr Tgfb3, Sox9, Bmp4, … … … … … Assays: ID Term Synony ms 1 Msx1 … 2 Tgfb3 … 3 Foxc2 … 4 Sox9 … 5 Bmp4 … … … … Genes: Table of canonical terms Assay Gene 73 Msx1 73 Foxc2 73 Sox9 74 Tgfb3 74 Sox9 74 Bmp4 … … Assay_Genes: Table of semantically aligned assay to gene name association Table of genetic assays Objective: to extract the free text “List_of_Genes” in a normalized and semantically aligned association with the canonical term set. After
  • 32. Information Sciences Institute Tagify: normalize and semantically align free form tags… d Tagifya r ↦ d Aligna (Atomizea r) ↦ d Aligna (μ (ReifySub a r)) ↦ d Aligna (μ (𝜋key(r),a r)) ↦ 𝜚t/a (𝜋-a,-s (μ (𝜋key(r),a r)) ⋈≡ d) (Tagify definition) (Atomize definition) (ReifySub definition) (Align definition) Input: d: Genes r: Assays Parameters: a: List_of_Genes (others omitted for brevity) Result: Computes association table Iteratively apply composite SMO definitions until we have an expression of only primitive SMOs
  • 33. Information Sciences Institute Tagify: normalize and semantically align free form tags… d Tagifya r ↦ d Aligna (Atomizea r) ↦ d Aligna (μ (ReifySub a r)) ↦ d Aligna (μ (𝜋key(r),a r)) ↦ 𝜚t/a (𝜋-a,-s (μ (𝜋key(r),a r)) ⋈≡ d) (Tagify definition) (Atomize definition) (ReifySub definition) (Align definition) Input: d: Genes r: Assays Parameters: a: List_of_Genes (others omitted for brevity) Result: Computes association table
  • 34. Information Sciences Institute Tagify: normalize and semantically align free form tags… d Tagifya r ↦ d Aligna (Atomizea r) ↦ d Aligna (μ (ReifySub a r)) ↦ d Aligna (μ (𝜋key(r),a r)) ↦ 𝜚t/a (𝜋-a,-s (μ (𝜋key(r),a r)) ⋈≡ d) (Tagify definition) (Atomize definition) (ReifySub definition) (Align definition) Input: d: Genes r: Assays Parameters: a: List_of_Genes (others omitted for brevity) Result: Computes association table
  • 35. Information Sciences Institute Tagify: normalize and semantically align free form tags… d Tagifya r ↦ d Aligna (Atomizea r) ↦ d Aligna (μ (ReifySub a r)) ↦ d Aligna (μ (𝜋key(r),a r)) ↦ 𝜚t/a (𝜋-a,-s (μ (𝜋key(r),a r)) ⋈≡ d) (Tagify definition) (Atomize definition) (ReifySub definition) (Align definition) Input: d: Genes r: Assays Parameters: a: List_of_Genes (others omitted for brevity) Result: Computes association table
  • 36. Information Sciences Institute Tagify: normalize and semantically align free form tags… d Tagifya r ↦ d Aligna (Atomizea r) ↦ d Aligna (μ (ReifySub a r)) ↦ d Aligna (μ (𝜋key(r),a r)) ↦ 𝜚t/a (𝜋-a,-s (μ (𝜋key(r),a r)) ⋈≡ d) (Tagify definition) (Atomize definition) (ReifySub definition) (Align definition) Input: d: Genes r: Assays Parameters: a: List_of_Genes (others omitted for brevity) Result: Computes association table
  • 37. Information Sciences Institute Big data bags • A packaging format for encapsulating – Payload: arbitrary content – Tags: metadata describing the payload – Checksums: supports verification of content Bio_data_bag/ |-- data | -- genomic | -- 2a673.fastq | -- 2a673.fastq | -- manifest-md5.txt | afbfa23123bfa data/genomic/2a673.fasta | -- bagit.txt Contact-Name: John Smith Chard, Kyle, et al. "I'll take that to go: Big data bags and minimal identifiers for exchange of large, complex datasets." 2016 IEEE International Conference on Big Data (Big Data). IEEE, 2016
  • 38. Information Sciences Institute Rich metadata for Bags { "@context": { "@vocab": "http://purl.org/dc/terms/", "dcmi": "http://purl.org/dc/dcmitype/Dataset" }, "@id": "../../data/numbers.csv", "@type": "dcmi:Dataset", "title": "CSV files of beverage consumption", ”description": "A CSV file listing the number of cups consumed per person.” } 38 Research ObjectsOpen Knowledge Foundation Table Schema { "name": "Method", "schema": { "fields": [ { "name": "id", "title": "A globally unique ID", "type": "string", "constraints": { "required": true, "unique": true} }, ], "missingValues": [ "" ], "primaryKey": "id" } },
  • 39. Information Sciences Institute Making Bags big • Manifest lists all content and checksums • Available content contained in “data” directory • Content may be “missing” – Missing content must be listed in “fetch.txt” – Fetch entries list local name in data directory, and URL of where to fetch data • Have created tool for creating, validating and materializing bags • Globus Auth, and Globus Transfer are supported
  • 40. Information Sciences Institute CFDE Metadata Loading Flow CFDE Repo DCC Extract CFDE Metadata Pull referenced BDBag Create DERIVA Catalog Load asset information save to shared metadata repository Automate Common Fund Data Environment Scalable approach to integrating data coordinating center asset information DCC DCC Transform DCC metadata into C2M2 records DCC administrative interface dashboard repeatable & extensible process CFDE metadata catalog reusable data asset information queries
  • 41. Information Sciences Institute C2M2 Model files datasets subjects orgs protocols events samples We have defined an entity/relationship model that captures the high-level asset descriptions in a form that we can map different data into. relationships “You can only manage what you can measure” • Drives metadata choices • Focused on collecting information to answer specific questions • E.g., • How much data? • What type? • Where is it stored?
  • 42. Information Sciences Institute Chaise – An adaptive model-driven Web user interface • How little can we assume? – discovery, analysis, visualization, editing, sharing and collaboration over tabular data (ERMRest). • Makes almost no assumptions about data model – Introspect the data model from ERMrest. – Use heuristics, for instance, how to flatten a hierarchical structure into a simplified presentation for searching and viewing. – Schema annotations are used to modify or override its rendering heuristics, for instance, to hide a column of a table or to use a specific display name. – Apply user preferences to override, for instance, to present a nested table of data in a transposed layout.
  • 43. Information Sciences Institute Management of all scientific assets
  • 45. Information Sciences Institute Dynamic data collections with global IDs
  • 46. Information Sciences Institute Model driven data entry…
  • 47. Information Sciences Institute With controlled vocabulary.
  • 55. Information Sciences Institute Acknowledgments • Rob Schuler, Karl Czajkowski, Hongsuda Tangmunarunkit, Mike D’Arcy • Rick Wagner, Kyle Chard, Ravi Maduri, Ian Foster • Many, many conversations with John Seely Brown (jsb) – See: Design Unbound: Designing for Emergence in a White Water World, Brown and Pendleton-Jullian, MIT Press
  • 56. Information Sciences Institute Summary • Think of eScience infrastructure as part of socio-technical system – Design for the interface across communities of practice • Validated this approach with Deriva across many domains and scales – Craniofacial dysmorphia, protein structure database, molecular atlas, kidney reconstruction, dynamic synapse mapping, optimization models, pancreatic beta cell modeling, developmental biology, … • For more information: www.derivacloud.org