NIH Data Commons Architecture Ideas

Team Argon
“A Commons Platform for Promoting Continuous FAIRness”
NIH Data Commons Pilot
Globus, University of Chicago
University of Southern California
Contact: Ian Foster, foster@uchicago.edu
PIs: Kyle Chard, Ian Foster, Carl Kesselman, Ravi Madduri

Three big picture themes
• Continuous FAIRness: Make all data findable, accessible,
interoperable, reusable at every stage, via pervasive use of
simple identifier and exchange format conventions
• Build on proven security, data, and computation building
blocks that have large user communities inside and outside
biomedicine (see subsequent slides for details)
• Solutions leverage industry best practices and professional
services team to meet scalability, interoperability,
sustainability, and reliability needs

App
(Client)
Service
(Resource Server)
Service
(Resource Server)
Globus Auth: A foundational service for an
authentication and authorization ecosystem
• A flexible security infrastructure that can be used across the Commons
• Enables federation across services using arbitrary linked identities (e.g., @gmail
@xsede @uchicago)
• Facilitates secure/authorized communication between users, services, clients
• Supports arbitrary clients including REST, web, command line, software
• Flexible token management
• Secure sharing between services
• Fine-grain user consents and revocation
Service
(Resource Server)
Resource
Owner
Resource
server operator
App
(Client)
3
https://docs.globus.org/api/auth/

Standards-based, reliable, performant data management
• Globus Connect Server: S3-compatible
HTTP/OAuth interface for secure
client-server transfer
• Endpoints have DNS names
• Globus Transfer: Managed, high-
performance, secure, reliable bulk
asynchronous transfer
• In-place data sharing with flexible and
secure ACLs
• Standards compliant
• S3, OAuth, OIDC, HTTP, GridFTP
4
https://docs.globus.org/api/transfer/

Interoperability: naming and exchange
Minid
• Lightweight identifiers for any product
at any stage
• Easily created, dereferenced, validated
• Global integrity – validate content
across the commons
BDBag
• Self-describing and flexible format
for exchange
• Extended BagIt Specification
• Standard manifest representation
that supports different protocols
Data
Metadata
File1 2AG230..
File2 A31FDC.. FTP
File3 D0F142.. HTTP
…
Minid 001
Minid 007
Minid 719
http://minid.bd2k.org http://bd2k.ini.usc.edu/tools/bdbag/

Infrastructure
My Workspace
• Workspaces bring together data and tools
• Infrastructure designed for scalability and portability
• Leverages
• Federated identities & access control
• Secure access to distributed data
• Data interoperability, exchange
• Provenance
• Tracking activity around data
• By whom? With what?
• Publication & sharing of tools
and workflows
• Cost aware resource allocation for
both compute and data movement
Workspaces: Scalable compute for distributed data
Data Tools
6

Search, navigation, and virtual cohorts
• DERIVA: Digital asset management for heterogeneous data
• Organize, navigate, discover interrelated objects (e.g., assays from a sample over time)
• REST interface
• Entity/Relation model for organizing data
• Supports various DCPPC metadata models
• Fine grain access control to support diverse
collaboration models
• Model evolution to enable continuous
publication, diverse, heterogeneous use cases
• Model driven user interface that
self-configures to current data model
• Integration with Globus Auth, Minids, BDBags, and other components
• Complements Globus Search: Access-controlled search of derived data products
7

Workspace Manager
Bags Workspaces Pipelines
minid_1 Galaxy GTExRNA
minid_2 Jupyter GATKVar
minid_3 RStudio
UCSC
GTEx
TOPMed
MOD
User catalogs
User catalogs
User catalogs
User catalogs
Search
Analyze Visualize
Publish & Reproduce
Discover
Uniform,
secure,
reliable
access to
storage
Virtual cohorts
in standard
manifest with
lightweight ID
Uniform search
across multiple
data sources
All results tracked via standard
manifest and lightweight IDs
Workspaces support Jupyter and
Galaxy on different clouds
Publication
assigns DOIs
and indexes
datasets
Integrating scenario

Summary: Reusable components include...
• Globus Connect Server for data access, transfer, and sharing
• HTTP/S3 access to many storage systems (Posix, object store, etc.)
• GridFTP for managed, reliable, secure, efficient transfers
• Integration with Globus Auth for authentication and authorization
• Offers: A universal storage API
• Globus Auth for securing all REST API interactions
• OAuth2 and OIDC + fine-grained consents and revocation
• Offers: A universal authentication and authorization API
• BDBag (“big data bags”: profiles on BagIt) tools
• BagIt specification with profiles for “holey bags”, etc.
• Offers: Common manifest for exchange of query results, virtual cohorts
• Identifier service for creating lightweight identifiers
• ARKs, created on demand, associated checksum, simple metadata
• Offers: Common mechanism for naming and tracking derived data products9

NIH Data Commons Architecture Ideas

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to NIH Data Commons Architecture Ideas

Similar to NIH Data Commons Architecture Ideas (20)

More from Ian Foster

More from Ian Foster (20)

Recently uploaded

Recently uploaded (20)

NIH Data Commons Architecture Ideas