Todd Pihl PhD., Technical Project Mgr. & Mark Jensen, Director of Data Managements and Interoperability, National Institute of Health, Frederick National Labs for Cancer Research
Data repositories such as NCI’s Cancer Research Data Commons receive data that use a variety of data models and vocabularies. This presents a significant obstacle to finding and using the data outside of their original purpose. In this talk we’ll show how using Neo4j allows different data models to be represented and mapped to each other, giving data managers a new way to provide harmonized data to their users.
Government GraphSummit: And Then There Were 15 Standards
1. SPONSORED BY THE NATIONAL CANCER INSTITUTE
And then there were 15
standards
Using Neo4j to harmonize data in cancer
research
Todd Pihl, Ph.D.
Mark Jensen, Ph.D.
4. Graph management by subject matter experts
Node
s
Edge
s
Propert
y
Defs
Props referenced here … and defined
here
Entity names are the
keys
Nodes at the
ends,
with direction
Other attributes
specified
Constrain the
data values
to defined
types
Model Description Files
https://github.com/CBIIT/bento-mdf
6. Installing a Bento Data Sharing Platform on a Cloud Platform
LOCAL
MACHINE
GITHUB
CLOUD
PLATFORM
Clone files
from
GitHub
Frontend
Backend
Neo4J
-Add test meta data to DB
-Edit UI config files
-View updates in real-time
-Save updated files in bento-frontend
-Push to Git Hub
bento-frontend
bento-backend
bento-data-model
bento-frontend
bento-backend
bento-data-model
Pull updated files
from GitHub
Load data from a
secure S3
bucket
Frontend
Backend
Neo4J
Data Sharing Platform
AWS Environment
7. Cancer Research Data Commons (CRDC)
Cancer Data Aggregator
Aggregate by patient, sample, study, disease, tissue, etc.
Clinical Proteomics Imaging
Genomics Immuno-
oncology
Animal
Models
Cancer
Biomarkers
Cancer
Research
Data Commons
0100111
0
0100001
1
0100100
1
Data Standards Services
8. Cancer Data Aggregator (CDA)
• CDA Mission: Provide a single location to query across all CRDC data repositories
• API, Python library
• Currently contains data from Genomics, Proteomics and Imaging Data Commons
• Remaining CRDC data repositories in progress
• Released for CRDC production use on June 28th
• Documentation: https://cda.readthedocs.io/en/latest/
• The Examples page has many Python use cases
• CDA Github: https://github.com/CancerDataAggregator
• Swagger: https://cda.datacommons.cancer.gov/api/swagger-ui.html
• For the first time, CDA allows us to easily look across CRDC at how data are presented to
users.
14. CRDC is a federation of going concerns
• Each CRDC node has its own data systems, business processes, stakeholders,
and users
• Each has its own purpose-built data model that enables data ingestion, query, and distribution.
• Each has large, ongoing inflows and outflows of data today.
• So – A top-down, prescriptive approach to standardization is not feasible.
(Believe us; we know.)
• Standardization emphasizing carrots instead of sticks:
• Access to the CDA is a benefit for any node wanting to extend the reach of its data.
• Approach data standardization as a practical mapping goal: “If you can place your model in the
context of the CDA’s data maps, the CDA can query and serve your data”
• Approach standardization as an iterative process: “Start with a high priority set of metadata, and
expand mapping over time.”
15. Graphs as a common language for expressing data models
Property Graph Relational Data OWL/RDF
Node Table rows Class
Property Table columns/cells Datatype Property
Relationship Foreign keys/Linking tables Object Property
Representing custom data models as graphs can provide:
• a unified context for managing data and semantics, and
• a framework for integrating data with minimal impact on repository operations.
Creating graph versions of many kinds of data models is possible, since many
popular modeling approaches find natural expression in the Property Graph:
16. Model Description Format (MDF) - simple, iterative model
recording and schematizing
MDF is a compact, human-readable—and computable—format for defining a
property graph:
• Define Nodes
• Node Properties
• Define Relationships
• Relationship Properties
• Relationship Attributes
• Define and Describe Properties
• Property Attributes, including
• Allowable value types or sets
https://github.com/CBIIT/bento-md
f
17. In the Bento framework:
• Data SMEs directly update MDF (in GitHub) to make model updates
• Backend data loader and frontend user interfaces are configured directly by MDF
MDF is simple and standardized
17
Philip Musk 12:06
And let me tell you, with data needs driving many
of ICDC's requirements as they are, and have
been thus far, being able to both write the
requirements, and make the required model
changes ahead of engineers doing their thing, is
really powerful. I don't have to explain what
model changes we need to make to someone else
- I can get the model changes done myself, and
explain what we need the engineers and the UI to
do with those changes.
SMEs
Engineering
18. • Practical principles towards a practical goal led us to practical tools, enabling
• Rapid prototypes and production tier commons
• Integrated Canine Data Commons
• Clinical Trial Data Commons
• Rapid prototypes for data modeling and model visualization
• Cancer Data Service
• Children’s Cancer Data Initiative
• New practical problem: management of multiple dynamic data models over
independent projects
• Creating new models: component reuse?
• Managing acceptable value sets for many Properties in models
• Understanding interrelationships between models for mapping and interoperability
Metamodel Database – the models as data
18
19. Both data and model as property graphs
Data
Model
("Schema")
Label:
Person
Label:
Person
Label:
Group
20. Metamodel Schema
20
Defines:
• Models
• Nodes, Relationships, Properties
• Origins, Terms, and Value Sets
• Concepts and Predicates
Schema is represented in MDF
https://github.com/CBIIT/bento-meta/blob/master/metamodel.yaml
22. • In the simple context of Properties, Nodes, and Relationships, we have a
functional repository for multiple graph models
• Python packages move MDF into an MDB, create MDF from models in an MDB
• Docker containers easily run a local MDB, or can provide an instantiated, loaded MDB
• Based directly on Neo4j Community server
images
• Simple Terminology Server (STS) with MDB
as backend
• Enables both GUI and API access to the
models
• Model browsing and fulltext search across
all entities
• STS is also intended to be easy to
distribute and set up
MDB as a model repository and reference
23. The MDB schema also defines entities for relating models to one another
and to external authorities:
• Concepts & Predicates (“semantics”)
• Origins, Terms, & Value Sets (“terminology”)
Patterns for connecting these to model entities
create separable “layers” that can be added
or modified without disrupting the repository
function.
MDB as a cross-model tool
23
25. • Dynamic
• Like data and data models
• Pragmatic
• Not a repository of ultimate truth
• Tool to help us provide value to NCI today
• Friendly
• Communicates to humans and computers
• Simple, but well-defined
• Not necessarily exhaustive or “complete”
• Distributable
• Not necessarily “central”
• A platform for “mutual understanding” of data
MDB Philosophy: keys to its utility
25
https://cbiit.github.io/bento-meta/mdb-principles.html
26. • Mark Benson, PhD
• Phil Musk, PhD
• Ming Ying, MS
• Anjan Purkayastha, PhD
• Ye Wu, PhD
• Pat Dunn, PhD
• Nelson Moore, MS
• John Otridge, PhD
Acknowledgements
26