This presentation was given at the 2019 GlobusWorld Conference in Chicago, IL by Jonathan Silverstein and Mike Davis, both from University of Pittsburgh.
Health Sciences Research Informatics, Powered by Globus
1. Enabling Cancer Research with End-to-end
Infrastructure for the Request, Exploration, and
Receipt of Cancer Registry Data
https://www.dbmi.pitt.edu/ https://rio.pitt.edu
Jonathan C. Silverstein, MD, MS, FACS, FACMI
Chief Research Informatics Officer
Health Sciences and Institute for Precision Medicine
Professor
Department of Biomedical Informatics (DBMI), University of Pittsburgh
Michael Davis, MS
Software Integration Architect
Department of Biomedical Informatics (DBMI), University of Pittsburgh
2. Research Informatics Office (RIO.pitt.edu)
• 25 staff in DBMI with mission “to support investigators
through innovative collection and use of biomedical data”
• Science-as-a-Service (scientists serving scientists)
• Clinical Data Science Projects
• Basic Data Science Projects
• Translational Data Science Projects
6. CR3 Collaborators
• University of Pittsburgh
– Biomedical Informatics (Jonathan Silverstein)
– Institute for Precision Medicine (Adrian Lee)
• UPMC
– Hillman Cancer Center (Robert Ferris)
– Network Cancer Registry (Sharon Winters)
• University of Chicago
– Globus (Ian Foster)
NCI U24 CA209996, PI: Ian T. Foster
7. CURRENT STATE: Step 1: Define desired data set
Researcher talks to registrar
about what kind of patient
cohort they want.
Registar queries
the registry for the
researchers.
RegistrarResearcher
Lengthy, iterative, labor-intensive process
No mechanism for ad hoc exploration of the data by the researcher
8. CURRENT STATE: Step 2: Deliver data to researcher
Registrar delivers data to
researcher
Registar extracts
data from Registry
RegistrarResearcher
Registrar delivers data via email, OneDrive, Sharepoint, mapped drive
No tracking/auditing, error prone, insecure
9. FUTURE STATE: Step 1: Define desired data set
Goal 1: Provide mechanism for self service exploration of registry data
Registar queries
the registry for the
researchers.
Researchers queries Registry
RegistrarResearcher
CR3
10. FUTURE STATE: Step 2: Deliver data to researcher
RegistrarResearcher
Registrar delivers data to
researcher
Registar extracts
data from Registry
Goal 2: Provide secure, auditable, access controlled mechanism to share data
11. Research Portal Advanced Features
• Modern technical approach (SaaS deployment)
• Federated Identity (each institution does authentication)
• Honest Brokerage with clinical data (local control)
• NAACCR Standard data (each institution can generate)
• Search via Globus (one search traverses institutions)
• Easy GUI to develop and use (includes faceted search)
• Easy search transmission (search “sentence” to registrar)
• Secure Data Liquidity (easy to move data to authorized)
13. Cancer Registry Records for Research (CR3)
Portal Feature Summary
• Portal code developed using Globus portal framework
(Django/python) based on Modern Research Data Portal
design pattern (PeerJ Articles:cs-144 https://peerj.com/articles/cs-144/)
• Federated logon using Globus Auth (GA) with Pitt/UPMC
as identity providers
• Cancer registry data ingested into Globus Search (GS)
• Aggregate charts (i.e., age range, gender)
• Google-like text search with collapsible facet panels for
filtering
14. CR3
Discovery
Portal
Cohort
aggregate
counts
Login with
UPMC/Pitt
credentials
Globus
Search (GS)
Globus
Auth (GA)
Elasticsearch
Cancer Registry De-identified Data Index
(minimal criteria data: e.g., staging)
UPMC/Pitt
Identity
Providers
Authentication
Auth
initiated
to GA
Cohort
search
initiated to
GS
Researcher
Cohort
aggregate
counts
returned
Email Request
sent to CR
CR3 Architecture v1.0
Registry Staff
(Globus Django Framework)
15. Summary of data stored in Globus Search Index
• Focused on 7 primary sites (i.e., ICD-O-3 codes)
• Breast (e.g., C500-C509), Colon, Lung, Skin, Head and Neck, Ovary, Prostate
• 30 common data elements (CDE) (based on North American Association of Central Cancer
Registries (NAACCR) standard and limited to Safe Harbor)
• Limit to 2 UPMC hospitals (UPMC operates 40 academic, community, and specialty
hospitals)
• UPMC Presbyterian Shadyside
• UPMC Magee-Womens Hospital
• Years 2004 -2017 (based on the availability of linking other clinical data from our Neptune data
warehouse in the future)
• Included Completed and Reportable cases only (NAACCR standards)
• Roughly 65,000 patients
Globus Search
Cancer
Registry DB
16. Extraction Pipeline into Globus Search
CSV
CSV
CSV
Globus Search
Transformation
Process
(python)
Cancer
Registry DB
*Data
extracts
Transformation
Process
(python)
Ingestion
Process
(python)
GMetaEntry
format
(Safe Harbor Data)
Cancer Registry (Honest Broker) Informatics (Honest Broker)
* Future version will generate/accept the NAACCR XML format
17. {
"@version": "2016-11-09",
"ingest_data": {
"@version": "2017-09-01",
"gmeta": [
{
"@version": "2017-09-01",
"subject": "https://dbmi.pitt.edu/subject/201201295:00:BRCA",
"visible_to": [
"urn:globus:groups:id:ddb13d88-d6df-11e8-9daa-0e368f3075e8"
],
"content": {
"clinical_t": "T2",
"mets_at_dx": "No distant metastasis",
"recurrence": "No",
"vitals": "Alive",
"clinical_n": "N0",
"race": "White",
"histology": "Infiltrating duct carcinoma, NOS",
"path_stage": "2A",
"facility": "Magee",
"histology_code": "85003",
"disease_category": "Breast",
"dx_age_range": "35-44",
"grade": "Grade II: Mod diff, mod well diff,",
"spanish_hispanic_origin": "Non-Spanish; non-Hispanic",
"year_last_contact": "2017",
"cs_stage": "2A",
"site": "Breast, NOS",
"cs_extension": "100",
"clinical_stage": "2A",
"cause_of_death": "Not applicable",
"path_n": "N0",
"path_t": "T2",
"cancer_status": "No evidence of this tumor",
"age_at_dx": "36",
"gender": "Female",
"summary_stage": "Localized",
"clinical_m": "M0",
"laterality": "Left - origin of primary",
"dx_year": "2012",
"site_code": "C509"
}
}
]
}
}
GMetaEntry data in GS
visible_to:
content:
(scope limited to defined
Globus Group)
26. Long Term Goals
• Create a network of federated cancer registries
– Deploy similar infrastructure at other cancer registries
– Enable queries across multiple registries
– Model proven in ITCR TIES/TCRN project with independent installation
of TIES or connect a TIES installation to TCRN for multi-institutional
queries
• Federation via Globus allows network to scale while
control remains with each institution
– Data owners input and export data sets, apply quality control
– Data owners control all access policies
– No overhead from ingestion of disparate datasets
– Registry data remains at the location where it was generated
– Identities are provided and authenticated by the institution, not Globus
– System can scale almost infinitely if data owners provide their own
storage resources
27. CR3
Discovery
Portal
Cohort
aggregate
counts
Login with
UPMC/Pitt
credentials
Globus
Search (GS)
Globus
Auth (GA)
UPMC/Pitt
Identity
Providers
Authentication
Auth
initiated
to GA
Cohort
search
initiated to
GS
Researcher
Cohort
aggregate
counts
returned
CR3 Architecture v2.0
Globus
Transfer (GT)
Registry Staff
Data transfer from registrar to
researcher mediated by GT
Manage
authorization
Elasticsearch
Cancer Registry De-identified Data Index
(minimal criteria data: e.g., staging)
Request
Service