Enabling Cancer Research with End-to-end
Infrastructure for the Request, Exploration, and
Receipt of Cancer Registry Data
https://www.dbmi.pitt.edu/ https://rio.pitt.edu
Jonathan C. Silverstein, MD, MS, FACS, FACMI
Chief Research Informatics Officer
Health Sciences and Institute for Precision Medicine
Professor
Department of Biomedical Informatics (DBMI), University of Pittsburgh
Michael Davis, MS
Software Integration Architect
Department of Biomedical Informatics (DBMI), University of Pittsburgh
Research Informatics Office (RIO.pitt.edu)
• 25 staff in DBMI with mission “to support investigators
through innovative collection and use of biomedical data”
• Science-as-a-Service (scientists serving scientists)
• Clinical Data Science Projects
• Basic Data Science Projects
• Translational Data Science Projects
R3 Health Record
Research Request
Neptune/R3 (w Shyam Visweswaran)
HuBMAP Service Architecture (w Nick Nystrom)
CR3 Collaborators
• University of Pittsburgh
– Biomedical Informatics (Jonathan Silverstein)
– Institute for Precision Medicine (Adrian Lee)
• UPMC
– Hillman Cancer Center (Robert Ferris)
– Network Cancer Registry (Sharon Winters)
• University of Chicago
– Globus (Ian Foster)
NCI U24 CA209996, PI: Ian T. Foster
CURRENT STATE: Step 1: Define desired data set
Researcher talks to registrar
about what kind of patient
cohort they want.
Registar queries
the registry for the
researchers.
RegistrarResearcher
Lengthy, iterative, labor-intensive process
No mechanism for ad hoc exploration of the data by the researcher
CURRENT STATE: Step 2: Deliver data to researcher
Registrar delivers data to
researcher
Registar extracts
data from Registry
RegistrarResearcher
Registrar delivers data via email, OneDrive, Sharepoint, mapped drive
No tracking/auditing, error prone, insecure
FUTURE STATE: Step 1: Define desired data set
Goal 1: Provide mechanism for self service exploration of registry data
Registar queries
the registry for the
researchers.
Researchers queries Registry
RegistrarResearcher
CR3
FUTURE STATE: Step 2: Deliver data to researcher
RegistrarResearcher
Registrar delivers data to
researcher
Registar extracts
data from Registry
Goal 2: Provide secure, auditable, access controlled mechanism to share data
Research Portal Advanced Features
• Modern technical approach (SaaS deployment)
• Federated Identity (each institution does authentication)
• Honest Brokerage with clinical data (local control)
• NAACCR Standard data (each institution can generate)
• Search via Globus (one search traverses institutions)
• Easy GUI to develop and use (includes faceted search)
• Easy search transmission (search “sentence” to registrar)
• Secure Data Liquidity (easy to move data to authorized)
Cancer Registry Records for Research (CR3)
Portal
Cancer Registry Records for Research (CR3)
Portal Feature Summary
• Portal code developed using Globus portal framework
(Django/python) based on Modern Research Data Portal
design pattern (PeerJ Articles:cs-144 https://peerj.com/articles/cs-144/)
• Federated logon using Globus Auth (GA) with Pitt/UPMC
as identity providers
• Cancer registry data ingested into Globus Search (GS)
• Aggregate charts (i.e., age range, gender)
• Google-like text search with collapsible facet panels for
filtering
CR3
Discovery
Portal
Cohort
aggregate
counts
Login with
UPMC/Pitt
credentials
Globus
Search (GS)
Globus
Auth (GA)
Elasticsearch
Cancer Registry De-identified Data Index
(minimal criteria data: e.g., staging)
UPMC/Pitt
Identity
Providers
Authentication
Auth
initiated
to GA
Cohort
search
initiated to
GS
Researcher
Cohort
aggregate
counts
returned
Email Request
sent to CR
CR3 Architecture v1.0
Registry Staff
(Globus Django Framework)
Summary of data stored in Globus Search Index
• Focused on 7 primary sites (i.e., ICD-O-3 codes)
• Breast (e.g., C500-C509), Colon, Lung, Skin, Head and Neck, Ovary, Prostate
• 30 common data elements (CDE) (based on North American Association of Central Cancer
Registries (NAACCR) standard and limited to Safe Harbor)
• Limit to 2 UPMC hospitals (UPMC operates 40 academic, community, and specialty
hospitals)
• UPMC Presbyterian Shadyside
• UPMC Magee-Womens Hospital
• Years 2004 -2017 (based on the availability of linking other clinical data from our Neptune data
warehouse in the future)
• Included Completed and Reportable cases only (NAACCR standards)
• Roughly 65,000 patients
Globus Search
Cancer
Registry DB
Extraction Pipeline into Globus Search
CSV
CSV
CSV
Globus Search
Transformation
Process
(python)
Cancer
Registry DB
*Data
extracts
Transformation
Process
(python)
Ingestion
Process
(python)
GMetaEntry
format
(Safe Harbor Data)
Cancer Registry (Honest Broker) Informatics (Honest Broker)
* Future version will generate/accept the NAACCR XML format
{
"@version": "2016-11-09",
"ingest_data": {
"@version": "2017-09-01",
"gmeta": [
{
"@version": "2017-09-01",
"subject": "https://dbmi.pitt.edu/subject/201201295:00:BRCA",
"visible_to": [
"urn:globus:groups:id:ddb13d88-d6df-11e8-9daa-0e368f3075e8"
],
"content": {
"clinical_t": "T2",
"mets_at_dx": "No distant metastasis",
"recurrence": "No",
"vitals": "Alive",
"clinical_n": "N0",
"race": "White",
"histology": "Infiltrating duct carcinoma, NOS",
"path_stage": "2A",
"facility": "Magee",
"histology_code": "85003",
"disease_category": "Breast",
"dx_age_range": "35-44",
"grade": "Grade II: Mod diff, mod well diff,",
"spanish_hispanic_origin": "Non-Spanish; non-Hispanic",
"year_last_contact": "2017",
"cs_stage": "2A",
"site": "Breast, NOS",
"cs_extension": "100",
"clinical_stage": "2A",
"cause_of_death": "Not applicable",
"path_n": "N0",
"path_t": "T2",
"cancer_status": "No evidence of this tumor",
"age_at_dx": "36",
"gender": "Female",
"summary_stage": "Localized",
"clinical_m": "M0",
"laterality": "Left - origin of primary",
"dx_year": "2012",
"site_code": "C509"
}
}
]
}
}
GMetaEntry data in GS
visible_to:
content:
(scope limited to defined
Globus Group)
Long Term Goals
• Create a network of federated cancer registries
– Deploy similar infrastructure at other cancer registries
– Enable queries across multiple registries
– Model proven in ITCR TIES/TCRN project with independent installation
of TIES or connect a TIES installation to TCRN for multi-institutional
queries
• Federation via Globus allows network to scale while
control remains with each institution
– Data owners input and export data sets, apply quality control
– Data owners control all access policies
– No overhead from ingestion of disparate datasets
– Registry data remains at the location where it was generated
– Identities are provided and authenticated by the institution, not Globus
– System can scale almost infinitely if data owners provide their own
storage resources
CR3
Discovery
Portal
Cohort
aggregate
counts
Login with
UPMC/Pitt
credentials
Globus
Search (GS)
Globus
Auth (GA)
UPMC/Pitt
Identity
Providers
Authentication
Auth
initiated
to GA
Cohort
search
initiated to
GS
Researcher
Cohort
aggregate
counts
returned
CR3 Architecture v2.0
Globus
Transfer (GT)
Registry Staff
Data transfer from registrar to
researcher mediated by GT
Manage
authorization
Elasticsearch
Cancer Registry De-identified Data Index
(minimal criteria data: e.g., staging)
Request
Service
Questions?
j.c.s@pitt.edu
midavis@pitt.edu

Health Sciences Research Informatics, Powered by Globus

  • 1.
    Enabling Cancer Researchwith End-to-end Infrastructure for the Request, Exploration, and Receipt of Cancer Registry Data https://www.dbmi.pitt.edu/ https://rio.pitt.edu Jonathan C. Silverstein, MD, MS, FACS, FACMI Chief Research Informatics Officer Health Sciences and Institute for Precision Medicine Professor Department of Biomedical Informatics (DBMI), University of Pittsburgh Michael Davis, MS Software Integration Architect Department of Biomedical Informatics (DBMI), University of Pittsburgh
  • 2.
    Research Informatics Office(RIO.pitt.edu) • 25 staff in DBMI with mission “to support investigators through innovative collection and use of biomedical data” • Science-as-a-Service (scientists serving scientists) • Clinical Data Science Projects • Basic Data Science Projects • Translational Data Science Projects
  • 3.
  • 4.
    Neptune/R3 (w ShyamVisweswaran)
  • 5.
  • 6.
    CR3 Collaborators • Universityof Pittsburgh – Biomedical Informatics (Jonathan Silverstein) – Institute for Precision Medicine (Adrian Lee) • UPMC – Hillman Cancer Center (Robert Ferris) – Network Cancer Registry (Sharon Winters) • University of Chicago – Globus (Ian Foster) NCI U24 CA209996, PI: Ian T. Foster
  • 7.
    CURRENT STATE: Step1: Define desired data set Researcher talks to registrar about what kind of patient cohort they want. Registar queries the registry for the researchers. RegistrarResearcher Lengthy, iterative, labor-intensive process No mechanism for ad hoc exploration of the data by the researcher
  • 8.
    CURRENT STATE: Step2: Deliver data to researcher Registrar delivers data to researcher Registar extracts data from Registry RegistrarResearcher Registrar delivers data via email, OneDrive, Sharepoint, mapped drive No tracking/auditing, error prone, insecure
  • 9.
    FUTURE STATE: Step1: Define desired data set Goal 1: Provide mechanism for self service exploration of registry data Registar queries the registry for the researchers. Researchers queries Registry RegistrarResearcher CR3
  • 10.
    FUTURE STATE: Step2: Deliver data to researcher RegistrarResearcher Registrar delivers data to researcher Registar extracts data from Registry Goal 2: Provide secure, auditable, access controlled mechanism to share data
  • 11.
    Research Portal AdvancedFeatures • Modern technical approach (SaaS deployment) • Federated Identity (each institution does authentication) • Honest Brokerage with clinical data (local control) • NAACCR Standard data (each institution can generate) • Search via Globus (one search traverses institutions) • Easy GUI to develop and use (includes faceted search) • Easy search transmission (search “sentence” to registrar) • Secure Data Liquidity (easy to move data to authorized)
  • 12.
    Cancer Registry Recordsfor Research (CR3) Portal
  • 13.
    Cancer Registry Recordsfor Research (CR3) Portal Feature Summary • Portal code developed using Globus portal framework (Django/python) based on Modern Research Data Portal design pattern (PeerJ Articles:cs-144 https://peerj.com/articles/cs-144/) • Federated logon using Globus Auth (GA) with Pitt/UPMC as identity providers • Cancer registry data ingested into Globus Search (GS) • Aggregate charts (i.e., age range, gender) • Google-like text search with collapsible facet panels for filtering
  • 14.
    CR3 Discovery Portal Cohort aggregate counts Login with UPMC/Pitt credentials Globus Search (GS) Globus Auth(GA) Elasticsearch Cancer Registry De-identified Data Index (minimal criteria data: e.g., staging) UPMC/Pitt Identity Providers Authentication Auth initiated to GA Cohort search initiated to GS Researcher Cohort aggregate counts returned Email Request sent to CR CR3 Architecture v1.0 Registry Staff (Globus Django Framework)
  • 15.
    Summary of datastored in Globus Search Index • Focused on 7 primary sites (i.e., ICD-O-3 codes) • Breast (e.g., C500-C509), Colon, Lung, Skin, Head and Neck, Ovary, Prostate • 30 common data elements (CDE) (based on North American Association of Central Cancer Registries (NAACCR) standard and limited to Safe Harbor) • Limit to 2 UPMC hospitals (UPMC operates 40 academic, community, and specialty hospitals) • UPMC Presbyterian Shadyside • UPMC Magee-Womens Hospital • Years 2004 -2017 (based on the availability of linking other clinical data from our Neptune data warehouse in the future) • Included Completed and Reportable cases only (NAACCR standards) • Roughly 65,000 patients Globus Search Cancer Registry DB
  • 16.
    Extraction Pipeline intoGlobus Search CSV CSV CSV Globus Search Transformation Process (python) Cancer Registry DB *Data extracts Transformation Process (python) Ingestion Process (python) GMetaEntry format (Safe Harbor Data) Cancer Registry (Honest Broker) Informatics (Honest Broker) * Future version will generate/accept the NAACCR XML format
  • 17.
    { "@version": "2016-11-09", "ingest_data": { "@version":"2017-09-01", "gmeta": [ { "@version": "2017-09-01", "subject": "https://dbmi.pitt.edu/subject/201201295:00:BRCA", "visible_to": [ "urn:globus:groups:id:ddb13d88-d6df-11e8-9daa-0e368f3075e8" ], "content": { "clinical_t": "T2", "mets_at_dx": "No distant metastasis", "recurrence": "No", "vitals": "Alive", "clinical_n": "N0", "race": "White", "histology": "Infiltrating duct carcinoma, NOS", "path_stage": "2A", "facility": "Magee", "histology_code": "85003", "disease_category": "Breast", "dx_age_range": "35-44", "grade": "Grade II: Mod diff, mod well diff,", "spanish_hispanic_origin": "Non-Spanish; non-Hispanic", "year_last_contact": "2017", "cs_stage": "2A", "site": "Breast, NOS", "cs_extension": "100", "clinical_stage": "2A", "cause_of_death": "Not applicable", "path_n": "N0", "path_t": "T2", "cancer_status": "No evidence of this tumor", "age_at_dx": "36", "gender": "Female", "summary_stage": "Localized", "clinical_m": "M0", "laterality": "Left - origin of primary", "dx_year": "2012", "site_code": "C509" } } ] } } GMetaEntry data in GS visible_to: content: (scope limited to defined Globus Group)
  • 26.
    Long Term Goals •Create a network of federated cancer registries – Deploy similar infrastructure at other cancer registries – Enable queries across multiple registries – Model proven in ITCR TIES/TCRN project with independent installation of TIES or connect a TIES installation to TCRN for multi-institutional queries • Federation via Globus allows network to scale while control remains with each institution – Data owners input and export data sets, apply quality control – Data owners control all access policies – No overhead from ingestion of disparate datasets – Registry data remains at the location where it was generated – Identities are provided and authenticated by the institution, not Globus – System can scale almost infinitely if data owners provide their own storage resources
  • 27.
    CR3 Discovery Portal Cohort aggregate counts Login with UPMC/Pitt credentials Globus Search (GS) Globus Auth(GA) UPMC/Pitt Identity Providers Authentication Auth initiated to GA Cohort search initiated to GS Researcher Cohort aggregate counts returned CR3 Architecture v2.0 Globus Transfer (GT) Registry Staff Data transfer from registrar to researcher mediated by GT Manage authorization Elasticsearch Cancer Registry De-identified Data Index (minimal criteria data: e.g., staging) Request Service
  • 28.