Data Quality in Multi-Site Health Services
and Comparative Effective Research: Lessons
from PHIS+
Ram Gouripeddi
University of Utah
10th Annual Utah Health Services Research Conference
Considering Data Quality in Health Services Research
Monday, March 16, 2015
Acknowledgements
• Raj Srivastava, MD, MPH
• Ron Keren MD, MPH
• OpenFurther Team members
• PHIS+ Team members across multiple institutions
• Apelon
• FURTHeR development was supported by the NCRR and the NCATS, NIH, through
Grant UL1RR025764 and supplement 3UL1RR025764-02S2. This project was
funded under grant number R01 HS019862-01 from the AHRQ, U.S. Department of
Health and Human Services (HHS). The opinions expressed [in this document] are
those of the authors and do not reflect the official position of AHRQ or the HHS.
• PHIS+: www.childrenshospitals.org/phisplus/index.html
2
PHIS+
• Augment Children’s Hospital Association’s (CHA)
existing electronic database of administrative data -
Pediatric Health Information System (PHIS) with clinical
data to conduct Comparative Effectiveness Research
studies.
• UU Biomedical Informatics Core - Informatics Partners
• Agency for Healthcare Research and Quality (AHRQ)
funded project.
3
PHIS+ Overview
Pneumonia
Appendicitis
Osteomyelitis
Gastroesophageal
Reflux Disease
CER Studies
4
4
The PHIS+ Process
6
1. Cincinnati Children’s Hospital
Medical Center (CCHMC)
3. Children’s Hospital of Philadelphia
(CHOP)
5. Primary Children’s Medical Center,
Intermountain Healthcare (PCMC)
2. Children’s Hospital Boston (CHB)
4. Children’s Hospital of Pittsburgh
(CHP)
6. Seattle Children’s Hospital (SCH)
5
1
2
3
4
OpenFurther
6
Developmental Process Overview
Narus et. al, Federating Clinical Data from Six Pediatric Hospitals: Process and Initial Results from the PHIS+
Consortium. AMIA 2011
7
Modeling & Terminology Phase
• Data Model Harmonization
• Semantic Mapping
• Steps ensured quality of the data by limiting
information losses arising from data
transformations
8
Data Model Harmonization
• Informatics team worked with domain experts
to create representative common data models
for storage of different domains of data.
• Then with each hospital’s IT to harmonize
their data models with the common data
models.
9
Semantic Mapping
• Obtained detailed information
about distinct local data
elements using a metadata
collection toolkit
• Mapped local data elements to
standard biomedical
terminologies.
• Doubtful mappings discussed
with their respective hospital
team inclusive of the site PI,
lab and EHR personnel.
• All mapping peer-reviewed
within the informatics team,
with the contributing hospital
team, and also run through
software checks.
10
Metadata Fields Example
Local Battery/Panel
Name/Code
Battery/Panel Description
Local Test Name Glucose
Local Test Code Glu
Test Description Blood Glucose
LOINC Code -
Test Value Type Numeric
Test Value Sample Data 86
Test Start Date Format
Test End Date Format
Specimen Serum
Units of Measure mg/dL
Reference Range 80 – 120
Interpretation Codes
Test Status Codes
Comments
Differences in Local Coding Schemas
C Reactive
Protein
[Mass/volume]
in Serum or
Plasma (1988-5)
C Reactive
Protein
(8726)
C Reactive
Protein
(CRPT)
CRP (CRP)
CRP Test
(700111)
C-Reactive
Protein
(801582)
C R Protein
(801679)
11
Nanogram/
Decilitre
(258805003)
NG/DL
ng/dL
ng/dL
ng per
dL
ng/Dl
ng per
dL
Laboratory Test Unit of Measure
Data Processing Phase
• Data collection phase: Each hospital used a
combination of a data collection toolkit and data
validation scripts to assess their submitted data.
• Contributed data was then processed through the
OpenFurther platform for translation to selected
standard terminologies and storage in common data
models.
• Each row of processed data was check for different
data quality issues specific to each domain.
• Errors in the data were flagged with an error taxonomy
and reviewed for fixes or resubmissions.
12
Example Checks
• Is the lab test associated with a patient?
• Is there a valid lab test in each row of lab
result data?
• Does the lab test have a result a valid result?
• Are there proper relationships between
cultures, their test specimens and results?
13
Study Specific Quality Assessment
• Individual studies have
different granularities and
specificities in their data
requirements.
• We undertook a second set
of data quality assessments
at the study cohort level.
• This included a chart review
of a significant sample
within each study cohort.
14
0
5000
10000
15000
20000
25000
30000
35000
0.6
1.9
2.9
3.76
4.5
5.4
6.4
7.4
8.4
9.4
10.4
11.4
12.6
13.7
****
>9.0
QNStorepeat
2823-3: Potassium
[Moles/volume] in Serum or
Plasma
PHIS+ CER Database – 2007-11
Site Results
LOINC Lab Test
Code
A 15,011,312 538
B 33,214,540 1,214
C 16,868,383 860
D 25,706,608 1,089
E 38,422,668 1,016
F 14,507,629 2,131
Total 150,731,140 *6,848 (2,992)
Site Culture Results
SNOMED Specimen
Code
SNOMED Culture
Procedure Code
SNOMED Organism
Code
RxNorm Anti-
microbial Code
Susceptibility
Results
LOINC Susceptibility
Test Code
A 247,933 114 70 113 57 487,813 97
B 359,780 58 42 56 58 393,594 85
C 231,071 179 46 162 59 340,100 99
D 335,606 110 34 145 57 376,844 75
E 486,315 130 56 160 59 605,000 76
F 176,848 264 71 121 51 283,865 89
Total 1,837,553 *855 (451) *319 (95) *757 (203) *341 (74) 2,487,216 *521 (136)
Site Reports
CPT Radiology
Procedure Code
A 445,681 280
B 1,151,383 349
C 635,458 296
D 980,740 482
E 1,098,693 497
F 201,708 477
Total 4,513,663 *2,381 (714)
Laboratory Radiology
Microbiology
* The first number is the total number of standard codes, the second in parenthesis is the distinct number of standard codes across all sites.
1,854,406
Kids
Discussion
• We developed an infrastructure that assesses the
quality of data being integrated from disparate data
sources.
• Using this infrastructure we populated a database with
high quality data to support HSR & CER.
• To ensure data quality a combination of computerized
data assessment checks within OpenFurther and
manual checks were used.
• Global and study specific data quality assessments
were required
– Address systemic issues in data integration and study
specific issues.
16
Discussion
• Informed by the framework developed by Kahn et. al in “A
Pragmatic Framework for Single-site and Multisite Data
Quality Assessment in Electronic Health Record-based
Clinical Research”
• Inherent dimensions such as Accuracy, Objectivity and
Believability; and Conceptual dimensions such as
Timeliness and Appropriate amount of data were
measured.
• A software platform that complies with existing theoretic
frameworks of data quality can assist this process and
speed up the process of generating new and reproducible
study results.
– A Data Model for Representation and Storage of Biomedical
Data Quality, Breakout Session 3 – Strategies for Identifying
Data Quality Issues
17
THANK YOU
18

10th Annual Utah's Health Services Research Conference - Data Quality in Multi-Site Health Services and Comparative Effective Research: Lessons from PHIS+ By: Ram Gouripeddi

  • 1.
    Data Quality inMulti-Site Health Services and Comparative Effective Research: Lessons from PHIS+ Ram Gouripeddi University of Utah 10th Annual Utah Health Services Research Conference Considering Data Quality in Health Services Research Monday, March 16, 2015
  • 2.
    Acknowledgements • Raj Srivastava,MD, MPH • Ron Keren MD, MPH • OpenFurther Team members • PHIS+ Team members across multiple institutions • Apelon • FURTHeR development was supported by the NCRR and the NCATS, NIH, through Grant UL1RR025764 and supplement 3UL1RR025764-02S2. This project was funded under grant number R01 HS019862-01 from the AHRQ, U.S. Department of Health and Human Services (HHS). The opinions expressed [in this document] are those of the authors and do not reflect the official position of AHRQ or the HHS. • PHIS+: www.childrenshospitals.org/phisplus/index.html 2
  • 3.
    PHIS+ • Augment Children’sHospital Association’s (CHA) existing electronic database of administrative data - Pediatric Health Information System (PHIS) with clinical data to conduct Comparative Effectiveness Research studies. • UU Biomedical Informatics Core - Informatics Partners • Agency for Healthcare Research and Quality (AHRQ) funded project. 3
  • 4.
  • 5.
    The PHIS+ Process 6 1.Cincinnati Children’s Hospital Medical Center (CCHMC) 3. Children’s Hospital of Philadelphia (CHOP) 5. Primary Children’s Medical Center, Intermountain Healthcare (PCMC) 2. Children’s Hospital Boston (CHB) 4. Children’s Hospital of Pittsburgh (CHP) 6. Seattle Children’s Hospital (SCH) 5 1 2 3 4
  • 6.
  • 7.
    Developmental Process Overview Naruset. al, Federating Clinical Data from Six Pediatric Hospitals: Process and Initial Results from the PHIS+ Consortium. AMIA 2011 7
  • 8.
    Modeling & TerminologyPhase • Data Model Harmonization • Semantic Mapping • Steps ensured quality of the data by limiting information losses arising from data transformations 8
  • 9.
    Data Model Harmonization •Informatics team worked with domain experts to create representative common data models for storage of different domains of data. • Then with each hospital’s IT to harmonize their data models with the common data models. 9
  • 10.
    Semantic Mapping • Obtaineddetailed information about distinct local data elements using a metadata collection toolkit • Mapped local data elements to standard biomedical terminologies. • Doubtful mappings discussed with their respective hospital team inclusive of the site PI, lab and EHR personnel. • All mapping peer-reviewed within the informatics team, with the contributing hospital team, and also run through software checks. 10 Metadata Fields Example Local Battery/Panel Name/Code Battery/Panel Description Local Test Name Glucose Local Test Code Glu Test Description Blood Glucose LOINC Code - Test Value Type Numeric Test Value Sample Data 86 Test Start Date Format Test End Date Format Specimen Serum Units of Measure mg/dL Reference Range 80 – 120 Interpretation Codes Test Status Codes Comments
  • 11.
    Differences in LocalCoding Schemas C Reactive Protein [Mass/volume] in Serum or Plasma (1988-5) C Reactive Protein (8726) C Reactive Protein (CRPT) CRP (CRP) CRP Test (700111) C-Reactive Protein (801582) C R Protein (801679) 11 Nanogram/ Decilitre (258805003) NG/DL ng/dL ng/dL ng per dL ng/Dl ng per dL Laboratory Test Unit of Measure
  • 12.
    Data Processing Phase •Data collection phase: Each hospital used a combination of a data collection toolkit and data validation scripts to assess their submitted data. • Contributed data was then processed through the OpenFurther platform for translation to selected standard terminologies and storage in common data models. • Each row of processed data was check for different data quality issues specific to each domain. • Errors in the data were flagged with an error taxonomy and reviewed for fixes or resubmissions. 12
  • 13.
    Example Checks • Isthe lab test associated with a patient? • Is there a valid lab test in each row of lab result data? • Does the lab test have a result a valid result? • Are there proper relationships between cultures, their test specimens and results? 13
  • 14.
    Study Specific QualityAssessment • Individual studies have different granularities and specificities in their data requirements. • We undertook a second set of data quality assessments at the study cohort level. • This included a chart review of a significant sample within each study cohort. 14 0 5000 10000 15000 20000 25000 30000 35000 0.6 1.9 2.9 3.76 4.5 5.4 6.4 7.4 8.4 9.4 10.4 11.4 12.6 13.7 **** >9.0 QNStorepeat 2823-3: Potassium [Moles/volume] in Serum or Plasma
  • 15.
    PHIS+ CER Database– 2007-11 Site Results LOINC Lab Test Code A 15,011,312 538 B 33,214,540 1,214 C 16,868,383 860 D 25,706,608 1,089 E 38,422,668 1,016 F 14,507,629 2,131 Total 150,731,140 *6,848 (2,992) Site Culture Results SNOMED Specimen Code SNOMED Culture Procedure Code SNOMED Organism Code RxNorm Anti- microbial Code Susceptibility Results LOINC Susceptibility Test Code A 247,933 114 70 113 57 487,813 97 B 359,780 58 42 56 58 393,594 85 C 231,071 179 46 162 59 340,100 99 D 335,606 110 34 145 57 376,844 75 E 486,315 130 56 160 59 605,000 76 F 176,848 264 71 121 51 283,865 89 Total 1,837,553 *855 (451) *319 (95) *757 (203) *341 (74) 2,487,216 *521 (136) Site Reports CPT Radiology Procedure Code A 445,681 280 B 1,151,383 349 C 635,458 296 D 980,740 482 E 1,098,693 497 F 201,708 477 Total 4,513,663 *2,381 (714) Laboratory Radiology Microbiology * The first number is the total number of standard codes, the second in parenthesis is the distinct number of standard codes across all sites. 1,854,406 Kids
  • 16.
    Discussion • We developedan infrastructure that assesses the quality of data being integrated from disparate data sources. • Using this infrastructure we populated a database with high quality data to support HSR & CER. • To ensure data quality a combination of computerized data assessment checks within OpenFurther and manual checks were used. • Global and study specific data quality assessments were required – Address systemic issues in data integration and study specific issues. 16
  • 17.
    Discussion • Informed bythe framework developed by Kahn et. al in “A Pragmatic Framework for Single-site and Multisite Data Quality Assessment in Electronic Health Record-based Clinical Research” • Inherent dimensions such as Accuracy, Objectivity and Believability; and Conceptual dimensions such as Timeliness and Appropriate amount of data were measured. • A software platform that complies with existing theoretic frameworks of data quality can assist this process and speed up the process of generating new and reproducible study results. – A Data Model for Representation and Storage of Biomedical Data Quality, Breakout Session 3 – Strategies for Identifying Data Quality Issues 17
  • 18.

Editor's Notes

  • #8 Initial Data Analysis for Terminology & Modeling Harmonization of Site Specific Models to FURTHeR Input Model Input Data Submission Specifications Trial Sample Submission PHIS+ Storage Model Software Development for Translations Mapping of Local Terminologies to standards Evaluation with 1 Year data Installation of Infrastructure at CHA Terminology and Software Support – Current Status
  • #16 This slide has the hospital-wise and total numbers in the PHIS+ CER database. It currently stores all laboratory, microbiology and radiology data for the years 2007 – 2011 with plans to add future year data loads from these six pediatric hospitals. Using the FURTHeR infrastructure, data from each hospital has been converted to a central model that uses standard terminologies – which allows performing CER. The database at this time consists of about: 143 million lab results, 1.8 million culture results and 2.4 susceptibility results, and 4.2 million radiology reports with distinct study codes in terms of standard codes for each of the data-streams.