Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

PEDSnet : 18 month summary on data integration and data quality


Published on

Panel Discussion: Enabling Large-Scale Observational Research through Data Integration

Published in: Science
  • Be the first to comment

  • Be the first to like this

PEDSnet : 18 month summary on data integration and data quality

  1. 1. PEDSnet:  A  Na-onal  Pediatric  Learning   Health  System   18-­‐month  Summary  on  Data  Integra3on  and  Data  Quality     Ritu  Khare             October  23  2015     1  
  2. 2. PEDSnet     A  Na-onal  Pediatric  Learning  Health  System  (Forrest  et  al.  2014)   •  What  is  PEDSnet?   –  Clinical  Data  Research  Network  (CDRN)     •  Facilitate  large-­‐scale  research   –  Aggregates  pediatric  EHR  data  from  mulNple   insNtuNons     •  PCORI  Support   –  One  of  the  11  CDRNs  supported  by  PaNent-­‐ Centered  Outcomes  Research  InsNtute  (PCORI)   –  Part  of  a  larger  network,  PCORnet,  which   combines  data  from  all  CDRNs  under  a  common   data  model     2  
  3. 3. PEDSnet  Organiza3onal  Architecture   3     •  Establish  an  interoperable  digital  infrastructure  to  support  learning   (research,  disseminaNon,  and  implementaNon),  recruitment,  and   engagement  of  stakeholders.     •  Validate  that  the  aggregated  data  is  high-­‐quality  and  ready  to  be   integrated  into  the  PCORnet  research  network   Leadership   Core   Network   Core   Science  Core   Engagement   Core   InformaNcs  Core   (>50%  of  team   members)   Steering  CommiWee   NaNonal  Advisory  CommiWee  
  4. 4. Building  PEDSnet   Challenges  and  OpportuniNes   4   S1   S2   S3   S4   S5   S6   S7   S8   PCORnet   Common   Data  Model   PEDSnet  sites     (at  various  loca-ons)   PEDSnet  Data  CoordinaNng  Center   •  lack  of  EHR  data’s   orientaNon  toward   research  analyNcs   •  diversity  in  EHR   configuraNons   •  semanNc  heterogeneity       •  data  peculiariNes  in   pediatrics   •  large-­‐scale   •  Interdisciplinary   informaNcs  team   •  ObservaNonal   medical  outcomes   partnership  (OMOP)   •  common  data   model     •  unified  vocabulary     •  Advances  in  data   science  (big  data  and   text  mining  tools)         Challenges   Opportuni3es  
  5. 5. Building  PEDSnet     Overall  Approach   5   S1   S2   S3   S4   S5   S6   S7   S8   PCORnet   Common   Data  Model   PEDSnet  sites     (at  various  loca-ons)     OMOP   Common  Data   Model     Post  hoc   Data  Quality   Assessment   OMOP  to   PCORnet   transformaNon   scripts   Extract-­‐transform-­‐ load  (ETL)?   Fitness  for   research?   •  More  granular     •  Vocabulary  support   •  OMOP  analyNc  tools   •  Pediatric  specific  edits   Specs  
  6. 6. Post  hoc  Data  Quality  Assessment   One  Data  Cycle       6   Data  Quality   Checks   OMOP   common  data   model     Communicate   data  quality   issues  to  sites     Two  Expert   Reviewers   Site  ETL   Team  
  7. 7. •  Formal  ViolaNons  of  SpecificaNons   –  Illegal  values  in  coded  fields  (e.g.  paNent’s  race)   –  Any  missing  data  or  informaNon    (discharge  flags  missing  for  outpaNent   visits)   •  Implausible  Events     –  Death  date  in  2017     –  Visit  date  in  1930   –  Height  =  6  cm     •  Outliers  (by  Comparison)   –  “diabetes”    -­‐    most  frequent  condiNon   –  Over  50  procedures/paNent  at  a  given  site  (other  sites  having  15-­‐20)   •  Others  (catch  our  aWenNon!)   –  a  provider  responsible  for  recording  over  2M  measurements   –  An  outpaNent  visit  having  30k  measurements     –  sudden  increase  in  number  of  facts  aier  a  certain  date/year   7   Post  hoc  Data  Quality  Assessment   Types  of  Checks  and  Example  Issues    
  8. 8. Post  hoc  Data  Quality  Assessment   Communica-on  with  Sites       •  Prepare  data  quality  reports   •  IdenNfy  causes  of  issues  and  propose  soluNons   8   Visual  Report   GitHub  issue  report   Ranking  
  9. 9. 18-­‐month  Experience  Summary   9   Data  Content  of  the  Network     (8  sites)   ~4.7M  children   ~98M  clinical  encounters         ~120M  condiNons     ~68M  procedures     ~100M  medicaNons   ~42M  observaNons   ~428M  labs  and  vital  signs   Prospec3ve  Methods   Communica-on   >  90  ETL  MeeNngs   ETL  Code    95  GitHub  commits/site   ETL   Documenta-on   ~150  GitHub  commits     Post  hoc   Data  Cycle   OMOP  Model   version   Total  Issues   Jan  ‘15   OMOP  v4   117   Mar  ‘15   OMOP  v4   313   May  ‘15   OMOP  v5   440   Jul  ‘15   OMOP  v5   591   Sep  ’15   OMOP  v5   505  
  10. 10. 18-­‐month  Experience  Summary   Data  Quality  in  the  Most  Recent  Data  Cycle   10   Most  Frequent  Issues   Missing  Data   31%   EnNty  Outliers   15%   Unexpected  difference   from  previous  ETL  run   9%   Implausible  Dates   9%   Temporal  Outliers   6%   Unexpected  Facts   4%   Domains  with  Most  Issues   MedicaNon   21%   Labs  and  Vital  Signs   13%   ObservaNon   11%   CondiNon   10%   Procedure   9%   Demographic   7%   Encounter   7%  
  11. 11. 18-­‐month  Experience  Summary   Causes  of  Data  Quality  Issues   •  ETL  code     –  Programming  bug/mistake     –  Ambiguous  or  incomplete  ETL  specificaNons     –  AdministraNve  reasons     –  Can  be  fixed   •  Source  Data  CharacterisNcs   –  Data  Entry  Error     –  Clinical  or  administraNve  workflow     –  True  anomaly     –  Cannot  be  fixed     11  
  12. 12. 18-­‐month  Experience  Summary   Causes  of  Issues  and  Lessons  Learnt   12   u  ETL     u  Data  CharacterisNcs   u  I2b2  transform   u  Non-­‐issue  0   20   40   60   Jan  '15   (OMOP  v4)   Mar  '15   (OMOP  v4)   May  '15   (OMOP  v5)   Jul  '15   (OMOP  v5)   Percentage  of   Issues   Distribu3on  of  Issue  Causes  across  Data  Cycles   •  A  strong  post  hoc  program  is  important  in  assessment  of  data   validity  in  CDRNs   •  The  percentage  of  “ETL”  issues  decreased  across  successive  data  cycles   •  Avg.  22  ETL  issues  /  site.     •  This  study  serves  as  a  data  educaNon  plarorm     •  The  total  number  of  “data  characterisNc”  issues  increased  across  cycles  
  13. 13. Future  Work   •  Advanced  data  quality  checks     –  Current  work  largely  based  on  aWribute-­‐level  checks   –  Comparison  with  external  datasets  and  published   findings     •  DeterminaNon  of  research  fitness   –  based  on  the  source  data  characterisNcs  idenNfied   through  post  hoc  assessments   •  AutomaNon     –  The  data  quality  check  scripts  applied  to  other  OMOP-­‐ based  network.     13  
  14. 14. Acknowledgments   14   •  PEDSnet  Data   CoordinaNng  Center   –  Aaron  Browne   –  Byron  Ruth   –  EvaneWe  Burrows   –  Jeff  Pennington     –  Kevin  Murphy   –  Lemar  Davidson   –  Lena  Kushleyeva   –  Levon  UNdjian       –  Toan  Ong   •  PEDSnet  ETL  teams  at   various  sites   •  PEDSnet  InformaNcs   Leadership   –  Charles  Bailey     –  Michael  Kahn     –  Keith  Marsolo     –  Christopher  Forrest     •  OHDSI  ConsorNum       •  PaNents  and  Families       •  This  work  was  supported   by  PCORI  Contract   CDRN-­‐1306-­‐01556.