PEDSnet : 18 month summary on data integration and data quality
1. PEDSnet:
A
Na-onal
Pediatric
Learning
Health
System
18-‐month
Summary
on
Data
Integra3on
and
Data
Quality
Ritu
Khare
October
23
2015
1
2. PEDSnet
A
Na-onal
Pediatric
Learning
Health
System
(Forrest
et
al.
2014)
• What
is
PEDSnet?
– Clinical
Data
Research
Network
(CDRN)
• Facilitate
large-‐scale
research
– Aggregates
pediatric
EHR
data
from
mulNple
insNtuNons
• PCORI
Support
– One
of
the
11
CDRNs
supported
by
PaNent-‐
Centered
Outcomes
Research
InsNtute
(PCORI)
– Part
of
a
larger
network,
PCORnet,
which
combines
data
from
all
CDRNs
under
a
common
data
model
2
3. PEDSnet
Organiza3onal
Architecture
3
• Establish
an
interoperable
digital
infrastructure
to
support
learning
(research,
disseminaNon,
and
implementaNon),
recruitment,
and
engagement
of
stakeholders.
• Validate
that
the
aggregated
data
is
high-‐quality
and
ready
to
be
integrated
into
the
PCORnet
research
network
Leadership
Core
Network
Core
Science
Core
Engagement
Core
InformaNcs
Core
(>50%
of
team
members)
Steering
CommiWee
NaNonal
Advisory
CommiWee
4. Building
PEDSnet
Challenges
and
OpportuniNes
4
S1
S2
S3
S4
S5
S6
S7
S8
PCORnet
Common
Data
Model
PEDSnet
sites
(at
various
loca-ons)
PEDSnet
Data
CoordinaNng
Center
• lack
of
EHR
data’s
orientaNon
toward
research
analyNcs
• diversity
in
EHR
configuraNons
• semanNc
heterogeneity
• data
peculiariNes
in
pediatrics
• large-‐scale
• Interdisciplinary
informaNcs
team
• ObservaNonal
medical
outcomes
partnership
(OMOP)
• common
data
model
• unified
vocabulary
• Advances
in
data
science
(big
data
and
text
mining
tools)
Challenges
Opportuni3es
5. Building
PEDSnet
Overall
Approach
5
S1
S2
S3
S4
S5
S6
S7
S8
PCORnet
Common
Data
Model
PEDSnet
sites
(at
various
loca-ons)
OMOP
Common
Data
Model
Post
hoc
Data
Quality
Assessment
OMOP
to
PCORnet
transformaNon
scripts
Extract-‐transform-‐
load
(ETL)?
Fitness
for
research?
• More
granular
• Vocabulary
support
• OMOP
analyNc
tools
• Pediatric
specific
edits
Specs
6. Post
hoc
Data
Quality
Assessment
One
Data
Cycle
6
Data
Quality
Checks
OMOP
common
data
model
Communicate
data
quality
issues
to
sites
Two
Expert
Reviewers
Site
ETL
Team
7. • Formal
ViolaNons
of
SpecificaNons
– Illegal
values
in
coded
fields
(e.g.
paNent’s
race)
– Any
missing
data
or
informaNon
(discharge
flags
missing
for
outpaNent
visits)
• Implausible
Events
– Death
date
in
2017
– Visit
date
in
1930
– Height
=
6
cm
• Outliers
(by
Comparison)
– “diabetes”
-‐
most
frequent
condiNon
– Over
50
procedures/paNent
at
a
given
site
(other
sites
having
15-‐20)
• Others
(catch
our
aWenNon!)
– a
provider
responsible
for
recording
over
2M
measurements
– An
outpaNent
visit
having
30k
measurements
– sudden
increase
in
number
of
facts
aier
a
certain
date/year
7
Post
hoc
Data
Quality
Assessment
Types
of
Checks
and
Example
Issues
8. Post
hoc
Data
Quality
Assessment
Communica-on
with
Sites
• Prepare
data
quality
reports
• IdenNfy
causes
of
issues
and
propose
soluNons
8
Visual
Report
GitHub
issue
report
Ranking
9. 18-‐month
Experience
Summary
9
Data
Content
of
the
Network
(8
sites)
~4.7M
children
~98M
clinical
encounters
~120M
condiNons
~68M
procedures
~100M
medicaNons
~42M
observaNons
~428M
labs
and
vital
signs
Prospec3ve
Methods
Communica-on
>
90
ETL
MeeNngs
ETL
Code
95
GitHub
commits/site
ETL
Documenta-on
~150
GitHub
commits
Post
hoc
Data
Cycle
OMOP
Model
version
Total
Issues
Jan
‘15
OMOP
v4
117
Mar
‘15
OMOP
v4
313
May
‘15
OMOP
v5
440
Jul
‘15
OMOP
v5
591
Sep
’15
OMOP
v5
505
10. 18-‐month
Experience
Summary
Data
Quality
in
the
Most
Recent
Data
Cycle
10
Most
Frequent
Issues
Missing
Data
31%
EnNty
Outliers
15%
Unexpected
difference
from
previous
ETL
run
9%
Implausible
Dates
9%
Temporal
Outliers
6%
Unexpected
Facts
4%
Domains
with
Most
Issues
MedicaNon
21%
Labs
and
Vital
Signs
13%
ObservaNon
11%
CondiNon
10%
Procedure
9%
Demographic
7%
Encounter
7%
11. 18-‐month
Experience
Summary
Causes
of
Data
Quality
Issues
• ETL
code
– Programming
bug/mistake
– Ambiguous
or
incomplete
ETL
specificaNons
– AdministraNve
reasons
– Can
be
fixed
• Source
Data
CharacterisNcs
– Data
Entry
Error
– Clinical
or
administraNve
workflow
– True
anomaly
– Cannot
be
fixed
11
12. 18-‐month
Experience
Summary
Causes
of
Issues
and
Lessons
Learnt
12
u ETL
u Data
CharacterisNcs
u I2b2
transform
u Non-‐issue
0
20
40
60
Jan
'15
(OMOP
v4)
Mar
'15
(OMOP
v4)
May
'15
(OMOP
v5)
Jul
'15
(OMOP
v5)
Percentage
of
Issues
Distribu3on
of
Issue
Causes
across
Data
Cycles
• A
strong
post
hoc
program
is
important
in
assessment
of
data
validity
in
CDRNs
• The
percentage
of
“ETL”
issues
decreased
across
successive
data
cycles
• Avg.
22
ETL
issues
/
site.
• This
study
serves
as
a
data
educaNon
plarorm
• The
total
number
of
“data
characterisNc”
issues
increased
across
cycles
13. Future
Work
• Advanced
data
quality
checks
– Current
work
largely
based
on
aWribute-‐level
checks
– Comparison
with
external
datasets
and
published
findings
• DeterminaNon
of
research
fitness
– based
on
the
source
data
characterisNcs
idenNfied
through
post
hoc
assessments
• AutomaNon
– The
data
quality
check
scripts
applied
to
other
OMOP-‐
based
network.
13
14. Acknowledgments
14
• PEDSnet
Data
CoordinaNng
Center
– Aaron
Browne
– Byron
Ruth
– EvaneWe
Burrows
– Jeff
Pennington
– Kevin
Murphy
– Lemar
Davidson
– Lena
Kushleyeva
– Levon
UNdjian
– Toan
Ong
• PEDSnet
ETL
teams
at
various
sites
• PEDSnet
InformaNcs
Leadership
– Charles
Bailey
– Michael
Kahn
– Keith
Marsolo
– Christopher
Forrest
• OHDSI
ConsorNum
• PaNents
and
Families
• This
work
was
supported
by
PCORI
Contract
CDRN-‐1306-‐01556.