Geisinger Health System is well known in the healthcare community as a pioneer in data and analytics. We have had an Electronic Health Record (EHR) since 1996, and an Electronic Data Warehouse (EDW) since 2008. Much of daily and weekly operational reporting, as well as an abundance of ad hoc analytics, come from the EDW.
Approximately 18 months ago, the Data Management team implemented Hadoop in the Hortonworks Data Platform (HDP), and successes in implementation and development have proven to the organization that we should abandon the traditional EDW in favor of the Big Data (HDP) platform.
In less than 18 months, we stood up the platform, created a data ingestion pipeline, duplicated all source feeds from the EDW into HDP, and had several analytics developed with HDP and Tableau. Furthermore, we have exploited the new capabilities of the platform, where we use Natural Language Processing (NLP) to interrogate valuable (but previously hidden) clinical notes. The new platform has data that is modeled and governed, setting the stage to push Geisinger Health System from a pioneer to a leader in Big Data and Analytics.
This session will focus on Hortonworks Data Platform, covering data architecture, security, data process flow, and development. It is geared toward Data Architects, Data Scientists, and Operations/I.T. audiences.
2. Integrated health services organization
Innovative care delivery models
Serves >3 million residents in 45 counties
>30,000 employees
>1,500 employed physicians
12 hospital campuses
551,000 member health plan
3.
4. A good first-start.
Data assembled in a central location
Allowed for self-service
Could link disparate data
Health
Record
Data
Warehouse
Surveys
Cardiology
Oncology
Financials
Codesets
External
Data
Claims
5. “There are too
many
undocumented
data sources.”
“There is no
documented
understanding
of business
requirements for
CDIS business
analytics.”
“We don’t have
the
transformations
that the
business users
really need.”
“Cannot
provide data
that is fit for
purpose.” “Data dictionary
does not exist
today.”
“Can’t
“match” from
encounters to
bills to claim.”
“Much of my
group’s time
is spent
entering data
manually”
“The platform/
architecture in
place for CDIS
analytics is not
correct for the
types of work
being
performed.”
“Clinical data
quality
problems
related to
patient safety
exist.”
“Hierarchies
exist at many
levels.”
“The level of
detail that I
need is not
there in the
data.” “There are too
many pockets
of data.”
“The CDIS “lift
and shift” model
perpetuates the
problem with
too many
views/analytics”
6. • If Data isn’t accurate, it is worse than nothing.
• Incomplete isn’t useful.
• Data that isn’t timely is less than desirable.
• When multiple versions of data exist, relying
on the wrong value can lead to bad decisions.
•There must be ONE source of truth for data
•Data without documentation is of
questionable value
Often, the first exposure of new
data highlights data quality issues.
7. A unified data architecture (UDA) is a more comprehensive view of the overall enterprise
architecture; a collection of services, platforms, applications, and tools that help customers
define and deploy an architecture that makes the best use of available technologies to
unleash the optimal value of data. TDWI: Jun 6, 2013
The UDA at Geisinger Health System is the integration of key analytic platforms (e.g.,
Hadoop, EDW EHR, etc.) with a common semantic layer, and all performing under the
umbrella of the same Data Governance structure.
8. • Less expensive due to commodity hardware
• It could be as little as 10% of the cost of our traditional EDW.
• Faster ingestion of data
• Because of early binding, any mapping, modeling, etc. is typically done
upfront in traditional data warehousing. Late binding of Hadoop allows for
the data to simply be loaded without detailed analysis and preparation.
• Multiple views of the data
• Our multi-zoned Hadoop system allows for many views of the data, including
temporal, modeled, etc.
• Unstructured and semi-structured data
• Hadoop is not confined to structured data in discreet fields, as is the case
with traditional analytic platforms.
9.
10. THE V’S OF BIG DATA
Controlling Data Volume, Velocity, and Variety
13. VarietyDifferent forms and views
non-
traditional
sources
home
devices
KeyHIE
social
media
patient
apps
Device
integration
genomics
struct
multi-
zoned
Lawson
16. • ROI: use open-source, commodity hardware argument
• Change: SQL team are unfamiliar with Big Data ecosystem
• Data Load: Load EVERYTHING into Hadoop by building prototypes,
not use cases
• Self-service: Push for self-serve as much as possible,
• Adoption: Develop valuable early wins, invest in visualization (e.g.
Tableau)
• Data Zones: Create separate data zones, split PHI from non-PHI data
• Surge capacity: Pop-off to cloud-based options at surge capacity
needs
17. PRODUCTION FOOTPRINT
CDIS
Teradata production server
– Version 14.10
– ~13TB uncompressed
– ~30TB compressed
Hadoop
Production cluster
– Hortonworks Data Platform
v2.6
– 30 nodes
– 600TB total
– 200TB usable (3 copies)
18. MAJOR DATA SOURCES
Traditional EDW
• Health Record (clinical) data
• Financial
• Claims
• Pulmonary
• Pathology
• Oncology
Hadoop
• All EDW sources, plus:
• Lawson
– Fin, supply chain, A/P
• RIS (Radiology)
• Microbiology
• KeyHIE (Health Info Exchange)
• Lab System Data
• Phone Systems
• Lumedx (Cardiology)
19. LLAP STATISTICS
Configuration
• Running on 10 nodes
• Using 40% of the cluster
• 100GB Cache availability
Teradata vs LLAP
• Query under 1 minute : 80% queries
performed better than Teradata
• Query over 1 minute : 95% queries
performed better than Teradata
20. Epic
Cache
Epic Clarity
Hadoop
.ext files (ETL
files feeding
the clinical
reporting
database)
EDW
Primary Clinical
dataset containing
patient health records
Clinical reporting DB
Traditional Ent.
Data Warehouse
New Big Data Platform
Results in data
available hours
before the
traditional EDW
21. • More tables loaded nightly
• ~1100 in Teradata
• ~7200 in Hadoop
• Incremental EXT’s (~3,500 EXT files/night)
• Automated Epic loading process using Map Reduce
and Java
22. Landing
Zone
Raw Zone
Refined
Zone
Current
Zone
Integrated
Zone
• Source
system
pushes to
landing zone
• Stored
separately by
source
system
• Securely
transferred
• Auditing,
traceability,
compliance
and lineage
• New source
data is
appended,
not deleted
• Partitioned by
load date
• Compressed
• Data still
temporal
• Data types
match source
• Partitioned by
load date
• Organized by
business
attributes and
load date
• Current
snapshot
(temporal
history is
merged to
give the
latest view)
• Purpose-built
datasets for
quicker analytics
• Patient/member
uniquely
identified across
systems
23. • Encryption at rest for Hadoop data
• Authentication/Authorization
• LDAPS and AD Integration using Ranger/Knox
• Connections
• SSL endpoint encryption active for all network connections
• ODBC – SSL Secured
• JDBC – SSL Secured
• Data
• Appropriate access and roles as required. These roles will continue to be
defined by the Data Manger or his designate.
• All PHI data will be masked in the Development environment
• Kerberos Authentication: To thwart impersonation threats
24. • Bundled Payments Care Initiative
• Data Model
• De-identification of PHI/BSI
• Natural Language Processing
• Sepsis
• O.R. Workflows
• Bactec
• Social Security Death File
• Supply Chain
• Registries
• MPOG, AAA, Ortho Infection, Ortho Trauma
26. • Problem
• Patients with lung nodules found on imaging are lost to follow-up
• Solution
• Ingestion of data from radiology imaging notes
• NLP
• Value
• Identify lung nodules
27. NLP and Dictionary annotator
Annotates with UMLS concept codes
Lung nodule Filter annotator
Identifies lung nodule notes
~ 10 million notes
Negation Annotator
Measurement/Lung RADS Calculator
~ 9.7 million notes
NO
YES
~ 300 thousand notes
. . .
Lung nodule
in note?
Radiology notes
LUNG NODULES – TEXT ANALYTICS WORKFLOW
30. • Problem
• Patients with AAA are lost to follow up
• Solution
• Ingestion of data from radiology imaging notes
• Use NLP and care-gap closure technologies
• Value
• Ensure proper follow up
32. • Use case
• Provide capabilities to perform retrospective analysis of OR data
• Solution
• Ingest key data elements and metrics into a data model on Hadoop
• Provide advanced visualization and drill down capabilities using Tableau
• Value
• Improve OR utilization and quality of care using learnings from retrospective
analysis
33. • Scheduled vs Actual Analysis
• OR Staff Summary Information
• Various filters to slice and dice
the data in different ways
• Next day data availability
34. • Use case
• Understand the supply costs associated with OR procedures and variance by
provider/service/location
• Solution
• Ingest key data elements from EMR, Billing and Supply Chain systems
• Provide advanced visualization and drill down capabilities using Tableau
• Value
• Identify areas of greatest potential variance/opportunity to manage costs
• Opportunities for Isolation of data issues, best practices across platforms,
supply chain cost optimization and process improvement
35. • Compare supply cost for multiple
providers for same procedure
• Cost band indicates +/- 1 standard
deviation
• Compare cost for same procedure
by surgical role
36. • Heatmap of cost variance across all
service lines
• Heatmap of cost variance by
service lines
• Can be filtered by lead procedures
per case
• Drill down capability to show
implants/explants and supply cost
per procedure and per case
Editor's Notes
Brief introduction about Geisinger
EHR in mid-90s. By 2006, leadership wanted EDW. CDIS (clin dec intel syst) live in 2008. Big win early. Few Healthcare orgs had this integration platform at this time. Internally, depts. (research) no longer had to request extracts from Epic for analytics.
One platform of data (clin, fin, claims) for analytics, to transform the delivery of care. It has gone through a number of iterations, and currently supports much of the analytics running our day-to-day operations. Over 2100 users.
2012, switched to TD (higher performance).
2016, UDA. Integrate all key analytics platforms (Hadoop, Cerner, Epic EDW)
Next phase of our analytics platform: Hadoop (Big Data)
Late binding of Hadoop allows for the data to simply be loaded without detailed analysis and preparation up-front.
Our multi-zoned Hadoop system allows for many views of the data, including temporal, modeled, etc.
Hadoop is not confined to structured data in discreet fields, as is the case with traditional analytic platforms.
LDAP and AD Integration using Ranger/Knox
Encryption at rest
SSL endpoint encryption active for all network connections
Kerberos Authentication: To thwart impersonation threats
Appropriate access and roles as required. These roles will continue to be defined by the Data Manger or his designate
All PHI data will be masked in the Development environment
Less costly hardware for storing increasing data (structured and unstructured)
5 million to purchase new Terradata hardware
Prevent “one-off” data systems (e.g. IoT data capture, ICU real-time data capture, Cybersecurity)
Lung nodules are commonly identified in free text within radiology reports and can easily be lost to follow up with potential for delayed cancer diagnosis.
A treasure trove of useful, relevant, and unstructured clinical information in the form of text blobs and semi-templated data is locked inside EHRs.
We used Solr, a module part of the Apache Hadoop ecosystem, to expose the data and let users perform rapid search.
The ability to sort through over 184M clinical notes across 20-years worth of in/outpatient records
Serves a framework to run CTAKES and other Natural Language Processing programs to find signal in the text noise, and make the data actionable.
UMLS: Unified Medical Language System
Negations
Nearly 30 % of identified lung nodule notes are negative results.
NLP engine constructs grammar tree and associates negation words with the identified lung nodule text
Calculate Lung RADS scores based on nodule size and description
Future tasks
Measure accuracy of predicted Lund RADS scores and improve performace