4. Mercy is the 5th largest Catholic health system in the U.S.
serving in 140 communities over a multi-state footprint through several touch
points including outreach ministries and virtual care.
35 Acute Care Hospitals
700 Clinic and Outpatient Facilities
2,100 Mercy Clinic Physicians
4,231 Acute Licensed Beds
40,000 Co-workers
22,486 Births
158,768 Acute Inpatient Discharges
150,595 Surgeries (In/Outpatient)
650,702 Emergency Visits
8,361,683 Outpatient Visits
$4.48 billion Operating Revenue
14. Looks like the same reporting database…
smells a lot better!
• Use existing skills (SQL)
• Reuse data model expertise
• Large batch jobs run faster
• Large analytics run faster
• Near real-time updates
20. Free text search on lab results
Inventory data archival
Medical documentation improvement
EMR audit trail archival and reporting
o Text extraction and analytics
o Device data
o Real-time alerting
o New Integrated Patient Database
Here we are, Paul & Adam.
Paul’s been with Mercy for more than 8 years, and working in data warehousing, business intelligence, analytics, etc for more than 15 years. He’s currently serving as a director in the data engineering and analytics group, focusing his energy on Mercy’s big data implementation and analytics strategy.
Adam is Mercy’s technical lead for big data projects. He’s been with Mercy for more than 2 years, and has been in software development and consulting for more than 15 years.
Paul and Adam both live in St. Louis, MO.
Here are some facts about who Mercy is.
You might know the name Mercy from other places around the country, but if you aren’t near one of these dots, then it’s a different organization. We have a common heritage, but don’t have any business relationships related to the name.
The point is that we’re a fairly large healthcare system. In fact, we’re in the very top tier of customers for our EHR vendor, Epic. We’re large enough that we have to have three separate installations of their software to support our size, and one of those is the largest single installation that Epic had ever done at the time (7 years ago).
Virtual Care is one of our largest initiatives right now, building on a successful history with the nation’s largest centralized electronic ICU monitoring service. Remote monitoring and virtual access to specialists is a significant part of Mercy’s growth and commercialization strategy… which will obviously lead to even more data for us to play with.
We’re here to talk about Hadoop. We do that.
We’re a year into our first real Hadoop project.
We spent a couple of years before that doing proofs of concept and looking for the first solid business case.
Why do we do that?
Major problems we’re trying to address:
EHR gives us data too late.
Many of the use cases involved TODAY’S data, not yesterday’s.
We didn’t feel like we could do that on our existing data and reporting systems without huge investment.
Hadoop gave us a way to move into the real-time space at a lower cost.
It’s because our current analytical systems are simply too slow to both get data and to analyze data.
We have a pretty traditional assortment of operational data store and data mart kinds of data structures. Our largest single database is a structure provided by our EHR vendor. And we have well over 100 report developers across the ministry who have been trained on that data model and write SQL or Crystal Reports against that database.
But that data is always going to be at least a day behind reality.
And over time, that database hasn’t been able to keep up with the increasing demands.
It’s over 26 TB (with the largest single dataset, audit trail, already pulled out onto different platform).
Users regularly wait 15+ minutes for their standard reports to run.
Those factors and the cost of continuing to grow and improve performance make it an unsustainable long-term solution. We felt like Hadoop gave us a place to build and scale solutions much more affordably, and more flexible tools for bringing in low-latency data.
Sepsis example
The batch side of our architecture is
To make it easier to add new tables into our RDBMS synchronization process, we build a configuration-driven utility that works off of a few assumptions.
1) We can sqoop data out of source tables (or receive data files from the source) either in total or using a last update timestamp.
2) Every table has a primary key
3) If we have to deletes, they come to us in a separate file
4) Otherwise we can do a pseudo-upsert (using delete / insert; or merge/replace)
Challenges with this process:
No upsert in Hive
Takes a lot of extra space to do that kind of merge / replace
Not all tools support the right data file types to make this really efficient (ORC)
The real-time process is a bit more complicated. After receiving the data from the EMR (as small batch files), the process has three phases.
First is the translation of the data format. What we receive is somewhat complex variable-length / variable-format record type. So, we have to have several rules for interpreting incoming records.
Second is the conversion of data from the EMR semantics into the reporting database semantics. There are thousands of surrogate key translations and foreign key mappings that happen. Luckily, the metadata on how these should occur is maintained in database tables by the EMR vendor.
Finally, the data is mapped into the appropriate target tables and stored in correspondig Hbase tables.
It’s important to note that our real-time updates only come as cell-level information. The source system doesn’t transmit full records, only those fields that have been updated or populated. So, to use our real-time data store, we have to merge the full records from the nightly batch with the individual cell updates from the real-time process.
So, we use Hive/Hbase integration to bring the realtime data into Hive.
Then we use Hive views to merge the two datasets together.
Not shown here, we also have a complicated way to distringuish that a field has been set to NULL versus a field that simply hasn’t seen a real-time update.