Big data is generating a lot of hype in every industry including healthcare. As my colleagues and I talk to leaders at health systems, we’ve learned that they’re looking for answers about big data. They’ve heard that it’s something important and that they need to be thinking about it. But they don’t really know what they’re supposed to do with it.
Big Data Analytics in Hospitals By Dr.Mahboob ali khan Phd
1. 1
Big Data Analytics in Hospitals
By Dr.Mahboob ali khan Phd
Big data is generating a lot of hype in every
industry including healthcare. As my
colleagues and I talk to leaders at health
systems, we’ve learned that they’re looking
for answers about big data. They’ve heard
that it’s something important and that they
need to be thinking about it. But they don’t
really know what they’re supposed to do
with it. So they turn to us with questions
like:
When will I need big data?
What should I do to prepare for big data?
What’s the best way to use big data?
What is Health Catalyst doing with big
data?
This piece will tackle such questions head-on. It’s important to separate the reality
from the hype and clearly describe the place of big data in healthcare today, along
with the role it will play in the future.
Big Data in Healthcare Today
A number of use cases in healthcare are well suited for a big data solution. Some
academic- or research-focused healthcare institutions are either experimenting with
big data or using it in advanced research projects. Those institutions draw upon data
scientists, statisticians, graduate students, and the like to wrangle the complexities of
big data. In the following sections, we’ll address some of those complexities and
what’s being done to simplify big data and make it more accessible.
2. 2
A Brief History of Big Data in Healthcare
In 2001, Doug Laney, now at Gartner, coined the term ―the 3 V’s‖ to define big data–
Volume, Velocity, and Variety. Other analysts have argued that this is too simplistic,
and there are more things to think about when defining big data. They suggest more
V’s such as Variability and Veracity, and even a C for Complexity. We’ll stick with the
simpler 3 V’s definition for this piece.
In healthcare, we do have large volumes of data coming in. EMRs alone collect huge
amounts of data. Most of that data iscollected for recreational purposes according to
Brent James of Intermountain Healthcare. But neither the volume nor the velocity of
data in healthcare is truly high enough to require big data today. Our work with
health systems shows that only a small fraction of the tables in an EMR database
(perhaps 400 to 600 tables out of 1000s) are relevant to the current practice of
medicine and its corresponding analytics use cases. So, the vast majority of the data
collection in healthcare today could be considered recreational. Although that data
may have value down the road as the number of use cases expands, there aren’t
many real use cases for much of that data today.
There is certainly variety in the data, but most systems collect very similar data
objects with an occasional tweak to the model. That said, new use cases supporting
genomics will certainly require a big data approach.
Health Systems Without Big Data
Most health systems can do plenty today without big data, including meeting most of
their analytics and reporting needs. We haven’t even come close to stretching the
limits of what healthcare analytics can accomplish with traditional relational
databases—and using these databases effectively is a more valuable focus than
worrying about big data.
Currently, the majority of healthcare institutions are swamped with some very
pedestrian problems such as regulatory reporting and operational dashboards. Most
just need the proverbial ―air and water‖ right now, but once basic needs are met and
some of the initial advanced applications are in place, new use cases will arrive (e.g.
wearable medical devices and sensors) driving the need for big-data-style solutions.
3. 3
Barriers Exist for Using Big Data in Healthcare
Today
Several challenges with big data have yet to be addressed in the current big data
distributions. Two roadblocks to the general use of big data in healthcare are the
technical expertise required to use it and a lack of robust, integrated security
surrounding it.
Expertise
The value for big data in healthcare today is largely limited to research because
using big data requires a very specialized skill set. Hospital IT experts familiar with
SQL programming languages and traditional relational databases aren’t prepared for
the steep learning curve and other complexities surrounding big data.
In fact, most organizations need data scientists to manipulate and get data out of a
big data environment. These are usually Ph.D.-level thinkers with significant
expertise—and typically, they’re not just floating around an average health system.
These experts are hard to come by and expensive, and only research institutions
usually have access to them. Data scientists are in huge demand across industries
like banking and internet companies with deep pockets.
The good news is thanks to changes with the tooling, people with less-specialized
skillsets will be able to easily work with big data in the future. Big data is coming to
embrace SQL as the lingua franca for querying. And when this happens, it will
become useful in a health system setting.
Security
In healthcare, HIPAA compliance is non-negotiable. Nothing is more important than
the privacy and security of patient data. But, frankly, there aren’t many good,
integrated ways to manage security in big data. Although security is coming along, it
has been an afterthought up to this point. And for good reason. If a hospital only has
to grant access to a couple of data scientists, it really doesn’t have too much to worry
about. But when opening up access to a large, diverse group of users, security
cannot be an afterthought.
4. 4
Healthcare organizations can take some steps today to ensure better security of big
data. Big data runs on open source technology with inconsistent security technology.
To avoid big problems, organizations should be selective about big data vendors and
avoid assuming that any big data distribution they select will be secure.
The best option for healthcare organizations looking to implement big data is to
purchase a well-supported, commercial distribution rather than starting with a raw
Apache distribution. Another option is to select a cloud-based solution likeAzure
HDInsight to get started quickly. An example of a company with a well supported,
secure distribution is Cloudera. This company has created a Payment Card Industry
(PCI) compliant Hadoop environment supporting authentication, authorization, data
protection, and auditing. Surely other commercial distributions are working hard to
add more sophisticated security that will be well-suited for HIPAA compliance and
other security requirements unique to the healthcare industry.
Big Data Differs from the Databases Currently
Used in Healthcare
Big data differs from a typical relational database. This is obvious to a CIO or an IT
director, but a brief explanation of how the two systems differ will show why big data
is currently a work in progress—yet still holds so much potential.
Big Data Has Minimal Structure
The biggest difference between big data and relational databases is that big data
doesn’t have the traditional table-and-column structure that relational databases
have. In classic relational databases, a schema for the data is required (for example,
demographic data is housed in one table joined to other tables by a shared identifier
like a patient identifier). Every piece of data exists in its well-defined place. In
contrast, big data has hardly any structure at all. Data is extracted from source
systems in its raw form stored in a massive, somewhat chaotic distributed file
system. The Hadoop Distributed File System (HDFS) stores data across multiple
data nodes in a simple hierarchical form of directories of files. Conventionally, data is
stored in 64MB chunks (files) in the data nodes with a high degree of compression.
5. 5
Big Data Is Raw Data
By convention, big data is typically not transformed in any way. Little or no
―cleansing‖ is done and generally, no business rules are applied. Some people refer
to this raw data in terms of the ―Sushi Principle‖ (i.e. data is best when it’s raw, fresh,
and ready to consume). Interestingly, the Health Catalyst Late-Binding™ Data
Warehouse follows the same principles. This approach doesn’t transform data, apply
business rules, or bind the data semantically until the last responsible moment–in
other words, bind as close to the application layer as possible.
Big Data Is Less Expensive
Due to its unstructured nature and open source roots, big data is much less
expensive to own and operate than a traditional relational database. A Hadoop
cluster is built from inexpensive, commodity hardware, and it typically runs on
traditional disk drives in a direct-attached (DAS) configuration rather than an
expensive storage area network (SAN). Most relational database engines are
proprietary software and require expensive licensing and maintenance agreements.
Relational databases also require significant, specialized resources to design,
administer, and maintain. In contrast, big data doesn’t need a lot of design work and
is fairly simple to maintain. A lot of storage redundancy allows for more tolerable
hardware failures. Hadoop clusters are designed to simplify rebuilding of failed
nodes.
Big Data Has No Roadmap
The lack of pre-defined structure means a big data environment is cheaper and
simpler to create. So what’s the catch? The difficulty with big data is that it’s not
trivial to find needed data within that massive, unstructured data store. A structured
relational database essentially comes with a roadmap—an outline of where each
piece of data exists. On the big data side, there are no traditional schemas, and
therefore not much guidance. With a relational database, a simple, structured query
language (i.e. SQL) pulls the needed data using a sophisticated query engine
optimized for finding data.
With big data, the query languages are much more complicated. A sophisticated
data user—such as a data scientist—is needed to find the subset of data required for
applications. Creating the required MapReduce algorithms for querying big data
6. 6
instances isn’t for the faint of heart. Fortunately, that’s changing at a fairly rapid pace
with tools like SparkSQL and other query tools that leverage conventional SQL for
querying. Big data query engines can now convert SQL queries into MapReduce
jobs while others like the aforementioned Microsoft PolyBase can join queries from a
traditional relational database and Hadoop then return a single result set.
In short, big data is cheap but more difficult to use. Relational databases are
expensive but very usable. The maturity level of big data technology is low–after all
the big data journey only began a few short years ago. So, as the tooling and
security catches up with its potential, health systems will be able to do exciting things
with it.
It’s Coming: Big Data Will Be Important in
Healthcare
When healthcare organizations envision the future of big data, they often think of
using it for analyzing text-based notes. Current analytics technologies for the most
part make use of discrete data and struggle to capitalize on all of the valuable clinical
information captured in physicians’ and nurses’ notes. Big data indexing techniques,
and some of the new work finding information in textual fields, could indeed add real
value to healthcare analytics in the future.
Big Data and the Internet of Things
Big data will really become valuable to healthcare in what’s known as the internet of
things (IoT). SAS describes the IoT as:
The Internet of Things is a growing network of everyday objects from industrial
machines to consumer goods that can share information and complete tasks while
you are busy with other activities, like work, sleep, or exercise. Soon, our cars, our
homes, our major appliances, and even our city streets will be connected to the
Internet–creating this network of objects that is called the Internet of Things, or IoT
for short. Made up of millions of sensors and devices that generate incessant
streams of data, the IoT can be used to improve our lives and our businesses in
many ways.
The analyst firm Gartner projects that by 2020 there will be more than 25 billion
connected devices in the IoT. For healthcare, any device that generates data about a
7. 7
person’s health and sends that data into the cloud will be part of this IoT. Wearables
are perhaps the most familiar example of such a device. Many people now can wear
a fitness device that tracks how many steps they’ve taken, their heartrate, their
weight, and how it’s all trending. Apps are available on smart phones that track how
often and how intensely a user exercises. There are also medical devices that can
also send data into the cloud: blood pressure monitors, pulse oximeters, glucose
monitors, and much, much more.
Big Data and Care Management
ACOs focus on managed care and want to keep people at home and out of the
hospital. Sensors and wearables will collect health data on patients in their homes
and push all of that data into the cloud. Electronic scales, BP monitors, SpO2
sensors, proximity sensors like iBeacon, and soon-to-be-invented sensors will blast
data from millions of patients continually. Healthcare institutions and care managers,
using sophisticated tools, will monitor this massive data stream and the IoT to keep
their patients healthy.
And all of this disparate sensor data will come into healthcare organizations at an
unprecedented volume and velocity. In a healthcare future predicated on keeping
people out of the hospital, a health system’s ability to manage all this data will be
crucial. These volumes of data are best managed as streams coming into a big data
cluster. As the data streams in, organizations will need to be able to identify any
potential health issues and alert a care manager to intervene. For example, if a
patient’s blood pressure spikes, the system will send an alert in real time to a care
manager who can then interact with the patient to get his blood pressure back into a
healthy range.
Big data is the only hope for managing the volume, velocity, and variety of this
sensor data.
The Fun Stuff: Using Big Data for Predictive
Analytics, Prescriptive Analytics, and Genomics
Real-time alerting is just one important future use of big data. Another is predictive
analytics. The use cases for predictive analytics in healthcare have been limited up
to the present because we simply haven’t had enough data to work with. Big data
can help fill that gap.
8. 8
One example of data that can play a role in predictive analytics is socioeconomic
data. Socioeconomic factors influence patient health in significant ways.
Socioeconomic data might show that people in a certain zip code are unlikely to
have a car. There is a good chance, therefore, that a patient in that zip code who has
just been discharged from the hospital will have difficulty making it to a follow-up
appointment at a distant physician’s office. (Health systems have, in fact, found that
it is cheaper to send a taxi to pick a patient up for an appointment than it is for her to
miss the appointment and be readmitted to the hospital.)
This and similar data can help organizations predict missed appointments,
noncompliance with medications, and more. That is just a small example of how big
data can fuel predictive analytics. The possibilities are endless.
Patient Flight Paths and Prescriptive Analytics
Another use for predictive analytics is predicting the ―flight path‖ of a patient.
Leveraging historical data from other patients with similar conditions, predictive
algorithms can be created using programming languages such as R and big data
machine learning libraries to faithfully predict the trajectory of a patient over time.
Once we can accurately predict patient trajectories, we can shift to the Holy Grail–
Prescriptive Analytics. Intervening to interrupt the patient’s trajectory and set him on
the proper course will become a reality very soon. Big data is well suited for these
futuristic use cases.
Genomic Sequencing and Big Data
As someone who’s spent many years working on the Human Genome project, I am
personally very excited about the increasing use of genomic data in patient
treatment. The cost of sequencing an individual’s full genome has plunged in recent
years. Sequencing, once an art, will soon become commonplace and eventually
become a commodity lab test. Genomic sequences are huge files and the analysis of
genomes generates even more data. Again, big data serves this use case well.
Loading a genetic sequence into a relational database would require a huge
Character Large Object (CLOB) or a separate storage just to manage the sequence.
With big data, just toss it in the Hadoop cluster, and it’s ready for analysis.
9. 9
The Future of Healthcare Data Warehousing and
the Transition to Big Data
I’ve talked about the present limitations for big data in healthcare and the truly
fascinating future possibilities that big data enables. An important question to
address at this point is, of course, this: What should a health system do in the
meantime? Today, health systems’ need for data-driven quality and cost
improvement is urgent. Healthcare organizations cannot afford to wait for big data
technology to mature before diving into analytics. The important factor will be
choosing a data warehousing solution that can easily adapt to the future of big data.
A Late-Binding™ enterprise data warehouse (EDW) architecture is ideal for making
the transition from relational databases to unstructured big data. As stated earlier,
the late-binding approach is, in fact, very similar to the big data approach. In a Late-
Binding EDW like Health Catalyst’s, data from source systems (EHRs, financial
systems, etc.) are placed into source marts. In this process—as in big data—it is
best practice to keep the data as raw as possible, relying on the natural data models
of the source systems. As much as possible, late-binding methods minimize
remodeling data in the source marts until the analytic use case requires it. The data
remains in its raw state until someone needs it. At that point, analysts package the
data into a separate data mart and apply meaning and semantic context so that
effective analysis can occur. Because this approach is so similar to big data, it is a
natural transition to replace the source-mart layer of the EDW architecture with a big
data cluster.
Real World Example Healthcare’s Transition to Big Data
In conclusion, here is a brief example of how the transition from relational databases
to big data is happening in the real world. We, at Health Catalyst, are working with
one of our large health system clients and Microsoft to create a massively parallel
data warehouse in a Microsoft APS Appliance that also includes a Hortonworks
Hadoop Cluster. This means we can run a traditional relational database and a big
data cluster in parallel. We can query both data stores simultaneously, which
significantly improves our data processing power. Together, we are beginning to
experiment with big data in important ways, such as performing natural language
processing (NLP) with physician notes, predictive analytics, and other use cases.