healthDB: A Primer
Parag Patel, Shahid Shah
healthDB is a incrementally scalable, fault-tolerant, ACID compliant, key/value
document based database designed to hold huge amounts of data and has high
throughput read/writes and high availability. It is based on an open source project
called couchDB. It is designed to be a data warehouse for the disparate systems that
might be part of a healthcare practice or hospital. Due to it!s semi-structured data
storage nature, it can hold data of any type. The end user need not worry about
structuring the data in the data warehouse; the data will be stored in the warehouse for
future extraction and structuring as the user sees ﬁt. Future versions of healthDB will
help the end user structure data from the semi-structured state it is in. Conceptually
one can think of lazy evaluation in scheme, lisp, haskell. Once the user knows the
structure they want to put the data in, it will be a cinch to implement the structure in
The design of database encompasses a “just works” philosophy. The database should
work as advertised. The end user should only have to worry about building their
application or service, instead of worrying about the storage of there data and
performance. Most of the traditional work that a DBA has done will be done by
healthDB. All the end user has to do is start it up initially and add additional servers as
the healthDB will dictate in order scale. HealthDB will have a connector engine, that will
connect to common interfaces such as HL7, JMS, ODBC, various delimited ﬁle formats,
and has the ability to develop custom connectors to connect to unusual interfaces.
HealthDB will support in the future health query language (HQL) (as an external or
internal component tbd), will allow them to search all their structured and semi-
structured data to ﬁnd knowledge they seek in a health domain. HealthDB will come
with some sample applications to show end users just the power it holds.
healthDB uses couchDB to primarily take care of the low level storage. It
communicates to couchDB (couchDB might need to be modiﬁed for encryption) using
encrypted REST. A diagram shows the basic outline of healthDB.
The healthDB engine is the main control unit of the healthDB. It has a job of ensuring
the user can store data in a seamless fashion. It takes care of such task as automatic
partitioning, replication, encryption of the data, automatic load balancing, automatic
system backup, error logging.
The healthDB engine is made of up various components such as the partitioner,
replicator, connector engine, healthCPU, security, and healthDB API (healthSearch will
be additional component, it is undetermined whether it should sit in the healthDB engine
or couchdB. We shall look at each component of the engine brieﬂy. Note: additional
components maybe added, components maybe merged or deleted.
Provides the healthDB interface to the outside world. It will be the only way to
communicate with the database, Multiple API should be developed such as python,
ruby, java, C#, REST.
This connector engine allows data from a variety of different formats to be converted to
a format that healthDB can understand while preserving integrity.
This is the brain of the healthDB database. It controls when the healthDB should
replicate data and when it should partition data. It does the job of the looking up data in
the datastore (couchDB), formatting, structuring, and semi-structuring data that will be
stored in the datastore. It ensures that data HIPPA compliant, by having he security
component encrypt it. HealthCPU also maintains which nodes are alive and what the
status is. It does the job of load balancing. Filters out data based on the users
This performs the encryption, authentication, and tells the healthCPU the user has
permission to certain data or not.
Creates a new database replication based on what the healthCPU tells it.
Creates new partitions on the data and places the data on server(s) the healthCPU
Diagram of the healthDB engine below.
The healthCPU will store unstructured data as follows. It will have a series of
documents that keep track of data from various sources. Each source will have its own
document(s). The document will contain (key,values) for
(hash(document_sourcesystem_objectID),document_sourcesystem_objectID). A record
from a source system will be store in its own separate document which will have system
values such as last modiﬁed date, and the actual data itself. The record will be called a
DBobject. The document name will be used to identify the DBobject.
Other entities like DBobject can be created. We might have a person entity, which
would be identiﬁed by document_person_personID. Very similar to the DBobject
concept in which a series of documents contain references or indexes to the actual