Ingesting Healthcare Data, Micah Whitacre

INGESTING COMPLEX HEALTHCARE DATA
WITH APACHE KAFKA
Micah Whitacre
@mkwhit
#kafkasummit

Leader Healthcare IT
~30% of all US Healthcare
Data in a Cerner Solution

Sepsis Alerting
(minutes)
Doctor’s
Office
Minute
Clinic
ER
Hospital
Specialist

Table Table.NOTIFY
Google Percolator
NoSQL

Table Table.NOTIFY
Collector
HTTP

Was successful… for awhile
Progressed from minutes to seconds
Hit a wall preventing going faster
(missed SLAs)

NoSQL
NoSQL
NoSQL
Collector
Collector
Collector
Crawler
Crawler
Crawler

Solution A
Solution B
Solution C
Collector
Collector
Collector
Crawler
Crawler
Crawler

Use the right tool for the job!
NoSQL != Distributed Queue
Anti-patterns apply to everyone eventually

Our scalability should not impact crawlers
Cluster sprawl should be avoided
Reduce the number of copies

Kafka Base Notifications
● Kafka topic per listener
● Small Google Protobuf payloads
○ Gzip based compression for higher compression
● Could minimize to fewer listeners
○ Single topic and partition vs 100s of NoSQL rows
● Able to give up fairness concerns in favor of speed

NoSQL
NoSQL
NoSQL
Collector
Crawler
Crawler
Crawler

Kafka Staging Area
● Single location for one copy of the data
● Consumption based on type and source of data
○ 500ish of types and 100-1000 sources
○ Choose source based topics to cut down on topics
○ Default to 8 partitions
● Snappy compression for low latency
● Huge variation in data sizes and frequency
○ Infrequent MB - GB file uploads (daily, weekly, monthly, yearly)
○ Streaming uploads of 100B-10MB
● Time based retention to prevent data loss
○ Ambitiously set to 30 days but lowered to 7 days
○ Archive data to HDFS for reprocessing or lagging/offline consumers

Kafka Payloads And Delivery
● Avro Schema to wrap ingested data
○ Source, Type, Id, Version, Value (byte[]), Metadata
(byte[]), Properties
○ Common payload regardless of actual byte[]
● Set threshold for payloads stored in Kafka
○ Store 95-98% of data in Kafka
○ Data larger than 50 MB stored in HDFS with path
stored in Avro wrapper

● Rate of ingestion changes with Kafka
○ Lack of backpressure can increase rate of ingestion
○ Capacity and retention planning could end up
inaccurate
Most Surprising Lesson Learned

Weeks
msg/sec Initial Crawl - NoSQL
Crawl all historical data
Crawling only recent
changes

Rate of Data
Ingested Per Day
By Source
Number of
Sources
Number of Days
to keep Kafka
Total Storage
Needed in Kafka

Days
msg/sec Initial Crawl - Kafka
Crawls from
weeks to daysCrawl all historical data
Crawling only recent
changes

Rate of Data
Ingested Per Day
By Source
Number of
Sources
Number of Days
to keep Kafka
Total Storage
Needed in Kafka
10-30x

Kafka Storage Woes
● Monitor ALL THE THINGS
○ Broker free space
○ Disk usage per topic
○ Consumer lag in message count and max latency
○ Rate of data per source to detect anomalies vs steady
state
● Re-evaluate default retention with more evidence

Kafka Storage Woes Solution
● When storage gets tight know your options
○ Automate building new servers
○ Adjust retention policy for a topic(s)
● Balancing partitions is hard to do by hand
○ Balance in small batches
○ Automate, Automate, Automate

NoSQL
Kafka
NoSQL
Collector
Crawler
Crawler
Crawler

NoSQL
NoSQL
NoSQL
Collector
Crawler
Crawler
Crawler
NoSQL
NoSQL
NoSQL
DataCenter A
DataCenter B
Collector
Crawler
Crawler
Crawler

Current Stats
● Deployed in 3 (soon to be 4) data centers
● 440 sources currently (⅓ of all clients)
● Ingesting 2 billion messages per day
○ Spiked as high as 6 billion
● Ingest 1.2 TB/day of raw data
● Archive job runs hourly and takes ~10 mins to pull ~50 GB
data
● Latency
○ NoSQL: 2-3 seconds (subset of data)
○ Replication (Kafka to Kafka): 700 milliseconds (all the data)

http://engineering.cerner.com/

References
● Percolator - http://research.google.com/pubs/pub36726.html
● Cassandra Queue Anti-pattern: http://www.datastax.com/dev/blog/cassandra-
anti-patterns-queues-and-queue-like-datasets
● https://blog.cloudera.com/blog/2014/11/how-cerner-uses-cdh-with-apache-
kafka/

Ingesting Healthcare Data, Micah Whitacre

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Ingesting Healthcare Data, Micah Whitacre

Similar to Ingesting Healthcare Data, Micah Whitacre (20)

More from confluent

More from confluent (20)

Recently uploaded

Recently uploaded (20)

Ingesting Healthcare Data, Micah Whitacre