Healthcare data comes in many shapes and sizes making ingestion difficult for a variety of batch and near real time use cases. By Cerner evolving its architecture to adopt Apache Kafka, Cerner was able to build a modular architecture for current and future use cases. Reviewing the evolution of Cerner’s uses, developers can help to avoid mistakes and set themselves up for success.
17. Kafka Base Notifications
● Kafka topic per listener
● Small Google Protobuf payloads
○ Gzip based compression for higher compression
● Could minimize to fewer listeners
○ Single topic and partition vs 100s of NoSQL rows
● Able to give up fairness concerns in favor of speed
19. Kafka Staging Area
● Single location for one copy of the data
● Consumption based on type and source of data
○ 500ish of types and 100-1000 sources
○ Choose source based topics to cut down on topics
○ Default to 8 partitions
● Snappy compression for low latency
● Huge variation in data sizes and frequency
○ Infrequent MB - GB file uploads (daily, weekly, monthly, yearly)
○ Streaming uploads of 100B-10MB
● Time based retention to prevent data loss
○ Ambitiously set to 30 days but lowered to 7 days
○ Archive data to HDFS for reprocessing or lagging/offline consumers
20. Kafka Payloads And Delivery
● Avro Schema to wrap ingested data
○ Source, Type, Id, Version, Value (byte[]), Metadata
(byte[]), Properties
○ Common payload regardless of actual byte[]
● Set threshold for payloads stored in Kafka
○ Store 95-98% of data in Kafka
○ Data larger than 50 MB stored in HDFS with path
stored in Avro wrapper
21. ● Rate of ingestion changes with Kafka
○ Lack of backpressure can increase rate of ingestion
○ Capacity and retention planning could end up
inaccurate
Most Surprising Lesson Learned
23. Rate of Data
Ingested Per Day
By Source
Number of
Sources
Number of Days
to keep Kafka
Total Storage
Needed in Kafka
24. Days
msg/sec Initial Crawl - Kafka
Crawls from
weeks to daysCrawl all historical data
Crawling only recent
changes
25. Rate of Data
Ingested Per Day
By Source
Number of
Sources
Number of Days
to keep Kafka
Total Storage
Needed in Kafka
10-30x
26. Kafka Storage Woes
● Monitor ALL THE THINGS
○ Broker free space
○ Disk usage per topic
○ Consumer lag in message count and max latency
○ Rate of data per source to detect anomalies vs steady
state
● Re-evaluate default retention with more evidence
27. Kafka Storage Woes Solution
● When storage gets tight know your options
○ Automate building new servers
○ Adjust retention policy for a topic(s)
● Balancing partitions is hard to do by hand
○ Balance in small batches
○ Automate, Automate, Automate
31. Current Stats
● Deployed in 3 (soon to be 4) data centers
● 440 sources currently (⅓ of all clients)
● Ingesting 2 billion messages per day
○ Spiked as high as 6 billion
● Ingest 1.2 TB/day of raw data
● Archive job runs hourly and takes ~10 mins to pull ~50 GB
data
● Latency
○ NoSQL: 2-3 seconds (subset of data)
○ Replication (Kafka to Kafka): 700 milliseconds (all the data)