___________________________________________
Meetup#7 | Session 2 | 21/03/2018 | Taboola
_____________________________________________
In this talk, we will present our multi-DC Kafka architecture, and discuss how we tackle sending and handling 10B+ messages per day, with maximum availability and no tolerance for data loss.
Our architecture includes technologies such as Cassandra, Spark, HDFS, and Vertica - with Kafka as the backbone that feeds them all.
6. Data Infrastructure Requirements
• Fresh data
• Fast queries
• Exact (billing)
• Flexible – simple to add data and extend
• Scale faster than the business
• Endure traffic spikes
7. Our Strategy to Scale
• Best of breed technologies
• A lot of custom development
• Highly optimized distributions and hardware
• Software designed with scale and self healing in mind
• Everything is monitored and profiled
• Infrastructure to support extremely agile development cycles
8. Any server and even any data center can go down at any time without
service interruption:
• Share nothing - each server is independent
• Application is stateless - server can accept any traffic
• Data is fully replicated in real time
• Dynamic load balancing
Data must be exact
• All processing is idempotent
• “Exactly Once” semantics - never count twice
Data Infrastructure fully pluggable
• Connect into it at any point
Architecture Principles
9. Backend Processing
Our code deals with:
• Real time joining of multiple streams
• Real time access to data
• Distributed processing
FE
Servers
Jade
Cloud
Storage
Tensor Flow
Serving
SQL
11. High Throughput Using Custom Buffering
Volume
• 50B protobufs / day - billing related protobufs
• 25B protobufs / day - monitoring related protobufs
Requirements
• Can’t interrupt the recommendations service
• Can’t lose data
How do we handle this volume?
• Custom message buffering
• Async sending
• Offheap - No GC by using pre-allocated DirectByteBuffers
13. Monitoring
• Messages waiting on the filesystem (should be zero)
• Message producing rate
• Number of buffers being used
• Blocking Queue Size + Dropped Messages
• Message Size
• Payload Size
• Send to Kafka times
• Errors
14.
15.
16. Schema Management
Protostuff with schema evolution
• Separate git repo with strict ALM
• Testing for backward/forward compatibility
• Feature branches cannot be deployed to production
• Became critical with many developers adding data
18. Multi DC Deployment
• 6 FE Data centers
• 4 BE Data Centers
• Brokers: 36 FE + 50 BE
• Broker = 7TB NVME Disk, 128GB RAM, 32 CPU Core, 10GB Ethernet
• Full Data Replication to BE DCs
19. Mirroring With Kafka Mirror Maker
Topic 1
.
.
.
.
.
.
.
.
Topic N
Partition 1
.
.
.
Partition K
FE Kafka
Topic 1
.
.
.
.
.
.
.
Topic N
BE Kafka
Kafka
Mirror
Maker
C1
C2
Cm
P
P
P
Message size varies:
O(10KB) - O(MB)
20. Mirroring With Kafka Mirror Maker
Topic 1
.
.
.
.
.
.
.
.
Topic N
Partition 1
.
.
.
Partition K
FE Kafka
Topic 1
.
.
.
.
.
.
.
Topic N
BE Kafka
Kafka
Mirror
Maker
C1
C2
Ck
Partition 1
.
.
.
Partition K
P
P
P
21. Topic 1
.
.
.
.
.
.
.
.
Topic N
Mirroring Done Right - KFC
Topic 1
.
.
.
.
.
.
.
Topic N
BE Kafka
KFC
Mirror
C1
C2
Ck
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
Producer Pools - as
much as we need
Partition 1
.
.
.
Partition K
FE Kafka
Partition 1
.
.
.
Partition K
22. Introducing KFC
Multi-purpose framework for consuming from kafka and processing
messages in parallel, with built-in monitoring.
TaboolaKafkaConsumer
Consumer
Runnable
MessageProcessor KafkaConsumerParallelismStrategy KafkaCommitStrategy
25. Monitoring
● Registering all org.apache.kafka.common.Metric as codahale Gauge
● Alerting on poll cycle time:
○ Messaging processing is stuck
○ Can’t find group coordinator
● Alerting on partition lag:
○ Lag by number of messages
○ Lag by delta from produce time (bursty topics)
30. Apache Spark as Our Distributed Processing Engine
Serves multiple functions:
• Runs our code that joins data stream into pageviews and sessions
• SQL engine for analysts and algo group
• Raw data feed into the deep learning engine
• 100s of data aggregators constantly running feeding data to Backstage (in beta)
Fun facts
• ~15K cores + >70TB of RAM (+>8PB on disk historic data)
• Clusters will grow by 50% in size by year’s end
• Using Spark since the end of 2013 - in production
• Contributed multiple critical fixes back to community