Kafka has become extremely popular to stream data, but it imposes very little constraints over the format of the data that is being streamed. As we wanted all of our Data Engineers and Data Scientists to use the data in our Kafka clusters, we soon faced the challenge of keeping the quality of the data to its highest. We developed a tool to monitor the quality of the streams in realtime, and we had to make it scalable and fault tolerant. In this talk, we will see the technical difficulties we encountered with our Kafka Streams implementation, and how we went through a major rewrite of the application to make it scale.
Data Quality Monitoring in Real-Time and at Scale with Kafka
1. D ATA Q U A L I T Y M O N I T O R I N G
I N R E A LT I M E A N D AT S C A L E
A L E X I S S E I G N E U R I N - @ A S E I G N E U R I N
O C T O B E R 2 0 1 7
2. M Y S E L F
• Data Engineer at
• → →
•
• aseigneurin.github.io or @aseigneurin
3. D ATA Q U A L I T Y
M O N I T O R I N G
P A R T 1
4. T H E P R O J E C T
• A few Kafka clusters, lots of topics
• Analyze all the messages of all the Kafka topics
• Count the number of valid or invalid messages per second
• Push metrics to InfluxDB, graph with Grafana
5. L O T S O F C O M P L E X I T Y T O H A N D L E
• Topics = multiple partitions → Final results must be aggregated
• High volumes: 100k+ messages / second
• Fault tolerance + exactly once processing
• Count per window of 1 second with low latency
• Event-time processing (Kafka 0.10+)
• Data can arrive late → Update results
6. E X A M P L E
t 0 t 1 t 2
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
t x - 0 | v | v | v | i | v | | v | i | | i | v | v |
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
t x - 1 | v | i | v | v | | v | v | v | | i | i | v | v |
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
• t0
- 7 valid messages (4 in partition tx-0 + 3 in partition tx-1)
- 2 invalid messages (1 in each partition)
• t1
- 4 valid messages
- 1 invalid message
• t2
- 4 valid messages
- 3 invalid messages
7. K A F K A → K A F K A
• Kafka (raw data) → Kafka (metrics)
• Microservice (one function, only depends on Kafka)
$ kafka-console-consumer --topic data-checker-changelog --property print.key=true ...
{"topic":"tx","window":1501273548000,"status":"valid"} 7
{"topic":"tx","window":1501273548000,"status":"invalid"} 2
{"topic":"tx","window":1501273549000,"status":"valid"} 4
{"topic":"tx","window":1501273549000,"status":"invalid"} 1
{"topic":"tx","window":1501273550000,"status":"valid"} 4
{"topic":"tx","window":1501273550000,"status":"invalid"} 3
8. L AT E D ATA
• E.g. one more valid message with event time = t1
• Outputs one new metric:
• This topic is a change log
• Can use a compacted topic
{"topic":"tx","window":1501273549000,"status":"valid"} 4
...
{"topic":"tx","window":1501273549000,"status":"valid"} 5
9. K A F K A → I N F L U X D B
• Kafka (metrics change log) → InfluxDB (time series of metrics)
• Microservice: one function, can be restarted independently
> select valid, invalid from tx
name: tx
time valid invalid
---- ----- -------
1501273548000000000 7 2
1501273549000000000 4 1
1501273550000000000 4 3
10. L AT E D ATA
• Only the latest value is stored in InfluxDB
• New change log item = update in InfluxDB
> select valid, invalid from tx where time=1501273549000000000
name: tx
time valid invalid
---- ----- -------
1501273549000000000 5 1
11. F I R S T I M P L E M E N TAT I O N
W I T H K A F K A S T R E A M S
P A R T 2
12. K A F K A S T R E A M S
• docs.confluent.io/current/streams/index.html
• Library to process data from Kafka
• Built on top of the Java Kafka client
• DSL + low-level API
• Leverages Consumer Groups → Horizontal scalability
13. I M P L E M E N TAT I O N
• Thin Scala wrapper for the Kafka Streams API
github.com/aseigneurin/kafka-streams-scala
messages
.map((_, message) => message match {
case _: GoodMessage => ("valid", 1)
case _: BadMessage => ("invalid", 1)
})
.groupByKey
.count(TimeWindows.of(1000), "metrics-agg-store")
.toStream
.map((k, v) => (MetricKey(inputTopic, k.window.start, k.key), v))
.to(metricsTopic)
14. R E PA R T I T I O N I N G
• Aggregations can only be done by key
• Repartition topic (internal topic) created by Streams
• (timestamps are preserved)
.map((_, message) => message match {
case _: GoodMessage => ("valid", 1)
case _: BadMessage => ("invalid", 1)
})
.groupByKey
valid 1
valid 1
invalid 1
valid 1
invalid 1
15. C O U N T I N G
• Count per window of 1 second
• Streams creates an in-memory state store
• Backed by an internal change log (internal topic) for fault tolerance
.count(TimeWindows.of(1000), "metrics-agg-store")
16. O U T P U T
• Aggregation result is a KTable → Turn it into a KStream
• Must read:
Duality of Streams and Tables, Confluent
docs.confluent.io/current/streams/concepts.html#duality-of-streams-and-tables
• Write the result to a topic
.toStream
.map((k, v) => (MetricKey(inputTopic, k.window.start, k.key), v))
.to(metricsTopic)
17. O U T P U T & L AT E D ATA
• Write with a key to preserve ordering
• Must read:
The world beyond batch: Streaming 101, Tyler Akidau
www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
$ kafka-console-consumer --topic metrics --property print.key=true ...
{"topic":"tx","window":1501273548000,"status":"valid"} 7
{"topic":"tx","window":1501273548000,"status":"invalid"} 2
{"topic":"tx","window":1501273549000,"status":"valid"} 4
...
{"topic":"tx","window":1501273549000,"status":"valid"} 5
18. P R O B L E M 1 - H O T PA R T I T I O N
• Most messages are valid → Repartitioning creates a hot partition
• Can only be processed by 1 thread
-----------------------------
repartition-0
-----------------------------
repartition-1 |v|v|v|v|v|v|v|v|v|v|v|v|v|v|
-----------------------------
repartition-2 |i|i|i|
-----------------------------
repartition-3
-----------------------------
19. P R O B L E M 2 - I N T E R N A L T O P I C S
• 2 “internal” topics per input topic
• repartition
• changelog
• Need to allocate threads to read from the repartition topic
• Cannot reuse these topics for multiple input topics
20. R E I M P L E M E N TAT I O N
W I T H P L A I N K A F K A
C O N S U M E R S
P A R T 3
21. D E S I G N - K E Y I D E A S
• No repartitioning
• Directly aggregate per partition
• Final aggregation (across partitions) made by InfluxDB
• A single change log topic shared by all instances
22. D E S I G N - I D E A S F R O M K A F K A S T R E A M S
• Multi-threaded with one state store per thread
• 1 thread = 1 or multiple partitions
• No sharing of state across threads
• State store backed by the change log
• Used to read the state in case of crash / repartitioning
• Event time processing + Handling of late data
• Expiration of old windows
23. A G G R E G AT E P E R PA R T I T I O N
t 0 t 1 t 2
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
t x - 0 | v | v | v | i | v | | v | i | | i | v | v |
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
t x - 1 | v | i | v | v | | v | v | v | | i | i | v | v |
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
• Metrics per partition
$ kafka-console-consumer --topic metrics --property print.key=true ...
{"topic":"tx","partition":0,"window":1501273548000,"status":"valid"} 4
{"topic":"tx","partition":0,"window":1501273548000,"status":"invalid"} 1
{"topic":"tx","partition":1,"window":1501273548000,"status":"valid"} 3
{"topic":"tx","partition":1,"window":1501273548000,"status":"invalid"} 1
{"topic":"tx","partition":0,"window":1501273549000,"status":"valid"} 1
{"topic":"tx","partition":0,"window":1501273549000,"status":"invalid"} 1
{"topic":"tx","partition":1,"window":1501273549000,"status":"valid"} 3
{"topic":"tx","partition":0,"window":1501273550000,"status":"valid"} 2
{"topic":"tx","partition":0,"window":1501273550000,"status":"invalid"} 1
{"topic":"tx","partition":1,"window":1501273550000,"status":"valid"} 2
{"topic":"tx","partition":1,"window":1501273550000,"status":"invalid"} 2
24. S T O R A G E I N I N F L U X D B
• Add the partition number as a tag
> select valid, invalid from tx
name: tx
time partition valid invalid
---- --------- ----- -------
1501273548000000000 0 4 1
1501273548000000000 1 3 1
1501273549000000000 0 1 1
...
25. A G G R E G AT I O N W I T H I N F L U X D B
• Leverage InfluxDB’s aggregation functionality
• Supported by Grafana
> select sum(valid) as valid, sum(invalid) as invalid from "topic-health" where time>=1501273548000000000
and time<=1501273550000000000
name: topic-health
time valid invalid
---- ----- -------
1501273548000000000 7 2
1501273549000000000 4 1
1501273550000000000 4 3
26. C H A N G E L O G
• Change log = Metrics topic + source offsets
• When partitions are assigned, populate the state store from the change log
• Filter using the topic + partition number
$ kafka-console-consumer --topic changelog --property print.key=true ...
{"topic":"tx","partition":0,"window":1501273548000,"status":"valid"} {"value":4,"offset":5}
{"topic":"tx","partition":0,"window":1501273548000,"status":"invalid"} {"value":1,"offset":4}
{"topic":"tx","partition":1,"window":1501273548000,"status":"valid"} {"value":3,"offset":4}
...
27. M A I N C O M P O N E N T S
• StateStore
• In-memory store of active counts
• Cleanup every few minutes
• ChangelogWriter / ChangelogReader
• State store Changelog topic
• DataCheckerThread
• Main consumer loop
• ConsumerRebalanceListener
• Reconfigures the application when a partition rebalancing happens
28. L I F E C Y C L E
• On startup
• Wait for partitions to be assigned
• Read the change log → State store
• Start consuming data
• Every second
• Dump new values of the State Store to the change log
• Every 5 minutes
• Discard old windows
• On repartitioning = On startup
29. C AT C H I N G U P
• After an interruption
• Catches up at a different rate for
each partition
31. S U M M A R Y
• New implementation
✓ Very robust: fault-tolerant, exactly once processing
✓ Event-time + Late data processing
✓ 10.000 messages per second with 2 threads
• Kafka Streams?
• Great library but needs 2-step aggregation: per partition + across partitions
• Flink?
• Cluster management :-/
• aseigneurin.github.io/2017/08/04/why-kafka-streams-didnt-work-for-us-part-1.html