Data Quality Monitoring in Real-Time and at Scale with Kafka

D ATA Q U A L I T Y M O N I T O R I N G
I N R E A LT I M E A N D AT S C A L E
A L E X I S S E I G N E U R I N - @ A S E I G N E U R I N
O C T O B E R 2 0 1 7

M Y S E L F
• Data Engineer at
• → →
•
• aseigneurin.github.io or @aseigneurin

D ATA Q U A L I T Y
M O N I T O R I N G
P A R T 1

T H E P R O J E C T
• A few Kafka clusters, lots of topics
• Analyze all the messages of all the Kafka topics
• Count the number of valid or invalid messages per second
• Push metrics to InfluxDB, graph with Grafana

L O T S O F C O M P L E X I T Y T O H A N D L E
• Topics = multiple partitions → Final results must be aggregated
• High volumes: 100k+ messages / second
• Fault tolerance + exactly once processing
• Count per window of 1 second with low latency
• Event-time processing (Kafka 0.10+)
• Data can arrive late → Update results

E X A M P L E
t 0 t 1 t 2
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
t x - 0 | v | v | v | i | v | | v | i | | i | v | v |
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
t x - 1 | v | i | v | v | | v | v | v | | i | i | v | v |
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
• t0
- 7 valid messages (4 in partition tx-0 + 3 in partition tx-1)
- 2 invalid messages (1 in each partition)
• t1
- 4 valid messages
- 1 invalid message
• t2
- 4 valid messages
- 3 invalid messages

K A F K A → K A F K A
• Kafka (raw data) → Kafka (metrics)
• Microservice (one function, only depends on Kafka)
$ kafka-console-consumer --topic data-checker-changelog --property print.key=true ...
{"topic":"tx","window":1501273548000,"status":"valid"} 7
{"topic":"tx","window":1501273548000,"status":"invalid"} 2

L AT E D ATA
• E.g. one more valid message with event time = t1
• Outputs one new metric:
• This topic is a change log
• Can use a compacted topic
...

K A F K A → I N F L U X D B
• Kafka (metrics change log) → InfluxDB (time series of metrics)
• Microservice: one function, can be restarted independently
> select valid, invalid from tx
name: tx
time valid invalid
---- ----- -------
1501273548000000000 7 2
1501273549000000000 4 1
1501273550000000000 4 3

L AT E D ATA
• Only the latest value is stored in InfluxDB
• New change log item = update in InfluxDB
> select valid, invalid from tx where time=1501273549000000000
name: tx
time valid invalid
---- ----- -------
1501273549000000000 5 1

F I R S T I M P L E M E N TAT I O N
W I T H K A F K A S T R E A M S
P A R T 2

K A F K A S T R E A M S
• docs.confluent.io/current/streams/index.html
• Library to process data from Kafka
• Built on top of the Java Kafka client
• DSL + low-level API
• Leverages Consumer Groups → Horizontal scalability

I M P L E M E N TAT I O N
• Thin Scala wrapper for the Kafka Streams API
github.com/aseigneurin/kafka-streams-scala
messages
.map((_, message) => message match {
case _: GoodMessage => ("valid", 1)
case _: BadMessage => ("invalid", 1)
})
.groupByKey
.count(TimeWindows.of(1000), "metrics-agg-store")
.toStream
.map((k, v) => (MetricKey(inputTopic, k.window.start, k.key), v))
.to(metricsTopic)

R E PA R T I T I O N I N G
• Aggregations can only be done by key
• Repartition topic (internal topic) created by Streams
• (timestamps are preserved)
.map((_, message) => message match {
case _: GoodMessage => ("valid", 1)
case _: BadMessage => ("invalid", 1)
})
.groupByKey
valid 1
valid 1
invalid 1
valid 1
invalid 1

C O U N T I N G
• Count per window of 1 second
• Streams creates an in-memory state store
• Backed by an internal change log (internal topic) for fault tolerance
.count(TimeWindows.of(1000), "metrics-agg-store")

O U T P U T
• Aggregation result is a KTable → Turn it into a KStream
• Must read:
Duality of Streams and Tables, Confluent
docs.confluent.io/current/streams/concepts.html#duality-of-streams-and-tables
• Write the result to a topic
.toStream
.map((k, v) => (MetricKey(inputTopic, k.window.start, k.key), v))
.to(metricsTopic)

O U T P U T & L AT E D ATA
• Write with a key to preserve ordering
• Must read:
The world beyond batch: Streaming 101, Tyler Akidau
www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
$ kafka-console-consumer --topic metrics --property print.key=true ...
...

P R O B L E M 1 - H O T PA R T I T I O N
• Most messages are valid → Repartitioning creates a hot partition
• Can only be processed by 1 thread
-----------------------------
repartition-0
-----------------------------
repartition-1 |v|v|v|v|v|v|v|v|v|v|v|v|v|v|
-----------------------------
repartition-2 |i|i|i|
-----------------------------
repartition-3
-----------------------------

P R O B L E M 2 - I N T E R N A L T O P I C S
• 2 “internal” topics per input topic
• repartition
• changelog
• Need to allocate threads to read from the repartition topic
• Cannot reuse these topics for multiple input topics

R E I M P L E M E N TAT I O N
W I T H P L A I N K A F K A
C O N S U M E R S
P A R T 3

D E S I G N - K E Y I D E A S
• No repartitioning
• Directly aggregate per partition
• Final aggregation (across partitions) made by InfluxDB
• A single change log topic shared by all instances

D E S I G N - I D E A S F R O M K A F K A S T R E A M S
• Multi-threaded with one state store per thread
• 1 thread = 1 or multiple partitions
• No sharing of state across threads
• State store backed by the change log
• Used to read the state in case of crash / repartitioning
• Event time processing + Handling of late data
• Expiration of old windows

A G G R E G AT E P E R PA R T I T I O N
t 0 t 1 t 2
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
t x - 0 | v | v | v | i | v | | v | i | | i | v | v |
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
t x - 1 | v | i | v | v | | v | v | v | | i | i | v | v |
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
• Metrics per partition
$ kafka-console-consumer --topic metrics --property print.key=true ...
{"topic":"tx","partition":0,"window":1501273548000,"status":"valid"} 4
{"topic":"tx","partition":0,"window":1501273548000,"status":"invalid"} 1

S T O R A G E I N I N F L U X D B
• Add the partition number as a tag
> select valid, invalid from tx
name: tx
time partition valid invalid
---- --------- ----- -------
1501273548000000000 0 4 1
1501273548000000000 1 3 1
1501273549000000000 0 1 1
...

A G G R E G AT I O N W I T H I N F L U X D B
• Leverage InfluxDB’s aggregation functionality
• Supported by Grafana
> select sum(valid) as valid, sum(invalid) as invalid from "topic-health" where time>=1501273548000000000
and time<=1501273550000000000
name: topic-health
time valid invalid
---- ----- -------
1501273548000000000 7 2
1501273549000000000 4 1
1501273550000000000 4 3

C H A N G E L O G
• Change log = Metrics topic + source offsets
• When partitions are assigned, populate the state store from the change log
• Filter using the topic + partition number
$ kafka-console-consumer --topic changelog --property print.key=true ...
{"topic":"tx","partition":0,"window":1501273548000,"status":"valid"} {"value":4,"offset":5}
{"topic":"tx","partition":0,"window":1501273548000,"status":"invalid"} {"value":1,"offset":4}
{"topic":"tx","partition":1,"window":1501273548000,"status":"valid"} {"value":3,"offset":4}
...

M A I N C O M P O N E N T S
• StateStore
• In-memory store of active counts
• Cleanup every few minutes
• ChangelogWriter / ChangelogReader
• State store Changelog topic
• DataCheckerThread
• Main consumer loop
• ConsumerRebalanceListener
• Reconfigures the application when a partition rebalancing happens

L I F E C Y C L E
• On startup
• Wait for partitions to be assigned
• Read the change log → State store
• Start consuming data
• Every second
• Dump new values of the State Store to the change log
• Every 5 minutes
• Discard old windows
• On repartitioning = On startup

C AT C H I N G U P
• After an interruption
• Catches up at a different rate for
each partition

S U M M A R Y
• New implementation
✓ Very robust: fault-tolerant, exactly once processing
✓ Event-time + Late data processing
✓ 10.000 messages per second with 2 threads
• Kafka Streams?
• Great library but needs 2-step aggregation: per partition + across partitions
• Flink?
• Cluster management :-/
• aseigneurin.github.io/2017/08/04/why-kafka-streams-didnt-work-for-us-part-1.html

Data Quality Monitoring in Real-Time and at Scale with Kafka

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data Quality Monitoring in Real-Time and at Scale with Kafka

Similar to Data Quality Monitoring in Real-Time and at Scale with Kafka (20)

More from Alexis Seigneurin

More from Alexis Seigneurin (8)

Recently uploaded

Recently uploaded (20)

Data Quality Monitoring in Real-Time and at Scale with Kafka