Building event-driven
microservices with
Kafka Streams
Stathis Souris
Lead Software Engineer
2
Copyright ©2020 ThousandEyes, Inc. All Rights Reserved.  @ThousandEyes
Agenda
• Kafka
• Kafka Streams
• Endpoint Agent
• Kafka Streams Use Cases
• Production Issues
• Takeaways
• Q&A
3
Copyright ©2020 ThousandEyes, Inc. All Rights Reserved.  @ThousandEyes
Why Kafka
• Simple at first!
4
Copyright ©2020 ThousandEyes, Inc. All Rights Reserved.  @ThousandEyes
Why Kafka
• Complicated
5
Copyright ©2020 ThousandEyes, Inc. All Rights Reserved.  @ThousandEyes
Decoupling of data streams
6
Copyright ©2020 ThousandEyes, Inc. All Rights Reserved.  @ThousandEyes
Why Kafka
• Distributed, resilient architecture, fault tolerant
• Horizontal scalability
• High performance (latency of less than 10ms) - real time
• User by known companies
– LinkedIn, Netflix, AirBnb etc
7
Copyright ©2020 ThousandEyes, Inc. All Rights Reserved.  @ThousandEyes
Apache Kafka: Use cases
• Messaging System
• Activity Tracking tool
• Gather metrics from different locations
• Application logs
• Stream processing (Kafka Streams or Spark e.g.)
• Decoupling of systems
• Works with Spark, Flink, Hadoop etc
8
Copyright ©2020 ThousandEyes, Inc. All Rights Reserved.  @ThousandEyes
What is Kafka Streams?
• Easy data processing and
transformation library within
Kafka
• Standard Java Application
• No need to create a separate
cluster
• Highly scalable, elastic and fault
tolerant (inherits from Kafka)
• Exactly Once Capabilities
• One record at a time processing
(no batching)
• Works for any application size
9
Copyright ©2020 ThousandEyes, Inc. All Rights Reserved.  @ThousandEyes
Kafka Streams Architecture Design
•
10
Copyright ©2020 ThousandEyes, Inc. All Rights Reserved.  @ThousandEyes
Kafka Streams history
• The API / Library was introduced as part of Kafka 0.10 (2016)
• Serious contender to other processing frameworks such as
Spark, Flink, NiFi etc
11
Copyright ©2020 ThousandEyes, Inc. All Rights Reserved.  @ThousandEyes
About the Endpoint Agent
• Agents that run on users laptops or desktops
• Collect metrics from customer’s browser interactions
• Perform network tests e.g. ping, pathtrace against various targets
• Checks-in every 10 minutes
• Alerts & Reports
12
Copyright ©2020 ThousandEyes, Inc. All Rights Reserved.  @ThousandEyes
High-level Architecture Overview
13
Copyright ©2020 ThousandEyes, Inc. All Rights Reserved.  @ThousandEyes
Why event-driven microservices?
• Operate at large scale 100K agents
• Complex logic that needs to run at scale
• As real time as possible
• Asynchronous communication
14
Copyright ©2020 ThousandEyes, Inc. All Rights Reserved.  @ThousandEyes
Why Kafka Streams?
✓ Inherits Kafka Streams properties
✓ Simple DSL for
– Aggregations
– Windowing
✓ Streams & Tables
✓ <Key, Value>
Scheduled Tests
16
Copyright ©2020 ThousandEyes, Inc. All Rights Reserved.  @ThousandEyes
Use case
Synthetic tests at an interval
Schedule tests on agents dynamically
Powerful visualization and filtering capabilities
17
Copyright ©2020 ThousandEyes, Inc. All Rights Reserved.  @ThousandEyes
Batch Job approach
• Agent checks-in every 10 minutes
• Batch job runs to assign tests every 15 minutes
• Pull state from various DBs
• Run business logic
• Save assignments
After stress testing:
■ Latency increase as we added more agents
■ Could only scale vertically - not an option at
that point
18
Copyright ©2020 ThousandEyes, Inc. All Rights Reserved.  @ThousandEyes
Event Driven approach
• Stream of check-ins
• Use that stream to power the Scheduler
• Assign tasks on check-in event
19
Copyright ©2020 ThousandEyes, Inc. All Rights Reserved.  @ThousandEyes
Event Driven approach
✓ Application scales with number of
Kafka partitions
✓ Join with GlobalKTables
✓ Run the business logic
✓ Save assignments in KTable
Facts:
➢ All state lives in Kafka
➢ At least once delivery
➢ Materialize assignments in MongoDB:
○ Historical queries
○ Timeline of assignments
20
Copyright ©2020 ThousandEyes, Inc. All Rights Reserved.  @ThousandEyes
Interactive Queries
• Query in-memory KTable for assignments
directly
• Expose through a Rest API
• Very fast
• When State store is temporarily unavailable
use MongoDB query
– zero-downtime deployments
Checkin Reconciler:
React on application
events
22
Copyright ©2020 ThousandEyes, Inc. All Rights Reserved.  @ThousandEyes
Checkin Reconciler
23
Copyright ©2020 ThousandEyes, Inc. All Rights Reserved.  @ThousandEyes
Problem:
■ updating the KTable on every event
■ creating hot partitions that took too long to process
After 20K agents
24
Copyright ©2020 ThousandEyes, Inc. All Rights Reserved.  @ThousandEyes
Use KTable cache
Reduce the commit interval of the application.
StreamsConfig.COMMIT_INTERVAL_MS_CONFIG
Temporary solution
25
Copyright ©2020 ThousandEyes, Inc. All Rights Reserved.  @ThousandEyes
Long term fix
Removed repartitioning step and stored active check-ins in Redis instead
Alert Aggregator
27
Copyright ©2020 ThousandEyes, Inc. All Rights Reserved.  @ThousandEyes
Browser Session Metrics
✓ Real User Monitoring events coupled with network
tests
✓ No set interval
✓ Alerter needs binned data
✓ One minute window and emit aggregated metrics
28
Copyright ©2020 ThousandEyes, Inc. All Rights Reserved.  @ThousandEyes
Window operator
Problem:
Alerting use case needs aggregated event to be emitted at the end, not on every update.
29
Copyright ©2020 ThousandEyes, Inc. All Rights Reserved.  @ThousandEyes
Suppress operator
Problems:
Windowed aggregates took to long to reach the Alerter.
30
Copyright ©2020 ThousandEyes, Inc. All Rights Reserved.  @ThousandEyes
Aggregation was delayed?
Closing a window is driven by
events, that advance the stream
time.
Solution:
Created a cron job to generate
events every close window +
grace period to force the window
to close.
31
Copyright ©2020 ThousandEyes, Inc. All Rights Reserved.  @ThousandEyes
Production issues
• Compaction wasn’t working in some cases
• Avoid repartitioning to hot keys
• Interactive queries misbehavior
– Metadata incorrect
– Created loop between services
32
Copyright ©2020 ThousandEyes, Inc. All Rights Reserved.  @ThousandEyes
Key Takeaways
✓ Use KTable cache to de-duplicate events before
sending downstream. Use “commit.interval” to
your advantage.
✓ Avoid hot partition keys if possible especially when
you are going big.
✓ Make sure compaction works for your topics
✓ If you don’t really use RocksDB disable it
✓ Use binary format from the beginning if you are
going big
✓ Kafka as a DB is possible, but don’t overdo it
✓ Small latencies on the processor level can add up
once you have lag (100ms * 10.000 ~= 16min)
33
Copyright ©2020 ThousandEyes, Inc. All Rights Reserved.  @ThousandEyes
Q&A
Twitter: @efsouris
Blogpost:
https://medium.com/thousandeyes-engine
ering/kafka-streams-in-the-endpoint-agent
-670a098ae7a4
Building Event-Driven Microservices using Kafka Streams (Stathis Souris, ThousandEyes)

Building Event-Driven Microservices using Kafka Streams (Stathis Souris, ThousandEyes)

  • 1.
    Building event-driven microservices with KafkaStreams Stathis Souris Lead Software Engineer
  • 2.
    2 Copyright ©2020 ThousandEyes,Inc. All Rights Reserved.  @ThousandEyes Agenda • Kafka • Kafka Streams • Endpoint Agent • Kafka Streams Use Cases • Production Issues • Takeaways • Q&A
  • 3.
    3 Copyright ©2020 ThousandEyes,Inc. All Rights Reserved.  @ThousandEyes Why Kafka • Simple at first!
  • 4.
    4 Copyright ©2020 ThousandEyes,Inc. All Rights Reserved.  @ThousandEyes Why Kafka • Complicated
  • 5.
    5 Copyright ©2020 ThousandEyes,Inc. All Rights Reserved.  @ThousandEyes Decoupling of data streams
  • 6.
    6 Copyright ©2020 ThousandEyes,Inc. All Rights Reserved.  @ThousandEyes Why Kafka • Distributed, resilient architecture, fault tolerant • Horizontal scalability • High performance (latency of less than 10ms) - real time • User by known companies – LinkedIn, Netflix, AirBnb etc
  • 7.
    7 Copyright ©2020 ThousandEyes,Inc. All Rights Reserved.  @ThousandEyes Apache Kafka: Use cases • Messaging System • Activity Tracking tool • Gather metrics from different locations • Application logs • Stream processing (Kafka Streams or Spark e.g.) • Decoupling of systems • Works with Spark, Flink, Hadoop etc
  • 8.
    8 Copyright ©2020 ThousandEyes,Inc. All Rights Reserved.  @ThousandEyes What is Kafka Streams? • Easy data processing and transformation library within Kafka • Standard Java Application • No need to create a separate cluster • Highly scalable, elastic and fault tolerant (inherits from Kafka) • Exactly Once Capabilities • One record at a time processing (no batching) • Works for any application size
  • 9.
    9 Copyright ©2020 ThousandEyes,Inc. All Rights Reserved.  @ThousandEyes Kafka Streams Architecture Design •
  • 10.
    10 Copyright ©2020 ThousandEyes,Inc. All Rights Reserved.  @ThousandEyes Kafka Streams history • The API / Library was introduced as part of Kafka 0.10 (2016) • Serious contender to other processing frameworks such as Spark, Flink, NiFi etc
  • 11.
    11 Copyright ©2020 ThousandEyes,Inc. All Rights Reserved.  @ThousandEyes About the Endpoint Agent • Agents that run on users laptops or desktops • Collect metrics from customer’s browser interactions • Perform network tests e.g. ping, pathtrace against various targets • Checks-in every 10 minutes • Alerts & Reports
  • 12.
    12 Copyright ©2020 ThousandEyes,Inc. All Rights Reserved.  @ThousandEyes High-level Architecture Overview
  • 13.
    13 Copyright ©2020 ThousandEyes,Inc. All Rights Reserved.  @ThousandEyes Why event-driven microservices? • Operate at large scale 100K agents • Complex logic that needs to run at scale • As real time as possible • Asynchronous communication
  • 14.
    14 Copyright ©2020 ThousandEyes,Inc. All Rights Reserved.  @ThousandEyes Why Kafka Streams? ✓ Inherits Kafka Streams properties ✓ Simple DSL for – Aggregations – Windowing ✓ Streams & Tables ✓ <Key, Value>
  • 15.
  • 16.
    16 Copyright ©2020 ThousandEyes,Inc. All Rights Reserved.  @ThousandEyes Use case Synthetic tests at an interval Schedule tests on agents dynamically Powerful visualization and filtering capabilities
  • 17.
    17 Copyright ©2020 ThousandEyes,Inc. All Rights Reserved.  @ThousandEyes Batch Job approach • Agent checks-in every 10 minutes • Batch job runs to assign tests every 15 minutes • Pull state from various DBs • Run business logic • Save assignments After stress testing: ■ Latency increase as we added more agents ■ Could only scale vertically - not an option at that point
  • 18.
    18 Copyright ©2020 ThousandEyes,Inc. All Rights Reserved.  @ThousandEyes Event Driven approach • Stream of check-ins • Use that stream to power the Scheduler • Assign tasks on check-in event
  • 19.
    19 Copyright ©2020 ThousandEyes,Inc. All Rights Reserved.  @ThousandEyes Event Driven approach ✓ Application scales with number of Kafka partitions ✓ Join with GlobalKTables ✓ Run the business logic ✓ Save assignments in KTable Facts: ➢ All state lives in Kafka ➢ At least once delivery ➢ Materialize assignments in MongoDB: ○ Historical queries ○ Timeline of assignments
  • 20.
    20 Copyright ©2020 ThousandEyes,Inc. All Rights Reserved.  @ThousandEyes Interactive Queries • Query in-memory KTable for assignments directly • Expose through a Rest API • Very fast • When State store is temporarily unavailable use MongoDB query – zero-downtime deployments
  • 21.
    Checkin Reconciler: React onapplication events
  • 22.
    22 Copyright ©2020 ThousandEyes,Inc. All Rights Reserved.  @ThousandEyes Checkin Reconciler
  • 23.
    23 Copyright ©2020 ThousandEyes,Inc. All Rights Reserved.  @ThousandEyes Problem: ■ updating the KTable on every event ■ creating hot partitions that took too long to process After 20K agents
  • 24.
    24 Copyright ©2020 ThousandEyes,Inc. All Rights Reserved.  @ThousandEyes Use KTable cache Reduce the commit interval of the application. StreamsConfig.COMMIT_INTERVAL_MS_CONFIG Temporary solution
  • 25.
    25 Copyright ©2020 ThousandEyes,Inc. All Rights Reserved.  @ThousandEyes Long term fix Removed repartitioning step and stored active check-ins in Redis instead
  • 26.
  • 27.
    27 Copyright ©2020 ThousandEyes,Inc. All Rights Reserved.  @ThousandEyes Browser Session Metrics ✓ Real User Monitoring events coupled with network tests ✓ No set interval ✓ Alerter needs binned data ✓ One minute window and emit aggregated metrics
  • 28.
    28 Copyright ©2020 ThousandEyes,Inc. All Rights Reserved.  @ThousandEyes Window operator Problem: Alerting use case needs aggregated event to be emitted at the end, not on every update.
  • 29.
    29 Copyright ©2020 ThousandEyes,Inc. All Rights Reserved.  @ThousandEyes Suppress operator Problems: Windowed aggregates took to long to reach the Alerter.
  • 30.
    30 Copyright ©2020 ThousandEyes,Inc. All Rights Reserved.  @ThousandEyes Aggregation was delayed? Closing a window is driven by events, that advance the stream time. Solution: Created a cron job to generate events every close window + grace period to force the window to close.
  • 31.
    31 Copyright ©2020 ThousandEyes,Inc. All Rights Reserved.  @ThousandEyes Production issues • Compaction wasn’t working in some cases • Avoid repartitioning to hot keys • Interactive queries misbehavior – Metadata incorrect – Created loop between services
  • 32.
    32 Copyright ©2020 ThousandEyes,Inc. All Rights Reserved.  @ThousandEyes Key Takeaways ✓ Use KTable cache to de-duplicate events before sending downstream. Use “commit.interval” to your advantage. ✓ Avoid hot partition keys if possible especially when you are going big. ✓ Make sure compaction works for your topics ✓ If you don’t really use RocksDB disable it ✓ Use binary format from the beginning if you are going big ✓ Kafka as a DB is possible, but don’t overdo it ✓ Small latencies on the processor level can add up once you have lag (100ms * 10.000 ~= 16min)
  • 33.
    33 Copyright ©2020 ThousandEyes,Inc. All Rights Reserved.  @ThousandEyes Q&A Twitter: @efsouris Blogpost: https://medium.com/thousandeyes-engine ering/kafka-streams-in-the-endpoint-agent -670a098ae7a4