Join with me in a journey of exploration upriver with "Kongo", a scalable streaming IoT logistics demonstration application using Apache Kafka, the popular open source distributed streaming platform. Along the way you'll discover: an example logistics IoT problem domain (involving the rapid movement of thousands of goods by trucks between warehouses, with real-time checking of complex business and safety rules from sensor data); an overview of the Apache Kafka architecture and components; lessons learned from making critical Kaka application design decisions; an example of Kafka Streams for checking truck load limits; and finish the journey by overcoming final performance challenges and shooting the rapids to scale Kongo on a production Kafka cluster.
https://aceu19.apachecon.com/session/kongo-building-scalable-streaming-iot-application-using-apache-kafka
ApacheCon Berlin 2019: Kongo:Building a Scalable Streaming IoT Application using Apache Kafka
1. Kongo: Building a Scalable
Streaming IoT Application
using Apache Kafka
Paul Brebner
instaclustr.com Technology Evangelist
2. Overview
What we’ll see on
the journey up river
1. Kafka introduction
2. The Kongo problem
3. Application, Architecture and Design(s)
4. Streams extension
5. Scaling
5. What is Kafka?
Message flow
Distributed
streams
processing
1 Distributed Producers…
2 Send Messages
3 To Distributed Consumers
4 Via Kafka Cluster
6. Kafka
Key Benefits
■ Fast – high throughput and low latency
■ Scalable – horizontally scalable, just add nodes and
partitions
■ Reliable – distributed and fault tolerant
■ Zero data loss
■ Open Source
■ Heterogeneous data sources and sinks
■ Available as an Instaclustr Managed service
9. Filtering, or which consumers get which messages, is topic based.
- Producers send messages to topics.
- Consumers subscribe to topics of interest, e.g. Parties.
- When they poll they only receive messages sent to those topics.
None of these consumers will receive messages sent to the “Work” topic.
Producer
Consumer
Consumer
Consumer
Consumer
Topic “Parties”
Topic “Work”
Consumers subscribed
to Topic “Parties”
Consumers poll to
receive messages
from “Parties”
Consumers not subscribed to
“Work” messages
11. Kafka also works like the Clone Army
It supports delivery of the same message to
multiple consumers with consumer groups.
Kafka doesn’t throw messages away
immediately they are delivered, so the
same message can be delivered to
multiple consumer groups.
Image: AKKHARAT JARUSILAWONG / Shutterstock.com
12. Consumers subscribed to ”Parties” topic are allocated partitions.
When they poll they will only get messages from their allocated
partitions.
Consumer
Consumer
Topic “Parties”
Partition 1
Partition 2
Partition 3
Producer
Consumer Group
Consumer
Consumer Group
Consumer
13. This enables consumers in the same group to share the work
around. Each consumer gets only a subset of the available
messages.
Consumer
Consumer
Topic “Parties”
Partition 1
Partition 2
Partition 3
Producer
Consumer Group
Consumer
Consumers share
work within groups
14. Multiple groups enable message broadcasting. Messages
are duplicated across groups, as each consumer group
receives a copy of each message.
Consumer
Consumer
Consumer
Topic “Parties”
Partition 1
Partition 2
Partition 3
Producer
Consumer Group
Consumer
Consumer Group
Messages are
duplicated across
Consumer groups
15. 2 The Kongo
Problem
Amazon was taken
Congo is 2nd biggest
river, 5,000km
And deepest
Congo -> Kongo
(Kingdom of)
35. The Story
RFID Unload events
Goods now in
Warehouse
Repeat from Start
With lots more
warehouses, goods
and trucks!
RFID Unload Events
36. Rules
Goods
categories
■ Each Goods has 0 or more general Categories:
● Perishable
● Hazardous
● Fragile
● Edible
● Medicinal
● Bulky
● Dry
■ Real world more complex
● 97 categories in Australia
37. Rules
Goods
categories
■ And 0 or 1 temperature category
● Frozen Temp
● Heat Sensitive Temp
● Cool Temp
● Room Temp
● Ambient Temp
■ Some warehouses/trucks are temperature controlled
39. Sensor rules
Goods to have rules
to check if they are
safe in the
environment of a
location -
Warehouse or
Trucks, 20 metrics,
some in common
E.g. Keep your
chickens cool
40. 3 Application
Simulation
Logical steps
Create Goods,
Warehouses,
Trucks
Simulate next hour
Unload Goods into
Warehouses
Simulate Sensor
values (Trucks and
Warehouses)
Check Goods +
Sensor violations
(Goods in Trucks
and Warehouses)
Check Goods + co-
location violations
(Goods on trucks)
Load Trucks with
Goods, move
Trucks to random
Warehouses
repeat
41. Architecture
Create Goods,
Warehouses,
Trucks
Simulate next hour
Unload Goods into
Warehouses
Simulate Sensor
values (Trucks and
Warehouses)
Check Goods +
Sensor violations
(Goods in Trucks
and Warehouses)
Check Goods + co-
location violations
(Goods on trucks)
Load Trucks with
Goods, move
Trucks to random
Warehouses
repeat
Rule violations
Monolithic
Rule violations
42. Architecture
Create Goods,
Warehouses,
Trucks
Simulate next hour
Unload Goods into
Warehouses
Simulate Sensor
values (Trucks and
Warehouses)
Check Goods +
Sensor violations
(Goods in Trucks
and Warehouses)
Check Goods + co-
location violations
(Goods on trucks)Load Trucks with
Goods, move
Trucks to random
Warehouses
repeat
Event streams
Rule violations
Sensor events
Unload events
Load events
De-coupled with
event streams
43. Distributed
Architecture Create Goods,
Warehouses,
Trucks
Simulate next hour
Unload Goods into
Warehouses
Simulate Sensor
values (Trucks and
Warehouses)
Check Goods +
Sensor violations
(Goods in Trucks
and Warehouses)
Check Goods + co-
location violations
(Goods on trucks)
Load Trucks with
Goods, move
Trucks to random
Warehouses
repeat
Rule violations
Separate Kafka
producers and
consumers
Simulation has
perfect knowledge,
but violation rules
checking relies on
event stream data
Simulation
Kafka Producers
Violation rules checking
Kafka Consumers
44. Design Goal
Deliver events
produced from each
location to Goods in
same location
Events delivered to Goods
in same location
Events delivered to Goods
in same location
Events delivered to Goods
in same location
Events delivered to Goods
in same location
46. Design
Variables
Problems?
100s of locations =
topics ok, more not
ok
Too many consumer
groups not ok
All locations in 1 topic 1 topic per location
Goods de-coupled
from Consumers
(Consumers <<
Goods)
Every Goods is a
Consumer (Group)
Goods == Consumers
Topics
Consumers
1 Many
53. Streams
1 of 4 Kafka APIs
■ Kafka has 4 APIs, Producer, Consumer, Connector
and Streams!
■ The Streams API allows an application to act as a
stream processor
● consuming an input stream from one or more topics and
● producing an output stream to one or more topics
● transforming the input streams to output streams
■ A stream is an unbounded, ordered, replayable,
continuously updating data set, consisting of
strongly typed key-value records.
54. Processor
Topology
DAGs of stream
processors (nodes)
that are connected by
streams (edges)
Processors transform
data by receiving one
input record, applying
an operation to it, and
producing output
records.
55. Streams
DSL
Streams and Tables
■ The Streams DSL has built-in abstractions for
streams and tables
● KStream, KTable, GlobalKTable, KGroupedStream, and
KGroupedTable.
■ The DSL supports a Declarative functional
programming style, with
● stateless transformations (e.g. map and filter) as well as
● stateful transformations such as aggregations (e.g. count and
reduce), joins, and windowing.
61. Streaming
Problems?
■ Anti-gravity?
● Sometimes truck weights went negative!
● Solution: Turned on “exactly-once” transactional setting
● The transactional producer allows an application to send messages
to multiple partitions atomically.
● Weights no longer go negative
Negative truck
weights!
64. Scalability
alternatives
Scale out, up and
multiple clusters
Multiple clusters
enables flexible
scaling (cluster for
violations)
Different instance
sizes have different
network speeds
67. Scaling is
hard (1)
Actually hard to
achieve linear
scalability
Why? Kafka is
scalable, but:
■ Hash Collisions
● Too many open files exceptions
● Due to increasing and eventually too many consumers
● Some consumers were timing out
● Why? Some consumers were not receiving any events
● 300 locations and 300 partitions, but only 200 unique values, so
only 200 consumers receive events, the rest time out
● This is due to hashing collisions, some partitions get > 1 locations,
others 0
68. Key parking
problem
Well known problem
Knuth 1962
Ensure number of
keys >>> number of
partitions >=
number of
consumers (in a
group)
70. Rebalancing
storms
■ Rebalancing storms result in some consumers not
receiving events (drop in throughput) and a very
slow start up time for new consumers (> 20s)
■ Need to ensure consumers are started and are
polling before trying to add lots more consumers
■ So try to keep total number of consumers as low as
practical (next…)
72. Consumers
Less is more
■ Even though we used the design with least
consumers…
■ If Kafka consumers take too long to read events and
process them, then need more consumer threads
(and more partitions), impacting Kafka cluster
scalability
■ Solution? Minimize consumer response time
● Only use consumers for reading events
● Do event processing asynchronously or in separate thread pool
■ My #1 Kafka rule is
● “Kafka is easy to scale with the smallest number of consumers”
73. More
information
The End
■ Kongo code:
● https://github.com/instaclustr/kongo2
● https://github.com/instaclustr/kongokafkastreams
■ All blog series, including Kongo, and latest,
● Anomalia Machina
ᐨ Kafka+Cassandra+Kubernetes, and
● Geospatial Anomalia Machina
ᐨ Kafka+Cassandra+Kubernetes+Geospatial queries & indexing
● https://www.instaclustr.com/paul-brebner/
■ Visual Introduction To Kafka
● https://www.instaclustr.com/resource/apache-kafka-a-visual-
introduction/
■ The Instaclustr Managed Platform
● https://www.instaclustr.com/platform/
● Free Trial
ᐨ https://console.instaclustr.com/user/signup