Real-Time Processing of Spatial Data Using Kafka Streams, Ian Feeney & Roman Kolesnev | Current 2022
Kafka Streams applications can process fast-moving, unbounded streams of data. This gives us the capability to process and react to events from many sources in near real time as they converge in Kafka. However, if the events in these data streams have a spatial component and their spatial relationships with each other determine how they should be processed or reacted to, this raises some fundamental challenges. Determining that, for example, a person is within an area or that routes are intersecting requires access to geospatial operations which are not readily available in Kafka Streams.
In this talk, we will first set the scene with a geospatial 101. Then, using a simplified taxi hailing use case, we will look at two approaches for processing spatial data with Kafka Streams. The first approach is a naive approach which uses Kafka Streams DSL, geohashing and the Java Spatial4j library. The second approach is a prototype which replaces the RocksDB statestore with Apache Lucene (an embedded storage engine with powerful indexing, search and geospatial capabilities), and implements a stateful spatial join with the Transformer API.
This talk will give you an appreciation of geospatial use cases and how Kafka Streams could enable them. You will see the role the state store plays in stateful processing and the implications for geospatial processing. It will also show you what is involved in integrating a custom state store with Kafka Streams. Overall, this talk will give you an understanding of how you might go about building custom processing capabilities on top of Kafka Streams for your own use cases.
Real-Time Processing of Spatial Data Using Kafka Streams, Ian Feeney & Roman Kolesnev | Current 2022
1. Near real time processing of spatial data in
Kafka Streams
Near real time processing of spatial data using Kafka StreamsNear real time processing of spatial data using Kafka Streams
Current 2022
Ian Feeney and Roman Kolesnev
October 2022
Austin, Texas
Spatial data in motion
Near real time processing of spatial data using Kafka StreamsNear real time processing of spatial data using Kafka Streams
2. Who are we?
2
Ian Feeney
Customer Innovation Engineer
ifeeney@confluent.io
Roman Kolesnev
Staff Customer Innovation Engineer
ifeeney@confluent.io
Confluent Solution
and Innovation
Division
https://www.confluent.io/confluent-accelerators
3. Everything happens somewhere
3
● Proliferation of Smartphones, GPS devices and IoT
● Vast quantities of spatial data being produced
● Realtime geolocation at the heart of UX, business processes
and business models
● Consider Uber
● Everything event happens somewhere
4. Spatial data in Kafka Streams
4
● Processing fast moving
unbounded event
streams
● React to events from
many source in real time
● How to process
geospatial data in Kafka
streams?
5. Agenda
5
1. Geospatial 101 - Spatial data and operations
2. Use case and demo application
3. Standard Kafka Streams approach
4. Custom state store approach
6. Geospatial Data
6
● Represent and encode the geographies of real world entities
● Raster
○ image based format
○ pixels carry value
● Vector
○ represent points lines and polygons using pairs of x,y coordinates
10. Geospatial Operations
10
● Geometry based functions
● Analysis of spatial data
○ Determine spatial relationships
○ Resolve spatial interactions
● Produce new data and insights
17. ● Kafka Streams Domain Specific Language (DSL) for the
processing topology
● Geohashing for rekeying
● Java Spatial4J library for spatial operations
17
Standard Kafka streams Approach
20. Rekey
20
Records on both sides of a join must be:
● keyed by the same field
● routed to the same partition and handled
by the same processing task
Taxis People
Buffer
(mapValues)
Rekey (flatMap)
Rekey (map)
Join
Filter
TaxiPerson
27. How are geohashes useful for us?
● Geospatial event key
● Taxis and people events which are within 2.5 miles of each other can
be:
● keyed on the same field
● routed to the same partition
● matched in the join
33. Standard Kafka Streams Approach Downsides
● False positives = Extra processing
● Person events duplicated = Extra data
● Limitations of state store
○ No spatial operations
○ Non key based lookups not possible*
*Even foreign key joins use key based lookups under the hood
35. Custom State Store
35
Requirements:
• Embeddable in Java
• Supports Spatial operations
Apache Lucene - High performance, embeddable, written in Java. Supports indexing, search,
spatial operations.
Incremental steps:
• WindowStore implementation using Lucene as storage engine.
• Extension of WindowStore interface.
• Custom SpatialJoin DSL operation implementation.
36. Custom State Store - Lucene Window Store
36
WindowStore implementation using Lucene as storage engine
37. Custom State Store - Lucene Window Store
37
Lucene WindowStore construction flow
38. Custom State Store - Lucene Spatial Window Store
38
Extension of WindowStore interface - adding spatial operations
39. Custom State Store - SpatialJoin
39
Custom SpatialJoin operation implementation
• Recreates KStreamToKStream join logic flow
• Join based on Key, Time and Point within radius lookup
• ValueTransformer implementation to stay at DSL level
SpatialJoinTransformer
• SpatialWindowStore - this
• SpatialWindowStore - other
• JoinWindows
• ValueJoiner
• PointExtractor
• joinRadius
40. Custom State Store - SpatialJoin
40
Custom SpatialJoin operation implementation architecture
41. Vanilla Kafka Streams vs. Custom State Store
41
Taxis People
Buffer
(mapValues)
Rekey (flatMap)
Rekey (map)
SpatialJoin
TaxiPerson
Lucene
Taxis People
Buffer
(mapValues)
Rekey (flatMap)
Rekey (map)
Join
Filter
TaxiPerson
RocksDB RocksDB Lucene
43. And Finally……
43
1. You can use spatial semantics for event processing in Kafka Streams
2. Implementation of custom state store enables:
a. Spatial operations at the state store layer
b. Usage of secondary indexes and composite queries in state store
data lookups, including non key based lookups
44. We would love to hear from you.
44
Ian Feeney
ifeeney@confluent.io
Roman Kolesnev
rkolesnev@confluent.io
https://www.confluent.io/confluent-accelerators
https://confluent.io/community/ask-the-community