Massively Scalable Real-time Geospatial Anomaly Detection with Apache Kafka and Cassandra (ApacheCon 2020)

Massively Scalable Real-time
Geospatial Data Processing with
Apache Kafka and Cassandra
Paul Brebner
instaclustr.com Technology Evangelist
ApacheCon Geospatial Track 2020
© Instaclustr Pty Limited, 2020

Instaclustr Overview
©Instaclustr Pty Limited, 2020

Overview
• “Location” in the news in 2020
• Anomaly Detection: baseline throughput
• Spatial Anomaly Detection problem
• Solutions: location representation and querying/indexing
o Bounding boxes and secondary indexes
o Geohashes
o Lucene index
o Results
o 3D
© 2020 Instaclustr Pty Limited

In the
News
John Conway
Legendary Polymath
Passed away this
year

Game of
Life
Next state of each cell depends on state of
immediate neighbors

Simple rules but complex patterns
Game of
Life

Also In
the News
Social distancing and COVID-19 tracing
Uncle Ron’s social
distancing 3000
invention
Or COVIDSAFE App?

Uncle Ron’s social distancing
3000 invention
Also In
the News
And “planet-killer”
asteroid missed the
Earth in April
(16x moon orbit)

Previously…
KAFKA,
CASSANDRA AND
KUBERNETES AT
SCALE: REAL-
TIME ANOMALY
DETECTION ON 19
BILLION EVENTS
A DAY

Anomaly
Detection
Spot the difference
At speed (< 1s RT) and scale
(high throughput, lots of data)

How
Does It
Work?
• CUSUM (Cumulative Sum Control Chart)
• Statistical analysis of historical data
• Data for a single variable/key at a time
• Potentially billions of keys

Pipeline
Design
• Kafka, Cassandra and Kubernetes Clusters – Kafka handles
streaming, Cassandra for data storage, Kubernetes for
application scaling
• Efficient Cassandra Data writes and reads with key,
a unique “account ID” or similar
Kubernetes

Cassandra
Data
Model
• Events are timeseries
• Id is Partition Key - Time is clustering key (order)
• Read gets most recent 50 values for id, very fast
create table event_stream (
id bigint,
time timestamp,
value double,
primary key (id, time)
) with clustering order by (time desc);
select value from event_stream where id=314159265 limit 50;

Cluster
details –
CPU cores
Cassandra 384 +
Application pipeline
(workers) 118 +
Kafka 72 =
574 Cores Total
384
118
72
0
100
200
300
400
500
600
700
Cores
Cores Used
574 Total
Cassandra Workers Kafka

Baseline
Throughput
19 Billion Anomaly Checks/Day = 100%
0
20
40
60
80
100
120
Baseline (single transaction ID)
Normalised (%)

Harder
Problem:
Spot the
Differences
in Space
Space is big. Really big. You just won’t believe how vastly,
hugely, mind-bogglingly big it is. I mean, you may think it’s a
long way down the road to the chemist, but that’s just
peanuts to space.
Douglas Adams, The Hitchhiker’s Guide to the Galaxy

Spatial
Anomalies
Many and varied

Real
Example:
John Snow
No, not this one!

John Snow’s
1854 Cholera Map
• Death’s per household + location
• Used to identify a polluted pump (X)
• Some outliers—brewers drank beer not water! X

But…
First you have to
know where you are:
Location
To usefully represent location need:
• Coordinate system
• Map
• Scale

Better
• <lat, long> coordinates
• Scale
• Interesting locations,
“bulk of treasure here”

Geospatial Anomaly Detection
South Atlantic Geomagnetic Anomaly
New problem…
• Rather than a single ID, events now have a location (and a value)
• The problem now is to
o find the nearest 50 events to each new event
o Quickly (< 1s RT)
• Can’t make any assumptions about geospatial
properties of events
o including location, density or distribution – i.e. where, or how many
o Need to search from smallest to increasingly larger areas
o E.g. South Atlantic Geomagnetic Anomaly is BIG
• Uber (ApacheCon 2019 talk) uses similar
• technologies to
o forecast demand
o Increase area until they have sufficient data for predictions
• Can we use <lat, long> as Cassandra partition key?
o Yes, compound partition keys are allowed.
o But can only select the exact locations. © 2020 Instaclustr Pty Limited

How to Compute Nearness
To compute distance between locations
you need coordinate system
e.g. Mercator map
Flat earth, distortion nearer poles

World is (approx.) Spherical
Calculation of distance between two
latitudinal/longitudinal points is non-trivial

Bounding Box
Approximation of distance using inequalities

Bounding Boxes and Cassandra?
• Use ”country” partition key, Lat/long/time clustering keys
• But can’t run the query with multiple inequalities
CREATE TABLE latlong (
country text,
lat double,
long double,
time timestamp,
PRIMARY KEY (country, lat, long, time)
) WITH CLUSTERING ORDER BY (lat ASC, long ASC, time DESC);
select * from latlong where country='nz' and lat>= -39.58 and lat <= -38.67
and long >= 175.18 and long <= 176.08 limit 50;
InvalidRequest: Error from server: code=2200 [Invalid query] message="Clustering column
"long" cannot be restricted (preceding column "lat" is restricted by a non-EQ relation)"

Secondary
Indexes to
the
Rescue?
Secondary Indexes
o create index i1 on latlong (lat);
o create index i2 on latlong (long);
• But same restrictions as clustering columns.
SASI - SSTable Attached Secondary Index
• Supports more complex queries more efficiently
o create custom index i1 on latlong (long) using
'org.apache.cassandra.index.sasi.SASIIndex';
o create custom index i2 on latlong (lat) using
'org.apache.cassandra.index.sasi.SASIIndex’;
• select * from latlong where country='nz' and lat>= -39.58 and lat
<= -38.67 and long >= 175.18 and long <= 176.08 limit 50 allow
filtering;
• “allow filtering” may be inefficient (if many rows have to be
retrieved prior to filtering) and isn’t suitable for production.
• But SASI docs say
o even though “allow filtering” must be used with 2 or more column
inequalities, there is actually no filtering taking place

Results
Very slow (< 1%)
0
20
40
60
80
100
120
Normalised (%)
Baseline (single transaction ID) SASI

Geohashes
to the
Rescue?
• Divide maps into named and hierarchical areas
• We’ve been something similar already: “country” partition key
E.g. plate tectonics

Geohashes
Rectangular areas
Variable length base-32 string
Single char regions 5,000km x 5,000km
Each extra letter gives 32 sub-areas
8 chars is 40mx20m
En/de-code lat/long to/from geohash
But: Edges cases, non-linear near poles

Some
Geohashes
Are Words
“ketchup” is in Africa

Some
Geohashes
Are Words
153mx153m153m x 153m

“Trump”
Is in
Kazakhstan!
Not to scale
5kmx5km

Modifications for Geohashes
• Lat/long encoded as geohash • Geohash is new key • Geohash used to query Cassandra

Geohashes and Cassandra
In theory Geohashes work well for
database indexes
CREATE TABLE geohash1to8 (
geohash1 text,
time timestamp,
geohash2 text,
geohash3 text,
geohash4 text,
geohash5 text,
geohash6 text,
geohash7 text,
geohash8 text,
value double,
PRIMARY KEY (hash1, time)
) WITH CLUSTERING ORDER BY (time DESC);
CREATE INDEX i8 ON geohash1to8 (geohash8);
Option 1:
Multiple Indexed Geohash Columns

• Query from smallest to largest areas
select * from geohash1to8 where geohash1=’e’ and geohash7=’everywh’ limit 50;
select * from geohash1to8 where geohash1=’e’ and geohash6=’everyw’ limit 50;
select * from geohash1to8 where geohash1=’e’ and geohash5=’every’ limit 50;
select * from geohash1to8 where geohash1=’e’ and geohash4=’ever’ limit 50;
select * from geohash1to8 where geohash1=’e’ and geohash3=’eve’ limit 50;
select * from geohash1to8 where geohash1=’e’ and geohash2=’ev’ limit 50;
select * from geohash1to8 where geohash1=’e’ limit 50;
Tradeoffs?
Multiple secondary columns/indexes, multiple queries, accuracy and
number of queries depends on spatial distribution and density
• Stop when 50 rows found

Results
Slightly better
Option 1 = 10%
0
20
40
60
80
100
120
Normalised (%)
Baseline (single transaction ID) SASI Geohash Option 1

Denormalization is “Normal” in Cassandra
Create 8 tables, one for each geohash length
CREATE TABLE geohash1 (
geohash text,
time timestamp,
value double,
PRIMARY KEY (geohash, time)
…
CREATE TABLE geohash8 (
geohash text,
time timestamp,
value double,
PRIMARY KEY (geohash, time)
Option 2:
Denormalized Multiple Tables

• Select from smallest to largest areas using corresponding table
select * from geohash8 where geohash=’everywhe’ limit 50;
select * from geohash7 where geohash=’everywh’ limit 50;
select * from geohash6 where geohash=’everyw’ limit 50;
select * from geohash5 where geohash=’every’ limit 50;
select * from geohash4 where geohash=’ever’ limit 50;
select * from geohash3 where geohash=’eve’ limit 50;
select * from geohash2 where geohash=’ev’ limit 50;
select * from geohash1 where geohash=’e’ limit 50;
Tradeoffs?
Multiple tables and writes, multiple queries

Results
Getting better
Option 2 = 20%
0
20
40
60
80
100
120
Normalised (%)
Baseline (single transaction ID) SASI
Geohash Option 1 Geohash Option 2

Similar to Option 1 but using
clustering columns
CREATE TABLE geohash1to8_clustering (
geohash1 text,
time timestamp,
geohash2 text,
gephash3 text,
geohash4 text,
geohash5 text,
geohash6 text,
geohash7 text,
geohash8 text,
value double,
PRIMARY KEY (geohash1, geohash2, geohash3, geohash4,
geohash5, geohash6, geohash7, geohash8, time)
) WITH CLUSTERING ORDER BY (geohash2 DESC, geohash3 DESC,
geohash4 DESC, geohash5 DESC, geohash6 DESC, geohash7 DESC,
geohash8 DESC, time DESC);
Option 3:
Clustering Column(s)

How Do Clustering Columns Work?
• Clustering columns are good for modelling and efficient querying of
hierarchical/nested data
• Query must include higher level columns with equality operator, ranges are
only allowed on last column in query, lower level columns don’t have to be
included.
For example:
o select * from geohash1to8_clustering where geohash1=’e’ and
geohash2=’ev’ and geohash3 >= ’ev0’ and geohash3 <= ‘evz’ limit 50;
• But why have multiple clustering columns when one is actually enough…
Good for Hierarchical Data

Better: Single Geohash
Clustering Column
Geohash8 and time are
clustering keys CREATE TABLE geohash_clustering (
geohash1 text,
time timestamp,
geohash8 text,
lat double,
long double,
PRIMARY KEY (geohash1, geohash8, time)
) WITH CLUSTERING ORDER BY (geohash8 DESC,
time DESC);

Inequality Range Query
• With decreasing length
geohashes
• Stop when result has 50 rows
select * from geohash_clustering where geohash1=’e’ and
geohash8=’everywhe’ limit 50;
geohash8>=’everywh0’ and geohash8 <=’everywhz’ limit 50;
geohash8>=’everyw0’ and geohash8 <=’everywz’ limit 50;
geohash8>=’every0’ and geohash8 <=’everyz’ limit 50;
geohash8>=’ever0’ and geohash8 <=’everz’ limit 50;
geohash8>=’eve0’ and geohash8 <=’evez’ limit 50;
geohash8>=’ev0’ and geohash8 <=’evz’ limit 50;
select * from geohash_clustering where geohash1=’e’ limit 50;

Geohash
Results
Option 3 is best = 34%
0
20
40
60
80
100
120
Normalised (%)
Baseline (single transaction ID) SASI Geohash Option 1 Geohash Option 2 Geohash Option 3

Issues?
■ Cardinality for partition key
● should be > 100,000
● >= 4 character geohash
■ Unbounded partitions are bad
● May need composite partition key in
production
● e.g. extra time bucket (hour, day, etc)
■ Space vs time
● could have different sized buckets for
different sized spaces
● E.g. bigger areas with more frequent
events may need shorter time buckets
to limit size
● This may depend on the space-time
scales of underlying
systems/processes
● E.g. Spatial and temporal scales of
oceanographic processes (left)

Another
Option:
Cassandra
Apache
Lucene
Index Plugin
A Concordance

Cassandra Lucene Index Plugin
• The Cassandra Lucene Index is a plugin for Apache Cassandra:
o That extends its index functionality to provide near real-time search, including full-text search capabilities and free
multivariable, geospatial and bitemporal search
o It is achieved through an Apache Lucene based implementation of Cassandra secondary indexes, where each node of
the cluster indexes its own data.
• Instaclustr now supports the plugin
o Optional add-on to managed Cassandra service
o And code support
https://github.com/instaclustr/cassandra-lucene-index
• How does this help for Geospatial queries?
o Has very rich geospatial semantics including geo points, geo shapes, geo distance search, geo bounding box search,
geo shape search, multiple distance units, geo transformations, and complex geo shapes.

Cassandra Table and
Lucene Indexes
• Geopoint Example
• Under the hood indexing is
done using a tree structure
with geohashes
(configurable precision)
CREATE TABLE latlong_lucene (
geohash1 text,
value double,
time timestamp,
latitude double,
longitude double,
Primary key (geohash1, time)
CREATE CUSTOM INDEX latlong_index ON latlong_lucene ()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
'refresh_seconds': '1',
'schema': '{
fields: {
geohash1: {type: "string"},
value: {type: "double"},
time: {type: "date", pattern: "yyyy/MM/dd HH:mm:ss.SSS"},
place: {type: "geo_point", latitude: "latitude", longitude: "longitude"}
}'
};

Search Options
• Sort
• Sophisticated but complex
semantics (see the docs)
SELECT value FROM latlong_lucene
WHERE expr(latlong_index,
'{ sort: [ {field: "place", type:
"geo_distance", latitude: " + <lat> + ",
longitude: " + <long> + "}, {field: "time",
reverse: true} ] }') and
geohash1=<geohash> limit 50;

Search Options
• Building Box Filter
• Need to compute box
corners
WHERE expr(latlong_index, '{ filter: {
type: "geo_bbox", field: "place",
min_latitude: " + <minLat> + ",
max_latitude: " + <maxLat> + ",
min_longitude: " + <minLon> + ",
max_longitude: " + <maxLon> + " }}’)
limit 50;

Search Options
• Geo Distance Filter
WHERE expr(latlong_index, '{ filter: { type:
"geo_distance", field: "place", latitude: " +
<lat> + ", longitude: " + <long> + ",
max_distance: " <distance> + "km" } }') and
geohash1=' + <hash1> + ' limit 50;

Search Options: Prefix Filter
Prefix search is useful for searching larger areas over a single geohash column
as you can search for a substring
WHERE expr(latlong_index, '{ filter: [ {type:
"prefix", field: "geohash1", value:
<geohash>} ] }') limit 50
Similar to inequality over clustering column

Lucene
Results
Options = 2-25%
Best is prefix filter (25%)
0
20
40
60
80
100
120
Normalised (%)
Baseline (single
transaction ID)
SASI
Geohash Option 1
Geohash Option 2
Geohash Option 3
Lucene sort
Lucene filter
bounded box
Lucene filter geo
distance
Lucene filter prefix
over geohash

Overall
Geohash options are
faster (25%, 34%)
34% -> 6.5 Billion
Anomaly checks/day
0
20
40
60
80
100
120
Normalised (%)
Baseline (single
transaction ID)
SASI
Geohash Option 1
Geohash Option 2
Geohash Option 3
Lucene sort
Lucene filter bounded
box
Lucene filter geo
distance
over geohash
G
e
o
h
a
s
h
G
e
o
h
a
s
h

Overall
Geohash options are
faster (25%, 34%)
Lucene bounded box/geo
distance most accurate but only
5% of baseline performance
0
20
40
60
80
100
120
Normalised (%)
Baseline (single
transaction ID)
SASI
Geohash Option 1
Geohash Option 2
Geohash Option 3
Lucene sort
box
Lucene filter geo
distance
over geohash
L
u
c
e
n
e
L
u
c
e
n
e

3D
(Up and
Down)
Who needs it?

Location, Altitude and Volume
• 3D Geohashes
represent 2D location, altitude,
and volume
• A 3D geohash is a cube

Application?
3D Drone
Proximity
Detection

Proximity
rules
> 50m from people and
property
>150m from congested
areas
> 1000m from airports
> 5000m from exclusion
zones
Just happen to
correspond to different
length 3D geohashes,

3D Geohashes
Works with all the
geohash index options
Therefore reasonably fast
to compute 3D proximity
More accurate slower
options can be improved
with bigger Cassandra
clusters
0
20
40
60
80
100
120
Normalised (%)
Baseline (single
transaction ID)
SASI
Geohash Option 1
Geohash Option 2
Geohash Option 3
Lucene sort
box
Lucene filter geo
distance
Lucene filter prefix over
geohash
3
D
G
e
o
h
a
s
h
3
D
G
e
o
h
a
s
h
3
D
G
e
o
h
a
s
h
3
D
G
e
o
h
a
s
h

• Demo 3D Geohash java code
o https://gist.github.com/paulbrebner/a67243859d2cf38bd9038a12a7b14762
o produces valid 3D geohashes for altitudes from 13km below sea level to
geostationary satellite orbit
o Can be used for any of the geohash options
More Information?

More Information?
• Blogs: https://www.instaclustr.com/paul-brebner/
• Try us out! Free trial at Instaclustr.com

www.instaclustr.com
info@instaclustr.com
@instaclustr
THANK
YOU!

Massively Scalable Real-time Geospatial Anomaly Detection with Apache Kafka and Cassandra (ApacheCon 2020)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Massively Scalable Real-time Geospatial Anomaly Detection with Apache Kafka and Cassandra (ApacheCon 2020)

Similar to Massively Scalable Real-time Geospatial Anomaly Detection with Apache Kafka and Cassandra (ApacheCon 2020) (20)

More from Paul Brebner

More from Paul Brebner (20)

Recently uploaded

Recently uploaded (20)

Massively Scalable Real-time Geospatial Anomaly Detection with Apache Kafka and Cassandra (ApacheCon 2020)