Scylla Summit 2022: Stream Processing with ScyllaDB

Stream Processing
with Scylla
Daniel Belenky
Principal Software Engineer
YOUR COMPANY
LOGO HERE

Daniel Belenky
■ Kubernetes & Virtualization
■ Distributed applications
■ Big data and stream processing
Principal Software Engineer
YOUR PHOTO
GOES HERE

Agenda
■ A brief the product and my team - 3 min
■ What was the challenge that we were facing - 5 min
■ What were the solutions were considered - 5 min
■ How we’ve managed to solve the problem with Scylla - 12 min

A brief about the
product and the team

Our product
Is a security product that performs analytics, detection and response.
■ Millions of records per second
■ Multiple data sources and schemas
■ Has to provide insights in a near real-time timeframe (security...)

About my team
We are responsible for the infrastructure that:
■ Stream processing of data that comes from multiple sources.
■ Clean, normalize and process the data - prepare it for further analysis.
■ Build stories - multiple data sources emit different events and provide different
views on the same network session.
■ We want to fuse those events that tell the same story from a different
perspective.
■ Mostly developing with Go and Python
■ Deployment is on K8s

Problem description (part 1)
Various sensors
see a network
event
10:00:01

Various sensors
see a network
event
{event: dns-query, id: 6c92e}
{event: dns-query, id: 873a1}
...
10:00:01

Various sensors
see a network
event
...
10:00:01
10:00:02

Various sensors
see a network
event
...
10:00:01
10:00:02 {kind: login, id: 13}
{kind: signup, id: 17}
...

Various sensors
see a network
event
{kind: login, id: 13}
...
...
10:00:01
10:08:05
10:00:02

Various sensors
see a network
event
{kind: login, id: 13}
...
{type: GET, id: CHJW}
{type: POST, id: KQJD}
...
...
10:00:01
10:08:05
10:00:02

Data from different sensors
comes in different forms
and formats
In different times

and formats
In different times
Normalized
data in a
canonical form
ready for
processing

and formats
In different times
Normalized
data in a
canonical form
ready for
processing
?
Millions of
normalized but
unassociated
entries per
second from
many different
sources

The question is:
How to associate discrete entries that describe the
same network session

Why is it a challenge?
■ Clock skew across different sensors
Clocks across sensors might not be synchronized to the second

■ We have thousands of deployments to manage
Deployments also vary in size (from Bps to GBps)

■ Sensor’s viewpoint on the session
Different sensors have different views on the same session

■ Zero tolerance for data loss
Data is pushed to us and if we lose it, it’s lost for good

■ Zero tolerance for data loss
Data is pushed to us and if we lose it, it’s lost for good
■ Continuous out of order stream
Sensors send data in different times and
event time != ingestion time != processing
time

T0
T1
T2
T3
T4
T5
Event time
Ingestion time
Processing time

T0
T1
T2
T3
T4
T5
X
Event time
Ingestion time
Processing time
X
X
X
X
X

So… what do we need here?
■ Receive a stream of events

■ Wait some amount of time to allow related events
to arrive

to arrive
■ Decide which events are related to each other

to arrive
■ Publish the results

to arrive
■ Single tenant deployment - we need isolation

to arrive
■ Single tenant deployment - we need isolation
■ Support rates from several KB per hour up to
several GBs per second at a reasonable cost

Optional solution #1
Using a relational database

Proposed solution #1
Normalized
data in a
canonical form
ready for
processing

Normalized
data in a
canonical form
ready for
processing Store the
records in a
relational DB

Normalized
data in a
canonical form
ready for
records in a
relational DB
Periodical tasks
to compute
stories

Normalized
data in a
canonical form
ready for
records in a
relational DB
Periodical tasks
to compute
stories
Publish stories
for other
components to
consume

Pros
■ Relatively simple implementation:
We have to orchestrate the data and the
queries but not to write any complex logic
ourselves

Pros
ourselves
Cons

Pros
ourselves
■ Operational overhead - we have to deploy,
maintain and operate another database
Cons

Pros
ourselves
■ Limited performance - relational
database queries are slower when
compared to queries on a NoSQL
database (if the data model allows
utilizing a NoSQL database)
Cons

Pros
ourselves
■ Limited performance - relational
database queries are slower when
compared to queries on a NoSQL
database (if the data model allows
utilizing a NoSQL database)
■ Operational cost - complex queries
require more CPU hence are more
expensive
Cons

Using Scylla + Kafka

Normalized
data in a
canonical form
ready for
processing

Normalized
data in a
canonical form
ready for
processing
Store the
records
ScyllaDB

Normalized
data in a
canonical form
ready for
processing
Store the
records
ScyllaDB
Publish keys to
fetch the
records

Normalized
data in a
canonical form
ready for
processing
Store the
records
ScyllaDB
Publish keys to
fetch the
records
Records can’t be sent on Kafka because they are too big
So we send only the primary key to fetch from Scylla

Normalized
data in a
canonical form
ready for
processing
Store the
records
ScyllaDB
Publish keys to
fetch the
records
Multiple consumers read
data from a Kafka topic

Normalized
data in a
canonical form
ready for
processing
Store the
records
ScyllaDB
Publish keys to
fetch the
records
Fetch records
from Scylla

Normalized
data in a
canonical form
ready for
processing
Store the
records
ScyllaDB
Publish stories
for other
components to
consume
Publish keys to
fetch the
records
Fetch records
from Scylla

Pros
■ High throughput
■ One less database to maintain

Pros
■ High throughput
Cons

Pros
■ High throughput
■ We have to write our own logic to
ﬁnd correlations and build stories
Cons

Pros
■ High throughput
■ Complex architecture and
deployment
Cons

Pros
■ High throughput
■ Complex architecture and
deployment
■ We have to maintain thousands
of Kafka deployments
Cons

Using Scylla + Cloud managed
message queue

Normalized
data in a
canonical form
ready for
processing
Store the
records
ScyllaDB
Publish stories
for other
components to
consume
Publish keys to
fetch the
records
Fetch records
from Scylla
using keys
received from
the queue
Process and
compute
stories
Queue

Pros
■ High throughput when compared to the
relational database approach

Pros

Pros
■ No need to maintain Kafka deployments

Pros
Cons

Pros
■ Much slower performance when
compared to Kafka
Cons

The solution that solved our use case
Using ScyllaDB - no message queue

Accepted solution - high level
Normalized data in
a canonical form
ready for
processing

Normalized data in
a canonical form
ready for
processing
Store the records
ScyllaDB

Normalized data in
a canonical form
ready for
processing
Store the records
ScyllaDB
The data is sharded into hundreds shards

Normalized data in
a canonical form
ready for
processing
Store the records
ScyllaDB
Partition key is a tuple of
(shard-number, insert_time)
Clustering key is (event id)

Normalized data in
a canonical form
ready for
processing
Store the records
ScyllaDB
Multiple consumers fetch records
from Scylla using their assigned shard
numbers and the time they want to
consume
The step resolution is conﬁgurable

Normalized data in
a canonical form
ready for
processing
Store the records
ScyllaDB
consume
Process and
compute
stories

Normalized data in
a canonical form
ready for
processing
Store the records
ScyllaDB
Publish stories
for other
components to
consume
consume
Process and
compute
stories

Pros
■ Since we already have Scylla deployed for
other parts, we don’t have to add any new
creatures to the system

Pros

Pros
Cons

Pros
■ Our code became more complex
Cons

Pros
■ Our code became more complex
■ Producers and consumers has to have
synchronized clocks
(up to a certain resolution)
Cons

Accepted solution - detailed
{event_id: 1, payload: …}

SELECT offset_time FROM read_offsets WHERE shard = X

{shard: 1, read_offset: 1000}

INSERT INTO data (shard, insert_time, payload) VALUES (X, NOW(), ...)

Shard 1
Shard 2
Shard 3

Shard 1
Shard 2
Shard 3
SELECT id, payload FROM data WHERE shard = X and insert_time = Y

Shard 1
Shard 2
Shard 3

Shard 1
Shard 2
Shard 3
INSERT INTO read_offsets (shard, offset_time) VALUES (X, Y)

Shard 1
Shard 2
Shard 3

Our results
Reduced operational cost

Our results
Reduced operational complexity

Our results
Reduced operational complexity
Increased throughput

Thank you!
Stay in touch
Daniel Belenky
dbelenky@paloaltonetworks.com

Scylla Summit 2022: Stream Processing with ScyllaDB

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Scylla Summit 2022: Stream Processing with ScyllaDB

Similar to Scylla Summit 2022: Stream Processing with ScyllaDB (20)

More from ScyllaDB

More from ScyllaDB (20)

Recently uploaded

Recently uploaded (20)

Scylla Summit 2022: Stream Processing with ScyllaDB