SlideShare a Scribd company logo
1 of 118
Download to read offline
Stream Processing
with Scylla
Daniel Belenky
Principal Software Engineer
YOUR COMPANY
LOGO HERE
Daniel Belenky
■ Kubernetes & Virtualization
■ Distributed applications
■ Big data and stream processing
Principal Software Engineer
YOUR PHOTO
GOES HERE
Agenda
■ A brief the product and my team - 3 min
■ What was the challenge that we were facing - 5 min
■ What were the solutions were considered - 5 min
■ How we’ve managed to solve the problem with Scylla - 12 min
A brief about the
product and the team
Our product
Is a security product that performs analytics, detection and response.
■ Millions of records per second
■ Multiple data sources and schemas
■ Has to provide insights in a near real-time timeframe (security...)
About my team
We are responsible for the infrastructure that:
■ Stream processing of data that comes from multiple sources.
■ Clean, normalize and process the data - prepare it for further analysis.
■ Build stories - multiple data sources emit different events and provide different
views on the same network session.
■ We want to fuse those events that tell the same story from a different
perspective.
■ Mostly developing with Go and Python
■ Deployment is on K8s
The challenge
Problem description (part 1)
Problem description (part 1)
Various sensors
see a network
event
10:00:01
Problem description (part 1)
Various sensors
see a network
event
{event: dns-query, id: 6c92e}
{event: dns-query, id: 873a1}
...
10:00:01
Problem description (part 1)
Various sensors
see a network
event
{event: dns-query, id: 6c92e}
{event: dns-query, id: 873a1}
...
10:00:01
10:00:02
Problem description (part 1)
Various sensors
see a network
event
{event: dns-query, id: 6c92e}
{event: dns-query, id: 873a1}
...
10:00:01
10:00:02 {kind: login, id: 13}
{kind: signup, id: 17}
...
Problem description (part 1)
Various sensors
see a network
event
{kind: login, id: 13}
{kind: signup, id: 17}
...
{event: dns-query, id: 6c92e}
{event: dns-query, id: 873a1}
...
10:00:01
10:08:05
10:00:02
Problem description (part 1)
Various sensors
see a network
event
{kind: login, id: 13}
{kind: signup, id: 17}
...
{type: GET, id: CHJW}
{type: POST, id: KQJD}
...
{event: dns-query, id: 6c92e}
{event: dns-query, id: 873a1}
...
10:00:01
10:08:05
10:00:02
Problem description (part 2)
Data from different sensors
comes in different forms
and formats
In different times
Problem description (part 2)
Data from different sensors
comes in different forms
and formats
In different times
Problem description (part 2)
Data from different sensors
comes in different forms
and formats
In different times
Normalized
data in a
canonical form
ready for
processing
Problem description (part 2)
Data from different sensors
comes in different forms
and formats
In different times
Normalized
data in a
canonical form
ready for
processing
?
Millions of
normalized but
unassociated
entries per
second from
many different
sources
The question is:
How to associate discrete entries that describe the
same network session
Why is it a challenge?
Why is it a challenge?
■ Clock skew across different sensors
Clocks across sensors might not be synchronized to the second
Why is it a challenge?
■ Clock skew across different sensors
Clocks across sensors might not be synchronized to the second
■ We have thousands of deployments to manage
Deployments also vary in size (from Bps to GBps)
Why is it a challenge?
■ Clock skew across different sensors
Clocks across sensors might not be synchronized to the second
■ We have thousands of deployments to manage
Deployments also vary in size (from Bps to GBps)
■ Sensor’s viewpoint on the session
Different sensors have different views on the same session
Why is it a challenge?
■ Clock skew across different sensors
Clocks across sensors might not be synchronized to the second
■ We have thousands of deployments to manage
Deployments also vary in size (from Bps to GBps)
■ Sensor’s viewpoint on the session
Different sensors have different views on the same session
■ Zero tolerance for data loss
Data is pushed to us and if we lose it, it’s lost for good
Why is it a challenge?
■ Clock skew across different sensors
Clocks across sensors might not be synchronized to the second
■ We have thousands of deployments to manage
Deployments also vary in size (from Bps to GBps)
■ Sensor’s viewpoint on the session
Different sensors have different views on the same session
■ Zero tolerance for data loss
Data is pushed to us and if we lose it, it’s lost for good
■ Continuous out of order stream
Sensors send data in different times and
event time != ingestion time != processing
time
Example
T0
T1
T2
T3
T4
T5
Event time
Ingestion time
Processing time
T0
T1
T2
T3
T4
T5
Event time
Ingestion time
Processing time
T0
T1
T2
T3
T4
T5
Event time
Ingestion time
Processing time
T0
T1
T2
T3
T4
T5
Event time
Ingestion time
Processing time
T0
T1
T2
T3
T4
T5
Event time
Ingestion time
Processing time
T0
T1
T2
T3
T4
T5
Event time
Ingestion time
Processing time
T0
T1
T2
T3
T4
T5
Event time
Ingestion time
Processing time
T0
T1
T2
T3
T4
T5
Event time
Ingestion time
Processing time
T0
T1
T2
T3
T4
T5
Event time
Ingestion time
Processing time
T0
T1
T2
T3
T4
T5
Event time
Ingestion time
Processing time
T0
T1
T2
T3
T4
T5
X
Event time
Ingestion time
Processing time
X
X
X
X
X
So… what do we need here?
So… what do we need here?
■ Receive a stream of events
So… what do we need here?
■ Receive a stream of events
■ Wait some amount of time to allow related events
to arrive
So… what do we need here?
■ Receive a stream of events
■ Wait some amount of time to allow related events
to arrive
■ Decide which events are related to each other
So… what do we need here?
■ Receive a stream of events
■ Wait some amount of time to allow related events
to arrive
■ Decide which events are related to each other
■ Publish the results
So… what do we need here?
■ Receive a stream of events
■ Wait some amount of time to allow related events
to arrive
■ Decide which events are related to each other
■ Publish the results
So… what do we need here?
■ Receive a stream of events
■ Wait some amount of time to allow related events
to arrive
■ Decide which events are related to each other
■ Publish the results
■ Single tenant deployment - we need isolation
So… what do we need here?
■ Receive a stream of events
■ Wait some amount of time to allow related events
to arrive
■ Decide which events are related to each other
■ Publish the results
■ Single tenant deployment - we need isolation
■ Support rates from several KB per hour up to
several GBs per second at a reasonable cost
Optional solution #1
Using a relational database
Proposed solution #1
Normalized
data in a
canonical form
ready for
processing
Proposed solution #1
Normalized
data in a
canonical form
ready for
processing Store the
records in a
relational DB
Proposed solution #1
Normalized
data in a
canonical form
ready for
processing Store the
records in a
relational DB
Periodical tasks
to compute
stories
Proposed solution #1
Normalized
data in a
canonical form
ready for
processing Store the
records in a
relational DB
Periodical tasks
to compute
stories
Publish stories
for other
components to
consume
Pros
Pros
■ Relatively simple implementation:
We have to orchestrate the data and the
queries but not to write any complex logic
ourselves
Pros
■ Relatively simple implementation:
We have to orchestrate the data and the
queries but not to write any complex logic
ourselves
Cons
Pros
■ Relatively simple implementation:
We have to orchestrate the data and the
queries but not to write any complex logic
ourselves
■ Operational overhead - we have to deploy,
maintain and operate another database
Cons
Pros
■ Relatively simple implementation:
We have to orchestrate the data and the
queries but not to write any complex logic
ourselves
■ Operational overhead - we have to deploy,
maintain and operate another database
■ Limited performance - relational
database queries are slower when
compared to queries on a NoSQL
database (if the data model allows
utilizing a NoSQL database)
Cons
Pros
■ Relatively simple implementation:
We have to orchestrate the data and the
queries but not to write any complex logic
ourselves
■ Operational overhead - we have to deploy,
maintain and operate another database
■ Limited performance - relational
database queries are slower when
compared to queries on a NoSQL
database (if the data model allows
utilizing a NoSQL database)
■ Operational cost - complex queries
require more CPU hence are more
expensive
Cons
Optional solution #2
Using Scylla + Kafka
Proposed solution #2
Normalized
data in a
canonical form
ready for
processing
Proposed solution #2
Normalized
data in a
canonical form
ready for
processing
Store the
records
ScyllaDB
Proposed solution #2
Normalized
data in a
canonical form
ready for
processing
Store the
records
ScyllaDB
Publish keys to
fetch the
records
Proposed solution #2
Normalized
data in a
canonical form
ready for
processing
Store the
records
ScyllaDB
Publish keys to
fetch the
records
Records can’t be sent on Kafka because they are too big
So we send only the primary key to fetch from Scylla
Proposed solution #2
Normalized
data in a
canonical form
ready for
processing
Store the
records
ScyllaDB
Publish keys to
fetch the
records
Multiple consumers read
data from a Kafka topic
Proposed solution #2
Normalized
data in a
canonical form
ready for
processing
Store the
records
ScyllaDB
Publish keys to
fetch the
records
Fetch records
from Scylla
Multiple consumers read
data from a Kafka topic
Proposed solution #2
Normalized
data in a
canonical form
ready for
processing
Store the
records
ScyllaDB
Publish stories
for other
components to
consume
Publish keys to
fetch the
records
Multiple consumers read
data from a Kafka topic
Fetch records
from Scylla
Pros
Pros
■ High throughput
Pros
■ High throughput
■ One less database to maintain
Pros
■ High throughput
■ One less database to maintain
Cons
Pros
■ High throughput
■ One less database to maintain
■ We have to write our own logic to
find correlations and build stories
Cons
Pros
■ High throughput
■ One less database to maintain
■ We have to write our own logic to
find correlations and build stories
■ Complex architecture and
deployment
Cons
Pros
■ High throughput
■ One less database to maintain
■ We have to write our own logic to
find correlations and build stories
■ Complex architecture and
deployment
■ We have to maintain thousands
of Kafka deployments
Cons
Optional solution #3
Using Scylla + Cloud managed
message queue
Proposed solution #3
Normalized
data in a
canonical form
ready for
processing
Store the
records
ScyllaDB
Publish stories
for other
components to
consume
Publish keys to
fetch the
records
Fetch records
from Scylla
using keys
received from
the queue
Process and
compute
stories
Queue
Pros
Pros
■ High throughput when compared to the
relational database approach
Pros
■ High throughput when compared to the
relational database approach
■ One less database to maintain
Pros
■ High throughput when compared to the
relational database approach
■ One less database to maintain
■ No need to maintain Kafka deployments
Pros
■ High throughput when compared to the
relational database approach
■ One less database to maintain
■ No need to maintain Kafka deployments
Cons
Pros
■ High throughput when compared to the
relational database approach
■ One less database to maintain
■ No need to maintain Kafka deployments
■ Much slower performance when
compared to Kafka
Cons
The solution that solved our use case
Using ScyllaDB - no message queue
Accepted solution - high level
Normalized data in
a canonical form
ready for
processing
Accepted solution - high level
Normalized data in
a canonical form
ready for
processing
Store the records
ScyllaDB
Accepted solution - high level
Normalized data in
a canonical form
ready for
processing
Store the records
ScyllaDB
The data is sharded into hundreds shards
Accepted solution - high level
Normalized data in
a canonical form
ready for
processing
Store the records
ScyllaDB
The data is sharded into hundreds shards
Partition key is a tuple of
(shard-number, insert_time)
Clustering key is (event id)
Accepted solution - high level
Normalized data in
a canonical form
ready for
processing
Store the records
ScyllaDB
The data is sharded into hundreds shards
Partition key is a tuple of
(shard-number, insert_time)
Clustering key is (event id)
Multiple consumers fetch records
from Scylla using their assigned shard
numbers and the time they want to
consume
The step resolution is configurable
Accepted solution - high level
Normalized data in
a canonical form
ready for
processing
Store the records
ScyllaDB
The data is sharded into hundreds shards
Partition key is a tuple of
(shard-number, insert_time)
Clustering key is (event id)
Multiple consumers fetch records
from Scylla using their assigned shard
numbers and the time they want to
consume
The step resolution is configurable
Process and
compute
stories
Accepted solution - high level
Normalized data in
a canonical form
ready for
processing
Store the records
ScyllaDB
The data is sharded into hundreds shards
Partition key is a tuple of
(shard-number, insert_time)
Clustering key is (event id)
Publish stories
for other
components to
consume
Multiple consumers fetch records
from Scylla using their assigned shard
numbers and the time they want to
consume
The step resolution is configurable
Process and
compute
stories
Pros
Pros
■ Since we already have Scylla deployed for
other parts, we don’t have to add any new
creatures to the system
Pros
■ Since we already have Scylla deployed for
other parts, we don’t have to add any new
creatures to the system
■ High throughput when compared to the
relational database approach
Pros
■ Since we already have Scylla deployed for
other parts, we don’t have to add any new
creatures to the system
■ High throughput when compared to the
relational database approach
■ One less database to maintain
Pros
■ Since we already have Scylla deployed for
other parts, we don’t have to add any new
creatures to the system
■ High throughput when compared to the
relational database approach
■ One less database to maintain
■ No need to maintain Kafka deployments
Pros
■ Since we already have Scylla deployed for
other parts, we don’t have to add any new
creatures to the system
■ High throughput when compared to the
relational database approach
■ One less database to maintain
■ No need to maintain Kafka deployments
Cons
Pros
■ Since we already have Scylla deployed for
other parts, we don’t have to add any new
creatures to the system
■ High throughput when compared to the
relational database approach
■ One less database to maintain
■ No need to maintain Kafka deployments
■ Our code became more complex
Cons
Pros
■ Since we already have Scylla deployed for
other parts, we don’t have to add any new
creatures to the system
■ High throughput when compared to the
relational database approach
■ One less database to maintain
■ No need to maintain Kafka deployments
■ Our code became more complex
■ Producers and consumers has to have
synchronized clocks
(up to a certain resolution)
Cons
Detailed overview
Accepted solution - detailed
{event_id: 1, payload: …}
{event_id: 2, payload: …}
{event_id: 3, payload: …}
{event_id: 4, payload: …}
Accepted solution - detailed
SELECT offset_time FROM read_offsets WHERE shard = X
{event_id: 1, payload: …}
{event_id: 2, payload: …}
{event_id: 3, payload: …}
{event_id: 4, payload: …}
Accepted solution - detailed
SELECT offset_time FROM read_offsets WHERE shard = X
{shard: 1, read_offset: 1000}
{shard: 2, read_offset: 983}
{shard: 3, read_offset: 999}
{event_id: 1, payload: …}
{event_id: 2, payload: …}
{event_id: 3, payload: …}
{event_id: 4, payload: …}
Accepted solution - detailed
SELECT offset_time FROM read_offsets WHERE shard = X
{event_id: 1, payload: …}
{event_id: 2, payload: …}
{event_id: 3, payload: …}
{event_id: 4, payload: …}
INSERT INTO data (shard, insert_time, payload) VALUES (X, NOW(), ...)
Accepted solution - detailed
Shard 1
Shard 2
Shard 3
SELECT offset_time FROM read_offsets WHERE shard = X
{event_id: 1, payload: …}
{event_id: 2, payload: …}
{event_id: 3, payload: …}
{event_id: 4, payload: …}
INSERT INTO data (shard, insert_time, payload) VALUES (X, NOW(), ...)
Accepted solution - detailed
Shard 1
Shard 2
Shard 3
SELECT id, payload FROM data WHERE shard = X and insert_time = Y
SELECT offset_time FROM read_offsets WHERE shard = X
{event_id: 1, payload: …}
{event_id: 2, payload: …}
{event_id: 3, payload: …}
{event_id: 4, payload: …}
INSERT INTO data (shard, insert_time, payload) VALUES (X, NOW(), ...)
Accepted solution - detailed
Shard 1
Shard 2
Shard 3
SELECT id, payload FROM data WHERE shard = X and insert_time = Y
SELECT offset_time FROM read_offsets WHERE shard = X
{event_id: 1, payload: …}
{event_id: 2, payload: …}
{event_id: 3, payload: …}
{event_id: 4, payload: …}
INSERT INTO data (shard, insert_time, payload) VALUES (X, NOW(), ...)
Accepted solution - detailed
Shard 1
Shard 2
Shard 3
SELECT id, payload FROM data WHERE shard = X and insert_time = Y
INSERT INTO data (shard, insert_time, payload) VALUES (X, NOW(), ...)
SELECT offset_time FROM read_offsets WHERE shard = X
{event_id: 1, payload: …}
{event_id: 3, payload: …}
{event_id: 1, payload: …}
{event_id: 2, payload: …}
{event_id: 3, payload: …}
{event_id: 4, payload: …}
Accepted solution - detailed
Shard 1
Shard 2
Shard 3
SELECT id, payload FROM data WHERE shard = X and insert_time = Y
INSERT INTO data (shard, insert_time, payload) VALUES (X, NOW(), ...)
SELECT offset_time FROM read_offsets WHERE shard = X
{event_id: 1, payload: …}
{event_id: 3, payload: …}
{event_id: 1, payload: …}
{event_id: 2, payload: …}
{event_id: 3, payload: …}
{event_id: 4, payload: …}
Accepted solution - detailed
Shard 1
Shard 2
Shard 3
SELECT id, payload FROM data WHERE shard = X and insert_time = Y
INSERT INTO data (shard, insert_time, payload) VALUES (X, NOW(), ...)
SELECT offset_time FROM read_offsets WHERE shard = X
{event_id: 1, payload: …}
{event_id: 3, payload: …}
{event_id: 1, payload: …}
{event_id: 2, payload: …}
{event_id: 3, payload: …}
{event_id: 4, payload: …}
INSERT INTO read_offsets (shard, offset_time) VALUES (X, Y)
Accepted solution - detailed
Shard 1
Shard 2
Shard 3
SELECT id, payload FROM data WHERE shard = X and insert_time = Y
INSERT INTO data (shard, insert_time, payload) VALUES (X, NOW(), ...)
SELECT offset_time FROM read_offsets WHERE shard = X
{event_id: 1, payload: …}
{event_id: 3, payload: …}
{event_id: 1, payload: …}
{event_id: 2, payload: …}
{event_id: 3, payload: …}
{event_id: 4, payload: …}
{event_id: 2, payload: …}
INSERT INTO read_offsets (shard, offset_time) VALUES (X, Y)
Accepted solution - detailed
Shard 1
Shard 2
Shard 3
SELECT id, payload FROM data WHERE shard = X and insert_time = Y
INSERT INTO data (shard, insert_time, payload) VALUES (X, NOW(), ...)
SELECT offset_time FROM read_offsets WHERE shard = X
{event_id: 1, payload: …}
{event_id: 3, payload: …}
{event_id: 1, payload: …}
{event_id: 2, payload: …}
{event_id: 3, payload: …}
{event_id: 4, payload: …}
{event_id: 2, payload: …}
{event_id: 4, payload: …}
INSERT INTO read_offsets (shard, offset_time) VALUES (X, Y)
Accepted solution - detailed
Shard 1
Shard 2
Shard 3
SELECT id, payload FROM data WHERE shard = X and insert_time = Y
INSERT INTO data (shard, insert_time, payload) VALUES (X, NOW(), ...)
SELECT offset_time FROM read_offsets WHERE shard = X
{event_id: 1, payload: …}
{event_id: 3, payload: …}
{event_id: 1, payload: …}
{event_id: 2, payload: …}
{event_id: 3, payload: …}
{event_id: 4, payload: …}
{event_id: 2, payload: …}
{event_id: 4, payload: …}
INSERT INTO read_offsets (shard, offset_time) VALUES (X, Y)
Accepted solution - detailed
Shard 1
Shard 2
Shard 3
SELECT id, payload FROM data WHERE shard = X and insert_time = Y
INSERT INTO data (shard, insert_time, payload) VALUES (X, NOW(), ...)
SELECT offset_time FROM read_offsets WHERE shard = X
{event_id: 1, payload: …}
{event_id: 2, payload: …}
{event_id: 3, payload: …}
{event_id: 4, payload: …}
INSERT INTO read_offsets (shard, offset_time) VALUES (X, Y)
Accepted solution - detailed
Shard 1
Shard 2
Shard 3
SELECT id, payload FROM data WHERE shard = X and insert_time = Y
INSERT INTO data (shard, insert_time, payload) VALUES (X, NOW(), ...)
SELECT offset_time FROM read_offsets WHERE shard = X
{event_id: 1, payload: …}
{event_id: 2, payload: …}
{event_id: 3, payload: …}
{event_id: 4, payload: …}
INSERT INTO read_offsets (shard, offset_time) VALUES (X, Y)
{event_id: 5, payload: …}
Accepted solution - detailed
Shard 1
Shard 2
Shard 3
SELECT id, payload FROM data WHERE shard = X and insert_time = Y
INSERT INTO data (shard, insert_time, payload) VALUES (X, NOW(), ...)
SELECT offset_time FROM read_offsets WHERE shard = X
{event_id: 1, payload: …}
{event_id: 2, payload: …}
{event_id: 3, payload: …}
{event_id: 4, payload: …}
INSERT INTO read_offsets (shard, offset_time) VALUES (X, Y)
{event_id: 5, payload: …}
Accepted solution - detailed
Shard 1
Shard 2
Shard 3
SELECT id, payload FROM data WHERE shard = X and insert_time = Y
INSERT INTO data (shard, insert_time, payload) VALUES (X, NOW(), ...)
SELECT offset_time FROM read_offsets WHERE shard = X
{event_id: 1, payload: …}
{event_id: 2, payload: …}
{event_id: 3, payload: …}
{event_id: 4, payload: …}
INSERT INTO read_offsets (shard, offset_time) VALUES (X, Y)
{event_id: 5, payload: …}
{event_id: 5, payload: …}
Our results
Our results
Reduced operational cost
Our results
Reduced operational cost
Reduced operational complexity
Our results
Reduced operational cost
Reduced operational complexity
Increased throughput
Thank you!
Stay in touch
Daniel Belenky
dbelenky@paloaltonetworks.com

More Related Content

What's hot

How netflix manages petabyte scale apache cassandra in the cloud
How netflix manages petabyte scale apache cassandra in the cloudHow netflix manages petabyte scale apache cassandra in the cloud
How netflix manages petabyte scale apache cassandra in the cloud
Vinay Kumar Chella
 
Big Data At Spotify
Big Data At SpotifyBig Data At Spotify
Big Data At Spotify
Adam Kawa
 

What's hot (20)

How netflix manages petabyte scale apache cassandra in the cloud
How netflix manages petabyte scale apache cassandra in the cloudHow netflix manages petabyte scale apache cassandra in the cloud
How netflix manages petabyte scale apache cassandra in the cloud
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 
Big Data At Spotify
Big Data At SpotifyBig Data At Spotify
Big Data At Spotify
 
Lambda Architecture in the Cloud with Azure Databricks with Andrei Varanovich
Lambda Architecture in the Cloud with Azure Databricks with Andrei VaranovichLambda Architecture in the Cloud with Azure Databricks with Andrei Varanovich
Lambda Architecture in the Cloud with Azure Databricks with Andrei Varanovich
 
Apache Druid 101
Apache Druid 101Apache Druid 101
Apache Druid 101
 
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
 
NiFi Best Practices for the Enterprise
NiFi Best Practices for the EnterpriseNiFi Best Practices for the Enterprise
NiFi Best Practices for the Enterprise
 
Getting up to Speed with MirrorMaker 2 (Mickael Maison, IBM & Ryanne Dolan) K...
Getting up to Speed with MirrorMaker 2 (Mickael Maison, IBM & Ryanne Dolan) K...Getting up to Speed with MirrorMaker 2 (Mickael Maison, IBM & Ryanne Dolan) K...
Getting up to Speed with MirrorMaker 2 (Mickael Maison, IBM & Ryanne Dolan) K...
 
The Heart of the Data Mesh Beats in Real-Time with Apache Kafka
The Heart of the Data Mesh Beats in Real-Time with Apache KafkaThe Heart of the Data Mesh Beats in Real-Time with Apache Kafka
The Heart of the Data Mesh Beats in Real-Time with Apache Kafka
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for Experimentation
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
TiDB Introduction
TiDB IntroductionTiDB Introduction
TiDB Introduction
 
Log Structured Merge Tree
Log Structured Merge TreeLog Structured Merge Tree
Log Structured Merge Tree
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
NoSql
NoSqlNoSql
NoSql
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
 

Similar to Scylla Summit 2022: Stream Processing with ScyllaDB

Big Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapBig Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and Roadmap
Srinath Perera
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Spark Summit
 

Similar to Scylla Summit 2022: Stream Processing with ScyllaDB (20)

From 1000/day to 1000/sec: The Evolution of Incapsula's BIG DATA System [Surg...
From 1000/day to 1000/sec: The Evolution of Incapsula's BIG DATA System [Surg...From 1000/day to 1000/sec: The Evolution of Incapsula's BIG DATA System [Surg...
From 1000/day to 1000/sec: The Evolution of Incapsula's BIG DATA System [Surg...
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
TenMax Data Pipeline Experience Sharing
TenMax Data Pipeline Experience SharingTenMax Data Pipeline Experience Sharing
TenMax Data Pipeline Experience Sharing
 
Solving the Database Problem
Solving the Database ProblemSolving the Database Problem
Solving the Database Problem
 
OVHcloud – Enterprise Cloud Databases
OVHcloud – Enterprise Cloud DatabasesOVHcloud – Enterprise Cloud Databases
OVHcloud – Enterprise Cloud Databases
 
WW Historian 10
WW Historian 10WW Historian 10
WW Historian 10
 
Azure stream analytics by Nico Jacobs
Azure stream analytics by Nico JacobsAzure stream analytics by Nico Jacobs
Azure stream analytics by Nico Jacobs
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
 
Big Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapBig Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and Roadmap
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
 
OSMC 2019 | How to improve database Observability by Charles Judith
OSMC 2019 | How to improve database Observability by Charles JudithOSMC 2019 | How to improve database Observability by Charles Judith
OSMC 2019 | How to improve database Observability by Charles Judith
 
Sizing Your MongoDB Cluster
Sizing Your MongoDB ClusterSizing Your MongoDB Cluster
Sizing Your MongoDB Cluster
 
WSO2Con ASIA 2016: Patterns for Deploying Analytics in the Real World
WSO2Con ASIA 2016: Patterns for Deploying Analytics in the Real WorldWSO2Con ASIA 2016: Patterns for Deploying Analytics in the Real World
WSO2Con ASIA 2016: Patterns for Deploying Analytics in the Real World
 
Дмитрий Попович "How to build a data warehouse?"
Дмитрий Попович "How to build a data warehouse?"Дмитрий Попович "How to build a data warehouse?"
Дмитрий Попович "How to build a data warehouse?"
 
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
 
PyConline AU 2021 - Things might go wrong in a data-intensive application
PyConline AU 2021 - Things might go wrong in a data-intensive applicationPyConline AU 2021 - Things might go wrong in a data-intensive application
PyConline AU 2021 - Things might go wrong in a data-intensive application
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist
 

More from ScyllaDB

More from ScyllaDB (20)

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
What Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQLWhat Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQL
 
Low Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsLow Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & Pitfalls
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance Dilemmas
 
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDBBeyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance Dilemmas
 
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
 
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
 
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr SarnaDatabase Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
 
Replacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDBReplacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDB
 
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear ScalabilityPowering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
 
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
 
Getting the most out of ScyllaDB
Getting the most out of ScyllaDBGetting the most out of ScyllaDB
Getting the most out of ScyllaDB
 
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a MigrationNoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
 
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration LogisticsNoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
 
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and ChallengesNoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
 
ScyllaDB Virtual Workshop
ScyllaDB Virtual WorkshopScyllaDB Virtual Workshop
ScyllaDB Virtual Workshop
 
DBaaS in the Real World: Risks, Rewards & Tradeoffs
DBaaS in the Real World: Risks, Rewards & TradeoffsDBaaS in the Real World: Risks, Rewards & Tradeoffs
DBaaS in the Real World: Risks, Rewards & Tradeoffs
 
Build Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBBuild Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDB
 
NoSQL Data Modeling 101
NoSQL Data Modeling 101NoSQL Data Modeling 101
NoSQL Data Modeling 101
 

Recently uploaded

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Recently uploaded (20)

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 

Scylla Summit 2022: Stream Processing with ScyllaDB

  • 1. Stream Processing with Scylla Daniel Belenky Principal Software Engineer YOUR COMPANY LOGO HERE
  • 2. Daniel Belenky ■ Kubernetes & Virtualization ■ Distributed applications ■ Big data and stream processing Principal Software Engineer YOUR PHOTO GOES HERE
  • 3. Agenda ■ A brief the product and my team - 3 min ■ What was the challenge that we were facing - 5 min ■ What were the solutions were considered - 5 min ■ How we’ve managed to solve the problem with Scylla - 12 min
  • 4. A brief about the product and the team
  • 5. Our product Is a security product that performs analytics, detection and response. ■ Millions of records per second ■ Multiple data sources and schemas ■ Has to provide insights in a near real-time timeframe (security...)
  • 6. About my team We are responsible for the infrastructure that: ■ Stream processing of data that comes from multiple sources. ■ Clean, normalize and process the data - prepare it for further analysis. ■ Build stories - multiple data sources emit different events and provide different views on the same network session. ■ We want to fuse those events that tell the same story from a different perspective. ■ Mostly developing with Go and Python ■ Deployment is on K8s
  • 9. Problem description (part 1) Various sensors see a network event 10:00:01
  • 10. Problem description (part 1) Various sensors see a network event {event: dns-query, id: 6c92e} {event: dns-query, id: 873a1} ... 10:00:01
  • 11. Problem description (part 1) Various sensors see a network event {event: dns-query, id: 6c92e} {event: dns-query, id: 873a1} ... 10:00:01 10:00:02
  • 12. Problem description (part 1) Various sensors see a network event {event: dns-query, id: 6c92e} {event: dns-query, id: 873a1} ... 10:00:01 10:00:02 {kind: login, id: 13} {kind: signup, id: 17} ...
  • 13. Problem description (part 1) Various sensors see a network event {kind: login, id: 13} {kind: signup, id: 17} ... {event: dns-query, id: 6c92e} {event: dns-query, id: 873a1} ... 10:00:01 10:08:05 10:00:02
  • 14. Problem description (part 1) Various sensors see a network event {kind: login, id: 13} {kind: signup, id: 17} ... {type: GET, id: CHJW} {type: POST, id: KQJD} ... {event: dns-query, id: 6c92e} {event: dns-query, id: 873a1} ... 10:00:01 10:08:05 10:00:02
  • 15. Problem description (part 2) Data from different sensors comes in different forms and formats In different times
  • 16. Problem description (part 2) Data from different sensors comes in different forms and formats In different times
  • 17. Problem description (part 2) Data from different sensors comes in different forms and formats In different times Normalized data in a canonical form ready for processing
  • 18. Problem description (part 2) Data from different sensors comes in different forms and formats In different times Normalized data in a canonical form ready for processing ? Millions of normalized but unassociated entries per second from many different sources
  • 19. The question is: How to associate discrete entries that describe the same network session
  • 20. Why is it a challenge?
  • 21. Why is it a challenge? ■ Clock skew across different sensors Clocks across sensors might not be synchronized to the second
  • 22. Why is it a challenge? ■ Clock skew across different sensors Clocks across sensors might not be synchronized to the second ■ We have thousands of deployments to manage Deployments also vary in size (from Bps to GBps)
  • 23. Why is it a challenge? ■ Clock skew across different sensors Clocks across sensors might not be synchronized to the second ■ We have thousands of deployments to manage Deployments also vary in size (from Bps to GBps) ■ Sensor’s viewpoint on the session Different sensors have different views on the same session
  • 24. Why is it a challenge? ■ Clock skew across different sensors Clocks across sensors might not be synchronized to the second ■ We have thousands of deployments to manage Deployments also vary in size (from Bps to GBps) ■ Sensor’s viewpoint on the session Different sensors have different views on the same session ■ Zero tolerance for data loss Data is pushed to us and if we lose it, it’s lost for good
  • 25. Why is it a challenge? ■ Clock skew across different sensors Clocks across sensors might not be synchronized to the second ■ We have thousands of deployments to manage Deployments also vary in size (from Bps to GBps) ■ Sensor’s viewpoint on the session Different sensors have different views on the same session ■ Zero tolerance for data loss Data is pushed to us and if we lose it, it’s lost for good ■ Continuous out of order stream Sensors send data in different times and event time != ingestion time != processing time
  • 38. So… what do we need here?
  • 39. So… what do we need here? ■ Receive a stream of events
  • 40. So… what do we need here? ■ Receive a stream of events ■ Wait some amount of time to allow related events to arrive
  • 41. So… what do we need here? ■ Receive a stream of events ■ Wait some amount of time to allow related events to arrive ■ Decide which events are related to each other
  • 42. So… what do we need here? ■ Receive a stream of events ■ Wait some amount of time to allow related events to arrive ■ Decide which events are related to each other ■ Publish the results
  • 43. So… what do we need here? ■ Receive a stream of events ■ Wait some amount of time to allow related events to arrive ■ Decide which events are related to each other ■ Publish the results
  • 44. So… what do we need here? ■ Receive a stream of events ■ Wait some amount of time to allow related events to arrive ■ Decide which events are related to each other ■ Publish the results ■ Single tenant deployment - we need isolation
  • 45. So… what do we need here? ■ Receive a stream of events ■ Wait some amount of time to allow related events to arrive ■ Decide which events are related to each other ■ Publish the results ■ Single tenant deployment - we need isolation ■ Support rates from several KB per hour up to several GBs per second at a reasonable cost
  • 46. Optional solution #1 Using a relational database
  • 47. Proposed solution #1 Normalized data in a canonical form ready for processing
  • 48. Proposed solution #1 Normalized data in a canonical form ready for processing Store the records in a relational DB
  • 49. Proposed solution #1 Normalized data in a canonical form ready for processing Store the records in a relational DB Periodical tasks to compute stories
  • 50. Proposed solution #1 Normalized data in a canonical form ready for processing Store the records in a relational DB Periodical tasks to compute stories Publish stories for other components to consume
  • 51. Pros
  • 52. Pros ■ Relatively simple implementation: We have to orchestrate the data and the queries but not to write any complex logic ourselves
  • 53. Pros ■ Relatively simple implementation: We have to orchestrate the data and the queries but not to write any complex logic ourselves Cons
  • 54. Pros ■ Relatively simple implementation: We have to orchestrate the data and the queries but not to write any complex logic ourselves ■ Operational overhead - we have to deploy, maintain and operate another database Cons
  • 55. Pros ■ Relatively simple implementation: We have to orchestrate the data and the queries but not to write any complex logic ourselves ■ Operational overhead - we have to deploy, maintain and operate another database ■ Limited performance - relational database queries are slower when compared to queries on a NoSQL database (if the data model allows utilizing a NoSQL database) Cons
  • 56. Pros ■ Relatively simple implementation: We have to orchestrate the data and the queries but not to write any complex logic ourselves ■ Operational overhead - we have to deploy, maintain and operate another database ■ Limited performance - relational database queries are slower when compared to queries on a NoSQL database (if the data model allows utilizing a NoSQL database) ■ Operational cost - complex queries require more CPU hence are more expensive Cons
  • 57. Optional solution #2 Using Scylla + Kafka
  • 58. Proposed solution #2 Normalized data in a canonical form ready for processing
  • 59. Proposed solution #2 Normalized data in a canonical form ready for processing Store the records ScyllaDB
  • 60. Proposed solution #2 Normalized data in a canonical form ready for processing Store the records ScyllaDB Publish keys to fetch the records
  • 61. Proposed solution #2 Normalized data in a canonical form ready for processing Store the records ScyllaDB Publish keys to fetch the records Records can’t be sent on Kafka because they are too big So we send only the primary key to fetch from Scylla
  • 62. Proposed solution #2 Normalized data in a canonical form ready for processing Store the records ScyllaDB Publish keys to fetch the records Multiple consumers read data from a Kafka topic
  • 63. Proposed solution #2 Normalized data in a canonical form ready for processing Store the records ScyllaDB Publish keys to fetch the records Fetch records from Scylla Multiple consumers read data from a Kafka topic
  • 64. Proposed solution #2 Normalized data in a canonical form ready for processing Store the records ScyllaDB Publish stories for other components to consume Publish keys to fetch the records Multiple consumers read data from a Kafka topic Fetch records from Scylla
  • 65. Pros
  • 67. Pros ■ High throughput ■ One less database to maintain
  • 68. Pros ■ High throughput ■ One less database to maintain Cons
  • 69. Pros ■ High throughput ■ One less database to maintain ■ We have to write our own logic to find correlations and build stories Cons
  • 70. Pros ■ High throughput ■ One less database to maintain ■ We have to write our own logic to find correlations and build stories ■ Complex architecture and deployment Cons
  • 71. Pros ■ High throughput ■ One less database to maintain ■ We have to write our own logic to find correlations and build stories ■ Complex architecture and deployment ■ We have to maintain thousands of Kafka deployments Cons
  • 72. Optional solution #3 Using Scylla + Cloud managed message queue
  • 73. Proposed solution #3 Normalized data in a canonical form ready for processing Store the records ScyllaDB Publish stories for other components to consume Publish keys to fetch the records Fetch records from Scylla using keys received from the queue Process and compute stories Queue
  • 74. Pros
  • 75. Pros ■ High throughput when compared to the relational database approach
  • 76. Pros ■ High throughput when compared to the relational database approach ■ One less database to maintain
  • 77. Pros ■ High throughput when compared to the relational database approach ■ One less database to maintain ■ No need to maintain Kafka deployments
  • 78. Pros ■ High throughput when compared to the relational database approach ■ One less database to maintain ■ No need to maintain Kafka deployments Cons
  • 79. Pros ■ High throughput when compared to the relational database approach ■ One less database to maintain ■ No need to maintain Kafka deployments ■ Much slower performance when compared to Kafka Cons
  • 80. The solution that solved our use case Using ScyllaDB - no message queue
  • 81. Accepted solution - high level Normalized data in a canonical form ready for processing
  • 82. Accepted solution - high level Normalized data in a canonical form ready for processing Store the records ScyllaDB
  • 83. Accepted solution - high level Normalized data in a canonical form ready for processing Store the records ScyllaDB The data is sharded into hundreds shards
  • 84. Accepted solution - high level Normalized data in a canonical form ready for processing Store the records ScyllaDB The data is sharded into hundreds shards Partition key is a tuple of (shard-number, insert_time) Clustering key is (event id)
  • 85. Accepted solution - high level Normalized data in a canonical form ready for processing Store the records ScyllaDB The data is sharded into hundreds shards Partition key is a tuple of (shard-number, insert_time) Clustering key is (event id) Multiple consumers fetch records from Scylla using their assigned shard numbers and the time they want to consume The step resolution is configurable
  • 86. Accepted solution - high level Normalized data in a canonical form ready for processing Store the records ScyllaDB The data is sharded into hundreds shards Partition key is a tuple of (shard-number, insert_time) Clustering key is (event id) Multiple consumers fetch records from Scylla using their assigned shard numbers and the time they want to consume The step resolution is configurable Process and compute stories
  • 87. Accepted solution - high level Normalized data in a canonical form ready for processing Store the records ScyllaDB The data is sharded into hundreds shards Partition key is a tuple of (shard-number, insert_time) Clustering key is (event id) Publish stories for other components to consume Multiple consumers fetch records from Scylla using their assigned shard numbers and the time they want to consume The step resolution is configurable Process and compute stories
  • 88. Pros
  • 89. Pros ■ Since we already have Scylla deployed for other parts, we don’t have to add any new creatures to the system
  • 90. Pros ■ Since we already have Scylla deployed for other parts, we don’t have to add any new creatures to the system ■ High throughput when compared to the relational database approach
  • 91. Pros ■ Since we already have Scylla deployed for other parts, we don’t have to add any new creatures to the system ■ High throughput when compared to the relational database approach ■ One less database to maintain
  • 92. Pros ■ Since we already have Scylla deployed for other parts, we don’t have to add any new creatures to the system ■ High throughput when compared to the relational database approach ■ One less database to maintain ■ No need to maintain Kafka deployments
  • 93. Pros ■ Since we already have Scylla deployed for other parts, we don’t have to add any new creatures to the system ■ High throughput when compared to the relational database approach ■ One less database to maintain ■ No need to maintain Kafka deployments Cons
  • 94. Pros ■ Since we already have Scylla deployed for other parts, we don’t have to add any new creatures to the system ■ High throughput when compared to the relational database approach ■ One less database to maintain ■ No need to maintain Kafka deployments ■ Our code became more complex Cons
  • 95. Pros ■ Since we already have Scylla deployed for other parts, we don’t have to add any new creatures to the system ■ High throughput when compared to the relational database approach ■ One less database to maintain ■ No need to maintain Kafka deployments ■ Our code became more complex ■ Producers and consumers has to have synchronized clocks (up to a certain resolution) Cons
  • 97. Accepted solution - detailed {event_id: 1, payload: …} {event_id: 2, payload: …} {event_id: 3, payload: …} {event_id: 4, payload: …}
  • 98. Accepted solution - detailed SELECT offset_time FROM read_offsets WHERE shard = X {event_id: 1, payload: …} {event_id: 2, payload: …} {event_id: 3, payload: …} {event_id: 4, payload: …}
  • 99. Accepted solution - detailed SELECT offset_time FROM read_offsets WHERE shard = X {shard: 1, read_offset: 1000} {shard: 2, read_offset: 983} {shard: 3, read_offset: 999} {event_id: 1, payload: …} {event_id: 2, payload: …} {event_id: 3, payload: …} {event_id: 4, payload: …}
  • 100. Accepted solution - detailed SELECT offset_time FROM read_offsets WHERE shard = X {event_id: 1, payload: …} {event_id: 2, payload: …} {event_id: 3, payload: …} {event_id: 4, payload: …} INSERT INTO data (shard, insert_time, payload) VALUES (X, NOW(), ...)
  • 101. Accepted solution - detailed Shard 1 Shard 2 Shard 3 SELECT offset_time FROM read_offsets WHERE shard = X {event_id: 1, payload: …} {event_id: 2, payload: …} {event_id: 3, payload: …} {event_id: 4, payload: …} INSERT INTO data (shard, insert_time, payload) VALUES (X, NOW(), ...)
  • 102. Accepted solution - detailed Shard 1 Shard 2 Shard 3 SELECT id, payload FROM data WHERE shard = X and insert_time = Y SELECT offset_time FROM read_offsets WHERE shard = X {event_id: 1, payload: …} {event_id: 2, payload: …} {event_id: 3, payload: …} {event_id: 4, payload: …} INSERT INTO data (shard, insert_time, payload) VALUES (X, NOW(), ...)
  • 103. Accepted solution - detailed Shard 1 Shard 2 Shard 3 SELECT id, payload FROM data WHERE shard = X and insert_time = Y SELECT offset_time FROM read_offsets WHERE shard = X {event_id: 1, payload: …} {event_id: 2, payload: …} {event_id: 3, payload: …} {event_id: 4, payload: …} INSERT INTO data (shard, insert_time, payload) VALUES (X, NOW(), ...)
  • 104. Accepted solution - detailed Shard 1 Shard 2 Shard 3 SELECT id, payload FROM data WHERE shard = X and insert_time = Y INSERT INTO data (shard, insert_time, payload) VALUES (X, NOW(), ...) SELECT offset_time FROM read_offsets WHERE shard = X {event_id: 1, payload: …} {event_id: 3, payload: …} {event_id: 1, payload: …} {event_id: 2, payload: …} {event_id: 3, payload: …} {event_id: 4, payload: …}
  • 105. Accepted solution - detailed Shard 1 Shard 2 Shard 3 SELECT id, payload FROM data WHERE shard = X and insert_time = Y INSERT INTO data (shard, insert_time, payload) VALUES (X, NOW(), ...) SELECT offset_time FROM read_offsets WHERE shard = X {event_id: 1, payload: …} {event_id: 3, payload: …} {event_id: 1, payload: …} {event_id: 2, payload: …} {event_id: 3, payload: …} {event_id: 4, payload: …}
  • 106. Accepted solution - detailed Shard 1 Shard 2 Shard 3 SELECT id, payload FROM data WHERE shard = X and insert_time = Y INSERT INTO data (shard, insert_time, payload) VALUES (X, NOW(), ...) SELECT offset_time FROM read_offsets WHERE shard = X {event_id: 1, payload: …} {event_id: 3, payload: …} {event_id: 1, payload: …} {event_id: 2, payload: …} {event_id: 3, payload: …} {event_id: 4, payload: …} INSERT INTO read_offsets (shard, offset_time) VALUES (X, Y)
  • 107. Accepted solution - detailed Shard 1 Shard 2 Shard 3 SELECT id, payload FROM data WHERE shard = X and insert_time = Y INSERT INTO data (shard, insert_time, payload) VALUES (X, NOW(), ...) SELECT offset_time FROM read_offsets WHERE shard = X {event_id: 1, payload: …} {event_id: 3, payload: …} {event_id: 1, payload: …} {event_id: 2, payload: …} {event_id: 3, payload: …} {event_id: 4, payload: …} {event_id: 2, payload: …} INSERT INTO read_offsets (shard, offset_time) VALUES (X, Y)
  • 108. Accepted solution - detailed Shard 1 Shard 2 Shard 3 SELECT id, payload FROM data WHERE shard = X and insert_time = Y INSERT INTO data (shard, insert_time, payload) VALUES (X, NOW(), ...) SELECT offset_time FROM read_offsets WHERE shard = X {event_id: 1, payload: …} {event_id: 3, payload: …} {event_id: 1, payload: …} {event_id: 2, payload: …} {event_id: 3, payload: …} {event_id: 4, payload: …} {event_id: 2, payload: …} {event_id: 4, payload: …} INSERT INTO read_offsets (shard, offset_time) VALUES (X, Y)
  • 109. Accepted solution - detailed Shard 1 Shard 2 Shard 3 SELECT id, payload FROM data WHERE shard = X and insert_time = Y INSERT INTO data (shard, insert_time, payload) VALUES (X, NOW(), ...) SELECT offset_time FROM read_offsets WHERE shard = X {event_id: 1, payload: …} {event_id: 3, payload: …} {event_id: 1, payload: …} {event_id: 2, payload: …} {event_id: 3, payload: …} {event_id: 4, payload: …} {event_id: 2, payload: …} {event_id: 4, payload: …} INSERT INTO read_offsets (shard, offset_time) VALUES (X, Y)
  • 110. Accepted solution - detailed Shard 1 Shard 2 Shard 3 SELECT id, payload FROM data WHERE shard = X and insert_time = Y INSERT INTO data (shard, insert_time, payload) VALUES (X, NOW(), ...) SELECT offset_time FROM read_offsets WHERE shard = X {event_id: 1, payload: …} {event_id: 2, payload: …} {event_id: 3, payload: …} {event_id: 4, payload: …} INSERT INTO read_offsets (shard, offset_time) VALUES (X, Y)
  • 111. Accepted solution - detailed Shard 1 Shard 2 Shard 3 SELECT id, payload FROM data WHERE shard = X and insert_time = Y INSERT INTO data (shard, insert_time, payload) VALUES (X, NOW(), ...) SELECT offset_time FROM read_offsets WHERE shard = X {event_id: 1, payload: …} {event_id: 2, payload: …} {event_id: 3, payload: …} {event_id: 4, payload: …} INSERT INTO read_offsets (shard, offset_time) VALUES (X, Y) {event_id: 5, payload: …}
  • 112. Accepted solution - detailed Shard 1 Shard 2 Shard 3 SELECT id, payload FROM data WHERE shard = X and insert_time = Y INSERT INTO data (shard, insert_time, payload) VALUES (X, NOW(), ...) SELECT offset_time FROM read_offsets WHERE shard = X {event_id: 1, payload: …} {event_id: 2, payload: …} {event_id: 3, payload: …} {event_id: 4, payload: …} INSERT INTO read_offsets (shard, offset_time) VALUES (X, Y) {event_id: 5, payload: …}
  • 113. Accepted solution - detailed Shard 1 Shard 2 Shard 3 SELECT id, payload FROM data WHERE shard = X and insert_time = Y INSERT INTO data (shard, insert_time, payload) VALUES (X, NOW(), ...) SELECT offset_time FROM read_offsets WHERE shard = X {event_id: 1, payload: …} {event_id: 2, payload: …} {event_id: 3, payload: …} {event_id: 4, payload: …} INSERT INTO read_offsets (shard, offset_time) VALUES (X, Y) {event_id: 5, payload: …} {event_id: 5, payload: …}
  • 116. Our results Reduced operational cost Reduced operational complexity
  • 117. Our results Reduced operational cost Reduced operational complexity Increased throughput
  • 118. Thank you! Stay in touch Daniel Belenky dbelenky@paloaltonetworks.com