SlideShare a Scribd company logo
Stream Processing
with Scylla
Daniel Belenky
Principal Software Engineer
YOUR COMPANY
LOGO HERE
Daniel Belenky
■ Kubernetes & Virtualization
■ Distributed applications
■ Big data and stream processing
Principal Software Engineer
YOUR PHOTO
GOES HERE
Agenda
■ A brief the product and my team - 3 min
■ What was the challenge that we were facing - 5 min
■ What were the solutions were considered - 5 min
■ How we’ve managed to solve the problem with Scylla - 12 min
A brief about the
product and the team
Our product
Is a security product that performs analytics, detection and response.
■ Millions of records per second
■ Multiple data sources and schemas
■ Has to provide insights in a near real-time timeframe (security...)
About my team
We are responsible for the infrastructure that:
■ Stream processing of data that comes from multiple sources.
■ Clean, normalize and process the data - prepare it for further analysis.
■ Build stories - multiple data sources emit different events and provide different
views on the same network session.
■ We want to fuse those events that tell the same story from a different
perspective.
■ Mostly developing with Go and Python
■ Deployment is on K8s
The challenge
Problem description (part 1)
Problem description (part 1)
Various sensors
see a network
event
10:00:01
Problem description (part 1)
Various sensors
see a network
event
{event: dns-query, id: 6c92e}
{event: dns-query, id: 873a1}
...
10:00:01
Problem description (part 1)
Various sensors
see a network
event
{event: dns-query, id: 6c92e}
{event: dns-query, id: 873a1}
...
10:00:01
10:00:02
Problem description (part 1)
Various sensors
see a network
event
{event: dns-query, id: 6c92e}
{event: dns-query, id: 873a1}
...
10:00:01
10:00:02 {kind: login, id: 13}
{kind: signup, id: 17}
...
Problem description (part 1)
Various sensors
see a network
event
{kind: login, id: 13}
{kind: signup, id: 17}
...
{event: dns-query, id: 6c92e}
{event: dns-query, id: 873a1}
...
10:00:01
10:08:05
10:00:02
Problem description (part 1)
Various sensors
see a network
event
{kind: login, id: 13}
{kind: signup, id: 17}
...
{type: GET, id: CHJW}
{type: POST, id: KQJD}
...
{event: dns-query, id: 6c92e}
{event: dns-query, id: 873a1}
...
10:00:01
10:08:05
10:00:02
Problem description (part 2)
Data from different sensors
comes in different forms
and formats
In different times
Problem description (part 2)
Data from different sensors
comes in different forms
and formats
In different times
Problem description (part 2)
Data from different sensors
comes in different forms
and formats
In different times
Normalized
data in a
canonical form
ready for
processing
Problem description (part 2)
Data from different sensors
comes in different forms
and formats
In different times
Normalized
data in a
canonical form
ready for
processing
?
Millions of
normalized but
unassociated
entries per
second from
many different
sources
The question is:
How to associate discrete entries that describe the
same network session
Why is it a challenge?
Why is it a challenge?
■ Clock skew across different sensors
Clocks across sensors might not be synchronized to the second
Why is it a challenge?
■ Clock skew across different sensors
Clocks across sensors might not be synchronized to the second
■ We have thousands of deployments to manage
Deployments also vary in size (from Bps to GBps)
Why is it a challenge?
■ Clock skew across different sensors
Clocks across sensors might not be synchronized to the second
■ We have thousands of deployments to manage
Deployments also vary in size (from Bps to GBps)
■ Sensor’s viewpoint on the session
Different sensors have different views on the same session
Why is it a challenge?
■ Clock skew across different sensors
Clocks across sensors might not be synchronized to the second
■ We have thousands of deployments to manage
Deployments also vary in size (from Bps to GBps)
■ Sensor’s viewpoint on the session
Different sensors have different views on the same session
■ Zero tolerance for data loss
Data is pushed to us and if we lose it, it’s lost for good
Why is it a challenge?
■ Clock skew across different sensors
Clocks across sensors might not be synchronized to the second
■ We have thousands of deployments to manage
Deployments also vary in size (from Bps to GBps)
■ Sensor’s viewpoint on the session
Different sensors have different views on the same session
■ Zero tolerance for data loss
Data is pushed to us and if we lose it, it’s lost for good
■ Continuous out of order stream
Sensors send data in different times and
event time != ingestion time != processing
time
Example
T0
T1
T2
T3
T4
T5
Event time
Ingestion time
Processing time
T0
T1
T2
T3
T4
T5
Event time
Ingestion time
Processing time
T0
T1
T2
T3
T4
T5
Event time
Ingestion time
Processing time
T0
T1
T2
T3
T4
T5
Event time
Ingestion time
Processing time
T0
T1
T2
T3
T4
T5
Event time
Ingestion time
Processing time
T0
T1
T2
T3
T4
T5
Event time
Ingestion time
Processing time
T0
T1
T2
T3
T4
T5
Event time
Ingestion time
Processing time
T0
T1
T2
T3
T4
T5
Event time
Ingestion time
Processing time
T0
T1
T2
T3
T4
T5
Event time
Ingestion time
Processing time
T0
T1
T2
T3
T4
T5
Event time
Ingestion time
Processing time
T0
T1
T2
T3
T4
T5
X
Event time
Ingestion time
Processing time
X
X
X
X
X
So… what do we need here?
So… what do we need here?
■ Receive a stream of events
So… what do we need here?
■ Receive a stream of events
■ Wait some amount of time to allow related events
to arrive
So… what do we need here?
■ Receive a stream of events
■ Wait some amount of time to allow related events
to arrive
■ Decide which events are related to each other
So… what do we need here?
■ Receive a stream of events
■ Wait some amount of time to allow related events
to arrive
■ Decide which events are related to each other
■ Publish the results
So… what do we need here?
■ Receive a stream of events
■ Wait some amount of time to allow related events
to arrive
■ Decide which events are related to each other
■ Publish the results
So… what do we need here?
■ Receive a stream of events
■ Wait some amount of time to allow related events
to arrive
■ Decide which events are related to each other
■ Publish the results
■ Single tenant deployment - we need isolation
So… what do we need here?
■ Receive a stream of events
■ Wait some amount of time to allow related events
to arrive
■ Decide which events are related to each other
■ Publish the results
■ Single tenant deployment - we need isolation
■ Support rates from several KB per hour up to
several GBs per second at a reasonable cost
Optional solution #1
Using a relational database
Proposed solution #1
Normalized
data in a
canonical form
ready for
processing
Proposed solution #1
Normalized
data in a
canonical form
ready for
processing Store the
records in a
relational DB
Proposed solution #1
Normalized
data in a
canonical form
ready for
processing Store the
records in a
relational DB
Periodical tasks
to compute
stories
Proposed solution #1
Normalized
data in a
canonical form
ready for
processing Store the
records in a
relational DB
Periodical tasks
to compute
stories
Publish stories
for other
components to
consume
Pros
Pros
■ Relatively simple implementation:
We have to orchestrate the data and the
queries but not to write any complex logic
ourselves
Pros
■ Relatively simple implementation:
We have to orchestrate the data and the
queries but not to write any complex logic
ourselves
Cons
Pros
■ Relatively simple implementation:
We have to orchestrate the data and the
queries but not to write any complex logic
ourselves
■ Operational overhead - we have to deploy,
maintain and operate another database
Cons
Pros
■ Relatively simple implementation:
We have to orchestrate the data and the
queries but not to write any complex logic
ourselves
■ Operational overhead - we have to deploy,
maintain and operate another database
■ Limited performance - relational
database queries are slower when
compared to queries on a NoSQL
database (if the data model allows
utilizing a NoSQL database)
Cons
Pros
■ Relatively simple implementation:
We have to orchestrate the data and the
queries but not to write any complex logic
ourselves
■ Operational overhead - we have to deploy,
maintain and operate another database
■ Limited performance - relational
database queries are slower when
compared to queries on a NoSQL
database (if the data model allows
utilizing a NoSQL database)
■ Operational cost - complex queries
require more CPU hence are more
expensive
Cons
Optional solution #2
Using Scylla + Kafka
Proposed solution #2
Normalized
data in a
canonical form
ready for
processing
Proposed solution #2
Normalized
data in a
canonical form
ready for
processing
Store the
records
ScyllaDB
Proposed solution #2
Normalized
data in a
canonical form
ready for
processing
Store the
records
ScyllaDB
Publish keys to
fetch the
records
Proposed solution #2
Normalized
data in a
canonical form
ready for
processing
Store the
records
ScyllaDB
Publish keys to
fetch the
records
Records can’t be sent on Kafka because they are too big
So we send only the primary key to fetch from Scylla
Proposed solution #2
Normalized
data in a
canonical form
ready for
processing
Store the
records
ScyllaDB
Publish keys to
fetch the
records
Multiple consumers read
data from a Kafka topic
Proposed solution #2
Normalized
data in a
canonical form
ready for
processing
Store the
records
ScyllaDB
Publish keys to
fetch the
records
Fetch records
from Scylla
Multiple consumers read
data from a Kafka topic
Proposed solution #2
Normalized
data in a
canonical form
ready for
processing
Store the
records
ScyllaDB
Publish stories
for other
components to
consume
Publish keys to
fetch the
records
Multiple consumers read
data from a Kafka topic
Fetch records
from Scylla
Pros
Pros
■ High throughput
Pros
■ High throughput
■ One less database to maintain
Pros
■ High throughput
■ One less database to maintain
Cons
Pros
■ High throughput
■ One less database to maintain
■ We have to write our own logic to
find correlations and build stories
Cons
Pros
■ High throughput
■ One less database to maintain
■ We have to write our own logic to
find correlations and build stories
■ Complex architecture and
deployment
Cons
Pros
■ High throughput
■ One less database to maintain
■ We have to write our own logic to
find correlations and build stories
■ Complex architecture and
deployment
■ We have to maintain thousands
of Kafka deployments
Cons
Optional solution #3
Using Scylla + Cloud managed
message queue
Proposed solution #3
Normalized
data in a
canonical form
ready for
processing
Store the
records
ScyllaDB
Publish stories
for other
components to
consume
Publish keys to
fetch the
records
Fetch records
from Scylla
using keys
received from
the queue
Process and
compute
stories
Queue
Pros
Pros
■ High throughput when compared to the
relational database approach
Pros
■ High throughput when compared to the
relational database approach
■ One less database to maintain
Pros
■ High throughput when compared to the
relational database approach
■ One less database to maintain
■ No need to maintain Kafka deployments
Pros
■ High throughput when compared to the
relational database approach
■ One less database to maintain
■ No need to maintain Kafka deployments
Cons
Pros
■ High throughput when compared to the
relational database approach
■ One less database to maintain
■ No need to maintain Kafka deployments
■ Much slower performance when
compared to Kafka
Cons
The solution that solved our use case
Using ScyllaDB - no message queue
Accepted solution - high level
Normalized data in
a canonical form
ready for
processing
Accepted solution - high level
Normalized data in
a canonical form
ready for
processing
Store the records
ScyllaDB
Accepted solution - high level
Normalized data in
a canonical form
ready for
processing
Store the records
ScyllaDB
The data is sharded into hundreds shards
Accepted solution - high level
Normalized data in
a canonical form
ready for
processing
Store the records
ScyllaDB
The data is sharded into hundreds shards
Partition key is a tuple of
(shard-number, insert_time)
Clustering key is (event id)
Accepted solution - high level
Normalized data in
a canonical form
ready for
processing
Store the records
ScyllaDB
The data is sharded into hundreds shards
Partition key is a tuple of
(shard-number, insert_time)
Clustering key is (event id)
Multiple consumers fetch records
from Scylla using their assigned shard
numbers and the time they want to
consume
The step resolution is configurable
Accepted solution - high level
Normalized data in
a canonical form
ready for
processing
Store the records
ScyllaDB
The data is sharded into hundreds shards
Partition key is a tuple of
(shard-number, insert_time)
Clustering key is (event id)
Multiple consumers fetch records
from Scylla using their assigned shard
numbers and the time they want to
consume
The step resolution is configurable
Process and
compute
stories
Accepted solution - high level
Normalized data in
a canonical form
ready for
processing
Store the records
ScyllaDB
The data is sharded into hundreds shards
Partition key is a tuple of
(shard-number, insert_time)
Clustering key is (event id)
Publish stories
for other
components to
consume
Multiple consumers fetch records
from Scylla using their assigned shard
numbers and the time they want to
consume
The step resolution is configurable
Process and
compute
stories
Pros
Pros
■ Since we already have Scylla deployed for
other parts, we don’t have to add any new
creatures to the system
Pros
■ Since we already have Scylla deployed for
other parts, we don’t have to add any new
creatures to the system
■ High throughput when compared to the
relational database approach
Pros
■ Since we already have Scylla deployed for
other parts, we don’t have to add any new
creatures to the system
■ High throughput when compared to the
relational database approach
■ One less database to maintain
Pros
■ Since we already have Scylla deployed for
other parts, we don’t have to add any new
creatures to the system
■ High throughput when compared to the
relational database approach
■ One less database to maintain
■ No need to maintain Kafka deployments
Pros
■ Since we already have Scylla deployed for
other parts, we don’t have to add any new
creatures to the system
■ High throughput when compared to the
relational database approach
■ One less database to maintain
■ No need to maintain Kafka deployments
Cons
Pros
■ Since we already have Scylla deployed for
other parts, we don’t have to add any new
creatures to the system
■ High throughput when compared to the
relational database approach
■ One less database to maintain
■ No need to maintain Kafka deployments
■ Our code became more complex
Cons
Pros
■ Since we already have Scylla deployed for
other parts, we don’t have to add any new
creatures to the system
■ High throughput when compared to the
relational database approach
■ One less database to maintain
■ No need to maintain Kafka deployments
■ Our code became more complex
■ Producers and consumers has to have
synchronized clocks
(up to a certain resolution)
Cons
Detailed overview
Accepted solution - detailed
{event_id: 1, payload: …}
{event_id: 2, payload: …}
{event_id: 3, payload: …}
{event_id: 4, payload: …}
Accepted solution - detailed
SELECT offset_time FROM read_offsets WHERE shard = X
{event_id: 1, payload: …}
{event_id: 2, payload: …}
{event_id: 3, payload: …}
{event_id: 4, payload: …}
Accepted solution - detailed
SELECT offset_time FROM read_offsets WHERE shard = X
{shard: 1, read_offset: 1000}
{shard: 2, read_offset: 983}
{shard: 3, read_offset: 999}
{event_id: 1, payload: …}
{event_id: 2, payload: …}
{event_id: 3, payload: …}
{event_id: 4, payload: …}
Accepted solution - detailed
SELECT offset_time FROM read_offsets WHERE shard = X
{event_id: 1, payload: …}
{event_id: 2, payload: …}
{event_id: 3, payload: …}
{event_id: 4, payload: …}
INSERT INTO data (shard, insert_time, payload) VALUES (X, NOW(), ...)
Accepted solution - detailed
Shard 1
Shard 2
Shard 3
SELECT offset_time FROM read_offsets WHERE shard = X
{event_id: 1, payload: …}
{event_id: 2, payload: …}
{event_id: 3, payload: …}
{event_id: 4, payload: …}
INSERT INTO data (shard, insert_time, payload) VALUES (X, NOW(), ...)
Accepted solution - detailed
Shard 1
Shard 2
Shard 3
SELECT id, payload FROM data WHERE shard = X and insert_time = Y
SELECT offset_time FROM read_offsets WHERE shard = X
{event_id: 1, payload: …}
{event_id: 2, payload: …}
{event_id: 3, payload: …}
{event_id: 4, payload: …}
INSERT INTO data (shard, insert_time, payload) VALUES (X, NOW(), ...)
Accepted solution - detailed
Shard 1
Shard 2
Shard 3
SELECT id, payload FROM data WHERE shard = X and insert_time = Y
SELECT offset_time FROM read_offsets WHERE shard = X
{event_id: 1, payload: …}
{event_id: 2, payload: …}
{event_id: 3, payload: …}
{event_id: 4, payload: …}
INSERT INTO data (shard, insert_time, payload) VALUES (X, NOW(), ...)
Accepted solution - detailed
Shard 1
Shard 2
Shard 3
SELECT id, payload FROM data WHERE shard = X and insert_time = Y
INSERT INTO data (shard, insert_time, payload) VALUES (X, NOW(), ...)
SELECT offset_time FROM read_offsets WHERE shard = X
{event_id: 1, payload: …}
{event_id: 3, payload: …}
{event_id: 1, payload: …}
{event_id: 2, payload: …}
{event_id: 3, payload: …}
{event_id: 4, payload: …}
Accepted solution - detailed
Shard 1
Shard 2
Shard 3
SELECT id, payload FROM data WHERE shard = X and insert_time = Y
INSERT INTO data (shard, insert_time, payload) VALUES (X, NOW(), ...)
SELECT offset_time FROM read_offsets WHERE shard = X
{event_id: 1, payload: …}
{event_id: 3, payload: …}
{event_id: 1, payload: …}
{event_id: 2, payload: …}
{event_id: 3, payload: …}
{event_id: 4, payload: …}
Accepted solution - detailed
Shard 1
Shard 2
Shard 3
SELECT id, payload FROM data WHERE shard = X and insert_time = Y
INSERT INTO data (shard, insert_time, payload) VALUES (X, NOW(), ...)
SELECT offset_time FROM read_offsets WHERE shard = X
{event_id: 1, payload: …}
{event_id: 3, payload: …}
{event_id: 1, payload: …}
{event_id: 2, payload: …}
{event_id: 3, payload: …}
{event_id: 4, payload: …}
INSERT INTO read_offsets (shard, offset_time) VALUES (X, Y)
Accepted solution - detailed
Shard 1
Shard 2
Shard 3
SELECT id, payload FROM data WHERE shard = X and insert_time = Y
INSERT INTO data (shard, insert_time, payload) VALUES (X, NOW(), ...)
SELECT offset_time FROM read_offsets WHERE shard = X
{event_id: 1, payload: …}
{event_id: 3, payload: …}
{event_id: 1, payload: …}
{event_id: 2, payload: …}
{event_id: 3, payload: …}
{event_id: 4, payload: …}
{event_id: 2, payload: …}
INSERT INTO read_offsets (shard, offset_time) VALUES (X, Y)
Accepted solution - detailed
Shard 1
Shard 2
Shard 3
SELECT id, payload FROM data WHERE shard = X and insert_time = Y
INSERT INTO data (shard, insert_time, payload) VALUES (X, NOW(), ...)
SELECT offset_time FROM read_offsets WHERE shard = X
{event_id: 1, payload: …}
{event_id: 3, payload: …}
{event_id: 1, payload: …}
{event_id: 2, payload: …}
{event_id: 3, payload: …}
{event_id: 4, payload: …}
{event_id: 2, payload: …}
{event_id: 4, payload: …}
INSERT INTO read_offsets (shard, offset_time) VALUES (X, Y)
Accepted solution - detailed
Shard 1
Shard 2
Shard 3
SELECT id, payload FROM data WHERE shard = X and insert_time = Y
INSERT INTO data (shard, insert_time, payload) VALUES (X, NOW(), ...)
SELECT offset_time FROM read_offsets WHERE shard = X
{event_id: 1, payload: …}
{event_id: 3, payload: …}
{event_id: 1, payload: …}
{event_id: 2, payload: …}
{event_id: 3, payload: …}
{event_id: 4, payload: …}
{event_id: 2, payload: …}
{event_id: 4, payload: …}
INSERT INTO read_offsets (shard, offset_time) VALUES (X, Y)
Accepted solution - detailed
Shard 1
Shard 2
Shard 3
SELECT id, payload FROM data WHERE shard = X and insert_time = Y
INSERT INTO data (shard, insert_time, payload) VALUES (X, NOW(), ...)
SELECT offset_time FROM read_offsets WHERE shard = X
{event_id: 1, payload: …}
{event_id: 2, payload: …}
{event_id: 3, payload: …}
{event_id: 4, payload: …}
INSERT INTO read_offsets (shard, offset_time) VALUES (X, Y)
Accepted solution - detailed
Shard 1
Shard 2
Shard 3
SELECT id, payload FROM data WHERE shard = X and insert_time = Y
INSERT INTO data (shard, insert_time, payload) VALUES (X, NOW(), ...)
SELECT offset_time FROM read_offsets WHERE shard = X
{event_id: 1, payload: …}
{event_id: 2, payload: …}
{event_id: 3, payload: …}
{event_id: 4, payload: …}
INSERT INTO read_offsets (shard, offset_time) VALUES (X, Y)
{event_id: 5, payload: …}
Accepted solution - detailed
Shard 1
Shard 2
Shard 3
SELECT id, payload FROM data WHERE shard = X and insert_time = Y
INSERT INTO data (shard, insert_time, payload) VALUES (X, NOW(), ...)
SELECT offset_time FROM read_offsets WHERE shard = X
{event_id: 1, payload: …}
{event_id: 2, payload: …}
{event_id: 3, payload: …}
{event_id: 4, payload: …}
INSERT INTO read_offsets (shard, offset_time) VALUES (X, Y)
{event_id: 5, payload: …}
Accepted solution - detailed
Shard 1
Shard 2
Shard 3
SELECT id, payload FROM data WHERE shard = X and insert_time = Y
INSERT INTO data (shard, insert_time, payload) VALUES (X, NOW(), ...)
SELECT offset_time FROM read_offsets WHERE shard = X
{event_id: 1, payload: …}
{event_id: 2, payload: …}
{event_id: 3, payload: …}
{event_id: 4, payload: …}
INSERT INTO read_offsets (shard, offset_time) VALUES (X, Y)
{event_id: 5, payload: …}
{event_id: 5, payload: …}
Our results
Our results
Reduced operational cost
Our results
Reduced operational cost
Reduced operational complexity
Our results
Reduced operational cost
Reduced operational complexity
Increased throughput
Thank you!
Stay in touch
Daniel Belenky
dbelenky@paloaltonetworks.com

More Related Content

What's hot

Unifying Stream, SWL and CEP for Declarative Stream Processing with Apache Flink
Unifying Stream, SWL and CEP for Declarative Stream Processing with Apache FlinkUnifying Stream, SWL and CEP for Declarative Stream Processing with Apache Flink
Unifying Stream, SWL and CEP for Declarative Stream Processing with Apache Flink
DataWorks Summit/Hadoop Summit
 
Re:invent 2016 Container Scheduling, Execution and AWS Integration
Re:invent 2016 Container Scheduling, Execution and AWS IntegrationRe:invent 2016 Container Scheduling, Execution and AWS Integration
Re:invent 2016 Container Scheduling, Execution and AWS Integration
aspyker
 
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Henning Jacobs
 
(E book) cracking & hacking tutorial 1000 pagine (ita)
(E book) cracking & hacking tutorial 1000 pagine (ita)(E book) cracking & hacking tutorial 1000 pagine (ita)
(E book) cracking & hacking tutorial 1000 pagine (ita)UltraUploader
 
Using galera replication to create geo distributed clusters on the wan
Using galera replication to create geo distributed clusters on the wanUsing galera replication to create geo distributed clusters on the wan
Using galera replication to create geo distributed clusters on the wan
Codership Oy - Creators of Galera Cluster
 
Building your own CGN boxes with Linux
Building your own CGN boxes with LinuxBuilding your own CGN boxes with Linux
Building your own CGN boxes with Linux
Maximilan Wilhelm
 
Software defined storage
Software defined storageSoftware defined storage
Software defined storage
Gluster.org
 
Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoop
Prasanna Rajaperumal
 
Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!
Flink Forward
 
MariaDB Performance Tuning and Optimization
MariaDB Performance Tuning and OptimizationMariaDB Performance Tuning and Optimization
MariaDB Performance Tuning and Optimization
MariaDB plc
 
Extreme replication at IOUG Collaborate 15
Extreme replication at IOUG Collaborate 15Extreme replication at IOUG Collaborate 15
Extreme replication at IOUG Collaborate 15
Bobby Curtis
 
Multiple Sites and Disaster Recovery with Ceph: Andrew Hatfield, Red Hat
Multiple Sites and Disaster Recovery with Ceph: Andrew Hatfield, Red HatMultiple Sites and Disaster Recovery with Ceph: Andrew Hatfield, Red Hat
Multiple Sites and Disaster Recovery with Ceph: Andrew Hatfield, Red Hat
OpenStack
 
Operating PostgreSQL at Scale with Kubernetes
Operating PostgreSQL at Scale with KubernetesOperating PostgreSQL at Scale with Kubernetes
Operating PostgreSQL at Scale with Kubernetes
Jonathan Katz
 
Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...
Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...
Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...
Sandesh Rao
 
IBM MQ: An Introduction to Using and Developing with MQ Publish/Subscribe
IBM MQ: An Introduction to Using and Developing with MQ Publish/SubscribeIBM MQ: An Introduction to Using and Developing with MQ Publish/Subscribe
IBM MQ: An Introduction to Using and Developing with MQ Publish/Subscribe
David Ware
 
ceph optimization on ssd ilsoo byun-short
ceph optimization on ssd ilsoo byun-shortceph optimization on ssd ilsoo byun-short
ceph optimization on ssd ilsoo byun-short
NAVER D2
 
Elasticsearch for Logs & Metrics - a deep dive
Elasticsearch for Logs & Metrics - a deep diveElasticsearch for Logs & Metrics - a deep dive
Elasticsearch for Logs & Metrics - a deep dive
Sematext Group, Inc.
 
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA ArchitectureCeph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Danielle Womboldt
 
What is HyperLogLog and Why You Will Love It | PostgreSQL Conference Europe 2...
What is HyperLogLog and Why You Will Love It | PostgreSQL Conference Europe 2...What is HyperLogLog and Why You Will Love It | PostgreSQL Conference Europe 2...
What is HyperLogLog and Why You Will Love It | PostgreSQL Conference Europe 2...
Citus Data
 
Scylla on Kubernetes: Introducing the Scylla Operator
Scylla on Kubernetes: Introducing the Scylla OperatorScylla on Kubernetes: Introducing the Scylla Operator
Scylla on Kubernetes: Introducing the Scylla Operator
ScyllaDB
 

What's hot (20)

Unifying Stream, SWL and CEP for Declarative Stream Processing with Apache Flink
Unifying Stream, SWL and CEP for Declarative Stream Processing with Apache FlinkUnifying Stream, SWL and CEP for Declarative Stream Processing with Apache Flink
Unifying Stream, SWL and CEP for Declarative Stream Processing with Apache Flink
 
Re:invent 2016 Container Scheduling, Execution and AWS Integration
Re:invent 2016 Container Scheduling, Execution and AWS IntegrationRe:invent 2016 Container Scheduling, Execution and AWS Integration
Re:invent 2016 Container Scheduling, Execution and AWS Integration
 
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
 
(E book) cracking & hacking tutorial 1000 pagine (ita)
(E book) cracking & hacking tutorial 1000 pagine (ita)(E book) cracking & hacking tutorial 1000 pagine (ita)
(E book) cracking & hacking tutorial 1000 pagine (ita)
 
Using galera replication to create geo distributed clusters on the wan
Using galera replication to create geo distributed clusters on the wanUsing galera replication to create geo distributed clusters on the wan
Using galera replication to create geo distributed clusters on the wan
 
Building your own CGN boxes with Linux
Building your own CGN boxes with LinuxBuilding your own CGN boxes with Linux
Building your own CGN boxes with Linux
 
Software defined storage
Software defined storageSoftware defined storage
Software defined storage
 
Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoop
 
Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!
 
MariaDB Performance Tuning and Optimization
MariaDB Performance Tuning and OptimizationMariaDB Performance Tuning and Optimization
MariaDB Performance Tuning and Optimization
 
Extreme replication at IOUG Collaborate 15
Extreme replication at IOUG Collaborate 15Extreme replication at IOUG Collaborate 15
Extreme replication at IOUG Collaborate 15
 
Multiple Sites and Disaster Recovery with Ceph: Andrew Hatfield, Red Hat
Multiple Sites and Disaster Recovery with Ceph: Andrew Hatfield, Red HatMultiple Sites and Disaster Recovery with Ceph: Andrew Hatfield, Red Hat
Multiple Sites and Disaster Recovery with Ceph: Andrew Hatfield, Red Hat
 
Operating PostgreSQL at Scale with Kubernetes
Operating PostgreSQL at Scale with KubernetesOperating PostgreSQL at Scale with Kubernetes
Operating PostgreSQL at Scale with Kubernetes
 
Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...
Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...
Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...
 
IBM MQ: An Introduction to Using and Developing with MQ Publish/Subscribe
IBM MQ: An Introduction to Using and Developing with MQ Publish/SubscribeIBM MQ: An Introduction to Using and Developing with MQ Publish/Subscribe
IBM MQ: An Introduction to Using and Developing with MQ Publish/Subscribe
 
ceph optimization on ssd ilsoo byun-short
ceph optimization on ssd ilsoo byun-shortceph optimization on ssd ilsoo byun-short
ceph optimization on ssd ilsoo byun-short
 
Elasticsearch for Logs & Metrics - a deep dive
Elasticsearch for Logs & Metrics - a deep diveElasticsearch for Logs & Metrics - a deep dive
Elasticsearch for Logs & Metrics - a deep dive
 
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA ArchitectureCeph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
 
What is HyperLogLog and Why You Will Love It | PostgreSQL Conference Europe 2...
What is HyperLogLog and Why You Will Love It | PostgreSQL Conference Europe 2...What is HyperLogLog and Why You Will Love It | PostgreSQL Conference Europe 2...
What is HyperLogLog and Why You Will Love It | PostgreSQL Conference Europe 2...
 
Scylla on Kubernetes: Introducing the Scylla Operator
Scylla on Kubernetes: Introducing the Scylla OperatorScylla on Kubernetes: Introducing the Scylla Operator
Scylla on Kubernetes: Introducing the Scylla Operator
 

Similar to Scylla Summit 2022: Stream Processing with ScyllaDB

From 1000/day to 1000/sec: The Evolution of Incapsula's BIG DATA System [Surg...
From 1000/day to 1000/sec: The Evolution of Incapsula's BIG DATA System [Surg...From 1000/day to 1000/sec: The Evolution of Incapsula's BIG DATA System [Surg...
From 1000/day to 1000/sec: The Evolution of Incapsula's BIG DATA System [Surg...
Imperva Incapsula
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
Codemotion
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
C4Media
 
TenMax Data Pipeline Experience Sharing
TenMax Data Pipeline Experience SharingTenMax Data Pipeline Experience Sharing
TenMax Data Pipeline Experience Sharing
Chen-en Lu
 
Solving the Database Problem
Solving the Database ProblemSolving the Database Problem
Solving the Database Problem
Jay Gordon
 
OVHcloud – Enterprise Cloud Databases
OVHcloud – Enterprise Cloud DatabasesOVHcloud – Enterprise Cloud Databases
OVHcloud – Enterprise Cloud Databases
OVHcloud
 
WW Historian 10
WW Historian 10WW Historian 10
WW Historian 10
helenafinnan
 
Azure stream analytics by Nico Jacobs
Azure stream analytics by Nico JacobsAzure stream analytics by Nico Jacobs
Azure stream analytics by Nico Jacobs
ITProceed
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
Codemotion Tel Aviv
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
Databricks
 
Big Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapBig Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapSrinath Perera
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Landon Robinson
 
OSMC 2019 | How to improve database Observability by Charles Judith
OSMC 2019 | How to improve database Observability by Charles JudithOSMC 2019 | How to improve database Observability by Charles Judith
OSMC 2019 | How to improve database Observability by Charles Judith
NETWAYS
 
Sizing Your MongoDB Cluster
Sizing Your MongoDB ClusterSizing Your MongoDB Cluster
Sizing Your MongoDB Cluster
MongoDB
 
WSO2Con ASIA 2016: Patterns for Deploying Analytics in the Real World
WSO2Con ASIA 2016: Patterns for Deploying Analytics in the Real WorldWSO2Con ASIA 2016: Patterns for Deploying Analytics in the Real World
WSO2Con ASIA 2016: Patterns for Deploying Analytics in the Real World
WSO2
 
Дмитрий Попович "How to build a data warehouse?"
Дмитрий Попович "How to build a data warehouse?"Дмитрий Попович "How to build a data warehouse?"
Дмитрий Попович "How to build a data warehouse?"
Fwdays
 
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Databricks
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Spark Summit
 
PyConline AU 2021 - Things might go wrong in a data-intensive application
PyConline AU 2021 - Things might go wrong in a data-intensive applicationPyConline AU 2021 - Things might go wrong in a data-intensive application
PyConline AU 2021 - Things might go wrong in a data-intensive application
Hua Chu
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist
SoftServe
 

Similar to Scylla Summit 2022: Stream Processing with ScyllaDB (20)

From 1000/day to 1000/sec: The Evolution of Incapsula's BIG DATA System [Surg...
From 1000/day to 1000/sec: The Evolution of Incapsula's BIG DATA System [Surg...From 1000/day to 1000/sec: The Evolution of Incapsula's BIG DATA System [Surg...
From 1000/day to 1000/sec: The Evolution of Incapsula's BIG DATA System [Surg...
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
TenMax Data Pipeline Experience Sharing
TenMax Data Pipeline Experience SharingTenMax Data Pipeline Experience Sharing
TenMax Data Pipeline Experience Sharing
 
Solving the Database Problem
Solving the Database ProblemSolving the Database Problem
Solving the Database Problem
 
OVHcloud – Enterprise Cloud Databases
OVHcloud – Enterprise Cloud DatabasesOVHcloud – Enterprise Cloud Databases
OVHcloud – Enterprise Cloud Databases
 
WW Historian 10
WW Historian 10WW Historian 10
WW Historian 10
 
Azure stream analytics by Nico Jacobs
Azure stream analytics by Nico JacobsAzure stream analytics by Nico Jacobs
Azure stream analytics by Nico Jacobs
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
 
Big Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapBig Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and Roadmap
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
 
OSMC 2019 | How to improve database Observability by Charles Judith
OSMC 2019 | How to improve database Observability by Charles JudithOSMC 2019 | How to improve database Observability by Charles Judith
OSMC 2019 | How to improve database Observability by Charles Judith
 
Sizing Your MongoDB Cluster
Sizing Your MongoDB ClusterSizing Your MongoDB Cluster
Sizing Your MongoDB Cluster
 
WSO2Con ASIA 2016: Patterns for Deploying Analytics in the Real World
WSO2Con ASIA 2016: Patterns for Deploying Analytics in the Real WorldWSO2Con ASIA 2016: Patterns for Deploying Analytics in the Real World
WSO2Con ASIA 2016: Patterns for Deploying Analytics in the Real World
 
Дмитрий Попович "How to build a data warehouse?"
Дмитрий Попович "How to build a data warehouse?"Дмитрий Попович "How to build a data warehouse?"
Дмитрий Попович "How to build a data warehouse?"
 
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
 
PyConline AU 2021 - Things might go wrong in a data-intensive application
PyConline AU 2021 - Things might go wrong in a data-intensive applicationPyConline AU 2021 - Things might go wrong in a data-intensive application
PyConline AU 2021 - Things might go wrong in a data-intensive application
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist
 

More from ScyllaDB

Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
ScyllaDB
 
Event-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingEvent-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream Processing
ScyllaDB
 
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
ScyllaDB
 
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
ScyllaDB
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
ScyllaDB
 
What Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQLWhat Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQL
ScyllaDB
 
Low Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsLow Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & Pitfalls
ScyllaDB
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance Dilemmas
ScyllaDB
 
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDBBeyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
ScyllaDB
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance Dilemmas
ScyllaDB
 
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
ScyllaDB
 
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
ScyllaDB
 
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr SarnaDatabase Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
ScyllaDB
 
Replacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDBReplacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDB
ScyllaDB
 
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear ScalabilityPowering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
ScyllaDB
 
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
ScyllaDB
 
Getting the most out of ScyllaDB
Getting the most out of ScyllaDBGetting the most out of ScyllaDB
Getting the most out of ScyllaDB
ScyllaDB
 
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a MigrationNoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
ScyllaDB
 
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration LogisticsNoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
ScyllaDB
 
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and ChallengesNoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
ScyllaDB
 

More from ScyllaDB (20)

Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 
Event-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingEvent-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream Processing
 
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
 
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
What Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQLWhat Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQL
 
Low Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsLow Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & Pitfalls
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance Dilemmas
 
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDBBeyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance Dilemmas
 
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
 
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
 
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr SarnaDatabase Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
 
Replacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDBReplacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDB
 
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear ScalabilityPowering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
 
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
 
Getting the most out of ScyllaDB
Getting the most out of ScyllaDBGetting the most out of ScyllaDB
Getting the most out of ScyllaDB
 
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a MigrationNoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
 
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration LogisticsNoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
 
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and ChallengesNoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
 

Recently uploaded

Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Fwdays
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 

Recently uploaded (20)

Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 

Scylla Summit 2022: Stream Processing with ScyllaDB

  • 1. Stream Processing with Scylla Daniel Belenky Principal Software Engineer YOUR COMPANY LOGO HERE
  • 2. Daniel Belenky ■ Kubernetes & Virtualization ■ Distributed applications ■ Big data and stream processing Principal Software Engineer YOUR PHOTO GOES HERE
  • 3. Agenda ■ A brief the product and my team - 3 min ■ What was the challenge that we were facing - 5 min ■ What were the solutions were considered - 5 min ■ How we’ve managed to solve the problem with Scylla - 12 min
  • 4. A brief about the product and the team
  • 5. Our product Is a security product that performs analytics, detection and response. ■ Millions of records per second ■ Multiple data sources and schemas ■ Has to provide insights in a near real-time timeframe (security...)
  • 6. About my team We are responsible for the infrastructure that: ■ Stream processing of data that comes from multiple sources. ■ Clean, normalize and process the data - prepare it for further analysis. ■ Build stories - multiple data sources emit different events and provide different views on the same network session. ■ We want to fuse those events that tell the same story from a different perspective. ■ Mostly developing with Go and Python ■ Deployment is on K8s
  • 9. Problem description (part 1) Various sensors see a network event 10:00:01
  • 10. Problem description (part 1) Various sensors see a network event {event: dns-query, id: 6c92e} {event: dns-query, id: 873a1} ... 10:00:01
  • 11. Problem description (part 1) Various sensors see a network event {event: dns-query, id: 6c92e} {event: dns-query, id: 873a1} ... 10:00:01 10:00:02
  • 12. Problem description (part 1) Various sensors see a network event {event: dns-query, id: 6c92e} {event: dns-query, id: 873a1} ... 10:00:01 10:00:02 {kind: login, id: 13} {kind: signup, id: 17} ...
  • 13. Problem description (part 1) Various sensors see a network event {kind: login, id: 13} {kind: signup, id: 17} ... {event: dns-query, id: 6c92e} {event: dns-query, id: 873a1} ... 10:00:01 10:08:05 10:00:02
  • 14. Problem description (part 1) Various sensors see a network event {kind: login, id: 13} {kind: signup, id: 17} ... {type: GET, id: CHJW} {type: POST, id: KQJD} ... {event: dns-query, id: 6c92e} {event: dns-query, id: 873a1} ... 10:00:01 10:08:05 10:00:02
  • 15. Problem description (part 2) Data from different sensors comes in different forms and formats In different times
  • 16. Problem description (part 2) Data from different sensors comes in different forms and formats In different times
  • 17. Problem description (part 2) Data from different sensors comes in different forms and formats In different times Normalized data in a canonical form ready for processing
  • 18. Problem description (part 2) Data from different sensors comes in different forms and formats In different times Normalized data in a canonical form ready for processing ? Millions of normalized but unassociated entries per second from many different sources
  • 19. The question is: How to associate discrete entries that describe the same network session
  • 20. Why is it a challenge?
  • 21. Why is it a challenge? ■ Clock skew across different sensors Clocks across sensors might not be synchronized to the second
  • 22. Why is it a challenge? ■ Clock skew across different sensors Clocks across sensors might not be synchronized to the second ■ We have thousands of deployments to manage Deployments also vary in size (from Bps to GBps)
  • 23. Why is it a challenge? ■ Clock skew across different sensors Clocks across sensors might not be synchronized to the second ■ We have thousands of deployments to manage Deployments also vary in size (from Bps to GBps) ■ Sensor’s viewpoint on the session Different sensors have different views on the same session
  • 24. Why is it a challenge? ■ Clock skew across different sensors Clocks across sensors might not be synchronized to the second ■ We have thousands of deployments to manage Deployments also vary in size (from Bps to GBps) ■ Sensor’s viewpoint on the session Different sensors have different views on the same session ■ Zero tolerance for data loss Data is pushed to us and if we lose it, it’s lost for good
  • 25. Why is it a challenge? ■ Clock skew across different sensors Clocks across sensors might not be synchronized to the second ■ We have thousands of deployments to manage Deployments also vary in size (from Bps to GBps) ■ Sensor’s viewpoint on the session Different sensors have different views on the same session ■ Zero tolerance for data loss Data is pushed to us and if we lose it, it’s lost for good ■ Continuous out of order stream Sensors send data in different times and event time != ingestion time != processing time
  • 38. So… what do we need here?
  • 39. So… what do we need here? ■ Receive a stream of events
  • 40. So… what do we need here? ■ Receive a stream of events ■ Wait some amount of time to allow related events to arrive
  • 41. So… what do we need here? ■ Receive a stream of events ■ Wait some amount of time to allow related events to arrive ■ Decide which events are related to each other
  • 42. So… what do we need here? ■ Receive a stream of events ■ Wait some amount of time to allow related events to arrive ■ Decide which events are related to each other ■ Publish the results
  • 43. So… what do we need here? ■ Receive a stream of events ■ Wait some amount of time to allow related events to arrive ■ Decide which events are related to each other ■ Publish the results
  • 44. So… what do we need here? ■ Receive a stream of events ■ Wait some amount of time to allow related events to arrive ■ Decide which events are related to each other ■ Publish the results ■ Single tenant deployment - we need isolation
  • 45. So… what do we need here? ■ Receive a stream of events ■ Wait some amount of time to allow related events to arrive ■ Decide which events are related to each other ■ Publish the results ■ Single tenant deployment - we need isolation ■ Support rates from several KB per hour up to several GBs per second at a reasonable cost
  • 46. Optional solution #1 Using a relational database
  • 47. Proposed solution #1 Normalized data in a canonical form ready for processing
  • 48. Proposed solution #1 Normalized data in a canonical form ready for processing Store the records in a relational DB
  • 49. Proposed solution #1 Normalized data in a canonical form ready for processing Store the records in a relational DB Periodical tasks to compute stories
  • 50. Proposed solution #1 Normalized data in a canonical form ready for processing Store the records in a relational DB Periodical tasks to compute stories Publish stories for other components to consume
  • 51. Pros
  • 52. Pros ■ Relatively simple implementation: We have to orchestrate the data and the queries but not to write any complex logic ourselves
  • 53. Pros ■ Relatively simple implementation: We have to orchestrate the data and the queries but not to write any complex logic ourselves Cons
  • 54. Pros ■ Relatively simple implementation: We have to orchestrate the data and the queries but not to write any complex logic ourselves ■ Operational overhead - we have to deploy, maintain and operate another database Cons
  • 55. Pros ■ Relatively simple implementation: We have to orchestrate the data and the queries but not to write any complex logic ourselves ■ Operational overhead - we have to deploy, maintain and operate another database ■ Limited performance - relational database queries are slower when compared to queries on a NoSQL database (if the data model allows utilizing a NoSQL database) Cons
  • 56. Pros ■ Relatively simple implementation: We have to orchestrate the data and the queries but not to write any complex logic ourselves ■ Operational overhead - we have to deploy, maintain and operate another database ■ Limited performance - relational database queries are slower when compared to queries on a NoSQL database (if the data model allows utilizing a NoSQL database) ■ Operational cost - complex queries require more CPU hence are more expensive Cons
  • 57. Optional solution #2 Using Scylla + Kafka
  • 58. Proposed solution #2 Normalized data in a canonical form ready for processing
  • 59. Proposed solution #2 Normalized data in a canonical form ready for processing Store the records ScyllaDB
  • 60. Proposed solution #2 Normalized data in a canonical form ready for processing Store the records ScyllaDB Publish keys to fetch the records
  • 61. Proposed solution #2 Normalized data in a canonical form ready for processing Store the records ScyllaDB Publish keys to fetch the records Records can’t be sent on Kafka because they are too big So we send only the primary key to fetch from Scylla
  • 62. Proposed solution #2 Normalized data in a canonical form ready for processing Store the records ScyllaDB Publish keys to fetch the records Multiple consumers read data from a Kafka topic
  • 63. Proposed solution #2 Normalized data in a canonical form ready for processing Store the records ScyllaDB Publish keys to fetch the records Fetch records from Scylla Multiple consumers read data from a Kafka topic
  • 64. Proposed solution #2 Normalized data in a canonical form ready for processing Store the records ScyllaDB Publish stories for other components to consume Publish keys to fetch the records Multiple consumers read data from a Kafka topic Fetch records from Scylla
  • 65. Pros
  • 67. Pros ■ High throughput ■ One less database to maintain
  • 68. Pros ■ High throughput ■ One less database to maintain Cons
  • 69. Pros ■ High throughput ■ One less database to maintain ■ We have to write our own logic to find correlations and build stories Cons
  • 70. Pros ■ High throughput ■ One less database to maintain ■ We have to write our own logic to find correlations and build stories ■ Complex architecture and deployment Cons
  • 71. Pros ■ High throughput ■ One less database to maintain ■ We have to write our own logic to find correlations and build stories ■ Complex architecture and deployment ■ We have to maintain thousands of Kafka deployments Cons
  • 72. Optional solution #3 Using Scylla + Cloud managed message queue
  • 73. Proposed solution #3 Normalized data in a canonical form ready for processing Store the records ScyllaDB Publish stories for other components to consume Publish keys to fetch the records Fetch records from Scylla using keys received from the queue Process and compute stories Queue
  • 74. Pros
  • 75. Pros ■ High throughput when compared to the relational database approach
  • 76. Pros ■ High throughput when compared to the relational database approach ■ One less database to maintain
  • 77. Pros ■ High throughput when compared to the relational database approach ■ One less database to maintain ■ No need to maintain Kafka deployments
  • 78. Pros ■ High throughput when compared to the relational database approach ■ One less database to maintain ■ No need to maintain Kafka deployments Cons
  • 79. Pros ■ High throughput when compared to the relational database approach ■ One less database to maintain ■ No need to maintain Kafka deployments ■ Much slower performance when compared to Kafka Cons
  • 80. The solution that solved our use case Using ScyllaDB - no message queue
  • 81. Accepted solution - high level Normalized data in a canonical form ready for processing
  • 82. Accepted solution - high level Normalized data in a canonical form ready for processing Store the records ScyllaDB
  • 83. Accepted solution - high level Normalized data in a canonical form ready for processing Store the records ScyllaDB The data is sharded into hundreds shards
  • 84. Accepted solution - high level Normalized data in a canonical form ready for processing Store the records ScyllaDB The data is sharded into hundreds shards Partition key is a tuple of (shard-number, insert_time) Clustering key is (event id)
  • 85. Accepted solution - high level Normalized data in a canonical form ready for processing Store the records ScyllaDB The data is sharded into hundreds shards Partition key is a tuple of (shard-number, insert_time) Clustering key is (event id) Multiple consumers fetch records from Scylla using their assigned shard numbers and the time they want to consume The step resolution is configurable
  • 86. Accepted solution - high level Normalized data in a canonical form ready for processing Store the records ScyllaDB The data is sharded into hundreds shards Partition key is a tuple of (shard-number, insert_time) Clustering key is (event id) Multiple consumers fetch records from Scylla using their assigned shard numbers and the time they want to consume The step resolution is configurable Process and compute stories
  • 87. Accepted solution - high level Normalized data in a canonical form ready for processing Store the records ScyllaDB The data is sharded into hundreds shards Partition key is a tuple of (shard-number, insert_time) Clustering key is (event id) Publish stories for other components to consume Multiple consumers fetch records from Scylla using their assigned shard numbers and the time they want to consume The step resolution is configurable Process and compute stories
  • 88. Pros
  • 89. Pros ■ Since we already have Scylla deployed for other parts, we don’t have to add any new creatures to the system
  • 90. Pros ■ Since we already have Scylla deployed for other parts, we don’t have to add any new creatures to the system ■ High throughput when compared to the relational database approach
  • 91. Pros ■ Since we already have Scylla deployed for other parts, we don’t have to add any new creatures to the system ■ High throughput when compared to the relational database approach ■ One less database to maintain
  • 92. Pros ■ Since we already have Scylla deployed for other parts, we don’t have to add any new creatures to the system ■ High throughput when compared to the relational database approach ■ One less database to maintain ■ No need to maintain Kafka deployments
  • 93. Pros ■ Since we already have Scylla deployed for other parts, we don’t have to add any new creatures to the system ■ High throughput when compared to the relational database approach ■ One less database to maintain ■ No need to maintain Kafka deployments Cons
  • 94. Pros ■ Since we already have Scylla deployed for other parts, we don’t have to add any new creatures to the system ■ High throughput when compared to the relational database approach ■ One less database to maintain ■ No need to maintain Kafka deployments ■ Our code became more complex Cons
  • 95. Pros ■ Since we already have Scylla deployed for other parts, we don’t have to add any new creatures to the system ■ High throughput when compared to the relational database approach ■ One less database to maintain ■ No need to maintain Kafka deployments ■ Our code became more complex ■ Producers and consumers has to have synchronized clocks (up to a certain resolution) Cons
  • 97. Accepted solution - detailed {event_id: 1, payload: …} {event_id: 2, payload: …} {event_id: 3, payload: …} {event_id: 4, payload: …}
  • 98. Accepted solution - detailed SELECT offset_time FROM read_offsets WHERE shard = X {event_id: 1, payload: …} {event_id: 2, payload: …} {event_id: 3, payload: …} {event_id: 4, payload: …}
  • 99. Accepted solution - detailed SELECT offset_time FROM read_offsets WHERE shard = X {shard: 1, read_offset: 1000} {shard: 2, read_offset: 983} {shard: 3, read_offset: 999} {event_id: 1, payload: …} {event_id: 2, payload: …} {event_id: 3, payload: …} {event_id: 4, payload: …}
  • 100. Accepted solution - detailed SELECT offset_time FROM read_offsets WHERE shard = X {event_id: 1, payload: …} {event_id: 2, payload: …} {event_id: 3, payload: …} {event_id: 4, payload: …} INSERT INTO data (shard, insert_time, payload) VALUES (X, NOW(), ...)
  • 101. Accepted solution - detailed Shard 1 Shard 2 Shard 3 SELECT offset_time FROM read_offsets WHERE shard = X {event_id: 1, payload: …} {event_id: 2, payload: …} {event_id: 3, payload: …} {event_id: 4, payload: …} INSERT INTO data (shard, insert_time, payload) VALUES (X, NOW(), ...)
  • 102. Accepted solution - detailed Shard 1 Shard 2 Shard 3 SELECT id, payload FROM data WHERE shard = X and insert_time = Y SELECT offset_time FROM read_offsets WHERE shard = X {event_id: 1, payload: …} {event_id: 2, payload: …} {event_id: 3, payload: …} {event_id: 4, payload: …} INSERT INTO data (shard, insert_time, payload) VALUES (X, NOW(), ...)
  • 103. Accepted solution - detailed Shard 1 Shard 2 Shard 3 SELECT id, payload FROM data WHERE shard = X and insert_time = Y SELECT offset_time FROM read_offsets WHERE shard = X {event_id: 1, payload: …} {event_id: 2, payload: …} {event_id: 3, payload: …} {event_id: 4, payload: …} INSERT INTO data (shard, insert_time, payload) VALUES (X, NOW(), ...)
  • 104. Accepted solution - detailed Shard 1 Shard 2 Shard 3 SELECT id, payload FROM data WHERE shard = X and insert_time = Y INSERT INTO data (shard, insert_time, payload) VALUES (X, NOW(), ...) SELECT offset_time FROM read_offsets WHERE shard = X {event_id: 1, payload: …} {event_id: 3, payload: …} {event_id: 1, payload: …} {event_id: 2, payload: …} {event_id: 3, payload: …} {event_id: 4, payload: …}
  • 105. Accepted solution - detailed Shard 1 Shard 2 Shard 3 SELECT id, payload FROM data WHERE shard = X and insert_time = Y INSERT INTO data (shard, insert_time, payload) VALUES (X, NOW(), ...) SELECT offset_time FROM read_offsets WHERE shard = X {event_id: 1, payload: …} {event_id: 3, payload: …} {event_id: 1, payload: …} {event_id: 2, payload: …} {event_id: 3, payload: …} {event_id: 4, payload: …}
  • 106. Accepted solution - detailed Shard 1 Shard 2 Shard 3 SELECT id, payload FROM data WHERE shard = X and insert_time = Y INSERT INTO data (shard, insert_time, payload) VALUES (X, NOW(), ...) SELECT offset_time FROM read_offsets WHERE shard = X {event_id: 1, payload: …} {event_id: 3, payload: …} {event_id: 1, payload: …} {event_id: 2, payload: …} {event_id: 3, payload: …} {event_id: 4, payload: …} INSERT INTO read_offsets (shard, offset_time) VALUES (X, Y)
  • 107. Accepted solution - detailed Shard 1 Shard 2 Shard 3 SELECT id, payload FROM data WHERE shard = X and insert_time = Y INSERT INTO data (shard, insert_time, payload) VALUES (X, NOW(), ...) SELECT offset_time FROM read_offsets WHERE shard = X {event_id: 1, payload: …} {event_id: 3, payload: …} {event_id: 1, payload: …} {event_id: 2, payload: …} {event_id: 3, payload: …} {event_id: 4, payload: …} {event_id: 2, payload: …} INSERT INTO read_offsets (shard, offset_time) VALUES (X, Y)
  • 108. Accepted solution - detailed Shard 1 Shard 2 Shard 3 SELECT id, payload FROM data WHERE shard = X and insert_time = Y INSERT INTO data (shard, insert_time, payload) VALUES (X, NOW(), ...) SELECT offset_time FROM read_offsets WHERE shard = X {event_id: 1, payload: …} {event_id: 3, payload: …} {event_id: 1, payload: …} {event_id: 2, payload: …} {event_id: 3, payload: …} {event_id: 4, payload: …} {event_id: 2, payload: …} {event_id: 4, payload: …} INSERT INTO read_offsets (shard, offset_time) VALUES (X, Y)
  • 109. Accepted solution - detailed Shard 1 Shard 2 Shard 3 SELECT id, payload FROM data WHERE shard = X and insert_time = Y INSERT INTO data (shard, insert_time, payload) VALUES (X, NOW(), ...) SELECT offset_time FROM read_offsets WHERE shard = X {event_id: 1, payload: …} {event_id: 3, payload: …} {event_id: 1, payload: …} {event_id: 2, payload: …} {event_id: 3, payload: …} {event_id: 4, payload: …} {event_id: 2, payload: …} {event_id: 4, payload: …} INSERT INTO read_offsets (shard, offset_time) VALUES (X, Y)
  • 110. Accepted solution - detailed Shard 1 Shard 2 Shard 3 SELECT id, payload FROM data WHERE shard = X and insert_time = Y INSERT INTO data (shard, insert_time, payload) VALUES (X, NOW(), ...) SELECT offset_time FROM read_offsets WHERE shard = X {event_id: 1, payload: …} {event_id: 2, payload: …} {event_id: 3, payload: …} {event_id: 4, payload: …} INSERT INTO read_offsets (shard, offset_time) VALUES (X, Y)
  • 111. Accepted solution - detailed Shard 1 Shard 2 Shard 3 SELECT id, payload FROM data WHERE shard = X and insert_time = Y INSERT INTO data (shard, insert_time, payload) VALUES (X, NOW(), ...) SELECT offset_time FROM read_offsets WHERE shard = X {event_id: 1, payload: …} {event_id: 2, payload: …} {event_id: 3, payload: …} {event_id: 4, payload: …} INSERT INTO read_offsets (shard, offset_time) VALUES (X, Y) {event_id: 5, payload: …}
  • 112. Accepted solution - detailed Shard 1 Shard 2 Shard 3 SELECT id, payload FROM data WHERE shard = X and insert_time = Y INSERT INTO data (shard, insert_time, payload) VALUES (X, NOW(), ...) SELECT offset_time FROM read_offsets WHERE shard = X {event_id: 1, payload: …} {event_id: 2, payload: …} {event_id: 3, payload: …} {event_id: 4, payload: …} INSERT INTO read_offsets (shard, offset_time) VALUES (X, Y) {event_id: 5, payload: …}
  • 113. Accepted solution - detailed Shard 1 Shard 2 Shard 3 SELECT id, payload FROM data WHERE shard = X and insert_time = Y INSERT INTO data (shard, insert_time, payload) VALUES (X, NOW(), ...) SELECT offset_time FROM read_offsets WHERE shard = X {event_id: 1, payload: …} {event_id: 2, payload: …} {event_id: 3, payload: …} {event_id: 4, payload: …} INSERT INTO read_offsets (shard, offset_time) VALUES (X, Y) {event_id: 5, payload: …} {event_id: 5, payload: …}
  • 116. Our results Reduced operational cost Reduced operational complexity
  • 117. Our results Reduced operational cost Reduced operational complexity Increased throughput
  • 118. Thank you! Stay in touch Daniel Belenky dbelenky@paloaltonetworks.com