Palo Alto Networks processes terabytes of events each day. One of their many challenges is to understand which of those events (which might come from various different sensors) actually describe the same story but from many different viewpoints.
Traditionally, such a system would need some sort of a database to store the events, and a message queue to notify consumers about new events that arrived into the system. They wanted to mitigate the cost and operational overhead of deploying yet another stateful component to their system, and designed a solution that uses ScyllaDB as the database for the events *and* as a message queue that allows our consumers to consume the correct events each time. Join this talk with Daniel Belenky, Principal Software Engineer, Palo Alto Networks where he will walk you through their process.
To watch all of the recordings hosted during Scylla Summit 2022 visit our website here: https://www.scylladb.com/summit.
2. Daniel Belenky
■ Kubernetes & Virtualization
■ Distributed applications
■ Big data and stream processing
Principal Software Engineer
YOUR PHOTO
GOES HERE
3. Agenda
■ A brief the product and my team - 3 min
■ What was the challenge that we were facing - 5 min
■ What were the solutions were considered - 5 min
■ How we’ve managed to solve the problem with Scylla - 12 min
5. Our product
Is a security product that performs analytics, detection and response.
■ Millions of records per second
■ Multiple data sources and schemas
■ Has to provide insights in a near real-time timeframe (security...)
6. About my team
We are responsible for the infrastructure that:
■ Stream processing of data that comes from multiple sources.
■ Clean, normalize and process the data - prepare it for further analysis.
■ Build stories - multiple data sources emit different events and provide different
views on the same network session.
■ We want to fuse those events that tell the same story from a different
perspective.
■ Mostly developing with Go and Python
■ Deployment is on K8s
10. Problem description (part 1)
Various sensors
see a network
event
{event: dns-query, id: 6c92e}
{event: dns-query, id: 873a1}
...
10:00:01
11. Problem description (part 1)
Various sensors
see a network
event
{event: dns-query, id: 6c92e}
{event: dns-query, id: 873a1}
...
10:00:01
10:00:02
12. Problem description (part 1)
Various sensors
see a network
event
{event: dns-query, id: 6c92e}
{event: dns-query, id: 873a1}
...
10:00:01
10:00:02 {kind: login, id: 13}
{kind: signup, id: 17}
...
13. Problem description (part 1)
Various sensors
see a network
event
{kind: login, id: 13}
{kind: signup, id: 17}
...
{event: dns-query, id: 6c92e}
{event: dns-query, id: 873a1}
...
10:00:01
10:08:05
10:00:02
14. Problem description (part 1)
Various sensors
see a network
event
{kind: login, id: 13}
{kind: signup, id: 17}
...
{type: GET, id: CHJW}
{type: POST, id: KQJD}
...
{event: dns-query, id: 6c92e}
{event: dns-query, id: 873a1}
...
10:00:01
10:08:05
10:00:02
15. Problem description (part 2)
Data from different sensors
comes in different forms
and formats
In different times
16. Problem description (part 2)
Data from different sensors
comes in different forms
and formats
In different times
17. Problem description (part 2)
Data from different sensors
comes in different forms
and formats
In different times
Normalized
data in a
canonical form
ready for
processing
18. Problem description (part 2)
Data from different sensors
comes in different forms
and formats
In different times
Normalized
data in a
canonical form
ready for
processing
?
Millions of
normalized but
unassociated
entries per
second from
many different
sources
19. The question is:
How to associate discrete entries that describe the
same network session
21. Why is it a challenge?
■ Clock skew across different sensors
Clocks across sensors might not be synchronized to the second
22. Why is it a challenge?
■ Clock skew across different sensors
Clocks across sensors might not be synchronized to the second
■ We have thousands of deployments to manage
Deployments also vary in size (from Bps to GBps)
23. Why is it a challenge?
■ Clock skew across different sensors
Clocks across sensors might not be synchronized to the second
■ We have thousands of deployments to manage
Deployments also vary in size (from Bps to GBps)
■ Sensor’s viewpoint on the session
Different sensors have different views on the same session
24. Why is it a challenge?
■ Clock skew across different sensors
Clocks across sensors might not be synchronized to the second
■ We have thousands of deployments to manage
Deployments also vary in size (from Bps to GBps)
■ Sensor’s viewpoint on the session
Different sensors have different views on the same session
■ Zero tolerance for data loss
Data is pushed to us and if we lose it, it’s lost for good
25. Why is it a challenge?
■ Clock skew across different sensors
Clocks across sensors might not be synchronized to the second
■ We have thousands of deployments to manage
Deployments also vary in size (from Bps to GBps)
■ Sensor’s viewpoint on the session
Different sensors have different views on the same session
■ Zero tolerance for data loss
Data is pushed to us and if we lose it, it’s lost for good
■ Continuous out of order stream
Sensors send data in different times and
event time != ingestion time != processing
time
39. So… what do we need here?
■ Receive a stream of events
40. So… what do we need here?
■ Receive a stream of events
■ Wait some amount of time to allow related events
to arrive
41. So… what do we need here?
■ Receive a stream of events
■ Wait some amount of time to allow related events
to arrive
■ Decide which events are related to each other
42. So… what do we need here?
■ Receive a stream of events
■ Wait some amount of time to allow related events
to arrive
■ Decide which events are related to each other
■ Publish the results
43. So… what do we need here?
■ Receive a stream of events
■ Wait some amount of time to allow related events
to arrive
■ Decide which events are related to each other
■ Publish the results
44. So… what do we need here?
■ Receive a stream of events
■ Wait some amount of time to allow related events
to arrive
■ Decide which events are related to each other
■ Publish the results
■ Single tenant deployment - we need isolation
45. So… what do we need here?
■ Receive a stream of events
■ Wait some amount of time to allow related events
to arrive
■ Decide which events are related to each other
■ Publish the results
■ Single tenant deployment - we need isolation
■ Support rates from several KB per hour up to
several GBs per second at a reasonable cost
49. Proposed solution #1
Normalized
data in a
canonical form
ready for
processing Store the
records in a
relational DB
Periodical tasks
to compute
stories
50. Proposed solution #1
Normalized
data in a
canonical form
ready for
processing Store the
records in a
relational DB
Periodical tasks
to compute
stories
Publish stories
for other
components to
consume
52. Pros
■ Relatively simple implementation:
We have to orchestrate the data and the
queries but not to write any complex logic
ourselves
53. Pros
■ Relatively simple implementation:
We have to orchestrate the data and the
queries but not to write any complex logic
ourselves
Cons
54. Pros
■ Relatively simple implementation:
We have to orchestrate the data and the
queries but not to write any complex logic
ourselves
■ Operational overhead - we have to deploy,
maintain and operate another database
Cons
55. Pros
■ Relatively simple implementation:
We have to orchestrate the data and the
queries but not to write any complex logic
ourselves
■ Operational overhead - we have to deploy,
maintain and operate another database
■ Limited performance - relational
database queries are slower when
compared to queries on a NoSQL
database (if the data model allows
utilizing a NoSQL database)
Cons
56. Pros
■ Relatively simple implementation:
We have to orchestrate the data and the
queries but not to write any complex logic
ourselves
■ Operational overhead - we have to deploy,
maintain and operate another database
■ Limited performance - relational
database queries are slower when
compared to queries on a NoSQL
database (if the data model allows
utilizing a NoSQL database)
■ Operational cost - complex queries
require more CPU hence are more
expensive
Cons
61. Proposed solution #2
Normalized
data in a
canonical form
ready for
processing
Store the
records
ScyllaDB
Publish keys to
fetch the
records
Records can’t be sent on Kafka because they are too big
So we send only the primary key to fetch from Scylla
62. Proposed solution #2
Normalized
data in a
canonical form
ready for
processing
Store the
records
ScyllaDB
Publish keys to
fetch the
records
Multiple consumers read
data from a Kafka topic
63. Proposed solution #2
Normalized
data in a
canonical form
ready for
processing
Store the
records
ScyllaDB
Publish keys to
fetch the
records
Fetch records
from Scylla
Multiple consumers read
data from a Kafka topic
64. Proposed solution #2
Normalized
data in a
canonical form
ready for
processing
Store the
records
ScyllaDB
Publish stories
for other
components to
consume
Publish keys to
fetch the
records
Multiple consumers read
data from a Kafka topic
Fetch records
from Scylla
69. Pros
■ High throughput
■ One less database to maintain
■ We have to write our own logic to
find correlations and build stories
Cons
70. Pros
■ High throughput
■ One less database to maintain
■ We have to write our own logic to
find correlations and build stories
■ Complex architecture and
deployment
Cons
71. Pros
■ High throughput
■ One less database to maintain
■ We have to write our own logic to
find correlations and build stories
■ Complex architecture and
deployment
■ We have to maintain thousands
of Kafka deployments
Cons
73. Proposed solution #3
Normalized
data in a
canonical form
ready for
processing
Store the
records
ScyllaDB
Publish stories
for other
components to
consume
Publish keys to
fetch the
records
Fetch records
from Scylla
using keys
received from
the queue
Process and
compute
stories
Queue
76. Pros
■ High throughput when compared to the
relational database approach
■ One less database to maintain
77. Pros
■ High throughput when compared to the
relational database approach
■ One less database to maintain
■ No need to maintain Kafka deployments
78. Pros
■ High throughput when compared to the
relational database approach
■ One less database to maintain
■ No need to maintain Kafka deployments
Cons
79. Pros
■ High throughput when compared to the
relational database approach
■ One less database to maintain
■ No need to maintain Kafka deployments
■ Much slower performance when
compared to Kafka
Cons
80. The solution that solved our use case
Using ScyllaDB - no message queue
81. Accepted solution - high level
Normalized data in
a canonical form
ready for
processing
82. Accepted solution - high level
Normalized data in
a canonical form
ready for
processing
Store the records
ScyllaDB
83. Accepted solution - high level
Normalized data in
a canonical form
ready for
processing
Store the records
ScyllaDB
The data is sharded into hundreds shards
84. Accepted solution - high level
Normalized data in
a canonical form
ready for
processing
Store the records
ScyllaDB
The data is sharded into hundreds shards
Partition key is a tuple of
(shard-number, insert_time)
Clustering key is (event id)
85. Accepted solution - high level
Normalized data in
a canonical form
ready for
processing
Store the records
ScyllaDB
The data is sharded into hundreds shards
Partition key is a tuple of
(shard-number, insert_time)
Clustering key is (event id)
Multiple consumers fetch records
from Scylla using their assigned shard
numbers and the time they want to
consume
The step resolution is configurable
86. Accepted solution - high level
Normalized data in
a canonical form
ready for
processing
Store the records
ScyllaDB
The data is sharded into hundreds shards
Partition key is a tuple of
(shard-number, insert_time)
Clustering key is (event id)
Multiple consumers fetch records
from Scylla using their assigned shard
numbers and the time they want to
consume
The step resolution is configurable
Process and
compute
stories
87. Accepted solution - high level
Normalized data in
a canonical form
ready for
processing
Store the records
ScyllaDB
The data is sharded into hundreds shards
Partition key is a tuple of
(shard-number, insert_time)
Clustering key is (event id)
Publish stories
for other
components to
consume
Multiple consumers fetch records
from Scylla using their assigned shard
numbers and the time they want to
consume
The step resolution is configurable
Process and
compute
stories
89. Pros
■ Since we already have Scylla deployed for
other parts, we don’t have to add any new
creatures to the system
90. Pros
■ Since we already have Scylla deployed for
other parts, we don’t have to add any new
creatures to the system
■ High throughput when compared to the
relational database approach
91. Pros
■ Since we already have Scylla deployed for
other parts, we don’t have to add any new
creatures to the system
■ High throughput when compared to the
relational database approach
■ One less database to maintain
92. Pros
■ Since we already have Scylla deployed for
other parts, we don’t have to add any new
creatures to the system
■ High throughput when compared to the
relational database approach
■ One less database to maintain
■ No need to maintain Kafka deployments
93. Pros
■ Since we already have Scylla deployed for
other parts, we don’t have to add any new
creatures to the system
■ High throughput when compared to the
relational database approach
■ One less database to maintain
■ No need to maintain Kafka deployments
Cons
94. Pros
■ Since we already have Scylla deployed for
other parts, we don’t have to add any new
creatures to the system
■ High throughput when compared to the
relational database approach
■ One less database to maintain
■ No need to maintain Kafka deployments
■ Our code became more complex
Cons
95. Pros
■ Since we already have Scylla deployed for
other parts, we don’t have to add any new
creatures to the system
■ High throughput when compared to the
relational database approach
■ One less database to maintain
■ No need to maintain Kafka deployments
■ Our code became more complex
■ Producers and consumers has to have
synchronized clocks
(up to a certain resolution)
Cons