When building distributed data systems architectures you need to pair database systems that can keep up with the rate and volume of the event streams you generate. Discover from Raouf Chebri, Developer Advocate for ScyllaDB, why ScyllaDB’s NoSQL database serves as a perfect complement to event streaming technologies like Apache Pulsar.
11. Change Data Capture
11
Update
Table t
pk ck v
0 0 1
Table t
pk ck v
0 0 3
change at 2020-01-29 14:37:32: UPDATE ks.t SET v = 3 WHERE pk = 0 AND ck = 0;
Intro self
THANKS for attending the Distributed Data Systems Masterclass!
So maybe, this is a good time, before we dive into the demo, to take a moment and explain what is ScyllaDB and why we’re using it.
Here is an illustration of how data is replicated across a cluster. The query is parsed and processed by a node then replicated in other nodes for high availability.
We’ve seen this representation of a cluster in a previous slide and you might also have seen this architecture before if you’re familiar with distributed systems and other databases such as DynamoDB, CosmosDB and Apache Cassandra.
In fact, ScyllaDB is modeled after Apache Cassandra and is API compatible both Cassandra and DynamoDB.
But what makes ScyllaDB a lot faster than the competition and standout is two things:
It’s implemented in C++ and not Java
And, its shard-per-code architecture.
Let’s talk briefly about what shard-per-core means.
Shard-per-core means that every CPU of your machine is responsible of a portion of the data.
So imagine the following: when you send a query to the database, the partition key determines which node in the cluster will process the query.
With shard-awareness, it also determines which core will process the query.
NEXT
Let’s talk first a little bit about Connectors.
built-in connectors helps you avoid all those bug-prone and lengthy procedures.
So let’s say you’re using a streaming platform such as Kafka or Pulsar.
Connectors help you import and export data from some of the most commonly used data systems with just a simple configuration.
Connectors to import data into the steaming platform are called Sources.
Connectors to export data from the streaming platform. We call those Sinks.
We don’t have time to cover this today, but If you’re interested in knowing more about ScyllaDB, you should check-out Scylla University or actually attend Scylla University LIVE in July.
Alright! We said ScyllaDB and Pulsar are both very fast! So let’s now get to the demo!
Cheers to you for keeping afloat with the current pace of technology and trying to always improve
Let’s talk first a little bit about Connectors.
Life used to be simple
Your application became popular
Scale up
Virtual machines
Worried about the data: multiple instances, Geo-replication
Worry about latency because of geographical locations: Caching and CDN
You heard of containers and its advantages
You heard of microservices
Added an event-bus
Code is a liability, not an asset
I actually stole this quote from Matt Coulter, who is AWS Serverless Hero. I tried to trace the quote back to its original author but I couldn’t find them.
So please ping me on twitter or send me an email if you know more about that.
But basically, code is a liability. The more code you write, the more bugs you write and the more you fix and maintain.
Summary: Streaming Systems serve as data pipelines and help take it from a place to another
In this demo, we’ll
spin-up a ScyllaDB instance
Configure Pulsar to use ScyllaDB as a Sink
And walk through the code to set up a producer
I already have pulsar running using docker. I need an instance of Scylla as well.
Here is the command to run a cluster locally. I’m using docker to run a ScyllaDB container on my machine, but sometimes I don’t like to run many things on my machine, and also to avoid the “it worked on my machine” kinda issues I like to use Scylla Cloud, that you can try for free.
In the dashboard, I can create a cluster and deploy it on AWS or GCP in the geographical location that makes the most sense for your application. The closer the better your latency.
I’m going to skip this part as I already have a cluster ready for the demo.
On my cluster, I get information about my nodes and if they are running. I see that everything is okay.
In the connect tab, I have instructions to connect to the cluster. I can use any of the languages I like and also connect using the command line and CQLSH. CQL is the Cassandra Query Language, which if you are familiar with SQL, should be very intuitive.
So I’m going to connect to the cluster and see what’s in there. Here are the keyspaces that I have. Keyspace is the logical location of where your data is stored where you also define things like Replication Strategy and Replication factor.
Let me use pulsar_test_keyspace and have a look at the tables. And you can see I created a pulsar_test_table that has a key and a col.
So let’s configure the sink.
In pulsar, I created a config file sink-config.json where I specify the keyspace, table and topic.
Now I ready to create the sink using the following command.
You can see I have tenant, namespace, the sink-type and the config file as input.
Now that I successfully create the sink, let’s test it out.
I have a simple python script
The reason you’d consider using ScyllaDB is because latency matters at scale.
This is a chart from the Kafka vs Pulsar benchmark, that you can find on StreamNative’s website and that I highly encourage you to read.
At p99 and any anything beyond that is where it gets interesting.
With 1KB size messages, we clearly see Pulsar (here in blue) consistently operate at a few milliseconds or even microseconds.
The database you choose can easily become a bottleneck to your system. That’s why you need a system that can digest that huge in-flux of data very quickly.
ScyllaDB is designed to do just that. It’s designed to handle millions of operations per second with sub-milliseconds p99.
NEXT
built-in connectors helps you avoid all those bug-prone and lengthy procedures.
So let’s say you’re using a streaming platform such as Kafka or Pulsar.
Connectors help you import and export data from some of the most commonly used data systems with just a simple configuration.
Connectors to import data into the steaming platform are called Sources.
Connectors to export data from the streaming platform. We call those Sinks.