Pulsar Summit
San Francisco
Hotel Nikko
August 18 2022
Ecosystem
Distributed Database
Design Decisions to
Support High
Performance Event
Streaming
Peter Corless
Director of Technical Advocacy • ScyllaDB
Peter Corless is the Director of
Technical Advocacy of ScyllaDB, the
company behind the monstrously fast
and scalable NoSQL database.
He is the editor of and frequent
contributor to the ScyllaDB blog, and
program chair of the ScyllaDB Summit
and P99 CONF.
He recently hosted the Distributed
Systems Masterclass, co-sponsored by
StreamNative+ScyllaDB
Peter Corless
Director of Technical Advocacy
ScyllaDB
Distributed Database Design Decisions to Support High Performance Event Streaming
Requirements for
This Next Tech Cycle
This Next Tech Cycle
2000 2010
2020 2025+
Transistor
Count
42M
Pentium 4
(2000)
228M
Pentium D
(2005)
2.3B
Xeon Nahalem-EX
(2010)
10B
SPARC M7
(2015)
39B
Epyc Rome
(2019)
Core
Count 1 2 8 32 64
~60B?
Epyc Genoa
(2022)
96
~80B?
Epyc Bergamo
(2023)
128
1.2 ZB
IP traffic
(2016)
2 ZB
Data stored
(2010)
64 ZB
Data stored
(2020)
Broadband
Speeds
3G
(2002)
105mbps
(2014)
1.5 mbps
(2002)
16 mbps
(2008)
Wireless
Services
3Gbps
(2021)
1Gbps
(2018)
4G
(2014)
5G
(2018)
Zettabyte
Era
~180 ZB
Data stored
(2025)
Public Cloud
to Multicloud
AWS
(2006)
GCP
(2008)
Azure
(2010)
1021
Azure Arc
Hardware Infrastructure is Evolving
+ Compute
+ From 100+ cores → 1,000+ cores per server
+ From multicore CPUs → full System on a Chip (SoC) designs (CPU,
GPU, Cache, Memory)
+ Memory
+ Terabyte-scale RAM per server
+ DDR4 — 800 to 1600 MHz, 2011-present
+ DDR5 — 4600 MHz in 2020, 8000 MHz by 2024
+ DDR6 — 9600 MHz by 2025
+ Storage
+ Petabyte-scale storage per server
+ NVMe 2.0 [2021] — separation of base and transport
Distributed Database Design Decisions to Support High Performance Event Streaming
Distributed Database Design Decisions to Support High Performance Event Streaming
Databases are Evolving
+ Consistency Models [CAP Model: AP vs. CP]
+ Strong, Eventual, Tunable
+ ACID vs. BASE
+ Data Model / Query Languages [SQL vs. NoSQL]
+ RDBMS / SQL
+ NoSQL [Document, Key-Value, Wide-Column, Graph]
+ Big Data → HUGE Data
+ Data Stored: Gigabytes? Terabytes? Petabytes? Exabytes?
+ Payload Sizes: Kilobytes? Megabytes?
+ OPS / TPS: Hundreds of thousands? Millions?
+ Latencies: Sub-millisecond? Single-digit milliseconds?
Distributed Database Design Decisions to Support High Performance Event Streaming
Databases are [or should be] designed for
specific kinds of data, specific kinds of
workloads, and specific kinds of queries.
How aligned or far away from your specific
use case a database may be in its design &
implementation from your desired utility of it
determines the resistance of the system
Variable
Resistors
Anyone?
Distributed Database Design Decisions to Support High Performance Event Streaming
Sure you can use various databases for tasks they were
never designed for — but should you?
DATA ENGINEERS
Distributed Database Design Decisions to Support High Performance Event Streaming
Δ Data
––––––––––––
t
t ~ n ×0.001s
For a database to be appropriate for event streaming, it needs to
support managing changes to data over time in “real time” —
measured in single-digit milliseconds or less.
And where changes to data can be produced at a rate of
hundreds of thousands or millions of events per second. [And
greater rates in future]
DBaaS
Single-cloud vs.
Multi-cloud?
Multi-datacenter
Elasticity
Serverless
Orchestration
DevSecOps
Scalability
Reliability
Durability
Manageability
Observability
Flexibility
Facility / Usability
Compatibility
Interoperability
Linearizability
“Batch” → “Stream”
Change Data
Capture (CDC)
Sink & Source
Time Series
Event Streaming
Event Sourcing*
[* ≠ Event Streaming]
SQL or NoSQL?
Query Language
Data Model
Data Distribution
Workload [R/W]
Speed
Price/TCO/ROI
Cloud Native
Qualities
All the “-ilities” Event-Driven Best Fit to Use Case
Distributed Database Design Decisions to Support High Performance Event Streaming
Distributed Database Design Decisions to Support High Performance Event Streaming
While many database systems have been incrementally adapted to cloud native
environments, they still have underlying architectural limits or presumptions.
+ Strong consistency / record-locking — limits latencies & throughput
+ Single primary server for read/writes — replicas are read-only or only for failover;
bottlenecks write-heavy workloads
+ Local clustering/single datacenter design — inappropriate for high availability;
hampers global distribution; lack of topology-awareness induces fragility
Distributed Database Design Decisions to Support High Performance Event Streaming
Two flavors of responses:
+ NoSQL — Designed for non-relational data models, various query languages,
high availability distributed systems
+ Key value, document, wide column, graph, etc.
+ NewSQL — Still RDBMS, still SQL, but designed to operate as a highly
available distributed system
Distributed Database Design Decisions to Support High Performance Event Streaming
Database-as-a-Service (DBaaS)
+ Lift-and-Shift to Cloud — Same base offering as on-premises version,
offered as a cloud-hosted managed service
+ Easy/fast to bring to market, but no fundamental design changes
+ Cloud Native — Designed from-the-ground-up for cloud [only] usage
+ Elasticity — Dynamic provisioning, scale up/down for throughput, storage
+ Serverless — Do I need to know what hardware I’m running on?
+ Microservices & API integration — App integration, connectors, DevEx
+ Billing — making it easy to consume & measure ROI/TCO
+ Governance: Privacy Compliance / Data Localization
Distributed Database Design Decisions to Support High Performance Event Streaming
What does a database need to be, or have, or do, to properly support
event streaming in 2022?
+ High Availability [“Always On”]
+ Impedance Match of Database to Event Streaming Systems
+ Similar characteristics for throughput, latency
+ All the Appropriate “Goesintos/Goesouttas”
+ Sink Connector
+ Change Data Capture (CDC) / Source Connector
+ Supports your favorite streaming flavor of the day
+ Kafka, Pulsar, RabbitMQ Streams, etc.
Distributed Database Design Decisions to Support High Performance Event Streaming
Event Streaming Journey of a
NoSQL Database: ScyllaDB
Distributed Database Design Decisions to Support High Performance Event Streaming
ScyllaDB: Building on “Good Bones”
+ Performant: Shard-per-core, async-everywhere, shared-nothing architecture
+ Scalable: both horizontal [100s/1000s of nodes] & vertical [100s/1000s cores]
+ Available: Peer-to-Peer, Active-Active; no single point of failure
+ Distribution: Multi-datacenter clustering & replication, auto-sharding
+ Consistency: tunable; primarily eventual, but also Lightweight Transactions (LWT)
+ Topology Aware: Shard-aware, Node-aware, Rack-aware, Datacenter-aware
+ Compatible: Cassandra CQL & Amazon DynamoDB APIs
Distributed Database Design Decisions to Support High Performance Event Streaming
ScyllaDB Journey to Event Streaming — Starting with Kafka
+ Shard-Aware Kafka Sink Connector [January 2020]
+ Github: https://github.com/scylladb/kafka-connect-scylladb
+ Blog: https://www.scylladb.com/2020/02/18/introducing-the-kafka-scylla-connector/
Distributed Database Design Decisions to Support High Performance Event Streaming
ScyllaDB Journey to Event Streaming — Starting with Kafka
+ Shard-Aware Kafka Sink Connector [January 2020]
+ Github: https://github.com/scylladb/kafka-connect-scylladb
+ Blog: https://www.scylladb.com/2020/02/18/introducing-the-kafka-scylla-connector/
+ Change Data Capture [January 2020 – October 2021]
+ January 2020: ScyllaDB Open Source 3.2 — Experimental
+ Course of 2020 - 3.3, 3.4, 4.0, 4.1, 4.2 — Experimental iterations
+ January 2021: 4.3: Production-ready, new API
+ March 2021: 4.4: new API
+ October 2021: 4.5: performance & stability
Distributed Database Design Decisions to Support High Performance Event Streaming
ScyllaDB Journey to Event Streaming — Starting with Kafka
+ Shard-Aware Kafka Sink Connector [January 2020]
+ Github: https://github.com/scylladb/kafka-connect-scylladb
+ Blog: https://www.scylladb.com/2020/02/18/introducing-the-kafka-scylla-connector/
+ Change Data Capture [January 2020 – October 2021]
+ January 2020: ScyllaDB Open Source 3.2 — Experimental
+ Course of 2020 - 3.3, 3.4, 4.0, 4.1, 4.2 — Experimental iterations
+ January 2021: 4.3: Production-ready, new API
+ March 2021: 4.4: new API
+ October 2021: 4.5: performance & stability
+ CDC Kafka Source Connector [April 2021]
+ Github: https://github.com/scylladb/scylla-cdc-source-connector
+ Blog: https://debezium.io/blog/2021/09/22/deep-dive-into-a-debezium-community-conn
ector-scylla-cdc-source-connector/
Distributed Database Design Decisions to Support High Performance Event Streaming
Distributed Database Design Decisions to Support High Performance Event Streaming
ScyllaDB Journey to Event Streaming with Pulsar
+ Pulsar Consumer: Cassandra Sink Connector
+ Comes by default with Pulsar
+ ScyllaDB is Cassandra CQL compatible
+ Docs: https://pulsar.apache.org/docs/io-cassandra-sink/
+ Github: https://github.com/apache/pulsar/blob/master/site2/docs/io-cassandra-sink.md
+ Pulsar Producer: Can use ScyllaDB CDC Source Connector using Kafka Compatibility
+ Pulsar makes it easy to bring Kafka topics into Pulsar
+ Docs: https://pulsar.apache.org/docs/adaptors-kafka/
+ Potential Developments:
+ Native Pulsar Shard-Aware ScyllaDB Consumer Connector — even faster ingestion
+ Native CDC Pulsar Producer — unwrap your topics
Distributed Database Design Decisions to Support High Performance Event Streaming
ScyllaDB CDC:
How Does It Work?
ScyllaDB Quickstart: Create a Table and Enable CDC
CREATE TABLE ks.tbl (
pk int,
ck int,
val int,
col set<int>,
PRIMARY KEY (pk, ck)
) WITH cdc = { 'enabled': true };
Distributed Database Design Decisions to Support High Performance Event Streaming
Distributed Database Design Decisions to Support High Performance Event Streaming
CDC Options - Record Types
Delta Preimage Postimage
'full': contain
information about
every modified
column
'keys': only the
primary key of the
change will be
recorded
'false': Disables the
feature
'true': contain only the
columns that were
changed by the write
‘full’: contain the entire
row (how it was
before the write was
made)
'false': Disables the
feature
'true': show the
affected row’s state
after the write.
Postimage row always
contains all the
columns no matter if
they were affected by
the change or not
What was changed? What was before? What’s the end result?
Distributed Database Design Decisions to Support High Performance Event Streaming
CDC Options - Record Types
Enabled Postimage
86400: In seconds. By
default records on
CDC log table expire
within 24 hours
If set to 0, a separate
cleaning mechanism is
recommended.
'false': Disables the
CDC feature
'true': Enables the
CDC feature
TTL
Distributed Database Design Decisions to Support High Performance Event Streaming
cqlsh> desc table ks.tbl
_scylla_cdc_log;
CREATE TABLE ks.tbl_scylla_cdc_log (
"cdc$stream_id" blob,
"cdc$time" timeuuid,
"cdc$batch_seq_no" int,
"cdc$deleted_col" boolean,
"cdc$deleted_elements_col" frozen<set<int>>,
"cdc$deleted_val" boolean,
"cdc$end_of_batch" boolean,
"cdc$operation" tinyint,
"cdc$ttl" bigint,
ck int,
col frozen<set<int>>,
pk int,
val int,
PRIMARY KEY ("cdc$stream_id"
, "cdc$time", "cdc$batch_seq_no")
)
Partition Key Sorted by time Batch sequence
CDC Log Table
Cassandra DynamoDB MongoDB ScyllaDB
Consumer location on-node off-node off-node off-node
Replication duplicated deduplicated deduplicated deduplicated
Deltas yes no partial optional
Pre-image no yes no optional
Post-image no yes yes optional
Slow consumer reaction Table stopped Consumer loses data Consumer loses data Consumer loses data
Ordering no yes yes yes
Distributed Database Design Decisions to Support High Performance Event Streaming
How Do NoSQL CDC Implementations Compare?
Writing to Base Table [No CDC]
CQL write goes to
coordinator node.
INSERT INTO base_table(...)...
Distributed Database Design Decisions to Support High Performance Event Streaming
Coordinator node creates
write calls to replica nodes.
INSERT INTO base_table(...)...
CQL
Replicated writes
Writing to Base Table [No CDC]
Distributed Database Design Decisions to Support High Performance Event Streaming
Writing to CDC Enabled Table
CQL write goes to
coordinator node.
INSERT INTO base_table(...)...
Distributed Database Design Decisions to Support High Performance Event Streaming
Writing to CDC enabled table (post/preimage)
If required, Coordinator reads
existing row data for
pre-/post image generation.
INSERT INTO base_table(...)...
CQL
(Opt) preimage read
Distributed Database Design Decisions to Support High Performance Event Streaming
Writing to CDC Enabled Table
Coordinator creates CDC log table
writes and piggybacks on base
table writes to same replica nodes.
While data size written is larger, the
number of writes requests does not
change.
INSERT INTO base_table(...)...
CQL
CDC write
Distributed Database Design Decisions to Support High Performance Event Streaming
▪ CDC data is grouped into streams
• Divides the token ring space
• Each stream represents a tokenization “slot” in
current topology
• Stream is log partition key
• Stream chosen for given write based on base table
PK tokenization
▪ Can read from all, one or some streams at a time
• Allows “round-robin” traversal of data space to
avoid too large or cross-node queries
Stream 1, 2, 3, 4...
Token ring
Distributed Database Design Decisions to Support High Performance Event Streaming
CDC Streams
Distributed Database Design Decisions to Support High Performance Event Streaming
CDC Streams
Token ring
CDC
Java
Driver
Kafka
Source
Conn.
The Java driver handles round-robin
traversal.
Kafka
Broker
CDC Streams
Stream 1, 2, 3, 4...
Distributed Database Design Decisions to Support High Performance Event Streaming
Change Data Capture (CDC) lesson here:
https://university.scylladb.com/courses/scylla-operations/lessons/change-data-capture-cdc/
Learn NoSQL for free!
university.scylladb.com
Peter Corless
Thank you!
peter@scylladb.com
@petercorless
Pulsar Summit
San Francisco
Hotel Nikko
August 18 2022

Distributed Database Design Decisions to Support High Performance Event Streaming - Pulsar Summit SF 2022

  • 1.
    Pulsar Summit San Francisco HotelNikko August 18 2022 Ecosystem Distributed Database Design Decisions to Support High Performance Event Streaming Peter Corless Director of Technical Advocacy • ScyllaDB
  • 2.
    Peter Corless isthe Director of Technical Advocacy of ScyllaDB, the company behind the monstrously fast and scalable NoSQL database. He is the editor of and frequent contributor to the ScyllaDB blog, and program chair of the ScyllaDB Summit and P99 CONF. He recently hosted the Distributed Systems Masterclass, co-sponsored by StreamNative+ScyllaDB Peter Corless Director of Technical Advocacy ScyllaDB
  • 3.
    Distributed Database DesignDecisions to Support High Performance Event Streaming Requirements for This Next Tech Cycle
  • 4.
    This Next TechCycle 2000 2010 2020 2025+ Transistor Count 42M Pentium 4 (2000) 228M Pentium D (2005) 2.3B Xeon Nahalem-EX (2010) 10B SPARC M7 (2015) 39B Epyc Rome (2019) Core Count 1 2 8 32 64 ~60B? Epyc Genoa (2022) 96 ~80B? Epyc Bergamo (2023) 128 1.2 ZB IP traffic (2016) 2 ZB Data stored (2010) 64 ZB Data stored (2020) Broadband Speeds 3G (2002) 105mbps (2014) 1.5 mbps (2002) 16 mbps (2008) Wireless Services 3Gbps (2021) 1Gbps (2018) 4G (2014) 5G (2018) Zettabyte Era ~180 ZB Data stored (2025) Public Cloud to Multicloud AWS (2006) GCP (2008) Azure (2010) 1021 Azure Arc
  • 5.
    Hardware Infrastructure isEvolving + Compute + From 100+ cores → 1,000+ cores per server + From multicore CPUs → full System on a Chip (SoC) designs (CPU, GPU, Cache, Memory) + Memory + Terabyte-scale RAM per server + DDR4 — 800 to 1600 MHz, 2011-present + DDR5 — 4600 MHz in 2020, 8000 MHz by 2024 + DDR6 — 9600 MHz by 2025 + Storage + Petabyte-scale storage per server + NVMe 2.0 [2021] — separation of base and transport Distributed Database Design Decisions to Support High Performance Event Streaming
  • 6.
    Distributed Database DesignDecisions to Support High Performance Event Streaming Databases are Evolving + Consistency Models [CAP Model: AP vs. CP] + Strong, Eventual, Tunable + ACID vs. BASE + Data Model / Query Languages [SQL vs. NoSQL] + RDBMS / SQL + NoSQL [Document, Key-Value, Wide-Column, Graph] + Big Data → HUGE Data + Data Stored: Gigabytes? Terabytes? Petabytes? Exabytes? + Payload Sizes: Kilobytes? Megabytes? + OPS / TPS: Hundreds of thousands? Millions? + Latencies: Sub-millisecond? Single-digit milliseconds?
  • 7.
    Distributed Database DesignDecisions to Support High Performance Event Streaming Databases are [or should be] designed for specific kinds of data, specific kinds of workloads, and specific kinds of queries. How aligned or far away from your specific use case a database may be in its design & implementation from your desired utility of it determines the resistance of the system Variable Resistors Anyone?
  • 8.
    Distributed Database DesignDecisions to Support High Performance Event Streaming Sure you can use various databases for tasks they were never designed for — but should you? DATA ENGINEERS
  • 9.
    Distributed Database DesignDecisions to Support High Performance Event Streaming Δ Data –––––––––––– t t ~ n ×0.001s For a database to be appropriate for event streaming, it needs to support managing changes to data over time in “real time” — measured in single-digit milliseconds or less. And where changes to data can be produced at a rate of hundreds of thousands or millions of events per second. [And greater rates in future]
  • 10.
    DBaaS Single-cloud vs. Multi-cloud? Multi-datacenter Elasticity Serverless Orchestration DevSecOps Scalability Reliability Durability Manageability Observability Flexibility Facility /Usability Compatibility Interoperability Linearizability “Batch” → “Stream” Change Data Capture (CDC) Sink & Source Time Series Event Streaming Event Sourcing* [* ≠ Event Streaming] SQL or NoSQL? Query Language Data Model Data Distribution Workload [R/W] Speed Price/TCO/ROI Cloud Native Qualities All the “-ilities” Event-Driven Best Fit to Use Case Distributed Database Design Decisions to Support High Performance Event Streaming
  • 11.
    Distributed Database DesignDecisions to Support High Performance Event Streaming While many database systems have been incrementally adapted to cloud native environments, they still have underlying architectural limits or presumptions. + Strong consistency / record-locking — limits latencies & throughput + Single primary server for read/writes — replicas are read-only or only for failover; bottlenecks write-heavy workloads + Local clustering/single datacenter design — inappropriate for high availability; hampers global distribution; lack of topology-awareness induces fragility
  • 12.
    Distributed Database DesignDecisions to Support High Performance Event Streaming Two flavors of responses: + NoSQL — Designed for non-relational data models, various query languages, high availability distributed systems + Key value, document, wide column, graph, etc. + NewSQL — Still RDBMS, still SQL, but designed to operate as a highly available distributed system
  • 13.
    Distributed Database DesignDecisions to Support High Performance Event Streaming Database-as-a-Service (DBaaS) + Lift-and-Shift to Cloud — Same base offering as on-premises version, offered as a cloud-hosted managed service + Easy/fast to bring to market, but no fundamental design changes + Cloud Native — Designed from-the-ground-up for cloud [only] usage + Elasticity — Dynamic provisioning, scale up/down for throughput, storage + Serverless — Do I need to know what hardware I’m running on? + Microservices & API integration — App integration, connectors, DevEx + Billing — making it easy to consume & measure ROI/TCO + Governance: Privacy Compliance / Data Localization
  • 14.
    Distributed Database DesignDecisions to Support High Performance Event Streaming What does a database need to be, or have, or do, to properly support event streaming in 2022? + High Availability [“Always On”] + Impedance Match of Database to Event Streaming Systems + Similar characteristics for throughput, latency + All the Appropriate “Goesintos/Goesouttas” + Sink Connector + Change Data Capture (CDC) / Source Connector + Supports your favorite streaming flavor of the day + Kafka, Pulsar, RabbitMQ Streams, etc.
  • 15.
    Distributed Database DesignDecisions to Support High Performance Event Streaming Event Streaming Journey of a NoSQL Database: ScyllaDB
  • 16.
    Distributed Database DesignDecisions to Support High Performance Event Streaming ScyllaDB: Building on “Good Bones” + Performant: Shard-per-core, async-everywhere, shared-nothing architecture + Scalable: both horizontal [100s/1000s of nodes] & vertical [100s/1000s cores] + Available: Peer-to-Peer, Active-Active; no single point of failure + Distribution: Multi-datacenter clustering & replication, auto-sharding + Consistency: tunable; primarily eventual, but also Lightweight Transactions (LWT) + Topology Aware: Shard-aware, Node-aware, Rack-aware, Datacenter-aware + Compatible: Cassandra CQL & Amazon DynamoDB APIs
  • 17.
    Distributed Database DesignDecisions to Support High Performance Event Streaming ScyllaDB Journey to Event Streaming — Starting with Kafka + Shard-Aware Kafka Sink Connector [January 2020] + Github: https://github.com/scylladb/kafka-connect-scylladb + Blog: https://www.scylladb.com/2020/02/18/introducing-the-kafka-scylla-connector/
  • 18.
    Distributed Database DesignDecisions to Support High Performance Event Streaming ScyllaDB Journey to Event Streaming — Starting with Kafka + Shard-Aware Kafka Sink Connector [January 2020] + Github: https://github.com/scylladb/kafka-connect-scylladb + Blog: https://www.scylladb.com/2020/02/18/introducing-the-kafka-scylla-connector/ + Change Data Capture [January 2020 – October 2021] + January 2020: ScyllaDB Open Source 3.2 — Experimental + Course of 2020 - 3.3, 3.4, 4.0, 4.1, 4.2 — Experimental iterations + January 2021: 4.3: Production-ready, new API + March 2021: 4.4: new API + October 2021: 4.5: performance & stability
  • 19.
    Distributed Database DesignDecisions to Support High Performance Event Streaming ScyllaDB Journey to Event Streaming — Starting with Kafka + Shard-Aware Kafka Sink Connector [January 2020] + Github: https://github.com/scylladb/kafka-connect-scylladb + Blog: https://www.scylladb.com/2020/02/18/introducing-the-kafka-scylla-connector/ + Change Data Capture [January 2020 – October 2021] + January 2020: ScyllaDB Open Source 3.2 — Experimental + Course of 2020 - 3.3, 3.4, 4.0, 4.1, 4.2 — Experimental iterations + January 2021: 4.3: Production-ready, new API + March 2021: 4.4: new API + October 2021: 4.5: performance & stability + CDC Kafka Source Connector [April 2021] + Github: https://github.com/scylladb/scylla-cdc-source-connector + Blog: https://debezium.io/blog/2021/09/22/deep-dive-into-a-debezium-community-conn ector-scylla-cdc-source-connector/
  • 20.
    Distributed Database DesignDecisions to Support High Performance Event Streaming
  • 21.
    Distributed Database DesignDecisions to Support High Performance Event Streaming ScyllaDB Journey to Event Streaming with Pulsar + Pulsar Consumer: Cassandra Sink Connector + Comes by default with Pulsar + ScyllaDB is Cassandra CQL compatible + Docs: https://pulsar.apache.org/docs/io-cassandra-sink/ + Github: https://github.com/apache/pulsar/blob/master/site2/docs/io-cassandra-sink.md + Pulsar Producer: Can use ScyllaDB CDC Source Connector using Kafka Compatibility + Pulsar makes it easy to bring Kafka topics into Pulsar + Docs: https://pulsar.apache.org/docs/adaptors-kafka/ + Potential Developments: + Native Pulsar Shard-Aware ScyllaDB Consumer Connector — even faster ingestion + Native CDC Pulsar Producer — unwrap your topics
  • 22.
    Distributed Database DesignDecisions to Support High Performance Event Streaming ScyllaDB CDC: How Does It Work?
  • 23.
    ScyllaDB Quickstart: Createa Table and Enable CDC CREATE TABLE ks.tbl ( pk int, ck int, val int, col set<int>, PRIMARY KEY (pk, ck) ) WITH cdc = { 'enabled': true }; Distributed Database Design Decisions to Support High Performance Event Streaming
  • 24.
    Distributed Database DesignDecisions to Support High Performance Event Streaming CDC Options - Record Types Delta Preimage Postimage 'full': contain information about every modified column 'keys': only the primary key of the change will be recorded 'false': Disables the feature 'true': contain only the columns that were changed by the write ‘full’: contain the entire row (how it was before the write was made) 'false': Disables the feature 'true': show the affected row’s state after the write. Postimage row always contains all the columns no matter if they were affected by the change or not What was changed? What was before? What’s the end result?
  • 25.
    Distributed Database DesignDecisions to Support High Performance Event Streaming CDC Options - Record Types Enabled Postimage 86400: In seconds. By default records on CDC log table expire within 24 hours If set to 0, a separate cleaning mechanism is recommended. 'false': Disables the CDC feature 'true': Enables the CDC feature TTL
  • 26.
    Distributed Database DesignDecisions to Support High Performance Event Streaming cqlsh> desc table ks.tbl _scylla_cdc_log; CREATE TABLE ks.tbl_scylla_cdc_log ( "cdc$stream_id" blob, "cdc$time" timeuuid, "cdc$batch_seq_no" int, "cdc$deleted_col" boolean, "cdc$deleted_elements_col" frozen<set<int>>, "cdc$deleted_val" boolean, "cdc$end_of_batch" boolean, "cdc$operation" tinyint, "cdc$ttl" bigint, ck int, col frozen<set<int>>, pk int, val int, PRIMARY KEY ("cdc$stream_id" , "cdc$time", "cdc$batch_seq_no") ) Partition Key Sorted by time Batch sequence CDC Log Table
  • 27.
    Cassandra DynamoDB MongoDBScyllaDB Consumer location on-node off-node off-node off-node Replication duplicated deduplicated deduplicated deduplicated Deltas yes no partial optional Pre-image no yes no optional Post-image no yes yes optional Slow consumer reaction Table stopped Consumer loses data Consumer loses data Consumer loses data Ordering no yes yes yes Distributed Database Design Decisions to Support High Performance Event Streaming How Do NoSQL CDC Implementations Compare?
  • 28.
    Writing to BaseTable [No CDC] CQL write goes to coordinator node. INSERT INTO base_table(...)... Distributed Database Design Decisions to Support High Performance Event Streaming
  • 29.
    Coordinator node creates writecalls to replica nodes. INSERT INTO base_table(...)... CQL Replicated writes Writing to Base Table [No CDC] Distributed Database Design Decisions to Support High Performance Event Streaming
  • 30.
    Writing to CDCEnabled Table CQL write goes to coordinator node. INSERT INTO base_table(...)... Distributed Database Design Decisions to Support High Performance Event Streaming
  • 31.
    Writing to CDCenabled table (post/preimage) If required, Coordinator reads existing row data for pre-/post image generation. INSERT INTO base_table(...)... CQL (Opt) preimage read Distributed Database Design Decisions to Support High Performance Event Streaming
  • 32.
    Writing to CDCEnabled Table Coordinator creates CDC log table writes and piggybacks on base table writes to same replica nodes. While data size written is larger, the number of writes requests does not change. INSERT INTO base_table(...)... CQL CDC write Distributed Database Design Decisions to Support High Performance Event Streaming
  • 33.
    ▪ CDC datais grouped into streams • Divides the token ring space • Each stream represents a tokenization “slot” in current topology • Stream is log partition key • Stream chosen for given write based on base table PK tokenization ▪ Can read from all, one or some streams at a time • Allows “round-robin” traversal of data space to avoid too large or cross-node queries Stream 1, 2, 3, 4... Token ring Distributed Database Design Decisions to Support High Performance Event Streaming CDC Streams
  • 34.
    Distributed Database DesignDecisions to Support High Performance Event Streaming CDC Streams Token ring CDC Java Driver Kafka Source Conn. The Java driver handles round-robin traversal. Kafka Broker CDC Streams Stream 1, 2, 3, 4...
  • 35.
    Distributed Database DesignDecisions to Support High Performance Event Streaming Change Data Capture (CDC) lesson here: https://university.scylladb.com/courses/scylla-operations/lessons/change-data-capture-cdc/ Learn NoSQL for free! university.scylladb.com
  • 36.
    Peter Corless Thank you! peter@scylladb.com @petercorless PulsarSummit San Francisco Hotel Nikko August 18 2022