Calle Wilund presented on Change Data Capture (CDC) in Scylla. CDC in Scylla captures changes made to tables in the database and makes them available asynchronously to consumers. It is enabled per table and generates a log of modifications including pre-image, delta, and post-image data. This log is stored as another table in the database and can be consumed through normal CQL queries. CDC provides an easy way to integrate data duplication, replication, and analytics use cases without external tools.
2. Presenter
Calle Wilund, Software Developer at ScyllaDB
Co-founder of Appeal Virtual Machines and one of the principal
architects behind the JRockit JVM, Calle Wilund has an
extensive background in software development, specializing in
virtual machines, compiler technologies and high performance
computing as well as systems manageability.
4. Change Data Capture - CDC
Consumable modification record for one or more tables in the database
■ Capture changes (write/delete)
■ Asynchronously readable by a consumer
■ Key feature in Scylla 2020
5. Use cases
■ Transaction analysis
● Fraud detection
● Kafka pipeline
■ Direct integration without third-party adaptor
■ Data duplication
● Database mirroring
● Database replication
■ <Insert your use case>
7. What does it do
■ Enabled per table
■ On modification of a row
● Read pre-image (current state of the row) - optional
■ If row exists
■ For affected columns
● Add a log write to the modification
■ Pre-image data
■ Changes per column (delta)
■ Post-image (current state of row) - optional
8. How does the log work
■ CDC log per enabled table
■ CDC log just is another table
● Stored distributed on nodes in cluster
● Rows ordered by operation timestamp and batch sequence
● Mirrored columns for preimage/delta records
■ Every column record contains information about modification operation and TTL
● Topology matches source table
■ CDC log is colocated with original data
● But can use different consistency level (more or less reliable)
■ Data is transient
● CDC data is TTL:ed to 24h (configurable)
● Less risk for uncontrolled metadata buildup
9. Downsides
■ Read before write
● Additional latency
■ CDC log is eventually consistent, like everything else
● Concept of change is based on client view
■ I.e. data as seen and written via coordinator
● Does not contain information on how availability etc impacts actually resolved (read)
values later
● Can get partial logs in case of node crashes
11. It is up to you. And us.
■ CDC data is available through normal CQL
● Easy to read raw stream
● Already de-duplicated
● All delta and pre image values are normal CQL data
● Can consume without knowledge of server internals
■ Layered approach
● CDC core functionality relatively simple. Allows for more sophisticated adaptors
■ Push models etc.
■ Integrators
● Kafka
● Alternator (dynamo API)
● More...
20. CDC in Scylla
■ Easy to integrate and consume
● Plain CQL tables
■ Robust
● Replicated in same way as normal data
■ Reasonable overhead
● Coalesced writes to same replica ranges
■ Does not overflow if consumer fails to act
● Data is TTL:ed
21. Comparison chart
Cassandra DynamoDB MongoDB Scylla
Consumer location on-node off-node off-node off-node
Replication duplicated deduplicated deduplicated deduplicated
Deltas yes no partial yes
Pre-image no yes no optional
Post-image no yes yes optional
Slow consumer
reaction
Table stopped Consumer loses data Consumer loses data Consumer loses data
Ordering no yes yes yes
22. CDC roadmap
■ Experimental feature in master branch
■ Production feature in Q1 2020
■ Additional features in Q2 2020 and forward
● Performance improvements
● Integrators
● Scylla Cloud