Making Cassandra more capable, faster, and more reliable (at ApacheCon@Home 2020)

Making Cassandra more
capable, faster, and more
reliable
Hiroyuki Yamada – CTO/CEO at Scalar, Inc.
Yuji Ito – Architect at Scalar, Inc.
APACHECON @HOME
Sep, 29th – Oct. 1st 2020

© 2020 Scalar, inc.
Speakers
• Hiroyuki Yamada
– CTO at Scalar, Inc.
– Passionate about
Database Systems and
Distributed Systems
– Ph.D. in Computer
Science, the University of
Tokyo
– Formerly IIS the University
of Tokyo, Yahoo! Japan,
IBM Japan
2
• Yuji Ito
– Architect at Scalar, Inc.
– Improve the performance and
the reliability of Scalar DLT
– Love failure analysis
– Formerly an SSD firmware
engineer at Fixstars, Hitachi

Cassandra @ Scalar
• Scalar tries to make Cassandra the next level
– More capable: ACID transactions with Scalar DB
– Faster: Group CommitLog Sync
– More reliable: Jepsen tests for LWT
• This talk will present why we do them and what we do
3

ACID transactions on Cassandra with Scalar DB
4

What is Scalar DB
• A universal transaction manager
– A Java library that makes non-ACID databases ACID-compliant
– The architecture is inspired by Deuteronomy [CIDR’09,11]
• Cassandra is the first supported database
5
https://github.com/scalar-labs/scalardb

Why ACID Transactions with Cassandra? Why with Scalar DB?
• ACID is a must-have feature in some mission-critical applications
– C* has been getting widely used for such applications
– C* is one of the major open-source distributed databases
• Lots of risks and burden for modifying C*
– Scalar DB enables ACID transactions without modifying C* at all
since it is dependent only on the exposed APIs
– No risks for breaking the exiting code
6

Pros and Cons in Scalar DB on Cassandra
• Non-invasive
– No modifications in C*
• High availability and
scalability
– C* properties are fully
sustained by the client-
coordinated approach
• Flexible deployment
– Transaction layer and
storage layer can be
independently scaled
7
• Slower than NewSQLs
– More abstraction layers and
storage-oblivious
transaction manager
• Hard to optimize
– Transaction manager has
not much information about
storage
• No CQL support
– A transaction has to be
written procedurally with a
programming language

Programming Interface and System Architecture
• CRUD interface
– put, get, scan, delete
• Begin and commit semantics
– Arbitrary number of
operations can be handled
• Client-coordinated
– Transaction code is run in
the library
– No middleware is managed
8
DistributedTranasctionManager manager = …;
DistributedTransaction transaction = manager.start();
Get get = createGet();
Optional<Result> result = transaction.get(get);
Pub put = createPut(result);
transaction.put(put);
transaction.commit();
Client
programs
/ Web
applications
Scalar DBCommand
execution
/ HTTP
Cassandra

Data Model
• Multi-dimensional map [OSDI’06]
– (partition-key, clustering-key, value-name) -> value-content
– Assumed to be hash partitioned
9

Transaction Management - Overview
• Based on Cherry Garcia [ICDE’15]
– Two phase commit on linearizable operations (for Atomicity)
– Protocol correction is our extended work
– Distributed WAL records (for Atomicity and Durability)
– Single version optimistic concurrency control (for Isolation)
– Serializability support is our extended work
• Requirements in underlining databases/storages
– Linearizable read and linearizable conditional/CAS write
– An ability to store metadata for each record
10

Transaction Commit Protocol (for Atomicity)
• Two phase commit protocol on linearizable operations
– Similar to Paxos Commit [TODS’06]
– Data records are assumed to be distributed
• The protocol
– Prepare phase: prepare records
– Commit phase 1: commit status record
– This is where a transaction is regarded as committed or aborted
– Commit phase 2: commit records
• Lazy recovery
– Uncommitted records will be rollforwarded or rollbacked based on the
status of a transaction when the records are read
11

Distributed WAL (for Atomicity and Durability)
• WAL (Write-Ahead Logging) is distributed into records
12
Application data Transaction metadata
After image Before image
Application data
(Before)
Transaction metadata
(Before)
Status Version TxID
Status
(before)
Version
(before)
TxID
(before)
TxID Status Other metadata
Status Record
in coordinator
table
User/Application
Record
in user tables
Application data
(managed by users)
Transaction metadata
(managed by Scalar DB)

Concurrency Control (for Isolation)
• Single version OCC
– Simple implementation of Snapshot Isolation
– Conflicts are detected by linearizable conditional write (LWT)
– No clock dependency, no use of HLC (Hybrid Logical Clock)
• Supported isolation level
– Read-committed Snapshot Isolation (RCSI)
– Read-skew, write-skew, read-only, phantom anomalies could
happen
– Serializable
– No anomalies (Strict Serializability)
– RCSI-based but non-serializable schedules are aborted
13

Transaction With Example – Prepare Phase
14
Client1
Client1’s memory space
Cassandra
UserID Balance Status Version
1 100 C 5
TxID
XXX
2 100 C 4YYY

14
Client1
Cassandra
Read
1 100 C 5
TxID
XXX
2 100 C 4YYY

14
Client1
Cassandra
Read
1 100 C 5
TxID
XXX
2 100 C 4YYY
1 100 C 5
TxID
XXX
2 100 C 4YYY

14
Client1
Cassandra
Read
1 100 C 5
TxID
XXX
2 100 C 4YYY
1 80 P 6Tx1
2 120 P 5Tx1
Tx1: Transfer 20 from 1 to 2
1 100 C 5
TxID
XXX
2 100 C 4YYY

14
Client1
Cassandra
Read
Conditional write
(LWT)
Update only if
the versions and the
TxIDs are the same as
the ones it read
1 100 C 5
TxID
XXX
2 100 C 4YYY
1 80 P 6Tx1
2 120 P 5Tx1
1 100 C 5
TxID
XXX
2 100 C 4YYY

14
Client1
Cassandra
Read
Conditional write
(LWT)
Update only if
the ones it read
1 100 C 5
TxID
XXX
2 100 C 4YYY
1 80 P 6Tx1
2 120 P 5Tx1
1 100 C 5
TxID
XXX
2 100 C 4YYY
P 6Tx1
P 5Tx1

14
Client1
Cassandra
Read
Conditional write
(LWT)
Update only if
the ones it read
1 100 C 5
TxID
XXX
2 100 C 4YYY
1 80 P 6Tx1
2 120 P 5Tx1
Client2
1 100 C 5
TxID
XXX
2 100 C 4YYY
1 90 P 6Tx2
2 110 P 5Tx2
1 100 C 5
TxID
XXX
2 100 C 4YYY
P 6Tx1
P 5Tx1

14
Client1
Cassandra
Read
Conditional write
(LWT)
Update only if
the ones it read
Fail due to
the condition mismatch
1 100 C 5
TxID
XXX
2 100 C 4YYY
1 80 P 6Tx1
2 120 P 5Tx1
Client2
1 100 C 5
TxID
XXX
2 100 C 4YYY
1 90 P 6Tx2
2 110 P 5Tx2
1 100 C 5
TxID
XXX
2 100 C 4YYY
P 6Tx1
P 5Tx1

Transaction With Example – Commit Phase 1
15
1 80 P 6
TxID
Tx1
2 120 P 5Tx1
Status
C
TxID
XXX
CYYY
AZZZ
Client1 with
Tx1
Cassandra

15
1 80 P 6
TxID
Tx1
2 120 P 5Tx1
Status
C
TxID
XXX
CYYY
AZZZ
CTx1
Conditional write
(LWT)
Update if
the TxID
does not exist
Client1 with
Tx1
Cassandra

16
Cassandra
1 80 C 6
TxID
Tx1
2 120 C 5Tx1
Status
C
TxID
XXX
CYYY
AZZZ
CTx1
Conditional write
(LWT)
Update status if
the record is
prepared by the TxID
Client1 with
Tx1

Recovery
17
Prepare
Phase
Commit
Phase1
Commit
Phase2
TX1
• Recovery is lazily done when a record is read
Nothing is
needed
(local memory
space is
automatically
cleared)
Recovery
process
Rollbacked by
another TX
lazily using
before image
Roll-forwarded
by another TX
lazily updating
status to C
No need for
recovery
Crash

Serializable Strategy
• Basic strategy
– Avoid anti/rw-dependency dangerous structure [TODS’05]
– No use of SSI [SIGMOD’08] or its variant [EuroSys’12]
– Many linearizable operations for managing in/outConflicts or
correct clock are required
• Two implementations
– Extra-write
– Convert read into write
– Extra care is done if a record doesn’t exist (Delete the record)
– Extra-read
– Check read-set after prepared to see if it is not updated by
other transactions
18

Benchmark Results with Scalar DB on Cassandra
19
Workload2 (Evidence)Workload1 (Payment)
Each node: i3.4xlarge (16 vCPUs, 122 GB RAM, 1900 GB NVMe SSD * 2), RF: 3
• Achieved 90 % scalability in 100-node cluster
(Compared to the Ideal TPS based on the performance of 3-node cluster)

Verification Results for Scalar DB on Cassandra
• Scalar DB on Cassandra has been heavily tested with Jepsen
and our destructive tools
– Jepsen tests are created and conducted by Scalar
– See https://github.com/scalar-labs/scalar-jepsen for more detail
• Transaction commit protocol is verified with TLA+
– See https://github.com/scalar-labs/scalardb/tree/master/tla%2B/consensus-commit
20
Jepsen
Passed TLA+
Passed

Speakers
• Hiroyuki Yamada
– CTO at Scalar, Inc.
– Passionate about
Database Systems and
Distributed Systems
– Ph.D. in Computer Science,
the University of Tokyo
– Formerly IIS the University of
Tokyo, Yahoo! Japan, IBM
Japan
21
• Yuji Ito
– Architect at Scalar, Inc.
– Improve the performance
and the reliability of Scalar
DLT
– Love failure analysis
– Formerly an SSD firmware
engineer at Fixstars, Hitachi

Group CommitLog Sync
22

Why we need a new mode?
• Scalar DB transaction relies on Cassandra’s
– Durability
– Performance
• Synchronous commitlog sync is required for durability
– Periodic mode might lose commitlogs
• Commitlog sync performance is the key factor
– Batch mode tends to issue lots of IOs
23

Group CommitLog Sync
• New commitlog sync mode on 4.0
– https://issues.apache.org/jira/browse/CASSANDRA-13530
• The mode syncs multiple commitlogs at once periodically
24

Commitlog
• Logs of all mutations to a Cassandra node
– All writes append commitlogs and the mutations are written to the
memtable
• Recover write data from commitlogs on startup
– These data on memtable are gone when crash
25Commitlog disk
memtable
Write
Commitlog

Commitlog
memtable
26Commitlog disk
memtable
Recover

Commitlog
memtable
27Commitlog disk
memtable
Write
Commitlog

• Sync commitlogs periodically
• NOT wait for the completion of the sync (Asynchronous sync)
Existing mode: Periodic (default mode)
28
Commitlog
disk
Commitlog
sync
thread
Sync
Request
thread
ack ackack ack
commitlog_sync_period_in_ms

• Sync commitlogs periodically
• NOT wait for the completion of the sync (Asynchronous sync)
Þ commitlog(write data) might be lost when crash
Existing mode: Periodic (default mode)
29
These commitlogs are lost !!
Commitlog
disk
Commitlog
sync
thread
Request
thread
commitlog_sync_period_in_ms
ack ack ack

Existing mode: Batch
• Sync commitlogs immediately
– Wait for the completion of the sync (Synchronus sync)
– Commitlogs issued at about the same time can be synced
together
Þ Throughput is degraded due to many small IOs
30
Commitlog
disk
ack ack ackack
Commitlog
sync
thread
Sync
Request
thread
Sync Sync Sync
“commitlog_sync_batch_window_in_ms” is the maximum length of a window, it always syncs immediately

Issues in the existing modes
• Periodic
– Commitlogs might be lost when Cassandra crashes
• Batch
– Performance could be degradaded due to many small IOs
– Batch doesn’t work as users would expect from the name
31

Grouping commitlogs
• Sync multiple commitlogs at once periodically (Synchronus sync)
– Reduce IOs by grouping syncs
32
commitlog_sync_group_window_in_ms
Commitlog
disk
ack ack
Commitlog
sync
thread
Sync
Request
thread
Sync

Evaluation
• Workload
– Small (<< 1KB) update operations with IF EXISTS (LWT) and without
IF EXISTS (non LWT)
• Environment
33
Instance type AWS EC2 m4.large
Disk type AWS EBS io1 200 IOPS
# of nodes 3
Replication factor 3
Window time Batch: 2 ms(default), 10 ms
Group: 10 ms, 15 ms

Evaluation result
• Results with 2 ms and 10 ms batch window are almost the same
• Group mode is a bit better than Batch mode
– The difference becomes smaller with a faster disk
34
0
500
1000
1500
2000
2500
0 50 100 150 200 250 300
Throughput[operation/sec]
Threads
Throughput - UPDATE
Batch 2ms
Batch 10ms
Group 10ms
Group 15ms
0
20
40
60
80
100
120
140
160
0 200 400 600 800 1000 1200
AverageLatency[ms]
Throughput [ops]
Latency of UPDATE
Batch 2ms
Group 10ms
Group 15ms

Evaluation result
• Between 8 and 32 threads, the throughput of Group mode is
better than that of Batch mode up to 75 %
– With LWT, many commitlogs are issued and affect the
performance
35
0
20
40
60
80
100
120
140
160
0 200 400 600 800 1000 1200
AverageLatency[ms]
Throughput [ops]
Latency of UPDATE
Batch 2ms
Group 10ms
Group 15ms
0
50
100
150
200
250
300
350
0 10 20 30 40
Throughput[operation/sec]
Threads
Throughput - UPDATE (Low concurrency)
Batch 2ms
Batch 10ms
Group 10ms
Group 15ms
75 %

Evaluation result
• Without LWT, the latency of Batch mode is better than that of
Group mode in small requests
36
0
5
10
15
20
25
0 200 400 600 800 1000 1200
AverageLatency[ms]
Throughput [ops]
Latency of UPDATE without LWT
Batch 2ms
Group 15ms

When to use Group mode?
• When durability is required
• When commitlog disk IOPS is lower than request arrival rate
– Group mode can remedy latency increase due to IO saturation
37

Jepsen Tests for LWT
38

Why we do Jepsen test for LWT?
• Scalar DB transaction relies on on the “correctness” of LWT
– Jepsen can check the correctness (linearizability)
• The existing Jepsen test for Cassandra has not been maintained
• https://github.com/riptano/jepsen
• Last commit: Feb 3, 2016
39

Jepsen tests for Cassandra
• Our tests have LWT, Batch, Set, Map, and Counter with various
faults
40
DB
DB
DB
DB
DB
Join/Leave/Rejoin
DB
DB
DB
DB
DB
DB
Network faults
(Bridge, Isolation, Halves)
DB
DB
DB
DB
DB
Node crash
DB
DB
DB
DB
DB
Clock drift

Our contributions to Jepsen testing for Cassandra
• Replaced Cassaforte with Alia (Clojure wrapper for Cassandra)
– Cassaforte has not been maintained
– There seems a bug in getting results
• Rewrote tests with the latest Jepsen
– The previous LWT test failed due to OOM
– New Jepsen can check the logs by dividing a test to some parts
41

Our contributions to Jepsen testing for Cassandra
• Report the result of short tests when a new version is released
– 1 minute per test
– Without fault injection
• Run tests with fault injection for 4.0 beta every week
– Sometimes, a node can not join the cluster before testing
– This issue didn’t happen with 4.0 alpha
42
jepsen@node0:~$ sudo /root/cassandra/bin/nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 10.0.1.7 978.53 KiB 256 ? b7713da3-2ac6-4f10-bea0-6374f23b907a rack1
UN 10.0.1.9 1003.29 KiB 256 ? c5c961fa-b585-41a0-ad19-1c51590ccfb0 rack1
UN 10.0.1.8 975.07 KiB 256 ? 981dd1aa-fd12-472e-9fb6-41d24470716e rack1
UJ 10.0.1.4 182.66 KiB 256 ? 9cc222d5-ba45-4e61-ac2d-b42a31cb74b1 rack1

[Discussion] Jepsen tests migration
• Jepsen test is now maintained in https://github.com/scalar-
labs/scalar-jepsen
• Probably more beneficial to many developers if it is migrated
into official Cassandra repo
– Thought?
43

Summary
• Scalar has enhanced Cassandra from various perspectives
– More capable: ACID transactions with Scalar DB
– Faster: Group CommitLog Sync
– More reliable: Jepsen tests for LWT
• They are mainly done without updating the core of C*
– Making C* more loosely coupled makes such contributions
way easier to do
44

Making Cassandra more capable, faster, and more reliable (at ApacheCon@Home 2020)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Making Cassandra more capable, faster, and more reliable (at ApacheCon@Home 2020)

Similar to Making Cassandra more capable, faster, and more reliable (at ApacheCon@Home 2020) (20)

Recently uploaded

Recently uploaded (20)

Making Cassandra more capable, faster, and more reliable (at ApacheCon@Home 2020)