How to Migrate a
Counter Table for 68
Billion Records
Robert Czupioł
Senior Platform Engineer
YOUR COMPANY 

LOGO HERE
Robert Czupioł
■ Cassandra Certified Expert since 2015
■ Introduce/Manage C*/Scylla in different companies
■ Attendee at first Scylla Summit 2016
■ and 2017, 2018… :)
Senior Platform Engineer
YOUR
COMPANY 

LOGO HERE
YOUR
PHOTO 

GOES HERE
■ Dating App
■ Top-3 in West Europe
■ +100M Customers
■ 9 Scylla Clusters (in past 16 Cassandra)
■ +200TB Data
■ avg 300k req/sec
Find the people you've crossed paths with
Decision
In May 2021, let migrate to ScyllaDB
■ Targets
• TCO
• Technical dept
• Data volumen
• Latency
• Monitoring
Crossings Cluster
■ Second biggest cluster
• 48 nodes (4 CPU, 32 GB, 1TB PD-SSD)
■ Dept
• Debian 8
• Uptime +580 days
• Cassandra 2.1
■ 8 active Tables
• Crossings_count
Migrate +68B table
Type of migration
■ Offline
• Not that case
■ Online in 3 steps
• DW by ųService
• Leverage data
• Open bottle of Champaign
Leverage data
Different strategies
■ CSV
• CQLSH/DSBULK (Writetime issue)
■ SSTable
• SStableLoader (cluster stress)
• Nodetool refresh (imo best with Network Disks)
■ Dual Connect
• Scylla migrator (spark cluster)
• Own application
Counter table
■ Out of Idempotent rule
■ Only update
■ Weird delete approach
■ Different implementation in past (local and remote shards)
■ Without USING Timestamp
■ Without TTL
■ Not accurate
Counter table
■ Range of Long value
How counter works
■ Create dedicated node-counter-id (shard) for each node [RF=2]
Node
counter id
Shard’s
logical clock
Shard’s
value
A_1 0 0
B_1 1 1
NODE A
Node
counter id
Shard’s
logical clock
Shard’s
value
A_1 0 0
B_1 1 1
NODE B
Update operation
■ On node B increment by 2
■ Read the previous shard value
■ Generate the newest logical clock
■ Save new value and send to replica
Node
counter id
Shard’s
logical clock
Shard’s
value
A_1 0 0
B_1 2 3
NODE A
Node
counter id
Shard’s
logical clock
Shard’s
value
A_1 0 0
B_1 2 3
NODE B
Node B increment by 2
Update operations
■ On node A decrement by 5
■ Read the previous shard value
■ Generate the newest logical clock
■ Save new value and send to replica
Node
counter id
Shard’s
logical clock
Shard’s
value
A_1 1 -5
B_1 2 3
NODE A
Node
counter id
Shard’s
logical clock
Shard’s
value
A_1 1 -5
B_1 2 3
NODE B
Node A decrement by 5
Update operations
■ On node A decrement by 5
■ Read the previous shard value
■ Generate the newest logical clock
■ Save new value and send to replica
Node
counter id
Shard’s
logical clock
Shard’s
value
A_1 1 -5
B_1 2 3
NODE A
Node
counter id
Shard’s
logical clock
Shard’s
value
A_1 1 -5
B_1 2 3
NODE B
Node A decrement by 5
Read operations
■ While reading nodes merge the value from each shards
Value = 3 + (-5) = -2
Node
counter id
Shard’s
logical clock
Shard’s
value
A_1 1 -5
B_1 2 3
How migrate that s..tuff?
Counter migration approach
20
21 Double write
Counter migration approach
20
21
22
23
24
1
2
3
Double write
Leverage a data
Counter migration approach
20
21
22
23
24
25
26
27
1
2
3
28
29
30
Double write
Leverage a data
.. and we’ve written our
own app
Counter-migrator
■ Java
• All ųs were written in that language
• Well known
■ Spread token ring
• 6144 ranges
• select * from table where token(a) >= ? and token(a) < ?
■ Compare and set
Some pitfalls
Out of memory
■ Extend ranges
■ 68B / 6144 ~= 11M
■ Spread into 600.000 ranges ~= 100k
Out of memory
■ Extend ranges
■ 68B / 6144 ~= 11M
■ Spread into 600.000 ranges ~= 100k
■ And shuffle that ranges - remember about Shards!
Java and Spring…
■ 30sek Spring Context start
■ JVM Heap
■ Bunch of machines
Java and Spring…
■ 30sek Spring Context start
■ JVM Heap
■ Bunch of machines
■ Switch to GOLANG
Missing alerting and swap
■ 1 node goes down (w/o swap and another process)
■ Alerts not set properly
■ Hints aggressive workload
Tune driver and CL
■ Default is always wrong
■ Properly CL even ALL
■ Scylla Driver
Avoid network latency
■ Use batches
■ Increase a warning threshold not to overload journal
Result?
Result
2x n2-standard-8
Result
2x n2-standard-8
5 days
Metrics
■ API related to DB:
• 99perc 80ms to 20ms
• 90perc 50ms to 15ms
Disk space
■ Cassandra 2.1
• 48 TB
• 45% occupation
■ Scylla 4.4
• 18 TB
• 55% occupation
0 TB
5.5 TB
11 TB
16.5 TB
22 TB
Cassandra Scylla
We’ve checked if all data
exists :)
Disk space
■ MD-format
■ Aggressive compaction
■ Zstd compression
0 TB
5.5 TB
11 TB
16.5 TB
22 TB
Cassandra Scylla
Final result
■ 48 C* => 6 Scylla
■ Improve GCS cost by Incremental Snapshots
■ N2 and LocalSSD - Commitment
■ TCO REDUCED 75%
Final result
■ 48 C* => 6 Scylla
■ Improve GCS cost by Incremental Snapshots
■ N2 and LocalSSD - Commitment
■ TCO REDUCED 75%
Lesson learned
Lesson learned
■ Before migration
• Cleanup and Repair cluster
• Sometimes even compact
■ Remember about tables properties
■ Adjust scylla.yaml
• Hints window time
• max_partition_key_restrictions_per_query (or better improve ųs code)
• internode_compression
• batch_size_warn/fail
■ Improve and keep changes in IaC like Ansible playbooks
Thank you!
Stay in touch
Robert Czupioł
linkedin.com/in/robert-czupioł-2a34b394
robert.czupiol@gmail.com

Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records

  • 1.
    How to Migratea Counter Table for 68 Billion Records Robert Czupioł Senior Platform Engineer YOUR COMPANY 
 LOGO HERE
  • 2.
    Robert Czupioł ■ CassandraCertified Expert since 2015 ■ Introduce/Manage C*/Scylla in different companies ■ Attendee at first Scylla Summit 2016 ■ and 2017, 2018… :) Senior Platform Engineer YOUR COMPANY 
 LOGO HERE YOUR PHOTO 
 GOES HERE
  • 3.
    ■ Dating App ■Top-3 in West Europe ■ +100M Customers ■ 9 Scylla Clusters (in past 16 Cassandra) ■ +200TB Data ■ avg 300k req/sec Find the people you've crossed paths with
  • 4.
    Decision In May 2021,let migrate to ScyllaDB ■ Targets • TCO • Technical dept • Data volumen • Latency • Monitoring
  • 5.
    Crossings Cluster ■ Secondbiggest cluster • 48 nodes (4 CPU, 32 GB, 1TB PD-SSD) ■ Dept • Debian 8 • Uptime +580 days • Cassandra 2.1 ■ 8 active Tables • Crossings_count
  • 6.
  • 7.
    Type of migration ■Offline • Not that case ■ Online in 3 steps • DW by ųService • Leverage data • Open bottle of Champaign
  • 8.
  • 9.
    Different strategies ■ CSV •CQLSH/DSBULK (Writetime issue) ■ SSTable • SStableLoader (cluster stress) • Nodetool refresh (imo best with Network Disks) ■ Dual Connect • Scylla migrator (spark cluster) • Own application
  • 10.
    Counter table ■ Outof Idempotent rule ■ Only update ■ Weird delete approach ■ Different implementation in past (local and remote shards) ■ Without USING Timestamp ■ Without TTL ■ Not accurate
  • 11.
  • 12.
    How counter works ■Create dedicated node-counter-id (shard) for each node [RF=2] Node counter id Shard’s logical clock Shard’s value A_1 0 0 B_1 1 1 NODE A Node counter id Shard’s logical clock Shard’s value A_1 0 0 B_1 1 1 NODE B
  • 13.
    Update operation ■ Onnode B increment by 2 ■ Read the previous shard value ■ Generate the newest logical clock ■ Save new value and send to replica Node counter id Shard’s logical clock Shard’s value A_1 0 0 B_1 2 3 NODE A Node counter id Shard’s logical clock Shard’s value A_1 0 0 B_1 2 3 NODE B Node B increment by 2
  • 14.
    Update operations ■ Onnode A decrement by 5 ■ Read the previous shard value ■ Generate the newest logical clock ■ Save new value and send to replica Node counter id Shard’s logical clock Shard’s value A_1 1 -5 B_1 2 3 NODE A Node counter id Shard’s logical clock Shard’s value A_1 1 -5 B_1 2 3 NODE B Node A decrement by 5
  • 15.
    Update operations ■ Onnode A decrement by 5 ■ Read the previous shard value ■ Generate the newest logical clock ■ Save new value and send to replica Node counter id Shard’s logical clock Shard’s value A_1 1 -5 B_1 2 3 NODE A Node counter id Shard’s logical clock Shard’s value A_1 1 -5 B_1 2 3 NODE B Node A decrement by 5
  • 16.
    Read operations ■ Whilereading nodes merge the value from each shards Value = 3 + (-5) = -2 Node counter id Shard’s logical clock Shard’s value A_1 1 -5 B_1 2 3
  • 17.
  • 18.
  • 19.
  • 20.
  • 22.
    .. and we’vewritten our own app
  • 23.
    Counter-migrator ■ Java • Allųs were written in that language • Well known ■ Spread token ring • 6144 ranges • select * from table where token(a) >= ? and token(a) < ? ■ Compare and set
  • 24.
  • 25.
    Out of memory ■Extend ranges ■ 68B / 6144 ~= 11M ■ Spread into 600.000 ranges ~= 100k
  • 26.
    Out of memory ■Extend ranges ■ 68B / 6144 ~= 11M ■ Spread into 600.000 ranges ~= 100k ■ And shuffle that ranges - remember about Shards!
  • 27.
    Java and Spring… ■30sek Spring Context start ■ JVM Heap ■ Bunch of machines
  • 28.
    Java and Spring… ■30sek Spring Context start ■ JVM Heap ■ Bunch of machines ■ Switch to GOLANG
  • 29.
    Missing alerting andswap ■ 1 node goes down (w/o swap and another process) ■ Alerts not set properly ■ Hints aggressive workload
  • 30.
    Tune driver andCL ■ Default is always wrong ■ Properly CL even ALL ■ Scylla Driver
  • 31.
    Avoid network latency ■Use batches ■ Increase a warning threshold not to overload journal
  • 32.
  • 33.
  • 34.
  • 35.
    Metrics ■ API relatedto DB: • 99perc 80ms to 20ms • 90perc 50ms to 15ms
  • 36.
    Disk space ■ Cassandra2.1 • 48 TB • 45% occupation ■ Scylla 4.4 • 18 TB • 55% occupation 0 TB 5.5 TB 11 TB 16.5 TB 22 TB Cassandra Scylla
  • 37.
    We’ve checked ifall data exists :)
  • 38.
    Disk space ■ MD-format ■Aggressive compaction ■ Zstd compression 0 TB 5.5 TB 11 TB 16.5 TB 22 TB Cassandra Scylla
  • 39.
    Final result ■ 48C* => 6 Scylla ■ Improve GCS cost by Incremental Snapshots ■ N2 and LocalSSD - Commitment ■ TCO REDUCED 75%
  • 40.
    Final result ■ 48C* => 6 Scylla ■ Improve GCS cost by Incremental Snapshots ■ N2 and LocalSSD - Commitment ■ TCO REDUCED 75%
  • 41.
  • 42.
    Lesson learned ■ Beforemigration • Cleanup and Repair cluster • Sometimes even compact ■ Remember about tables properties ■ Adjust scylla.yaml • Hints window time • max_partition_key_restrictions_per_query (or better improve ųs code) • internode_compression • batch_size_warn/fail ■ Improve and keep changes in IaC like Ansible playbooks
  • 43.
    Thank you! Stay intouch Robert Czupioł linkedin.com/in/robert-czupioł-2a34b394 robert.czupiol@gmail.com