How Cassandra Deletes Data (Alain Rodriguez, The Last Pickle) | Cassandra Summit 2016

Licensed under a Creative Commons Attribution-NonCommercial 3.0 New Zealand License
HOW CASSANDRA DELETES DATA
Alain Rodriguez

• Tombstone issues
• Why tombstones
• Tombstone removal

About The Last Pickle and Alain Rodriguez

About deletes in Cassandra
Deleted data in Cassandra do not just disappear,

Deleted data in Cassandra do not just disappear,
instead a tombstone is added.
About deletes in Cassandra

Ok so what’s the matter, why this talk ?
Tombstone are needed in Cassandra, not an issue…

Tombstone are needed in Cassandra, not an issue…
…until an SSTables or a result to a query look like this…

Then we can see that in the user
mailing list or other community tools

Then we can see that in the user
mailing list or other community tools
So I thought I could share,
about this topic.
thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html

Tombstone issues: impacts
The read path: Reading tombstones induces
Latencies, Timeouts or Exceptions

The disk space: tombstones can fill up the disk
100%

The disk space: tombstones can fill up the disk
I am facing one of these issues, is it caused by tombstones?
100%

Tombstone issues: Read Path
grep -i -e "ERROR" -e "WARN" /var/log/cassandra/system.log

WARN [SharedPool-Worker-7] 2016-07-16 16:31:09,048 SliceQueryFilter.java:319 - Read 276 live and 1104 tombstone
cells in mykeyspace.mytable for key: ItV9kZC8mFNiSvYM8AwufBU8tTtJkW5dUH5MNcq1H18 (see
tombstone_warn_threshold). 500 columns were requested, slices=[-]

WARN [SharedPool-Worker-7] 2016-07-16 16:31:09,048 SliceQueryFilter.java:319 - Read 276 live and 1104 tombstone
cells in mykeyspace.mytable for key: ItV9kZC8mFNiSvYM8AwufBU8tTtJkW5dUH5MNcq1H18 (see
tombstone_warn_threshold). 500 columns were requested, slices=[-]
ERROR [ReadStage:290729] 2016-07-16 17:00:18,708 SliceQueryFilter.java (line 206) Scanned over 100000
tombstones in mykeyspace.mytable; query aborted (see tombstone_failure_threshold)
ERROR [ReadStage:290729] 2016-04-22 17:00:18,709 CassandraDaemon.java (line 258) Exception in threadThread[ReadStage:290729,5,main]
java.lang.RuntimeException: org.apache.cassandra.db.filter.TombstoneOverwhelmingException

tombstoneScannedHistogram metric
Through nodetool cfstats, JMX…

tombstoneScannedHistogram metric
Through or a plugged monitoring tool such as
Datadog, Grafana, SPM, OpsCenter…
Commercial
Free

Tombstone issues: Disk space
DroppableTombstoneRatio metric provide interesting info.

Tombstone issues: Disk space
DroppableTombstoneRatio metric provide interesting info.
Through sstablemetadata tool, JMX and plugged monitoring tool such as Datadog,
Grafana, SPM, OpsCenter, etc.
Possible to write a script to check biggest SSTables ratio for example

Why tombstones?
I want to remove data !

WhyTombstones: Cassandra write path
Write path
Client write
Memory
Disk
Memtable
Commit Log SSTable SSTable
SSTable SSTable
Cassandra node
Flush
Immutable

WhyTombstones: Cassandra write path
Write path
Client write
Memory
Disk
Memtable
SSTable SSTable
Cassandra node
Immutable
Client read
Flush

WhyTombstones: Distributed system
Cassandra is a distributed system
Distributed deletes are tricky !

WhyTombstones: Cassandra consistency
Consistency 
Cassandra Cluster 
4 nodes 
RF = 3 
Write CL = Quorum = 2
Read CL = Quorum = 2
Strong 
Consistency

Consistency 
4 nodes 
RF = 3 
Strong 
Consistency
Consistency 
4 nodes 
RF = 3 
?
?
?
Client write “A”
Strong 
Consistency

Consistency 
4 nodes 
RF = 3 
Strong 
Consistency
Consistency 
4 nodes 
RF = 3 
?
?
?
Strong 
Consistency
Consistency 
4 nodes 
RF = 3 
A
A
?
Ack
Ack
Strong 
Consistency

Consistency 
4 nodes 
RF = 3 
Strong 
Consistency
Consistency 
4 nodes 
RF = 3 
?
?
?
Strong 
Consistency
Consistency 
4 nodes 
RF = 3 
A
A
?
Ack
Ack
Strong 
Consistency
Consistency 
4 nodes 
RF = 3 
A
A
?
Client read “A ”
Ack
Ack
Strong 
Consistency

WhyTombstones: Cassandra consistency & availability
Consistency 
4 nodes 
RF = 3 
Strong 
Consistency
Consistency 
4 nodes 
RF = 3 
?
?
?
Strong 
Consistency
Consistency 
4 nodes 
RF = 3 
A
A
?
Ack
Ack
Strong 
Consistency
Consistency 
4 nodes 
RF = 3 
A
A
?
Ack
Ack
Strong 
Consistency
Consistency 
4 nodes 
RF = 3 
A
A
Down
Client read “A”
Ack
Ack
High 
availability

WhyTombstones: Distributed deletes
Consistency 
4 nodes 
RF = 3 
Strong 
Consistency
Consistency 
4 nodes 
RF = 3 
?
?
?
Strong 
Consistency
Consistency 
4 nodes 
RF = 3 
A
A
?
Ack
Ack
Strong 
Consistency
Consistency 
4 nodes 
RF = 3 
A
A
?
Ack
Ack
Strong 
Consistency
WITHOUT Tombstones 
4 nodes 
RF = 3 
A
A
A

Consistency 
4 nodes 
RF = 3 
Strong 
Consistency
Consistency 
4 nodes 
RF = 3 
?
?
?
Strong 
Consistency
Consistency 
4 nodes 
RF = 3 
A
A
?
Ack
Ack
Strong 
Consistency
Consistency 
4 nodes 
RF = 3 
A
A
?
Ack
Ack
Strong 
Consistency
4 nodes 
RF = 3 
A
A
A
4 nodes 
RF = 3 
A
A
A
Client delete “A”

Consistency 
4 nodes 
RF = 3 
Strong 
Consistency
Consistency 
4 nodes 
RF = 3 
?
?
?
Strong 
Consistency
Consistency 
4 nodes 
RF = 3 
A
A
?
Ack
Ack
Strong 
Consistency
Consistency 
4 nodes 
RF = 3 
A
A
?
Ack
Ack
Strong 
Consistency
4 nodes 
RF = 3 
A
A
A
4 nodes 
RF = 3 
A
A
A
4 nodes 
RF = 3 
A
Ack
Ack

Consistency 
4 nodes 
RF = 3 
Strong 
Consistency
Consistency 
4 nodes 
RF = 3 
?
?
?
Strong 
Consistency
Consistency 
4 nodes 
RF = 3 
A
A
?
Ack
Ack
Strong 
Consistency
Consistency 
4 nodes 
RF = 3 
A
A
?
Ack
Ack
Strong 
Consistency
4 nodes 
RF = 3 
A
A
A
4 nodes 
RF = 3 
A
A
A
4 nodes 
RF = 3 
A
Ack
Ack
4 nodes 
RF = 3 
A
Client read “A”
Ack
Ack
Wrong

Consistency 
4 nodes 
RF = 3 
Strong 
Consistency
Consistency 
4 nodes 
RF = 3 
?
?
?
Strong 
Consistency
Consistency 
4 nodes 
RF = 3 
A
A
?
Ack
Ack
Strong 
Consistency
Consistency 
4 nodes 
RF = 3 
A
A
?
Ack
Ack
Strong 
Consistency
4 nodes 
RF = 3 
A
A
A
4 nodes 
RF = 3 
A
A
A
4 nodes 
RF = 3 
A
Ack
Ack
4 nodes 
RF = 3 
A
Client read
“empty”
Ack
Ack
Correct

WITH Tombstones 
4 nodes 
RF = 3 
A
A
A
Consistency 
4 nodes 
RF = 3 
Strong 
Consistency
Consistency 
4 nodes 
RF = 3 
?
?
?
Strong 
Consistency
Consistency 
4 nodes 
RF = 3 
A
A
?
Ack
Ack
Strong 
Consistency
Consistency 
4 nodes 
RF = 3 
A
A
?
Ack
Ack
Strong 
Consistency
4 nodes 
RF = 3 
A
A
A
4 nodes 
RF = 3 
A
A
A
4 nodes 
RF = 3 
A
Ack
Ack
4 nodes 
RF = 3 
A
Client read “A”
Ack
Ack
Wrong

WITH Tombstones 
4 nodes 
RF = 3 
A
A
A
WITH Tombstones 
4 nodes 
RF = 3 
A
A
A
Consistency 
4 nodes 
RF = 3 
Strong 
Consistency
Consistency 
4 nodes 
RF = 3 
?
?
?
Strong 
Consistency
Consistency 
4 nodes 
RF = 3 
A
A
?
Ack
Ack
Strong 
Consistency
Consistency 
4 nodes 
RF = 3 
A
A
?
Ack
Ack
Strong 
Consistency
4 nodes 
RF = 3 
A
A
A
4 nodes 
RF = 3 
A
A
A
4 nodes 
RF = 3 
A
Ack
Ack
4 nodes 
RF = 3 
A
Client read “A”
Ack
Ack
Wrong

WITH Tombstones 
4 nodes 
RF = 3 
A
A
A
WITH Tombstones 
4 nodes 
RF = 3 
Read CL = Quorum = 2 
A* =Tombstone on A
A
A
A
WITH Tombstones 
4 nodes 
RF = 3 
A* =Tombstone on A
A*
A*
A
Ack
Ack
Consistency 
4 nodes 
RF = 3 
Strong 
Consistency
Consistency 
4 nodes 
RF = 3 
?
?
?
Strong 
Consistency
Consistency 
4 nodes 
RF = 3 
A
A
?
Ack
Ack
Strong 
Consistency
Consistency 
4 nodes 
RF = 3 
A
A
?
Ack
Ack
Strong 
Consistency
4 nodes 
RF = 3 
A
A
A
4 nodes 
RF = 3 
A
A
A
4 nodes 
RF = 3 
A
Ack
Ack
4 nodes 
RF = 3 
A
Client read “A”
Ack
Ack
Wrong

WITH Tombstones 
4 nodes 
RF = 3 
A
A
A
WITH Tombstones 
4 nodes 
RF = 3 
A* =Tombstone on A
A
A
A
WITH Tombstones 
4 nodes 
RF = 3 
A* =Tombstone on A
A*
A*
A
Ack
Ack
WITH Tombstones 
4 nodes 
RF = 3 
A* =Tombstone on A
A*
A*
A
Client read “A*” 
meaning “empty”
Ack
Ack
Consistency 
4 nodes 
RF = 3 
Strong 
Consistency
Consistency 
4 nodes 
RF = 3 
?
?
?
Strong 
Consistency
Consistency 
4 nodes 
RF = 3 
A
A
?
Ack
Ack
Strong 
Consistency
Consistency 
4 nodes 
RF = 3 
A
A
?
Ack
Ack
Strong 
Consistency
4 nodes 
RF = 3 
A
A
A
4 nodes 
RF = 3 
A
A
A
4 nodes 
RF = 3 
A
Ack
Ack
4 nodes 
RF = 3 
A
Client read “A”
Ack
Ack
Wrong Correct

Cool story, but I really want to remove the data !
Tombstone removal!

When are tombstones removed?
When should tombstones be removed?
• Once the tombstone is fully replicated
• When deleted data has been removed

When are tombstones removed?
When should tombstones be removed?
• Once the tombstone is fully replicated
• When deleted data has been removed
When are tombstones actually removed?
• After gc_grace_seconds
• During compactions 
IF all the deleted data and the tombstone itself are involved

How tombstones are removed: Compaction!
Write path
Client write
Memory
Disk
Memtable
SSTable SSTable
Cassandra node
Immutable
Client read
Flush

Write path
Client write
Memory
Disk
Memtable
SSTable SSTable
Cassandra node
Immutable
Client read
Compacting 4 SSTables
Flush

Write path
Client write
Memory
Disk
Memtable
Commit Log
SSTable
Cassandra node
Immutable
Client read
Flush

Implications in the real world
• No compaction = no eviction
• + TTLs or deletes, tombstone stack (up to 100%)

• Overlapping SSTable = no eviction
• Fragmented data = eviction unlikely
• LCS: tombstone level ≠ than data level = no eviction

• Overlapping SSTable = no eviction
• Fragmented data = eviction unlikely
• LCS: tombstone level ≠ than data level = no eviction
• TTL << gc_grace_seconds = high % of useless data

Some tuning !
Good news:
Cassandra community
and
Committers are Awesome!

Some tuning !
Issue: No compaction = No eviction
CASSANDRA-3442: tombstone_threshold (C* 1.2.b1)
Compaction option, default:
tombstone_threshold = 0.2 (ratio = 20% has been deleted)
Single SSTable compaction triggered based on an estimate!
Low risk: worst case —> No-op

Some tuning !
Issue: Tombstone compaction loop!
CASSANDRA-4022: Check for key overlaps (C* 1.2.b1)
Internals improvement, not an option:
Estimated droppable tombstone improved
Now considering key overlapping with other SSTable

Some tuning !
Issue: Tombstone compaction loop!
CASSANDRA-4781: tombstone_compaction_interval (C* 1.2.b2)
tombstone_compaction_interval = 86400 (in seconds = 1 day)
Deﬁnitely prevents loops

Some tuning !
Issue: Compacting to remove tombstone is expensive
CASSANDRA-5228: Expired SSTables (C*2.0.b1)
Internals improvement, not an option
Effective with Time series, DTCS / TWCS and TTLs !

Some tuning !
Issue: Tombstone compactions not triggering
CASSANDRA-6563: unchecked_tombstone_compaction (C* 2.0.9)
unchecked_tombstone_compaction = false
CASSANDRA-4022 becomes an option

Some tuning !
Issue: Overlapping preventing efﬁcient tombstone compactions
CASSANDRA-7019: provide_overlapping_tombstones (C* 3.10)
provide_overlapping_tombstones = NONE (CELL / ROW / NONE)
Risky:
• Not yet released, so not really tested
• Heavier tombstones compactions

Some tuning -Tombstone distribution !
WITH Tombstones 
4 nodes 
RF = 3 
A
A
A
WITH Tombstones 
4 nodes 
RF = 3 
A* =Tombstone on A
A
A
A
WITH Tombstones 
4 nodes 
RF = 3 
A* =Tombstone on A
A*
A*
A
Ack
Ack
WITH Tombstones 
4 nodes 
RF = 3 
A* =Tombstone on A
A*
A*
A
meaning “empty”
Ack
Ack
Correct
Consistency 
4 nodes 
RF = 3 
Strong 
Consistency
Consistency 
4 nodes 
RF = 3 
?
?
?
Strong 
Consistency
Consistency 
4 nodes 
RF = 3 
A
A
?
Ack
Ack
Strong 
Consistency
Consistency 
4 nodes 
RF = 3 
A
A
?
Ack
Ack
Strong 
Consistency
4 nodes 
RF = 3 
A
A
A
4 nodes 
RF = 3 
A
A
A
4 nodes 
RF = 3 
A
Ack
Ack
Tombstones not replicated 
4 nodes 
RF = 3 
A* =Tombstone on A
A*
A*
A
Client read “A*”
Ack
Ack
Correct

Case were node fail + no repair
=
Case without tombstone
WITH Tombstones 
4 nodes 
RF = 3 
A
A
A
WITH Tombstones 
4 nodes 
RF = 3 
A* =Tombstone on A
A
A
A
WITH Tombstones 
4 nodes 
RF = 3 
A* =Tombstone on A
A*
A*
A
Ack
Ack
WITH Tombstones 
4 nodes 
RF = 3 
A* =Tombstone on A
A*
A*
A
meaning “empty”
Ack
Ack
Correct
Consistency 
4 nodes 
RF = 3 
Strong 
Consistency
Consistency 
4 nodes 
RF = 3 
?
?
?
Strong 
Consistency
Consistency 
4 nodes 
RF = 3 
A
A
?
Ack
Ack
Strong 
Consistency
Consistency 
4 nodes 
RF = 3 
A
A
?
Ack
Ack
Strong 
Consistency
4 nodes 
RF = 3 
A
A
A
4 nodes 
RF = 3 
A
A
A
4 nodes 
RF = 3 
A
Ack
Ack
4 nodes 
RF = 3 
A* =Tombstone on A
A*
AClient read “A”
Wrong
A* removed

Case were node fail + no repair
=
Case without tombstone
=
Zombie data !
WITH Tombstones 
4 nodes 
RF = 3 
A
A
A
WITH Tombstones 
4 nodes 
RF = 3 
A* =Tombstone on A
A
A
A
WITH Tombstones 
4 nodes 
RF = 3 
A* =Tombstone on A
A*
A*
A
Ack
Ack
WITH Tombstones 
4 nodes 
RF = 3 
A* =Tombstone on A
A*
A*
A
meaning “empty”
Ack
Ack
Correct
Consistency 
4 nodes 
RF = 3 
Strong 
Consistency
Consistency 
4 nodes 
RF = 3 
?
?
?
Strong 
Consistency
Consistency 
4 nodes 
RF = 3 
A
A
?
Ack
Ack
Strong 
Consistency
Consistency 
4 nodes 
RF = 3 
A
A
?
Ack
Ack
Strong 
Consistency
4 nodes 
RF = 3 
A
A
A
4 nodes 
RF = 3 
A
A
A
4 nodes 
RF = 3 
A
Ack
Ack
4 nodes 
RF = 3 
A* =Tombstone on A
AClient read “A”
Wrong
A* removed

CASSANDRA-6434 (C*3.0.b1):
only_purge_repaired_tombstones
(Default: False)
WITH Tombstones 
4 nodes 
RF = 3 
A
A
A
WITH Tombstones 
4 nodes 
RF = 3 
A* =Tombstone on A
A
A
A
WITH Tombstones 
4 nodes 
RF = 3 
A* =Tombstone on A
A*
A*
A
Ack
Ack
4 nodes 
RF = 3 
A* =Tombstone on A
A*
A*
A
A* not removed
meaning “empty” Correct

CASSANDRA-6434 (C*3.0.b1):
only_purge_repaired_tombstones
(Default: False)
Limitation
Repair failing or no repair
=
permanent tombstone
WITH Tombstones 
4 nodes 
RF = 3 
A
A
A
WITH Tombstones 
4 nodes 
RF = 3 
A* =Tombstone on A
A
A
A
WITH Tombstones 
4 nodes 
RF = 3 
A* =Tombstone on A
A*
A*
A
Ack
Ack
4 nodes 
RF = 3 
A* =Tombstone on A
A*
A*
A
A* not removed
meaning “empty” Correct

Things we know about tombstones
• Tombstones due to deletes and TTLs
• Tombstone ﬁts with Cassandra write path
• Tombstones ensure consistency
• Reading tombstones is expensive and can produce failures
• Tombstones take space on disk and might be tricky to remove
• Tombstones need to be distributed before being removed

Takeaways
• Model data and workﬂow to avoid to reading many tombstones
• Deleted data = repair table within gc_grace_seconds
• Monitor tombstones, keep control! (Set some alerts ?)
• Use compaction options to tackle problems, there is always a way.
• Is there no way? Ask, or create a Jira and keep improving Cassandra!

Thank you
Questions ?
thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html

How Cassandra Deletes Data (Alain Rodriguez, The Last Pickle) | Cassandra Summit 2016

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to How Cassandra Deletes Data (Alain Rodriguez, The Last Pickle) | Cassandra Summit 2016

Similar to How Cassandra Deletes Data (Alain Rodriguez, The Last Pickle) | Cassandra Summit 2016 (10)

More from DataStax

More from DataStax (20)

Recently uploaded

Recently uploaded (20)

How Cassandra Deletes Data (Alain Rodriguez, The Last Pickle) | Cassandra Summit 2016