Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Summit 2016

Deletes Without
Tombstones or TTLs
Eric Stevens, Principal Architect
ProtectWise, Inc.

©2016 ProtectWise, Inc. All rights reserved.
About ProtectWise
An enterprise security company that records, analyzes, and visualizes your network on demand to detect
complex threats that others can’t see
Big DataData Ingestion and Availability
● Well north of a billion new records
per day
● Processed, analyzed, and stored
in soft real time
● Fully indexed and searchable with
p95 query response times <1
second
○ Shortening the OODA loop
● Hundreds of Cassandra servers
● Hundreds of Billions of Records
● Multiple Petabytes of Data

With one sensor, ProtectWise captured the
following data at Super Bowl 50:
● 8.806 Terabytes of data seen. Primarily HTTP,
SSL and traffic to Amazon AWS, Facebook,
Twitter, and Instagram.
● 1.550 Terabytes of data captured (82%
optimization)
● 17 million URLs hit
● 8,085,949 DNS requests
With a single sensor deployed on the Levi's
Public Wi-Fi Network, ProtectWise captured
8.806 Terabytes of Data and was able to optimize
it by 82% to just 1.550 Terabytes of data, a true
testament to the scale and power of our platform.
Use Case – Super Bowl 50
The Broncos weren’t the only team from Denver in Levi’s Stadium

● How Deletes (tombstones) in Cassandra Work Today
● The Limitations of Tombstones
● Misconceptions about Tombstones
● How TTL (Time to Live) in Cassandra works today
● The limitations of TTLs
● Why neither strategy works for ProtectWise
● Our unconventional solution
● Advantages of our solution
● Disadvantages of our solution
Overview

● Increases both write and read I/O pressure
● Not an effective means of reclaiming disk
capacity
● May be difficult to locate correct records for
deletion
● Makes reads more expensive
● Actual tombstones can often greatly outlive
their deleted data (much longer than
gc_grace)
Terrible
● Surgically target data for removal
● Easy to reason about from a read
consistency perspective
Terrific
The Trouble with Tombstones

When do tombstones (and expired TTL’d
records) go away?
● Never before it’s gc_grace old (this is a good thing, and you get to control it)
● During compaction, for a tombstone past gc_grace, its partition key is checked
against the bloom filters of all other SSTables for the given CQL table.
● If there is a bloom filter collision, the tombstone will remain, even if the bloom
filter collision was a false positive
● If there is ANY data, even other tombstones for that partition in any SSTable,
the tombstone will not get cleaned up
● If bloom filters indicate there is no chance of overlap on that partition key, the
tombstone will get cleaned up

Misconception about Tombstone Performance
● The performance degradation from tombstones isn’t from the tombstone itself.
● If you do
○ for (n <- 0 to 100000) {
INSERT INTO table (partitionKey, clusterKey) VALUES ( 1, n )
}
● You can later create a range tombstone that is tiny bytes wise:
○ DELETE FROM table WHERE partitionKey = 1 AND clusterKey < 99999
● But if you then
○ SELECT * FROM table WHERE partitionKey = 1 LIMIT 1
● Cassandra will have to read then discard rows with clusterKey values from 0
to 99998 before the LIMIT 1 can be reached

PK1 CK1
CK2
1 2 ... o
1 2 ... p
... ...
CKn 1 2 ... q
PK1 DELETE 1 – n-1
SSTable 1
SSTable 2
3
SELECT * FROM table WHERE pk1 LIMIT 1

{
{
{
{
Compaction Review
↑ Writes
← Older Data Newer Data →

Tombstones in Compaction
↑ Delete
SSTable
containing
record to
delete ↑

↑ Other Writes
SSTable
containing
record to
delete ↑

↑ Other Writes
Finally
Deleted ↑

Showing why tombstones are not the same thing as a delete.
Tombstone Demo

Setup
cqlsh> CREATE TABLE testing(
… p blob,
… c blob,
… v blob,
… PRIMARY KEY(p,c)
… ) WITH gc_grace_seconds=0;

Setup
cqlsh> INSERT INTO testing
(p,c,v) VALUES (0xcafebabe,
0xdeadbeef, 0xdeadc0de);
$ nodetool flush && ls *-Data.db
testing-testing-ka-1-Data.db
cqlsh> INSERT INTO testing
(p,c,v) VALUES (0xcafebabe,
0xdeadbeef, 0xfacefeed);
0xcafebabe:0xdeadbeef:0xfacefeed1 0xcafebabe:0xdeadbeef:0xfacefeed1
0xcafebabe:0xdeadbeef:0xdeadc0de2

Setup
cqlsh> DELETE FROM testing WHERE
p=0xcafebabe AND c=0xdeadbeef;
cqlsh> select * from testing;
p | c | v
------------+------------+------------
0xcafebabe | 0xdeadbeef | 0xdeadc0de
0xcafebabe:0xdeadbeef:0xfacefeed1
0xcafebabe:0xdeadbeef:DELETE3

Let’s look at the data
$ hexdump testing-testing-ka-1-Data.db
0000000 4b 00 00 00 c3 00 04 ca fe ba be 7f ff ff ff 80
0000010 00 01 00 72 0a 00 04 de ad be ef 0e 00 71 05 34
0000020 3b d8 4e df f1 0d 00 14 0b 19 00 29 01 76 1a 00
0000030 70 04 fa ce fe ed 00 00 6f 9b 15 17

0000010 00 01 00 72 0a 00 04 de ad be ef 0e 00 71 05 34
0000020 3b e3 86 df 23 0d 00 14 0b 19 00 29 01 76 1a 00
0000030 70 04 de ad c0 de 00 00 62 de 14 02

0000000 33 00 00 00 c3 00 04 ca fe ba be 7f ff ff ff 80
0000010 00 01 00 94 07 00 04 de ad be ef ff 10 0a 00 f0
0000020 00 01 57 4f 2d 69 00 05 34 3b e6 ab 47 c8 00 00
0000030 db 77 12 69
0xcafebabe:0xdeadbeef:DELETE3

Time to Compact
Simulate compaction
happening on data that
has been deleted, but
where the tombstone is
not involved in the
compaction
% jmx_invoke -m
org.apache.cassandra.db:type=CompactionMan
ager forceUserDefinedCompaction testing-
testing-ka-1-Data.db,testing-testing-ka-2-
Data.db
$ ls *-Data.db
0xcafebabe:0xdeadbeef:0xdeadc0de2 0xcafebabe:0xdeadbeef:??????????4

Let’s look again:
0000010 00 01 00 72 0a 00 04 de ad be ef 0e 00 71 05 34
0000020 3b e3 86 df 23 0d 00 14 0b 19 00 29 01 76 1a 00
0000030 70 04 de ad c0 de 00 00 62 de 14 02

What happened?
● The tombstone for primary key (0xcafebabe,0xdeadbeef) was written in
SSTable 3
● SSTable 3 wasn’t involved in the compaction
● ∴The data at rest didn’t get cleaned up

Why is this a problem
● In all mainline compaction strategies:
○ Data written close together chronologically tends to compact together relatively quickly
○ Data written chronologically far apart tends to take a long time to compact together
■ This is why it’s an anti-pattern to append or overwrite the same partition over long
periods of time, your reads to that partition will end up needing to read out of a large
number of SSTables
○ Because disk capacity is not recovered until the tombstone and its underlying data are
involved in the same compaction, it can take a long time to recover disk capacity
● Some compaction strategies (DateTiered, TimeWindowed) have controls that
allow for data to permanently stop compacting.
○ Under these conditions there become times where it’s impossible to ever recover disk capacity
Note, See CASSANDRA-7019 for an upcoming alternative
Also “Improving Tombstone Compactions” today at 4:10 in 210C

● Once a TTL has been written, there is no
way to change your mind except to write the
record again with a new TTL
● Rows written to more than one time may
have inconsistent TTLs leading to dirty or
incomplete reads.
● TTL’d records may remain at rest much
longer than you realize in some
circumstances
Trouble
● Fire and forget, your data will “go away”
fairly predictably
Terrific
The Trouble with TTLs

● Customers get to change their mind about how
long they want us to retain their data
● Changing TTL’s is expensive, both in terms of
I/O pressure, and temporarily doubling the size
of your data at rest
● Disks are cheap… lots of disks are not
● Cassandra data at rest has an ongoing cost, if
a customer stops paying for it, we need to as
well
● Timeliness of deletes is important
● Sensitive data spillage means we need to
remove some data quickly
Why Neither Strategy Works for Us

● If you have hot swappable drives, this is a
lot easier, if not, you might have some
temporary downtime due to RF change.
Step 2: Disconnect Drive
● There are some weird anti-entropy corner
cases that are solved if you disable
replication
Step 1: Set RF=1
Basic Strategy
Successfully used to delete significant amounts of data with little to no performance impact

Step 3

● Records are removed from the next
compaction as soon as they should be
evicted
● If we need to recover capacity quickly we
can use user defined compaction to
selectively target our oldest files
Evicting Compaction Strategy
● During compaction, use deterministic logic
to determine which records should be
removed
● Prevent records from surviving the
compaction process
● Clean up indexes at the time the record is
removed
Delete While Compacting
Basic Strategy
For real this time.

● If you choose to, you can create a backup
automatically of the deleted records
● Save yourself from deletion remorse
● Incorrect deletion logic
● Change of heart by you(r customer)
● Move those records to cheaper storage
Backing up your deletes
● Acts as a parent strategy with your
preferred child compaction strategy
● Child strategy is responsible for sstable
selection
● You get the characteristics of your strategy,
with the deletes of our strategy
Wrapping Compaction Strategy
Features
Does it support feature X of my preferred compaction strategy?

● Configurable and extensible
● Several provided implementations can
be reasonably surgically controlled by
reading deletion rules out of a table
you specify
● Extend one of several base classes to
provide more sophisticated custom
logic
● Restoring backups
● To restore accidentally deleted
records, copy these files to the right
path and do nodetool refresh
● Or if your topology has changed you
can restore them with sstableloader
Features

ALTER TABLE bar WITH compaction = {
'class': 'DeletingCompactionStrategy',
'dcs_underlying_compactor':
'LeveledCompactionStrategy',
'sstable_size_in_mb': 160
};
ALTER TABLE foo WITH compaction = {
'class': 'DeletingCompactionStrategy',
'dcs_underlying_compactor':
'SizeTieredCompactionStrategy',
'min_threshold': '2',
'max_threshold': '8'
};
A Wrapping Compaction Strategy
Doesn’t change the fundamental characteristics
of your preferred compaction strategy

Compaction’s Inner Workings
Credit: DataStax
https://docs.datastax.com/en/cassandra/2.1/cassandra/dml/dml_write_path_c.html

Credit: DataStax
{
Compaction Strategy
selects SSTables
Returns SSTableIterators

Credit: DataStax
}
FilteringSSTableIterators
exclude data which should be
deleted, and also notify
IndexManager if appropriate to
clean up associated indexes.

Rules:
A => ✓
B => ✗
C => ✓
D => ✗
E => ✓
* if configured to backup convicted records
An Evicting Compaction Strategy
Records involved in compaction which are convicted do not
survive into the newly compacted SSTable
A
B
C
A
B
D
C
D
E
A
C
E
SSTable 1 SSTable 2 SSTable 3
New SSTable Backup SSTable*
B
D

● Compaction performance is often bounded
by available write capacity
● Fewer records surviving into the target table
reduces write pressure during compaction
● Testing of records for conviction is
lightweight (depending on the complexity of
your business logic), and mostly CPU
bound
Often Faster than Existing Compaction

● Records past the deletion boundary may
still be visible to your application
● You may get inconsistent reads for
such records
● Evicted records may resurrect temporarily
due to repair
● They’ll end up in a new SSTable and
will evict again during the next auto
compaction
Boundary Consistency
● Like all other baked in deletion options, disk
capacity is reclaimed only eventually
● Old SSTables still tend not to compact
very frequently
● However by triggering user defined
compaction, you can reclaim space
immediately without resorting to major
compaction
Eventual Deletes
Limitations

● Read repair and in general any repair may
cause a record to fully resurrect temporarily
● Resurrected record will appear in the
youngest SSTables
● Will disappear again when those new
SSTables next compact (generally relatively
quickly for an active cluster)
Repair = Resurrection
● Logic for deletes needs to be deterministic
or you’ll end up with consistency issues
● Probably not a good idea to base any
deletion logic on anything outside of the
primary key except in narrow use cases
Requires deletion determinism
Limitations

● Supports and tested against Cassandra 2.x
series
● In 3.x the package and class names
changed, needs to be ported
● Tests are written in Scala, they cover a lot
of surface area but would need to be
rewritten prior to contribution
● Needs additional general purpose
convictors
● Principally tested against STCS and
deserves better coverage for other child
strategies
Current Project Status

https://github.com/protectwise/cassandra-util
Also includes:
● Our DataStax Driver Wrapper for Scala
● Our CCM wrapper lib for automating unit tests in Scala
GitHub
Availability & Compatibility

www.protectwise.com/careers.html
Especially if you’re in Denver!
Scala, Akka, Spark, Node, DevOps
We’re Hiring!

Cold Storage that Isn’t Glacial
Tomorrow 10:45 Room LL20D
Using Approximate Data for Small,
Insightful Analytics
Tomorrow 2:00 Room LL20A
See Our Other Talks

Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Summit 2016

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Summit 2016

Similar to Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Summit 2016 (20)

More from DataStax

More from DataStax (20)

Recently uploaded

Recently uploaded (20)

Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Summit 2016

Editor's Notes