C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

June 19, 2013
#Cassandra13
Axel Liljencrantz
liljencrantz@spotify.com
How not to use
Cassandra

#Cassandra13
The Spotify backend

#Cassandra13
The Spotify backend
•  Around 3000 servers in 3 datacenters
•  Volumes
o  We have ~ 12 soccer fields of music
o  Streaming ~ 4 Wikipedias/second
o  ~ 24 000 000 active users

#Cassandra13
The Spotify backend
•  Specialized software powering Spotify
o  ~ 70 services
o  Mostly Python, some Java
o  Small, simple services responsible for single task

#Cassandra13
Storage needs
•  Used to be a pure PostgreSQL shop
•  Postgres is awesome, but...
o  Poor cross-site replication support
o  Write master failure requires manual intervention
o  Sharding throws most relational advantages out the
window

#Cassandra13
Cassandra @ Spotify
•  We started using Cassandra ~2 years ago
•  About a dozen services use it by now
•  Back then, there was little information about how to
design efficient, scalable storage schemas for
Cassandra

#Cassandra13
Cassandra @ Spotify
•  We started using Cassandra ~2 years ago
•  About a dozen services use it by now
•  Back then, there was little information about how to
design efficient, scalable storage schemas for
Cassandra
•  So we screwed up
•  A lot

#Cassandra13
How to misconfigure Cassandra

#Cassandra13
Read repair
•  Repair from outages during regular read operation
•  With RR, all reads request hash digests from all nodes
•  Result is still returned as soon as enough nodes have
replied
•  If there is a mismatch, perform a repair

#Cassandra13
Read repair
•  Useful factoid: Read repair is performed across all data
centers
•  So in a multi-DC setup, all reads will result in requests being
sent to every data center
•  We've made this mistake a bunch of times
•  New in 1.1: dclocal_read_repair

#Cassandra13
Row cache
•  Cassandra can be configured to cache entire data rows in
RAM
•  Intended as a memcache alternative
•  Lets enable it. What's the worst that could happen, right?

#Cassandra13
Row cache
NO!
•  Only stores full rows
•  All cache misses are silently promoted to full row slices
•  All writes invalidate entire row
•  Don't use unless you understand all use cases

#Cassandra13
Compression
•  Cassandra supports transparent compression of all data
•  Compression algorithm (snappy) is super fast
•  So you can just enable it and everything will be better, right?

#Cassandra13
Compression
•  Cassandra supports transparent compression of all data
•  Compression algorithm (snappy) is super fast
•  So you can just enable it and everything will be better, right?
•  NO!
•  Compression disables a bunch of fast paths, slowing down
fast reads

#Cassandra13
How to misuse Cassandra

#Cassandra13
Performance worse over time
•  A freshly loaded Cassandra cluster is usually snappy
•  But when you keep writing to the same columns over for a
long time, performance goes down
•  We've seen clusters where reads touch a dozen SSTables
on average
•  nodetool cfhistograms is your friend

#Cassandra13
Performance worse over time
•  CASSANDRA-5514
•  Every SSTable stores first/last column of SSTable
•  Time series-like data is effectively partitioned

#Cassandra13
Few cross continent clusters
•  Few cross continent Cassandra users
•  We are kind of on our own when it comes to some problems
•  Disable TCP nodelay
•  Reduced packet count by 20 %

#Cassandra13
How not to upgrade Cassandra

#Cassandra13
How not to upgrade Cassandra
•  Very few total cluster outages
o  Clusters have been up and running since the early
0.7 days, been rolling upgraded, expanded, full
hardware replacements etc.
•  Never lost any data!
o  No matter how spectacularly Cassandra fails, it has
never written bad data
o  Immutable SSTables FTW

#Cassandra13
Upgrade from 0.7 to 0.8
•  This was the first big upgrade we did, 0.7.4 ⇾ 0.8.6
•  Everyone claimed rolling upgrade would work
o  It did not
•  One would expect 0.8.6 to have this fixed
•  Patched Cassandra and rolled it a day later
•  Takeaways:
o  ALWAYS try rolling upgrades in a testing environment
o  Don't believe what people on the Internet tell you

#Cassandra13
Upgrade 0.8 to 1.0
•  We tried upgrading in test env, worked fine
•  Worked fine in production...
•  Except the last cluster
•  All data gone

#Cassandra13
Upgrade 0.8 to 1.0
•  We tried upgrading in test env, worked fine
•  Worked fine in production...
•  Except the last cluster
•  All data gone
•  Many keys per SSTable ⇾ corrupt bloom filters
•  Made Cassandra think it didn't have any keys
•  Scrub data ⇾ fixed
•  Takeaway: ALWAYS test upgrades using production data

#Cassandra13
Upgrading 1.0 to 1.1
•  After the previous upgrades, we did all the tests with
production data and everything worked fine...
•  Until we redid it in production, and we had reports of
missing rows
•  Scrub ⇾ restart made them reappear
•  This was in December, have not been able to reproduce
•  PEBKAC?
•  Takeaway: ?

#Cassandra13
How not to deal with large clusters

#Cassandra13
Coordinator
•  Coordinator performs partitioning, passes on request to
the right nodes
•  Merges all responses

#Cassandra13
What happens if one node is slow?

#Cassandra13
Many reasons for temporary slowness:
•  Bad raid battery
•  Sudden bursts of compaction/repair
•  Bursty load
•  Net hiccup
•  Major GC
•  Reality

#Cassandra13
•  Coordinator has a request queue
•  If a node goes down completely, gossip will notice
quickly and drop the node
•  But what happens if a node is just super slow?

#Cassandra13
•  Gossip doesn't react quickly to slow nodes
•  The request queue for the coordinator on every node in
the cluster fills up
•  And the entire cluster stops accepting requests

#Cassandra13
•  Gossip doesn't react quickly to slow nodes
•  The request queue for the coordinator on every node in
the cluster fills up
•  And the entire cluster stops accepting requests
•  No single point of failure?

#Cassandra13
•  Solution: Partitioner awareness in client
•  Max 3 nodes go down
•  Available in Astyanax

#Cassandra13
How not to delete data

#Cassandra13
Deleting data
How is data deleted?
•  SSTables are immutable, we can't remove the data
•  Cassandra creates tombstones for deleted data
•  Tombstones are versioned the same way as any other
write

#Cassandra13
Do tombstones ever go away?
•  During compactions, tombstones can get merged into
SStables that hold the original data, making the
tombstones redundant
•  Once a tombstone is the only value for a specific
column, the tombstone can go away
•  Still need grace time to handle node downtime

#Cassandra13
•  Tombstones can only be deleted once all non-
tombstone values have been deleted
•  If you're using SizeTiered compaction, 'old' rows will
rarely get deleted

#Cassandra13
•  Tombstones are a problem even when using levelled
compaction
•  In theory, 90 % of all rows should live in a single
SSTable
•  In production, we've found that 20 - 50 % of all reads hit
more than one SSTable
•  Frequently updated columns will exist in many levels,
causing tombstones to stick around

#Cassandra13
•  Deletions are messy
•  Unless you perform major compactions, tombstones will
rarely get deleted from «popular» rows
•  Avoid schemas that delete data!

#Cassandra13
TTL:ed data
•  Cassandra supports TTL:ed data
•  Once TTL:ed data expires, it should just be compacted
away, right?
•  We know we don't need the data anymore, no need for
a tombstone, so it should be fast, right?

#Cassandra13
TTL:ed data
•  Cassandra supports TTL:ed data
•  Once TTL:ed data expires, it should just be compacted
away, right?
•  We know we don't need the data anymore, no need for
a tombstone, so it should be fast, right?
•  Noooooo...
•  (Overwritten data could theoretically bounce back)

#Cassandra13
TTL:ed data
•  Drop entire sstables when all columns are expired

#Cassandra13
The Playlist service
Our most complex service
•  ~1 billion playlists
•  40 000 reads per second
•  22 TB of compressed data

#Cassandra13
Our old playlist system had many problems:
•  Stored data across hundreds of millions of files, making
backup process really slow.
•  Home brewed replication model that didn't work very
well
•  Frequent downtimes, huge scalability problems

#Cassandra13
Our old playlist system had many problems:
•  Stored data across hundreds of millions of files, making
backup process really slow.
•  Home brewed replication model that didn't work very
well
•  Frequent downtimes, huge scalability problems
•  Perfect test case for
Cassandra!

#Cassandra13
Playlist data model
•  Every playlist is a revisioned object
•  Think of it like a distributed versioning system
•  Allows concurrent modification on multiple offlined clients
•  We even have an automatic merge conflict resolver that
works really well!
•  That's actually a really useful feature

#Cassandra13
Playlist data model
•  Every playlist is a revisioned object
•  Think of it like a distributed versioning system
•  Allows concurrent modification on multiple offlined clients
•  We even have an automatic merge conflict resolver that
works really well!
•  That's actually a really useful feature said no one ever

#Cassandra13
Playlist data model
•  Sequence of changes
•  The changes are the authoritative data
•  Everything else is optimization
•  Cassandra pretty neat for storing this kind of stuff
•  Can use consistency level ONE safely

#Cassandra13
Tombstone hell
Noticed that HEAD requests took several seconds for some
lists
Easy to reproduce in cassandra-cli
•  get playlist_head[utf8('spotify:user...')];
•  1-15 seconds latency - should be < 0.1 s
Copy head SSTables to development machine for
investigation
Cassandra tool sstabletojson showed that the row contained
600 000 tombstones!

#Cassandra13
Tombstone hell
We expected tombstones would be deleted after 30 days
•  Nope, all tombstones since 1.5 years ago were there
Revelation: Rows existing in 4+ SSTables never have
tombstones deleted during minor compactions
•  Frequently updated lists exists in nearly all SSTables
Solution:
Major compaction (CF size cut in half)

#Cassandra13
Zombie tombstones
•  Ran major compaction manually on all nodes during a
few days.
•  All seemed well...
•  But a week later, the same lists took several seconds
again‽‽‽

#Cassandra13
Repair vs major compactions
A repair between the major compactions "resurrected" the
tombstones :(
New solution:
•  Repairs during Monday-Friday
•  Major compaction Saturday-Sunday
A (by now) well-known Cassandra anti-pattern:
Don't use Cassandra to store queues

#Cassandra13
Cassandra counters
•  There are lots of places in the Spotify UI where we
count things
•  # of followers of a playlist
•  # of followers of an artist
•  # of times a song has been played
•  Cassandra has a feature called distributed counters that
sounds suitable
•  Is this awesome?

#Cassandra13
Cassandra counters
•  They've actually worked reasonably well for us.

#Cassandra13
Lessons
•  There are still various esoteric problems with large scale
Cassandra installations
•  Debugging them is interesting
•  If you agree with the above statements, you should
totally come work with us

#Cassandra13
Lessons
•  Cassandra read performance is heavily dependent on
the temporal patterns of your writes
•  Cassandra is initially snappy, but various write patterns
make read performance slowly decrease
•  Super hard to perform realistic benchmarks

#Cassandra13
Lessons
•  Avoid repeatedly writing data to the same row over very
long spans of time
•  If you're working at scale, you'll need to know how
Cassandra works under the hood
•  nodetool cfhistograms is your friend

June 19, 2013
#Cassandra13
Questions?

C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

Similar to C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz (20)

More from DataStax Academy

More from DataStax Academy (20)

Recently uploaded

Recently uploaded (20)

C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz