SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.
SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.
Successfully reported this slideshow.
Activate your 30 day free trial to unlock unlimited reading.
C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz
At Spotify, we see failure as an opportunity to learn. During the two years we've used Cassandra in our production environment, we have learned a lot. This session touches on some of the exciting design anti-patterns, performance killers and other opportunities to lose a finger that are at your disposal with Cassandra.
1.
June 19, 2013
#Cassandra13
Axel Liljencrantz
liljencrantz@spotify.com
How not to use
Cassandra
3.
#Cassandra13
The Spotify backend
• Around 3000 servers in 3 datacenters
• Volumes
o We have ~ 12 soccer fields of music
o Streaming ~ 4 Wikipedias/second
o ~ 24 000 000 active users
4.
#Cassandra13
The Spotify backend
• Specialized software powering Spotify
o ~ 70 services
o Mostly Python, some Java
o Small, simple services responsible for single task
5.
#Cassandra13
Storage needs
• Used to be a pure PostgreSQL shop
• Postgres is awesome, but...
o Poor cross-site replication support
o Write master failure requires manual intervention
o Sharding throws most relational advantages out the
window
6.
#Cassandra13
Cassandra @ Spotify
• We started using Cassandra ~2 years ago
• About a dozen services use it by now
• Back then, there was little information about how to
design efficient, scalable storage schemas for
Cassandra
7.
#Cassandra13
Cassandra @ Spotify
• We started using Cassandra ~2 years ago
• About a dozen services use it by now
• Back then, there was little information about how to
design efficient, scalable storage schemas for
Cassandra
• So we screwed up
• A lot
9.
#Cassandra13
Read repair
• Repair from outages during regular read operation
• With RR, all reads request hash digests from all nodes
• Result is still returned as soon as enough nodes have
replied
• If there is a mismatch, perform a repair
10.
#Cassandra13
Read repair
• Useful factoid: Read repair is performed across all data
centers
• So in a multi-DC setup, all reads will result in requests being
sent to every data center
• We've made this mistake a bunch of times
• New in 1.1: dclocal_read_repair
11.
#Cassandra13
Row cache
• Cassandra can be configured to cache entire data rows in
RAM
• Intended as a memcache alternative
• Lets enable it. What's the worst that could happen, right?
12.
#Cassandra13
Row cache
NO!
• Only stores full rows
• All cache misses are silently promoted to full row slices
• All writes invalidate entire row
• Don't use unless you understand all use cases
13.
#Cassandra13
Compression
• Cassandra supports transparent compression of all data
• Compression algorithm (snappy) is super fast
• So you can just enable it and everything will be better, right?
14.
#Cassandra13
Compression
• Cassandra supports transparent compression of all data
• Compression algorithm (snappy) is super fast
• So you can just enable it and everything will be better, right?
• NO!
• Compression disables a bunch of fast paths, slowing down
fast reads
16.
#Cassandra13
Performance worse over time
• A freshly loaded Cassandra cluster is usually snappy
• But when you keep writing to the same columns over for a
long time, performance goes down
• We've seen clusters where reads touch a dozen SSTables
on average
• nodetool cfhistograms is your friend
17.
#Cassandra13
Performance worse over time
• CASSANDRA-5514
• Every SSTable stores first/last column of SSTable
• Time series-like data is effectively partitioned
18.
#Cassandra13
Few cross continent clusters
• Few cross continent Cassandra users
• We are kind of on our own when it comes to some problems
• CASSANDRA-5148
• Disable TCP nodelay
• Reduced packet count by 20 %
20.
#Cassandra13
How not to upgrade Cassandra
• Very few total cluster outages
o Clusters have been up and running since the early
0.7 days, been rolling upgraded, expanded, full
hardware replacements etc.
• Never lost any data!
o No matter how spectacularly Cassandra fails, it has
never written bad data
o Immutable SSTables FTW
21.
#Cassandra13
Upgrade from 0.7 to 0.8
• This was the first big upgrade we did, 0.7.4 ⇾ 0.8.6
• Everyone claimed rolling upgrade would work
o It did not
• One would expect 0.8.6 to have this fixed
• Patched Cassandra and rolled it a day later
• Takeaways:
o ALWAYS try rolling upgrades in a testing environment
o Don't believe what people on the Internet tell you
22.
#Cassandra13
Upgrade 0.8 to 1.0
• We tried upgrading in test env, worked fine
• Worked fine in production...
• Except the last cluster
• All data gone
23.
#Cassandra13
Upgrade 0.8 to 1.0
• We tried upgrading in test env, worked fine
• Worked fine in production...
• Except the last cluster
• All data gone
• Many keys per SSTable ⇾ corrupt bloom filters
• Made Cassandra think it didn't have any keys
• Scrub data ⇾ fixed
• Takeaway: ALWAYS test upgrades using production data
24.
#Cassandra13
Upgrading 1.0 to 1.1
• After the previous upgrades, we did all the tests with
production data and everything worked fine...
• Until we redid it in production, and we had reports of
missing rows
• Scrub ⇾ restart made them reappear
• This was in December, have not been able to reproduce
• PEBKAC?
• Takeaway: ?
25.
#Cassandra13
How not to deal with large clusters
26.
#Cassandra13
Coordinator
• Coordinator performs partitioning, passes on request to
the right nodes
• Merges all responses
27.
#Cassandra13
What happens if one node is slow?
28.
#Cassandra13
What happens if one node is slow?
Many reasons for temporary slowness:
• Bad raid battery
• Sudden bursts of compaction/repair
• Bursty load
• Net hiccup
• Major GC
• Reality
29.
#Cassandra13
What happens if one node is slow?
• Coordinator has a request queue
• If a node goes down completely, gossip will notice
quickly and drop the node
• But what happens if a node is just super slow?
30.
#Cassandra13
What happens if one node is slow?
• Gossip doesn't react quickly to slow nodes
• The request queue for the coordinator on every node in
the cluster fills up
• And the entire cluster stops accepting requests
31.
#Cassandra13
What happens if one node is slow?
• Gossip doesn't react quickly to slow nodes
• The request queue for the coordinator on every node in
the cluster fills up
• And the entire cluster stops accepting requests
• No single point of failure?
32.
#Cassandra13
What happens if one node is slow?
• Solution: Partitioner awareness in client
• Max 3 nodes go down
• Available in Astyanax
34.
#Cassandra13
Deleting data
How is data deleted?
• SSTables are immutable, we can't remove the data
• Cassandra creates tombstones for deleted data
• Tombstones are versioned the same way as any other
write
35.
#Cassandra13
How not to delete data
Do tombstones ever go away?
• During compactions, tombstones can get merged into
SStables that hold the original data, making the
tombstones redundant
• Once a tombstone is the only value for a specific
column, the tombstone can go away
• Still need grace time to handle node downtime
36.
#Cassandra13
How not to delete data
• Tombstones can only be deleted once all non-
tombstone values have been deleted
• If you're using SizeTiered compaction, 'old' rows will
rarely get deleted
37.
#Cassandra13
How not to delete data
• Tombstones are a problem even when using levelled
compaction
• In theory, 90 % of all rows should live in a single
SSTable
• In production, we've found that 20 - 50 % of all reads hit
more than one SSTable
• Frequently updated columns will exist in many levels,
causing tombstones to stick around
38.
#Cassandra13
How not to delete data
• Deletions are messy
• Unless you perform major compactions, tombstones will
rarely get deleted from «popular» rows
• Avoid schemas that delete data!
39.
#Cassandra13
TTL:ed data
• Cassandra supports TTL:ed data
• Once TTL:ed data expires, it should just be compacted
away, right?
• We know we don't need the data anymore, no need for
a tombstone, so it should be fast, right?
40.
#Cassandra13
TTL:ed data
• Cassandra supports TTL:ed data
• Once TTL:ed data expires, it should just be compacted
away, right?
• We know we don't need the data anymore, no need for
a tombstone, so it should be fast, right?
• Noooooo...
• (Overwritten data could theoretically bounce back)
41.
#Cassandra13
TTL:ed data
• CASSANDRA-5228
• Drop entire sstables when all columns are expired
42.
#Cassandra13
The Playlist service
Our most complex service
• ~1 billion playlists
• 40 000 reads per second
• 22 TB of compressed data
43.
#Cassandra13
The Playlist service
Our old playlist system had many problems:
• Stored data across hundreds of millions of files, making
backup process really slow.
• Home brewed replication model that didn't work very
well
• Frequent downtimes, huge scalability problems
44.
#Cassandra13
The Playlist service
Our old playlist system had many problems:
• Stored data across hundreds of millions of files, making
backup process really slow.
• Home brewed replication model that didn't work very
well
• Frequent downtimes, huge scalability problems
• Perfect test case for
Cassandra!
45.
#Cassandra13
Playlist data model
• Every playlist is a revisioned object
• Think of it like a distributed versioning system
• Allows concurrent modification on multiple offlined clients
• We even have an automatic merge conflict resolver that
works really well!
• That's actually a really useful feature
46.
#Cassandra13
Playlist data model
• Every playlist is a revisioned object
• Think of it like a distributed versioning system
• Allows concurrent modification on multiple offlined clients
• We even have an automatic merge conflict resolver that
works really well!
• That's actually a really useful feature said no one ever
47.
#Cassandra13
Playlist data model
• Sequence of changes
• The changes are the authoritative data
• Everything else is optimization
• Cassandra pretty neat for storing this kind of stuff
• Can use consistency level ONE safely
49.
#Cassandra13
Tombstone hell
Noticed that HEAD requests took several seconds for some
lists
Easy to reproduce in cassandra-cli
• get playlist_head[utf8('spotify:user...')];
• 1-15 seconds latency - should be < 0.1 s
Copy head SSTables to development machine for
investigation
Cassandra tool sstabletojson showed that the row contained
600 000 tombstones!
50.
#Cassandra13
Tombstone hell
We expected tombstones would be deleted after 30 days
• Nope, all tombstones since 1.5 years ago were there
Revelation: Rows existing in 4+ SSTables never have
tombstones deleted during minor compactions
• Frequently updated lists exists in nearly all SSTables
Solution:
Major compaction (CF size cut in half)
51.
#Cassandra13
Zombie tombstones
• Ran major compaction manually on all nodes during a
few days.
• All seemed well...
• But a week later, the same lists took several seconds
again‽‽‽
52.
#Cassandra13
Repair vs major compactions
A repair between the major compactions "resurrected" the
tombstones :(
New solution:
• Repairs during Monday-Friday
• Major compaction Saturday-Sunday
A (by now) well-known Cassandra anti-pattern:
Don't use Cassandra to store queues
53.
#Cassandra13
Cassandra counters
• There are lots of places in the Spotify UI where we
count things
• # of followers of a playlist
• # of followers of an artist
• # of times a song has been played
• Cassandra has a feature called distributed counters that
sounds suitable
• Is this awesome?
54.
#Cassandra13
Cassandra counters
• They've actually worked reasonably well for us.
55.
#Cassandra13
Lessons
• There are still various esoteric problems with large scale
Cassandra installations
• Debugging them is interesting
• If you agree with the above statements, you should
totally come work with us
56.
#Cassandra13
Lessons
• Cassandra read performance is heavily dependent on
the temporal patterns of your writes
• Cassandra is initially snappy, but various write patterns
make read performance slowly decrease
• Super hard to perform realistic benchmarks
57.
#Cassandra13
Lessons
• Avoid repeatedly writing data to the same row over very
long spans of time
• If you're working at scale, you'll need to know how
Cassandra works under the hood
• nodetool cfhistograms is your friend
At Spotify, we see failure as an opportunity to learn. During the two years we've used Cassandra in our production environment, we have learned a lot. This session touches on some of the exciting design anti-patterns, performance killers and other opportunities to lose a finger that are at your disposal with Cassandra.
1.
June 19, 2013
#Cassandra13
Axel Liljencrantz
liljencrantz@spotify.com
How not to use
Cassandra
3.
#Cassandra13
The Spotify backend
• Around 3000 servers in 3 datacenters
• Volumes
o We have ~ 12 soccer fields of music
o Streaming ~ 4 Wikipedias/second
o ~ 24 000 000 active users
4.
#Cassandra13
The Spotify backend
• Specialized software powering Spotify
o ~ 70 services
o Mostly Python, some Java
o Small, simple services responsible for single task
5.
#Cassandra13
Storage needs
• Used to be a pure PostgreSQL shop
• Postgres is awesome, but...
o Poor cross-site replication support
o Write master failure requires manual intervention
o Sharding throws most relational advantages out the
window
6.
#Cassandra13
Cassandra @ Spotify
• We started using Cassandra ~2 years ago
• About a dozen services use it by now
• Back then, there was little information about how to
design efficient, scalable storage schemas for
Cassandra
7.
#Cassandra13
Cassandra @ Spotify
• We started using Cassandra ~2 years ago
• About a dozen services use it by now
• Back then, there was little information about how to
design efficient, scalable storage schemas for
Cassandra
• So we screwed up
• A lot
9.
#Cassandra13
Read repair
• Repair from outages during regular read operation
• With RR, all reads request hash digests from all nodes
• Result is still returned as soon as enough nodes have
replied
• If there is a mismatch, perform a repair
10.
#Cassandra13
Read repair
• Useful factoid: Read repair is performed across all data
centers
• So in a multi-DC setup, all reads will result in requests being
sent to every data center
• We've made this mistake a bunch of times
• New in 1.1: dclocal_read_repair
11.
#Cassandra13
Row cache
• Cassandra can be configured to cache entire data rows in
RAM
• Intended as a memcache alternative
• Lets enable it. What's the worst that could happen, right?
12.
#Cassandra13
Row cache
NO!
• Only stores full rows
• All cache misses are silently promoted to full row slices
• All writes invalidate entire row
• Don't use unless you understand all use cases
13.
#Cassandra13
Compression
• Cassandra supports transparent compression of all data
• Compression algorithm (snappy) is super fast
• So you can just enable it and everything will be better, right?
14.
#Cassandra13
Compression
• Cassandra supports transparent compression of all data
• Compression algorithm (snappy) is super fast
• So you can just enable it and everything will be better, right?
• NO!
• Compression disables a bunch of fast paths, slowing down
fast reads
16.
#Cassandra13
Performance worse over time
• A freshly loaded Cassandra cluster is usually snappy
• But when you keep writing to the same columns over for a
long time, performance goes down
• We've seen clusters where reads touch a dozen SSTables
on average
• nodetool cfhistograms is your friend
17.
#Cassandra13
Performance worse over time
• CASSANDRA-5514
• Every SSTable stores first/last column of SSTable
• Time series-like data is effectively partitioned
18.
#Cassandra13
Few cross continent clusters
• Few cross continent Cassandra users
• We are kind of on our own when it comes to some problems
• CASSANDRA-5148
• Disable TCP nodelay
• Reduced packet count by 20 %
20.
#Cassandra13
How not to upgrade Cassandra
• Very few total cluster outages
o Clusters have been up and running since the early
0.7 days, been rolling upgraded, expanded, full
hardware replacements etc.
• Never lost any data!
o No matter how spectacularly Cassandra fails, it has
never written bad data
o Immutable SSTables FTW
21.
#Cassandra13
Upgrade from 0.7 to 0.8
• This was the first big upgrade we did, 0.7.4 ⇾ 0.8.6
• Everyone claimed rolling upgrade would work
o It did not
• One would expect 0.8.6 to have this fixed
• Patched Cassandra and rolled it a day later
• Takeaways:
o ALWAYS try rolling upgrades in a testing environment
o Don't believe what people on the Internet tell you
22.
#Cassandra13
Upgrade 0.8 to 1.0
• We tried upgrading in test env, worked fine
• Worked fine in production...
• Except the last cluster
• All data gone
23.
#Cassandra13
Upgrade 0.8 to 1.0
• We tried upgrading in test env, worked fine
• Worked fine in production...
• Except the last cluster
• All data gone
• Many keys per SSTable ⇾ corrupt bloom filters
• Made Cassandra think it didn't have any keys
• Scrub data ⇾ fixed
• Takeaway: ALWAYS test upgrades using production data
24.
#Cassandra13
Upgrading 1.0 to 1.1
• After the previous upgrades, we did all the tests with
production data and everything worked fine...
• Until we redid it in production, and we had reports of
missing rows
• Scrub ⇾ restart made them reappear
• This was in December, have not been able to reproduce
• PEBKAC?
• Takeaway: ?
25.
#Cassandra13
How not to deal with large clusters
26.
#Cassandra13
Coordinator
• Coordinator performs partitioning, passes on request to
the right nodes
• Merges all responses
27.
#Cassandra13
What happens if one node is slow?
28.
#Cassandra13
What happens if one node is slow?
Many reasons for temporary slowness:
• Bad raid battery
• Sudden bursts of compaction/repair
• Bursty load
• Net hiccup
• Major GC
• Reality
29.
#Cassandra13
What happens if one node is slow?
• Coordinator has a request queue
• If a node goes down completely, gossip will notice
quickly and drop the node
• But what happens if a node is just super slow?
30.
#Cassandra13
What happens if one node is slow?
• Gossip doesn't react quickly to slow nodes
• The request queue for the coordinator on every node in
the cluster fills up
• And the entire cluster stops accepting requests
31.
#Cassandra13
What happens if one node is slow?
• Gossip doesn't react quickly to slow nodes
• The request queue for the coordinator on every node in
the cluster fills up
• And the entire cluster stops accepting requests
• No single point of failure?
32.
#Cassandra13
What happens if one node is slow?
• Solution: Partitioner awareness in client
• Max 3 nodes go down
• Available in Astyanax
34.
#Cassandra13
Deleting data
How is data deleted?
• SSTables are immutable, we can't remove the data
• Cassandra creates tombstones for deleted data
• Tombstones are versioned the same way as any other
write
35.
#Cassandra13
How not to delete data
Do tombstones ever go away?
• During compactions, tombstones can get merged into
SStables that hold the original data, making the
tombstones redundant
• Once a tombstone is the only value for a specific
column, the tombstone can go away
• Still need grace time to handle node downtime
36.
#Cassandra13
How not to delete data
• Tombstones can only be deleted once all non-
tombstone values have been deleted
• If you're using SizeTiered compaction, 'old' rows will
rarely get deleted
37.
#Cassandra13
How not to delete data
• Tombstones are a problem even when using levelled
compaction
• In theory, 90 % of all rows should live in a single
SSTable
• In production, we've found that 20 - 50 % of all reads hit
more than one SSTable
• Frequently updated columns will exist in many levels,
causing tombstones to stick around
38.
#Cassandra13
How not to delete data
• Deletions are messy
• Unless you perform major compactions, tombstones will
rarely get deleted from «popular» rows
• Avoid schemas that delete data!
39.
#Cassandra13
TTL:ed data
• Cassandra supports TTL:ed data
• Once TTL:ed data expires, it should just be compacted
away, right?
• We know we don't need the data anymore, no need for
a tombstone, so it should be fast, right?
40.
#Cassandra13
TTL:ed data
• Cassandra supports TTL:ed data
• Once TTL:ed data expires, it should just be compacted
away, right?
• We know we don't need the data anymore, no need for
a tombstone, so it should be fast, right?
• Noooooo...
• (Overwritten data could theoretically bounce back)
41.
#Cassandra13
TTL:ed data
• CASSANDRA-5228
• Drop entire sstables when all columns are expired
42.
#Cassandra13
The Playlist service
Our most complex service
• ~1 billion playlists
• 40 000 reads per second
• 22 TB of compressed data
43.
#Cassandra13
The Playlist service
Our old playlist system had many problems:
• Stored data across hundreds of millions of files, making
backup process really slow.
• Home brewed replication model that didn't work very
well
• Frequent downtimes, huge scalability problems
44.
#Cassandra13
The Playlist service
Our old playlist system had many problems:
• Stored data across hundreds of millions of files, making
backup process really slow.
• Home brewed replication model that didn't work very
well
• Frequent downtimes, huge scalability problems
• Perfect test case for
Cassandra!
45.
#Cassandra13
Playlist data model
• Every playlist is a revisioned object
• Think of it like a distributed versioning system
• Allows concurrent modification on multiple offlined clients
• We even have an automatic merge conflict resolver that
works really well!
• That's actually a really useful feature
46.
#Cassandra13
Playlist data model
• Every playlist is a revisioned object
• Think of it like a distributed versioning system
• Allows concurrent modification on multiple offlined clients
• We even have an automatic merge conflict resolver that
works really well!
• That's actually a really useful feature said no one ever
47.
#Cassandra13
Playlist data model
• Sequence of changes
• The changes are the authoritative data
• Everything else is optimization
• Cassandra pretty neat for storing this kind of stuff
• Can use consistency level ONE safely
49.
#Cassandra13
Tombstone hell
Noticed that HEAD requests took several seconds for some
lists
Easy to reproduce in cassandra-cli
• get playlist_head[utf8('spotify:user...')];
• 1-15 seconds latency - should be < 0.1 s
Copy head SSTables to development machine for
investigation
Cassandra tool sstabletojson showed that the row contained
600 000 tombstones!
50.
#Cassandra13
Tombstone hell
We expected tombstones would be deleted after 30 days
• Nope, all tombstones since 1.5 years ago were there
Revelation: Rows existing in 4+ SSTables never have
tombstones deleted during minor compactions
• Frequently updated lists exists in nearly all SSTables
Solution:
Major compaction (CF size cut in half)
51.
#Cassandra13
Zombie tombstones
• Ran major compaction manually on all nodes during a
few days.
• All seemed well...
• But a week later, the same lists took several seconds
again‽‽‽
52.
#Cassandra13
Repair vs major compactions
A repair between the major compactions "resurrected" the
tombstones :(
New solution:
• Repairs during Monday-Friday
• Major compaction Saturday-Sunday
A (by now) well-known Cassandra anti-pattern:
Don't use Cassandra to store queues
53.
#Cassandra13
Cassandra counters
• There are lots of places in the Spotify UI where we
count things
• # of followers of a playlist
• # of followers of an artist
• # of times a song has been played
• Cassandra has a feature called distributed counters that
sounds suitable
• Is this awesome?
54.
#Cassandra13
Cassandra counters
• They've actually worked reasonably well for us.
55.
#Cassandra13
Lessons
• There are still various esoteric problems with large scale
Cassandra installations
• Debugging them is interesting
• If you agree with the above statements, you should
totally come work with us
56.
#Cassandra13
Lessons
• Cassandra read performance is heavily dependent on
the temporal patterns of your writes
• Cassandra is initially snappy, but various write patterns
make read performance slowly decrease
• Super hard to perform realistic benchmarks
57.
#Cassandra13
Lessons
• Avoid repeatedly writing data to the same row over very
long spans of time
• If you're working at scale, you'll need to know how
Cassandra works under the hood
• nodetool cfhistograms is your friend