Spotify cassandra london

Cassandra at Spotify

28th of March 2012

About this talk
An introduction Spotify, to our service and
our persistent storage needs

About this talk
What Cassandra brings

About this talk
What we have learned

About this talk
What I would have liked to have known a year
ago

About this talk
ago
Not a comparison between different NoSQL
solutions

About this talk
ago
Not a comparison between different NoSQL
solutions
The real reason: yes, we are hiring.

Noa Resare
Stockholm, Sweden
Service Reliability Engineering

Noa Resare
Stockholm, Sweden
noa@spotify.com

Noa Resare
Stockholm, Sweden
noa@spotify.com
@blippie

Spotify — all music, all the time

A better user experience than ﬁle sharing.

Native desktop and mobile clients.

Custom backend, built for performance and
scalability.

scalability.

13 markets. More than ten million users.

scalability.

3 datacenters.

scalability.

3 datacenters.
Tens of gigabits of data pushed per
datacenter.

scalability.

3 datacenters.
Tens of gigabits of data pushed per
datacenter.
Backend systems that support a large set of
innovative features.

Innovative features in practice

Playlist

Playlist
A named list of tracks

Playlist
Keep multiple devices in sync

Playlist
Support nested playlists

Playlist
Offline editing, pubsub

Playlist
Scale. More than half a billion lists currently
in the system

Playlist
in the system
About 10 kHz on peak traffic.

Playlist
in the system
About 10 kHz on peak traffic.
Result: accidentally implemented VCS

Suggested solutions
Flat ﬁles

Suggested solutions
Flat ﬁles
We don’t need ACID

Suggested solutions
Flat ﬁles
Linux page cache kicks ass.

Suggested solutions
Flat ﬁles
(Not really)

Suggested solutions
Flat ﬁles
(Not really)
SQL

Suggested solutions
Flat ﬁles
(Not really)
SQL
Tried and true. Facebook does this

Suggested solutions
Flat ﬁles
(Not really)
SQL
Simple Key-Value store

Suggested solutions
Flat ﬁles
(Not really)
SQL
Tokyo cabinet, some experience

Suggested solutions
Flat ﬁles
(Not really)
SQL
Clustered Key-Value store

Suggested solutions
Flat ﬁles
(Not really)
SQL
Clustered Key-Value store
Evaluated a lot, end game contestants HBase
and Cassandra

Enter Cassandra
Solves a large subset of storage related
problems

Enter Cassandra
problems
Sharding, replication

Enter Cassandra
problems
No single point of failure

Enter Cassandra
problems
Free software

Enter Cassandra
problems
Free software
Active community, commercial backing

Enter Cassandra
problems
Free software
66 + 18 + 9 + 28 production nodes

Enter Cassandra
problems
Free software
About twenty nodes for various testing
clusters

Enter Cassandra
problems
Free software
About twenty nodes for various testing
clusters
Datasets ranging from 8T to a few gigs.

Cassandra, winning!
Major upgrades without service interruptions
(in theory)

Cassandra, winning!
(in theory)
Crazy fast writes

Cassandra, winning!
(in theory)
Crazy fast writes
Not just because you have a hardware RAID
card that is good at lying to you

Cassandra, winning!
(in theory)
Crazy fast writes
Uses the knowledge that sequential is I/O
faster than random I/O

Cassandra, winning!
(in theory)
Crazy fast writes
In case of inconsistencies, knows what to do

Cassandra, winning!
(in theory)
Crazy fast writes
Cross datacenter replication support

Cassandra, winning!
(in theory)
Crazy fast writes
Tinker friendly

Cassandra, winning!
(in theory)
Crazy fast writes
Tinker friendly
Readable code

Cassandra ﬂexibility for Playlist

The main use cases for playlist:

Get me all changes since version N of playlist P

Apply the following changes on top of version
M of playlist Q

M of playlist Q
This translates to CFs head and change

M of playlist Q
Asymmetric sizes

M of playlist Q
Asymmetric sizes
Neat trick: read change with level=ONE,
fallback to LOCAL_QUORUM

Let me tell you a story
Latest stable kernel from Debian Squeeze
2.6.32-5

2.6.32-5
What happens after 209 days of uptime?

2.6.32-5
Load average around 120.

2.6.32-5
No CPU activity reported by top

2.6.32-5
Mattias de Zalenski:

log((209 days) / (1 nanoseconds)) / log(2) = 54.0034557

(2^54) nanoseconds = 208.499983 days

Somewhere nanosecond values are shifted ten bits?

2.6.32-5




Downtime for payment

2.6.32-5




Downtime for account creation

2.6.32-5




Downtime for account creation
No downtime for cassandra backed systems

Backups
A few terabytes of live data, many nodes.
Painful.

Backups
Painful.
Inefficient. Copy of on disk structure, at least
3 times the data

Backups
Painful.
3 times the data
Non-compacted. Possibly a few tens of old
versions.

Backups
Painful.
3 times the data
versions.
Pulling data off nodes evict hot data from
page cache.

Backups
Painful.
3 times the data
versions.
Pulling data off nodes evict hot data from
page cache.
Initially, only full backups (pre 0.8)

Our solution to backups
NetworkTopologyStrategy is cool

Separate datacenter for backups with RF=1

Beware: tricky

Beware: tricky
Once removed from production performance
considerations

Beware: tricky
considerations
Application level incremental backups

Beware: tricky
considerations
This week, cassandra level incremental
backups

Beware: tricky
considerations
This week, cassandra level incremental
backups
Still some issues: lots of SSTables

Solid state is a game changer
Asymmetrically sized datasets

I Can Haz superlarge SSD?

No.

No.
With small disks, on disk data structure size
matters a lot

No.
matters a lot
Our plan:

No.
matters a lot
Our plan:
Leveled compaction strategy, new in 1.0

No.
matters a lot
Our plan:
Hack cassandra to have conﬁgurable datadirs
per keyspace.

No.
matters a lot
Our plan:
Hack cassandra to have conﬁgurable datadirs
per keyspace.
Our patch is integrated in Cassandra 1.1

Some unpleasant surprises
Immaturity.

Immaturity.
Has anyone written nodetool -h ring?

Immaturity.
Broken on disk bloom ﬁlters in 0.8. Very
painful upgrade to 1.0

Immaturity.
Small disk, high load, very possible to get
into an Out Of Disk condition

Immaturity.
Small disk, high load, very possible to get
into an Out Of Disk condition
Logging is lacking

Lessons learned from backup datacenter

Asymmetric cluster sizes are painful.

60 production nodes, 6 backup nodes

Repairs that replicate all data 10 times

The workaround: manual repairs

Remove sstables from broken node (to free up
space)

space)
Start it to have it take writes while repopulating

space)
Snapshot and move SSTables from 4 evenly
spaced nodes

space)
spaced nodes
Do a full compaction

space)
spaced nodes
Do a full compaction
Do a repair and hope for the best

Spot the bug
Hector java cassandra driver:

Spot the bug
private AtomicInteger counter = new AtomicInteger();

private Server getNextServer() {
counter.compareAndSet(16384, 0);
return servers[counter.getAndIncrement() % servers.length];
}

Spot the bug

}

Race condition

Spot the bug

}

Race condition
java.lang.ArrayIndexOutOfBoundsException

Spot the bug

}

Race condition
After close to 2**31 requests

Spot the bug

}

Race condition
After close to 2**31 requests
Took a few days

Thrift payload size limits
Communication with Cassandra is based on
thrift
Large mutations, larger than 15MiB
Thrift drops the underlying TCP connection
Hector considers the connection drop a node
speciﬁc problem
Retries on all cassandra nodes
Effectively shutting down all cassandra traffic

Conclusions
In the 0.6-1.0 timeframe, developers and
operations engineers are needed

Conclusions
You need to keep an eye on bugs created, be
part of the community

Conclusions
Exotic stuff (such a asymmetrically sized
datacenters) is tricky

Conclusions
Lots of things gets ﬁxed. You need to keep
up with upstream

Conclusions
up with upstream
You need to integrate with monitoring and
graphing

Conclusions
up with upstream
You need to integrate with monitoring and
graphing
Consider it a toolkit for constructing
solutions.

Spotify cassandra london

Recommended

Recommended

More Related Content

Similar to Spotify cassandra london

Similar to Spotify cassandra london (20)

Recently uploaded

Recently uploaded (20)

Spotify cassandra london

Editor's Notes