Cassandra nyc

Cassandra at Spotify

7th of March 2012

About this talk
An introduction Spotify, to our service and our persistent storage needs

About this talk
What Cassandra brings

About this talk
What we have learned

About this talk
What I would have liked to have known a year ago

About this talk

Not a comparison between different NoSQL solutions

About this talk

Not a hands on introduction to Cassandra

About this talk

Not a hands on introduction to Cassandra
We work with physical hardware for production

Noa Resare
Stockholm, Sweden

Noa Resare
Stockholm, Sweden
Service Reliability Engineering

Noa Resare
Stockholm, Sweden
noa@spotify.com

Noa Resare
Stockholm, Sweden
noa@spotify.com
@blippie

Spotify — all music, all the time

A better user experience than ﬁle sharing.

Native desktop and mobile clients.

Custom backend, built for performance and scalability.


12 markets. More than ten million users.


3 datacenters.


3 datacenters.
Tens of gigabits of data pushed per datacenter.


3 datacenters.
Tens of gigabits of data pushed per datacenter.
Backend systems that support a large set of innovative features.

Innovative features in practice

Playlist

Playlist
Should be simple, right?

Playlist
A named list of tracks

Playlist
It gets more complicated

Playlist
Keep multiple devices in sync

Playlist
Support nested playlists

Playlist
Offline editing on multiple devices

Playlist
Changes pushed to connected devices

Playlist
Scale. More than half a billion lists currently in the system

Playlist
About 10 khz on peak traffic.

Playlist
Resulting storage requirements:

Playlist
Full history

Playlist
Full history
Really fast access to latest version number and content

Suggested solutions
Flat ﬁles

Suggested solutions
Flat ﬁles
We don’t need ACID

Suggested solutions
Flat ﬁles
Linux page cache kicks ass.

Suggested solutions
Flat ﬁles
(Not really)

Suggested solutions
Flat ﬁles
(Not really)
SQL

Suggested solutions
Flat ﬁles
(Not really)
SQL
Tried and true. Facebook does this

Suggested solutions
Flat ﬁles
(Not really)
SQL
Simple Key-Value store

Suggested solutions
Flat ﬁles
(Not really)
SQL
Tokyo cabinet, some experience

Suggested solutions
Flat ﬁles
(Not really)
SQL
Clustered Key-Value store

Suggested solutions
Flat ﬁles
(Not really)
SQL
Clustered Key-Value store
Evaluated a lot, end game contestants HBase and Cassandra

Enter Cassandra
Solves a large subset of storage related problems

Enter Cassandra
Sharding, replication

Enter Cassandra
No single point of failure

Enter Cassandra
Ability to make the performance/reliability tradeoff per request

Enter Cassandra
Free software

Enter Cassandra
Free software
Active community, commercial backing

Enter Cassandra
Free software

66 + 18 + 9 + 28 production nodes

Enter Cassandra
Free software

About twenty nodes for various testing clusters

Enter Cassandra
Free software

About twenty nodes for various testing clusters
Datasets ranging from 8T to a few gigs.

Cassandra key concepts, on a node
Log structured storage
Sorted string table — SSTable
Immutable ﬁles on disk
Compaction — Many to one, merge sort

Memtable

SSTable SSTable SSTable

Cassandra key concepts, In a cluster
Clusters of nodes in a ring by key order
All data typically written to several nodes, Replication Factor
Rings can be expanded in production
Gossip, detects nodes being up / down / joining
Anti Entropy mechanisms
Many read operations can be done sequentially

Cassandra, winning!
Major upgrades without service interruptions (in theory)

Cassandra, winning!
Crazy fast writes

Cassandra, winning!
Crazy fast writes
Not just because you have a hardware RAID card that is good at lying to you

Cassandra, winning!
Crazy fast writes
Somewhat predictable number of seeks needed for read

Cassandra, winning!
Crazy fast writes
Knows that sequential I/O faster than random I/O

Cassandra, winning!
Crazy fast writes
In case of inconsistencies, knows what to do

Cassandra, winning!
Crazy fast writes
Replacing broken nodes straightforward

Cassandra, winning!
Crazy fast writes
Cross datacenter replication support

Cassandra, winning!
Crazy fast writes
Tinker friendly

Cassandra, winning!
Crazy fast writes
Tinker friendly
Readable code

Let me tell you a story
Latest stable kernel from Debian Squeeze 2.6.32-5

What happens after 209 days of uptime?

Load average around 120.

No CPU activity reported by top


Mattias de Zalenski:

log((209 days) / (1 nanoseconds)) / log(2) = 54.0034557

(2^54) nanoseconds = 208.499983 days

Somewhere nanosecond values are shifted ten bits?






Downtime for payment






Downtime for account creation






Downtime for account creation
No downtime for cassandra backed systems

Backups
A few terabytes of live data, many nodes. Painful.

Backups
Inefficient. Copy of on disk structure, at least 3 times the data

Backups
Non-compacted. Possibly a few tens of old versions.

Backups
Non-compacted. Possibly a few tens of old versions.
Initially, only full backups (pre 0.8)

Our solution to backups
Separate datacenter for backups with RF=1

Beware: tricky

Beware: tricky
Once removed from production performance considerations

Beware: tricky
Application level incremental backups

Beware: tricky
Application level incremental backups
Soon: Cassandra incremental backups

Solid state is a game changer
Large datasets, light read load

Small datasets, heavy read load

I Can Haz superlarge SSD?

No.

No.
With small disks, on disk datastructure size matters a lot

No.

Our plan:

No.

Our plan:
Leveled compaction strategy, new in 1.0

No.

Our plan:
Hack cassandra to have conﬁgurable datadirs per keyspace.

No.

Our plan:
Hack cassandra to have conﬁgurable datadirs per keyspace.
Our patch is integrated in Cassandra 1.1

Some unpleasant surprises
Immaturity

Immaturity
Hector, larger mutations than 15MB. Connection drops in thrift.

Immaturity
Broken on disk bloom ﬁlters in 0.8. Very painful upgrade to 1.0

Immaturity
Small disk, high load, very possible to get into an Out Of Disk condition

Immaturity
Small disk, high load, very possible to get into an Out Of Disk condition
Logging is lacking

Spot the bug
Hector java cassandra driver:

Spot the bug
private AtomicInteger counter = new AtomicInteger();

private Server getNextServer() {
counter.compareAndSet(16384, 0);
return servers[counter.getAndIncrement() % servers.length];
}

Spot the bug

}

Race condition

Spot the bug

}

Race condition
java.lang.ArrayIndexOutOfBoundsException

Spot the bug

}

Race condition
After close to 2**31 requests

Spot the bug

}

Race condition
After close to 2**31 requests
Took about 5 days

Conclusions
In the 0.6-1.0 timeframe, development engineers and operations are needed

Conclusions
You need to keep an eye on bugs created, be part of the community

Conclusions
Exotic stuff (such a asymmetrically sized datacenters) is tricky

Conclusions
Lots of things gets ﬁxed. You need to keep up with upstream

Conclusions
You need to integrate with monitoring and graphing

Conclusions
You need to integrate with monitoring and graphing

Consider it a toolkit for constructing solutions.

Cassandra nyc

Recommended

Recommended

More Related Content

Similar to Cassandra nyc

Similar to Cassandra nyc (20)

Recently uploaded

Recently uploaded (20)

Cassandra nyc

Editor's Notes