Spotify cassandra london
Upcoming SlideShare
Loading in...5
×
 

Spotify cassandra london

on

  • 2,440 views

A presentation from the Cassandra Europe conference about Cassandra use at Spotfiy

A presentation from the Cassandra Europe conference about Cassandra use at Spotfiy

Statistics

Views

Total Views
2,440
Views on SlideShare
2,431
Embed Views
9

Actions

Likes
5
Downloads
82
Comments
0

3 Embeds 9

https://twimg0-a.akamaihd.net 7
http://www.slashdocs.com 1
https://twitter.com 1

Accessibility

Categories

Upload Details

Uploaded via as Apple Keynote

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n

Spotify cassandra london Spotify cassandra london Presentation Transcript

  • Cassandra at Spotify 28th of March 2012
  • About this talk
  • About this talk An introduction Spotify, to our service and our persistent storage needs
  • About this talk An introduction Spotify, to our service and our persistent storage needs What Cassandra brings
  • About this talk An introduction Spotify, to our service and our persistent storage needs What Cassandra brings What we have learned
  • About this talk An introduction Spotify, to our service and our persistent storage needs What Cassandra brings What we have learned What I would have liked to have known a year ago
  • About this talk An introduction Spotify, to our service and our persistent storage needs What Cassandra brings What we have learned What I would have liked to have known a year ago Not a comparison between different NoSQL solutions
  • About this talk An introduction Spotify, to our service and our persistent storage needs What Cassandra brings What we have learned What I would have liked to have known a year ago Not a comparison between different NoSQL solutions The real reason: yes, we are hiring.
  • Noa Resare
  • Noa Resare Stockholm, Sweden
  • Noa Resare Stockholm, Sweden Service Reliability Engineering
  • Noa Resare Stockholm, Sweden Service Reliability Engineering noa@spotify.com
  • Noa Resare Stockholm, Sweden Service Reliability Engineering noa@spotify.com @blippie
  • Spotify — all music, all the time
  • Spotify — all music, all the time A better user experience than file sharing.
  • Spotify — all music, all the time A better user experience than file sharing. Native desktop and mobile clients.
  • Spotify — all music, all the time A better user experience than file sharing. Native desktop and mobile clients. Custom backend, built for performance and scalability.
  • Spotify — all music, all the time A better user experience than file sharing. Native desktop and mobile clients. Custom backend, built for performance and scalability.
  • Spotify — all music, all the time A better user experience than file sharing. Native desktop and mobile clients. Custom backend, built for performance and scalability. 13 markets. More than ten million users.
  • Spotify — all music, all the time A better user experience than file sharing. Native desktop and mobile clients. Custom backend, built for performance and scalability. 13 markets. More than ten million users. 3 datacenters.
  • Spotify — all music, all the time A better user experience than file sharing. Native desktop and mobile clients. Custom backend, built for performance and scalability. 13 markets. More than ten million users. 3 datacenters. Tens of gigabits of data pushed per datacenter.
  • Spotify — all music, all the time A better user experience than file sharing. Native desktop and mobile clients. Custom backend, built for performance and scalability. 13 markets. More than ten million users. 3 datacenters. Tens of gigabits of data pushed per datacenter. Backend systems that support a large set of innovative features.
  • Innovative features in practice
  • Innovative features in practice Playlist
  • Innovative features in practice Playlist A named list of tracks
  • Innovative features in practice Playlist A named list of tracks Keep multiple devices in sync
  • Innovative features in practice Playlist A named list of tracks Keep multiple devices in sync Support nested playlists
  • Innovative features in practice Playlist A named list of tracks Keep multiple devices in sync Support nested playlists Offline editing, pubsub
  • Innovative features in practice Playlist A named list of tracks Keep multiple devices in sync Support nested playlists Offline editing, pubsub Scale. More than half a billion lists currently in the system
  • Innovative features in practice Playlist A named list of tracks Keep multiple devices in sync Support nested playlists Offline editing, pubsub Scale. More than half a billion lists currently in the system About 10 kHz on peak traffic.
  • Innovative features in practice Playlist A named list of tracks Keep multiple devices in sync Support nested playlists Offline editing, pubsub Scale. More than half a billion lists currently in the system About 10 kHz on peak traffic. Result: accidentally implemented VCS
  • Suggested solutions
  • Suggested solutions Flat files
  • Suggested solutions Flat files We don’t need ACID
  • Suggested solutions Flat files We don’t need ACID Linux page cache kicks ass.
  • Suggested solutions Flat files We don’t need ACID Linux page cache kicks ass. (Not really)
  • Suggested solutions Flat files We don’t need ACID Linux page cache kicks ass. (Not really) SQL
  • Suggested solutions Flat files We don’t need ACID Linux page cache kicks ass. (Not really) SQL Tried and true. Facebook does this
  • Suggested solutions Flat files We don’t need ACID Linux page cache kicks ass. (Not really) SQL Tried and true. Facebook does this Simple Key-Value store
  • Suggested solutions Flat files We don’t need ACID Linux page cache kicks ass. (Not really) SQL Tried and true. Facebook does this Simple Key-Value store Tokyo cabinet, some experience
  • Suggested solutions Flat files We don’t need ACID Linux page cache kicks ass. (Not really) SQL Tried and true. Facebook does this Simple Key-Value store Tokyo cabinet, some experience Clustered Key-Value store
  • Suggested solutions Flat files We don’t need ACID Linux page cache kicks ass. (Not really) SQL Tried and true. Facebook does this Simple Key-Value store Tokyo cabinet, some experience Clustered Key-Value store Evaluated a lot, end game contestants HBase and Cassandra
  • Enter Cassandra
  • Enter Cassandra Solves a large subset of storage related problems
  • Enter Cassandra Solves a large subset of storage related problems Sharding, replication
  • Enter Cassandra Solves a large subset of storage related problems Sharding, replication No single point of failure
  • Enter Cassandra Solves a large subset of storage related problems Sharding, replication No single point of failure Free software
  • Enter Cassandra Solves a large subset of storage related problems Sharding, replication No single point of failure Free software Active community, commercial backing
  • Enter Cassandra Solves a large subset of storage related problems Sharding, replication No single point of failure Free software Active community, commercial backing 66 + 18 + 9 + 28 production nodes
  • Enter Cassandra Solves a large subset of storage related problems Sharding, replication No single point of failure Free software Active community, commercial backing 66 + 18 + 9 + 28 production nodes About twenty nodes for various testing clusters
  • Enter Cassandra Solves a large subset of storage related problems Sharding, replication No single point of failure Free software Active community, commercial backing 66 + 18 + 9 + 28 production nodes About twenty nodes for various testing clusters Datasets ranging from 8T to a few gigs.
  • Cassandra, winning!
  • Cassandra, winning! Major upgrades without service interruptions (in theory)
  • Cassandra, winning! Major upgrades without service interruptions (in theory) Crazy fast writes
  • Cassandra, winning! Major upgrades without service interruptions (in theory) Crazy fast writes Not just because you have a hardware RAID card that is good at lying to you
  • Cassandra, winning! Major upgrades without service interruptions (in theory) Crazy fast writes Not just because you have a hardware RAID card that is good at lying to you Uses the knowledge that sequential is I/O faster than random I/O
  • Cassandra, winning! Major upgrades without service interruptions (in theory) Crazy fast writes Not just because you have a hardware RAID card that is good at lying to you Uses the knowledge that sequential is I/O faster than random I/O In case of inconsistencies, knows what to do
  • Cassandra, winning! Major upgrades without service interruptions (in theory) Crazy fast writes Not just because you have a hardware RAID card that is good at lying to you Uses the knowledge that sequential is I/O faster than random I/O In case of inconsistencies, knows what to do Cross datacenter replication support
  • Cassandra, winning! Major upgrades without service interruptions (in theory) Crazy fast writes Not just because you have a hardware RAID card that is good at lying to you Uses the knowledge that sequential is I/O faster than random I/O In case of inconsistencies, knows what to do Cross datacenter replication support Tinker friendly
  • Cassandra, winning! Major upgrades without service interruptions (in theory) Crazy fast writes Not just because you have a hardware RAID card that is good at lying to you Uses the knowledge that sequential is I/O faster than random I/O In case of inconsistencies, knows what to do Cross datacenter replication support Tinker friendly Readable code
  • Cassandra flexibility for Playlist
  • Cassandra flexibility for Playlist The main use cases for playlist:
  • Cassandra flexibility for Playlist The main use cases for playlist: Get me all changes since version N of playlist P
  • Cassandra flexibility for Playlist The main use cases for playlist: Get me all changes since version N of playlist P Apply the following changes on top of version M of playlist Q
  • Cassandra flexibility for Playlist The main use cases for playlist: Get me all changes since version N of playlist P Apply the following changes on top of version M of playlist Q This translates to CFs head and change
  • Cassandra flexibility for Playlist The main use cases for playlist: Get me all changes since version N of playlist P Apply the following changes on top of version M of playlist Q This translates to CFs head and change Asymmetric sizes
  • Cassandra flexibility for Playlist The main use cases for playlist: Get me all changes since version N of playlist P Apply the following changes on top of version M of playlist Q This translates to CFs head and change Asymmetric sizes Neat trick: read change with level=ONE, fallback to LOCAL_QUORUM
  • Cassandra flexibility for Playlist The main use cases for playlist: Get me all changes since version N of playlist P Apply the following changes on top of version M of playlist Q This translates to CFs head and change Asymmetric sizes Neat trick: read change with level=ONE, fallback to LOCAL_QUORUM
  • Let me tell you a story
  • Let me tell you a story Latest stable kernel from Debian Squeeze 2.6.32-5
  • Let me tell you a story Latest stable kernel from Debian Squeeze 2.6.32-5 What happens after 209 days of uptime?
  • Let me tell you a story Latest stable kernel from Debian Squeeze 2.6.32-5 What happens after 209 days of uptime? Load average around 120.
  • Let me tell you a story Latest stable kernel from Debian Squeeze 2.6.32-5 What happens after 209 days of uptime? Load average around 120. No CPU activity reported by top
  • Let me tell you a story Latest stable kernel from Debian Squeeze 2.6.32-5 What happens after 209 days of uptime? Load average around 120. No CPU activity reported by top Mattias de Zalenski: log((209 days) / (1 nanoseconds)) / log(2) = 54.0034557 (2^54) nanoseconds = 208.499983 days Somewhere nanosecond values are shifted ten bits?
  • Let me tell you a story Latest stable kernel from Debian Squeeze 2.6.32-5 What happens after 209 days of uptime? Load average around 120. No CPU activity reported by top Mattias de Zalenski: log((209 days) / (1 nanoseconds)) / log(2) = 54.0034557 (2^54) nanoseconds = 208.499983 days Somewhere nanosecond values are shifted ten bits? Downtime for payment
  • Let me tell you a story Latest stable kernel from Debian Squeeze 2.6.32-5 What happens after 209 days of uptime? Load average around 120. No CPU activity reported by top Mattias de Zalenski: log((209 days) / (1 nanoseconds)) / log(2) = 54.0034557 (2^54) nanoseconds = 208.499983 days Somewhere nanosecond values are shifted ten bits? Downtime for payment Downtime for account creation
  • Let me tell you a story Latest stable kernel from Debian Squeeze 2.6.32-5 What happens after 209 days of uptime? Load average around 120. No CPU activity reported by top Mattias de Zalenski: log((209 days) / (1 nanoseconds)) / log(2) = 54.0034557 (2^54) nanoseconds = 208.499983 days Somewhere nanosecond values are shifted ten bits? Downtime for payment Downtime for account creation No downtime for cassandra backed systems
  • Backups
  • Backups A few terabytes of live data, many nodes. Painful.
  • Backups A few terabytes of live data, many nodes. Painful. Inefficient. Copy of on disk structure, at least 3 times the data
  • Backups A few terabytes of live data, many nodes. Painful. Inefficient. Copy of on disk structure, at least 3 times the data Non-compacted. Possibly a few tens of old versions.
  • Backups A few terabytes of live data, many nodes. Painful. Inefficient. Copy of on disk structure, at least 3 times the data Non-compacted. Possibly a few tens of old versions. Pulling data off nodes evict hot data from page cache.
  • Backups A few terabytes of live data, many nodes. Painful. Inefficient. Copy of on disk structure, at least 3 times the data Non-compacted. Possibly a few tens of old versions. Pulling data off nodes evict hot data from page cache. Initially, only full backups (pre 0.8)
  • Our solution to backups
  • Our solution to backups NetworkTopologyStrategy is cool
  • Our solution to backups NetworkTopologyStrategy is cool Separate datacenter for backups with RF=1
  • Our solution to backups NetworkTopologyStrategy is cool Separate datacenter for backups with RF=1 Beware: tricky
  • Our solution to backups NetworkTopologyStrategy is cool Separate datacenter for backups with RF=1 Beware: tricky Once removed from production performance considerations
  • Our solution to backups NetworkTopologyStrategy is cool Separate datacenter for backups with RF=1 Beware: tricky Once removed from production performance considerations Application level incremental backups
  • Our solution to backups NetworkTopologyStrategy is cool Separate datacenter for backups with RF=1 Beware: tricky Once removed from production performance considerations Application level incremental backups This week, cassandra level incremental backups
  • Our solution to backups NetworkTopologyStrategy is cool Separate datacenter for backups with RF=1 Beware: tricky Once removed from production performance considerations Application level incremental backups This week, cassandra level incremental backups Still some issues: lots of SSTables
  • Solid state is a game changer
  • Solid state is a game changer Asymmetrically sized datasets
  • Solid state is a game changer Asymmetrically sized datasets I Can Haz superlarge SSD?
  • Solid state is a game changer Asymmetrically sized datasets I Can Haz superlarge SSD? No.
  • Solid state is a game changer Asymmetrically sized datasets I Can Haz superlarge SSD? No. With small disks, on disk data structure size matters a lot
  • Solid state is a game changer Asymmetrically sized datasets I Can Haz superlarge SSD? No. With small disks, on disk data structure size matters a lot Our plan:
  • Solid state is a game changer Asymmetrically sized datasets I Can Haz superlarge SSD? No. With small disks, on disk data structure size matters a lot Our plan: Leveled compaction strategy, new in 1.0
  • Solid state is a game changer Asymmetrically sized datasets I Can Haz superlarge SSD? No. With small disks, on disk data structure size matters a lot Our plan: Leveled compaction strategy, new in 1.0 Hack cassandra to have configurable datadirs per keyspace.
  • Solid state is a game changer Asymmetrically sized datasets I Can Haz superlarge SSD? No. With small disks, on disk data structure size matters a lot Our plan: Leveled compaction strategy, new in 1.0 Hack cassandra to have configurable datadirs per keyspace. Our patch is integrated in Cassandra 1.1
  • Some unpleasant surprises
  • Some unpleasant surprises Immaturity.
  • Some unpleasant surprises Immaturity. Has anyone written nodetool -h ring?
  • Some unpleasant surprises Immaturity. Has anyone written nodetool -h ring? Broken on disk bloom filters in 0.8. Very painful upgrade to 1.0
  • Some unpleasant surprises Immaturity. Has anyone written nodetool -h ring? Broken on disk bloom filters in 0.8. Very painful upgrade to 1.0 Small disk, high load, very possible to get into an Out Of Disk condition
  • Some unpleasant surprises Immaturity. Has anyone written nodetool -h ring? Broken on disk bloom filters in 0.8. Very painful upgrade to 1.0 Small disk, high load, very possible to get into an Out Of Disk condition Logging is lacking
  • Lessons learned from backup datacenter
  • Lessons learned from backup datacenter Asymmetric cluster sizes are painful.
  • Lessons learned from backup datacenter Asymmetric cluster sizes are painful. 60 production nodes, 6 backup nodes
  • Lessons learned from backup datacenter Asymmetric cluster sizes are painful. 60 production nodes, 6 backup nodes Repairs that replicate all data 10 times
  • Lessons learned from backup datacenter Asymmetric cluster sizes are painful. 60 production nodes, 6 backup nodes Repairs that replicate all data 10 times The workaround: manual repairs
  • Lessons learned from backup datacenter Asymmetric cluster sizes are painful. 60 production nodes, 6 backup nodes Repairs that replicate all data 10 times The workaround: manual repairs Remove sstables from broken node (to free up space)
  • Lessons learned from backup datacenter Asymmetric cluster sizes are painful. 60 production nodes, 6 backup nodes Repairs that replicate all data 10 times The workaround: manual repairs Remove sstables from broken node (to free up space) Start it to have it take writes while repopulating
  • Lessons learned from backup datacenter Asymmetric cluster sizes are painful. 60 production nodes, 6 backup nodes Repairs that replicate all data 10 times The workaround: manual repairs Remove sstables from broken node (to free up space) Start it to have it take writes while repopulating Snapshot and move SSTables from 4 evenly spaced nodes
  • Lessons learned from backup datacenter Asymmetric cluster sizes are painful. 60 production nodes, 6 backup nodes Repairs that replicate all data 10 times The workaround: manual repairs Remove sstables from broken node (to free up space) Start it to have it take writes while repopulating Snapshot and move SSTables from 4 evenly spaced nodes Do a full compaction
  • Lessons learned from backup datacenter Asymmetric cluster sizes are painful. 60 production nodes, 6 backup nodes Repairs that replicate all data 10 times The workaround: manual repairs Remove sstables from broken node (to free up space) Start it to have it take writes while repopulating Snapshot and move SSTables from 4 evenly spaced nodes Do a full compaction Do a repair and hope for the best
  • Spot the bug
  • Spot the bug Hector java cassandra driver:
  • Spot the bug Hector java cassandra driver: private AtomicInteger counter = new AtomicInteger(); private Server getNextServer() { counter.compareAndSet(16384, 0); return servers[counter.getAndIncrement() % servers.length]; }
  • Spot the bug Hector java cassandra driver: private AtomicInteger counter = new AtomicInteger(); private Server getNextServer() { counter.compareAndSet(16384, 0); return servers[counter.getAndIncrement() % servers.length]; } Race condition
  • Spot the bug Hector java cassandra driver: private AtomicInteger counter = new AtomicInteger(); private Server getNextServer() { counter.compareAndSet(16384, 0); return servers[counter.getAndIncrement() % servers.length]; } Race condition java.lang.ArrayIndexOutOfBoundsException
  • Spot the bug Hector java cassandra driver: private AtomicInteger counter = new AtomicInteger(); private Server getNextServer() { counter.compareAndSet(16384, 0); return servers[counter.getAndIncrement() % servers.length]; } Race condition java.lang.ArrayIndexOutOfBoundsException After close to 2**31 requests
  • Spot the bug Hector java cassandra driver: private AtomicInteger counter = new AtomicInteger(); private Server getNextServer() { counter.compareAndSet(16384, 0); return servers[counter.getAndIncrement() % servers.length]; } Race condition java.lang.ArrayIndexOutOfBoundsException After close to 2**31 requests Took a few days
  • Thrift payload size limits Communication with Cassandra is based on thrift Large mutations, larger than 15MiB Thrift drops the underlying TCP connection Hector considers the connection drop a node specific problem Retries on all cassandra nodes Effectively shutting down all cassandra traffic
  • Conclusions
  • Conclusions In the 0.6-1.0 timeframe, developers and operations engineers are needed
  • Conclusions In the 0.6-1.0 timeframe, developers and operations engineers are needed You need to keep an eye on bugs created, be part of the community
  • Conclusions In the 0.6-1.0 timeframe, developers and operations engineers are needed You need to keep an eye on bugs created, be part of the community Exotic stuff (such a asymmetrically sized datacenters) is tricky
  • Conclusions In the 0.6-1.0 timeframe, developers and operations engineers are needed You need to keep an eye on bugs created, be part of the community Exotic stuff (such a asymmetrically sized datacenters) is tricky Lots of things gets fixed. You need to keep up with upstream
  • Conclusions In the 0.6-1.0 timeframe, developers and operations engineers are needed You need to keep an eye on bugs created, be part of the community Exotic stuff (such a asymmetrically sized datacenters) is tricky Lots of things gets fixed. You need to keep up with upstream You need to integrate with monitoring and graphing
  • Conclusions In the 0.6-1.0 timeframe, developers and operations engineers are needed You need to keep an eye on bugs created, be part of the community Exotic stuff (such a asymmetrically sized datacenters) is tricky Lots of things gets fixed. You need to keep up with upstream You need to integrate with monitoring and graphing Consider it a toolkit for constructing solutions.
  • Questions? Answers.