3. About this talk
An introduction Spotify, to our service and our persistent storage needs
4. About this talk
An introduction Spotify, to our service and our persistent storage needs
What Cassandra brings
5. About this talk
An introduction Spotify, to our service and our persistent storage needs
What Cassandra brings
What we have learned
6. About this talk
An introduction Spotify, to our service and our persistent storage needs
What Cassandra brings
What we have learned
What I would have liked to have known a year ago
7. About this talk
An introduction Spotify, to our service and our persistent storage needs
What Cassandra brings
What we have learned
What I would have liked to have known a year ago
8. About this talk
An introduction Spotify, to our service and our persistent storage needs
What Cassandra brings
What we have learned
What I would have liked to have known a year ago
Not a comparison between different NoSQL solutions
9. About this talk
An introduction Spotify, to our service and our persistent storage needs
What Cassandra brings
What we have learned
What I would have liked to have known a year ago
Not a comparison between different NoSQL solutions
Not a hands on introduction to Cassandra
10. About this talk
An introduction Spotify, to our service and our persistent storage needs
What Cassandra brings
What we have learned
What I would have liked to have known a year ago
Not a comparison between different NoSQL solutions
Not a hands on introduction to Cassandra
We work with physical hardware for production
17. Spotify — all music, all the time
A better user experience than file sharing.
18. Spotify — all music, all the time
A better user experience than file sharing.
Native desktop and mobile clients.
19. Spotify — all music, all the time
A better user experience than file sharing.
Native desktop and mobile clients.
Custom backend, built for performance and scalability.
20. Spotify — all music, all the time
A better user experience than file sharing.
Native desktop and mobile clients.
Custom backend, built for performance and scalability.
21. Spotify — all music, all the time
A better user experience than file sharing.
Native desktop and mobile clients.
Custom backend, built for performance and scalability.
12 markets. More than ten million users.
22. Spotify — all music, all the time
A better user experience than file sharing.
Native desktop and mobile clients.
Custom backend, built for performance and scalability.
12 markets. More than ten million users.
3 datacenters.
23. Spotify — all music, all the time
A better user experience than file sharing.
Native desktop and mobile clients.
Custom backend, built for performance and scalability.
12 markets. More than ten million users.
3 datacenters.
Tens of gigabits of data pushed per datacenter.
24. Spotify — all music, all the time
A better user experience than file sharing.
Native desktop and mobile clients.
Custom backend, built for performance and scalability.
12 markets. More than ten million users.
3 datacenters.
Tens of gigabits of data pushed per datacenter.
Backend systems that support a large set of innovative features.
28. Innovative features in practice
Playlist
Should be simple, right?
A named list of tracks
29. Innovative features in practice
Playlist
Should be simple, right?
A named list of tracks
It gets more complicated
30. Innovative features in practice
Playlist
Should be simple, right?
A named list of tracks
It gets more complicated
Keep multiple devices in sync
31. Innovative features in practice
Playlist
Should be simple, right?
A named list of tracks
It gets more complicated
Keep multiple devices in sync
Support nested playlists
32. Innovative features in practice
Playlist
Should be simple, right?
A named list of tracks
It gets more complicated
Keep multiple devices in sync
Support nested playlists
Offline editing on multiple devices
33. Innovative features in practice
Playlist
Should be simple, right?
A named list of tracks
It gets more complicated
Keep multiple devices in sync
Support nested playlists
Offline editing on multiple devices
Changes pushed to connected devices
34. Innovative features in practice
Playlist
Should be simple, right?
A named list of tracks
It gets more complicated
Keep multiple devices in sync
Support nested playlists
Offline editing on multiple devices
Changes pushed to connected devices
Scale. More than half a billion lists currently in the system
35. Innovative features in practice
Playlist
Should be simple, right?
A named list of tracks
It gets more complicated
Keep multiple devices in sync
Support nested playlists
Offline editing on multiple devices
Changes pushed to connected devices
Scale. More than half a billion lists currently in the system
About 10 khz on peak traffic.
36. Innovative features in practice
Playlist
Should be simple, right?
A named list of tracks
It gets more complicated
Keep multiple devices in sync
Support nested playlists
Offline editing on multiple devices
Changes pushed to connected devices
Scale. More than half a billion lists currently in the system
About 10 khz on peak traffic.
Resulting storage requirements:
37. Innovative features in practice
Playlist
Should be simple, right?
A named list of tracks
It gets more complicated
Keep multiple devices in sync
Support nested playlists
Offline editing on multiple devices
Changes pushed to connected devices
Scale. More than half a billion lists currently in the system
About 10 khz on peak traffic.
Resulting storage requirements:
Full history
38. Innovative features in practice
Playlist
Should be simple, right?
A named list of tracks
It gets more complicated
Keep multiple devices in sync
Support nested playlists
Offline editing on multiple devices
Changes pushed to connected devices
Scale. More than half a billion lists currently in the system
About 10 khz on peak traffic.
Resulting storage requirements:
Full history
Really fast access to latest version number and content
39. Innovative features in practice
Playlist
Should be simple, right?
A named list of tracks
It gets more complicated
Keep multiple devices in sync
Support nested playlists
Offline editing on multiple devices
Changes pushed to connected devices
Scale. More than half a billion lists currently in the system
About 10 khz on peak traffic.
Resulting storage requirements:
Full history
Really fast access to latest version number and content
43. Suggested solutions
Flat files
We don’t need ACID
Linux page cache kicks ass.
44. Suggested solutions
Flat files
We don’t need ACID
Linux page cache kicks ass.
(Not really)
45. Suggested solutions
Flat files
We don’t need ACID
Linux page cache kicks ass.
(Not really)
SQL
46. Suggested solutions
Flat files
We don’t need ACID
Linux page cache kicks ass.
(Not really)
SQL
Tried and true. Facebook does this
47. Suggested solutions
Flat files
We don’t need ACID
Linux page cache kicks ass.
(Not really)
SQL
Tried and true. Facebook does this
Simple Key-Value store
48. Suggested solutions
Flat files
We don’t need ACID
Linux page cache kicks ass.
(Not really)
SQL
Tried and true. Facebook does this
Simple Key-Value store
Tokyo cabinet, some experience
49. Suggested solutions
Flat files
We don’t need ACID
Linux page cache kicks ass.
(Not really)
SQL
Tried and true. Facebook does this
Simple Key-Value store
Tokyo cabinet, some experience
Clustered Key-Value store
50. Suggested solutions
Flat files
We don’t need ACID
Linux page cache kicks ass.
(Not really)
SQL
Tried and true. Facebook does this
Simple Key-Value store
Tokyo cabinet, some experience
Clustered Key-Value store
Evaluated a lot, end game contestants HBase and Cassandra
52. Enter Cassandra
Solves a large subset of storage related problems
53. Enter Cassandra
Solves a large subset of storage related problems
Sharding, replication
54. Enter Cassandra
Solves a large subset of storage related problems
Sharding, replication
No single point of failure
55. Enter Cassandra
Solves a large subset of storage related problems
Sharding, replication
No single point of failure
Ability to make the performance/reliability tradeoff per request
56. Enter Cassandra
Solves a large subset of storage related problems
Sharding, replication
No single point of failure
Ability to make the performance/reliability tradeoff per request
Free software
57. Enter Cassandra
Solves a large subset of storage related problems
Sharding, replication
No single point of failure
Ability to make the performance/reliability tradeoff per request
Free software
Active community, commercial backing
58. Enter Cassandra
Solves a large subset of storage related problems
Sharding, replication
No single point of failure
Ability to make the performance/reliability tradeoff per request
Free software
Active community, commercial backing
59. Enter Cassandra
Solves a large subset of storage related problems
Sharding, replication
No single point of failure
Ability to make the performance/reliability tradeoff per request
Free software
Active community, commercial backing
60. Enter Cassandra
Solves a large subset of storage related problems
Sharding, replication
No single point of failure
Ability to make the performance/reliability tradeoff per request
Free software
Active community, commercial backing
66 + 18 + 9 + 28 production nodes
61. Enter Cassandra
Solves a large subset of storage related problems
Sharding, replication
No single point of failure
Ability to make the performance/reliability tradeoff per request
Free software
Active community, commercial backing
66 + 18 + 9 + 28 production nodes
About twenty nodes for various testing clusters
62. Enter Cassandra
Solves a large subset of storage related problems
Sharding, replication
No single point of failure
Ability to make the performance/reliability tradeoff per request
Free software
Active community, commercial backing
66 + 18 + 9 + 28 production nodes
About twenty nodes for various testing clusters
Datasets ranging from 8T to a few gigs.
63. Cassandra key concepts, on a node
Log structured storage
Sorted string table — SSTable
Immutable files on disk
Compaction — Many to one, merge sort
Memtable
SSTable SSTable SSTable
64. Cassandra key concepts, In a cluster
Clusters of nodes in a ring by key order
All data typically written to several nodes, Replication Factor
Rings can be expanded in production
Gossip, detects nodes being up / down / joining
Anti Entropy mechanisms
Many read operations can be done sequentially
66. Cassandra, winning!
Major upgrades without service interruptions (in theory)
67. Cassandra, winning!
Major upgrades without service interruptions (in theory)
Crazy fast writes
68. Cassandra, winning!
Major upgrades without service interruptions (in theory)
Crazy fast writes
Not just because you have a hardware RAID card that is good at lying to you
69. Cassandra, winning!
Major upgrades without service interruptions (in theory)
Crazy fast writes
Not just because you have a hardware RAID card that is good at lying to you
Somewhat predictable number of seeks needed for read
70. Cassandra, winning!
Major upgrades without service interruptions (in theory)
Crazy fast writes
Not just because you have a hardware RAID card that is good at lying to you
Somewhat predictable number of seeks needed for read
Knows that sequential I/O faster than random I/O
71. Cassandra, winning!
Major upgrades without service interruptions (in theory)
Crazy fast writes
Not just because you have a hardware RAID card that is good at lying to you
Somewhat predictable number of seeks needed for read
Knows that sequential I/O faster than random I/O
In case of inconsistencies, knows what to do
72. Cassandra, winning!
Major upgrades without service interruptions (in theory)
Crazy fast writes
Not just because you have a hardware RAID card that is good at lying to you
Somewhat predictable number of seeks needed for read
Knows that sequential I/O faster than random I/O
In case of inconsistencies, knows what to do
Replacing broken nodes straightforward
73. Cassandra, winning!
Major upgrades without service interruptions (in theory)
Crazy fast writes
Not just because you have a hardware RAID card that is good at lying to you
Somewhat predictable number of seeks needed for read
Knows that sequential I/O faster than random I/O
In case of inconsistencies, knows what to do
Replacing broken nodes straightforward
Cross datacenter replication support
74. Cassandra, winning!
Major upgrades without service interruptions (in theory)
Crazy fast writes
Not just because you have a hardware RAID card that is good at lying to you
Somewhat predictable number of seeks needed for read
Knows that sequential I/O faster than random I/O
In case of inconsistencies, knows what to do
Replacing broken nodes straightforward
Cross datacenter replication support
Tinker friendly
75. Cassandra, winning!
Major upgrades without service interruptions (in theory)
Crazy fast writes
Not just because you have a hardware RAID card that is good at lying to you
Somewhat predictable number of seeks needed for read
Knows that sequential I/O faster than random I/O
In case of inconsistencies, knows what to do
Replacing broken nodes straightforward
Cross datacenter replication support
Tinker friendly
Readable code
76. Cassandra, winning!
Major upgrades without service interruptions (in theory)
Crazy fast writes
Not just because you have a hardware RAID card that is good at lying to you
Somewhat predictable number of seeks needed for read
Knows that sequential I/O faster than random I/O
In case of inconsistencies, knows what to do
Replacing broken nodes straightforward
Cross datacenter replication support
Tinker friendly
Readable code
78. Let me tell you a story
Latest stable kernel from Debian Squeeze 2.6.32-5
79. Let me tell you a story
Latest stable kernel from Debian Squeeze 2.6.32-5
What happens after 209 days of uptime?
80. Let me tell you a story
Latest stable kernel from Debian Squeeze 2.6.32-5
What happens after 209 days of uptime?
Load average around 120.
81. Let me tell you a story
Latest stable kernel from Debian Squeeze 2.6.32-5
What happens after 209 days of uptime?
Load average around 120.
No CPU activity reported by top
82. Let me tell you a story
Latest stable kernel from Debian Squeeze 2.6.32-5
What happens after 209 days of uptime?
Load average around 120.
No CPU activity reported by top
Mattias de Zalenski:
log((209 days) / (1 nanoseconds)) / log(2) = 54.0034557
(2^54) nanoseconds = 208.499983 days
Somewhere nanosecond values are shifted ten bits?
83. Let me tell you a story
Latest stable kernel from Debian Squeeze 2.6.32-5
What happens after 209 days of uptime?
Load average around 120.
No CPU activity reported by top
Mattias de Zalenski:
log((209 days) / (1 nanoseconds)) / log(2) = 54.0034557
(2^54) nanoseconds = 208.499983 days
Somewhere nanosecond values are shifted ten bits?
Downtime for payment
84. Let me tell you a story
Latest stable kernel from Debian Squeeze 2.6.32-5
What happens after 209 days of uptime?
Load average around 120.
No CPU activity reported by top
Mattias de Zalenski:
log((209 days) / (1 nanoseconds)) / log(2) = 54.0034557
(2^54) nanoseconds = 208.499983 days
Somewhere nanosecond values are shifted ten bits?
Downtime for payment
Downtime for account creation
85. Let me tell you a story
Latest stable kernel from Debian Squeeze 2.6.32-5
What happens after 209 days of uptime?
Load average around 120.
No CPU activity reported by top
Mattias de Zalenski:
log((209 days) / (1 nanoseconds)) / log(2) = 54.0034557
(2^54) nanoseconds = 208.499983 days
Somewhere nanosecond values are shifted ten bits?
Downtime for payment
Downtime for account creation
No downtime for cassandra backed systems
87. Backups
A few terabytes of live data, many nodes. Painful.
88. Backups
A few terabytes of live data, many nodes. Painful.
Inefficient. Copy of on disk structure, at least 3 times the data
89. Backups
A few terabytes of live data, many nodes. Painful.
Inefficient. Copy of on disk structure, at least 3 times the data
Non-compacted. Possibly a few tens of old versions.
90. Backups
A few terabytes of live data, many nodes. Painful.
Inefficient. Copy of on disk structure, at least 3 times the data
Non-compacted. Possibly a few tens of old versions.
Initially, only full backups (pre 0.8)
91. Backups
A few terabytes of live data, many nodes. Painful.
Inefficient. Copy of on disk structure, at least 3 times the data
Non-compacted. Possibly a few tens of old versions.
Initially, only full backups (pre 0.8)
93. Our solution to backups
Separate datacenter for backups with RF=1
94. Our solution to backups
Separate datacenter for backups with RF=1
Beware: tricky
95. Our solution to backups
Separate datacenter for backups with RF=1
Beware: tricky
Once removed from production performance considerations
96. Our solution to backups
Separate datacenter for backups with RF=1
Beware: tricky
Once removed from production performance considerations
Application level incremental backups
97. Our solution to backups
Separate datacenter for backups with RF=1
Beware: tricky
Once removed from production performance considerations
Application level incremental backups
Soon: Cassandra incremental backups
99. Solid state is a game changer
Large datasets, light read load
100. Solid state is a game changer
Large datasets, light read load
Small datasets, heavy read load
101. Solid state is a game changer
Large datasets, light read load
Small datasets, heavy read load
I Can Haz superlarge SSD?
102. Solid state is a game changer
Large datasets, light read load
Small datasets, heavy read load
I Can Haz superlarge SSD?
No.
103. Solid state is a game changer
Large datasets, light read load
Small datasets, heavy read load
I Can Haz superlarge SSD?
No.
With small disks, on disk datastructure size matters a lot
104. Solid state is a game changer
Large datasets, light read load
Small datasets, heavy read load
I Can Haz superlarge SSD?
No.
With small disks, on disk datastructure size matters a lot
105. Solid state is a game changer
Large datasets, light read load
Small datasets, heavy read load
I Can Haz superlarge SSD?
No.
With small disks, on disk datastructure size matters a lot
106. Solid state is a game changer
Large datasets, light read load
Small datasets, heavy read load
I Can Haz superlarge SSD?
No.
With small disks, on disk datastructure size matters a lot
Our plan:
107. Solid state is a game changer
Large datasets, light read load
Small datasets, heavy read load
I Can Haz superlarge SSD?
No.
With small disks, on disk datastructure size matters a lot
Our plan:
Leveled compaction strategy, new in 1.0
108. Solid state is a game changer
Large datasets, light read load
Small datasets, heavy read load
I Can Haz superlarge SSD?
No.
With small disks, on disk datastructure size matters a lot
Our plan:
Leveled compaction strategy, new in 1.0
Hack cassandra to have configurable datadirs per keyspace.
109. Solid state is a game changer
Large datasets, light read load
Small datasets, heavy read load
I Can Haz superlarge SSD?
No.
With small disks, on disk datastructure size matters a lot
Our plan:
Leveled compaction strategy, new in 1.0
Hack cassandra to have configurable datadirs per keyspace.
Our patch is integrated in Cassandra 1.1
110. Solid state is a game changer
Large datasets, light read load
Small datasets, heavy read load
I Can Haz superlarge SSD?
No.
With small disks, on disk datastructure size matters a lot
Our plan:
Leveled compaction strategy, new in 1.0
Hack cassandra to have configurable datadirs per keyspace.
Our patch is integrated in Cassandra 1.1
113. Some unpleasant surprises
Immaturity
Hector, larger mutations than 15MB. Connection drops in thrift.
114. Some unpleasant surprises
Immaturity
Hector, larger mutations than 15MB. Connection drops in thrift.
Broken on disk bloom filters in 0.8. Very painful upgrade to 1.0
115. Some unpleasant surprises
Immaturity
Hector, larger mutations than 15MB. Connection drops in thrift.
Broken on disk bloom filters in 0.8. Very painful upgrade to 1.0
Small disk, high load, very possible to get into an Out Of Disk condition
116. Some unpleasant surprises
Immaturity
Hector, larger mutations than 15MB. Connection drops in thrift.
Broken on disk bloom filters in 0.8. Very painful upgrade to 1.0
Small disk, high load, very possible to get into an Out Of Disk condition
Logging is lacking
119. Spot the bug
Hector java cassandra driver:
private AtomicInteger counter = new AtomicInteger();
private Server getNextServer() {
counter.compareAndSet(16384, 0);
return servers[counter.getAndIncrement() % servers.length];
}
120. Spot the bug
Hector java cassandra driver:
private AtomicInteger counter = new AtomicInteger();
private Server getNextServer() {
counter.compareAndSet(16384, 0);
return servers[counter.getAndIncrement() % servers.length];
}
Race condition
121. Spot the bug
Hector java cassandra driver:
private AtomicInteger counter = new AtomicInteger();
private Server getNextServer() {
counter.compareAndSet(16384, 0);
return servers[counter.getAndIncrement() % servers.length];
}
Race condition
java.lang.ArrayIndexOutOfBoundsException
122. Spot the bug
Hector java cassandra driver:
private AtomicInteger counter = new AtomicInteger();
private Server getNextServer() {
counter.compareAndSet(16384, 0);
return servers[counter.getAndIncrement() % servers.length];
}
Race condition
java.lang.ArrayIndexOutOfBoundsException
After close to 2**31 requests
123. Spot the bug
Hector java cassandra driver:
private AtomicInteger counter = new AtomicInteger();
private Server getNextServer() {
counter.compareAndSet(16384, 0);
return servers[counter.getAndIncrement() % servers.length];
}
Race condition
java.lang.ArrayIndexOutOfBoundsException
After close to 2**31 requests
Took about 5 days
125. Conclusions
In the 0.6-1.0 timeframe, development engineers and operations are needed
126. Conclusions
In the 0.6-1.0 timeframe, development engineers and operations are needed
You need to keep an eye on bugs created, be part of the community
127. Conclusions
In the 0.6-1.0 timeframe, development engineers and operations are needed
You need to keep an eye on bugs created, be part of the community
Exotic stuff (such a asymmetrically sized datacenters) is tricky
128. Conclusions
In the 0.6-1.0 timeframe, development engineers and operations are needed
You need to keep an eye on bugs created, be part of the community
Exotic stuff (such a asymmetrically sized datacenters) is tricky
Lots of things gets fixed. You need to keep up with upstream
129. Conclusions
In the 0.6-1.0 timeframe, development engineers and operations are needed
You need to keep an eye on bugs created, be part of the community
Exotic stuff (such a asymmetrically sized datacenters) is tricky
Lots of things gets fixed. You need to keep up with upstream
You need to integrate with monitoring and graphing
130. Conclusions
In the 0.6-1.0 timeframe, development engineers and operations are needed
You need to keep an eye on bugs created, be part of the community
Exotic stuff (such a asymmetrically sized datacenters) is tricky
Lots of things gets fixed. You need to keep up with upstream
You need to integrate with monitoring and graphing
131. Conclusions
In the 0.6-1.0 timeframe, development engineers and operations are needed
You need to keep an eye on bugs created, be part of the community
Exotic stuff (such a asymmetrically sized datacenters) is tricky
Lots of things gets fixed. You need to keep up with upstream
You need to integrate with monitoring and graphing
132. Conclusions
In the 0.6-1.0 timeframe, development engineers and operations are needed
You need to keep an eye on bugs created, be part of the community
Exotic stuff (such a asymmetrically sized datacenters) is tricky
Lots of things gets fixed. You need to keep up with upstream
You need to integrate with monitoring and graphing
Consider it a toolkit for constructing solutions.