2. Who am I?
• I’m Ricardo!
• Lead Engineer at Spotify
• ricardovice on twitter, spotify, about.me, kiva, slideshare, github,
bitbucket, delicious…
• Portuguese
• Previously working in the video streaming industry
• (only) Discovered Spotify late 2009
• Joined in 2010
3. spotifiera: to use Spotify;
spo·ti·fie·ra Verb to provide a service free of cost;
4. What’s Spotify all about?
• A big catalogue, tons of music
• Available everywhere
• Great user experience
• More convenient than piracy
• Reliable, high availability
• Scalable for many, many users
But what really got me hooked up:
• Free, legal ad-supported service
• Very fast
5. The importance of being fast
• High latency can be a problem, not only in First
Person Shooters
• Slow performance is a major user experience killer
• At Velocity 2009, Eric Schurman (Bing) and Jake
Brutlag (Google Search) showed that increased
latency directly hurt usage and revenue per user[1].
• Latency leads to users leaving, many wont ever
come back
• Users will share their experience with friends
[1] http://radar.oreilly.com/2009/07/velocity-making-your-site-fast.html
6. So how fast is Spotify?
• We monitor playback latency on the client side
• Current median latency to play any track is 265ms
• On average, the human notion of “instant” is
anything under 200ms
• Due to disk lookup, at times it's actually faster to
start playing a track from network than from disk
• Below 1% of playbacks experienced stutter
7. “Spotify is fast due to P2P”
• This is something I read a lot around the web
• P2P does play a crucial role in the picture, but…
• Experience at Spotify showed me that most latency issues are
directly linked to backend problems
• It’s a mistake to think that we could be this fast without a smart and
scalable backend architecture
So let’s give credit where credit is due.
9. Handling growth
Things to keep in mind:
• Scaling is not an exact science
• There is no such thing as a magic formula
• Usage patterns differ
• There is always a limit to what you can handle
• Fail gracefully
• Continuous evolution process
10. Scaling horizontally
• You can always add more machines!
• Stateless services
• Several processes can share memcached
• Possible to run in “the cloud” (EC2, Rackspace)
• Need some kind of load balancer
• Data sharing/synchronization can be hard
• Complexity: many pieces, maybe hidden SPOFs
• Fundamental to the application’s design
11. Usage patterns
Typically, some services are more demanding than
others, this can be due to:
• Higher popularity
• Higher complexity
• Low latency expectation
• All combined
12. Decoupling
• Divide and conquer!
• The Unix way
• Resources assigned individually
• Using the right tools to address each problem
• Organization and delegation
• Problems are isolated
• Easier to handle growth
13. Read only services
• The easiest to scale
• Stateless
• Use indices, large read-optimized data containers
• Each node has its local copy
• Data structured according to service
• Updated periodically, during off-peak hours
• Take advantage of OS page cache
14. Read-write services
• User generated content, e.g. playlists
• Hard to ensure consistence of data across instances
Solutions:
• Eventual consistency:
• Reads of just written data not guaranteed to be up-to-date
• Locking, atomic operations
• Creating globally unique keys, e.g. usernames
• Transactions, e.g. billing
16. Finding a service via DNS
Each service has an SRV DNS record:
• One record with same name for each service instance
• Clients (AP) resolve to find servers providing that service
• Lowest priority record is chosen with weighted shuffle
• Clients retry other instances in case of failures
Example SRV record
_frobnicator._http.example.com. 3600 SRV 10 50 8081 frob1.example.com.!
name TTL type prio weight port host!
17. Request assignment
• Hardware load balancers
• Round-robin DNS
• Proxy servers
• Sharding:
• Each server/instance responsible for subset of data
• Directs client to instance that has its data
• Easy if nothing is shared
• Hard if you require replication
18. Sharding using a DHT
Some Spotify services use Dynamo inspired DHTs[1]:
• Each request has a key
• Each service node is responsible for a range of hash keys
• Data is distributed among service nodes
• Redundancy is ensured by re-hashing and writing to replica node
• Data must be transitioned when ring changes
!
[1] http://dl.acm.org/citation.cfm?id=1294281
20. Spotify’s DNS powered DHT
Configuration of DHT
config._frobnicator._http.example.com. 3600 TXT “slaves=0”!
config.srv_name. TTL type ! no replication!
!
config._frobnicator._http.example.com. 3600 TXT “slaves=2 redundancy=host”!
config.srv_name. TTL! type ! three replicas!
on separate hosts!
Ring segment, one per node
tokens.8081.frob1.example.com. 3600 TXT “00112233445566778899aabbccddeeff”!
tokens.port.host. TTL type last key!
!
21. And if none of this works for you
Remember
/dev/null is
web scale!!
http://www.xtranormal.com/watch/6995033/
22. Questions?
get in touch!
@ricardovice
ricardo@spotify.com