DiscoRank: Optimizing Discoverabilityon SoundCloudAmélie Anglade
• Developer at SoundCloud• SoundCloud is theworld’s largest socialsound platform• Academic background inMusic InformationR...
DISCOVERABILITY ?
PAGERANK
• The web is a graph:• nodes = web pages• edges = hyperlinks• The (Page)rank of a node depends on the linkstructure of the...
RANDOM SURFER
RANDOM SURFERABCD1/31/31/3
RANDOM SURFERABCD1/31/31/3
Nodes visited more often:• Nodes with many links• Coming from frequently visited nodesRANDOM SURFERABCDE
Adjacency matrix ACOMPUTING THE PAGERANKABCDETransition probability matrix MProbability distributionof surfer’s position
Adjacency matrix ACOMPUTING THE PAGERANKABCDETransition probability matrix MProbability distributionof surfer’s position
Adjacency matrix ACOMPUTING THE PAGERANKABCDETransition probability matrix MProbability distributionof surfer’s position
Adjacency matrix ACOMPUTING THE PAGERANKABCDETransition probability matrix MProbability distributionof surfer’s position
Adjacency matrix ACOMPUTING THE PAGERANKABCDETransition probability matrix MProbability distributionof surfer’s position
Adjacency matrix ACOMPUTING THE PAGERANKABCDETransition probability matrix MProbability distributionof surfer’s position
TELEPORTABCDE
TELEPORTABCDE
TELEPORTABCDE
If N nodes in graph,probability to teleportto any other node(including self) = 1/NTELEPORTABCDE1/N1/N1/N1/N1/N
TELEPORTABCDE1/N1/N1/N1/Nα?1-α1/NAt regular node: invoketeleport operation withprobability α andstandard random walkwith p...
Probability distribution of the surfer at any time is a vector.COMPUTING THE PAGERANKThat vector converges to a steady sta...
PAGERANK EQUATION
SOUNDCLOUDDISCORANK
DISCORANKABCDEUserUserTrackPlaylistfavoritefollowfeatured in
• Search across People, Sounds, Sets, Groups• One unique rank vector that contains all entities• Weight the links based on...
• How do we identify content that is trending?• The more recent a listen, favorite, etc. (event) thehigher the weight• Mul...
PERFORMANCEOPTIMIZATION
• Millions of entities(=nodes) and events(=edges)• First DiscoRank: several hours of computation• Trimmed down to a few mi...
•• Re-mapping entity ids• Memory optimization so the graph holds in memory:• All edges details are stored in memory in a b...
• We keep versioned copies of:• the DiscoRank vector of results• the DiscoRank graph• We rebuild the entire DiscoRank grap...
• MySQL batch jobs• DiscoRank results stored inHDFS• At the end of everyDiscoRank run we re-load itin ElasticSearch:• For ...
Amélie AngladeSound/Music Information Retrieval Engineerabout.me/utstikkar@utstikkarWe’re hiring!www.soundcloud.com
DiscoRank: optimizing discoverability on SoundCloud
DiscoRank: optimizing discoverability on SoundCloud
DiscoRank: optimizing discoverability on SoundCloud
DiscoRank: optimizing discoverability on SoundCloud
DiscoRank: optimizing discoverability on SoundCloud
Upcoming SlideShare
Loading in …5
×

DiscoRank: optimizing discoverability on SoundCloud

2,903 views

Published on

These are the slides of the presentation I gave at the Realtime Conf EU on 23rd April 2013.
The full abstract of the talk can be found here: http://lanyrd.com/2013/realtime-conf-europe/scdtyf/

Published in: Technology
1 Comment
2 Likes
Statistics
Notes
No Downloads
Views
Total views
2,903
On SlideShare
0
From Embeds
0
Number of Embeds
438
Actions
Shares
0
Downloads
17
Comments
1
Likes
2
Embeds 0
No embeds

No notes for slide

DiscoRank: optimizing discoverability on SoundCloud

  1. 1. DiscoRank: Optimizing Discoverabilityon SoundCloudAmélie Anglade
  2. 2. • Developer at SoundCloud• SoundCloud is theworld’s largest socialsound platform• Academic background inMusic InformationRetrieval (MIR)• Design, prototype andimplement MachineLearning algorithms formusic discovery
  3. 3. DISCOVERABILITY ?
  4. 4. PAGERANK
  5. 5. • The web is a graph:• nodes = web pages• edges = hyperlinks• The (Page)rank of a node depends on the linkstructure of the graphWEB AND PAGERANK
  6. 6. RANDOM SURFER
  7. 7. RANDOM SURFERABCD1/31/31/3
  8. 8. RANDOM SURFERABCD1/31/31/3
  9. 9. Nodes visited more often:• Nodes with many links• Coming from frequently visited nodesRANDOM SURFERABCDE
  10. 10. Adjacency matrix ACOMPUTING THE PAGERANKABCDETransition probability matrix MProbability distributionof surfer’s position
  11. 11. Adjacency matrix ACOMPUTING THE PAGERANKABCDETransition probability matrix MProbability distributionof surfer’s position
  12. 12. Adjacency matrix ACOMPUTING THE PAGERANKABCDETransition probability matrix MProbability distributionof surfer’s position
  13. 13. Adjacency matrix ACOMPUTING THE PAGERANKABCDETransition probability matrix MProbability distributionof surfer’s position
  14. 14. Adjacency matrix ACOMPUTING THE PAGERANKABCDETransition probability matrix MProbability distributionof surfer’s position
  15. 15. Adjacency matrix ACOMPUTING THE PAGERANKABCDETransition probability matrix MProbability distributionof surfer’s position
  16. 16. TELEPORTABCDE
  17. 17. TELEPORTABCDE
  18. 18. TELEPORTABCDE
  19. 19. If N nodes in graph,probability to teleportto any other node(including self) = 1/NTELEPORTABCDE1/N1/N1/N1/N1/N
  20. 20. TELEPORTABCDE1/N1/N1/N1/Nα?1-α1/NAt regular node: invoketeleport operation withprobability α andstandard random walkwith probability (1 - α)
  21. 21. Probability distribution of the surfer at any time is a vector.COMPUTING THE PAGERANKThat vector converges to a steady state:the PageRank vector.
  22. 22. PAGERANK EQUATION
  23. 23. SOUNDCLOUDDISCORANK
  24. 24. DISCORANKABCDEUserUserTrackPlaylistfavoritefollowfeatured in
  25. 25. • Search across People, Sounds, Sets, Groups• One unique rank vector that contains all entities• Weight the links based on the type of event:• User favorites Track• Track is featured in Playlist...• New big (but sparse)adjacency matrix:UNIVERSAL SEARCH
  26. 26. • How do we identify content that is trending?• The more recent a listen, favorite, etc. (event) thehigher the weight• Multiply each event (=edge) by a time decay:• New adjacency matrix:BACK TO EXPLORE
  27. 27. PERFORMANCEOPTIMIZATION
  28. 28. • Millions of entities(=nodes) and events(=edges)• First DiscoRank: several hours of computation• Trimmed down to a few minutes using:• Sparse matrix• Optimized storage of the graph in memory• Versioned copies of the DiscoRank• So technically we could compute the DiscoRankrealtimeA VERY LARGE GRAPH
  29. 29. •• Re-mapping entity ids• Memory optimization so the graph holds in memory:• All edges details are stored in memory in a byte[]• buffer the byte[] into an opaque byte block pool• no object• sort the buffered byte[] in place• On disk and when computing the DiscoRank:• Delta encoded ordered adjacency lists:• One “from” node, several “to” nodes• Delta encode the “to” node idsUSING SPARSITY
  30. 30. • We keep versioned copies of:• the DiscoRank vector of results• the DiscoRank graph• We rebuild the entire DiscoRank graph from scratchonce a week• In between:• we create additional graph segments with newentities and events• and use as prior for the DiscoRank computationthe results of the previous DiscoRank run• Side effect:• Also allows for experimentationVERSIONED DISCORANK
  31. 31. • MySQL batch jobs• DiscoRank results stored inHDFS• At the end of everyDiscoRank run we re-load itin ElasticSearch:• For each item we combineits Lucene score with itsDiscoRankINTEGRATION INOUR INFRASTRUCTURE
  32. 32. Amélie AngladeSound/Music Information Retrieval Engineerabout.me/utstikkar@utstikkarWe’re hiring!www.soundcloud.com

×